VIEW
Communicated by Zach Mainen
Mechanisms Shaping Fast Excitatory Postsynaptic Currents in the Central Nervous System Mladen I. GlavinoviÂc
[email protected] Departments of Anaesthesia Research and Physiology, McGill University, Montreal PQ H3G 1Y6, Canada
How different factors contribute to determine the time course of the basic element of fast glutamate-mediated excitatory postsynaptic currents (mEPSCs) in the central nervous system has been a focus of interest of neurobiologists for some years. In spite of intensive investigations, these mechanisms are not well understood. In this review, basic hypotheses are summarized, and a new hypothesis is proposed, which holds that desensitization of AMPA receptros plays a major role in shaping the time course of fast mEPSCs. According to the new hypothesis, desensitization shortens the time course of mEPSCs largely by reducing the buffering of glutamate molecules by AMPA receptors. The hypothesis accounts for numerous ndings on fast mEPSCs and is expected to be equally fruitful as a framework for further experimental and theoretical investigations. 1 Factors Shaping the Kinetics of Fast mEPSCs: Historical Background Fusion of synaptic vesicles with presynaptic membrane of excitatory synapses in the central nervous system leads to a release of glutamate. Glutamate is believed to reach high concentrations before decaying rapidly (Clements, 1996). The kinetics of glutamate release, diffusion, binding by the receptors, and uptake by the transporters, as well as synaptic geometry, are all expected to inuence its spatiotemporal concentration prole in the synaptic cleft, which together with the receptor properties (kinetics, density, and spatial distribution) determine the amplitude and the time course of the basic element of synaptic transmission: miniature excitatory postsynaptic current (mEPSC). The fast component of the excitatory postsynaptic current (EPSC) in neurons of the central nervous system results from activation of AMPA (a-amino-3-hydroxy-methyl-isoxazole)–type glutamate receptors. How different factors contribute to shaping the time course of mEPSCs has been a focus of interest of neurobiologists for some years. In spite of intensive investigations, these mechanisms are not well understood, though determining them is critical to issues of synaptic specicity and the induction of synaptic plasticity. In this review, basic hypotheses of Neural Computation 14, 1–19 (2001)
° c 2001 Massachusetts Institute of Technology
2
Mladen I. GlavinoviÂc
mechanisms shaping the time course of fast glutamate-mediated excitatory postsynaptic currents in the central nervous system are summarized and a new hypothesis is proposed. Three hypotheses have been traditionally put forward to explain the rate of decay of the synaptic current (Jonas & Spruston, 1994). The rst follows the line of argument used to explain the time course of postsynaptic current at the neuromuscular junction (Magleby & Stevens, 1972). The decay of glutamate concentration in the synaptic cleft is assumed to be very rapid, the decay time of the postsynaptic current being therefore determined by channel closure, similar to deactivation (Hestrin, Sah, & Nicoll, 1990; Tang, Shi, Katchman, & Lynch, 1991). The second hypothesis argues that the decay of glutamate concentration is slow and that the postsynaptic current is terminated by desensitization of AMPA receptor channels (Trussell, Thio, Zorumski, & Fischbach, 1988; Trussell & Fischbach, 1989; Isaacson & Nicoll, 1991; Vyklicky, Patneau, & Mayer, 1991; Hestrin, 1992; Trussell, Zhang, & Raman, 1993; Livsey, Costa, & Vicini, 1993; Raman & Trussell, 1995). This is plausible since AMPA receptor channels desensitize very rapidly (they are the fastest desensitizing ligandgated ion channels). According to the third hypothesis, the postsynaptic current is determined in a complex manner by the time course of deactivation and desensitization, as well as glutamate concentration (its decay has been argued to be between the two extremes postulated in the rst and second hypothesis; Clements, Lester, Tong, Jahr, & Westbrook, 1992; Barbour, Keller, Llano, & Marty, 1994; Tong & Jahr, 1994a). 2 Deactivation and Desensitization of Currents Evoked by Glutamate, Released Either Synaptically or by Local Application Deactivation is by convention dened as the decrease of current following the end of a very brief pulse of glutamate applied to excised outside-out membrane patches. Since it reects the random closure of AMPA channels in the absence of glutamate, deactivation should determine the rate of decay of mEPSCs if the glutamate transient in the synaptic cleft is very fast. Intrinsic desensitization is dened as the decrease in current observed during a prolonged application of glutamate. The time constants of deactivation and desensitization of AMPA receptors have been studied in membrane patches from a variety of central neurons to evaluate these hypotheses. In inhibitory interneurons of neocortex (Jonas, Racca, Sakmann, Seeburg, & Monyer, 1994) and hippocampus (Livsey et al., 1993) and in cochlear nucleus neurons (Raman & Trussell, 1992; Trussell et al., 1993), the gating of AMPA channels is rapid; it is slower in hippocampal granule cells and pyramidal neurons and neocortical pyramidal neurons (Colquhoun, Jonas, & Sakmann, 1992; Hestrin, 1992, 1993; Jonas et al., 1994). In all these cells,
Desensitization and Excitatory Synaptic Currents
3
the time course of postsynaptic currents is well correlated with the time course of currents induced by short pulses of glutamate (Raman & Trussell, 1992; Silver, Traynelis, & Cull-Candy, 1992; Stern, Edwards, & Sakmann, 1992; Jonas, Major, & Sakmann, 1993). Because intrinsic desensitization is typically two- to four-fold slower than the decay of well-clamped fast EPSCs, many authors have argued that at these synapses, desensitization is too slow to affect the time course of mEPSCs or even EPSCs (Colquhoun et al., 1992; Hestrin, 1992; Stern et al., 1992; Trussell et al., 1993; Jonas et al., 1993; Partin, Patneau, Winters, Mayer, & Buonanno, 1993; Jonas et al., 1994; Jonas & Spruston, 1994; Raman & Trussell, 1995; Edmonds, Gibb, & Colquhoun, 1995; Hausser & Roth, 1997). 3 Need for Reevaluation For several reasons, these notions need to be reevaluated. First, the apparent time course of intrinsic desensitization is not necessarily a good indicator of the entry into desensitized states, which may develop much faster; thus, a glutamate pulse much shorter than the time course of desensitization leads to a diminished response to a subsequent short test pulse of glutamate (Raman & Trussell, 1995). Second, the decay time of mEPSCs becomes markedly amplitude dependent when AMPA receptor desensitization is blocked by aniracetam or cyclothiazide (Ghamari-Langroudi & GlavinoviÂc, 1998; Atassi & GlavinoviÂc, 1999). This nding suggests that (1) desensitization of synaptic currents depends on the number of glutamate molecules released and therefore is concentration dependent (unlike intrinsic desensitization), and (2) in the absence of desensitization, another process renders the decay phase of mEPSCs concentration dependent. Alternative explanations may be given for the amplitude dependence of mEPSCs in the absence of desensitization. Larger mEPSCs may be of multivesicular origin, and the positive correlation may arise at least partly from asynchronous vesicular release. The recent report that spontaneous release becomes less multivesicular with maturation (Wall & Usowicz, 1998) argues against such an explanation (both studies were done in hippocampal slices from adult rats; Ghamari-Langroudi & GlavinoviÂc, 1998; Atassi & GlavinoviÂc, 1999). Another explanation would be that the deactivation rate is concentration dependent. High glutamate concentrations may preferentially activate large-conductance channels (Rosenmund, Stern-Bach, & Stevens, 1998) that remain open for longer times (Stern-Bach, Russo, Neuman, & Rosenmund, 1998). Though plausible, it seems an unlikely explanation for dentate granule cells or CA1 pyramidal cells of hippocampus; in excised patches, decay times of currents produced by short (1 ms) pulses of glutamate are essentially concentration independent (over the range 0.2– 1.0 mM). In CA3 pyramidal cells of hippocampus, however, such a mechanism may be more important (Colquhoun et al., 1992).
4
Mladen I. GlavinoviÂc
4 Current Observations Our recordings in hippocampal slices showed that the removal of desensitization by specic pharmacological agents prolonged mEPSCs and greatly enhanced the amplitude dependence of their decay times. To investigate this phenomenon further, we made a detailed simulation of the underlying mechanisms using well-established Monte Carlo methods to examine glutamate release into the synaptic cleft, and its interaction with AMPA receptors, and the interaction of short and long pulses of glutamate with AMPA receptors in excised patches. More specically, we aimed to clarify what factors shape the amplitude and time course of mEPSCs and what determines the relationship among the time course of intrinsic desensitization, deactivation, mEPSCs, and the time course of the occupancy of desensitized states (for synaptic and patch currents; GlavinoviÂc & Rabie, 1998). The Monte Carlo method follows individual glutamate molecules as they diffuse randomly within the synaptic cleft and interact with the postsynaptic receptors (see Figure 1A; Bartol, Land, Salpeter, & Salpeter, 1991; Wahl, Pouzat, & Stratford, 1996; Kruk, Horn, & Faber, 1997; GlavinoviÂc & Rabie, 1998; GlavinoviÂc, 1999). Our model assumes a 200 £ 200 nm synaptic contact area and 15 nm wide cleft, bounded by a three-dimensional innite space. Further assumptions are that the movement of a glutamate molecule depends only on its present position (and not on its history) and that the probability of each receptor’s remaining in its present state or changing to another other state depends only on its present state. Both the movement and the gating kinetics are thus Markovian processes. In the kinetic scheme for the receptors (see Figure 1B; Jonas et al., 1993), two glutamate molecules must bind to the receptor before the channel can open. AMPA receptor can therefore be unbound (U), singly bound (SB), doubly bound (DB), open (O), or in one of the three desensitized states (D1 , D2 , or D3 ). The interaction between glutamate molecules and receptors is dened by the three rates in the kinetic scheme (K C 1 , K C 2 , and K C 3 ) and by associating with each receptor state a surface area and a probability of binding, given that a glutamate molecule hits this receptor surface. There is no agreement concerning the level of occupancy of AMPA receptors at the peak of the current. Although AMPA receptors may saturate at some synapses, several lines of experimental evidence indicate that this is unlikely to occur in the cerebellum and hippocampus: noise analysis at single-site synapses in the cerebellum (Silver, Colquhoun, Cull-Candy, & Edmonds, 1996), the high coefcient of variation for single-site mEPSCs in hippocampal culture (Forti, Bossi, Bergamaschi, Villa, & Malgaroli, 1997), and the effects of local agonist applications at single-site synapses in hippocampal cultures (Liu, Choi, & Tsien, 1999). We therefore simulated mEPSCs generated by the release of glutamate molecules, varying in number over a very wide range (150–20,000) and covering all degrees of saturation.
Desensitization and Excitatory Synaptic Currents
5
Figure 1: Schematic models of glutamate release into synaptic cleft and the kinetics of gating of AMPA receptors. (A) Diagram (approximately to scale) showing how glutamate molecules, after release from an instantaneous point source, diffuse in synaptic cleft and interact with postsynaptic receptors. (B) Kinetic scheme of gating of AMPA receptor channels by glutamate, used for the Monte Carlo simulations. U, SB, DB, and O indicate the unbound, singly bound, doubly bound, and open state respectively. D1 , D2 , and D3 are three desensitized states. The rate constants, taken from Jonas et al. (1993) were adjusted for the temperature of simulations (37± C) and assuming a Q10 of 4.0 (GlavinoviÂc & Rabie, 1998). They are: K C 1 D 3.67 £ 107 M¡1 , k¡1 D 3.41 £ 104 s ¡1 , K C 2 D 22.7 £ 107 M¡1 , k¡2 D 2.61 £ 104 s¡1 , K C 3 D 1.02 £ 107 M¡1 , k¡3 D 366s¡1 for glutamate binding; a D 3.39 £ 104 s ¡1 , b D 7200s ¡1 , for channel opening; and a1 D 2.31 £ 104 s ¡1 , b1 D 314s ¡1 , a2 D 1376s ¡1 , b2 D 5.82s ¡1 , a3 D 142s ¡1 , b3 D 32s ¡1 , a4 D 134.4s ¡1 , b4 D 1523s ¡1 for the desensitization pathway.
There was good agreement between the predictions of the simulations and experimental observations: Both rise and decay times of mEPSCs are prolonged when desensitization is reduced or abolished, but the amplitude changes only marginally (Isaacson & Walmsley, 1996; Ghamari-Langroudi & GlavinoviÂc, 1998; Atassi & GlavinoviÂc, 1999). The decay phase of mEPSCs, which is not (or is only modestly) amplitude dependent in the presence of desensitization becomes markedly so when desensitization is absent (see Figure 2A; Isaacson & Walms-
6
Mladen I. GlavinoviÂc
Figure 2: Under control conditions, the decay times of fast unitary excitatory postsynaptic current (mEPSC) are only marginally amplitude dependent, but they become markedly so when desensitization is suppressed. (A) Amplitude dependence of the decay time of mEPSCs, recorded from CA1 pyramidal cells in rat hippocampal slices, in control conditions and after suppression of AMPA receptor desensitization by 100 m M cyclothiazide (cyclo; from Atassi & GlavinoviÂc, 1999). (B) Amplitude dependence of the decay time of simulated fast mEPSCs, when desensitization is either present (control) or absent (No Des), with either single (Single Bind) or repeated (Repeat Bind) glutamate binding to receptors permitted. The amplitude was normalized to 1.0, when all AMPA channels are open. Over a wide range, the amplitude dependence of simulated mEPSCs is evident only if there is repeated binding; this indicates that desensitization shortens mEPSCs and markedly reduces the amplitude dependence of their decay times by eliminating glutamate repeated binding (“buffering”) by AMPA receptors. Slower decay of observed mEPSCs (A versus B) can be ascribed to the lower temperature of recordings (32± versus 37± for the simulations), without excluding likely different kinetics in CA1 versus CA3 neurons (rate constants listed in Figure 1B were obtained from CA3 neurons; Jonas et al., 1993).
Desensitization and Excitatory Synaptic Currents
7
ley, 1996; Ghamari-Langroudi & GlavinoviÂc, 1998; Atassi & GlavinoviÂc, 1999). Intrinsic desensitization is concentration independent (Colquhoun et al., 1992) and two to three times slower than the decay of deactivation or the decay of “perfectly clamped” mEPSCs (see Figures 2B and 3A; Colquhoun et al., 1992; Hestrin, 1992; Stern et al., 1992; Trussell et al., 1993; Jonas et al., 1993, 1994; Partin et al., 1993; Jonas & Spruston, 1994; Raman & Trussell, 1995; Edmonds et al., 1995; Hausser & Roth, 1997; GlavinoviÂc & Rabie, 1998). Deactivation and perfectly clamped mEPSCs have very similar decay times (see Figures 2B and 3A; Raman & Trussell, 1992; Silver et al., 1992; Stern et al., 1992; Jonas et al., 1993; GlavinoviÂc & Rabie, 1998). 5 Revised Desensitization Hypothesis Although it is clear that desensitization plays a major role in shaping the time course of post-synaptic currents, the picture that emerges from these simulations differs considerably from the classical desensitization hypothesis (the second hypothesis). According to our revised hypothesis, desensitization shapes the time course of mEPSCs largely by limiting the repeated binding of glutamate molecules to AMPA receptors (i.e., by reducing glutamate buffering by AMPA receptors) and, to a lesser extent, through the termination of burst openings of AMPA channels. This conclusion is indicated by two sets of observations on simulated mEPSCs. First, if only single binding of glutamate is permitted, their decay times are very similar (brief and virtually amplitude independent) in both the presence and absence of desensitization (see Figure 2B); second, if multiple glutamate binding is permitted, the decay times are long and strongly amplitude dependent in the presence of desensitization. The short duration and the similar time course of mEPSCs and deactivation in the presence of desensitization can thus be explained by the absence of repeated binding of individual glutamate molecules. In excised patches, this happens because only a brief pulse of glutamate is applied; during mEPSCs, repeated binding is curtailed by desensitization. The decay times are thus very similar not due to the lack of desensitization but because of its importance. The occupancy of the single bound activatable state is especially relevant if one wishes to understand what determines the time course of mEPSCs. As the simulations indicate, suppression of desensitization leads to a higher occupancy of the single bound state (see Figure 4A), as glutamate molecules both unbind from it and subsequently rebind more frequently. The high rate of glutamate binding to the singly bound state (it is 6.2 times greater than binding to the unbound state; see Figure 1B; see also Jonas et al., 1993; Wahl et al., 1996; GlavinoviÂc & Rabie, 1998) further
8
Mladen I. GlavinoviÂc
Figure 3: Although decay of intrinsic desensitization is slower than decay of deactivation, the desensitization of AMPA receptors during an mEPSC is very pronounced and depends strongly on how many molecules of glutamate are released (Monte Carlo simulations). (A) Decay time due to intrinsic desensitization (dened as a decrease in responsiveness to a constant prolonged pulse of glutamate) is two to three times longer than the decay time by deactivation, although most AMPA receptors enter into desensitized states during an mEPSC. (B) Fraction of the AMPA channels in the unbound state (lled and empty circles) is higher in the absence of desensitization. Desensitization of AMPA receptors is very pronounced and strongly depends on the number of molecules released (all desensitized states are taken together; lled squares). The occupancy of all states (unbound and desensitized) was estimated 1 ms after instantaneous release (from GlavinoviÂc & Rabie, 1998).
Desensitization and Excitatory Synaptic Currents
9
Figure 4: In the absence of desensitization, both the fraction of bound receptors and late, but not early, phase of cleft concentration of glutamate are higher regardless of how many glutamate molecules are released (Monte Carlo simulations). The fraction of single- (circles) and double- (triangles) bound AMPA receptors (A) and the synaptic cleft concentration (C) as a function of the number of molecules of glutamate molecules released, in the presence and absence of desensitization. The occupancies of all receptor states and the concentration of glutamate were estimated 1 ms after instantaneous release (from GlavinoviÂc & Rabie, 1998). (B) The fast constant of the cleft concentration is, however, independent of the number of molecules released, the presence of AMPA receptors, and their ability to desensitize. Smooth lines are the best multiexponential ts to the concentration traces.
enhances the frequency of repeated binding. Its proximity to the open state (unlike that of the unbound or desensitized state D1; see Figure 2A) results in prolongation of mEPSCs and greater synaptic efcacy (i.e., a higher
10
Mladen I. GlavinoviÂc
probability of opening of AMPA channels). Higher occupancy of the single bound state raises the occupancy of the double-bound state (DB; see Figure 1B) and thus enhances the likelihood of repeated opening of AMPA channels. Simulations also reveal that during and following an mEPSC, the occupancies of all states, including desensitized states, depend on the number of molecules released (i.e., they are concentration dependent; see Figures 3 and 4). This is not surprising given that three out of seven kinetic rates are concentration dependent (see Figure 1B). The postulated concentration dependence of desensitization is not in contradiction experimentally (Jonas et al., 1993) or theoretically (GlavinoviÂc & Rabie, 1998) with the well-documented concentration independence of desensitization in excised patches (intrinsic desensitization; see Figure 3A). Desensitization that develops during a single mEPSC is concentration dependent mainly because it abolishes glutamate buffering—a concentration-dependent process that is not present during desensitization in excised patches. Finally, our simulations remove one of the main objections to the idea that desensitization plays an important role in shaping the time course of mEPSCs: the supposed requirement that the time course of glutamate concentration in the synaptic cleft be very slow (Edmonds et al., 1995). The decay of glutamate concentration is actually predicted to be faster in the presence of desensitization of AMPA channels because of reduced buffering of glutamate by AMPA receptors (see Figure 4B and Figure 5). A word of caution is, however, necessary. The comparative importance of desensitization as an inactivation pathway as opposed to its role in diminishing glutamate buffering may vary from synapse to synapse and may depend on external conditions. Desensitization is more likely to act as an inactivation pathway if the synaptic cleft is wide and the synaptic contact area small. The kinetics of gating of AMPA channels and their density are also signicant factors. The theoretical simulations (Wahl et al., 1996; GlavinoviÂc & Rabie, 1998; GlavinoviÂc, 1999) assumed that all rates have the same Q10, but that is not necessarily the case. If the association rates have lower Q10 (Jones, Sahara, Dzubay, & Westbrook, 1998), desensitization would have a greater importance as an inactivation pathway at higher temperatures than we evaluated. In the simulations, all glutamate molecules that reached the edge of the synapse diffused freely into the surrounding innite space (GlavinoviÂc & Rabie, 1998; GlavinoviÂc, 1999). If, as seems likely, there are cellular barriers to such free diffusion (e.g., neuroglia), they would slow the clearance of glutamate from the synaptic cleft and enhance the role of desensitization in diminishing glutamate buffering by AMPA receptors—bearing in mind that signicant removal of glutamate by cellular uptake probably accelerates the clearance of glutamate (see below); this needs to be included explicitly in further simulations. In any case, the experimental ndings strongly support the idea that repeated binding is important in shaping the time course of
Desensitization and Excitatory Synaptic Currents
11
Figure 5: Diagram compares the time course of glutamate concentration in synaptic cleft and unitary excitatory postsynaptic current (below), in the presence and absence of desensitization. (Top) Glutamate concentration appears as a large but short pulse that decays instantaneously and is followed by a small, prolonged tail. According to our model, the effect of glutamate is not only more evanescent in the presence of AMPA receptor desensitization, but its concentration in the synaptic cleft decays faster. Both effects are closely linked to a higher occupancy of the single bound receptor state that produces the greater and more prolonged postsynaptic effect (downward arrows) but also elevates its concentration in the synaptic cleft, owing to greater buffering by the AMPA receptors (curved upward arrows).
mEPSCs at low (22–27± C) and especially at high (32± C) temperatures (Atassi & GlavinoviÂc, 1999). 6 Fast-Application Protocol on Excised Membrane Patch as Surrogate Synapse Since it is difcult to examine the steps that are rate limiting at an intact synapse, fast glutamate applications to excised outside-out membrane patches are often used as a surrogate synapse (Colquhoun et al., 1992; Raman & Trussell, 1992; Hestrin, 1992, 1993; Trussell et al., 1993; Livsey et al., 1993; Jonas et al., 1994). This protocol is attractive because it permits the investigator to control not only postsynaptic but also presynaptic (i.e., release) param-
12
Mladen I. GlavinoviÂc
eters. One can thus systematically examine how the glutamate concentration and its pulse length inuence the time course of the postsynaptic currents. For at least three reasons, the surrogate synapse often gives a misleading picture of synaptic events (GlavinoviÂc & Rabie, 1998). First, the time course of transmitter in the synaptic cleft is never a square pulse; it has a more complicated shape, and the concentration is spatially nonuniform (Ogston, 1955; Holmes, 1995; Wahl et al., 1996; Kleinle et al., 1996; GlavinoviÂc & Rabie, 1998; GlavinoviÂc, 1999). Second, in such a system, the binding and unbinding of transmitter from receptors do not affect the concentration of transmitter that activates the responses. In the synaptic cleft, accumulations (and depletions) of transmitter can occur because of such binding and unbinding, and they can exert a profound inuence on the time course of the postsynaptic current (Mennerick & Zorumski, 1994, 1995; GlavinoviÂc & Rabie, 1998). They are likely to be especially relevant if the occupancy of the AMPA receptors in the single bound state is more than negligible. Third, the rate of onset of intrinsic desensitization differs from that of entry into the desensitized states. Unlike the time course of intrinsic desensitization, the occupancy of the desensitized states during a synaptic current clearly is concentration dependent. 7 Pharmacological Inhibition of Desensitization of AMPA Receptor Channels A variety of agents (such as aniracetam or cyclothiazide) are known to inhibit the desensitization of AMPA receptor channels (Ito, Tanabe, Kohda, & Sugiyama, 1990; Yamada & Tang, 1993). They produce a slower decay of the spontaneous and evoked postsynaptic potentials (Larson, Le, Hall, & Lynch, 1994) and currents (Isaacson & Nicoll, 1991; Tang et al., 1991; Trussell et al., 1993; Partin et al., 1993; Ghamari-Langroudi & GlavinoviÂc, 1998; Atassi & GlavinoviÂc, 1999), which is often taken as an argument that desensitization plays an important role in shaping the time course of postsynaptic currents. This interpretation is not universally accepted on the grounds that these agents also make the kinetics of deactivation slower (Jonas & Spruston, 1994; Edmonds et al., 1995). Such criticisms may not be valid because deactivation should be slower when desensitization is suppressed, though there is no slowing of the rate of closing of AMPA channels (GlavinoviÂc & Rabie, 1998). A strong correlation between the time constants of deactivation and of desensitization, observed in a variety of central synapses (Partin, Fleck, & Mayer, 1996), suggests that deactivation is slower largely (and probably entirely) owing to reduced desensitization. 8 Time Course of Glutamate Concentration in the Synaptic Cleft Albeit of considerable interest, the time course of glutamate concentration is not known with precision. Given that diffusion is fundamentally a mul-
Desensitization and Excitatory Synaptic Currents
13
tiexponential process (Carslaw & Jaeger, 1959), the time course of transmitter concentration in the synaptic cleft should decay multiexponentially (Ogston, 1955; Eccles & Jaeger, 1958; Wathey, Nass, & Lester, 1979; Silver et al., 1996; Clements, 1996). At least two exponentials are expected even when considering a two-dimensional extracellular space (Destexhe & Sejnowski, 1995). The time constant of the fast initial concentration decay (t f ) is expected to be governed by lateral diffusion and determined primarily by the diffusion coefcient (D) of the transmitter and by the cleft radius (rI t f D r2 / 4D; Bartol & Sejnowski, 1993; Destexhe & Sejnowski, 1995; Clements, 1996; Silver et al., 1996). This has been conrmed by recent Monte Carlo simulations (Bartol et al., 1991; Wahl et al., 1996; GlavinoviÂc & Rabie, 1998). Fitting four exponentials to the time course of the concentration of cleft glutamate simulated assuming an instantaneous point source of glutamate (see Figure 4B) yields a very fast initial time constant (2–6 m s) that is independent of the number of glutamate molecules released (300–10,000), the presence or absence of 196 receptors on the 200 £ 200 nm postsynaptic membrane, or their ability to desensitize. Though this value is clearly smaller than values reported before (50–200 m s; Eccles & Jaeger, 1958; Burger et al., 1989; Faber, Young, Legendre, & Korn, 1992; Bartol & Sejnowski, 1993), the agreement between them is good when the differences of the cleft radius are taken into account (0.25–0.5 m m as opposed to 0.1 m m). The cleft diameters of excitatory synaptic terminals on CA1 pyramidal neurones are variable ranging from 0.1 to 1.0 m m (Palay & Chan-Palay, 1974; Bekkers, Richerson, & Stevens, 1990). Within 50 m s, the transmitter concentration is essentially spatially uniform throughout the synapse (Wahl et al., 1996; Clements, 1996; GlavinoviÂc, 1999). The amplitudes and the time constants of the slower components, however, depend not only on the diffusion constant and cleft geometry but also on receptor density and its kinetics (Wahl et al., 1996; GlavinoviÂc & Rabie, 1998; see Figure 4B). The longest time constant ranges from 30 m s to 1.2 ms, values comparable to earlier estimates for the slow component (Clements, 1996). Its amplitude is 1.5 to 15.0% of that for the fastest component. The effective diffusion constant of glutamate in the synaptic cleft inuences all components of the time course of glutamate concentration: if the rate of diffusion is lower than in simple aqueous solutions (Longsworth, 1953), the glutamate concentration will change more slowly (Vogt, Luscher, & Streit, 1995; Kleinle et al., 1996). In contrast to the neuromuscular junction, a high-turnover enzyme for transmitter degradation is absent from the synaptic cleft of glutamatergic synapses. This would tend to make the time course of glutamate in the cleft relatively long. However, a signicant compensating factor is the presence of transmitter-selective transporters (not found at cholinergic synapses), which affects the time course of glutamate (Tong & Jahr, 1994b; Takahashi, Sarantis, & Attwell, 1996), especially if, as evidence suggests, the transporters are at a high density, surpassing that of AMPA receptor channels by
14
Mladen I. GlavinoviÂc
more than an order of magnitude (Arriza et al., 1994; Otis, Kavanaugh, & Jahr, 1997; Diamond & Jahr, 1997). The rate of binding of glutamate to the transporters appears to be at least as rapid as binding to the AMPA receptors; the rate of glutamate unbinding is low, and its turnover rate is very low. Therefore, on the timescale of the fast excitatory postsynaptic current, the binding of glutamate to the transporters is essentially irreversible, and it will speed up the clearance of glutamate from the cleft. The time course of release of the vesicular content will also inuence the time course of glutamate concentration in the synaptic cleft. The short rise time of well-clamped, fast, spontaneous EPSCs (Stern et al., 1992; Trussell et al., 1993; Jonas et al., 1993; Spruston, Jonas, & Sakmann, 1995) would argue for a rapid onset of the release of the vesicular content and a rapid subsequent rise of glutamate concentration in the synaptic cleft. However, the time course of release of vesicular content appears to be highly variable, judging by the marked variability in rate of rise of mEPSCs, even when all are equally well clamped (Bekkers & Stevens, 1996; Ghamari-Langroudi & GlavinoviÂc, 1998; Atassi & GlavinoviÂc, 1999). This is not surprising because the intravesicular concentrations of glutamate and vesicular sizes differ considerably. The estimates of the intravesicular glutamate concentration range from 60 to 210 mM (Burger et al., 1989; Riveros, Fiedler, Lagos, Munoz, ˜ & Orrego, 1986; Shupliakov, Brodin, Cullheim, Ottersen, & Storm-Mathisen, 1992) and those of the vesicle diameters from 25 to 45 nm (Palay & Chan-Palay, 1974; Bekkers et al., 1990). Assuming identical fusion pore geometry and time course of opening, larger vesicles are expected to release their contents more slowly than smaller vesicles (GlavinoviÂc, 1999). The amplitude dependence of rise times of mEPSCs recorded in the presence and especially the absence of desensitization (Ghamari-Langroudi & GlavinoviÂc, 1998; Atassi & GlavinoviÂc, 1999) supports such a conclusion. Any variability of the geometry of the fusion pore and its time course of opening would provide another important modulating inuence on the rate of release of vesicular content. Finally, according to several recent reports on neuroendocrine secretory cells, agents are not released by simple diffusion, but rst have to dissociate from the gel matrix onto which they are stored at high concentration (Helle, 1990; Yoo & Lewis, 1993; Schroeder, Jankowski, Senyshyn, Holz, & Wightman, 1994; Walker, GlavinoviÂc & Trifaro, 1996). By slowing the release process, similar mechanism of storing glutamate in vesicles would provide an additional variable in the process of release of vesicular content. 9 Conclusion According to the revised hypothesis presented in this review, desensitization of AMPA receptors plays a major role in shaping the time course of fast excitatory postsynaptic currents. This hypothesis differs from the classical desensitization hypothesis in several important respects and proposes that: (1) the glutamate concentration decays faster in the presence of desensiti-
Desensitization and Excitatory Synaptic Currents
15
zation of AMPA channels (see Figure 5), (2) desensitization shapes the time course of mEPSCs to a large extent through its effect on buffering of glutamate molecules by AMPA receptors, (3) the occupancy of the single bound activatable state is an important determinant of both the level of glutamate buffering and the efcacy of glutamate in opening AMPA channels, (4) during an mEPSC, the occupancy of all desensitized states (taken individually or all together) and of all other states, as well as the extent of glutamate buffering, are concentration dependent, and (5) desensitization is concentration dependent largely because it abolishes the glutamate buffering (a concentration-dependent process). Thus, the revised hypothesis accounts for numerous ndings on fast mEPSCs. It is expected to be equally fruitful as a framework for further experimental and theoretical investigations. Acknowledgments K. Krnjevic read the manuscript and made valuable comments. This work was supported by the Operating Grant of the Medical Research Council of Canada. References Arriza, J. L., Fairman, W. A., Wadiche, J. L., Murdoch, G. H., Kavanaugh, M. P., & Amara, S. G. (1994). Functional comparisons of three glutamate transporter subtypes cloned from human motor cortex. Journal of Neuroscience, 14, 5559– 5569. Atassi, B., & GlavinoviÂc, M. I. (1999). Effect of cyclothiazide on spontaneous miniature excitatory postsynaptic currents in rat hippocampal pyramidal cells. Pugers Archiv, 437, 471–478. Barbour, B., Keller, B. U., Llano, I., & Marty, A. (1994). Prolonged presence of glutamate during excitatory synaptic transmission to cerebellar Purkinje cells. Neuron, 12, 1331–1343. Bartol, T. M., Jr., Land, B. R., Salpeter, E. E., & Salpeter, M. M. (1991). Monte Carlo simulation of miniature endplate current generation in the vertebrate neuromuscular junction. Biophysical Journal, 59, 1290–1307. Bartol, T. M., Jr., & Sejnowski, T. J. (1993). Model of the quantal activation of NMDA receptors at a hippocampal synaptic spine. Society for Neuroscience (Abstract), 19, 1515. Bekkers, J. M., Richerson, G. B., & Stevens, C. F. (1990). Origin of variability in quantal size in cultured hippocampal neurons and hippocampal slices. Proceedings of the National Academy of Science U.S.A., 87, 5359–5362. Bekkers, J. M., & Stevens, C. F. (1996). Cable properties of cultured hippocampal neurons determined from sucrose-evoked miniature EPSCs. Journal of Neurophysiology, 75, 1250–1255. Burger, P. M., Mehl, E., Cameron, P. L., Maycox, P. R., Baumert, M., Lottspeich, F., de Camilli, P., & Jahn, R. (1989). Synaptic vesicles immunoisolated from rat cerebral cortex contain high levels of glutamate. Neuron, 3, 715–720.
16
Mladen I. GlavinoviÂc
Carslaw, H. S., & Jaeger, J. C. (1959). Conduction of heat in solids (2nd ed.). Oxford: Clarendon Press. Clements, J. D. (1996). Transmitter time course in the synaptic cleft: Its role in central synaptic function. Trends in Neuroscience, 19, 163–171. Clements, J. D., Lester, R. A. J., Tong, G., Jahr, C. E., & Westbrook, G. L. (1992). The time course of glutamate in the synaptic cleft. Science, 258, 1498–1501. Colquhoun, D., Jonas, P., & Sakmann, B. (1992). Action of brief pulses of glutamate on AMPA/kainate receptors in patches from different neurones of rat hippocampal slices. Journal of Physiology London, 458, 261–287. Destexhe, A., & Sejnowski, T. S. (1995).G protein activation kinetics and spillover of c -aminobutyric acid may account for differences between inhibitory responses in the hippocampus and thalamus. Proceedingsof the NationalAcademy of Science U.S.A., 92, 9515–9519. Diamond, J. S., & Jahr, C. E. (1997). Transporters buffer synaptically released glutamate on a submillisecond time scale. Journal of Neuroscience, 17, 4672– 4687. Eccles, J. C., & Jaeger, J. C. (1958).The relationship between the mode of operation and the dimensions of the junctional regions at synapses and motor endorgans. Proceedings of the Royal Society B, 148, 38–56. Edmonds, B., Gibb, A. J., & Colquhoun, D. (1995). Mechanisms of activation of glutamate receptors and the time course of excitatory synaptic currents. Annual Review of Physiology, 57, 495–519. Faber, D. S., Young, W. S., Legendre, P., & Korn, H. (1992). Intrinsic quantal variability due to stochastic properties of receptor-transmitter interactions. Science, 258, 1494–1498. Forti, L., Bossi, M., Bergamaschi, A., Villa, A., & Malgaroli, A. (1997).Loose-patch recordings of single quanta at individual hippocampal synapses. Nature, 388, 874–878. Ghamari-Langroudi, M., & GlavinoviÂc, M. I. (1998). Changes of spontaneous miniature excitatory postsynaptic currents in rat hippocampal pyramidal cells induced by aniracetam. Pugers Archiv, 435, 185–192. GlavinoviÂc, M. I. (1999). Monte Carlo simulation of vesicular release, spatiotemporal distribution of glutamate in synaptic cleft, and generation of postsynaptic currents. Pugers Archiv, 437, 462–470. GlavinoviÂc, M. I., & Rabie, H. R. (1998). Monte Carlo simulation of spontaneous miniature excitatory postsynaptic currents in rat hippocampal synapse in the presence and in the absence of desensitization. Pugers Archiv, 435, 193–202. Hausser, M., & Roth, A. (1997). Dendritic and somatic glutamate receptor channels in rat cerebellar Purkinje cells. Journal of Physiology London, 501, 77–95. Helle, K. B. (1990). Chromogranins: Universal proteins in secretory organelles from Paramecium to man. Neurochemistry International, 17, 165–175. Hestrin, S. (1992). Activation and desensitization of glutamate-activated channels mediating fast excitatory synaptic currents in the visual cortex. Neuron, 9, 991–999. Hestrin, S. (1993). Different glutamate receptor channels mediate fast excitatory synaptic currents in inhibitory and excitatory cortical neurons. Neuron, 11, 1083–1091.
Desensitization and Excitatory Synaptic Currents
17
Hestrin, S., Sah, P., & Nicoll R. A. (1990). Mechanisms generating the time course of dual component excitatory synaptic currents recorded in hippocampal slices. Neuron, 5, 247–253. Holmes, W. R. (1995). Modeling the effect of glutamate diffusion and uptake on NMDA and non-NMDA receptor saturation. Biophysical Journal, 69, 1734– 1747. Isaacson, J. S., & Nicoll, R. A. (1991). Aniracetam reduces glutamate receptor desensitization and slows the decay of fast excitatory synaptic currents in the hippocampus. Proceedings of the National Academy of Science U.S.A., 88, 10936–10940. Isaacson, J. S., & Walmsley, B. (1996). Amplitude and time course of spontaneous and evoked excitatory postsynaptic currents in bushy cells of the anteroventral cochlear nucleus. Journal of Neurophysiology, 76, 1566–1571. Ito, I., Tanabe, S., Kohda, A., & Sugiyama, H. (1990). Allosteric potentiation of quisqualate receptors by a nootropic drug aniracetam. Journal of Physiology London, 424, 533–544. Jonas, P., Major, G., & Sakmann, B. (1993). Quantal components of unitary EPSCs at the mossy bre synapse on CA3 pyramidal cells of rat hippocampus. Journal of Physiology London, 472, 615–663. Jonas, P., Racca, C., Sakmann, B., Seeburg, P. H., & Monyer, H. (1994). Differences in Ca2 C permeability of AMPA-type glutamate receptor channels in neocortical neurons caused by differential gluR-B subunit expression. Neuron, 12, 1281–1289. Jonas, P., & Spruston, N. (1994). Mechanisms shaping glutamate-mediated excitatory postsynaptic currents in the CNS. Current Opinion in Neurobiology, 4, 366–372. Jones, M. V., Sahara, Y., Dzubay, J. A., & Westbrook, G. L. (1998). Dening afnity with the GABAA receptor. Journal of Neuroscience, 18, 8590–8604. Kleinle, J., Vogt, K., Luscher, H. R., Senn, W., Wyler, K., & Streit, J. (1996). Transmitter concentration proles in the synaptic cleft: An analytical model of release and diffusion. Biophysical Journal, 71, 2413–2426. Kruk, P. J., Horn, H., & Faber, D. S. (1997). The effects of geometrical parameters on synaptic transmission: A Monte Carlo simulation study. Biophysical Journal, 73, 2874–2890. Larson, J., Le, T. T., Hall, R. A., & Lynch, G. (1994). Effects of cyclothiazide on synaptic responses in slices of adult and neonatal rat hippocampus. Neuroreport, 5, 389–392. Liu, G., Choi, S., & Tsien, R. W. (1999). Variability of neurotransmitter concentration and nonsaturation of postsynaptic AMPA receptors at synapses in hippocampal cultures and slices. Neuron, 22, 395–409. Livsey, C. T., Costa, E., & Vicini, S. (1993). Glutamate-activated currents in outside-out patches from spiny versus aspiny hilar neurons of rat hippocampal slices. Journal of Neuroscience, 13, 5324–5333. Longsworth, L. G. (1953). Diffusion measurements at 25± C, of aqueous solutions of amino acids, peptides and sugars. Journal of the American Chemical Society, 75, 5705–5709.
18
Mladen I. GlavinoviÂc
Magleby, K. L., & Stevens, C. F. (1972). A quantitative description of end-plate currents. Journal of Physiology London, 223, 173–197. Mennerick, S., & Zorumski, C. F. (1994). Glial contributions to excitatory neurotransmission in cultured hippocampal cells. Nature, 368, 59–62. Mennerick, S., & Zorumski, C. F. (1995). Presynaptic inuence on the time course of fast excitatory synaptic currents in cultured hippocampal cells. Journal of Neuroscience, 15, 3178–3192. Ogston, A. G. (1955).Removal of acetylcholine from limited volume by diffusion. Journal of Physiology London, 128, 222–223. Otis, T. S., Kavanaugh, M. P., & Jahr, C. E. (1997). Postsynaptic glutamate transport at the climbing ber–Purkinje cell synapse. Science, 277, 1515–1518. Palay, S. L., & Chan-Palay, V. (1974). Cerebellar cortex: Cytology and organization. Berlin: Springer. Partin, K. M., Fleck, M. W., & Mayer, M. L. (1996). AMPA receptor ip / op mutants affecting deactivation, desensitization, and modulation by cyclothiazide, aniracetam, and thiocyanate. Journal of Neuroscience, 16, 6634– 6647. Partin, K. M., Patneau, D. K., Winters, C. A., Mayer, M. L., & Buonanno, A. (1993). Selective modulation of desensitization at AMPA versus kainate receptors by cyclothiazide and concanavalin A. Neuron, 11, 1069–1082. Raman, I. M., & Trussell, L. O. (1992). The kinetics of the response to glutamate and kainate in neurons of the avian cochlear nucleus. Neuron, 9, 173–186. Raman, I. M., & Trussell, L. O. (1995). The mechanism of (a-amino-3-hydroxy-5methyl-4-isoxazolepropionate receptor desensitization after removal of glutamate. Biophysical Journal, 68, 137–146. Riveros, N., Fiedler, J., Lagos, N., Munoz, ˜ C., & Orrego, F. (1986). Glutamate in rat brain cortex synaptic vesicles: Inuence of the vesicle isolation procedure. Brain Research, 386, 405–408. Rosenmund, C., Stern-Bach, Y., & Stevens, C. F. (1998). The tetrameric structure of a glutamate receptor channel. Science, 280, 1596–1599. Schroeder, T. J., Jankowski, J. A., Senyshyn, J., Holz, R. W., & Wightman, R. M. (1994). Zones of exocytotic release on bovine adrenal medullary cells in culture. Journal of Biological Chemistry, 269, 17215–17220. Shupliakov, O., Brodin, L., Cullheim, S., Ottersen, O. P., & Storm-Mathisen, J. (1992). Immunogold quantication of glutamate in two types of excitatory synapse with different ring patterns. Journal of Neuroscience, 12, 3789– 803. Silver, R. A., Colquhoun, D., Cull-Candy, S. G., & Edmonds, B. (1996). Deactivation and desensitization of non-NMDA receptors in patches and the time course of EPSCs in rat cerebellar granule cells. Journal of Physiology London, 493, 167–173. Silver, R. A., Traynelis, S. F., & Cull-Candy, S. G. (1992).Rapid-time-course miniature and evoked excitatory currents at cerebellar synapses in situ. Nature, 355, 163–166. Spruston, N., Jonas, P., & Sakmann, B. (1995). Dendritic glutamate receptor channels in rat hippocampal CA3 and CA1 pyramidal neurons. Journal of Physiology London, 482, 325–352.
Desensitization and Excitatory Synaptic Currents
19
Stern, P., Edwards, F. A., & Sakmann, B. (1992). Fast and slow components of unitary EPSCs on stellate cells elicited by focal stimulation in slices of rat visual cortex. Journal of Physiology London, 449, 247–278. Stern-Bach, Y., Russo, S., Neuman, M., & Rosenmund, C. (1998). A point mutation in the glutamate binding site blocks desensitization of AMPA receptors. Neuron, 2, 907–918. Takahashi, M., Sarantis, M., & Attwell, D. (1996). Postsynaptic glutamate uptake in rat cerebellar Purkinje cells. Journal of Physiology London, 497, 523–530. Tang, C.-M., Shi, Q.-Y., Katchman, A., & Lynch, G. (1991). Modulation of the time course of fast EPSCs and glutamate channel kinetics by aniracetam. Science, 254, 288–290. Tong, G., & Jahr, C. E. (1994a). Multivesicular release from excitatory synapses of cultured hippocampal neurons. Neuron, 12, 51–59. Tong, G., & Jahr, C. E. (1994b). Block of glutamate transporters potentiates postsynaptic excitation. Neuron, 13, 1195–1203. Trussell, L. O., & Fischbach, G.D. (1989). Glutamate receptor desensitization and its role in synaptic transmission. Neuron, 3, 209–218. Trussell, L. O., Thio, L. L., Zorumski, C. F., & Fischbach, G. D. (1988). Rapid desensitization of glutamate receptors in vertebrate central neurons. Proceedings of the National Academy of Science U.S.A., 85, 2834–2838. Trussell, L. O., Zhang, S., & Raman, I. M. (1993). Desensitization of AMPA receptors upon multiquantal neurotransmitter release. Neuron, 10, 1185–1196. Vogt, K., Luscher, H. R., & Streit, J. (1995). Analysis of synaptic transmission at single identied boutons on rat spinal neurons in culture. Pugers Archiv, 430, 1022–1028. Vyklicky, L., Patneau, D. K., & Mayer, M. L. (1991). Modulation of excitatory synaptic transmission by drugs that reduce desensitization at AMPA/kainate receptors. Neuron, 7, 971–984. Wahl, L. M., Pouzat, C., & Stratford, K. J. (1996). Monte Carlo simulation of fast excitatory synaptic transmission at a hippocampal synapse. Journal of Neurophysiology, 75, 597–608. Wall, M. J., & Usowicz, M. M. (1998). Development of the quantal properties of evoked and spontaneous synaptic currents at a brain synapse. Nature Neuroscience, 1, 675–682. Walker, A., GlavinoviÂc, M. I., & Trifaro, J-M. (1996). Time course of release of content of single vesicles in bovine chromafn cells. Pugers Archiv, 431, 729– 735. Wathey, J. C., Nass, M. M., & Lester, H. A. (1979). Numerical reconstruction of the quantal event at nicotinic synapses. Biophysical Journal, 27, 145–164. Yamada, K. A., & Tang, C.-M. (1993). Benzothiadiazides inhibit rapid glutamate receptor desensitization and enhance glutamatergic synaptic currents. Journal of Neuroscience, 13, 3904–3915. Yoo, S. H., & Lewis, M. S. (1993). Dimerization and tetramerization properties of the C-terminal chromogranin A: A thermodynamic analysis. Biochemistry, 32, 8816–8219. Received December 12, 2000; accepted March 27, 2001.
NOTE
Communicated by Leo Breiman
Adjusting the Outputs of a Classier to New a Priori Probabilities: A Simple Procedure Marco Saerens
[email protected] IRIDIA Laboratory, cp 194/6, UniversitÂe Libre de Bruxelles, B-1050 Brussels, Belgium, and SmalS-MvM, Research Section, Brussels, Belgium
Patrice Latinne
[email protected] IRIDIA Laboratory, cp 194/6, UniversitÂe Libre de Bruxelles, B-1050 Brussels, Belgium
Christine Decaestecker
[email protected] Laboratory of Histopathology, cp 620, UniversitÂe Libre de Bruxelles, B-1070 Brussels, Belgium
It sometimes happens (for instance in case control studies) that a classier is trained on a data set that does not reect the true a priori probabilities of the target classes on real-world data. This may have a negative effect on the classication accuracy obtained on the real-world data set, especially when the classier’s decisions are based on the a posteriori probabilities of class membership. Indeed, in this case, the trained classier provides estimates of the a posteriori probabilities that are not valid for this realworld data set (they rely on the a priori probabilities of the training set). Applying the classier as is (without correcting its outputs with respect to these new conditions) on this new data set may thus be suboptimal. In this note, we present a simple iterative procedure for adjusting the outputs of the trained classier with respect to these new a priori probabilities without having to ret the model, even when these probabilities are not known in advance. As a by-product, estimates of the new a priori probabilities are also obtained. This iterative algorithm is a straightforward instance of the expectation-maximization (EM) algorithm and is shown to maximize the likelihood of the new data. Thereafter, we discuss a statistical test that can be applied to decide if the a priori class probabilities have changed from the training set to the real-world data. The procedure is illustrated on different classication problems involving a multilayer neural network, and comparisons with a standard procedure for a priori probability estimation are provided. Our original method, based on the EM algorithm, is shown to be superior to the standard one for a priori probability estimation. Experimental results also indicate that the classier with adjusted outputs always performs better than the original one in Neural Computation 14, 21–41 (2001)
° c 2001 Massachusetts Institute of Technology
22
M. Saerens, P. Latinne, and C. Decaestecker
terms of classication accuracy, when the a priori probability conditions differ from the training set to the real-world data. The gain in classication accuracy can be signicant. 1 Introduction In supervised classication tasks, sometimes the a priori probabilities of the classes from a training set do not reect the “true” a priori probabilities of real-world data, on which the trained classier has to be applied. For instance, this happens when the sample used for training is stratied by the value of the discrete response variable (i.e., the class membership). Consider, for example, an experimental setting—a case control study—where we select 50% of individuals suffering from a disease (the cases) and 50% of individuals who do not suffer from this disease (the controls), and suppose that we make a set of measurements on these individuals. The resulting observations are used in order to train a model that classies the data into the two target classes: disease and no disease. In this case, the a priori probabilities of the two classes in the training set are 0.5 each. Once we apply the trained model in a real-world situation (new cases), we have no idea of the true a priori probability of disease (also labeled “disease prevalence” in biostatistics). It has to be estimated from the new data. Moreover, the outputs of the model have to be adjusted accordingly. In other words, the classication model is trained on a data set with a priori probabilities that are different from the real-world conditions. In this situation, knowledge of the “true” a priori probabilities of the real-world data would be an asset for the following important reasons: Optimal Bayesian decision making is based on the a posteriori probabilities of the classes conditional on the observation (we have to select the class label that has maximum estimated a posteriori probability). Now, following Bayes’ rule, these a posteriori probabilities depend in a nonlinear way on the a priori probabilities. Therefore, a change of the a priori probabilities (as is the case for the real-world data versus the training set) may have an important impact on the a posteriori probabilities of membership, which themselves affect the classication rate. In other words, even if we use an optimal Bayesian model, if the a priori probabilities of the classes change, the model will not be optimal anymore in these new conditions. But knowing the new a priori probabilities of the classes would allow us to correct (by Bayes’ rule) the output of the model in order to recover the optimal decision. Many classication methods, including neural network classiers, provide estimates of the a posteriori probabilities of the classes. From the previous point, this means that applying such a classier as is on new data having different a priori probabilities from the training set can result in a loss of classication accuracy, in comparison with an equiv-
Adjusting a Classier to New a Priori Probabilities
23
alent classier that relies on the “true” a priori probabilities of the new data set. This is the primary motivation of this article: to introduce a procedure allowing the correction of the estimated a posteriori probabilities, that is, the classier’s outputs, in accordance with the new a priori probabilities of the real-world data, in order to make more accurate predictions, even if these a priori probabilities of the new data set are not known in advance. As a by-product, estimates of the new a priori probabilities are also obtained. The experimental section, section 4, will conrm that a signicant increase in classication accuracy can be obtained when correcting the outputs of the classier with respect to new a priori probability conditions. For the sake of completeness, notice also that there exists another approach, the min-max criterion, which avoids the estimation of the a priori probabilities on the new data. Basically, the min-max criterion says that one should use the Bayes decision rule, which corresponds to the least favorable a priori probability distribution (see, e.g., Melsa & Cohn, 1978, or Hand, 1981). In brief, we present a simple iterative procedure that estimates the new a priori probabilities of a new data set and adjusts the outputs of the classier, which is supposed to approximate the a posteriori probabilities, without having to ret the model (section 2). This algorithm is a simple instance of the expectation-maximization (EM) algorithm (Dempster, Laird, & Rubin, 1977; McLachlan & Krishnan, 1997), which aims to maximize the likelihood of the new observed data. We also discuss a simple statistical test (a likelihood ratio test) that can be applied in order to decide if the a priori probabilities have changed or not from the training set to the new data set (section 3). We illustrate the procedure on articial and real classication tasks and analyze its robustness with respect to imperfect estimation of the a posteriori probabilities provided by the classier (section 4). Comparisons with a standard procedure used for a priori probabilities estimation (also in section 4) and a discussion with respect to the related work (section 5) are also provided. 2 Correcting a Posteriori Probability Estimates with Respect to New a Priori Probabilities 2.1 Data Classication. One of the most common uses of data is classication. Suppose that we want to forecast the unknown discrete value of a dependent (or response) variable ! based on a measurement vector—or observation vector—x. This discrete dependent variable takes its value in V D ( v1 , . . . , vn ) —the n class labels. A training example is therefore a realization of a random feature vector, x, measured on an individual and allocated to one of the n classes 2 V. A training set is a collection of such training examples recorded for
24
M. Saerens, P. Latinne, and C. Decaestecker
the purpose of model building (training) and forecasting based on that model. The a priori probability of belonging to class vi in the training set will be denoted as pt (vi ) (in the sequel, subscript t will be used for estimates carried out on the basis of the training set). In the case control example, pt (v1 ) D pt ( disease) D 0.5, and pt (v2 ) D pt ( no disease) D 0.5. For the purpose of training, we suppose that for P each class vi , observations on Nti individuals belonging to the class (with niD 1 Nti D Nt , the total number of training examples) have been independently recorded according to the within-class probability density, p (x | vi ) . Indeed, case control studies involve direct sampling from the within-class probability densities, p (x | vi ) . In a case control study with two classes (as reported in section 1), this means that we made independent measurements on Nt1 individuals who contracted the disease (the cases), according to p ( x |disease), and on Nt2 individuals who did not (the controls), according to p( x | no disease) . The a priori probabilities of the classes in the training set are therefore estimated by their frequencies b pt ( vi ) D Nti / Nt . Let us suppose that we trained a classication model (the classier), and denote by b pt (vi | x) the estimated a posteriori probability of belonging to class vi provided by the classier, given that the feature vector x has been observed, in the conditions of the training set. The classication model (whose parameters are estimated on the basis of the training set as indicated by subscript t) could be an articial neural network, a logistic regression, or any other model that provides as output estimates of the a posteriori probabilities of the classes given the observation. This is, for instance, the case if we use the least-squares error or the Kullback-Leibler divergence as a criterion for training and if the minimum of the criterion is reached (see, e.g., Richard & Lippmann, 1991, or Saerens, 2000, for a recent discussion). We therefore suppose that the model has n outputs, gi ( x) (i D 1, . . . , n), providing estimated posterior probabilities of membership b pt ( vi | x) D gi ( x). In the experimental section (section 4), we will show that even imperfect approximations of these output probabilities allow reasonably good outputs corrections by the procedure to be presented below. Let us now suppose that the trained classication model has to be applied on another data set (new cases, e.g., real-world data to be scored) for which the class frequencies, estimating the a priori probabilities p (vi ) (no subscript t), are suspected to be different from b pt (vi ) . The a posteriori probabilities provided by the model for these new cases will have to be corrected accordingly. As detailed in the two next sections, two cases must be considered according to the fact that estimates of the new a priori probabilities b p (vi ) are, or are not, available for this new data set. 2.2 Adjusting the Outputs to New a Priori Probabilities: New a Priori Probabilities Known. In the sequel, we assume that the generation of the observations within the classes, and thus the within-class densities, p (x | vi ) ,
Adjusting a Classier to New a Priori Probabilities
25
does not change from the training set to the new data set (only the relative proportion of measurements observed from each class has changed). This is a natural requirement; it supposes that we choose the training set examples only on the basis of the class labels vi , not on the basis of x. We also assume that we have an estimate of the new a priori probabilities, b p ( vi ) . Suppose now that we are working on a new data set to be scored. Bayes’ theorem provides b pt ( vi | x)b pt ( x) , b pt ( vi )
b pt ( x | vi ) D
(2.1)
where the a posteriori probabilities b pt (vi | x) are obtained by applying the trained model as is (subscript t) on some observation x of the new data set (i.e., by scoring the data). These are the estimated a posteriori probabilities in the conditions of the training set (relying on the a priori probabilities of the training set). The corrected a posteriori probabilities, b p (vi | x) (relying this time on the estimated a priori probabilities of the new data set) obey the same equation, but with b p (vi ) as the new a priori probabilities and b p ( x) as the new probability density function (no subscript t): b p (x | vi ) D
b p (vi | x)b p ( x) . b p (vi )
(2.2)
Since the within-class densities b p (x | vi ) do not change from training to real-world data (b pt ( x | vi ) D b p( x | vi ) ), by equating equation (2.1) to (2.2) and dening f ( x) D b pt ( x) / b p (x) , we nd b p (vi | x) D f ( x)
b p( vi ) b pt ( vi | x) . b pt ( vi )
(2.3)
Xn
b p ( vi | x) D 1, we easily obtain 3 ¡1 n b( v ) X p j b f ( x) D 4 pt ( vj | x) 5 , b pt ( vj )
Since
iD 1
2
j D1
and consequently b p( vi ) b pt ( vi | x) b pt ( vi ) b . p (vi | x) D n X b p ( vj ) b (v | x) pt j b pt (vj )
(2.4)
jD 1
This well-known formula can be used to compute the corrected a posteriori probabilities, b p( vi | x) in terms of the outputs provided by the trained
26
M. Saerens, P. Latinne, and C. Decaestecker
model, gi (x) D b pt ( vi | x), and the new priors b p( vi ) . We observe that the new a posteriori probabilities b p (vi | x) are simply the a posteriori probabilities in the conditions of the training set, b pt ( vi | x), weighted by the ratio of the new priors to the old priors, b p( vi ) /b pt ( vi ). The denominator of equation 2.4 ensures that the corrected a posteriori probabilities sum to one. However, in many real-world cases, we ignore what the real-world a priori probabilities p( vi ) are since we do not know the class labels for these new data. This is the subject of the next section. 2.3 Adjusting the Outputs to New a Priori Probabilities: New a Priori Probabilities Unknown. When the new a priori probabilities are not known in advance, we cannot use equation 2.4, and the p ( vi ) probabilities have to be estimated from the new data set. In this section, we present an already known standard procedure used for new a priori probability estimation (the only one available in the literature to our knowledge); then we introduce our original method based on the EM algorithm. 2.3.1 Method 1: Confusion Matrix. The standard procedure used for a priori probabilities estimation is based on the computation of the confusion matrix, b p (di | vj ) , an estimation of the probability of taking the decision di to classify an observation in class vi , while in fact it belongs to class vj (see, e.g., McLachlan, 1992, or McLachlan & Basford, 1988). In the sequel, this method will be referred to as the confusion matrix method. Here is its rationale. First, the confusion matrix b pt (di | vj ) is estimated on the training set from cross-tabulated classication frequencies provided by the classier. Once this confusion matrix has been computed on the training set, it is used in order to infer the a priori probabilities on a new data set by solving the following system of n linear equations, b p(di ) D
n X jD 1
b pt (di | vj )b p( vj ) ,
i D 1, . . . , n,
(2.5)
with respect to the b p ( vj ), where the b p (di ) is simply the marginal of classifying an observation in class vi , estimated by the class label frequency after application of the classier on the new data set. Once the b p (vj ) are computed from equation 2.5, we use equation 2.4 to infer the new a posteriori probabilities. 2.3.2 Method 2: EM Algorithm. We now present a new procedure for a priori and a posteriori probabilities adjustment, based on the EM algorithm (Dempster et al., 1977; McLachlan & Krishnan, 1997). This iterative algorithm increases the likelihood of the new data at each iteration until a local maximum is reached. Once again, let us suppose that we record a set of N new independent (x1 , x2 , . . . , xN ) , sampled from realizations of the stochastic variable x, XN 1 D
Adjusting a Classier to New a Priori Probabilities
27
p ( x) , in a new data set to be scored by the model. The likelihood of these new observations is dened as L (x1 , x2 , . . . , xN ) D D D
N Y
p( x k )
kD 1
" N n Y X kD 1
iD 1
kD 1
iD 1
" N n Y X
# p( x k, vi ) # p( x k | vi ) p (vi ) ,
(2.6)
where the within-class densities—that is, the probabilities of observing x k given class vi —remain the same (p( x k | vi ) D pt ( xk | vi ) ) since we assume that only the a priori probabilities change from the training set to the new data set. We have to determine the estimates b p( vi ) that maximize the likelihood (2.6) with respect to p( vi ) . While a closed-form solution to this problem cannot be found, we can obtain an iterative procedure for estimating the new p ( vi ) by applying the EM algorithm. As before, let us dene gi ( x k ) as the model’s output value corresponding to class vi for the observation x k of the new data set to be scored. The model outputs provide an approximation of the a posteriori probabilities of the classes given the observation in the conditions of the training set (subscript t), while the a priori probabilities of the training set are estimated by class frequencies: b pt ( vi | x k ) D gi ( xk ) b pt ( vi ) D
Nti . Nt
(2.7) (2.8)
Let us dene as b p( s ) ( vi ) and b p ( s ) (vi | x k ) the estimations of the new a priori and a posteriori probabilities at step s of the iterative procedure. If the b p ( s) ( vi ) are initialized by the frequencies of the classes in the training set (see equation 2.8), the EM algorithm provides the following iterative steps (see the appendix) for each new observation x k and each class vi : b p( 0) (vi ) D b pt (vi ) b p ( s) ( vi ) b pt ( vi | x k ) b pt ( vi ) b p ( s) (vi | xk ) D n ( s) Xb p ( vj ) b pt ( vj | x k ) b pt (vj ) D1 j
b p ( s C 1) (vi ) D
N 1 X b p( s ) (vi | x k ) , N kD 1
(2.9)
28
M. Saerens, P. Latinne, and C. Decaestecker
where b pt ( vi | x k ) and b pt ( vi ) are given by equations 2.7 and 2.8. Notice the similarity between equations 2.4 and 2.9. At each iteration step s, both the a posteriori (b p ( s) ( vi | x k ) ) and the a priori probabilities (b p ( s) ( vi ) ) are reestimated sequentially for each new observation x k and each class vi . The iterative procedure proceeds until the convergence of the estimated probabilities b p( s ) (vi ) . Of course, if we have some a priori knowledge about the values of the prior probabilities, we can take these starting values for the initialization of the b p( 0) (vi ) . Notice also that although we did not encounter this problem in our simulations, we must keep in mind that local maxima problems potentially may occur (the EM algorithm nds a local maximum of the likelihood function). In order to obtain good a priori probability estimates, it is necessary that the a posteriori probabilities relative to the training set are reasonably well approximated (i.e., sufciently well estimated by the model). The robustness of the EM procedure with respect to imperfect a posteriori probability estimates will be investigated in the experimental section (section 4). 3 Testing for Different A Priori Probabilities In this section, we show that the likelihood ratio test can be used in order to decide if the a priori probabilities have signicantly changed from the training set to the new data set. Before adjusting the a priori probabilities (when the trained classication model is simply applied to the new data), the likelihood of the new observations is
Lt ( x1 , x2 , . . . , xN ) D
N Y
b pt (x k )
kD 1
¶ N µ Y b p( x k | vi )b pt (vi ) D , b pt ( vi | xk )
(3.1)
kD 1
whatever the class label vi , and where we used the fact that pt ( x k | vi ) D p (x k | vi ). After the adjustment of the a priori and a posteriori probabilities, we compute the likelihood in the same way:
L ( x1 , x2 , . . . , xN ) D D
N Y
b p( xk )
kD 1
¶ N µ Y b p (x k | vi )b p ( vi ) , b p( vi | x k ) D1 k
(3.2)
Adjusting a Classier to New a Priori Probabilities
29
so that the likelihood ratio is µ ¶ N Q b p( x k | vi )b p ( vi ) b p ( vi | x k ) L( x 1 , x 2 , . . . , x N ) kD 1 D µ ¶ N ( ) Lt x 1 , x 2 , . . . , x N Q b p (x k | vi )b pt (vi ) b pt ( vi | x k ) kD 1 µ ¶ N Q b p( vi ) p( vi | x k ) kD 1 b D µ ¶ N Q b pt ( vi ) , pt ( vi | x k ) kD 1 b
(3.3)
and the log-likelihood ratio is µ
L( x 1 , x 2 , . . . , x N ) log Lt ( x 1 , x 2 , . . . , x N )
¶ D
N X kD 1
N £ ¤ X £ ¤ log b log b pt ( vi | x k ) ¡ p (vi | xk ) kD 1
£ ¤ £ ¤ C N log b p( vi ) ¡ N log b pt (vi ) .
(3.4)
From standard statistical inference (see, e.g., Mood, Graybill, & Boes, 1974; Papoulis, 1991), 2 £ log [L( x1 , x2 , . . . , xN ) / Lt (x1 , x2 , . . . , xN ) ] is asymptotically distributed as a chi square with ( n ¡ 1) degrees of freedom (Â(2n¡1) , P where n is the number of classes). Indeed, since niD 1 b p( vi ) D 1 , there are only ( n ¡ 1) degrees of freedom. This allows us to test if the new a priori probabilities differ signicantly from the original ones and thus to decide if the a posteriori probabilities (i.e., the model outputs) need to be corrected. Notice also that standard errors on the estimated a priori probabilities can be obtained through the computation of the observed information matrix, as detailed in McLachlan & Krishnan, 1997. 4 Experimental Evaluation 4.1 Simulations on Articial Data. We present a simple experiment that illustrates the iterative adjustment of the a priori and a posteriori probabilities. We chose a conventional multilayer perceptron (with one hidden layer, softmax output functions, trained with the Levenberg-Marquardt algorithm) as a classication model, as well as a database labeled Ringnorm, introduced by Breiman (1998). 1 This database is constituted of 7400 cases described by 20 numerical features and divided into two equidistributed classes (each drawn from a multivariate normal distribution with a different variance-covariance matrix). 1
Available online at http://www.cs.utoronto.ca/˜delve/data/datasets.html.
30
M. Saerens, P. Latinne, and C. Decaestecker
Table 1: Results of the Estimation of Priors on the Test Sets, Averaged on 10 Runs, Ringnorm Articial Data Set. True Priors
10% 20% 30% 40% 50% 60% 70% 80% 90%
Estimated Prior by Using
Log-Likelihood Ratio Test
EM
Confusion Matrix
Number of Times the Test Is Signicant
14.7% 21.4 33.0 42.5 49.2 59.0 66.8 77.3 85.6
18.1% 24.2 34.4 42.7 49.0 57.1 64.8 73.9 80.9
10 10 10 10 0 10 10 10 10
Note: The neural network has been trained on a data set with a priori probabilities of (50%, 50%).
Ten replications of the following experimental design were applied. First, a training set of 500 cases of each class was extracted from the data (pt ( v1 ) D pt (v2 ) D 0.50) and was used for training a neural network with 10 hidden units. For each training set, nine independent test sets of 1000 cases were selected according to the following a priori probability sequence: p (v1 ) D 0.10, 0.20, 0.30, 0.40, 0.50, 0.60, 0.70, 0.80, 0.90 (with p ( v2 ) D 1 ¡p(v1 ) ). Then, for each test set, the EM procedure (see equation 2.9), as well as the confusion matrix procedure (see equation 2.5), were applied in order to estimate the new a priori probabilities and adjust the a posteriori probabilities provided by the model (b pt ( v1 | x) D g( x) ). In each experiment, a maximum of ve iteration steps of the EM algorithm was sufcient to ensure the convergence of the estimated probabilities. Table 1 shows the estimated a priori probabilities for v1 . With respect to the EM algorithm, it also shows the number of times the likelihood ratio test was signicant at p < 0.01 on these 10 replications. Table 2 presents the classication rates (computed on the test set) before and after the probability adjustments, as well as when the true priors of the test set (p( vi ), which are unknown in a real-world situation) were used to adjust the classier’s outputs (using equation 2.4). This latter result can be considered an optimal reference in this experimental context. The results reported in Table 1 show that the EM algorithm was clearly superior to the confusion matrix method for a priori probability estimation and that the a priori probabilities are reasonably well estimated. Except in the cases where p ( vi ) D pt ( vi ) D 0.50, the likelihood ratio test revealed in each replication a signicant difference (at p < 0.01) between the training 6 b and the test set a priori probabilities (b pt ( vi ) D p( vi ) ). The a priori estimates appeared as slightly biased toward 50%; this appears as a bias affecting the neural network classier trained on an equidistributed training set.
Adjusting a Classier to New a Priori Probabilities
31
Table 2: Classication Rates on the Test Sets, Averaged on 10 Runs, Ringnorm Articial Data Set. True Priors
Percentage of Correct Classication No
10% 20% 30% 40% 50% 60% 70% 80% 90%
After Adjustment by Using
Adjustment
EM
Confusion Matrix
True Priors
90.1% 90.3 88.6 90.4 87.0 90.0 89.2 89.5 88.5
93.6% 91.9 89.9 90.4 86.9 90.0 89.8 90.7 91.6
93.1% 91.7 89.8 90.4 86.8 90.0 89.7 90.7 91.3
94.0% 92.2 90.0 90.6 87.0 90.0 90.2 91.0 92.0
By looking at Table 2 (classication results), we observe that the impact of the adjustment of the outputs on classication accuracy can be signicant. The effect was most benecial when the new a priori probabilities, p ( vi ), are far from the training set ones (pt ( vi ) D 0.50). Notice that in each case, the classication rates obtained after adjustment were close to those obtained by using the true a priori probabilities of the test sets. Although the EM algorithm provides better estimates of the a priori probabilities, we found no difference between the EM algorithm and the confusion matrix method in terms of classication accuracy. This could be due to the high recognition rates observed for this problem. Notice also that we observe a small degradation in classication accuracy if we adjust the a priori probabilities when not necessary (pt ( vi ) D p( vi ) D 0.5), as indicated by the likelihood ratio test. 4.2 Robustness Evaluation on Articial Data. This section investigates the robustness of the EM-based procedure with respect to imperfect estimates of the a posteriori probability values provided by the classier, as well as to the size of the training and the test set (the test set alone is used to estimate the new a priori probabilities). In order to degrade the classier outputs, we gradually decreased the size of the training set in steps. Symmetrically, in order to reduce the amount of data available to the EM and the confusion matrix algorithms, we also gradually decreased the size of the test set. For each condition, we compared the classier outputs with those obtained with a Bayesian classier based on the true data distribution (which is known for an articial data set such as Ringnorm). We were thus able to quantify the error level of the classier with respect to the true a posteriori probabilities (How far is our neural network from the Bayesian classier?) and to evaluate the effects of a decrease in the training or test sizes on the a priori estimates provided by EM and the classication performances.
32
M. Saerens, P. Latinne, and C. Decaestecker
Table 3: Averaged Results for Estimation of the Priors, Ringnorm Data Set, Averaged on 10 Runs. Training Set Size
Test Set Size
(#v1 , #v2 )
(#v1 , #v2 )
(500, 500)
(200, 800) (100, 400) (40, 160) (20, 80) (200, 800) (100, 400) (40, 160) (20, 80) (200, 800) (100, 400) (40, 60) (20, 80) (200, 800) (100, 400) (40, 160) (20, 80)
(250, 250)
(100, 100)
(50, 50)
Estimated Prior for v1 (p(v1 ) D 0.20) by Using
Mean Absolute Deviation 1 N
PN kD 1
| b(xk ) ¡ g(x k ) | 0.107 0.110 0.104 0.122 0.139 0.140 0.134 0.167 0.183 0.185 0.181 0.180 0.202 0.199 0.203 0.189
EM
Confusion Matrix
22.0% 21.6 20.4 22.7 22.1 22.6 23.1 22.7 24.1 24.4 23.5 26.6 24.9 25.3 24.3 22.3
24.7% 24.5 23.5 26.7 25.3 25.6 25.8 26.0 27.5 28.2 27.3 29.2 28.5 29.0 27.6 26.0
Note: Notice that the true priors of the test sets are (20%, 80%).
As for the experiment reported above, a multilayer perceptron was trained on the basis of an equidistributed training set (pt ( v1 ) D 0.5 D pt ( v2 ) ). An independent and unbalanced test set (with p (v1 ) D 0.20 and p ( v2 ) D 0.80) was selected and scored by the neural network. The experiments (10 replications in each condition) were carried out on the basis of training and test sets with decreasing sizes (1000, 500, 200 and 100 cases), as detailed in Table 3. We rst compared the articial neural network’s output values (g(x) D pOt (v1 | x), obtained by scoring the test sets with the trained neural network) with those provided by the Bayesian classier (b( x) D pt ( v1 | x), obtained by scoring the test sets with the Bayesian classier) on the test sets before output readjustment; that is, we measured the discrepancy between the outputs of the neural and the Bayesian classiers before output adjustment. For this purpose, we computed the averaged absolute deviation between the output value of the neural and the Bayesian classiers (the average of |b( x) ¡ g ( x) |) before output adjustment. Then for each test set, the EM and the confusion matrix procedures were applied to the outputs of the neural classier in order to estimate the new a priori probabilities and the new a posteriori probabilities. The results for a priori probability estimation are detailed in Table 3. By looking at the mean absolute deviation in Table 3, it can be seen that, as expected, decreasing the training set size results in a degradation in the
Adjusting a Classier to New a Priori Probabilities
33
Average classification rate in each condition
100% Using true priors No adjustment After adjustment by EM After adjustment by confusion matrix 95%
90%
85%
#Training=(50,50); #Test=(20,80)
#Training=(100,100); #Test=(20,80)
#Training=(250,250); #Test=(20,80)
#Training=(500,500); #Test=(20,80)
#Training=(50,50); #Test=(40,160)
#Training=(100,100); #Test=(40,160)
#Training=(250,250); #Test=(40,160)
#Training=(500,500); #Test=(40,160)
#Training=(50,50); #Test=(100,400)
#Training=(100,100); #Test=(100,400)
#Training=(250,250); #Test=(100,400)
#Training=(500,500); #Test=(100,400)
#Training=(50,50); #Test=(200,800)
#Training=(100,100); #Test=(200,800)
#Training=(250,250); #Test=(200,800)
#Training=(500,500); #Test=(200,800)
80%
Number of observations in training and test sets
Figure 1: Classication rates obtained on the Ringnorm data set. Results are reported for four different conditions: (1) Without adjusting the classier output (no adjustment); (2) adjusting the classier output by using the confusion matrix method (after adjustment by confusion matrix); (3) adjusting the classier output by using the EM algorithm (after adjustment by EM); and (4) adjusting the classier output by using the true a priori probability of the new data (using true priors). The results are plotted by function of different sizes of both the training and the test sets.
estimation of the a posteriori probabilities (an increase of absolute deviation of about 0.10 between large, i.e., Nt D 1000, and small, i.e., Nt D 100, training set sizes). Of course, the prior estimates degraded accordingly, but only slightly. The EM algorithm appeared to be more robust than the confusion matrix method. Indeed, on average (on all the experiments), the EM method overestimated the prior p( v1 ) by 3.3%, while the confusion matrix method overestimated by 6.6%. In contrast, decreasing the size of the test set seems to have very few effects on the results. Figure 1 shows the classication rates (averaged on the 10 replications) of the neural network before and after the output adjustments made by the EM and the confusion matrix methods. It also illustrates the degradation in
34
M. Saerens, P. Latinne, and C. Decaestecker
classier performances due to the decrease in the size of the training sets: a loss of about 8% between large (i.e., Nt D 1000), and small (i.e., Nt D 100) training set sizes. The classication rates obtained after the adjustments made by the confusion matrix method are very close to those obtained with the EM method. In fact, the EM method almost always (15 times on the 16 conditions) provided better results, but the differences in accuracy between the two methods are very small (0.3% in average). As already observed in the rst experiment (see Table 2), the classication rates obtained after adjustment by the EM or the confusion matrix method are very close to those obtained by using the true a priori probabilities (a difference of 0.2% on average). Finally, we clearly observe (see Figure 1) that by adjusting the outputs of the classier, we always increased classication accuracy signicantly. 4.3 Tests on Real Data. We also tested the a priori estimation and outputs readjustment method on three real medical data sets from the UCI repository (Blake, Keogh, & Merz, 1998) in order to conrm our results on more realistic data. These data are Pima Indian Diabetes (2 classes of 268 and 500 cases, 8 features), Breast Cancer Wisconsin (2 classes of 239 and 444 cases after omission of the 16 cases with missing values, 9 features) and Bupa Liver Disorders (2 classes of 145 and 200 cases, 6 features). A training set of 50 cases of each class was selected in each data set and used for training a multilayer neural network; the remaining cases were used for selecting an independent test set. In order to increase the difference between the class distributions in the training (0.50, 0.50) and the test sets, we omitted a number of cases from the smallest class in order to obtain a class distribution of (p( v1 ) D 0.20, p( v2 ) D 0.80) for the test set. Ten different selections of training and test set were carried out, and for each of them, the training phase was replicated 10 times, giving a total of 100 trained neural networks for each data set. On average over the 100 experiments, Table 4 details the a priori probabilities estimated by means of the EM and the confusion matrix methods, as Table 4: Classication Results on Three Real Data Sets. Data Set
Diabetes Breast Liver
True Priors
20% 20 20
Priors Estimated by EM
24.8% 18.0 24.6
Confusion Matrix
31.3% 26.2 21.5
Percentage of Correct Classication No Adjustment
67.4% 91.3 68.0
After Adjustment by Using EM
Confusion Matrix
True Priors
76.3% 92.0 75.7
74.4% 92.1 75.5
78.3% 92.6 79.1
Note: The neural network has been trained on a learning set with a priori probabilities of (50%, 50%).
Adjusting a Classier to New a Priori Probabilities
35
well as the classication rates before and after the probability adjustments. These results show that the EM prior estimates were generally better than the confusion matrix ones (except for the Liver data). Moreover, adjusting the classier outputs on the basis of the new a priori probabilities always increased classication rates and provided accuracy levels not too far from those obtained by using the true priors for adjusting the outputs (given in the last column of Table 4). However, except for the Diabetes data, for which EM gave better results, the adjustments made on the basis of the EM and the confusion matrix methods seemed to have the same effect on the accuracy improvement. 5 Related Work The problem of estimating parameters of a model by including unlabeled data in addition to the labeled samples has been studied in both the machine learning and the articial neural network communities. In this case, we speak about learning from partially labeled data (see, e.g., Shahshahani & Landgrebe, 1994; Ghahramani & Jordan, 1994; Castelli & Cover, 1995; Towell, 1996; Miller & Uyar, 1997; Nigam, McCallum, Thrun, & Mitchell, 2000). The purpose is to use both labeled and unlabeled data for learning since unlabeled data are usually easy to collect, while labeled data are much more difcult to obtain. In this framework, the labeled part (the training set in our case) and the unlabeled part (the new data set in our case) are combined in one data set, and a partly supervised EM algorithm is used to t the model (a classier) by maximizing the full likelihood of the complete set of data (training set plus new data set). For instance, Nigam et al. (2000) use the EM algorithm to learn classiers that take advantage of both labeled and unlabeled data. This procedure could easily be applied to our problem: adjusting the a posteriori probabilities provided by a classier to new a priori conditions. Moreover, it makes fully efcient use of the available data. However, on the downside, the model has to be completely retted each time it is applied to a new data set. This is the opposite of the approach discussed in this article, where the model is tted only on the training set. When applied to a new data set, the model is not modied; only its outputs are recomputed based on the new observations. Related problems involving missing data have also been studied in applied statistics. Some good recent reference pointers are Scott and Wild (1997) and Lawless, Kalbeisch, and Wild (1999). 6 Conclusion We presented a simple procedure for adjusting the outputs of a classier to new a prioriclass probabilities. This procedure is a simple instance of the EM algorithm. When deriving this procedure, we relied on three fundamental
36
M. Saerens, P. Latinne, and C. Decaestecker
assumptions: 1. The a posteriori probabilities provided by the model (our readjustment procedure can be applied only if the classier provides as output an estimate of the a posteriori probabilities) are reasonably well approximated, which means that it provides predicted probabilities of belonging to the classes that are sufciently close to the observed probabilities. 2. The training set selection (the sampling) has been performed on the basis of the discrete dependent variable (the classes), and not of the observed input variable x (the explanatory variable), so that the withinclass probability densities do not change. 3. The new data set to be scored is large enough in order to be able to estimate accurately the new a priori class probabilities. If sampling also occurs on the basis of x, the usual sample survey solution to this problem is to use weighted maximum likelihood estimators with weights inversely proportional to the selection probabilities, which are supposed to be known (see, e.g., Kish and Frankel, 1974). Experimental results show that our new procedure based on EM performs better than the standard method (based on the confusion matrix) for new a priori probability estimation. The results also show that even if the classier ’s output provides imperfect a posteriori probability estimates, The EM procedure is able to provide reasonably good estimates of the new a priori probabilities. The classier with adjusted outputs always performs better than the original one if the a priori conditions differ from the training set to the real-world data. The gain of classication accuracy can be signicant. The classication performances after adjustment by EM are relatively close to the results obtained by using the true priors (which are unknown in a real-world situation), even when the a posteriori probabilities are imperfectly estimated. Additionally, the quality of the estimates does not appear to depend strongly on the size of the new data set. All these results enable us to relax to a certain extent the rst and third assumptions above. We also observed that adjusting the outputs of the classier when not needed (i.e., when the a priori probabilities of the training set and the realworld data do not differ) can result in a decrease in classication accuracy. We therefore showed that a likelihood ratio test can be used in order to decide if the a priori probabilities have signicantly changed from the training set to the new data set. The readjustment procedure should be applied only when we nd a signicant change of a priori probabilities.
Adjusting a Classier to New a Priori Probabilities
37
Notice that the EM-based adjustment procedure could be useful in the context of disease prevalence estimation. In this application, the primary objective is the estimation of the class proportions in an unlabeled data set (i.e., class a priori probabilities); classication accuracy is not important per se. Another important problem, also encountered in medicine, concerns the automatic estimation of the proportions of different cell populations constituting, for example, a smear or a lesion (such as a tumor). Mixed tumors are composed of two or more cell populations with different lineages, as, for example, in brain glial tumors (Decaestecker et al., 1997). In this case, a classier is trained on a sample of images of reference cells provided from tumors with a pure lineage (which did not present diagnostic difculties) and labeled by experts. When a tumor is suspected to be mixed, the classier is applied to a sample of cells from this tumor (a few hundred) in order to estimate the proportion of the different cell populations. The main motivation for the determination of the proportion of the different cell populations in these mixed tumors is that the different lineage components may significantly differ with respect to their susceptibility for aggressive progression and may thus inuence patients’ prognoses. In this case, the primary goal is the determination of the proportion of cell populations, corresponding to the new a priori probabilities. Another practical use of our readjustment procedure is the automatic labeling of geographical maps based on remote sensing information. Each portion of the map has to be labeled according to its nature (e.g., forest, agricultural zone, urban zone). In this case, the a priori probabilities are unknown in advance and vary considerably from one image to another, since they directly depend on the geographical area that has been observed (e.g., urban area, country area). We are now actively working on these biomedical and geographical problems.
Appendix: Derivation of the EM Algorithm Our derivation of the iterative process (see equation 2.9) closely follows the estimation of mixing proportions of densities (see McLachlan & Krishnan, 1997). Indeed, p ( x | vi ) can be viewed as a probability density dened by equation 2.1. The EM algorithm supposes that there exists a set of unobserved data dened as the class labels of the observations of the new data set. In order to pose the problem as an incomplete data one, associated with the new ( ), observed data, XN 1 D x1 , x2 , . . . , xN we introduce as the unobservable data ( ) ZN z z . . . z , where each vector z k is associated with one of the n , N 1, 2, 1 D mutually exclusive classes: z k will represent the class label 2 (v1 , . . . , vn ) of the observation x k. More precisely, each z k will be dened as an indicator
38
M. Saerens, P. Latinne, and C. Decaestecker
vector: if zki is the component i of vector zk , then z ki D 1 and zkj D 0 for 6 each j D i if and only if the class label associated with observation x k is vi . For instance, if the observation x k is assigned to class label vi , then z k D [0, . . . , 0 , 1, 0 , . . . , 0]T . i¡1 i i C 1
1
n
Let us denote by ¼ D [p( v1 ) , p (v2 ) , . . . , p ( vn ) ]T the vector of a priori probabilities (the parameters) to be estimated. The likelihood of the complete data (for the new data set) is N Y n £ Y
N L ( XN 1 , Z1 | ¼ ) D
p (x k,vi )
¤zki
kD 1 i D 1
N Y n £ Y
D
p (x k | vi ) p ( vi )
¤zki
,
(A.1)
kD 1 i D 1
where p ( x k | vi ) is constant (it does not depend on the parameter vector ¼ ) and the p ( vi ) probabilities are the parameters to be estimated. The log-likelihood is h i N N N l( XN 1 , Z 1 | ¼ ) D log L (X 1 , Z1 | ¼ ) D
N X n X kD 1 i D 1
N X n £ ¤ X £ ¤ z ki log p (vi ) C zki log p ( xk | vi ) .(A.2) kD 1 i D 1
Since the ZN 1 data are unobservable, during the E-step, we replace the N ) log-likelihood function by its conditional expectation over p( ZN 1 | X1 , ¼ : N EZN [l | X1 , ¼ ]. Moreover, since we need to know the value of ¼ in order to 1
compute EZN [l | XN 1 , ¼ ] (the expected log-likelihood), we use, as a current 1 guess for ¼ , the current value (at iteration step s) of the parameter vector, b ( s ) D [b ¼ p ( s ) (v1 ) , b p ( s) ( v2 ) , . . . , b p( s ) ( vn ) ]T , h i N N b( s) ) D EZN l( XN Q( ¼, ¼ ¼ ( s) 1 , Z1 | ¼ ) | X1 , b 1
D
N X n X
h i £ ¤ EZN zki | xk , b ¼ ( s) log p ( vi )
kD 1 i D 1
C
N X n X kD 1 i D 1
1
h i £ ¤ b( s) log p( x k | vi ) , EZN z ki | x k, ¼ 1
(A.3)
where we assumed that the complete data observations f(x k, zk ) , k D 1, . . . , Ng are independent. We obtain for the expectation of the unobservable data h i b( s) D 0 ¢ p( z ki D 0 | xk, b E ZN zki | xk , ¼ ¼ ( s) ) C 1 ¢ p ( zki D 1 | x k, b ¼( s) ) 1
Adjusting a Classier to New a Priori Probabilities
39
D p ( z ki D 1 | xk , b ¼ ( s) ) Db p ( s ) (vi | x k )
b p ( s ) (vi ) b pt (vi | xk ) b pt ( vi ) D n , Xb p ( s) ( vj ) b pt ( vj | xk ) b pt ( vj )
(A.4)
jD 1
where we used equation 2.4 at the last step. The expected likelihood is therefore Q( ¼ , b ¼( s) ) D
N X n X
£ ¤ b p ( s ) (vi | xk ) log p (vi )
kD 1 i D 1
C
N X n X
£ ¤ ( ) b p s ( vi | x k ) log p ( xk | vi ) ,
(A.5)
kD 1 i D 1
where b p( s ) ( vi | x k ) is given by equation A.4. For the M-step, we compute the maximum of Q ( ¼, b ¼ ( s ) ) (see equation A.5) with respect to the parameter vector¼ D [p(v1 ) , p ( v2 ) , . . . , p( vn ) ]T . The new estimate at time step ( s C 1) will therefore be the value of the parameter vector P b( s ) ) . Since we have the constraint, niD1 p (vi ) D 1, ¼ that maximizes Q( ¼ , ¼ we dene the Lagrange function as " # n X ( s) ( ) ( ) ( ) b Cl 1¡ ` ¼ D Q ¼, ¼ p vi iD 1
D
N X n X kD 1 i D 1
( ) b p s ( vi | xk ) log[p( vi )] C
"
Cl 1¡
By computing N X k D1
n X
# p( vi ) .
N X n X
( ) b p s (vi | x k ) log[p(x k | vi ) ]
kD 1 i D 1
(A.6)
iD 1
@`( ¼ ) @p ( vj )
D 0, we obtain
( ) b p s ( vj | x k ) D l p ( vj )
(A.7)
for j D 1, . . . , n. If we sum this equation over j, we obtain the value of the Lagrange parameter, l D N, so that p (vj ) D
N 1 X b p ( s) ( vj | x k ) , N kD 1
(A.8)
40
M. Saerens, P. Latinne, and C. Decaestecker
and the next estimate of p( vi ) is therefore
( b p s C 1) (vi ) D
N 1 X ( ) b p s ( vi | x k ) , N kD 1
(A.9)
so that equations A.4 (E-step) and A.9 (M-step) are repeated until the convergence of the parameter vector ¼ . The overall procedure is summarized in equation 2.9. It can be shown that this iterative process increases the likelihood (see equation 2.6) at each step (see, e.g., Dempster et al., 1977; McLachlan & Krishnan, 1997). Acknowledgments Part of this work was supported by project RBC-BR 216/4041 from the RÂegion de Bruxelles-Capitale, and funding from the SmalS-MvM. P. L. is supported by a grant under an Action de Recherche ConcertÂee program of the Communaut e Fran¸caise de Belgique. C. D. is a research associate with the FNRS (Belgian National Scientic Research Fund). We also thank the two anonymous reviewers for their pertinent and constructive remarks. References Blake, C., Keogh, E., & Merz, C. (1998). UCI repository of machine learning databases. Irvine, CA: University of California, Department of Information and Computer Science. Available online at: http://www.ics.uci.edu/ »mlearn/MLRepository.html. Breiman, L. (1998). Arcing classiers. Annals of Statistics, 26, 801–849. Castelli, V., & Cover, T. (1995). On the exponential value of labelled samples. Pattern Recognition Letters, 16, 105–111. Decaestecker, C., Lopes, M.-B., Gordower, L., Camby, I., Cras, P., Martin, J.-J., Kiss, R., VandenBerg, S., & Salmon, I. (1997). Quantitative chromatin pattern description in feulgen-stained nuclei as a diagnostic tool to characterise the oligodendroglial and astroglial components in mixed oligoatrocytomas. Journal of Neuropathology and Experimental Neurology, 56, 391–402. Dempster, A., Laird, N., & Rubin, D. (1977). Maximum likelihood from incomplete data via the EM algorithm (with discussion). Journal of the Royal Statistical Society B, 39, 1–38. Ghahramani, Z., & Jordan, M. (1994). Supervised learning from incomplete data via an EM algorithm. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing systems, 6 (pp. 120–127). San Mateo, CA: Morgan Kaufmann. Hand, D. (1981). Discrimination and classication. New York: Wiley. Kish, L., & Frankel, M. (1974).Inference from complex samples (with discussion). Journal of the Royal Statistical Society B, 61, 1–37.
Adjusting a Classier to New a Priori Probabilities
41
Lawless, J., Kalbeisch, J., & Wild, C. (1999). Semiparametric methods for response-selective and missing data problems in regression. Journal of the Royal Statistical Society B, 61, 413–438. McLachlan, G. (1992). Discriminant analysis and statisticalpattern recognition. New York: Wiley. McLachlan, G., & Basford, K. (1988). Mixture models, inference and applications to clustering. New York: Marcel Dekker. McLachlan, G., & Krishnan, T. (1997).The EM algorithm and extensions. New York: Wiley. Melsa, J., & Cohn, D. (1978). Decision and estimation theory. New York: McGrawHill. Miller, D., & Uyar, S. (1997).A mixture of experts classier with learning based on both labeled and unlabeled data. In M. Mozer, M. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 571–578). Cambridge, MA: MIT Press. Mood, A., Graybill, F., & Boes, D. (1974). Introduction to the theory of statistics (3rd ed.). New York: McGraw-Hill. Nigam, K., McCallum, A., Thrun, S., & Mitchell, T. (2000).Text classication from labeled and unlabeled documents using EM. Machine Learning, 39, 103–134. Papoulis, A. (1991). Probability, random variables, and stochastic processes (3rd ed.), New York: McGraw-Hill. Richard, M., & Lippmann, R. (1991). Neural network classiers estimate Bayesian a posteriori probabilities. Neural Computation, 2, 461–483. Saerens, M. (2000). Building cost functions minimizing to some summary statistics. IEEE Transactions on Neural Networks, 11, 1263–1271. Scott, A., & Wild, C. (1997). Fitting regression models to case-control data by maximum likelihood. Biometrika, 84, 57–71. Shahshahani, B., & Landgrebe, D. (1994). The effect of unlabeled samples in reducing the small sample size problem and mitigating the Hugues phenomenon. IEEE Transactions on Geoscience and Remote Sensing, 32, 1087–1095. Towell, G. (1996). Using unlabeled data for supervised learning. In D. Touretzky, M. Mozer, & M. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 647–653). Cambridge, MA: MIT Press. Received April 19, 2000; accepted March 30, 2001.
LETTER
Communicated by George Gerstein
Unitary Events in Multiple Single-Neuron Spiking Activity: I. Detection and Signicance Sonja Grun ¨
[email protected] Department of Neurophysiology, Max-Planck Institute for Brain Research, D-60528 Frankfurt/ Main, Germany Markus Diesmann
[email protected] Department of Nonlinear Dynamics, Max-Planck Institut fur ¨ Str¨omungsforschung, D-37073 G¨ottingen, Germany Ad Aertsen
[email protected] Department of Neurobiology and Biophysics, Institute of Biology III, Albert-LudwigsUniversity, D-79104 Freiburg, Germany It has been proposed that cortical neurons organize dynamically into functional groups (cell assemblies) by the temporal structure of their joint spiking activity. Here, we describe a novel method to detect conspicuous patterns of coincident joint spike activity among simultaneously recorded single neurons. The statistical signicance of these unitary events of coincident joint spike activity is evaluated by the joint-surprise. The method is tested and calibrated on the basis of simulated, stationary spike trains of independently ring neurons, into which coincident joint spike events were inserted under controlled conditions. The sensitivity and specicity of the method are investigated for their dependence on physiological parameters (ring rate, coincidence precision, coincidence pattern complexity) and temporal resolution of the analysis. In the companion article in this issue, we describe an extension of the method, designed to deal with nonstationary ring rates. 1 Introduction
In the classical view, ring rates play a central role in neural coding (Barlow, 1972, 1992). This approach indeed led to fundamental insights into the neuronal mechanisms of brain function (Georgopoulos, Taira, & Lukashin, 1993; Hubel & Wiesel, 1977; Newsome, Britten, & Movshon, 1989). In parallel, however, a different concept was developed, in which the temporal organization of spike discharges within functional groups of neurons— Neural Computation 14, 43–80 (2001)
° c 2001 Massachusetts Institute of Technology
44
S. Grun, ¨ M. Diesmann, and A. Aertsen
so-called neuronal assemblies (Hebb, 1949)—contribute to neural coding (von der Malsburg, 1981; Abeles, 1982b, 1991; Gerstein, Bedenbaugh, & Aertsen, 1989; Palm, 1990; Singer, 1993). It was argued that the biophysics of synaptic integration favors coincident presynaptic events over asynchronous ones (Abeles, 1982c; Softky & Koch, 1993). Accordingly, synchronized spikes are considered a property of neuronal signals that can be detected and propagated by other neurons (Diesmann, Gewaltig, & Aertsen, 1999). In addition, these spike correlations should be dynamic, reecting varying afliations of the neurons, depending on stimulus and behavioral context. Thereby, synchrony of ring would be directly available to the brain as a potential neural code (Perkel & Bullock, 1968; Johannesma, Aertsen, van den Boogaard, Eggermont, & Epping, 1986). Dynamic modulations of spike correlation at various levels of precision have in fact been observed in different cortical areas: visual (Eckhorn et al., 1988; Gray & Singer, 1989; for reviews, see (Engel, Konig, ¨ Schillen, & Singer, 1992; Aertsen & Arndt, 1993; Singer & Gray, 1995; Roelfsema, Engel, K¨onig, & Singer, 1996; Singer et al., 1997; Singer, 1999), auditory (Ahissar, Bergman, & Vaadia, 1992; Eggermont, 1992; DeCharms & Merzenich, 1996; Sakurai, 1996), somatosensory (Laubach, Wessberg, & Nicolelis, 2000; Nicolelis, Baccala, Lin, & Chapin, 1995; Steinmetz et al., 2000), motor (Murthy & Fetz, 1992; Sanes & Donoghue, 1993; Hatsopoulos, Ojakangas, Paninski, & Donoghue, 1998), and frontal (Aertsen et al., 1991; Abeles, Vaadia, Prut, Haalman, & Slovin, 1993; Abeles, Bergman, Margalit, & Vaadia, 1993; Vaadia et al., 1995; Prut et al., 1998). Little is known, however, about the functional role of temporal organization in such signals. First hints toward the importance of accurate spike patterns came from the work of Abeles and colleagues (Abeles, Vaadia, et al., 1993; Abeles, Bergman, et al., 1993; Prut et al., 1998). They observed that multiple single-neuron recordings from the frontal cortex of awake, behaving monkeys contain an abundance of recurring precise spike patterns. These patterns had a duration of up to several hundred milliseconds, repeated with a precision of § 1 ms, and occurred in systematic relation to sensory stimuli and behavioral events. To test the hypothesis that cortical neurons coordinate their spiking activity into volleys of precise synchrony, we developed a method to detect the presence of conspicuous spike coincidences in simultaneously recorded multiple single-unit spike trains and to evaluate their statistical signicance. We refer to such conspicuous coincidences as unitary events and dene them as those joint spike constellations that recur more often than expected by chance (Grun, ¨ Aertsen, Abeles, Gerstein, & Palm, 1994; Grun, ¨ 1996). Briey, the algorithm works as follows: The simultaneous observation of spiking events from N neurons is described mathematically by the joint process composed of N parallel point processes. By appropriate binning, this is transformed into an N-fold ( 0, 1)-process, the statistics of which are described by the set of activity vectors reecting the various ( 0, 1)-constellations occurring across the neurons. Under the null hypothesis of independent ring,
Unitary Events: I. Detection and Signicance
45
the expected number of occurrences of any activity vector and its probability distribution can be calculated analytically on the basis of the single-neuron ring rates. The degree of deviation from independence is derived by comparing these theoretically derived values with their empirical counterparts. Those activity vectors that violate the null-hypothesis of independence dene potentially interesting occurrences of joint-events; their composition denes the set of neurons that are momentarily engaged in synchronous activity. To test the signicance of such unitary coincident events, we developed a new statistical measure: the joint-surprise. For any particular activity vector, the joint-surprise measures the cumulative probability of nding the observed number of coincidences or an even larger one by chance. To account for nonstationarities in the discharge rates, modulations in spike rates and coincidence rates are determined on the basis of short data segments by sliding a xed time window (typically 100 ms wide) along the data in steps of the coincidence bin width. This segmentation is applied to each trial, and the data of corresponding segments in all trials are analyzed as one quasi-stationary data set, using the appropriate rate approximation. Having rst ascertained the statistical signicance of brief epochs of synchronous spiking, the functional signicance of such unitary coincident events is then tested by investigating the times of their occurrence and their composition in relation to sensory stimuli and behavioral events. Thus, Riehle, Grun, ¨ Diesmann, and Aertsen (1997) found that simultaneously recorded activities of neurons in monkey primary motor cortex exhibited context-dependent, rapid changes in the patterns of coincident action potentials during performance of a delayed-pointing task. Accurate spike synchronization occurred in relation to external events (visual stimuli, hand movements), commonly accompanied by discharge rate modulations, however, without precise time locking of the spikes to these external events. Accurate spike synchronization also occurred in relation to purely internal events (stimulus expectancy), where ring-rate modulations were distinctly absent. These ndings indicate that internally generated synchronization of individual spike discharges may subserve the cortical organization of cognitive motor processes. The clear correlation of the precise spike coincidences with behavioral events was interpreted as evidence for their functional relevance (Riehle et al., 1997; Fetz, 1997). Thus, unitary event analysis evoked a considerable amount of interest in the ongoing debate on spike synchronization (Shadlen & Newsome, 1998; Diesmann et al., 1999) and its detectability in experimental data (Pauluis & Baker, 2000; Roy, Steinmetz, & Niebur, 2000; Gutig, ¨ Aertsen, & Rotter, in press). It is currently used in a number of laboratories (Pauluis, 1999; Grammont & Riehle, 1999; Riehle, Grammont, Diesmann, & Grun, ¨ 2000). Here we provide for the rst time a full account of the unitary event method and discuss its underlying principles in detail. In this article, we describe
46
S. Grun, ¨ M. Diesmann, and A. Aertsen
the theory and statistical background of the analysis method for stationary conditions—when the ring rates of the neurons under observation do not change as a function of time. Simulated spike trains, consisting of parallel, independent Poisson processes into which we inserted particular coincident spike constellations under controlled conditions, were used to test and calibrate the method. In the companion article in this issue, we extend the method to deal with nonstationary ring rates and to illustrate its potential by analyzing multiple single-neuron recordings from frontal and motor cortical areas in awake, behaving monkeys. Preliminary descriptions of the method have been presented in abstract form (Grun, ¨ Aertsen, Abeles, & Gerstein, 1993; Grun ¨ et al., 1994; Grun ¨ & Aertsen, 1998) and in Riehle et al., (1997). 2 Detecting Unitary Events in Joint Spiking Activity 2.1 Representation of Joint Spiking Activity. By introducing a temporal resolution D , the spike train of a single neuron i recorded over a time interval of length T can be represented by a binary process vi ( t) , that is, as a ( 0, 1)-sequence. With T D bT / D c denoting the total number of time steps, we dene ( 1, if spike in [t, t C D ) vi ( t ) D 0, if no spike in [t, t C D ) ,
t D 0, 1D , 2D , . . . , ( T ¡ 1) D .
(2.1)
The minimal D is set by the spike time resolution h (in the data we analyzed, typically 1 ms). In our analysis, we used integer multiples of the data resolution D D bh for the binning grid, with b serving as a control parameter for the analysis precision (D is called analysis bin size). Thus, each point in time is assigned to a unique bin, which we refer to as exclusive binning. A single bin, however, may contain more than one spike. Equation 2.1 guarantees that vi ( t) is restricted to 1, even if the corresponding bin contains more than one spike, which we refer to as clipping. The simultaneous observation of spike events from N neurons can now be represented in this binary framework. The activities of the individual neurons i are described by parallel binary sequences vi ( t). Alternatively, we can describe the N sequences as a single sequence of a vector-valued function v ( t ), the components of which are formed by the vi ( t) : 2 3 v1 ( t ) 6 .. 7 6 . 7 6 7 7 v ( t) D 6 (2.2) 6 vi ( t) 7 , i D 1, . . . , NI vi 2 f0, 1g . 6 . 7 . 4 . 5 vN ( t )
Unitary Events: I. Detection and Signicance
v1 . vi . . vN
1 0 1 0 0 0
0 0 0 1 0 1
0 1 1 0 0 0
0 0 0 0 0 0
0 0 0 0 1 0
1 0 0 0 0 0
0 0 1 0 0 1
0 1 0 0 1 0
1 0 0 1 0 0
0 0 0 0 1 1
1 0 1 0 0 0
0 0 0 0 0 0
0 1 0 1 0 0
0 0 0 0 0 0
0 0 1 0 0 0
1 0 0 0 0 1
0 0 0 0 0 0
47 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 1 0
0
0 0 1 0 0 1
0 0 0 1 0 0
t
v (t)
0 0 0 0 0 0
0 1 0 0 1 0
0 0 0 0 0 0
0 0 0 1 0 0
p1 . pi . . pN
T-1
Figure 1: N parallel binary processes, lasting for T time steps. Each horizontal row, consisting of a sequence of 0s and 1s, represents a realization of a single process vi . The 1s mark the occurrences of spike events. The joint activity across the processes at each instant in time can be expressed by a vector v ( t) , as indicated for one example. The empirical ring probability per bin pi of each single process is evaluated as the marginal probability: the number of spikes in the observation time interval, divided by the number of time steps.
This scheme is illustrated in Figure 1. At each time step, v ( t) equals one of the m D 2N possible constellations of 0s and 1s (coincidence patterns). The m possible constellations v k are identied by a (for now) arbitrary index function k (e.g., let v k be mapped to an integer k 2 f1, . . . , mg by interpretk . . . k ) and dening ing v k as the binary representation of an integer ( vN v1 2 k k k D ( vN . . . v1 ) 2 C 1). The empirical number of occurrences of a coincidence pattern v k in data recorded over an interval T is called n k. 2.2 The Null-Hypothesis of Independent Firing. We are interested in detecting the (sub)groups of neurons jointly involved in a cell assembly—the neurons that act in an interdependent manner. To distinguish these neurons from those that are not involved, we develop a statistical tool to test the null hypothesis ( H 0 ) of independent neurons. Under this null-hypothesis, the joint-probability P k D P ( v k ) of a coincidence pattern v k (a particular constellation of spikes and nonspikes across the observed neurons) equals the product of the probabilities of the individual events:
H0 : P k D
N ± ² Y P vik , i D1
± ² with P vik D
(
P ( vi D 1) , 1 ¡ P ( vi D 1) ,
if vik D 1
if vik D 0.
(2.3)
Equation 2.3 assumes independence of the N neuronal processes. In addition, we now make the assumption that the binary sequence describing the activity of a single neuron (equation 2.1) represents a sequence of Bernoulli
48
S. Grun, ¨ M. Diesmann, and A. Aertsen
trials (Feller, 1968). The probability of a specic outcome vi ( t) does not depend on the value of vi at any other point in time. A Poisson process, often used to model neuronal spike trains (see section 5), leads to a binary sequence in accordance with the above assumption. Equation 2.3 is based on precise knowledge of the single-neuron ring probabilities pi D P (vi D 1). However, in the experimental situation, the ring probabilities are typically unknown and have to be estimated from the data. The simplest scheme is to adopt the frequency interpretation (e.g., Feller, 1968) and use the number of spike events ci in the observation interval T containing T time steps to calculate the probability pi D ci / T as an estimate for the ring probability of neuron i (for an alternative approach, see Gutig ¨ et al., in press). The task now is to develop a tool that enables us to judge whether the emp empirical number of occurrences n k of a particular coincidence pattern pred
v k deviates signicantly from the expected number n k
.
2.3 Describing Independently Spiking Neurons by Multiple Bernoulli Trials. Consider a set of parallel realizations of the N binary processes with
duration T. The N resulting T-dimensional row vectors can be combined to a matrix of 0s and 1s with N rows and T columns (see Figure 1). According to the assumptions made in the previous section, the probability of a particular outcome in a specic matrix element does not depend on the outcome in any of the other matrix elements. Equivalently, we can describe the realization as a succession of T N-dimensional column vectors (coincidence patterns). A process generating such N-dimensional events is called a multiple Bernoulli trial (Feller, 1968). Following Feller (1968), we can write the probability of nding each pattern v k exactly n k times in the observation interval directly in terms of the P k: T! y ( n 1 , n 2 , . . . , n m I P1 , P2 , . . . , P m I T ) D Q m
kD 1 n k !
¢
m Y
P nk k .
(2.4)
kD 1
This expression represents a generalization of the binomial distribution to a process with more than two possible outcomes and is called a multinomial distribution. The P k and n k are subject to the normalizing conditions m X kD 1 m X kD 1
Pk D 1
(2.5)
n k D T.
(2.6)
For any particular spike constellation v k (dening that particular v k as “the” outcome and all the rest as “the others”), the probability distribution in equation 2.4 can be reduced to the binomial distribution. For such selected
Unitary Events: I. Detection and Signicance
A Poisson Distr.
49
B Joint p value
0.2
C Joint surprise 3
0.1
2
0 0
20
40
60
S
0.05
80
0 0
¬
0.1
y
y
1
10
20
pred
n
n
30
emp
0 1
Y
n
¬
2 40
3 0 0.25 0.5 0.75 1
Y
n
Figure 2: (A) Three examples of Poisson distributions, for parameters npred D 5, 15, and 50 (from left to right). (B) The black shaded area under the Poisson distribution (npred D 15), ranging from nemp D 25 to innity, indicates the jointp-value Y as the cumulative probability. For this example, the joint-p-value equals 0.0112. (C) The joint-surprise S shown as a logarithmic scaling function of the joint-p-value. The dash-dotted line equals the surprise measure as dened by Palm (1981), and the solid line shows the continuous, differentiable version used here: the joint-surprise (see equation 2.10). The value of the joint-surprise corresponding to the joint-p-value in the example in B is 1.9459.
constellation v k, we obtain
y ( nkI P kI T ) D
T! nk! ¢ ( T ¡ nk ) ! ¢ Pnk k ¢ ( 1 ¡ P k ) T¡nk ,
k D 1, . . . , m.
(2.7)
Since the number of time steps T (or Tb D T / b for a binning grid b) is usually large for bh in the order of 1 ms and the associated probabilities P k are small, while their product P k ¢ T remains moderate, equation 2.7 can be approximated by the Poisson distribution (Feller, 1968; see Figure 2A):
y (n kI PkI T) D
( P k ¢ T ) nk ¢ exp ( ¡P k ¢ T) , nk!
k D 1, . . . , m.
(2.8)
Here, P k ¢ T is the rate parameter of the Poisson distribution, dening the pred
expected number of occurrences n k v k.
D P k ¢T of the joint spike constellation
2.4 Signicance of Joint-Events: The Joint-Surprise. For each of the m constellations v k in the observation set, equation 2.7 describes the nullhypothesis of independent component processes. The expected number of pred
occurrences n k
D P k ¢ T denes the center of mass of the distribution.
50
S. Grun, ¨ M. Diesmann, and A. Aertsen emp
Thus, for each empirical number of occurrences n k , we can now compute the statistical signicance of the deviation from independence. It is dened as the cumulative probability of nding the observed number of emp occurrences n k or an even larger one (an alternative approach of measuring deviation from independence using the framework of information theory is discussed in appendix C). We call this cumulative probability the joint-p-value Y, dened by ± ² emp pred Y nk | nk D
D
1 X emp
nk D nk
± ² pred y nk, nk ± ² pred nk nk
1 X
nk!
emp
nk D nk
± ² pred ¢ exp ¡n k ,
k D 1, . . . , m.
(2.9)
It can efciently be evaluated numerically using the connection to the regularized incomplete gamma function (Press, Teukolsky, Vetterling, & Flannery, 1992). Figure 2B shows Y as the black area under the distribution. The smaller this area is, the higher is the signicance of the corresponding count: emp
if n k if if
emp nk emp nk
pred
> nk ’
<
pred nk pred nk
then then then
0 · Y < 0.5 Y ’ 0.5
0.5 < Y · 1.
Thus, the larger the number of excessive coincidences, the closer Y is to 0. By similar reasoning, we can dene the statistical signicance for coincidence patterns occurring at an unexpectedly low rate (i.e., “lacking” coinemp cidences). The statistical signicance of nding at most nk repetitions of pattern v k is obtained as the complementary part of the sum in equation 2.9 emp (n k D 0, . . . , n k ). In that case, the lower the number of coincidences is (or, the larger the number of lacking ones), the closer Y is to 1 and the closer its complement 1 ¡ Y is to 0. To enhance the visual resolution at the relevant low values for Y (or 1¡Y), one may choose a logarithmic scaling as was done for the surprise measure (Legendy, 1975; Palm, 1981; Legendy & Salcman, 1985; Palm, Aertsen, & Gerstein, 1988; Aertsen, Gerstein, Habib, & Palm, 1989). This is a straightforward scale transformation, dened as ¡ log(Y) for Y larger than 0.5 and as ¡ log( 1¡Y) for Y smaller than 0.5. However, since this function is discontinuous at Y D 0.5 (see Figure 2C, dotted line) it leads to wildly uctuating values when considering Y-values close to chance level. To overcome this problem, we dene a new transformation, the joint-surprise, which approx-
Unitary Events: I. Detection and Signicance
51
imates the conventional surprise measure for the relevant values of Y and yields a continuous and differentiable function around Y D 0.5: S (Y) D log
1 ¡Y . Y
(2.10)
For excessive coincidences, this function is dominated by the numerator Y; for lacking coincidences, it is dominated by the denominator 1 ¡ Y (see Figure 2C, solid line). It turns out that the expression for the joint-surprise (equation 2.10) is equivalent (apart from minor details) to the difference of the surprise for excitation and surprise for inhibition used in Palm et al. (1988) and Aertsen et al. (1989). Equation 2.10 is comparable to measure signicance on a dB scale and yields positive numbers for excessive coincidences (e.g., S D 1 for Y D 0.1, or S D 2 for Y D 0.01), negative numbers for lacking ones, while changing sign at chance level Y D 0.5: emp
if n k if if
emp nk emp nk
pred
> nk ’
<
pred nk pred nk
then
S> 0
then
S’0
then
S < 0.
2.5 Unitary Events: Denition and Detection. On the basis of the jointsurprise as a measure of signicance, we now dene unitary events as those joint spike events v k in a given interval T that occur much more often than expected by chance. To that end, we set a threshold Sa on the joint-surprise measure and denote the occurrences of those v k for which ± ² emp pred ¸ Sa (2.11) S nk | nk emp
pred
as unitary events. The arguments of S ( n k | n k ) remind us that this test is performed separately for each coincidence pattern v k in the interval T . The raster display (or dot display) is the standard tool used by the electrophysiologist to look for temporal structure in the “raw” spike data (see Figure 3). Since we are interested in the dynamics of assembly activation, we want to detect the joint spike constellations that possibly express assembly activity as they occur in time. To visualize occurrences of potentially interesting coincidence patterns in relation to other instances of itself, other patterns, or other events (e.g., behavioral, stimuli), we mark all spikes in all instantiations of a signicant pattern v k by squares (see Figure 3, lower row). For any Sa , there is necessarily a certain probability of detecting a v k as signicant in a realization of independent processes (false positive). Given a realization of dependent processes generating a surplus of v k, there is a certain probability not to detect the pattern as signicant (false negative). To obtain maximum sensitivity while maintaining a minimum level of false
52
S. Grun, ¨ M. Diesmann, and A. Aertsen
Figure 3: Dot displays in the two top panels show the simultaneous activity of six simulated neurons: independent (A) and dependent (B) ring. Firing rates are: neuron 1: 10 s ¡1 ; 2: 20 s ¡1 ; 3: 15 s¡1 ; 4: 30 s ¡1 ; 5: 25 s ¡1 ; 6: 15 s ¡1 . The spike trains in B are generated by rst copying the spike trains of A. Dependencies between neurons are then introduced by injecting coincident events, consisting of neuron pairs 1,3 and 2,5 (both at coincidence rate of 1 s ¡1 ), randomly distributed in time over all the trials. Each box contains the spike activity of a single neuron over 100 trials of 1000 ms duration. Each dot represents a spike at the time of its occurrence. Trials are organized in rows. Bottom panels: Spikes belonging to statistically signicant constellations (unitary events) are marked by squares. Observe the different numbers of occurrences of unitary events in A and B due to the injected coincidences. In addition, in B, some of the constellations containing the injected spikes as subpatterns are also detected as signicant events.
Unitary Events: I. Detection and Signicance
53
positives (see elaboration in section 4), we set the threshold Sa to a level between 1.28 and 2. This corresponds to a signicance level a between 0.05 and 0.01, a commonly used threshold level in statistical signicance tests (e.g., Hays, 1994, “p-value”). The coincidence patterns v k contain different numbers of spikes ranging from 0 to N. We call the number of spikes in a pattern complexity: j ( vk) D
N X
vik.
(2.12)
iD 1
¡ ¢ There are N j patterns of complexity j . Because each of the patterns is assigned a complexity j 2 f0, . . . , Ng, we recover the total number of patterns by the binomial theorem, N ³ ´ X N D 2N . j j D0
The single pattern of complexity 0 (no spike) and the N patterns of complexity 1 (spike from one neuron) do not represent joint spiking activity in the natural sense. Therefore, we typically concentrate on patterns with j ( v k ) > 1. For a signicant v k with j > 1, each square in a dot display has a counterpart in at least one other box of the dot display at the same time instant. The procedure is illustrated in Figure 3A for simulated realizations of six independent processes and in Figure 3B for six dependent parallel processes. Here, all patterns with j ( v k ) > 1 are tested independently using equation 2.11, and all signicant occurrences are marked according to the convention. However, there is no need to visualize all patterns simultaneously. In an application of the method to experimental data, it might be useful to generate separate raster displays for individual patterns or subsets of patterns. The simulation, like all further simulations, was performed as follows. Several (here N D 6) spike trains of 100s duration were generated using independent homogeneous Poisson processes, each with a particular rate parameter li . The single spike trains were then combined, as if they had been recorded simultaneously from as many neurons. For visualization, spike data are organized in 100 consecutive trials of 1s duration (see Figure 3). Time resolution was set to h D 1 ms. In addition, into one of the data sets (see Figure 3B) we introduced statistical dependencies by injecting pairs of simultaneous spikes into the spike trains of neuron pairs 1, 3 and 2, 5, respectively. Both coincidences occurred at a rate of 1 s ¡1 and were randomly distributed in time, such that on average, each trial contained one injected coincident event. In the context of this article, it is important to note that spike trains were generated by stationary processes and that the analysis was performed once, taking into account the entire data set. “Trials”
54
S. Grun, ¨ M. Diesmann, and A. Aertsen
are introduced here only for visualization. An equivalent description is that each box in Figure 3 displays the activity of a neuron as a page of text (i.e., written left to right and top to bottom). The concept of a trial becomes important only when treating nonstationary data (see the companion article). When data are organized in trials, T is understood to specify the duration of a trial (in time steps) and M ¢ T is the duration of the full data set, with M indicating the number of trials. As expected, the raster displays of the two data sets look very similar (Figure 3, top row). Since the rate of injected coincident spikes (1 s ¡1 ) is low compared to the baseline ring rates, comparison of corresponding ring rates in the two data sets does not reveal any noticeable difference (not shown here). The analysis for unitary events was performed with the thresh¡ ¢ ¡ ¢ old level set at a D 0.05. With 2N ¡ N1 ¡ N0 D 57 patterns independently tested at a D 0.05, we expect to nd 2.85 patterns to be marked as significant. Observe that the dependent data set in Figure 3B (bottom) exhibits many unitary events, whereas the independent data in Figure 3A (bottom) set has almost none. Moreover, the few unitary events in the independent data set consist of spike patterns of complexity 3 and 4, appearing three and one times, respectively. Their signicance is due to statistical uctuations (we will return to this dependence on pattern complexity). In the dependent data set, however, almost all signicant constellations correspond to the injected coincidences (between neurons 1 and 3, and neurons 2 and 5). In addition, some higher-order constellations containing the injected spikes as subpatterns also appear as unitary events, leading to the few squares in the raster display of neuron 6. As was to be expected from the random insertion times of the injected coincidences, the unitary events appear randomly distributed over time and trials. In real neuronal data, however, their times of occurrence may provide information concerning the dynamics of these potentially interesting constellations and their relation to stimuli or behavioral events (Riehle et al., 1997).
3 Dependence of Joint-Surprise on Physiological Parameters
Having derived the joint-surprise as a measure for statistical signicance of joint spiking events, we now investigate its performance with respect to various physiologically relevant parameters: the ring rates of the neurons under consideration, the time resolution (bin size) chosen for the analysis, the rate of spike coincidences, their coincidence accuracy (allowing the biological system some degree of noise), and the number of neurons involved. To this end, we calibrate the performance of the joint-surprise by applying it to appropriately designed sets of simulated data. As before, the control data sets consist of independently generated Poisson trains of varying base rates. These are compared to different data sets, containing additionally injected coincidences of varying complexities and coincidence rates. Typically, the
Unitary Events: I. Detection and Signicance
55
simulated data consisted of M D 100 trials of 1000 ms each and a time resolution of h D 1 ms. The rates of the random Poisson trains were chosen to cover a physiologically realistic range for cortical neurons—between 10 and 100 s¡1 . 3.1 Inuence of the Firing Rate. To investigate the inuence of the neurons’ ring rates, we studied two parallel spike trains generated as independent Poisson processes, with both the same and constant rate. We varied this rate from l D 10 to 100 s ¡1 in steps of 10 s¡1 , in the presence of different constant injection rates lc . Expectation values for the number of coincidences in the data set nemp and the number of coincidences expected to occur assuming independence npred are: h i nemp D lc h C ( lh) 2 ¢ MT
npred D [(lc C l)h]2 ¢ MT.
(3.1)
The probability per time step for a coincidence in the presence of injected coincidences is the sum of the probability of seeing an injected coincidence lc h and seeing a chance coincidence ( lh) 2 . For experimental data, we have to estimate the ring rates from the data set. The marginal probabilities (spike count divided by time interval) cannot distinguish between the base rate and the injection rate. Therefore, we obtain lc C l as the expectation value for the ring rate and [( lc C l)h]2 as the expectation value for the probability to nd a coincidence assuming independence. Further corrections for the specic injection process are discussed in Grun, ¨ Diesmann, Grammont, Riehle, and Aertsen (1999). Values obtained for nemp and npred in different realizations uctuate around their expectation values. To visualize the effect of statistical uctuations, we generated 10 data sets for each rate level. Figure 4A (top) shows that the empirical numbers of coincidences nemp (diamonds) indeed match the number expected assuming independence npred (solid lines), apart from small statistical uctuations. The number of coincidences exhibits a convex dependence on background ring rate. From equation 3.1, we know that the increase is quadratic, ( h2 ¢ MT ) being the coefcient of the leading power. At l D 0, that is, only injected spikes in both situations, the expectation values nemp and npred are (lc h) ¢ MT and ( lc h ) 2 ¢MT, respectively. Comparison of the expressions for nemp and npred in equation 3.1 shows that in the regime ( lc C 2l)h < 1, the difference decreases linearly with l. Variability in counts increases with ring rate because of the well-known property of the Poisson distribution (see equation 2.8) that the count variance equals the mean. Figure 4 (top row) demonstrates that also the expected number of coincidences assuming independence npred exhibits uctuations. These uctuations are caused by the fact that the ring rates have to be estimated from the data. The variance is given by sn2pred D [(l C lc ) h]2 (2l C 4lc ) h ¢ MT,
(3.2)
56
S. Grun, ¨ M. Diesmann, and A. Aertsen
A
l =0.0 s 1 c
B
l =0.5 s 1 c
C
l =1.0 s 1 c
n
1000 500 0
S
20 10 0 20 40 60 80 100 1
20 40 60 80 100
l (s )
1
l (s )
20 40 60 80 100
l (s 1)
Figure 4: Detection of coincident events under different ring-rate conditions. Simulated data from two parallel processes were analyzed for the presence of coincident events. Realizations consisting of 100 trials, each of duration 1000 ms, were generated with a time resolution of 1 ms. Rates of both processes were varied from 10 to 100 s¡1 in steps of 10 s¡1 . The experiment was repeated 10 times at each rate, to visualize statistical variations. Coincident events, also generated by Poisson processes, were injected into each of the 10 data sets at one of two coincidence rates (B: 0.5s¡1 , C: 1s¡1 ). The results of the control experiments without injected events are shown in A. Data were analyzed for the number of empirical occurrences (top row, diamonds) versus expected level (top row, solid lines, theoretical curve in gray), estimated from the marginal probabilities. The corresponding joint-surprise is shown in the bottom panels (diamonds, theoretical curve in gray). Results for the 10 realizations per ring rate are grouped together, giving rise to the stairway-like appearance of the plots. Horizontal lines in the bottom panels indicate the signicance threshold (a D 0.01).
which, with lc < l, is bounded by 3 [( l C lc ) h]3 ¢ MT.
(3.3)
This dependence on the third power in lh renders it much smaller than the variance of nemp in the parameter range of interest (say, lh < 16 , lc < l). Closely related to the probability of obtaining false positives (signicant outcome in the absence of excess coincidences; see section 4) is the question of how precisely npred can be estimated when no coincidences are injected.
Unitary Events: I. Detection and Signicance
57
Comparing the variance of the coincidence counts when no coincidences are injected ( lh) 2 ¢ MT with equation 3.2 for lc D 0 suggests that above lh D 12 , the variance of npred exceeds the variance of nemp . However, at this high probability, the Poisson distribution is no longer a good approximation. Using the binomial distribution (see equation 2.7) in the argument above, it turns out that the variance of npred is always smaller than the variance of nemp . The insight gained by analyzing the variances of npred and nemp to understand the uctuations of S is limited, because the two measures are not completely independent: a high spike count for one of the neurons simultaneously leads to high values for npred and nemp . We present an analysis of the relation of signicance level a to the percentage of false positives obtained in independent data sets in section 4. Figure 4 (bottom row) shows the joint-surprise values corresponding to the (npred , nemp ) pairs (top row). Without injected coincidences, the jointsurprise uctuates around 0, independent of the rate, due to the uctuations in nemp and npred. Because we necessarily have uctuations in npred , the percentage of experiments in which the coincidence count is signicant may differ from the theoretical value (assuming a known npred ) determined by Sa (see section 4). In the case of injected coincident events, the measured and expected coincidence counts deviate from each other, and the more so the higher the injection rate (for Figure 4B, 0.5 s ¡1 ; for Figure 4C, 1 s ¡1 ). The jointsurprise declines hyperbolically with increasing background rate due to the decreasing ratio ( nemp ¡ npred ) / npred . At vanishing background ring rate, the expected number of coincidences assuming independence (lc h) 2 ¢ MT is practically 0, while the number of measured coincidences ( lc h ) ¢ MT remains considerable. Therefore, the joint surprise obtains a large, nite value (not shown). For the injected coincident rate of lc D 0.5s ¡1 (Figure 4B), the jointsurprise falls below the signicance level of 0.01 (horizontal line in bottom graph) at a rate of about 60 s ¡1 (in total, 30 trials below signicance level). For the injected rate of lc D 1.0s ¡1 (Figure 4C), this occurs only at a considerably higher background rate (about 100 s¡1 ). At higher ring rates, more excess coincident events are needed to escape from the statistically expected uctuation range. Clearly, this behavior imposes a severe limit on the detectability of excess coincidences at high ring rates. Before the expectation of the joint-surprise falls below the signicance threshold, the cloud of joint-surprise values obtained in the individual experiments has already reached it (in Figures 4B and 4C, 40 s ¡1 and 70 s¡1 , respectively). However, there is a large regime where uctuations in the joint-surprise are well separated from the signicance threshold, and, hence, excess coincidences can reliably be detected. When injected coincidences are present, the difference between nemp and npred increases linearly with T, while the width p (standard deviation) of the joint-p-value Y increases with T. Therefore, given enough data, excess coincidences can always be detected.
58
S. Grun, ¨ M. Diesmann, and A. Aertsen
3.2 Inuence of Binning. The time resolution of data acquisition in extracellular spike recordings is typically 1 ms or better. There is recent experimental evidence from cross-correlation, joint peristimulus time histogram (JPSTH), and, particularly, from spike pattern analysis, that the timing accuracy of spiking events that might be relevant for brain function can be as precise as 1–5 ms (Abeles, Bergman, et al., 1993; Riehle et al., 1997). Similar suggestions come from modeling studies (Diesmann et al., 1999). Here, we want to investigate whether, by choosing a binning grid D D bh (see equation 2.1) in that time range, we may be able to detect coincidences with corresponding accuracy. Therefore, we will rst study the general inuence of binning on the outcome of joint-surprise analysis and then address the effect of varying bin size on the detection of coincidences with a nite temporal jitter. We generated a set of simulated data as before. While the rate of the independent processes was maintained constant (20 s ¡1 ), we injected additional coincident events at various rates. Two examples for coincident rates of 0.5 s ¡1 and 1.0 s¡1 are shown in Figures 5B and 5C; the control set is shown in Figure 5A. In the analysis, we gradually increased the binning grid from b D 1 to b D 10. If there were more than one spike per bin, the result was set to one (clipping). This newly generated process formed the basis of our investigation. Binning has two opposite effects on the coincidence counts: it reduces the number of time steps, T b D T / b, while increasing the probability pb to observe an event in a time step, compared to the original probability p D lh. The net effect of binning is therefore comparable to that of increasing the rate while reducing the number of observation time steps. Within a single analysis bin of size bh, the probability of nding exactly k of the b possible positions occupied by a spike is given by the binomial distribution. Thus, the probability of nding one or more events P (k ¸ 1) D 1 ¡P (k D 0) equals
pb D
b ³ ´ X b kD 1
k
p k ( 1 ¡ p ) b¡k D 1 ¡ ( 1 ¡ lh) b .
(3.4)
For p ¿ 1, it can be approximated by pb D b ¢ p. Following equation 3.1, the expectation values for the number of coincidences are now µ emp
nb
pred
nb
D
h
± ²2 ¶ lc bh C 1 ¡ ( 1 ¡ lh) b ¢ M ¢ Tb ¡
D 1 ¡ 1 ¡ (lc h C lh)
¢b i 2
¢ M ¢ Tb .
(3.5)
Improvements of expressions 3.5, not relevant in this context, can be made by taking into account interactions of the background spikes and the injected spikes in the binning process (Grun ¨ et al., 1999).
Unitary Events: I. Detection and Signicance
A
l =0.0 s 1 c
B
59
l =0.5 s 1 c
C
l =1.0 s 1 c
n
400 200 0
S
20 10 0 1 2 3 4 5 6 7 8 910
1 2 3 4 5 6 7 8 910
1 2 3 4 5 6 7 8 910
b
b
b
Figure 5: Detection of coincident events using different analysis bin sizes bh. Simulated data from two parallel processes were analyzed for the presence of coincident events. Realizations consisting of 100 trials, each of duration 1000 ms, were generated with a time resolution of 1 ms. Rates of both processes were kept constant at 20 s¡1 . Coincident events were injected at different rates: (A) no injected events, (B) 0.5 s ¡1 , (C) 1.0 s ¡1 . The experiment was repeated 10 times at each bin size, to visualize statistical variations. The bin width bh was varied from 1 to 10 ms. Data were analyzed for the number of coincidences (top row, diamonds) and compared to the expected number of coincidences assuming independence (top row, solid lines, theoretical curve in gray). The corresponding joint-surprise is shown in the bottom panels (diamonds, theoretical curve in gray). Further details as in Figure 4.
pred
nb is concave with positive slope for small b, reaches a maximum, and after passing a point of inection approaches the curve T / b from below. The latter represents an upper bound for the expectation value, reached when each bin is occupied. The initial concave increase can be observed pred
emp
in the simulated data for nb as well as for nb (Figure 5, top row). As in the case of increasing background rate (see Figure 4), the difference between the measured and the expected coincidence counts assuming independence decreases with increasing bin size. The effect can clearly be seen at the high coincidence rate (1 s¡1 ; Figure 5C, top), less so at the lower one (0.5 s ¡1 ; Figure 5B, top). In the regime shown, binning increases occupa-
60
S. Grun, ¨ M. Diesmann, and A. Aertsen pred
emp
tion probability (somewhat stronger for nb than for nb ), and clipping is not dominant yet. In Figure 5C (top), we can clearly observe the ucpred
pred
tuations in nb increasing with b. Here, we simply estimated nb from the binned data. However, uctuations can be reduced by estimating ring rates on the original resolution h and using equation 3.4 to obtain the occupation probability at bin size bh. The dependence of the joint-surprise (see Figure 5, bottom row) on the bin size is similar to the above-described dependence on the rate (cf. Figure 4). In the absence of injected coincidences, S uctuates around 0 (see Figure 5A). For injected coincidences, S decreases with increasing bin size. The lower the injection rate is, the sooner S starts to decrease and the faster it decays: for lc D 0.5s ¡1 , joint-surprise values start to fall below the 0.01 signicance level at b D 3 (see Figure 5B), while for lc D 1.0s¡1 , signicance is maintained up to about b D 6 to 10 (see Figure 5C). Again, the decline in S is controlled by the decreasing ratio emp pred pred ( nb ¡ nb ) / nb . The similarity between the dependences of S on spike rate and on bin size is not surprising, considering that binning has the net effect of an apparent increase in ring probability, limited by the additional effect of clipping. 3.3 Detection of Near-Coincidences. In a next step, we investigate whether it is also possible to detect noisy (i.e., imprecise) coincidences. This question arises naturally, since neurons are usually considered to exhibit some degree of “noise” or uncertainty in the timing of their action potentials. Note, however, that the degree of this temporal noise has long been questioned (e.g., Abeles, 1983) and is still under debate, (e.g., Mainen & Sejnowski, 1995; Shadlen & Newsome, 1998; Diesmann et al., 1999). While keeping both the independent background rate and the injection rate constant, we increase the temporal jitter of the injected near-coincident events stepwise from 0 to 5 ms, such that in each case, the difference in spike times is uniformly distributed within the chosen jitter range. The question is whether, by choosing an appropriate binning grid, we can improve the detection of such near-coincident events. To this end, we analyze the simulated data with varying bin sizes and for each bin size compute the joint-surprise. Figure 6 shows the results for a background rate of l D 30s ¡1 and a rate of injected near-coincidences lc D 2s ¡1 . Each of the curves in Figure 6A represents data with a particular temporal jitter s, analyzed with a bin size bh increasing from 1 ms to 10 ms. Values of S are averages of 100 repetitions of the simulation experiment at constant parameters. Each curve exhibits a global maximum (marked by an asterisk) at a bin size b¤ close to the magnitude of the jitter of the injected coincidences; b¤ is shown as a function of temporal jitter in Figure 6B. Indeed, the maxima occur at the bin size that in a given simulation just covers the maximal jitter (e.g., spikes with a maximal time difference of s D 1 are covered by a bin size spanning two time steps of the original time resolution: b D 2). Numerical analysis of an analytical
Unitary Events: I. Detection and Signicance
A 20
B
Significance
5 0
Best Width
*
10
b
S
15
7 6 5 4 3 2 1 0 0
61
1 2 3 4 5 6 7 8 910 b
1
2
3
±s
4
5
Figure 6: Detection of near-coincidences for different degrees of coincidence precision. Two parallel spike trains were generated with background rates l D 30s¡1 the rate of the injected coincidences was lc D 2s¡1 , for T D 105 time steps, h D 1 ms. The temporal jitter of coincident events was varied from s D 0 to s D 5 time steps. Each simulation was repeated 100 times, and data were analyzed for the number of observed coincidences by varying the analysis bin size from b D 1 to b D 10, and compared to the expected coincidence count assuming independence. (A) Each curve shows the resulting average joint-surprise as a function of the analysis bin size for a given temporal jitter. The top curve shows the results for s D 0, the bottom curve for s D 5, intermediate scatters in between (using the maxima as reference). Maxima of the curves are marked by an asterisk. (B) Optimal bin width b¤ for detecting excess coincidences as a function of temporal jitter, theoretical curve indicated by dashed line.
description of the situation (Grun ¨ et al., 1999) shows that maxima are located at b¤ D s ¡ 1. The fact that in the simulation results in Figure 6B, the maxima for s D 4 and 5 occur at a larger bin size is due to uctuations remaining in the averaged S. For bin sizes smaller than the scatter width, S increases with bin size since for more and more near-coincidences, the constituting spikes fall into a common bin. At bin sizes larger than the coincidence accuracy, the rate at which the number of excess coincidences grows drops, and the probability that an injected coincidence is detected slowly reaches saturation. Thus, S is bound to decrease again, because the expected coincidence count assuming independence continues to grow approximately linearly. The joint-surprise curves for nite temporal jitter s approach the curve for perfect coincidences s D 0 from below. The comparison of different joint-surprise curves shows that the higher the temporal jitter (i.e., the lower the coincidence accuracy) is, the lower the joint-surprise is. Hence, for a given b, the number of nearcoincidences that can be detected increases with decreasing temporal jitter. 3.4 Multiple Parallel Processes. When the number of simultaneously observed neurons N is increased, the variety of coincidence patterns grows
62
S. Grun, ¨ M. Diesmann, and A. Aertsen
strongly, due to the nonlinear increase in combinatorial possibilities. Each complexity j (i.e., a spike pattern with j 1s, N ¡ j 0s) can in principle oc¡ ¢ cur in N j variations. On the other hand, the occurrence of higher-order constellations depends in a nonlinear fashion on the rates. The probability for a pattern with complexity j to occur among N neurons, all ring independently with probability p, is given by pj ¢ ( 1 ¡ p) N¡j . For low ring probabilities (p ¿ 1), this can be approximated by pj . By the combination of these two effects, constellations of high complexity are actually expected to occur rarely. For low ring probabilities, such as p D 2 ¢ 10¡2 , the expected count for a coincidence pattern of complexity j D 3 (assuming a total number of observation time steps T D 105 ) is of the order of 1 or less, and even less for higher complexities (see Figure 7B, top). For higher ring probabilities, this expectation is shifted to larger values. Consider a pattern of complexity 4, with other parameters as above. The expected coincidence count now is npred D 0.016; the probabilities of nding 0, 1, or 2 coincidences are approximately y ( 0, npred ) D 0.9841, y ( 1, npred ) D 0.0157, and y ( 2, npred ) D 0.0001. Here, the discrete nature of the Poisson distribution is fully exhibited. Almost all the mass is at a single outcome (0). The joint probabilities of outcomes 1 and 2 are Y( 1 | npred ) D 0.0159 and Y( 2 | npred ) D 0.0001, respectively. Thus, at an a-level of 0.01, the occurrence of two coincidences is already signicant and would still be signicant for much lower a values. If the occurrence of 2 or more coincidences than expected is signicant for almost any signicance level, our measure is obviously susceptible to uctuations. The signicance of the spike constellation in a particular experiment cannot be determined precisely. The obvious and standard cure to this problem is to collect more data for such an experimental situation, shifting the distribution of the coincidence counts for patterns of high complexity to larger expectation values, where the discrete nature of the distribution is of less importance. As a result of the above discussion, high j constellations, if occurring at all, are typically accompanied by high joint-surprise values (cf. Figure 7B, bottom). It is therefore not surprising that in simulations where we varied the complexity of the injected coincidence patterns from j D 2 to 6 (while keeping the number of processes (N D 6), the background rate (l D 20s ¡1 ) and the injection rate (lc D 1s ¡1 ) constant), all coincidences of complexity ¸ 3 were detected with high signicance (see Figure 7B, bottom). Moreover, the measured coincidence counts for j ¸ 3 are close to the expectation for the injected coincidence lc hT D 100 (see Figure 7B, top). For complexity 2, the coincidence count is higher, because we get contributions from the background rate ( lh) 2 T D 40. This contribution is rapidly vanishing for higher complexities (e.g., j D 3, (lh ) 3 T D 0.8). Similar results were obtained when we increased the number of independent processes N (from 2 to 12), while keeping the complexity of injected coincidences constant (j D 2, Figure 7A). Here, injection means that
Unitary Events: I. Detection and Signicance
A
B
x=2
200
63
N=6
150
n
100 100 50 0
0
60
400
40
S
200 20 0
2
4
6
N
8
10
12
0
2
3
4
x
5
6
Figure 7: Complexity of joint spike patterns. (A) The number of processes into which a pair of coincidences (complexity j D 2) was injected was varied from 2 to 12. The rate of the independent processes was l D 20s¡1 , and the injection rate was 1 s ¡1 (T D 105 , 10 repetitions). The diamonds in the upper graph show the number of occurrences of the coincidence patterns [110], [1100], [11000], and so on, with the number of zeros depending on the number of processes. The expected counts are shown as solid lines. The lower graph shows the joint-surprise for the corresponding pairs of measured and expected counts; the horizontal line marks the signicance level of 0.01. (B) The number of processes was kept constant (N D 6), as were their rates (parameters as in A), but the complexity of the injected coincidences was varied from 2 to 6. Thus, the pattern looked for was [110000], [111000], and so on, respectively. The measured counts of the coincidence pattern are displayed as diamonds, the expected counts as solid lines. The latter values cannot be distinguished from 0 for j > 3 in this graph, because values become very small. The corresponding joint-surprise (bottom panel) was therefore very high (clipped here to an arbitrary value of 400 for visualization).
simultaneous spikes are added to j of the N parallel spike trains, without affecting the N ¡ j remaining ones. It turns out that the joint-surprise of the j constellation (i.e., j 1s and N ¡ j 0s) at the given ring rates is practically independent of N. There is a small decrease in the number of occurrences of this particular pattern, because with increasing N, more patterns containing the two spikes as a subpattern become available. However, this effect does not seriously affect the detectability of the injected
64
S. Grun, ¨ M. Diesmann, and A. Aertsen
coincidences. The situation changes when massively parallel data are examined, and patterns of higher complexity become typical. Let the two neurons under consideration be accompanied by 98 other neurons, other parameters as above. The probability that at least 1 of the 98 neurons con¡98¢ P 98¡k , which is k tributes a spike to the coincidence is 98 kD 1 k ( lh) ( 1 ¡ lh) 98 1 ¡ ( 1 ¡ lh) ¼ 0.86. We conclude that to decide on the empirical relevance of coincidences of higher complexities (j ¸ 3), given a moderate amount of data, it is advisable to set additional criteria, for example, by requiring a minimum absolute number of occurrences (see also Abeles, Bergman, et al., 1993; Martignon, Laskey, Deco, & Vaadia, 1997; Martignon et al., 2000). 4 False Positives
Up to now, we have studied the sensitivity of our method by exploring under which conditions excess coincident events are detectable. However, while striving for high sensitivity (a low fraction of false negatives), we simultaneously need to ensure an appropriately high degree of specicity (a low fraction of false positives). Such false positives are the result of incorrectly assigning the label “excess coincidences” to an experiment where they in fact are not in excess. Thus, we have to establish conditions under which we reach a compromise between a sufcient degree of sensitivity and an acceptable degree of specicity. Therefore, we now analyze various sets of simulated data, with the combined requirement of attaining a high level (90%) of detection (only 10% false negatives), while securing a low level (10%) of false positives. As in the preceding sections, the simulations are described by biologically relevant parameters, varied over a physiologically realistic regime. 100 independent experiments were performed for each parameter value; from these, the percentage of experiments that crossed a certain threshold level on the joint-surprise was evaluated. This threshold level a was varied in equidistant steps to cover the range of joint-surprise values between ¡15 and C 15. 4.1 Inuence of the Firing Rate. In the rst step, we kept the number of independent processes constant (N D 2) and varied the rate of the processes. We found that for constellations of complexity 2, the percentage of false positives is practically independent of the background rates (see Figure 8A, left). This is not surprising, because if the rates of the underlying processes were known, and therefore the expected number of coincidences assuming independence npred could be determined without error, a would represent the percentage of experiments passing Sa . The above result ensures that for the parameter regime tested, determination of ring rates from the data does not cause dramatic deviations of the percentage of false positives from the theoretical level a.
Unitary Events: I. Detection and Signicance
65
By contrast, the sensitivity for detecting excess coincidences shows a clear dependence on background rates. At low rates, it is very high, but it decreases—rapidly at rst, more slowly later—with increasing background rate (see Figure 8A, middle). At background rates above l D 60s ¡1 , the threshold for detecting the injected events has decayed to about a D 0.05. Combining these two observations in a single graph, we obtain the intersection range of the joint-surprise, necessary to obtain both maximally 10% false positives and minimally 90% sensitivity (the white area in Figure 8A, right). For low a, this region is bounded by an approximately straight vertical line at a D 0.05; the lower boundary of the permissible signicance measure is approximately independent of the background rate. The upper bound, however, is clearly curved: the threshold needed for reliable detection decreases with increasing background rate, reaching a level of only 0.05 at l D 60s ¡1 . Thus, the higher the rate is, the narrower is the bandwidth of a-values permissible to detect excess coincident events selectively and sensitively. 4.2 Inuence of the Number of Parallel Processes. Next, we varied the number of independent processes (from 2 to 12) while keeping the rates constant (l D 20s ¡1 ). For each number of processes, the fractions of false positives and false negatives were evaluated at different threshold levels. We found that the fraction of false positives increased with decreasing threshold and in the given range was independent of the number of processes involved (see Figure 8B, left). Moreover, the sensitivity for excess coincidences (shown for complexity 2 at coincidence rate of 1 s ¡1 in Figure 8B, middle) was independent of the number of processes as well. The intersection range of the joint-surprise, necessary to obtain maximally 10% false positives and maximally 10% false negatives, is shown in white in Figure 8B (right panel). Observe the wide parallel band for selective and sensitive detection, independent of the number of observed processes. If more restrictive criteria (fewer false positives and/or fewer false negatives) are adopted, the band becomes accordingly smaller (not shown here). 4.3 Inuence of Pattern Complexity. Higher-order coincidences (coincidences with high complexity j ) are rarely found in data with low ring rates and limited numbers of observation time steps (see also section 3.4). Figure 8C (left) illustrates that there are hardly any false positives for complexities 4 or higher, even for threshold levels of a > 0.5. Let na represent the smallest n for which Y( na | npred ) < a. Because of the discrete nature of the distribution of coincidence counts y (see section 3.4), Y( na | npred ) can actually be much smaller than a (see the example in section 3.4). If npred is exact, Y( na | npred ) actually is the fraction of false positives expected. Therefore, the percentage of false positives can be much smaller than a. This does not contradict the fact that if such coincidences occurred, its joint-surprise would indicate high signicance. False positives of lower complexity (up to
66
S. Grun, ¨ M. Diesmann, and A. Aertsen
3) do show up for threshold levels a below 0.5, but their fraction decreases with pattern complexity. The situation is even more clear-cut for the case of false negatives (see Figure 8C, middle). The detection of injected coincidences is practically 100%. Thus, the intersection graph of the joint-surprise for fewer than 10% false positives and fewer than 10% false negatives (see Figure 8C, right) shows noncompliance for negative values of Sa and j < 4. 5 Discussion
We described a new method to analyze simultaneously recorded single-unit spike trains for signs that (some of) the observed neurons are engaged in a cell assembly. We adopted a widely used operational denition, dening common assembly membership on the basis of near-simultaneity of the joint spike activity of the observed neurons (Abeles, 1982a; Gerstein et al.,
Unitary Events: I. Detection and Signicance
67
1989). The simultaneous observation of spiking events from N neurons was described by the joint process of N parallel spike trains. By appropriate binning, this was transformed to an N-fold (0,1)-process, a realization of which is represented by a sequence of N-dimensional activity vectors (instances of coincidence patterns) describing the various (0,1)-constellations that occurred across the recorded neurons. Under the null hypothesis of independently ring neurons, the expected number of occurrences of any coincidence pattern and its probability distribution could be calculated analytically on the basis of the single-neuron ring rates. The degree of deviation from independence among the neurons was evaluated by comparing the theoretically expected counts with their empirical counterparts. In order to test the signicance of deviations from expectation, we developed a new statistical measure: the joint-surprise. For any coincidence pattern, the joint-surprise measures the probability of nding the observed number of occurrences (or an even larger one) by chance. Those coincidence patterns that violate the null-hypothesis of independence dene potentially interesting occurrences of unitary joint events. The neurons that contribute a spike to the signicant coincidence pattern are considered a subset of the neurons currently engaged in assembly activity. To calibrate the new method and test its performance, we applied it to simulated data sets in which different physiological and analysis parameters were varied in systematic fashion. We used independent Poisson processes Figure 8: Facing page. Selectivity and sensitivity as a function of ring rate (A), number of neurons (B), and pattern complexity (C). In the left column, the percentage of false positives (fp), that is, 1 ¡ selectivity, is calculated using independent data sets, without injected coincident events. The percentage of false negatives (fn), that is, 1 ¡ sensitivity, for injected events as a function of the threshold level Sa (abscissa) is shown in the middle column. The overlap regions of maximally 10% false positives and maximally 10% false negatives are indicated in white in the right column. (A) The percentage of false positives and false negatives of pair coincidences within spike data of two parallel processes. The rates of both independent processes l were identical and increased from 10 s¡1 to 100 s¡1 in steps of 10 s ¡1 (ordinate). In the middle and right panels, coincident events of complexity 2 were injected (1 s ¡1 ) into the spike data. For each rate, the experiment was repeated 100 times, T D 105 , h D 1 ms. The density plots (gray-level coding as indicated) represent the percentage of experiments that crossed the threshold level Sa (fp, left column) or remained below it (fn, middle column), respectively. (B) The percentage of fp and fn for coincidences of complexity 2 for varying numbers of processes N (ordinate), increasing from 2 to 12 (ring rate held constant at 20 s ¡1 , coincidences injected at rate 1 s ¡1 into 2 of the N processes). Display and other parameters as in A. (C) The percentage of fp and fn for coincidences of varying complexity (ordinate) in a xed number of processes (N D 6). Display and other parameters as in A. For orientation, the threshold level for a D 0.05 is indicated by a dashed line in all plots.
68
S. Grun, ¨ M. Diesmann, and A. Aertsen
to generate control data at various ring rates. The degree of interdependence of the surrogate data was controlled by injecting coincident spiking events at different rates, timing precision, and neuron composition. Specifically, we measured the sensitivity (the probability not to generate false negatives) and the specicity (the probability not to generate false positives) of the method and determined its dependence on various physiological parameters. Overall, the method proved to be both highly sensitive and highly specic to detect the presence of even weak signs of coincident spiking. Moreover, the method is only moderately sensitive to wide-range variations of the tested parameters, largely covering the physiologically relevant regime encountered in cortical neuron recordings. Thus, unitary event analysis provides a simple measure to test for the presence of excess coincident spiking events in experimental data. Since the method takes into account the ring rates of the observed neurons, results from different experiments and/or recordings may be compared. The principal ingredient of the method is the joint-surprise. It provides a convenient measure of the probability that the number of coincident spiking events represents a chance constellation. One may stop at this point and use the resulting probabilities (e.g., by comparing them across different experimental or behavioral conditions) as a means to assess the functional relevance of synchronous spiking. Another way to proceed, explored in this article, is to adopt a common approach in statistics by imposing a threshold level on the joint-surprise function and to focus on the data where this minimum signicance level (e.g., a D 0.05 or 0.01) was surpassed. In doing so, selections from the data are highlighted as potentially interesting regarding the presence of excess coincident spiking events. We referred to these events as unitary events, marking highly unexpected joint spike constellations. Their neuronal composition, as well as the moments at which they occur, may provide information about the underlying dynamics of assembly activation. It is worthwhile to point out that our method does not allow us to distinguish on an individual spike basis which one is an excess event and which is not. Hence, all instances of a signicant coincidence pattern are marked. Nevertheless, unitary events may well occur inhomogeneously distributed over the time interval studied, revealing a potentially interesting time structure in relation to the experiment that is not present in the original stationary ring rates. We have formulated the null-hypothesis in terms of statistical independence. In cases where a specic time structure within a single spike train is of interest (e.g., Legendy & Salcman, 1985; Dayhoff & Gerstein, 1983), independence is often formulated as the assumption that the neuronal spike train is a realization of a Poisson process. In cases where parallel processes are tested for spatio and/or temporal patterns (as in our case), often independent Poisson processes are assumed—that is, both independence within the spike trains and independence between them (Palm et al., 1988; Abeles & Gerstein, 1988; Aertsen et al., 1989; Prut et al., 1998). Physiological data, however, often violate the Poisson assumption, and it
Unitary Events: I. Detection and Signicance
69
is not yet clear how to correct for that in a general (i.e., model-free) manner. One option is to make different assumptions about the nature of the underlying point process, for example, to assume that it is a renewal process (Cox & Isham, 1980) or, more specically, a c -process (e.g., Pauluis & Baker, 2000; Baker & Lemon, 2000). We have tested the inuence of violations of the Poisson assumption on the occurrence of false positives using c -processes (see appendix B). Results from our parametric study, where the structure of the point processes was varied from bursty to regular ring, indicate that the unitary event analysis method is quite robust against such violations. Another option, which we are currently exploring, is a bootstrap-type method, shufing the spike trains across trials to generate surrogate data from which one can estimate the expected numbers of the various constellations quasi-empirically (Pipa, Singer, & Grun, ¨ 2001). By this procedure, the temporal structure of the individual spike trains is taken into account, and an explicit hypothesis about the generation processes need not be made. Another aspect of extending the formulation of the null-hypothesis is to take into account correlations among subsets of neurons. In this context, a promising new approach proposed recently is to extend the null-hypothesis of independence to incorporate interactions among subsets of the neurons contributing spikes to a given coincidence pattern (Martignon, von Hasseln, Grun, ¨ Aertsen, & Palm, 1995; Martignon et al., 1997, 2000). Constellations of complexity higher than 2 in independent multiple parallel processes are relatively rare. However, if they occur, they are very likely to be detected as false positives in data sets of nite length (see section 3.4; Roy et al., 2000). In order to account for that, it is advisable to apply an additional test at a meta-level, for example, by requiring a minimal absolute number of occurrences of the high-complexity event or by applying an additional statistical test (Prut et al., 1998). Also here, bootstrap techniques may be invoked (e.g., Nadasdy, Hirase, Czurko, Csicsvari, & Buzsaki, 1999) to provide additional means to differentiate false positives from true positives in such regimes of relatively rare occurrences. Another source for false positives is the violation of the assumption of stationarity. The ring rates, measured by averaging over time (and trials) and serving as the basis to test the null-assumption, may not reect the instantaneous behavior. Particularly in regions with a higher-than-average rate, unitary events may be detected by our method for incorrect reasons (for related problems with cross-correlation measures, see, e.g., Brody, 1999a, 1999b). Unfortunately, however, a strict requirement of stationarity may sometimes disqualify a large portion of the experimental data, especially from awake, behaving animals. Therefore, a more promising approach is to adopt techniques that enable us to make reliable estimates of the instantaneous ring rates (Nawrot, Aertsen, & Rotter, 1999; Pauluis & Baker, 2000). Giving up the concept that neuronal spiking is driven by a (potentially time-dependent) intensity function, the signicance test can also be based
70
S. Grun, ¨ M. Diesmann, and A. Aertsen
on counting statistics, thereby removing the problem of rate estimation from experimental data (Gutig ¨ et al., in press). The method of unitary event analysis bears clear relations to the dynamic analysis of cross-correlation, as exemplied, for example, in the JPSTH (Aertsen et al., 1989). It presents an extension, in that it enables us to analyze more than two neurons at a time, and a restriction, by focusing on coincident events only. In principle, the method could also accommodate any specic arrangement of coincidence delays unequal to zero. However, the combinatoric problems associated with exploring all possible such arrangements are beyond our present capabilities. The synre model (Abeles, 1991) has prompted scientists to search specically for spatiotemporal ring patterns in multiple single-neuron spike trains (Abeles & Gerstein, 1988; Abeles, Bergman, et al., 1993; Villa & Abeles, 1990; Prut et al., 1998; Date, Bienenstock, & Geman, 1998). However, in contrast to the method of Prut et al. (1998), unitary event analysis focuses on spatial patterns. Binning provides a general and straightforward mechanism to control the amount of temporal jitter allowed in the denition of a coincidence. It is applicable to N parallel processes. Unfortunately, for coincidences with large temporal jitter, sensitivity is reduced due to the ssion of coincidences at the binning grid (see section 3.2; Grun ¨ et al., 1999). For N D 2, methods have been developed to detect near-coincidences without the need of binning (Grun ¨ et al., 1999; Pauluis & Baker, 2000). However, no method currently exists for N > 2. We are exploring the detection of near-coincidences in large numbers of parallel processes without discretization of time (Grun ¨ & Diesmann, 2000). One important issue remains to be solved before we can apply this framework to physiological data and study neuronal assembly dynamics in relation to stimuli and behavioral events. Until now, we have considered only the case of neurons ring at a stationary rate and with stationary coincident activity among them. Physiological data, however, are usually not stationary. Firing rates vary considerably as a function of time, particularly when the animal is presented with adequate stimuli or is engaged in a behavioral task. A second type of nonstationarity is that coincident ring itself may be nonstationary for example, by being time-locked to a stimulus or behavioral event even if the rates of the neurons are constant (Vaadia et al., 1995). Since our analysis so far derives its measures globally from the entire observation interval, the time-locked occurrence of coincidences might be overlooked. In the companion article in this issue, we address both types of nonstationarities and extend our theoretical framework accordingly. Appendix A: Notation
T h T M
temporal duration of observation interval, [T ] D unit of time time resolution of data, [h] D unit of time temporal duration of observation interval in units of h, [T] D 1 number of trials
Unitary Events: I. Detection and Signicance
vi N v ( t) vk
m j ( vk) n emp nk pred nk P pi Pk y Y S a l lc b Tb pb s MD k MD( t)
71
(0,1)-sequence of neuron i number of simultaneously observed neurons coincidence pattern at time step t, N-dimensional vector coincidence pattern k, N-dimensional vector number of possible patterns complexity of v k general coincidence count empirical coincidence count of v k expected coincidence count of v k general probability in expressions like P ( k ¸ 1) occupation probability for neuron i probability of coincidence pattern v k distribution of coincidence counts joint-p-value joint-surprise signicance level background ring rate, [l] D 1 / unit of time coincidence rate, [lc ] D 1 / unit of time bin size in units of h, [b] D 1 number of time steps after binning, [T b ] D 1 occupation probability after binning temporal jitter of injected coincidences in units of h, [s] D 1 mutual dependence of v k; see appendix C time-resolved mutual dependence
Appendix B: Violation of the Assumption of Poissonian Spike Trains
In order to test how sensitive the unitary event analysis method is to a violation of the assumption of Poisson spike trains, we conducted the following experiment: Independent parallel spike trains (N D 2) were modeled as c processes and analyzed for the occurrence of signicant coincident events— false positives (similar to section 4). A c -process allows us to vary the spike train structure from “burstiness” to regular spiking by variation of a single parameter only: the “shape” parameter c . c -processes belong to the class of renewal processes and can be simulated by successively drawing interspike intervals from the interval distribution f ( t ) D l ¢ e ¡lt ¢
( lt ) c ¡1 . C (c )
(B.1)
For c D 1 the spike train is Poissonian (coefcient of variation (CV) D 1). If c is chosen < 1, the resulting spike train exhibits clusters or bursts of spikes, leading to a high variability of the interspike intervals (CV > 1). By contrast, if c is chosen > 1, the spike train is more regular; the higher the c is, the smaller is the variability of the interspike intervals (CV < 1).
72
S. Grun, ¨ M. Diesmann, and A. Aertsen
1.5
% fp
100 90 80 70 60 50 40 30 20 10 0.1 0.5
Average over rate levels 2
1 0.5
1
5 g
10
50
0
% fp
l (s 1)
% false positives (a =0.01)
6 4 2 0 0
a=0.05
a=0.01
10
20 g 30
40
50
Average over shape parameters 6 a=0.05 4 2 a=0.01 0 10 20 30 40 50 60 70 80 90 100 l (s 1)
Figure 9: False positives in non-Poissonian spike trains. Two parallel spike trains of duration T D 100 s and time resolution h D 1 ms were simulated as independent c -processes with parameters rate l and shape factor c . l was varied from 10 to 100 s ¡1 in steps of 10 s ¡1 , and the shape factorc was varied between 0.1 and 50, in steps of 0.1 between 0.1 and 1, up to 10 in steps of 1, and above in steps of 5. For each parameter constellation, the simulation was repeated 1000 times; the percentage of cases showing signicant outcomes at given signicance levels are derived as false positives (fp). (A) The matrix illustrates the percentage of false positives in gray code as a function of shape factor (horizontal) and given rate parameter (vertical) for a signicance level a D 0.01. Note that the resulting rate may differ from the given rate parameter, since for shape factors c < 1, a relatively large number of spikes occur with interspike intervals · h, which are clipped to one spike per time resolution bin in the simulation process. The larger the given rate and the smaller c , the larger the reduction in rate (at l D 100s ¡1 and c D 0.1 about 50%). (B) Percentage of false positives as a function of c averaged over all rate levels (top) and as a function of rate parameter l averaged over all shape factors (bottom) displayed for various signicance levels a D 0.01, 0.02, . . . , 0.05.
For c ! 1, the process approaches a clock process, with a xed value for the interspike interval. Thus, by varying the shape factor from 0.1 to 50, we covered a wide range of variability of experimentally observed spike trains (e.g., Softky & Koch, 1993; Baker & Lemon, 2000; Nawrot, Riehle, Aertsen, & Rotter, 2000). For the signicance test, the same procedure was used as introduced for the Poissonian spike trains: a Poisson distribution with its mean set to the expected number of coincidences (see equations 2.3 and 2.9). Two parameters were systematically varied in the simulations: the rate parameter l of the processes and the shape factor c . For each parameter constellation, the simulation of duration T D 100s was repeated 1000 times, and the percentage of cases showing signicant outcomes was derived. Figure 9A illustrates
Unitary Events: I. Detection and Signicance
73
the percentage of false positives in gray code as a function of shape factor (horizontal) and rate parameter (vertical) for a signicance level a D 0.01. Observe that the percentage of false positives varies between 0% and 2%— in a range around the expected value given the applied signicance level of 1%. The matrix does not appear to be clearly structured but shows a weak tendency for higher percentages of false positives (2%) with increasing rate and shape factor. The projections (and averages) of the results onto the shape axis (see Figure 9B, top) and on the rate axis (see Figure 9B, bottom) show that an increasing rate does not vary the number of false positives but with increasing shape factor, the number of false positives increases slightly, which is somewhat stronger for less strict signicance levels (a D 0.02 ¢ ¢ ¢ 0.05). In summary, we conclude that the unitary event analysis method behaves quite robustly with respect to the signicance level a against a violation of the Poisson assumption, realized here as c -processes. Appendix C: Mutual Dependence
A different approach to detect dependencies in parallel spike data is to use a measure related to the general framework of information theory: mutual dependence, derived from the mutual information and redundancy (for details, see Grun, ¨ 1996). For each particular activity constellation v k, mupred
tual dependence MD k is dened in terms of the joint-probability P k expected under the null-hypothesis (see equation 2.3) and its empirical counterpart, emp
emp
Pk
D
nk
,
T
(C.1)
by emp
MD k D ln
Pk
pred
.
(C.2)
Pk
Thereby, we obtain: if if if
emp
Pk
emp Pk emp Pk
pred
< Pk D
>
pred Pk pred Pk
then
MDk < 0 :
“negative” dependence
then
MDk D 0 :
independence
then
MDk > 0 :
(C.3)
“positive” dependence.
Hence, any deviation of MDk from 0 will indicate deviations from the null-hypothesis of independence for the corresponding activity constellation v k. Note that mutual dependence is a time-averaged measure over the entire duration of the spike trains under observation. We can make this into
74
S. Grun, ¨ M. Diesmann, and A. Aertsen
Figure 10: Time-resolved mutual dependence. The dot display (top) and the time-resolved mutual dependence (bottom) are shown for a simulated data set, into which coincident spikes were injected (same data as in Figure 3B) . In contrast to Figure 3B, spike data of the trials are concatenated and, individually for each neuron, represented as a single continuous series of dots.
a “local” time measure, however, by replacing each individual vector v k emp at the points in time where it occurs tjk, j 2 f1, . . . , n k g by the associated mutual dependence value MDk. This leads to a time-varying function: the time-resolved mutual dependence MD(t) : emp
MD( t) D
m n k X X kD 1 j D 1
MDk ¢d( t ¡ tjk ) .
(C.4)
The delta function in equation C.4 selects the MD value corresponding to the coincidence pattern occurring at the time of interest. The resulting time series describes the individual contributions to the MD in the course of time. The result is a discontinuous, “peaky” function of time, as shown in Figure 10 for a simulated data set with injected coincident events. The amplitudes, ranging from negative to positive values, typically vary strongly from one time step to the next. Positive amplitudes tend to have higher values than negative ones and pop out from small uctuations around 0. Positive peaks “point” to spike constellations for which the probability of occurrence is higher than expected, assuming independent processes. Negative peaks “point” to spike constellations with probability of occurrence lower than expected. Thus, the time-resolved mutual dependence may be interpreted as a “dynamic pointer” to instances of joint spike constellations representing conspicuous deviations from independence. A comparison of the performance of MD to the joint-surprise with regard to false positives and false negatives (cf. section 4) revealed that the
Unitary Events: I. Detection and Signicance
75
Figure 11: (A) Selectivity and sensitivity of the mutual dependence for pair coincidences injected into two parallel processes as a function of ring rates. (B) Results for the joint-surprise for comparison. The percentage of false positives (fp: left column), false negatives (fn: middle column), and the resulting overlap of maximum 10% fp and maximum 10% fn (right column). The thresholds on the MD were varied from h D ¡2 to 2 in equidistant steps. The dashed line indicates the MD-value corresponding to Sa (a D 0.05) for background rates l D 10s¡1 . Further details as in Figure 8.
MD measure, in contrast to the joint-surprise, strongly depends on the ring rates of the neurons involved (see Figure 11). This dependence expresses itself in the curved shape of the sensitivity-selectivity overlap region (the white area in Figure 11A, right). As a result, different signicance thresholds would be required for different ring rates. Thus, to obtain an adequate performance of the MD, the threshold needs to be adjusted, point by point, to the associated rates. Evidently, this is unpractical, considering that the observed processes typically have different individual ring rates. In addition, rate dependence poses problems when comparing data from different experiments. For the above reasons, we do not pursue this measure further here; more details and results of application to neuronal data can be found in Grun ¨ (1996).
76
S. Grun, ¨ M. Diesmann, and A. Aertsen
Acknowledgments
We thank Moshe Abeles, George Gerstein, and Gunther ¨ Palm for many stimulating discussions and help in the initial phase of the project. We also thank Robert Gutig, ¨ Stefan Rotter, and Wolf Singer for their constructive comments on an earlier version of the manuscript for this article. This work was partly supported by the DFG, BMBF, HFSP, GIF, and Minerva. References Abeles, M. (1982a). Local cortical circuits: An electrophysiological study. Berlin: Springer-Verlag. Abeles, M. (1982b). Quantication, smoothing, and condence limits for singleunits’ histograms. J. Neurosci., 5, 317–325. Abeles, M. (1982c). Role of cortical neuron: Integrator or coincidence detector? Israel J. Med. Sci., 18, 83–92. Abeles, M. (1983). The quantication and graphic display of correlations among three spike trains. IEEE Trans. Biomed. Eng., 30, 236–239. Abeles, M. (1991). Corticonics: Neural circuits of the cerebral cortex. Cambridge: Cambridge University Press. Abeles, M., Bergman, H., Margalit, E., & Vaadia, E. (1993). Spatiotemporal ring patterns in the frontal cortex of behaving monkeys. J. Neurophysiol., 70(4), 1629–1638. Abeles, M., & Gerstein, G. L. (1988). Detecting spatiotemporal ring patterns among simultaneously recorded single neurons. J. Neurophysiol., 60(3), 909– 924. Abeles, M., Vaadia, E., Prut, Y., Haalman, I., & Slovin, H. (1993). Dynamics of neuronal interactions in the frontal cortex of behaving monekys. Conc. Neurosci., 4(2), 131–158. Aertsen, A., & Arndt, M. (1993). Response synchronization in the visual cortex. Curr. Op. Neurobiol., 3, 586–594. Aertsen, A., Gerstein, G., Habib, M., & Palm, G. (1989).Dynamics of neuronal ring correlation: Modulation of “effective connectivity.” J. Neurophysiol., 61(5), 900–917. Aertsen, A., Vaadia, E., Abeles, M., Ahissar, E., Bergman, H., Karmon, B., Lavner, Y., Margalit, E., Nelken, I., & Rotter, S. (1991).Neural interactions in the frontal cortex of a behaving monkey: Signs of dependence on stimulus context and behavioral state. J. Hirnf., 32(6), 735–743. Ahissar, M., E., A., Bergman, H., & Vaadia, E. (1992). Encoding of sound-source location and movement: Activity of single neurons and interactions between adjacent neurons in the monkey auditory cortex. J. Neurophysiol.,67, 203–215. Baker, S., & Lemon, R. (2000). Precise spatiotemporal repeating patterns in monkey primary and supplementary motor areas occur at chance level. J. Physiol. (Lond.), 84, 1770–1780. Barlow, H. B. (1972).Single units and sensation: A neuron doctrine for perceptual psychology? Perception, 1, 371–394.
Unitary Events: I. Detection and Signicance
77
Barlow, H. (1992). Single cells versus neuronal assemblies. In A. Aertsen & V. Braitenberg (Eds.), Information processing in the cortex (pp. 169–173). Berlin: Springer-Verlag. Brody, C. D. (1999a). Correlations without synchrony. Neural Comp., 11, 1537– 1551. Brody, C. D. (1999b). Disambiguating different covariation types. Neural Comp., 11, 1527–1535. Cox, D. R., & Isham, V. (1980). Point processes. London: Chapman and Hall. Date, A., Bienenstock, E., & Geman, S. (1998). On the temporal resolution of neural activity (Tech. Rep.). Providence, RI: Divison of Applied Mathematics, Brown University. Dayhoff, J. E., & Gerstein, G. L. (1983). Favored patterns in spike trains. II. Application. J. Neurophysiol., 49(6), 1349–1363. DeCharms, R., & Merzenich, M. (1996). Primary cortical representation of sounds by the coordination of action-potential timing. Nature, 381, 610– 613. Diesmann, M., Gewaltig, M.-O., & Aertsen, A. (1999). Conditions for stable propagation of synchronous spiking in cortical neural networks. Nature, 402, 529–533. Eckhorn, R., Bauer, R., Jordan, W., Brosch, M., Kruse, W., Munk, M., & Reitb ock, ¨ H. J. (1988). Coherent oscillations: A mechanism of feature linking in the visual cortex? Biol. Cybern., 60, 121–130. Eggermont, J. J. (1992). Neural interaction in cat primary auditory cortex II. Effects of sound stimulation. J. Neurophysiol., 71, 246–270. Engel, A. K., Konig, ¨ P., Schillen, T. B., & Singer, W. (1992). Temporal coding in the visual cortex: New vistas on integration in the nervous system. TINS, 15(6), 218–226. Feller, W. (1968). An introduction to probability theory and its applications (Vol. 1, 3rd ed.). New York: Wiley. Fetz, E. E. (1997).Temporal coding in neural populations. Science, 278, 1901–1902. Georgopoulos, A. P., Taira, M., & Lukashin, A. (1993). Cognitive neurophysiology of the motor cortex. Science, 260, 47–52. Gerstein, G. L., Bedenbaugh, P., & Aertsen, A. (1989). Neuronal assemblies. IEEE Trans. Biomed. Eng., 36, 4–14. Grammont, F., & Riehle, A. (1999). Precise spike synchronization in monkey motor cortex involved in preparation for movement. Exp. Brain Res., 128, 118–122. Gray, C. M., & Singer, W. (1989). Stimulus-specic neuronal oscillations in orientation columns of cat visual cortex. Proc. Nat. Acad. Sci. USA, 86, 1698–1702. Grun, ¨ S. (1996). Unitary joint-events in multiple-neuron spiking activity: Detection, signicance, and interpretation. Thun: Verlag Harri Deutsch. Grun, ¨ S., & Aertsen, A. (1998). Unitary events in non-stationary multiple-neuron activity. Society for Neuroscience Abstracts, 24, 1670. Grun, ¨ S., Aertsen, A., Abeles, M., & Gerstein, G. (1993). Unitary events in multineuron activity: Exploring the language of cortical cell assemblies. In N. Elsner & M. Heisenberg (Eds.), Gene—Brain—Behaviour. Gottingen: ¨ Thieme Verlag.
78
S. Grun, ¨ M. Diesmann, and A. Aertsen
Grun, ¨ S., Aertsen, A., Abeles, M., Gerstein, G., & Palm, G. (1994). Behaviorrelated neuron group activity in the cortex. In Proc. 17th Ann. Meeting European Neurosci. Assoc. Oxford: Oxford University Press. Grun, ¨ S., & Diesmann, M. (2000). Evaluation of higher-order coincidences in multiple parallel proceses. Society for Neuroscience Abstracts, 26, 2201. Grun, ¨ S., Diesmann, M., Grammont, F., Riehle, A., & Aertsen, A. (1999).Detecting unitary events without discretization of time. J. Neurosci. Meth., 94, 67–79. Gutig, ¨ R., Aertsen, A., & Rotter, S. (in press). Statistical signicance of coincident spikes: Count-based versus rate-based statistics. Neural. Comp. Hatsopoulos, N. G., Ojakangas, C. L., Paninski, L., & Donoghue, J. P. (1998). Information about movement direction obtained from synchronous activity of motor cortical areas. Proc. Nat. Acad. Sci. USA, 95, 15706–15711. Hays, W. L. (1994). Statistics (5th ed.). Orlando, FL: Harcourt Brace. Hebb, D. O. (1949).Organizationof behavior.A neurophysiologicaltheory. New York: Wiley. Hubel, D. H., & Wiesel, T. N. (1977). Ferrier lecture: Functional architecture of macaque monkey visual cortex. Proc. R. Soc. Lond. B., 198(1130), 1–59. Johannesma, P., Aertsen, A., van den Boogaard, H., Eggermont, J., & Epping, W. (1986). From synchrony to harmony: Ideas on the function of neural assemblies and on the interpretation of neural synchrony. In G. Palm & A. Aertsen (Eds.), Brain theory (pp. 25–47). Berlin: Springer-Verlag. Laubach, M., Wessberg, J., & Nicolelis, M. (2000). Cortical ensemble activity increasingly predicts behaviour outcomes during learning of a motor task. Nature, 405, 567–571. Legendy, C. R. (1975). Three principles of brain function and structure. Intern. J. Neurosci., 6, 237–254. Legendy, C. R., & Salcman, M. (1985). Bursts and recurrences of bursts in the spike trains of spontaneously active striate cortex neurons. J. Neurophysiol., 53(4), 926–939. Mainen, Z. F., & Sejnowski, T. J. (1995). Reliability of spike timing in neocortical neurons. Science, 268, 1503–1506. Martignon, L., Deco, G., Laskey, K., Diamond, M., Freiwald, W., & Vaadia, E. (2000). Neural coding: Higher-order temporal patterns in the neurostatistics of cell assemblies. Neural Comp., 12, 2621–2653. Martignon, L., Laskey, K., Deco, G., & Vaadia, E. (1997). Learning exact patterns of quasi-synchronization among spiking neurons from data on multi-unit recordings. In M. Jordan & M. Mozer (Eds.), Advances in information processing systems, 9 (pp. 145–151). Cambridge, MA: MIT Press. Martignon, L., von Hasseln, H., Grun, ¨ S., Aertsen, A., & Palm, G. (1995). Detecting higher-order interactions among the spiking events in a group of neurons. Biol. Cybern., 73, 69–81. Murthy, V. N., & Fetz, E. E. (1992). Coherent 25- to 35- Hz oscillations in the sensorimotor cortex of awake behaving monkeys. Proc. Nat. Acad. Sci. USA, 89, 5670–5674. Nadasdy, Z., Hirase, H., Czurko, A., Csicsvari, J., & Buzsaki, G. (1999). Replay and time compression of recurring spike sequences in the hippocampus. J. Neurosci., 19(21), 9497–9507.
Unitary Events: I. Detection and Signicance
79
Nawrot, M., Aertsen, A., & Rotter, S. (1999). Single-trial estimation of neuronal ring rates. J. Neurosci. Meth., 94, 81–91. Nawrot, M., Riehle, A., Aertsen, A., & Rotter, S. (2000). Spike count variability in motor cortical neurons. Eur. J. Neurosci., 12 (suppl. 11), 506. Newsome, W. T., Britten, K. H., & Movshon, J. A. (1989). Neuronal correlates of a perceptual decision. Nature, 341(6237), 52–54. Nicolelis, M. A. L., Baccala, L. A., Lin, R. C. S., & Chapin, J. K. (1995). Sensorimotor encoding by synchronous neural assembly activity at multiple levels in the somatosensory system. Science, 268, 1353–1358. Palm, G. (1981). Evidence, information and surprise. Biol. Cybern., 42, 57–68. Palm, G. (1990). Cell assemblies as a guidline for brain reseach. Conc. Neurosci., 1, 133–148. Palm, G., Aertsen, A., & Gerstein, G. L. (1988). On the signicance of correlations among neuronal spike trains. Biol. Cybern., 59, 1–11. Pauluis, Q. (1999). Temporal coding in the superior colicullus. Unpublished doctoral dissertation, Universit e catholique de Louvain, Brussels, Belgium. Pauluis, Q., & Baker, S. N. (2000). An accurate measure of the instantaneous discharge probability, with application to unitary joint-event analysis. Neural Comp., 12(3), 647–669. Perkel, D. H., & Bullock, T. H. (1968). Neural coding. Neurosci. Res. Prog. Bull., 6. Pipa, G., Singer, W., & Grun, ¨ S. (2001). Non-parametric signicance estimation for unitary events. In N. Elsner & G. M. Kreutzberg (Eds.), Proc. of the German Neuroscience Society. Stuttgart: Thieme Verlag. Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. (1992). Numerical recipes in C (2nd ed.). Cambridge: Cambridge University Press. Prut, Y., Vaadia, E., Bergman, H., Haalman, I., Hamutal, S., & Abeles, M. (1998). Spatiotemporal structure of cortical activity: Properties and behavioral relevance. J. Neurophysiol., 79(6), 2857–2874. Riehle, A., Grammont, F., Diesmann, M., & Grun, ¨ S. (2000). Dynamical changes and temporal precision of synchronized spiking activity in motor cortex during movement preparation. J. Physiol. (Paris), 94(5–6), 569–582. Riehle, A., Grun, ¨ S., Diesmann, M., & Aertsen, A. (1997). Spike synchronization and rate modulation differentially involved in motor cortical function. Science, 278, 1950–1953. Roelfsema, P., Engel, A., Konig, ¨ P., & Singer, W. (1996). The role of neuronal synchronization in response selection: A biologically plausible theory of structured representations in the visual cortex. J. Cogn. Neurosci., 8(6), 603– 625. Roy, A., Steinmetz, P., & Niebur, E. (2000). Rate limitations of unitary event analysis. Neural Comp., 12, 2063–2082. Sakurai, Y. (1996). Population coding by cell assemblies—what it really is in the brain. Neuroscience Research, 26, 1–16. Sanes, J. N., & Donoghue, J. P. (1993). Oscillations in local eld potentials of the primate motor cortex during voluntary movement. Proc. Nat. Acad. Sci. USA, 90, 4470–4474.
80
S. Grun, ¨ M. Diesmann, and A. Aertsen
Shadlen, M. N., & Newsome, W. T. (1998). The variable discharge of cortical neurons: Implications for connectivity, computation, and information coding. J. Neurosci., 18(10), 3870–3896. Singer, W. (1993). Synchronization of cortical activity and its putative role in information processing and learning. Annu. Rev. Physiol., 55, 349–374. Singer, W. (1999). Neural synchrony: A versatile code for the denition of relations. Neuron, 24, 49–65. Singer, W., Engel, A. K., Kreiter, A. K., Munk, M. H. J., Neuenschwander, S., & Roelfsema, P. R. (1997). Neuronal assemblies: necessity, signature and detectability. Trends in Cognitive Sciences, 1(7), 252–261. Singer, W., & Gray, C. (1995). Visual feature integration and the temporal correlation hypothesis. Annu. Rev. of Neurosci., 18, 555–586. Softky, W. R., & Koch, C. (1993). The highly irregular ring of cortical cells is inconsistent with temporal integration of random EPSPS. J. Neurosci., 13, 334–350. Steinmetz, P., Roy, A., Fitzgerald, P., Hsiao, S., Johnson, K., & Niebur, E. (2000). Attention modulates synchronized neuronal ring in primate somatosensory cortex. Nature, 404(6774), 187–190. Vaadia, E., Haalman, I., Abeles, M., Bergman, H., Prut, Y., Slovin, H., & Aertsen, A. (1995). Dynamics of neuronal interactions in monkey cortex in relation to behavioural events. Nature, 373(6514), 515–518. Villa, A., & Abeles, M. (1990). Evidence for spatiotemporal ring patterns within the auditory thalamus of the cat. Brain Res., 509(2), 325–327. von der Malsburg, C. (1981). The correlation theory of brain function (Int. Rep. No. 81-2). Gottingen: ¨ Max-Planck-Institute for Biophysical Chemistry. Received July 20, 2000; accepted April 24, 2001.
LETTER
Communicated by George Gerstein
Unitary Events in Multiple Single-Neuron Spiking Activity: II. Nonstationary Data Sonja Grun ¨
[email protected] Department of Neurophysiology, Max-Planck Institute for Brain Research, D-60528 Frankfurt/Main, Germany Markus Diesmann
[email protected] Department of Nonlinear Dynamics, Max-Planck Institut fur ¨ Str¨omungsforschung, D-37073 G¨ottingen, Germany Ad Aertsen
[email protected] Department of Neurobiology and Biophysics, Institute of Biology III, Albert-LudwigsUniversity, D-79104 Freiburg, Germany In order to detect members of a functional group (cell assembly) in simultaneously recorded neuronal spiking activity, we adopted the widely used operational denition that membership in a common assembly is expressed in near-simultaneous spike activity. Unitary event analysis, a statistical method to detect the signicant occurrence of coincident spiking activity in stationary data, was recently developed (see the companion article in this issue). The technique for the detection of unitary events is based on the assumption that the underlying processes are stationary in time. This requirement, however, is usually not fullled in neuronal data. Here we describe a method that properly normalizes for changes of rate: the unitary events by moving window analysis (UEMWA). Analysis for unitary events is performed separately in overlapping time segments by sliding a window of constant width along the data. In each window, stationarity is assumed. Performance and sensitivity are demonstrated by use of simulated spike trains of independently ring neurons, into which coincident events are inserted. If cortical neurons organize dynamically into functional groups, the occurrence of near-simultaneous spike activity should be time varying and related to behavior and stimuli. UEMWA also accounts for these potentially interesting nonstationarities and allows locating them in time. The potential of the new method is illustrated by results from multiple single-unit recordings from frontal and motor cortical areas in awake, behaving monkey. Neural Computation 14, 81–119 (2001)
° c 2001 Massachusetts Institute of Technology
82
S. Grun, ¨ M. Diesmann, and A. Aertsen
1 Introduction
In the companion article in this issue, unitary event analysis was introduced to detect a certain type of statistical dependency in the spiking activities of simultaneously recorded neurons: near-coincident spike constellations that occur more often than expected on the basis of independent ring rates. In the literature, such events are discussed as signatures of coherent cell assemblies, considered to be the building blocks of cortical processing (see the companion article for references). Unitary event analysis as described in the companion article was based on the assumption that the underlying processes are stationary. Typically, however, experimental data show modulations in ring rates. In fact, often stimuli are manipulated to enhance these “responses.” In this article, we describe an extension of the stationary method, unitary events by moving window analysis (UEMWA), specically designed to enable an application to nonstationary data. After introducing and describing the method in detail (section 2), we illustrate its performance and discuss its sensitivity using simulated spike sequences under different scenarios of nonstationarities in ring rate and nonstationarities in coincidence rate, on the same and on different timescales (section 3). Two experimental data sets from frontal and motor cortical recordings are used to illustrate the occurrence of unitary events in neuronal data and their relation to behavioral context (section 4). In section 5, we concentrate on practical aspects of the application of our new method. An assessment of the problems of false positives and false negatives is followed by guidelines for proper choice of analysis parameters. 2 Detecting Unitary Events by Moving Windows Analysis
The task is to develop a method that allows the detection of unitary events in nonstationary spike data. The basic idea is to segment the data into sections over which stationarity can be assumed and analyze the data in these sections separately by the method developed in the companion article for the stationary situation. Here, we construct a procedure in which a time window of width Tw is slid along the data, and unitary event analysis is performed separately at each window position, dening slightly different rate environments (an alternative approach is described in appendix D and discussed in section 5). The moving window has to be narrow enough such that the ring rates can be assumed to be stationary and at the same time long enough to obtain sufcient statistics. It turns out that a third criterion, the time-dependent rate of the spike coincidences to be detected, also inuences the optimal choice of the analysis window size. In this section, we describe the procedure sketched above in detail and work out its statistical interpretation, using the results of the companion article. Peristimulus time histograms (PSTHs) of spike data from motor and frontal cortical areas show that ring rates are usually sufciently station-
Unitary Events: II. Nonstationary Data
83
ary (cf. section 4, Figures 6 and 7) over time windows on the order of 50 to 100 ms. However, considering a rate level of 1 to 50 s¡1 , we expect only 0.1 to 5.0 spikes in a single window. Clearly, these numbers are too low for a statistical comparison of the numbers of expected and observed coincidences. An important assumption of the PSTH is that ring rates are stationary across trials, and, thus, the average over trials allows computing a reliable estimate of the ring rate at any point in time. Using this assumption, the set of trials performed for a particular experimental condition can be combined to overcome the problem of the low numbers of counts stated above. Figure 1 illustrates how the data of all available trials are used to construct a new process. For a time window centered at ti , the data from the M trials are concatenated to form a new set of parallel spike trains of length M ¢ Tw . Let vj ( t) be the parallel ( 0, 1) -process (see the companion article) describing the neuronal spike data of trial j. In this article, all variables representing time are in units of the temporal resolution h of vj . In these units, the window width is an odd integer, Tw D (2n C 1) ,
n 2 f0, 1, 2, . . .g.
(2.1)
The new parallel process v on the new time axes t0 is given by v ( t0 ) D vj ( t)
(2.2)
³ ´ 1 t0 D ( j ¡ 1) ¢ Tw C t ¡ ti ¡ ( Tw ¡ 1) 2
(2.3)
with
where » ¼ 1 1 t 2 ti ¡ ( Tw ¡ 1) , . . . , ti C ( Tw ¡ 1) 2 2
(2.4)
j 2 f1, . . . , Mg.
(2.5)
The stationary unitary event analysis is then performed on these new parallel spike trains. It can be summarized as follows. The average ring probabilities of the processes within the time window determine the expected number of coincidences—the empirical number of coincidences results from a counting process within the window over all trials. The signicance for excess (or lacking) coincidences is evaluated by comparing the expected and empirical numbers using the joint-surprise measure, a logarithmic transform of the joint-p-value. The latter expresses the probability of observing the empirical number of coincidences by chance, under the null-hypothesis of independent Poisson processes. The full data set is analyzed by successively moving the time window from one position ti to the next (usually in steps of one bin) and repeating the procedure sketched above.
84
S. Grun, ¨ M. Diesmann, and A. Aertsen Tw trial 1
1 1
trial 2
2 2
trial M
M M
1
2
1
2
ti t
t i+1
M
ti M
t i+1 t’ Figure 1: Sketch of the moving window analysis. (A) Parallel spike trains of ve neurons (spikes marked as dots) for several repetitions 1, 2, . . . , M (trials) of the same experiment. A window of width Tw centered at a given point in time ti denes the segment of the data (shaded in gray), which enters the analysis at ti . (B) From the data in each such time segment, a new time axis t0 is constructed by concatenating the windows from all trials. Unitary events analysis is then performed on this new process. The full data set is analyzed by successively moving the window to the next point in time and repeating the above procedure (indicated by dashing). Typically, the window is shifted in steps of the time resolution of the spike data.
3 Dependence of Signicance on Spike Rates
In this section we describe the performance of the UEMWA method under different conditions, including various time courses of the ring rates and the coincident rate. Special emphasis will be put on the width of the analysis time window, which sets detectability limits. After introducing a common theoretical framework, we discuss stationary and nonstationary rates. We rst formulate a thought experiment in which we analytically describe the variables of the processes as expectation values. This will serve as a de-
Unitary Events: II. Nonstationary Data
85
scription of the average realization. Theoretical results will be illustrated by actual realizations in the form of simulated data. 3.1 Description of the Performance Test. For convenience we will restrict our considerations to two processes, both having the same rates. The results, however, are easily extended for N > 2 processes and for differing rates. Dependencies between the two processes and their consequences on our measures are studied by injecting coincident Poisson events at a given rate level lc ( t) into two independent Poisson processes with (background) rates l(t) (for a detailed discussion, see Grun, ¨ Diesmann, Grammont, Riehle, & Aertsen, 1999). Temporal resolution h is assumed to be 1 ms. Let us now derive expectation values for the empirical number of coincidences nemp in time interval Tw and for the number of coincidences npred we expect to nd on the basis of the rates. The probability of nding a coincidence at time t is lc ( t) h C ( l( t) h ) 2 . Therefore, the expectation value for the empirical coincidence count in T w is
nemp ( t ) D M ¢
( Tw ¡1) t C 12X t D t¡ 12 ( Tw ¡1)
[lc ( t ) h C (l( t ) h ) 2 ].
(3.1)
Knowing only the rate of the processes ( lc ( t) C l( t) ), the expected number of coincidences in a window Tw centered at t is pred n¤ ( t )
D M¢
( Tw ¡1) t C 12X t Dt¡ 12 ( Tw ¡1)
[( lc ( t ) C l( t ) ) h]2 .
(3.2)
However, given experimental data, the ring rates are not known and have to be estimated from the data. Assuming that the rates are stationary over the duration of a time window Tw , the rate can be estimated by the ratio of the spike count and the number of time steps. (The effect of rate estimation on stationary unitary events analysis is discussed in the companion article.) Having available only an estimate of the average ring rate, equation 3.2 reads 2 1 npred ( t ) D MTw ¢ 4 Tw
( Tw ¡1) t C 12X t Dt¡ 12
( Tw ¡1)
32 ( lc ( t ) C l( t ) ) h5 .
(3.3)
Comparing equation 3.3 with equation 3.2, we have, using the assumption of stationarity, effectively exchanged the sum and the squaring. The difference pred between npred ( t) and n¤ ( t) determines the error made when nonstationary processes are analyzed with an averaging window of width Tw . (See appendix B for a parametric study of this deviation in terms of false positives.) However, conceptually, rate estimation and coincidence statistics can
86
S. Grun, ¨ M. Diesmann, and A. Aertsen
be separated and performed using different methods (see section 5). The following measures will be analyzed and illustrated for different time courses of the background rates l( t) and the injected rate lc ( t ) : The number of coincidences expected to occur in the course of time. Both the number of occurrences assuming independence (npred; equation 3.3) and the empirical number of coincidences (nemp ; equation 3.1) will be evaluated. The joint-surprise S as a function of time, resulting from the comparison of npred and nemp . The rst characterizes the underlying distribution under the null-hypothesis of independence, and the second describes the deviation from independence. Data and results will be displayed as sequences of three subgures (columns in Figures 2 and 4): at the top empirical (solid line) and expected coincidence rates (dotted line), in the middle the number of coincidences (empirical [solid] and expected [dotted]), at the bottom the joint-surprise (gray) including the upper and lower signicance levels for a D 0.01 (dashed). All measures are shown as functions of time over the duration of the trial. 3.2 Stationary Background Rates
3.2.1 Stationary Coincidence Rate. The simplest “experimental” situation is given if both background rates and the injected coincidence rate are staFigure 2: Facing page. Stationary background rates: relevance of window size. (A) Stationary coincidence rate. Three examples of sliding window analysis for different window widths (columns from left to right, Tw D 50, 220, 400; M D 100, h D 1 ms). (Top) Coincident rate (dotted line) and compound rates (solid line). (Middle) Number of coincidences empirical (solid line) and expected (dotted line). (Bottom) Joint-surprise (gray curve) and upper and lower signicance threshold (dash-dot lines) for signicance level a D 0.01. In the left column, Tw is smaller than the minimal window size Ta ; the number of detected coincidences is not signicant. In the middle column, Tw equals Ta and thus is at the border of signicance. Only when Tw > Ta (right column) is the detected number of coincidences signicant. All measures are independent of time because of stationarity of the underlying rates. (B) Nonstationary coincidence rate. Three examples of sliding window analysis for different window widths (columns from left to right Tw D 30, 100, 500). Graphs as in A. In the left column, Tw is narrower than the minimal window size Tmin ; the number of detected coincidences is not signicant. In the middle column, Tw equals Tc and, hence, S has a triangular shape. Since Tw is larger than Tmin , signicance is reached for more than one window position. In the right column, Tw > Tmax ; the hot region is not detected as signicant.
Unitary Events: II. Nonstationary Data
250 500 7501000 time (ms)
0
0
l (s 1)
Significance
30 20 10 0 4 2 0 2 4
n(T =100) w
Significance
0
250 500 7501000 time (ms)
250 500 7501000 time (ms) Rate
20 10 0
n
4 2 0 2 4
w
250 500 7501000 time (ms)
250 500 7501000 time (ms)
S
l (s 1) n
30 20 10 0
n(T =30)
0
Significance
Rate 20 10 0
S
l (s 1) n S
4 2 0 2 4
w
Significance
Rate
30 20 10 0
4 2 0 2 4
n(T =400)
w
Significance
20 10 0
30 20 10 0
n(T =220)
w
0
l (s 1)
4 2 0 2 4
n
30 20 10 0
n(T =50)
Rate 20 10 0
S
l (s 1) n
4 2 0 2 4
Rate 20 10 0
S
l (s 1) n
30 20 10 0
S
Rate 20 10 0
87
n(T =450) w
Significance
0
250 500 7501000 time (ms)
tionary, that is, independent of time (see Figure 2A). As shown in section C.1, for each combination of l and lc , there is a minimal analysis window width Ta needed to detect coincidences as signicant events. If the window is chosen too narrow, the number of excess coincidences is too small to deviate signicantly from the expected number. In the stationary case, detection of excess coincident events can always be ensured by increasing the analysis window, since the larger the analysis window is, the more excess coincidences are detected. The examples in Figure 2A demonstrate the dependence of the signicance on the size of the analysis window (columns: Tw < Ta , Tw D T a, Tw > Ta ) for a xed combination of l and lc . Observe that in the left column, Tw is clearly too small to detect the injected coincidences, just on the border for detection in the middle column, and clearly large enough in the right column.
88
S. Grun, ¨ M. Diesmann, and A. Aertsen
3.2.2 Nonstationary Coincidence Rate. We now discuss the case where coincidences occur clustered in time (i.e., form “hot regions”) on top of a constant background rate (see Figure 2B). In that case, the detection of excess coincidences is constrained to a range between a minimal and a maximal analysis window size. Consider a situation where we place the analysis window in the middle of a hot region of width Tc , and gradually increase the width of the analysis window Tw (details are in section C.2; cf. Figure 10B). As long as Tw · Tc , we face the situation of stationary injected events, discussed in the preceding paragraph (cf. Figure 10A). Hence, we need a minimum width of the analysis window (Ta ) to detect the cluster of injected events. Ta depends only on the combination of l and lc ; its value can be obtained from the calibration graph for the stationary situation (see Figure 10A, bottom). When the analysis window exceeds the hot region (Tw > Tc ), the total number of coincidences increases further; now, however, due only to coincidences occurring by chance on the basis of the background activity. Thus, by increasing the analysis window further, excess coincidences are averaged with independent coincidences, and at some window size, Tmax will not be detected as signicant anymore. Tmax denes the maximal window for detecting excess coincidences as signicant. If Tc < Ta , the cluster cannot be detected, even with arbitrarily large analysis windows (assuming the number of trials to be xed). Thus, in contrast to the stationary condition, the existence of a Tmin depends on the width of the hot region Tc . A cluster of excess coincidences is not detectable if its duration Tc remains below the critical time span Ta . If the cluster is detectable (Tc ¸ T a, Tmin D Ta ), it may still go undetected if the analysis window is too small, Tw < Tmin , or too wide, Tw > Tmax . The range of appropriate window sizes can be obtained from calibration graphs as in Figure 10B in appendix C. Examples for analysis windows that are too narrow, appropriate, and too wide are presented in Figure 2B. Let us now discuss the case where the analysis window is shifted gradually into a hot region. Once the window has overlap with the hot region, the injected coincidences contribute to the coincidence count. With increasing overlap, the number of contributed coincidences grows linearly, and the joint-surprise increases accordingly (see Figure 2B). A plateau is reached when the analysis window is completely inside the hot region (Tw < Tc ) or when the hot region is completely covered by the analysis window (Tw > Tc ). Further shifting eventually leads to a decrease in overlap and to a time course of the joint-surprise S, which is symmetrical around the center of the hot region. The trapezoidal shape of S degenerates to a triangle in the special case Tw D Tc . The argument just given shows that S can pass the signicance threshold (i.e., the plateau surpasses threshold) only if the analysis window Tw has the appropriate size Tmin · Tw · Tmax . The duration of the plateau is given by Tp D |Tc ¡Tw |. Table 1 summarizes the relationships between the observable
Unitary Events: II. Nonstationary Data
89
Table 1: Relationships of the Plateau Duration Tp , Width of the Analysis Window Tw , and Extent of the Hot Region Tc .
Tp D 0 Tp < Tw Tp ¸ Tw
Tw < Tc
Tw D Tc
Tw > Tc
— Tc D Tw C Tp Tc D Tw C Tp
Tc D Tw — —
— Tc D Tw ¡ Tp —
Notes: Tw is compared to Tp (rows) and to Tc (columns). Table entries marked by a dash represent nonexisting (Tp ,Tw ,Tc ) combinations.The trapezoidal shape of the jointsurprise S reaches signicance if Tmin exists (Tc ¸ Ta ) and Tmin · Tw · Tmax .
variables Tw , Tp and the variables generating the trapezoid Tw , Tc . Using these relationships, one can determine the extent of the hot region Tc by measuring the size of the plateau and systematic variation of the analysis window (see section C.2 for a detailed derivation). Figure 3 summarizes the possible interactions of the width of an excess interval Tc and the width of the analysis window Tw . If the signicance threshold is reached at all, typically more than one window is signicant. Only in the special case of Tw D Tc D Ta is the maximum of S exactly at threshold level and for a single window only. For a given combination Tc , Tw , we can compute the minimal overlap of the two windows needed to detect the injected coincidences as signicant by the use of T max and can construct the extent of the region Ts in which injected coincidences are marked as signicant (see section C.2)—that is, the time span between the two intersections of S with the signicance threshold. According to the unitary event analysis, all coincidences in a signicant window are marked as “special.” Thus, only in the case of minimal Ts exactly the coincidences in the region Tc are marked as special. The smaller Tw , the better the extent of the marked coincidences approximates Tc . 3.3 Nonstationary Background Rates
3.3.1 Stationary Coincidence Rates. We consider two different cases of nonstationary background rates: stepwise and gradual increase of l( t) , both in combination with a stationary rate of injected coincidences lc (see Figures 4A and 4B). A stepwise increase in rate does not lead to a discontinuous change in the coincidence counts and the signicance measure, due to the smoothing effect of the moving window (see Figure 4A). On a larger timescale, npred and nemp increase parabolically because of their quadratic dependence on the background rate (see Figure 4A, middle graph). The difference between npred and nemp is constant throughout the trial. However, due to the absolute increase in background rate, the signicance decreases
90
S. Grun, ¨ M. Diesmann, and A. Aertsen
a
1 2
1
2
a
or
1 2
1 l
1
1
2
2
or
Figure 3: Detectability of short epochs of excess coincidences. The gure illustrates the inuence of the choice of Tw for the detectability of a hot region Tc . l and lc are constant, and Tc ¸ Ta (detectability in principle) is assumed. Each individual graph shows the time course of lc (solid) in its top part (width of analysis window Tw indicated by the dashed line) and the time course of the signicance measure (solid) relative to the signicance threshold (the dotted line) in the bottom part. The ordinates of the signicance measure are individually scaled for better visibility. The rst column covers the constellations where Tw < Tc , the second Tw D Tc , and the third Tw > Tc . The rows are organized by the size of Tw relative to the interval [Tmin , Tmax ] for which detection is possible. The rst row depicts cases where injected coincidences are not detected as signicant because Tw is outside the interval [Tmin , Tmax ]. The second row shows cases where injected coincidences are just at the border of detectability, because Tw equals either Tmin or Tmax . For the case of Tw D Tc and Tw equal to one of the detection boundaries, Tmin and Tmax are identical. The third row illustrates cases where Tw is inside the interval [Tmin , Tmax ]. Here, coincidences are detected as signicant for a range of window positions.
(see the companion article for an extended discussion of this issue). A linear increase of background rate (see Figure 4B) basically gives the same result. In both examples, the injected coincidences (lc D 2 s¡1 ) do not reach signicance (a D 0.01) at a background rate of l D 50 s ¡1 . Above this rate, injected coincidences can be detected only with a larger analysis window Tw . 3.3.2 Nonstationary Coincidence Rates. Finally, we investigate the general case where the neuronal processes have time-dependent rates and the excess coincident activity occurs in a short interval, triggered by some external or internal event. When the neuronal processes are observed over repeated trials, the coincident activity appears to some degree locked to certain points
Unitary Events: II. Nonstationary Data
91
in time. For the purpose of this article, the situation described above is our model for the composition of neuronal spike trains. Firing rates and coincidence rates vary independently and consistently over the time course of a trial. Regions of increased coincidence rate may be accompanied with elevated ring rates, with suppressed activity, or without noticeable changes in ring rate. Such data are optimally suited for UEMWA. Reproducibility over trials allows for reliable estimates of ring rates and coincidence rate in relatively narrow time windows. Consider a data set where several hot regions appear during the trial, while the ring rate is increasing with time. As in the preceding section, two types of increase (stepwise in Figure 4C and a constant slope in Figure 4D) are compared. The width of the analysis window is chosen as Tw D Tc . We again analyze the situation using the theory for the expectation values worked out in section 3.1. Figures 4C and 4D show the results of this analysis. Overall, npred and nemp are increasing over time; in addition, nemp exhibits strong peaks in the hot regions. S reects the transients in the hot regions while staying at naught in between. This clearly demonstrates the ratenormalizing property of the joint-surprise measure. The triangular shape of the peaks is explained by the condition T w D Tc (cf. Figure 2B, center column). Since lc is the same for each hot region, the peaks in nemp relative to baseline are of equal height (see equation 3.1). With a constant number of excess coincidences and increasing background level, the signicance decreases (cf. Figure 4 in the companion article and Figure 4A here). Thus, in our example, the height of the peaks in S (see Figures 4C and 4D) decreases over time. The last hot region is just on the border of detectability. Apart from differences in the uctuations of S, caused by the discretization inherent in the joint-surprise measure (cf. Figure 10), the two types of rate variations are 6 practically indistinguishable at the level of S. In case Tw D Tc (not shown here), the time course of S would exhibit plateau-like shapes around the hot regions. For the case of linearly increasing background rates, the plateaus would be oblique instead of at. The slope of the plateau would then be comparable to the situation of stationary coincidences rates (cf. Figure 4B). Before we apply UEMWA to experimental data, we will leave the theory for the expectation values and illustrate the procedure using simulated point processes with a realistic number of repetitions. Figure 5A shows simulations of two parallel processes in repeated trials. Both spike trains are simulated as independent Poisson processes (see the companion article for details). The rst one has a nonstationarity in ring rate: at a certain point in time, the ring rate raises stepwise. The second process is stationary. Clusters of coincident events were injected around two points in time. In Figure 5A, all coincidences found (irrespective of their signicance, termed “raw”) are marked by squares. The two clusters can clearly be seen. However, as expected, coincidences also occur outside the hot regions—more of them in the regime where the ring rate of one of the neurons is elevated. In Figure 5B, only those coincidences are marked that occur in windows
92
S. Grun, ¨ M. Diesmann, and A. Aertsen
n
40
l (s 1)
60 40 20 0
Rate
Coincidence counts
20
Significance
5 0
S
S
0 Significance
0 5
Coincidence counts
20
0 5
60 40 20 0 40
n
l (s 1)
Rate
5
0 250 500 7501000 time (ms)
0
20 0
S
Coincidence counts
60 40 20 0 40
n
n
40
l (s 1)
60 40 20 0
Rate
20
5
5
0
0
5
0 250 500 7501000 time (ms)
Coincidence counts
0
Significance S
l (s 1)
Rate
250 500 7501000 time (ms)
5
Significance
0
250 500 7501000 time (ms)
where S exceeds the signicance threshold. The time course of the jointsurprise is shown in Figure 5C. As expected from our considerations above, the joint-surprise remains at baseline outside the hot regions. Thus, only coincidences in the hot regions and, because of the limited temporal resolution of UEMWA, in a small region around them are marked. The triangular shape of the peaks in S indicates that Tw was close to Tc . Only the top of the triangle is above threshold, meaning that Tw is not much larger than Tmin . Consequently, the region of marked coincidences in Figure 5B gives a good estimate of the width of the cluster Tc . The uctuations of S in repeti-
Unitary Events: II. Nonstationary Data
93
tions of the same experiment are illustrated in Figure 5D. The time course of the boundaries enclosing at least 70% of the realizations (dark gray curves) shows that the coincidences in the rst hot region are detected with a probability exceeding 85% (already the lower boundary is above signicance threshold). Sensitivity is lower for the second hot region because of a higher background rate. The behavior of the lower boundaries in the regime of low ring rates (the analysis window being outside the hot region) exemplies the problem of detecting a lack in the number of coincidences. The size of the analysis window and the given number of trials does not allow for the detection of lacking coincidences at a reasonable signicance level because the probability of nding no coincidences already is ¼14% (compare Aertsen & Gerstein, 1985). 4 Unitary Events in Cortical Activity
In the following, we present results from the analysis of simultaneously recorded multiple single-neuron spike trains from frontal and motor cortex in awake, behaving monkeys. Figure 4: Facing page. Nonstationary background rates (graphs as in Figure 2; in all graphs the analysis window is Tw D 50, M D 100, h D 1 ms). Stationary coincidence rates combined with (A) stepwise increasing background rates (l D 20, 30, 40, 50, 60 s ¡1 ) and (B) continuously increasing background rates (l D 20 to 60 s ¡1 ). Coincidences are injected at rate lc D 2 s ¡1 (dotted line). Compound rates are shown as solid curves. In A, coincidence counts (middle panel) reect rate changes, with some smoothing due to the moving window (dotted: expected; solid: empirical). Above a certain background rate, injected coincidences are masked by coincidences expected from independent rates and are no longer signicant (bottom panel, gray curve). For continuously increasing background rates (B, top panel, solid line), signicance is lost for ring rates that are too high, as in A (bottom panel, gray curve). Nonstationary coincidence rates combined with (C) stepwise increasing background rates (as in A) and (D) continuously increasing background rates (as in B). Coincidences are injected in three “hot regions” (width 50 ms, lc D 2 s ¡1 ). Coincidence counts (middle panel) reect increasing rates and injected coincidences in the hot regions (solid: empirical; dotted: expected). The triangular shape of coincidence counts results from smoothing by the moving analysis window, its width being equal to the widths of the hot regions (cf. Figure 3). In the bottom panels, joint-surprise remains at zero in regions where no coincidences are injected; signicant excursions occur in the hot regions. Typically, several consecutive windows detect coincidences as signicant. In the last hot region, only a single window is signicant, because the size of the hot region matches the size of the analysis window, which is the minimal window for detectability at this combination of rates. Results in C and D are comparable. Dash-dotted lines indicate signicance threshold (a D 0.01) for excess (upper) and lacking (lower) coincidences.
94
S. Grun, ¨ M. Diesmann, and A. Aertsen Raw Coincidences neuron #
1
2 Unitary Events
neuron #
1
2
Joint Surprise function
S
10 5 0
S
10
Variance of Joint Surprise function
5 0 0
200
400
600
800
1000
time (ms)
4.1 Motor Cortical Activity. In order to investigate the possible relation between the dynamics of neuronal interactions in the motor cortex and the behavioral reaction time (RT), a task was designed in which RT can be experimentally manipulated (Riehle, Seal, Requin, Grun, ¨ & Aertsen, 1995; Riehle, Grun, ¨ Diesmann, & Aertsen, 1997). Briey, monkeys were trained to touch a target on a video display after a preparatory period (PP) of variable duration. To start a trial, the animal had to push down a lever. The preparatory signal (PS) was given by an open circle on the video display. After a delay of variable duration, during which the animal had to continue to press the lever, the response signal (RS) was indicated by a lling circle. Four durations of the PP, lasting 600, 900, 1200, and 1500 ms, occurred with equal probability and in random order. RT is dened as the period between the occurrence of the RS and the release of the lever, whereas movement
Unitary Events: II. Nonstationary Data
95
time (MT) is dened as the period between releasing the lever and touching the screen. After training, the monkeys were prepared for multiple singleunit recording. A multielectrode microdrive (Reitbock, ¨ 1983; Mountcastle, Reitbock, ¨ Poggio, & Steinmetz, 1991) was used to insert transdurally seven independently driven microelectrodes, spaced 330 m m apart, into the primary motor cortex (MI) (for details see Riehle et al., 1995; Riehle, Grun, ¨ Aertsen, & Requin, 1996; Riehle et al., 1997). Figure 6 presents an example of modulation of coincident spiking activity during the preparation for movement. The rst observation is that the number of coincidences marked as signicant coincidences (see Figure 6C) is considerably reduced as compared to the raw coincidences (see Figure 6B). Second, unitary events show a distinct timing structure, with two phases of synchronized activity: about 100 ms after PS (lasting for about 200 ms) and after ES1 (lasting also about 200 ms). The composition of unitary events within these phases is the same: for a rst short period, neurons 2 and 3 are synchronized; then neuron 3 switches its partner and is successively synchronized with neuron 1. Taking into account the condition under which
Figure 5: Facing page. Simulated nonstationarities. (A) Spike times (dots) of two parallel processes, simulated for 1000 ms (h D 1 ms) over 100 trials (upper panel process 1, lower panel process 2, trials displayed in consecutive rows). Process 1 has a nonstationarity in ring rate at 300 ms from trial start. Firing rate increases stepwise from 20 s¡1 to 60 s ¡1 . Process 2 is stationary at 20 s ¡1 . Centered at 175 ms and 775 ms from trial onset, two hot regions (Tc D 50) are generated by injecting additional coincidences at rate 2 s¡1 . All coincidences occurring in the simulation are marked by squares (“raw coincidences”). (B) Same data as in A. Here, only coincidences are marked by squares that occur in analysis windows passing the signicance threshold: unitary events (analysis parameters: Tw D 50, a D 0.01). (C) Joint surprise corresponding to the data shown in A and B as a function of time (thick curve), representing at each instant in time the signicance resulting from the analysis window centered around this point in time. Thin lines: a D 0.01 for excess (upper) and lacking (lower) coincidences. At around 175 ms and 775 ms, the joint-surprise function passes the signicance level for excessive coincidences. (D) Variance of the joint-surprise functions estimated from 1000 repetitions of the simulation experiment shown in A–C. Signicance level indicated as in C for orientation. Gray curves represent the width of the distribution as a function of time (dark gray: minimum 70%; light gray: 95% area). In the regime where both background rates are 20 Hz, the probability of nding no coincidences is ¼ 14%. Therefore, no lower boundary for the minimum 95% area region can be drawn. No coincidence count does exist such that the cumulative probability of lower counts is less than 2.5%. For the 70% area region, the lower boundary is at coincidence count 1. In some time steps, probability to obtain no coincidences exceeded 15% because of the nite number of repetitions.
96
S. Grun, ¨ M. Diesmann, and A. Aertsen
Dot Display
neuron #
A 1 2 3 PS
neuron #
B
ES1 Raw Coincidences
1 2 3 Unitary Events
neuron #
C 1 2 3 0
time (ms)
600
Figure 6: Time structure of coincident spiking activity. The three panels of dot displays show the same spike data from three simultaneously recorded neurons in the primary motor cortex of a monkey involved in a delayed-response task. (A) Spiking activity of three neurons (1, 2, 3) organized in separate displays showing 96 trials. Data are pooled from three types of trials (PP 900, 1200, and 1500 ms and aligned on the preparatory signal (PS, vertical line at time 0). Only the rst 800 ms after PS are shown; this includes the end of the rst potential end of the waiting period ES1 (“expected signal”; vertical line at time 600 ms). In the data analyzed here, no movement instruction occurred at ES1. (B) Same spike data as in A. All “raw” coincidences are marked by squares (bin width 5 ms). (C) Same spike data as in A and B. Unitary events are marked by squares (UEMWA, window width 100 ms, a D 0.05). (Modied from Riehle et al., 1997).
Unitary Events: II. Nonstationary Data
97
these unitary events occur, one may speculate that the occurrence of unitary events can be interpreted as activation of a cell assembly that is involved with the initiation (or reinitiation) of a waiting period. 4.2 Frontal Cortical Activity. In the second experimental study we discuss, rhesus monkeys were trained to perform a “delayed localization task” with two basic paradigms (localizing and nonlocalizing; an example of the latter is shown Figure 7). In both paradigms, the monkey receives a sequence of two stimuli (visual and auditory) out of ve possible locations. After a waiting period, a “GO” signal instructs the monkey to move its arm in the direction of the stimulus relevant in the current trial. In the localizing paradigm, the relevant spatial cue was selected by the color of the GO signal. In the nonlocalizing paradigm, an indicator light between blocks of trials informed the monkey about the reinforced direction for arm movement. Thus, in the latter case, the animal had to ignore the spatial cues given before the GO signal. In the behavioral paradigm analyzed here (nonlocalizing), neither the spatial cues before the GO signal nor the GO signal itself could be used to determine the correct behavioral response (see Vaadia, Bergman, & Abeles, 1989; Vaadia, Ahissar, Bergman, & Lavner, 1991; Aertsen et al., 1991, for further details). The activity of several (up to 16) neurons from the frontal cortex was recorded simultaneously by six microelectrodes during performance of the task. In each recording session, the microelectrodes were inserted into the cortex with interelectrode distances of 300 to 600 m m. Isolation of single units was aided by six spike sorters that could isolate activity of two or three single units, based on their spike shape (Abeles & Goldstein, 1977). The spike sorting procedure introduced a dead time of 600 m s for the spike detection. Using data from this study, we found that coincident activity in the frontal cortex can be specic to movement direction. We parsed the data of ve neurons according to the movement direction and analyzed each of these subsets separately. Figure 7 shows the analysis results for two movement directions (A: to the left; B: to the front); for the three other movement directions, there was no signicant activity. For each of the two movement directions, there is mainly one cluster of unitary events (besides some sparsely spread individual ones), occurring at the onset of the movement. The clusters of unitary events differ, however, in both their neuronal composition and their timing. During movement to the left, signicant coincidences occur between neurons 6 and 9; for movement to the front, they occur between neurons 6 and 10. The timing of the unitary events differs when measured in absolute time after the GO signal (to the left: 355 ms; to the front: 400 ms); however, both occur shortly after LEAVE. Thus, unitary events appear to be locked better to the behavioral event (LEAVE) than to the external event (GO). The analysis of the same ve neurons during the localizing task, where the color of the GO signal contained the information about the reinforced type of stimulus (data not shown), did not reveal any indications for unitary
98
S. Grun, ¨ M. Diesmann, and A. Aertsen
Dot Display
6
6
7
7
neuron #
neuron #
Dot Display
8 9
10
8 9
10 GO
LEAVE HIT
GO
6
6
7
7
8 9
10
8 9
10 Unitary Events
Unitary Events
6
6
7
7
neuron #
neuron #
LEAVE HIT
Raw Coincidences
neuron #
neuron #
Raw Coincidences
8 9
10
8 9
10 0
329 557 time (ms)
0
364 562 time (ms)
Figure 7: Task dependence of coincident spiking activity. The dot displays show the spiking activity of ve simultaneously recorded neurons (labeled 6 to 10) from the frontal cortex of a monkey involved in a delayed localization task (28 trials). The two columns represent two different behavioral conditions: (A) movement to the left, (B) movement to the front. Organization of the columns (A, B) is the same as in Figure 5, with bin width of 3 ms for coincidence detection and an analysis window width of 60 ms, and 0.01 signicance level for UEMWA. Data are taken from segments starting 500 ms before and ending 700 ms after the GO signal (vertical line at time 0 ms). The top row dot displays include two behavioral events, LEAVE (monkey leaves central key) and HIT (monkey hits target), marked by diamonds and triangles, respectively. Average times of behavioral events are indicated by vertical lines labeled LEAVE (A: 329 ms; B: 364 ms) and HIT (A: 557 ms; B: 562 ms).
Unitary Events: II. Nonstationary Data
99
events related to movement direction. Note that neuron 6 is participating in signicant coincident activity in both movement directions, however, with another coincidence partner in each. This is indicative of a common membership of neuron 6 in two different cell assemblies, one of which is activated depending on the movement direction. 5 Discussion
We developed the unitary events analysis method to detect excess coincidences in multiple single-neuron spike trains. In the companion article, we evaluated the method for the case of stationary rates and calibrated the method for physiological relevant parameters: ring rates, coincidence rates, and number of neurons analyzed in parallel. The method was shown to be very sensitive to excess coincidences; their signicance can be evaluated using the joint-surprise measure. In this article, we extended the method to incorporate nonstationary ring rates by introducing the UEMWA. This method performs the analysis for unitary events within analysis windows of xed length, which are slid in small steps along the data, in order to follow the dynamic changes of ring rates. Within each window, we assume stationarity and apply unitary event analysis as in the stationary case. The resulting time-dependent joint-surprise function provides a convenient measure for the probability that the number of coincident spiking events in a certain observation interval represents a chance event. By imposing a threshold level on the joint-surprise function, certain time segments of the data are highlighted as potentially interesting regarding the presence of excess coincident spiking events, referred to as unitary events. Their neuronal composition, as well as the moments they occur in time, may give us information about the underlying dynamics of assembly activation. 5.1 Appropriate Size of Analysis Window. The width of the moving window is clearly an important parameter and may be adjusted according to the data. In the calibration study described in this article, (see section 3), we analyzed a model in which excess coincidences were injected into independent background activity. In a rst step (see section 3.2.1), we studied the sensitivity under stationary conditions and in dependence of background and coincidence rate levels, using analytical descriptions for coincidence counts. For a given rate constellation, an increase of the analysis window leads to a linear increase of the coincidence count. In order to reach signicance, a certain level of excess coincidences needs to be present to pop out from the chance coincidences due to background activity. This requires a minimum size of the analysis window, specic for the given background and coincidence rates. The larger the coincidence rate is, the smaller is the minimal window size. By contrast, the larger the background rate is, the larger is the minimal window size.
100
S. Grun, ¨ M. Diesmann, and A. Aertsen
In a second step (in section 3.2.2), we explored the detectability of nonstationary coincidence rates. We studied the case that excess coincidences occurred only within a restricted time interval (“hot regions”), motivated by experimental observations (Riehle et al., 1997). Such hot regions may be the result of loose time-locking of synchronous spiking to an (external) trigger event. By studying the detectability in the symmetrical case (the analysis window is centered in the hot region), we found that there is not only a minimal window size as discussed in the stationary condition but also a maximal window size. By increasing the analysis window (starting from a window size smaller than the width of the hot region), more and more excess coincidences are “seen,” which, upon reaching the minimal window size, are detected as unitary events. If the analysis window covers exactly the hot region, all injected coincidences are detected, and maximal detectability is reached, that is, the joint-surprise reaches its maximum. When further increasing the window, the number of excess coincidences no longer grows; however, the contribution of chance coincidences will increase, leading to a decrease of detectability until the joint-surprise nally drops below signicance. Thus, if the analysis window is too large or too small, the hot region is not detected, although the hot region would be detectable with the appropriate choice of analysis time window. Using the results from the symmetrical case, we analyzed the situation when the analysis window is gradually shifted into the hot region. Once the two start to overlap, the injected coincidences contribute to the coincidence count. With increasing overlap, the joint-surprise increases accordingly, until it reaches its maximum at maximal overlap. When the analysis window leaves the hot region, the overlap decreases again, and so does the jointsurprise. For symmetrical shape of the hot region (including spike densities and coincidence densities, as is assumed here), the joint-surprise is symmetrical around the center of the hot region. Injected coincidences are detected as unitary events if the size of the analysis window is between Tmin and Tmax . Only in the special case that Tw is equal to or narrower than the width of the hot region and the joint surprise just reaches threshold at its peak, unitary events are restricted to the extent of the hot region. Generally, however, the epoch over which unitary events are detected does not coincide with the extent of the hot region: it may be narrower but also wider, depending on the size of the various windows. Figure 3 summarizes the possible combinations. The time course of the joint-surprise function indicates how the width of the analysis window can be optimized. From the extent of the plateau, we can derive the width of the hot region. Table 1 summarizes the necessary relationships. By shifting the time window Tw in smaller time steps than the width of the analysis window (usually we shift by one bin), we introduce dependencies between the time windows, since the analysis is applied to partly overlapping time segments. When shifting the window in single bin steps, each single joint-event will be considered in Tw analyses, although in slightly
Unitary Events: II. Nonstationary Data
101
differing contexts. As a result, a single event may be evaluated with differing signicance values in the various analyses. Assuming continuity of the processes, the signicance does not change drastically from one window to the next. One possibility for dealing with these dependencies is to give “bonus points” each time a joint-event surpasses a certain signicance level. Thus, an individual spike constellation would collect bonus points as the window is shifted along the data, indicating that its accumulated signicance “counts.” This procedure would lead to a gradual evaluation of the “unitarity” of those events. The decision for the nal selection of unitary events could be based on a threshold on bonus points. For reasons of simplicity, however, we chose a simpler version: we dene an event as unitary once it fullls a certain signicance criterion (usually a D 0.05 or 0.01) in at least one window. In terms of bonus points, this implies a selection on the basis of a threshold set to 1. A practical evaluation of the performance of a more elaborate bonus point rule is currently under study. Nonstationary coincidence rates (e.g., a hot region) may be the result of loose locking of assembly activation to an (external) trigger event. In this situation, the optimal window for UEMWA is determined by the degree of temporal locking. In physiological data, however, several internal triggers that we do not know of may lead to hot regions of different temporal widths, which cannot optimally be captured by one sliding window size. Thus, an interesting perspective would be to develop an algorithm that dynamically adapts the width of the analysis window to the varying width of the hot regions. 5.2 An Alternative Method of Unitary Event Detection: Cluster Analysis. UEMWA deals with nonstationary ring rates by sliding an analysis
window that is narrow enough to obtain ring rates that are approximately stationary over the extent of the window for all positions, along the trial. There is a second approach to obtain segments of data with joint-stationary rates, based on cluster analysis. In the following, we discuss this option briey, because some cortical data indeed exhibit joint rate states and the approach has interesting relationships to other methods of analyzing multiple single-neuron spiking activity (e.g., hidden Markov models (HMM); Abeles et al., 1995; Seidemann, Meilijson, Abeles, Bergman, & Vaadia, 1996; Gat, Tishby, & Abeles, 1997). The idea is to segment the data into (exclusive) joint-stationary subintervals, using a standard cluster algorithm (e.g., Hartigan, 1975). Subsequently the data are analyzed for unitary events in the time segments, dened by the joint stationary rate states. We call this method unitary event by cluster analysis method (UECA) (a detailed description is given in appendix D). In this method, the width of the analysis window is dened by the covariations of the ring rates of single neurons. However, from our theoretical results about the detectability of hot regions, we know that the optimal width of the analysis window is given by the width of the hot region or, in more general terms, the time course of the coincidence rate. Thus,
102
S. Grun, ¨ M. Diesmann, and A. Aertsen
a clustering approach is useful if the occurrence of excess coincidences is connected to the rate state. This, however, is not in agreement with experimental data (Riehle et al., 1997). Moreover, in UECA, the data are analyzed in exclusive regions. This implies transitions in signicance from one time step to the next. Moreover, if a segment is detected by UECA as signicant, that entire segment is marked as containing unitary events, even if a hot region covers only part of the segment. In the case of UEMWA, the analysis window positions are independent from rate transitions. They cover the entire data set step by step, resulting in a measure that is at the same time more localized (only data from a connected time interval enter the analysis) and smoother (a single event is weighed in several consecutive windows) than UECA. Taken together, for the experimental data we have analyzed so far, UEMWA provides a more differentiated picture of the presence of unitary events. We cannot exclude, however, that experimental settings may arise in which UECA, with its variable window size and the property that the counting statistics are not limited by the local window size, might be a promising alternative. 5.3 False Positives. Various sources of false positives can be distinguished. Here, we discuss only those that arise in direct connection to the nonstationarity extension presented in this article. For a more general discussion of the topic of false positives in unitary event analysis, we refer to the companion article. First, there are sources for false positives specic to the moving window analysis. The signicance level a, which we demand for events to be qualied as unitary, implies by denition a certain number of false positives. For example, if a is set to 0.01, we expect in 1% of the experiments a detection of signicant events by chance. In the case of UEMWA, we undertake, in fact, many of such experiments by analyzing step by step successive parts of a single data set. However, these experiments are not independent due to the overlap of the time windows. For a rough estimate of the number of windows that is expected to give rise to false positives, one has to calculate the number of nonoverlapping windows that would t within the total length of the data segment and take a fraction a of it. In our calibration experiments on simulated data, accidental crossings of signicance threshold were extremely rare—typically, one window per data set at most, and mostly none. As a rule of thumb, we have developed the criterion that only those cases in which the number of windows that passes the signicance threshold is clearly larger than this lower bound are considered as potentially interesting. A more systematic treatment of this issue is under development. Important sources of false positives are nonstationarities of different avors that give rise to signicant results, although the processes observed actually do not violate the null-hypothesis of independence. The obvious source is a remaining nonstationarity of rate in the analysis window. In appendix B, we show that unitary event analysis is robust against moderate
Unitary Events: II. Nonstationary Data
103
violations of the assumption of rate stationarity and quantify the effect on the number of false positives. Specic types of nonstationarity across trials discussed in the literature are variations of “excitability” and of “latency variability” (e.g. Brody, 1999a, 1999b). Knowledge (or “educated guesses”) of the type of nonstationarity in the data is important, because it may allow for a compensation of its effects in the analysis. Variation of excitability describes a nonstationarity in which the time course of the ring rate is identical in each trial, although the amplitude is modulated (e.g. Arieli, Sterkin, Grinvald, & Aertsen, 1996). Latency variability describes a nonstationarity in which shape and amplitude of the ring rate are identical in each trial, but the position in time shifts from trial to trial. In both cases, the ring-rate estimation obtained by averaging across trials, as performed in the PSTH and the unitary event analysis, leads to a value that is not representative for a single trial. In the case of excitability variations, the rate is underestimated for some trials and overestimated for others. In the case of latency variability, the rate estimate will generally present a blurred picture of the rate dynamics in an individual trial due to the convolution with the latency distribution. Thus, “misalignment” of trials may lead to falsely detected unitary events (see the example in Figure 8A, bottom panel). A solution to this problem is to realign the data to an external or behavioral event, to which the single trial rate functions of the observed neurons have a more proper locking. In case no such events are available, one can try to nd a consistent realignment directly based on the single trial rate functions themselves (as shown in Nawrot, Rotter, Riehle, & Aertsen, 1999; Nawrot, Aertsen, & Rotter, 1999; see also Baker & Gerstein, 2000). Figure 8 illustrates how proper realigmnment of trials is able to discard false positives (unitary events after RS in A). Note that in order to maintain a common reference time frame, the same realignment should be performed for all neurons under consideration. This implies, however, that when the latency variabilities of (some of) the observed neurons do not co-vary, such joint realignment is not possible (Nawrot, Rotter, et al., 1999). In such a case, one needs an alternative method to estimate the ring probabilities and the associated coincidence expectancy on the basis of single-trial data. Several such methods have been proposed recently, including convolution-based methods (Nawrot, Rotter, et al., 1999; Nawrot, Aertsen, & Rotter, 1999) and inverse-interval-based methods (Nawrot, Rotter, & Aertsen, 1997; Pauluis & Baker, 2000). The incorporation of these methods into the unitary event analysis is in progress. To ultimately demonstrate the consistent occurrence of unitary events, one may apply an additional test on a meta-level. This can be done by relating the unitary events to behavioral or external events or by an additional statistical test (Prut et al., 1998). An example where unitary events occurred in relation to behaviorally relevant events is shown in Figure 6. On the basis of a meta-analysis over many data sets, Riehle et al. (1997) demonstrated that unitary events occur in relation to stimulus and expected events.
neuron #
neuron #
neuron #
104
S. Grun, ¨ M. Diesmann, and A. Aertsen
Dot Display
Dot Display
Raw Coincidences
Raw Coincidences
Unitary Events
Unitary Events
4
7
4
7
4
7
1250 1000 750 500 250
0 250 ms RS
1250 1000 750 500 250
0 250 ms MVT
Figure 8: False positives due to nonstationarity across trials. (A, B) Identical spiking activities of two neurons (4,7) recorded from primary motor cortex while the monkey performed a delayed reaching task (1 direction out of 6; Grammont & Riehle, 1999). Columnar organization of the dot displays is the same as in Figure 6. In A, trials are aligned to the reaction signal (RS). In B, trials are aligned to the onset of movement (MVT). Trials are sorted according to increasing reaction time (time difference between RS and MVT) in both cases. All external and behavioral events are marked by lled circles. The events at the beginning of the data segments are the GO signals; the events late in the trials are the ends of movements. The unitary events in A about 250 ms before RS correspond to the ones in B at about 500 ms before MVT. Note, however, that the unitary events in A after the RS have vanished in B. These unitary events were due to nonstationarity across trials: an abrupt decrease in ring shortly before and locked to MVT dispersed to a different position in each trial by the incorrect alignment to RS. This misalignment of rate functions is removed in B, thereby discarding the falsely detected unitary events.
5.4 Guidelines for Application to Experimental Data. Based on the foregoing, we suggest the following procedure for unitary events analysis. First, inspect the data critically for across-trial nonstationarities (e.g., excitability and/or latency variability) by the use of the raster displays. For checking excitability variations, one may use as a rough estimate the number of spikes per trial. If necessary, eliminate the outlier trials. Latency variabil-
Unitary Events: II. Nonstationary Data
105
ity has to be eliminated by realignment of the trials, by either alignment of the data to another behavioral event or estimating the instantaneous rates per single trial and realigning the data by matching the shape of the rate functions (Nawrot, Aertsen, & Rotter, 1999). Since generally we do not have prior knowledge about the time structure of the coincident events, we may get some qualitative information about their composition, their distribution in time, and whether there are obvious deviations from the ring rates by inspecting the raw coincident events. The appropriate size of the analysis window cannot be known in advance. It has to be adjusted according to two aspects: rate changes and coincidence rate changes. In order to capture rate changes such that stationarity of rate within a window can be assumed, the window has to be adjusted to the timescale of the rate dynamics. Since we have no prior knowledge about the timescales of the coincidence rate dynamics, we have to scan the data with different window sizes. If we have an indication for a specic time structure from the raw data, we may start with a size in that range. In general, however, we found it best to start with a narrow analysis window and gradually increase its size. If the underlying time structure of coincident ring is clustered, the unitary events will appear at a certain minimum size of the analysis window and will be detected over a certain range of window widths beyond it. If the detected time structure is stable but only broadens due to the increased size of the analysis window, a hot region is detected. For an indication of the best analysis window size, the shape of the jointsurprise function may be used. Plateaus indicate a window size that is either too small or too large. A peaky shape indicates that the optimal window for a hot region size has been found. False positives may be identied by evaluating whether the structure is stable for different alignments to external events. We are aware that this is time-consuming. In further work we intend to develop additional statistical tests on a meta-level, to lter out false positives automatically. However, to improve the analysis method further, we need to gain experience from applications of the unitary events analysis method to physiological data. In addition, we have found it useful to compare the outcome of our analyses to the result of other techniques (e.g. cross-correlation or joint peristimulus time histogram, JPSTH). An additional parameter is the coincidence width. In order to determine the coincidence width of the experimental data, additional manipulations may be performed, for example, changing the bin width before analyzing the data for unitary events (see the companion article) or applying the multiple shift method (Grun ¨ et al., 1999; Riehle, Grammont, Diesmann, & Grun, ¨ 2000). 5.5 Unitary Events in Cortical Activity. We have analyzed simultaneously recorded multiple single-neuron spike trains from frontal and motor cortices in awake, behaving monkeys for the occurrence of signicant co-
106
S. Grun, ¨ M. Diesmann, and A. Aertsen
incident events. Our ndings indicate that highly precise (1–5 ms) unitary events occur in these data. Their joint-surprise may be well above 2; that is, they are statistically highly signicant. The composition and frequency of such patterns appear to be related to behavioral parameters. These results, together with results from other multi-neuron studies, are interpreted as expressions of cell assembly activity (for reviews, see Gerstein, Bedenbaugh, & Aertsen, 1989; Singer et al., 1997; Singer, 1999). The composition of unitary events is interpreted to reect the common membership in a cell assembly. If the composition of the unitary events changes depending on the stimulus or the behavioral conditions, a different group and, hence, a different cell assembly may be activated in relation to the external event. In the example in Figure 7, we showed that the occurrence of unitary events was locked to the onset of the movement, but the composition of them was different for the different movement directions. Similar ndings were made in visual and frontal areas in cross-correlation studies, when neurons were found to be correlated for one stimulus or behavioral condition but not for another (e.g., Vaadia et al., 1991; Aertsen et al., 1991; Vaadia et al., 1995; Freiwald, Kreiter, & Singer, 1995; Kreiter & Singer, 1996; Fries, Roelfsmema, Engel, Konig, ¨ & Singer, 1997; Castelo-Branco, Goebel, Neuenschwander, & Singer, 2000). Moreover, JPSTH results from frontal cortex show that the correlation between two neurons may dynamically change depending on the behavioral context, suggesting that the neurons rapidly change their associations into different functional groups (Aertsen et al., 1991; Vaadia et al., 1995). We demonstrated that unitary events show a marked increase in temporal structure as compared to the spiking events of the participating neurons, including cases where the single neurons did not show any discernible response, as judged from the absence of systematic modulations of their ring rates. This may indicate that neuronal computation uses different kinds of timescales, usually referred to as rate coding and temporal coding (see also Abeles, 1982b; Neven & Aertsen, 1992; Koenig, Engel, & Singer, 1996; Riehle et al., 1997; Shadlen & Newsome, 1998). We have begun to investigate if and how these concepts are implemented in the cortical network. It would seem possible that both coding mechanisms—rate coding and precise time coding—are used in the brain, and that depending on the cortical area, either one might dominate (see Vaadia & Aertsen, 1992, for a detailed discussion on this issue). We hope that the unitary event method presented here may help to decipher the mechanisms of neuronal information processing in the brain. Appendix A: Notation
h Tw M
time resolution of data, [h] D unit of time size of the moving window, integer valued, odd number of trials
Unitary Events: II. Nonstationary Data
l lc npred nemp S a na Tc Ta Tmin Tmax Tp Ts fmin
107
background ring rate, [l] D 1 / unit of time coincidence rate, [lc ] D 1 / unit of time expected coincidence count empirical coincidence count joint surprise signicance level number of coincidences needed for signicance duration of the “hot region” minimum analysis window width required by na minimum analysis window width to reach signicance maximum analysis window width to still reach signicance duration of the plateau of the joint-surprise function duration of interval showing unitary events minimal overlap of analysis window Tw and hot region for detection of unitary events in the sliding situation
All capital “T” variables specify time intervals in units of h, [T] D 1. Appendix B: False Positives Induced by Nonstationarity
In order to estimate the error made by assuming stationarity of rates within the analysis window, we consider the worst-case scenario (with respect to mean count). It can be shown that the maximal error is given if the rate changes in stepwise fashion (in comparison to, say, a linear change of rate), if both neurons change their rates in parallel, and, if the analysis window is centered at the time of rate change. Thus, in the following we consider the situation where two neurons change their rate at the same time from the same rate level l1 to l2 . The analysis window is centered at the rate change; that is, the window Tw is divided into two regions of duration T2w , where the rates are stationary at level l1 and l2 , respectively. The equations used in the following correspond to equations 3.3 and equation 3.2, however here with lc D 0 s ¡1 , and adjusted to the special case sketched here. The mean rate within Tw is lN D
( l1 ¢
Tw C 2
Tw
l2 ¢
Tw ) 2
.
(B.1)
Using the averaged rate, the number of expected coincidences is (cf. equation 3.3) nQ D ( lN ¢ h ) 2 ¢ Tw ¢ M.
(B.2)
The exact number of expected coincidences is given by calculating the number of coincidences in each time segment separately and taking the sum of the two: n¤ D ( ( l1 ¢ h) 2 C ( l2 ¢ h ) 2 ) ¢
Tw ¢ M, 2
(B.3)
108
S. Grun, ¨ M. Diesmann, and A. Aertsen
expressing a special case of equation 3.2. The error in number of coincidences is 1 nQ ¡ n¤ D ¡Tw h2 ( l1 ¡ l2 ) 2 . 4
(B.4)
The larger the difference of the rate levels is, the larger is the error. The exact number of expected coincidences is larger than the approximated number Q Thus, if we estimate the number of coincidences by averaging the rates n. within the analysis window, we tend to overestimate the signicance. The relevant parameter for the signicance test, the number of expected coincidences, denes the mean of the assumed Poisson distribution. This number, however, is determined by the rates, the size of the analysis window, and the number of trials. For simplicity, we assume a constant Tw . Different rate combinations of l1 and l2 can lead to the same number of expected coincidences. Thus, in order to study false positives, we derive the dependence of the exact number of coincidences n¤ as a function of the approximated number of coincidences nQ by eliminating the rates. We dene rD
l2 l1
(B.5)
as the ratio of the rates. n¤ D n¤,1 C n¤,2 is the total number of expected coincidences, with n¤,i D
1 Tw M (li ) 2 h2 2
(B.6)
the expected number of coincidences in each segment. Using equations B.5 and B.6, we yield n¤ D
2( 1 C r2 ) ¢ n. Q ( 1 C r) 2
(B.7)
Q is always ¸ 1, that is, for all values of r, n¤ , nQ ¸ 1. The slope of n¤ (n) In order to calculate the critical rate relation that leads to a signicant outcome in expectation, we systematically vary n¤ by varying r and compute the minimal number of coincidences needed for signicance nQ a , given a signicance level a D 0.01. The intersection of n¤ ( r, nQ ) and nQ a ( r, nQ ) gives the critical rate relation r. The larger r is, the steeper the slope of n¤ ( r, nQ ) and thus the smaller the minimal nQ for which false positives are obtained. Q r ) is invertible. The critical Interestingly, the mapping of ( l1 , l2 ) to ( n, rate relation and the corresponding rates are shown as functions of nQ in Figure 9A. Up to now, we have considered only the case where n¤ is above nQ a , corresponding to a signicant outcome in expectation. Now we are interested in the effective signicance level or, equivalently, the percentage of false
Unitary Events: II. Nonstationary Data
109
Figure 9: False positives induced by a stepwise rate change. As the worst-case scenario, two neurons are considered that change their rates in parallel stepwise from rate level l1 to l2 , and the analysis window is centered at the time of rate change. (A) Critical rate relation r D l2 / l1 (top) that leads to false positives at a given number of coincidences n. Q The corresponding rate levels are shown in the bottom panel. (B) Percentage of false positives. For each parameter constellation (n,r), Q the contour plot shows the percentage of false positives for a signicance level a D 0.01: contour lines at 1% (dashed) and 5%, 10%, . . . , 100% (solid). Gray scale indicates the percentage of false positives. In the stationary situation (r D 1), the percentage of false positives equals a.
Q r ). Therefore, we rst calcupositives at a given parameter constellation ( n, late for each nQ the minimal number of coincidences nQ a at signicance level a D 0.01 assuming a Poisson distribution with mean n. Q Then we determine the signicance level for nQ a , assuming now the exact mean number of coincidences n¤ and the corresponding distribution. Note that the distribution of coincidence counts is the convolution of the distributions for the two segments. Figure 9B illustrates that the larger nQ is, the lower the rate ration r that can be tolerated. Appendix C: Size of the Analysis Window C.1 Stationary Coincidence Rate. The need for a minimal window size in order to be able to detect excess coincidences will be derived for the case when both background rates and the injected coincidence rate are stationary. The minimal number of coincidences (na ) that just fullls the condition S ( na, npred ) D Sa depends nonlinearly on the number of occurrences expected at chance level (see Figure 4 in the companion article); for higher npred , disproportionately more coincidences nc are needed to reach signicance. Moreover, the minimum number of coincidences to reach threshold na can take only discrete values. This induces discrete jumps in na and in the
110
S. Grun, ¨ M. Diesmann, and A. Aertsen
40
30
n
na
20
n
20 c
20 npred Significance
0
40
4
T =100 c l =0.9 s 1 c
100 200 300 400 500 Significance
S
2
c
l (s 1)
0 0
na
10
l =0.4 s 1
1 0
S=10 S=1
200 400 600 800 1000 T w Tmin
2 0
100 200 300 400 500 T w T T min
max
joint-surprise function (see Figure 10). In the stationary case, equations 3.1 and 3.3 reduce to nemp D [lc h C (lh ) 2 ] ¢ MTw npred D [( lc C l) h]2 ¢ MTw . nemp and npred are both linearly dependent on Tw . As a result, we can express nemp as a linear function of npred : (lh ) 2 C lc h pred n (lh C lc h ) 2 ³ ´ lc h ¼ 1C npred , ( lh) 2
nemp ( npred ) D
(C.1) (C.2)
where in the latter expression we have neglected second-order rate terms involving lc . This linear function has a single intersection with the significance threshold na (assuming na to be a smooth function), yielding the minimal number of coincidences considered to be signicant. na however, is a function of the width of the analysis window. Hence, for each combination of l and lc , there is a minimal analysis window width Ta needed to detect coincidences as signicant events. In the situation described here, with coincidences injected over an arbitrarily long time interval, the minimal analysis window width can always be realized. The pair ( l, lc ) determines the value of Ta (here: Ta ¼ 220 ms), while the amount of data available determines whether an analysis window of this width can be realized. When Ta can indeed be realized, we call Tmin D Ta; in section C.2 we will see that this is not always the case in the nonstationary situation.
Unitary Events: II. Nonstationary Data
111
As it is clear from C.2, the function nemp ( npred ) is steeper if the rate of the injected coincident events is higher and hence crosses the curve of na at a lower value of nemp and, consequently, of T a. This behavior is summarized in the bottom graph of Figure 10A. Thus, in the stationary case, detection of excess coincident events can be ensured by increasing the analysis window. In addition, enlarging the window decreases possible effects of the stepwise behavior of the joint-surprise. C.2 Nonstationary Coincidence Rate. We now derive the conditions for the appropriate window size in the case when coincidences occur clustered in time on top of a constant background rate. Consider rst a situation where Figure 10: Facing page. Minimal and maximal window size. (A) Minimal window size. Existence and parameter dependence of the minimal analysis window size Ta needed to detect injected coincidences as signicant. (Upper graph) Independent from window size, the number of coincidences minimally needed (na ) to reach the signicance criterion is a function of the signicance level a and the expected number of coincidences npred (gray curve, here for a D 0.01). With increasing npred , na increasingly deviates from the diagonal. The solid line is the number of coincidences nemp for npred coincidences due to the rate of the independent processes (l D 20 s ¡1 ) and additional injected coincidences (lc D 0.4 s¡1 ). At a certain npred (here nmin ¼ 9), nemp intersects na from below. The minimal size of the analysis window is dened by the requirement that the expected number of coincidences is at least nmin . (Lower graph) Assuming stationary rates, npred is proportional to the size of the analysis window Tw . Thus, by proper scaling, the abscissa can as well be expressed in Tw . The dashed-dotted line connecting the two graphs indicates the minimal window size Tmin D Ta for the example pair (l, lc ). The contour plot (contour lines shown for S D 1, 2, . . . , 10) shows the dependence of the joint-surprise S on Tw and lc for the xed example value of l. For lc D 0.4 s ¡1 , a joint surprise value of 2 corresponding to a D 0.01 is reached at a window size of about 220 ms. The sensitivity of the method decreases rapidly when windows narrower than 100 ms are used. For windows wider than 220 ms, sensitivity increases only slowly. (B) Maximal window size. Same analysis as in A for a nonstationary coincidence rate. Coincidences are injected at rate lc D 0.9 s ¡1 in a “hot region” of duration Tc D 100, centered on t D 500, M D 100, h D 1 ms (background rate l D 20 s ¡1 ). Detectability as a function of the width of the analysis window, centered in the hot region. (Upper graph) Dashed line is the diagonal, where as many coincidences occur as expected for independent rates; the gray curve describes the minimal number of coincidences na needed to fulll the signicance criterion. For analysis windows below Tc , the number of coincidences increases faster than the minimal number and surpasses the minimal number at Tmin . For Tw wider than Tc , the number of coincidences increases with unit slope, slower than na . There is a second intersection at Tmax , above which the number of coincidences is no longer signicant. The lower graph illustrates the dependence of the joint-surprise on Tw .
112
S. Grun, ¨ M. Diesmann, and A. Aertsen
the analysis window is placed in the middle of a hot region of width Tc , and the width of the analysis window Tw is gradually increased (see Figure 10B). As long as Tw · Tc , we face the situation of stationary injected events, discussed in the preceding section (cf. Figure 10A). Hence, we need a minimum width of the analysis window (Ta ) to detect the cluster of injected events. Ta depends on only the combination of l and lc ; its value can be obtained from the calibration graph for the stationary situation (see Figure 10A, bottom). When the analysis window exceeds the hot region (Tw > Tc ), the total number of coincidences increases further. However, the slope is now reduced to (lh ) 2 , since the contribution of the injected coincidences remains constant (MTc ¢ lc h). If nemp ( Tw ) increases faster than na ( Tw ), nemp can intersect na from below, yielding the minimal window size Ta (compare to section C.1). Since the width of the hot region denes the size of the analysis window, from where on nemp ( Tw ) shows a reduced slope, a cluster can be detected only if T c ¸ Ta , with Tmin equal to Ta . For Tc < T a, nemp ( Tw ) bends before reaching na. As a result, Tmin does not exist. If nemp ( T w ) increases faster than na ( T w ) , there is always a Ta , but in contrast to the stationary situation, the existence of Tmin depends on the width of the hot region Tc . When Tc < Ta , the cluster cannot be detected, even with arbitrarily large analysis windows. The reason is that at Tw D Tc , we have nemp ( Tw ) < na ( Tw ), and for Tw > Tc , the slope nP emp ( Tw ) < nP a ( Tw ) . As a result, nemp ( Tw ) remains below na ( Tw ). This argument also implies that when the cluster can be detected (nemp ( Tw ) ¸ na ( T w ) for Ta · Tw · Tc ), a second intersection of nemp ( Tw ) and na ( Tw ) must exist for some Tw > Tc . Hence, in that case, there is a Tmax ¸ Tc at which nemp ( Tw ) intersects na ( Tw ) from above. Two conclusions can be drawn from this. First, a cluster of excess coincidences is not detectable if its duration Tc remains below the critical time span Ta . Second, even if the cluster is detectable (Tc ¸ Ta ), it may still go undetected if the analysis window is either too small Tw < Ta or too wide Tw > Tmax . The range of appropriate window sizes can be obtained from calibration graphs as in Figure 10B. When the analysis window is shifted gradually across the hot region, the time course of the joint-surprise S appears symmetrical around the center of 6 Tc that degenerates the hot region. S has a trapezoidal shape in case of Tw D to a triangle in the special case that T w D Tc . The duration of the plateau as a function of the parameter pair Tc , Tw is summarized in Table 1. Note that in two of the three possible outcomes regarding the observables Tp and Tw (Tp D 0 and Tp ¸ T w ), we have a unique expression for Tc (in the remaining case Tp < Tw , two possibilities exist for Tc ). This uniqueness can be exploited to determine the extent of the hot region from the size of the plateau: » Tc D
Tw Tw C Tp
for T p D 0 for T p ¸ Tw .
(C.3)
Unitary Events: II. Nonstationary Data
113
Because we are free in choosing the size of the analysis window Tw , the requirement of equation C.3 can always be met. However, even for the case where the relationship is not unique (Tp < Tw ), a small variation of Tw immediately allows differentiating between the two possible values of Tc : 8 T > > w >
Tw C Tp > > : Tw ¡ Tp
for Tp D 0 for Tp ¸ Tw
for Tp < Tw and TP p ( Tw ) > 0 for Tp < Tw and TP p ( Tw ) < 0.
(C.4)
From the shape of the signicance curve and the relationships worked out in equations C.3 and C.4, we can estimate Tc . If the signicance threshold is reached at all, typically more than one window is signicant. Only in the special case of T w D Tc D Ta is the maximum of S exactly at threshold level and for a single window only. For a given combination T c , Tw , we can compute the minimal overlap f of the two windows needed to detect the injected coincidences as signicant. The overlap f describes the part of Tc “seen” by Tw , expressed as a fraction of Tw : ( 0
Tc D
f ¢ Tw Tc ,
for f ¢ Tw < Tc otherwise.
(C.5)
We can now use Tmax to compute the minimal f . To this end, we reverse the question that originally led to the denition of Tmax : given a xed Tw , what is the minimal Tc0 (and hence, fmin ) needed to detect the injected coincidences as signicant? Formally, this fmin can be computed as follows: 0 ) D Tw Tmax ( T c,min Tmax ( fmin ¢ Tw ) D T w 1 ¡1 ( Tw ) . ¢ Tmax fmin ( Tw ) D Tw
(C.6) (C.7) (C.8)
Having found the minimal overlap fmin , we can construct the extent of the region Ts in which injected coincidences are marked as signicant (the time span between the two intersections of S with the signicance threshold): Ts D ( Tw ¡ fmin ¢ Tw ) C Tc C ( Tw ¡ fmin ¢ Tw ) D 2 ¢ Tw C Tc ¡ 2 ¢ fmin ¢ Tw D Tc C 2 ¢ ( 1 ¡ fmin ) ¢ Tw .
(C.9) (C.10) (C.11)
The minimal Ts is obtained for T w D Ta : in that case, fmin D 1. Hence, it follows that Ts D Tc .
114
S. Grun, ¨ M. Diesmann, and A. Aertsen
Figure 11: Multimodal rate amplitude histogram of a frontal cortex neuron, observed over 58 trials of 1500 ms duration. The rate was estimated by calculating a PSTH, smoothed with a window of 200 ms. Appendix D: Detecting Unitary Events by Cluster Analysis
The approach in UECA is based on the assumption that the neurons’ ring rates observed in parallel can only be in one of a nite number of joint rate “states” (Abeles et al., 1995; Seidemann et al., 1996). Using a clustering algorithm, segments of joint-stationary rates can be detected. In each of the resulting time segments, the unitary events analysis for the stationary case (see the companion article) is then performed separately. The amplitude distributions of the neurons’ ring rates in many cases show indications of multimodality, suggesting that the ring rates can be in any of a nite number of states (e.g., Figure 11). Accordingly, we try to separate the combined rate activity of several simultaneously recorded neurons into joint stationary regions, dened as the time segments in which all the ring rates of the N observed neurons are stationary in parallel. This is achieved on the basis on estimates of the instantaneous ring rates (e.g., PSTHs) of each neuron. At each instant of discretized time, we derive a joint-rate vector, its components being the rate of neuron i at time t: 2
3 l1 ( t ) 6 .. 7 6 . 7 6 7 ¡ ! 7 l ( t) D 6 6 li ( t) 7 , 6 . 7 4 .. 5
i D 1, . . . , N.
(D.1)
lN ( t ) These vectors are grouped in N-dimensional l-space into clusters of similar joint-rate vectors by the k-means clustering algorithm (Hartigan, 1975). It clusters according to R centers of gravity ¡ m!k . Since we usually do not know the number of the underlying rate states of our data beforehand, we have
Unitary Events: II. Nonstationary Data
115
to vary the number of clusters and check the clustering result in each case. The stopping criteria for the “correct” number of clusters are given by the constraints that each potentially underlying joint state should be captured and that states should not be split into articial ones (“overtting”). For this purpose, we dened the following pairwise relative cluster distance: !¡¡ |¡ m m!k | l dk,l D ¡ , ! |! sl | C | ¡ sk |
6 l, 8 k, l 2 1, . . . , R, k D
(D.2)
¡ ! sk representing the vector of standard deviation in the kth cluster. The stopping criteria are fullled if d k,l ¸ 1,
6 l. 8 k, l 2 1, . . . , R, k D
(D.3)
We call the nal set of cluster joint-stationary rate states. Estimates of the ring rates are obtained by identifying the membership of each joint-rate ¡ ! vector l with the cluster mean it belongs to. Projecting the cluster mean back to the associated position on the time axis, we can observe the time course of cluster membership (PSTHs in Figure 12). The set of time instances identied as belonging to the same cluster denes the time segmentation of a state. Data from all segments belonging to one state are analyzed as one single stationary data set. Note that the set of time steps belonging to a single cluster is not necessarily compact and may in fact contain a number of separate time intervals, over which the neurons have approximately the same rates. The performance of the clustering algorithm on a set of simulated, parallel Poisson processes is illustrated in Figure 12. Two extreme cases of ring-rate variations are chosen, covering the possible features of gradual (Figure 12A) and stepwise (Figure 12B) changes. To avoid overtting, that is, tting of states to variations that are due only to statistical uctuations, we smoothed the PSTHs by using a moving average (Abeles, 1982a; Kendall, 1976). The choice of the width of the smoothing window has to be a compromise between large enough to reduce the noise level and small enough not to atten out meaningful changes of the ring rate (see also Nawrot, Aertsen, et al., 1999). Observe that trajectories of the joint-rate vectors form clouds in rate space, well separated for rapid changes of the ring rates (see Figure 12A). Clusters can clearly be identied. By contrast, and not unexpectedly, gradual changes of ring rates (see Figure 12B) give rise to less well separated clouds or even smooth trajectories. Acknowledgments
We thank Moshe Abeles, George Gerstein, and Gunther ¨ Palm for many stimulating discussions and help in the initial phase of the project. We especially thank Moshe Abeles, Hagai Bergmann, Eilon Vaadia, and Alexa
116
S. Grun, ¨ M. Diesmann, and A. Aertsen
Figure 12: Unitary events by cluster analysis. Dot displays (middle) show the spiking activity of two parallel processes, the rates of which were varied in stepwise (A) or gradual (B) fashion. (A) 100 trials on the basis of time-varying rates in stepwise fashion: l1 ( i ) D 30 s ¡1 , for i 2 [1, 250) ; l1 ( i) D 60 s ¡1 , for i 2 [250, 500) ; l1 ( i) D 40s ¡1 , for i 2 [500, 1000]I l2 ( i) D 30 s ¡1 , for i 2 [1, 250) ; l2 ( i) D 50 s ¡1 , for i 2 [250, 750) ; l2 ( i) D 10 s¡1 , for i 2 [750, 1000]. (B) 100 (out of the 1000) trials in which neuron 2 was simulated with a constant rate of 30 s ¡1 , whereas the rate of neuron 1 is changing linearly throughout the trial from l1 D 20 s ¡1 to 90 s ¡1 . The PSTHs (bottom panels) represent the time course of the instantaneous ring rates (smoothing window of 20 ms). (Top panels) Trajectories of the joint-rates in rate space. Each instance of a joint-rate vector is represented by a small gray star (abscissa: rate of neuron 1, ordinate: neuron 2). Members of different clusters are indicated by different gray levels. Cluster means are indicated by black crosses and standard deviations by the length of the cross-lines. Rates corresponding to the cluster means are superimposed on the PSTHs in the bottom panels. Cluster means approximate the original stationary rate levels; standard deviations (dashed lines) illustrate the variability of the rates within each cluster. Stepwise changes of cluster means indicate the timing segmentation used in further analysis. In B, clustering resulted in an average width of a time segment of 75 ms.
Unitary Events: II. Nonstationary Data
117
Riehle for kindly putting their experimental data (AR: motor cortex; MA, HB, EV: frontal cortex) at our disposal and for the many exciting discussions on our results. We also thank Robert Gutig, ¨ Stefan Rotter, and Wolf Singer for their constructive comments on an earlier version of the manuscript for this article. This work was partly supported by the DFG, BMBF, HFSP, GIF, and Minerva. References Abeles, M. (1982a). Quantication, smoothing, and condence limits for singleunits’ histograms. J. Neurosci. Meth., 5, 317–325. Abeles, M. (1982b). Role of cortical neuron: Integrator or coincidence detector? Israel J. Med. Sci., 18, 83–92. Abeles, M., Bergman, H., Gat, I., Meilijson, I., Seidemann, E., Thishby, N., & Vaadia, E. (1995). Cortical activity ips among quasi stationary states. Proc. Nat. Acad. Sci. USA, 92, 8616–8620. Abeles, M., & Goldstein, M. H. (1977). Multispike train analysis. Proc. IEEE, 65(5), 762–773. Aertsen, A., & Gerstein, G. L. (1985). Evaluation of neuronal connectivity: Sensitivity of cross-correlation. Brain Research, 340, 341–354. Aertsen, A., Vaadia, E., Abeles, M., Ahissar, E., Bergman, H., Karmon, B., Lavner, Y., Margalit, E., Nelken, I., & Rotter, S. (1991).Neural interactions in the frontal cortex of a behaving monkey: Signs of dependence on stimulus context and behavioral state. J. Hirnf., 32(6), 735–743. Arieli, A., Sterkin, A., Grinvald, A., & Aertsen, A. (1996). Dynamics of ongoing activity: Explanation of the large variability in evoked cortical responses. Science, 273(5283), 1868–1871. Baker, S. N., & Gerstein, G. L. (2000). Improvements to the sensitivity of gravitational clustering for multiple neuron recordings. Neural Comp., 12, 2597–2620. Brody, C. D. (1999a). Correlations without synchrony. Neural Comp., 11, 1537– 1551. Brody, C. D. (1999b). Disambiguating different covariation types. Neural Comp., 11, 1527–1535. Castelo-Branco, M., Goebel, R., Neuenschwander, S., & Singer, W. (2000). Neural synchrony correlates with surface segregation rules. Nature, 8(405), 685–689. Freiwald, W., Kreiter, A., & Singer, W. (1995).Stimulus dependent intercolumnar synchronization of single unit responses in cat area 17. Neuroreport, 6, 2348– 2352. Fries, P., Roelfsema, P., Engel, A., Konig, ¨ P., & Singer, W. (1997). Synchronization of oscillatory responses in visual cortex correlates with perception in interocular rivalry. Proc. Nat. Acad. Sci. USA, 94, 12699–12704. Gat, I., Tishby, N., & Abeles, M. (1997). Hidden Markov modelling of simultaneously recorded cells in the associative cortex of behaving monkeys. Network: Comp. Neural Sys., 8, 297–322. Gerstein, G. L., Bedenbaugh, P., & Aertsen, A. (1989). Neuronal assemblies. IEEE Trans. Biomed. Eng., 36, 4–14.
118
S. Grun, ¨ M. Diesmann, and A. Aertsen
Grammont, F., & Riehle, A. (1999). Precise spike synchronization in monkey motor cortex involved in preparation for movement. Exp. Brain Res., 128, 118–122. Grun, ¨ S., Diesmann, M., Grammont, F., Riehle, A., & Aertsen, A. (1999).Detecting unitary events without discretization of time. J. Neurosci. Meth., 94, 67–79. Hartigan, J. A. (1975). Clustering algorithms. New York: Wiley. Kendall, M. (1976). Time-series (2nd ed.). London: Charles Grifn and Company. Koenig, P., Engel, A. K., & Singer, W. (1996). Integrator or coincidence detector? The role of the cortical neuron revisited. TINS, 19(4), 130–137. Kreiter, A., & Singer, W. (1996).Stimulus-dependent synchronization of neuronal responses in the visual cortex of awake macaque monkey. J. Neurosci., 16(7), 2381–2396. Mountcastle, V. B., Reitbock, ¨ R. J., Poggio, G. F., & Steinmetz, M. A. (1991). Adaptation of the Reitboeck method of multiple electrode recording to the neocortex of the waking monkey. J. Neurosci. Meth., 36, 77–84. Nawrot, M., Aertsen, A., & Rotter, S. (1999). Single-trial estimation of neuronal ring rates. J. Neurosci. Meth., 94, 81–91. Nawrot, M., Rotter, S., & Aertsen, A. (1997). Firing rate estimation from single trial spike trains. In N. Elsner & H. Waessle (Eds.), G¨ottingen Neurobiology Report 1997 (Vol. 2, p. 623). Stuttgart: Thieme Verlag. Nawrot, M., Rotter, S., Riehle, A., & Aertsen, A. (1999). Variability of neuronal activity in relation to behaviour. In N. Elsner & U. Eysel (Eds.), Proceedings of the 1st G¨ottingen Neurobiology Conference of the German Neuroscience Society 1999 (Vol. 1, p. 101). Stuttgart: Thieme Verlag. Neven, H., & Aertsen, A. (1992).Rate coherence and event coherence in the visual cortex: A neuronal model of object recognition. Biol. Cybern., 67, 309–322. Pauluis, Q., & Baker, S. N. (2000). An accurate measure of the instantaneous discharge probability, with application to unitary joint-event analysis. Neural Comp., 12(3), 647–669. Prut, Y., Vaadia, E., Bergman, H., Haalman, I., Hamutal, S., & Abeles, M. (1998). Spatiotemporal structure of cortical activity: Properties and behavioral relevance. J. Neurophysiol., 79(6), 2857–2874. Reitbock, ¨ H. J. (1983). A multi-electrode matrix for studies of temporal signal correlations within neural assemblies. In E. Basar, H. Flohr, H. Haken, & A. J. Mandell (Eds.), Synergetics of the brain (pp. 174–182). Berlin: Springer-Verlag. Riehle, A., Grammont, F., Diesmann, M., & Grun, ¨ S. (2000). Dynamical changes and temporal precision of synchronized spiking activity in motor cortex during movement preparation. J. Physiol. (Paris), 94(5–6), 569–582. Riehle, A., Grun, ¨ S., Aertsen, A., & Requin, J. (1996). Signatures of dynamic cell assemblies in monkey motor cortex. In C. von der Malsburg, J. Vorbruggen, ¨ & B. Sendhoff (Eds.), Articial Neural Networks—ICANN ’96 (pp. 673–678). Berlin: Springer-Verlag. Riehle, A., Grun, ¨ S., Diesmann, M., & Aertsen, A. (1997). Spike synchronization and rate modulation differentially involved in motor cortical function. Science, 278, 1950–1953. Riehle, A., Seal, J., Requin, J., Grun, ¨ S., & Aertsen, A. (1995). Multi-electrode recording of neuronal activity in the motor cortex: Evidence for changes in
Unitary Events: II. Nonstationary Data
119
the functional coupling between neurons. In H. J. Hermann, D. E. Wolf, & E. Poppel ¨ (Eds.), Supercomputing in brain research: From tomography to neural networks (pp. 281–288). Singapore: World Scientic. Seidemann, E., Meilijson, I., Abeles, M., Bergman, H., & Vaadia, E. (1996). Simultaneously recorded single units in the frontal cortex go through sequences of discrete and stable states in monkeys performing a delayed localization task. J. Neurosci., 16(2), 752–768. Shadlen, M. N., & Newsome, W. T. (1998). The variable discharge of cortical neurons: Implications for connectivity, computation, and information coding. J. Neurosci., 18(10), 3870–3896. Singer, W. (1999). Neural synchrony: A versatile code for the denition of relations. Neuron, 24, 49–65. Singer, W., Engel, A. K., Kreiter, A. K., Munk, M. H. J., Neuenschwander, S., & Roelfsema, P. R. (1997). Neuronal assemblies: Necessity, signature and detectability. Trends in Cognitive Sciences, 1(7), 252–261. Vaadia, E., & Aertsen, A. (1992). Coding and computation in the cortex: Singleneuron activity and cooperative phenomena. In A. Aertsen & V. Braitenberg (Eds.), Information processingin the cortex (pp. 81–121). Berlin: Springer-Verlag. Vaadia, E., Ahissar, E., Bergman, H., & Lavner, Y. (1991). Correlated activity of neurons: A neural code for higher brain functions? In J. Kruger ¨ (Ed.), Neuronal cooperativity (pp. 249–279). Berlin: Springer-Verlag. Vaadia, E., Bergman, H., & Abeles, M. (1989). Neuronal activities related to higher brain functions—theoretical and experimental implications. IEEE Trans. Biomed. Eng., 36(1), 25–35. Vaadia, E., Haalman, I., Abeles, M., Bergman, H., Prut, Y., Slovin, H., & Aertsen, A. (1995). Dynamics of neuronal interactions in monkey cortex in relation to behavioural events. Nature, 373, 515–518. Received July 20, 2000; accepted April 24, 2001.
LETTER
Communicated by George Gerstein
Statistical Signicance of Coincident Spikes: Count-Based Versus Rate-Based Statistics Robert G utig ¨ [email protected] Ad Aertsen [email protected] Stefan Rotter [email protected] Neurobiology and Biophysics, Institute of Biology III, Albert–Ludwigs–University, 79104 Freiburg, Germany Inspired by different conceptualizations of temporal neural coding schemes, there has been recent interest in the search for signs of precisely synchronized neural activity in the cortex. One method developed for this task is unitary-event analysis. This method tests multiple singleneuron recordings for short epochs with signicantly more coincident spikes than expected from independent neurons. We reformulated the statistical test underlying this method using a coincidence count distribution based on empirical spike counts rather than on estimated spike probabilities. In the case of two neurons, the requirement of stationary ring rates, originally imposed on both neurons, can be relaxed; only the rate of one neuron needs to be stationary, while the other may follow an arbitrary time course. By analytical calculations of the test power curves of the original and the revised method, we demonstrate that the test power can be increased by a factor of two or more in physiologically realistic regimes. In addition, we analyze the effective signicance levels of both methods for neural ring rates ranging between 0.2 Hz and 30 Hz. 1 Introduction
Thanks to advances in neurophysiological recording technology, it is now feasible to experimentally test biological hypotheses about cortical information processing and neuronal cooperativity on the basis of multiple singleneuron recordings (Aertsen, Bonhoeffer, & Kruger, ¨ 1987; Nicolelis, 1998). However, due to the stochastic appearance of neural response patterns in the cortex (Palm, Aertsen, & Gerstein, 1988), the neurobiological concepts in question must be translated into precise statistical hypotheses to be veried by specically designed statistical tests. Inspired by a number of different conceptualizations of temporal neural coding schemes, such as correlational cell assemblies (Aertsen & Gerstein, 1991; Gerstein, Bedenbaugh, & Aertsen, 1989; von der Malsburg, 1981), coherent oscillations (Singer, 1993), and precise ring patterns (Abeles, 1982, Neural Computation 14, 121–153 (2001)
° c 2001 Massachusetts Institute of Technology
122
R. Gutig, ¨ A. Aertsen, and S. Rotter
1991), there has been particular emphasis on the search for synchronized neural activity (Abeles & Gerstein, 1988; Abeles, Bergman, Margalit, & Vaadia, 1993; Kreiter & Singer, 1996; Prut et al., 1998; Singer, 1999). One of the methods developed for this task is unitary-event analysis (Grun, ¨ 1996; Grun, ¨ Diesmann, Grammont, Riehle, & Aertsen, 1999; Grun, ¨ Diesmann, & Aertsen, in press-a; Grun, ¨ Diesmann, & Aertsen, in press-b; Riehle, Grun, ¨ Diesmann, & Aertsen, 1997). This method searches recordings from multiple single neurons for epochs with distinctly more (near-)coincident spikes than expected from independent neurons obeying Poissonian spike statistics. The core of unitary-event analysis consists of computing the probabilities (joint-p-values) for the occurrence of a given minimum number of coincident spikes in short time segments, under the null hypothesis of independence. Segments with a joint-p-value below a xed level of significance a are identied as signicant epochs where the null hypothesis is rejected. Here we demonstrate that by revising the original testing procedure, more specically, by implementing a different coincidence count distribution, we can substantially increase the power of the method. Figure 1 compares the behavior of the original and the modied versions of the method on an empirically motivated, simulated data set of two neurons, stretching over epochs of correlated activity. Figure 1D depicts all epochs marked as signicant by either of the two methods. The plot shows that the original version (Bin) misses epochs of synchronous neural activity that are detected by the modied version (Hyp). The primary goal of this article is to investigate and compare two central properties—the power and the effective signicance level—of the statistical test underlying the original and the revised versions of unitary-event analysis. Our investigation focuses on the identication of signicant epochs; we will not treat statistical problems of interdependent testing, arising when
Figure 1: Facing page. Comparison of the original and the modied version of the statistical test underlying unitary-event analysis applied to simulated data. Matching the data of Riehle, Grun, ¨ Diesmann, and Aertsen (1997), we simulated 36 trials of two dependent neurons over 1300 bins (each of 5 ms duration) with the spike event probabilities p1 D 0.15 and p2 D 0.05, corresponding to the mean ring rates of 30 Hz and 10 Hz. (A, B) Raster plots of neurons 1 and 2. (C) The spike event and coincidence counts for 65 nonoverlapping analysis ms windows of 100 ms duration (n D 36 ¢ 100 720). (D) The upper part shows 5 ms D the correlation of spike counts r of the two neurons that was controlled through the stochastic model underlying the simulation of the data (see section 2 for details). The lower part depicts the epochs that were marked as signicant by the original test (upper row) and the modied test (lower row) at a signicance level of a D 0.01 (dotted line in E). (E) The corresponding joint-p-values for both versions of the test, calculated in each of the analysis windows.
Statistical Signicance of Coincident Spikes
Spike Trains
Trial #
A 30 20 10 0
B Trial #
123
Bin #
30 20 10 0
C
100 10 1
200
400
Spike Count 1 Spike Count 2
600
800
1000
1200
Counts
Coincidence Count
D
Significant Epochs
E0
Joint P Values
0.1 0.05 0 Bin. ® Hyp. ® 10
10 1 10 2 3
10
4
10
5
10
a Level Binomial Distribution Hypergeometric Distribution 10
20
30
40
Window #
50
60
analysis windows overlap. Preliminary results have been presented in abstract form (Gutig, ¨ Rotter, Grun, ¨ & Aertsen, 2000). 2 Stochastic Model
First, we sketch the mathematical framework underlying our statistical assessment of the signicance of coincident spiking of two neurons. Based on
124
R. Gutig, ¨ A. Aertsen, and S. Rotter
this framework, in section 3 we summarize the statistical test underlying the original unitary-event analysis (Grun, ¨ 1996) and describe our revised version of the analysis method. One key element of our approach is to parameterize the stochastic dependence of the two neurons in terms of their spike correlation. This will be used in section 4 to calculate and compare the power of both statistical tests. The extension of this new approach to the case of three or more neurons is conceptually straightforward but numerically demanding. For clarity, most mathematical details are deferred to appendix A. We assume the analysis to extend over a time window composed of n pairs of corresponding time bins (one for each neuron) of width D t. Each bin can take either the value 1, denoting the observation of at least one spike in that time bin (we refer to this as a spike event), or the value 0, denoting the absence of spikes. Our treatment will refer only to joint observations of all n bins in the analysis window and disregard any information about the ne structure of the data in individual trials. Hence, an application of our results to pooled data from a multiple-trial design needs to assume stationarity across trials. Two main assumptions will be made in the following: A1 : The spike probabilities and the binwise spike correlations of the two neurons are the same for all n pairs of bins of the analysis window (“stationarity”). A2 : All bins, except for those in a pair, are stochastically independent (“serial independence”). Note that assumption A1 is conceptually not essential to our framework, which is suited to treat a neural system with nonstationary spike probabilities. Doing so, however, would give rise to probability distributions with many parameters (cf. appendix C), greatly complicating the calculations. By contrast, assumption A2 clearly imposes severe restrictions on the generality of the approach. Although it was not always explicitly stated, this assumption also underlies previous treatments of the topic (Grun, ¨ 1996; Roy, Steinmetz, & Niebur, 2000). The important task of overcoming the difculties introduced by serially correlated spike trains is the subject of ongoing research. As shown in appendix A, assumptions A1 and A2 allow us to completely characterize the probability space describing the realization of two neural spike trains by specifying the individual spike probabilities of the two neurons, p1 and p2 , and their spike correlation r. We will abbreviate this parameter triplet throughout by j :D ( p1 , p2 , r ) . The correlation r parameterizes the stochastic dependence of the two neurons, with the case of independently spiking neurons corresponding to r D 0. Note that for binary variables as used here, the notion of stochastic (in)dependence and the
Statistical Signicance of Coincident Spikes
125
Table 1: 2 £ 2 Table of Counts for Analysis Windows Comprising n Bins. Spike from
No Spike from
Neuron 1
Neuron 1
Spike from Neuron 2
k
No Spike from Neuron 2
c1 ¡ k
c2 ¡ k n ¡ c1 ¡ c2 C k
c1
n ¡ c1
c2 n ¡ c2 n
Note: The counts c1 and c2 denote the number of spike events from neurons 1 and 2, respectively, and k denotes the number of coincident spike events, in one particular observation
notion of correlation coincide. Thus, in our treatment of the test power in section 4, we will use nonvanishing values of r to quantify violations of the null hypothesis of independent ring. We will indicate the case of stochastic independence by writing j 0 instead of j . The central statistics of the following treatment will be the individual spike counts, C1 and C2 , and the coincidence count K, all derived from an observation of the two neurons in the analysis window. Since, according to our denition of spike events, each bin can hold at most one count, the total number of spike events observed from each neuron in a given analysis window is restricted to the integers between 0 and n. And the number of coincident spike events K cannot exceed either one of the corresponding spike counts C1 and C2 . As shown in Table 1, the statistics C1 , C2 , and K give rise to a 2 £ 2 table of counts for each realization of the analysis window. We note in passing that because of assumption A1 , the correlation r between spike events in coincident bins (see equation A.3) equals the correlation of the spike event counts of the two neurons, independently of the number of bins n. Both tests for independence discussed in the following use the coincidence count K as their test statistic; they dene statistical signicance based on the number of coincident spikes observed within the analysis window. Thus, the probability distribution of this random variable, Pj ( K D k) , directly enters the computation of the joint-p-values in each of the tests and therefore affects their statistical properties. The two methods, however, differ in the amount of information they draw from the data. To make this distinction clear, we discuss here both probability distributions for the case of independent neurons, that is, Pj0 ( K D k) where r D 0. The general forms of these distributions for arbitrary r 2 [¡1, 1] and their derivations are given in appendix B. We emphasize that while the distributions for r D 0 sufce for the denition of the statistical tests in the next section, the calculation of test power in section 4 will rely on the general expressions from appendix B. We rst consider the situation where only the (constant) spike probabili-
126
R. Gutig, ¨ A. Aertsen, and S. Rotter
ties p1 and p2 of both neurons are known. In particular, no further knowledge of the spike counts c1 and c2 is assumed. Then, based on assumptions A1 and A2 , the probability distribution of K for independent neurons is given by the binomial distribution,
³ ´ Pj0 ( K D k) D
n k
( p1 p2 ) k (1 ¡ p1 p2 ) ( n¡k) .
(2.1)
Note that this distribution of coincidence counts was used in the original denition of unitary-event analysis (Grun, ¨ 1996), as well as in several recent contributions about the method (Grun ¨ et al., 1999; Roy et al., 2000). In the second case, which is conceptually different, we consider the coincidence count distribution, making explicit use of the knowledge of the individual spike counts c1 and c2 . In this case, we formulate the probability distribution conditional on the specic realization of the spike counts, that is, Pj ( K D k | c1 , c2 ) . Unlike the spike probabilities, these counts are readily accessible in any empirical spike train. It can be shown (cf. appendix B) that the conditional distribution of K for independent neurons is described by the hypergeometric distribution (cf. Palm et al., 1988):
³ ´³ c1
Pj0 ( K D k | C1 D c1 , C2 D c2 ) D
k
n ¡ c1 c2 ¡ k
³ ´ n
´ .
(2.2)
c2 Note that this distribution does not refer to the spike probabilities p1 and p2 anymore. Moreover, it can easily be veried that it is symmetric with respect to c1 and c2 . The general form of the conditional distribution for r 2 [¡1, 1] is much more complicated, though. Its explicit form is derived in appendix B. 3 Count-Based Versus Rate-Based Statistics
Typically, the test of a statistical hypothesis on the grounds of empirical data is based on a specially designed random variable, the so-called test statistic. In order to control the probability that the test fails, certain aspects of the distribution of the test statistic under the null hypothesis must be known. The probability that the null hypothesis is rejected, even if it is correct (type I error, or a-error), is usually xed at some small value (e.g., 5 %). Essentially, this is achieved by adjusting the critical region of the test through the choice of an appropriate threshold of the test statistic (Mood, Graybill, & Boes, 1974). Similarly, the case where the null hypothesis is not rejected, even if it is false, is referred to as a type II error, or b-error. Given
Statistical Signicance of Coincident Spikes
127
a parametric model of deviations from the null hypothesis, which species the probability distribution of the test statistic under a given deviation, it is possible to calculate the b-error probability of the test. Based on this probability, one can directly assess the power of the test—the probability of correctly detecting a given violation of the null hypothesis. Within the framework of the stochastic model dened in the previous section, including the assumptions A1 and A2 , the null hypothesis underlying the detection of signicant epochs comprises the following additional assumption for each analysis window: H 0: The activity of the two neurons is stochastically independent, r D 0. The original version of unitary-event analysis is based on the binomial coincidence count distribution (see equation 2.1). In this approach, empirical estimates for the parameters p1 and p2 on the basis of the spike counts c1 and c2 , pOi D
ci n
( i D 1, 2) ,
(3.1)
are used to calculate the probability that kQ or more coincident events are observed under the given conditions. This probability, here denoted by Q c1 , c2 ) , is refered to as the joint-p-value (Grun, ( k, ¨ 1996) Jjbin 0
³ ´
Q c1 , c2 ) ( k, Jjbin 0
n X n ± c1 c2 ² k ± c1 c2 ² ( n¡k) 1 ¡ . :D n2 n2 k Q
(3.2)
kD k
We emphasize that this procedure uses the observed spike counts solely for the estimation of the spike probabilities. One effect of this is that the binomial distribution gives nonvanishing probabilities for the impossible outcome k > min( c1 , c2 ). In addition, the binomial coincidence count distribution does not take into account the stochastic nature of the rate estimation procedure itself. To make better use of the information contained in the spike counts, we propose to compute the joint-p-values from the conditional probabilities Pj0 ( k | c1 , c2 ) instead. Given assumption H 0, together with the empirically accessible spike counts c1 and c2 , these probabilities are determined by the hypergeometric distribution (see equation 2.2; Palm et al., 1988; Aertsen, Gerstein, Habib, & Palm, 1989; Lehmann, 1997). Thus, by using the empirical spike counts to specify the conditional distribution of coincidence counts (see equation 2.2), we can completely eliminate the rate estimation (see equation 3.1) from the testing procedure. Accordingly, the joint-p-value
128
R. Gutig, ¨ A. Aertsen, and S. Rotter
hyp
Q c1 , c2 ) of an epoch with kQ coincident spikes, based on the hypergeoJj0 ( k, metric distribution, is given by
³ ´³ c1
hyp
Q c1 , c2 ) :D Jj0 ( k,
min(c X1 ,c2 ) kD kQ
k
n ¡ c1 c2 ¡ k
³ ´ n
´ .
(3.3)
c2
To illustrate the difference between the two approaches, Figure 2A shows that for large enough values of the coincidence count k, the joint-p-values according to the binomial distribution exceed the corresponding probabilities based on the hypergeometric distribution. Thus, although the two distributions have equal mean, the binomial distribution overestimates the probability for the occurrence of large coincidence counts, for certain combinations of individual spike counts. The effect is that for the parameter values in Figure 2A, a statistical test using the hypergeometric distribution would classify an observation of kQ ¸ 12 as signicant (a D 0.05; dotted line), whereas the corresponding test based on the binomial distribution would need at least kQ D 13 coincident spikes. Figure 2B shows the corresponding differences in these critical coincidence counts for all count combinations c1 , c2 2 f0, . . . , 100g. Light areas depict count combinations where the critical coincidence count for the binomial distribution exceeds the critical count of the modied test by one; for dark areas, the difference is zero. This example suggests that the use of the conditional coincidence count distribution may effectively lead to an increase in sensitivity of the test in certain parameter regimes. In fact, it is a result from mathematical statistics that the randomized version of the proposed test, called Fisher ’s exact test in the nonrandomized form presented here, is uniformly most powerful unbiased (Lehmann, 1997). A brief discussion of the randomized test is given in appendix E. Figure 2A also illustrates that in general, the joint-p-values of the critical coincidence counts do not equal the nominal signicance threshold a. Moreover, for a given a-level, both tests will in general not operate on the same effective signicance level.
Figure 2: Facing page. Comparison of binomial and hypergeometric distributions. (A) Joint-p-value: cumulative binomial and hypergeometric probability distributions for observing k or more coincidence counts for n D 720, c1 D 100, and c2 D 51. (B) Difference between the critical coincidence counts kcrit of both distributions for n D 720 and a D 0.05. Light areas depict count combinations where the critical coincidence count of the binomial distribution exceeds the critical coincidence count of the hypergeometric distribution by one. For count combinations in the dark areas, the difference is zero.
Statistical Signicance of Coincident Spikes
A
129
Joint P Value Binomial Distribution Hypergeometric Distribution
0.1 0.09 0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0
11
12
B
13
14
15
16
17
Difference in Critical Counts 100 90 80 70 60 50 40 30 20 10 0 0
10 20
30 40
50 60
70 80 90 100
Next to the issue of sensitivity of the two tests that will be addressed in the following section, we note one further important implication of using the hypergeometric coincidence count distribution. As shown in appendix C, the modied version of the test does not require that both neurons have con-
130
R. Gutig, ¨ A. Aertsen, and S. Rotter
stant ring rates. Rather, it sufces if the spike probability of only one of the two neurons (either one) remains constant throughout the analysis window. Since the original method had to assume stationarity of both neural ring rates, this relaxation of the stationarity requirement A1 implies an important extension of the class of empirical data that qualify for the statistical analysis described here. Note, however, that this relaxation concerns only the inferential statistical testing of H 0 as implemented by our revised method. The following investigation of the statistical properties of the method when applied to neurons that violate the hypothesis of independence still relies on stationarity in both neurons. 4 Test Power
To obtain a thorough understanding of the properties of the statistical signicance tests dened above, we calculated the power function (Mood et al., 1974) of the tests for both coincidence count distributions. Specically, this will allow us to quantify the advantage of using the hypergeometric count distribution concerning its test performance. For a given set of alternative hypotheses, the power of the statistical tests investigated here is the probability of obtaining a coincident count that yields a signicant nding when tested against the null hypothesis. Loosely speaking, the power of the test measures the probability that the test will detect a given violation of the null hypothesis. According to the stochastic model dened in section 2, we use r to parameterize the set of alternative hypotheses. We emphasize that the power curves we will compute only characterize the sensitivity of the methods regarding violations of the null hypothesis of independent ring (H 0). Our treatment does not include violations of the other assumptions A1 and A2 . We also note that the power curves will be calculated with respect to identical nominal signicance levels a. Hence, the resulting values describe the more common nonrandomized application of the methods, as discussed in recent contributions (Grun, ¨ 1996; Grun ¨ et al., 1999; Pauluis & Baker, 2000; Roy et al., 2000). The effective signicance levels of the tests, however, generally differ, and, hence, differences in test power will depend on differences in effective signicance levels. crit We introduce the auxiliary functions kcrit bin and khyp : bin ( c1 , c2 ) :D minfk 2 N: Jjbin ( k, c1 , c2 ) · ag kcrit 0 hyp
hyp
kcrit ( c1 , c2 ) :D minfk 2 N: Jj0 ( k, c1 , c2 ) · ag.
(4.1) (4.2)
For any given combination of spike event counts c1 and c2 , these k-values give the minimum number of coincident spikes that leads to a rejection of the null hypothesis at the signicance level a, when tested against the null hypothesis underlying the corresponding coincidence count distribution. The probability of rejecting the null hypothesis for given counts c1 and c2
Statistical Signicance of Coincident Spikes
131
and j is then determined by Pj ( K ¸ kcrit ( c1 , c2 ) | C1 D c1 , C2 D c2 ) ,
(4.3)
which can be straightforwardly computed from equation B.7. It is important to note that this rejection probability is a function of the individual spike counts c1 and c2 . This means that it is not an intrinsic property of the test itself, but rather a property of the test in combination with a specic empirical observation. Thus, to calculate the power of a test, we need to calculate the expectation value of the rejection probability with respect to the joint count distribution (see equation B.5), which is also specied through the parameter j of an alternative hypothesis. Thus, for a given j , the power p j of the test is given by the expectation value of the rejection probability for each of the two underlying coincidence count distributions: h i pjbin D Ej Pj ( K ¸ kbin crit ( c1 , c2 ) | C1 D c1 , C2 D c2 ) D
hyp
pj
n X c1 ,c2 D 0
Pj ( C1 D c1 , C2 D c2 )
£ Pj ( K ¸ kbin crit ( c1 , c2 ) | C 1 D c1 , C 2 D c2 ) h i hyp D Ej Pj ( K ¸ kcrit ( c1 , c2 ) | C1 D c1 , C2 D c2 ) D
n X c1 ,c2 D 0
(4.4)
Pj ( C1 D c1 , C2 D c2 ) hyp
£ Pj ( K ¸ kcrit ( c1 , c2 ) | C1 D c1 , C2 D c2 ) .
(4.5)
We refer to appendix D for an account on the numerical evaluation of these probabilities. 4.1 Difference in Test Power. In this section, we investigate the differences in power of the rate-based versus the count-based tests for signicance of coincident spike activity. Specically, we will discuss how this difference depends on the signicance level a and the number of bins in the analysis window n and how this difference is affected by the spike probabilities p1 and p2 . Apart from comparing the performance of the two methods, our analysis is also valuable for applications to empirical data. For a given set of parameters, knowledge of the power function allows us to assess whether a planned analysis is feasible or, vice versa, it can be used to guide the design of experiments. As mentioned before, we will use the spike correlation r to parameterize deviations from the null hypothesis of independent ring. The value of r will be varied over a physiologically realistic regime, as reviewed by Abeles (1991; (see also Aertsen & Gerstein, 1985).
132
R. Gutig, ¨ A. Aertsen, and S. Rotter
We start the comparison by calculating the power curves of both analysis methods in a parameter regime that matches the empirical ndings published by Riehle et al. (1997). Accordingly, we let the analysis window cover n D 720 bins of 5 ms each (originally collected from 36 trials) and set the spike probabilities of the two neurons to p1 D 0.15 and p2 D 0.05, corresponding to mean ring rates of 30 Hz and 10 Hz, respectively. We computed the power curves for the signicance levels a D 0.01, a D 0.05 (this value was used by Riehle et al., 1997), and a D 0.1. Figure 3 shows that the tests based on the hypergeometric coincidence count distribution clearly outperform the original tests in terms of their power. Considering a spike correlation of r D 0.1, we see from Figure 3B that for a D 0.01, the chance of rejecting the (false) null hypothesis of independent neurons is increased by over 0.1 compared to the original method. This corresponds to a relative increase in power of about 50% (see Figure 3). From Figures 3A and 3B we also see that the difference between the tests increases when they are chosen to operate at more conservative signicance levels (i.e., lower a). The inset of Figure 3A shows that for large analysis windows (here n D 720), the power decreases smoothly as a is decreased. It is interesting to note that the relative difference in test power (see Figure 3C) monotonically increases (up to 150%) as r approaches zero. Thus, other than the difference in power, which follows a bell-shaped curve (see Figure 3B), the relative difference in power becomes maximal for r D 0, that is, for independent neurons. This means that, as already suggested by Figure 2, the original method yields less signicant ndings in a low-correlation regime. In other words, for r ! 0, it effectively operates at a lower probability of yielding a false-positive nding than the count-based test. This is important, since it implies that the neglect of count information leads to a conservative behavior of the original test as applied in the past. However, as we will see, this is not the case for small values of n ¼ 20 (i.e., for narrow analysis windows and single trial applications) in connection with specic values of a. While both tests, as expected, gain power with increasing n (see Figure 4A), there is no qualitative change in the difference in power for larger analysis windows (n 2 f100, . . . , 700g, data not shown). The count-based test outperforms the rate-based version with differences in power qualitatively corresponding to the curves shown in Figure 3B. However, because of the overall increase in power, the peaks of the difference in power curves move toward smaller correlations as n grows. Therefore, the revised test is the method of choice when searching weak correlations in larger analysis windows (n 100). Another potentially interesting regime is given by small analysis windows (n ¼ 20), as would be needed for time-resolved single-trial analysis. Both tests suffer a considerable loss of power with decreasing n (see Figure 4A). Moreover, for small n, both methods become dominated by
Statistical Signicance of Coincident Spikes
A
133
Test Power
1 0.8 0.6 0.4
a = 0.1
0.05
0.2 0
0
B 0.12
0.01
0.8 0.6 0.4 0.01
r » 0.096
0.05
a
0.1
Binomial Distribution Hypergeometric Distribution 0.05
0.1
0.15
0.2
0.25
Difference in Test Power a = 0.01 a = 0.05 a = 0.1
0.1 0.08 0.06 0.04 0.02 0
C
0
0.05
0.1
0.15
0.2
0.25
Relative Difference in Test Power
2
a = 0.01 a = 0.05 a = 0.1
1.5 1 0.5 0
0
0.05
0.1
0.15
0.2
0.25
Figure 3: Dependence of both tests on nominal signicance level. (A) Test power curves of the two methods for n D 720, p1 D 0.15, p2 D 0.05, and a-levels of 0.01, 0.05, and 0.1. The inset shows the test power for a 2 f0.01, 0.011, . . . , 0.1g at r ¼ 0.096. (B) Differences in test power—the power values of the modied test minus the power values of the original version. (C) Difference in test power relative to the power of the original version of the test—the difference in power divided by the power of the original version. The analytical results shown in this gure were checked by computer simulations based on the Ran2 and the MT19937 random number generators (cf. Galassi et al., 1998).
134
R. Gutig, ¨ A. Aertsen, and S. Rotter
A
Test Power
1 0.8 0.6 0.4
Binomial Distribution Hypergeometric Distribution
0.2 0
B
20
100
200
300
400
500
600
700
Test Power
0.4 0.3 0.2 0.1 0
20
C
0.2
0.15
Binomial Distribution Hypergeometric Distribution 30
40
50
60
70
80
90
100
Test Power Binomial Distribution Hypergeometric Distribution
0.1 0.05 0
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1
Figure 4: Dependence of both tests on analysis window size. (A, B) Test power curves for a D 0.05, p D p1 D p2 D 0.1, r ¼ 0.19, and n 2 f20, 21, . . . 700g. (C) Dependence of both tests on nominal signicance level for n D 20, p D p1 D p2 D 0.05, r ¼ 0.26, and a 2 f0.01, 0.011, . . . , 0.1g. Arrows mark values of a hyp where p jbin > p j .
Statistical Signicance of Coincident Spikes
135
discreteness effects of the underlying spike statistics (see Figure 4B). Thus, the difference between the two methods becomes extremely sensitive to the nominal level of signicance a, in a discontinuous way. For a small number of specic combinations of n and a (e.g. n D 20 and a D 0.049), the power of the rate-based test can be even greater than the power of the count-based test (see Figure 4C). Thus, it would not, as was the case for large n, miss signicant epochs but instead would indicate neural synchrony in cases that are not judged signicant, once knowledge of the individual spike counts is incorporated into the testing procedure. Due to this behavior for small n, the original method should not be used in these parameter regimes. Turning to the dependence of both tests on varying spike probabilities p1 and p2 , we rst let p :D p1 D p2 and computed the power curves for p ranging from 0.01 to 0.15 (again, n D 720 and a D 0.05). The results are shown in Figures 5A and 5B: as the power of both methods increases with growing spike probability p, the difference in their power also becomes larger. Thus, the advantage of the count-based method increases with higher neural ring rates. Also note that the power of both methods for low spike probabilities becomes small: the chance to detect a spike correlation of 0.1 is only around 30%. Finally, Figure 5C shows the difference in power for asymmetric spike 6 probabilities p1 D p2 with constant product p1 p2 D 0.0075. Observe that the difference between the two tests grows as the asymmetry in the spike probabilities increases. This result indicates that the common assumption of equal spike probabilities underlying other investigations of the method (Grun ¨ et al., 1999; Roy et al., 2000) may shadow nontrivial properties of the tests, which are enhanced in regimes with different spike probabilities. 5 Effective Signicance Level
Next to its power, the signicance level of a statistical method is another important characteristic of its performance in practical applications. In this section, we analyze and compare the signicance levels of the two tests. Generally, the signicance level a of a statistical test denotes the probability that it leads to a rejection of its null hypothesis, even if it were correct (i.e., the probability of making a type I error). In principle, this probability is xed before computing the test statistics. However, tests based on discrete random variables can effectively operate only at levels that correspond to p-values of actual realizations of the test statistic (cf. Figure 2; Mood et al., 1974). It is therefore necessary to differentiate between the nominal signicance level (i.e., the value of a denoting the signicance threshold) and the effective signicance level (i.e., the largest possible p-value that still falls below the a threshold) (see also Roy et al., 2000). In principle, “randomized tests” provide the means to adjust the effective signicance level of a test to its nominal level a by introducing an auxiliary random variable, not related to the data under testing. However, due to conceptual objections
136
R. Gutig, ¨ A. Aertsen, and S. Rotter
A
Test Power 1
0.8
p = 0.15 ¬
0.6 ¬ p = 0.01
0.4
Binomial Distribution Hypergeometric Distribution
0.2 0
0
B 0.12
0.05
0.1
0.15
0.2
0.25
0.3
Difference in Test Power p=0.02 p=0.03 p=0.04 p=0.05 p=0.06 p=0.07 p=0.08
0.1 0.08 0.06 0.04
p=0.09 p=0.10 p=0.11 p=0.12 p=0.13 p=0.14 p=0.15
0.02 0
0
C
0.2
0.05
0.1
0.15
0.2
0.25
0.3
Difference in Test Power p1=0.30, p1=0.25, p1=0.20, p1=0.15, p1=0.10,
0.15 0.1
p2=0.025 p2=0.03 p2=0.0375 p2=0.05 p2=0.075
0.05 0
0
0.05
0.1
0.15
0.2
0.25
Figure 5: Dependence of both tests on spike probabilities (n D 720, a D 0.05). (A) Test power curves of the two methods for “symmetrical” spike probabilities p D p1 D p2 2 f0.01, 0.02, . . . , 0.15g. (B) Differences in test power corresponding to A. (C) Differences in test power for asymmetrical spike probabilities ( p1 , p2 ) 2 f( 0.1, 0.075) , ( 0.15, 0.05), ( 0.2, 0.0375) , ( 0.25, 0.03) , ( 0.03, 0.025) g.
Statistical Signicance of Coincident Spikes
137
against this procedure, randomized tests are not commonly used in applied statistics (Mood et al., 1974), in spite of their attractive theoretical properties (Lehmann, 1997). A brief discussion of the randomization of the test investigated in this study is given in appendix E. hyp ( ) ( ) Given the critical coincidence counts kbin crit c1 , c2 and kcrit c1 , c2 introduced in the previous section (see equations 4.1 and 4.2), the effective signicance level of the tests investigated here is determined by the proba( ), bility of obtaining a coincident count k ¸ kbin crit c1 , c2 evaluated with respect to the coincidence count distribution of the corresponding test, that is, hyp hyp ( kbin ( ) ) ( kcrit ( c1 , c2 ) , c1 , c2 ) , respectively (cf. equaby Jjbin crit c1 , c2 , c1 , c2 and Jj0 0 tions 3.2 and 3.3). However, it is important to note that these effective significance levels can be calculated only if specic values of c1 and c2 are given. Only then is the coincident count distribution belonging to the corresponding test sufciently specied. Due to this dependence on the specic realization of the counts C1 and C2 , we will refer to these effective signicance levels as count-dependent effective signicance levels. Figure 6 shows the countdependent effective signicance levels for tests with a nominal signicance level of a D 0.05, for n D 20 and n D 720. The gure clearly demonstrates that the count-dependent effective signicance levels of both tests uctuate considerably between different realizations of the spike counts. In addition, it is interesting to note the effect of the neglect of count information by the original version of the test from the comparison of Figures 6A and 6C (see the gure caption for details). We stress that the calculation of the count-dependent effective significance level does not take into account that the spike counts themselves are random variables. Thus, it is without doubt the appropriate measure to assess the effective signicance level for the inferential statistical test of stretches of data with specic spike counts. However, one should be reluctant in interpreting this probability as the overall type I error probability of the method. This conceptual difference between the count-dependent effective signicance level and the unconditional effective signicance level (i.e., independent of the specic realization of the spike counts) gains crucial importance when interpreting the ndings reported by Roy et al. (2000), although these authors do not seem to make this distinction consistently. Instead of characterizing the method by its count-dependent effective signicance level, we introduce the expected effective signicance level aj0 . The latter is dened as the expectation value of the count-dependent effective signicance level with respect to the joint count distribution of independent neurons characterized by j 0. The interpretation of this measure as the expectation value of the count-dependent effective signicance level rests on the additional assumption that the parameters j0 D ( p1 , p2 , r D 0) of the joint count distribution are known. This assumption is not part of the null hypotheses of the tests investigated here. Thus, aj0 has to be interpreted as the expectation value of the count-dependent effective signicance level
138
R. Gutig, ¨ A. Aertsen, and S. Rotter
Binomial Distribution
A
B
50
20 15 10 5 0
0
5
10
15
20
0.04
40
0.04
0.03
30
0.03
0.02
20
0.02
0.01
10
0.01
0
0
0 10 20 30 40 50
0
Hypergeometric Distribution
C
D
50
20 15 10 5 0
0
5
10
15
20
0.04
40
0.04
0.03
30
0.03
0.02
20
0.02
0.01
10
0.01
0
0
0 10 20 30 40 50
0
Figure 6: Count-dependent effective signicance levels for a D 0.05. (A, C) Small analysis window, n D 20. The black regions correspond to count combinations where the count-dependent effective signicance level is zero; this occurs when either one of the counts is zero or when even the maximal possible number of coincidences—k D 20 for the binomial distribution and k D min( c1 , c2 ) for the hypergeometric distribution—does not yield a signicant p-value and, hence, the probability of rejecting the null hypothesis equals zero. (B, D) Large analysis window, n D 720. (A, B) Binomial coincidence count distribution. (C, D) Hypergeometric coincidence count distribution.
of the test when applied to neurons with spike probabilities p1 and p2 . By contrast, the landscapes of count-dependent effective signicance levels in Figure 6 do not depend on j 0 , that is, on the values of p1 and p2 . The parameter j0 enters the computation of the expectation value aj0 only through the joint count distribution underlying its denition. Averaging over the joint count distribution for independent neurons (cf. appendix B), the expected effective signicance level aj0 is obtained in straightforward fashion: h i bin ( bin ( ) ) ajbin E D J k c , c , c , c j 1 2 1 2 j crit 0 0 0
Statistical Signicance of Coincident Spikes
D hyp
aj0
n X c1 ,c2 D 0
h
139
( kbin Pj0 ( c1 , c2 ) Jjbin crit ( c1 , c2 ) , c1 , c2 ) 0
hyp
(5.1)
i
hyp
D Ej0 Jj0 ( kcrit ( c1 , c2 ) , c1 , c2 ) D
n X
hyp
c1 ,c2 D 0
hyp
Pj0 ( c1 , c2 ) Jj0 ( kcrit ( c1 , c2 ) , c1 , c2 ) .
(5.2)
A discussion of a numerical evaluation of these expectation values is given in appendix D. Since the test based on the hypergeometric coincidence count distribution makes no reference to the spike probabilities p1 and p2 , it is independent of the “actual” spike probabilities of the investigated neurons. Thus, hyp
its expected count-dependent effective signicance level aj0 reects the probability that the test will reject its correct null hypothesis when applied to data from two independent neurons with spike probabilities j 0. Note that this is not the case for tests based on the binomial coincidence count distribution, where the testing procedure includes the estimation of the spike event probabilities pi via pOi D ci / n (see equation 3.1). In general, the estimated probabilities pOi will deviate from the actual parameters for most spike count combinations. Moreover, as pointed out in section 3, the binomial coincidence count distribution neglects the information contained in the actual spike counts. As a result, the count-dependent effective signicance level based on these estimators will generally not correctly describe ( ). the probability of obtaining a coincident count k ¸ kbin crit c1 , c2 Therefore, the number ajbin is of purely theoretical interest and does not bear any op0 erational relevance. Thus, to calculate the probability that a test based on the binomial coincidence count distribution will reject the null hypothesis when applied to independent neurons with spike probabilities p1 and p2 , we need to compute the count-dependent probability to obtain a coincidence count ( ) k ¸ kbin crit c1 , c2 with respect to the hypergeometric distribution. Note, howbin ever, that kcrit ( c1 , c2 ) itself is calculated with respect to the binomial coincidence count distribution. Thus, again forming the expectation value with respect to the joint count distribution, we obtain h i hyp bin bin 2 j0 D Ej0 Jj0 ( kcrit ( c1 , c2 ) , c1 , c2 ) D
n X c1 ,c2 D 0
hyp
bin ( Pj0 ( c1 , c2 ) Jj0 ( kcrit c1 , c2 ) , c1 , c2 ) ,
(5.3)
which we will refer to as a-error probability. It is clear from the above that for the corresponding a-error probability for the test based on the hypergeometric count distribution 2
hyp j0 ,
we have 2
hyp j0
hyp
D aj0 .
140
R. Gutig, ¨ A. Aertsen, and S. Rotter
5.1 Dependence of the Effective Signicance Level on Spike Probabilities. In this section we investigate the dependence of the a-error probabil-
ities on the spike probabilities p1 and p2 , for both types of the test. Figure 7 shows the expected effective signicance levels aj0 and the a-error probability 2 jbin for p D p1 D p2 ranging from 0.001 to 0.15. The surface plots in 0 Figures 7C and 7D display the dependence of these curves on the number of bins n. Overall, Figure 7 shows that the curves for the a-error probability hyp
of the modied method aj0 lie above the curves for the a-error probability of the original method 2 jbin . Thus, especially for higher ring rates, where 0 the difference between the curves amounts to a considerable fraction of the nominal signicance level (a D 0.05), the a-error probability of the modied method lies closer to the nominal signicance level. In Figures 7A and 7B, we also display the count-dependent effective signicance levels that would arise from the binomial coincidence count distribution with coincidence probability p2 (indicated by diamonds), as considered by Roy et al. (2000). Note that in order to obtain a continuous sample of these count-dependent effective signicance levels, we did not restrict the values of p2 to possible realizations of c1 c2 / n2 but instead treated p as a continuous variable. In contrast to the view expressed by Roy et al. (2000), we emphasize that according to our formalism the interpretation of the ensuing sawtooth function (cf. Figure 7A) has to take into account that its independent variable p does not correspond to a neural spike probability, as it does in the calculation of aj0 and 2 jbin . Instead, this variable technically 0 corresponds to an estimator of the spike probability, which would be based on realizations of the random variables C1 and C2 in applications of the method to experimental data. In this context, it is important to recall that the count-dependent effective signicance levels are different for different spike count combinations (cf. Figure 6). Thus, because of the stochastic nature of the spike counts, the tests will in general operate on different count-dependent effective signicance levels for different realizations of the analysis window. Since this effect is due to the stochastic nature of the spike counts, it equally holds if the Figure 7: Facing page. Dependence of expected effective signicance level and a-error probability on analysis window size, for (A) n D 720 and (B) 20, 100, 500, and 1000, respectively. Diamonds depict corresponding count-dependent effective signicance levels (cf. Roy et al., 2000; see text for details). (C, D) The a-error probabilities for the binomial and hypergeometric coincidence count distribution for n 2 f20, 40, . . . , 1000g. The oscillatory behavior of the curves for n 100 at low values of p reects the periodic structure of the countdependent effective signicance level landscape (cf. Figure 6), which for small p dominates the expectation values because of the increasing localization of the joint count distributions. The analytical results shown in this gure were checked by computer simulations, based on the Ran2 random number generator (cf. Galassi et al., 1998).
Statistical Signicance of Coincident Spikes
A
141
Effective Significance Level, n = 720
0.05 0.04 0.03 0.02 0.01 0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
B n = 20
0.04
0.04
0.02
0.02
0
0
0.04
0.04
0.02 0
C 1000
n = 100
0.02 0.05
n = 500 0.1 0.15
0
0.05
n = 1000 0.1 0.15
D Hypergeometric
Binomial
0.005 0.035 0.03 0.025 0.02 0.015 0.01
800 600 400 200 0.05
0.1
0.15
0.005 0.035 0.03 0.025 0.02 0.015 0.01 0.05
0.1
0.15
probabilities p1 and p2 underlying the different realizations of the spike counts would remain perfectly constant. Therefore, contrary to the suggestion by Roy et al. (2000), the variation of count-dependent effective signicance levels between different realizations of the analysis window is not an issue of neural ring rates. In fact, the count-dependent effective signicance levels do not depend on the spike probabilities p1 and p2 . Therefore, a change of neural ring rates would in no way inuence the count-dependent effec-
142
R. Gutig, ¨ A. Aertsen, and S. Rotter
tive signicance levels—the shape of the sawtooth function in Figures 7A and 7B. Rather, it would lead to a change in the joint count distribution and, hence, affect the expected signicance levels aj0 and the a-error probability . As shown in Figures 7A and 7B, computing these expectation values 2 jbin 0 transforms the discrete sawtooth structure (diamonds) into smooth functions of p (curves). Thus, even for p as low as 0.02 (corresponding to a ring rate of 4 Hz for 5 ms bins) and n as in Figure 7, small changes in ring rate do not lead to large changes in the a-error probabilities of the two methods (as was claimed by Roy et al., 2000). 6 Discussion
We presented a modication of the statistical test underlying unitary-event analysis for the detection of neural synchrony. By incorporating the empirical spike counts into the calculation of the probability distribution of the test statistic (i.e., the coincidence count), we were able to remove the ring-rate estimation from the testing procedure. As a result, the distribution of the test statistic becomes independent of the a priori ring rates. The application of the modied method therefore avoids the problems of ring-rate estimation associated with statistical uctuations in the spike counts. To quantify the increase in sensitivity of the new method, we calculated and compared the test power of both tests with respect to violations of the null hypothesis of independent ring for various regimes of physiological parameters (ring rates; cf. Figure 5), degree of spike correlation (cf. Figures 3 and 5) and analysis parameters (size of analysis window; cf. Figure 4), and signicance level (cf. Figure 3). The spike probabilities p1 and p2 were chosen such that the corresponding neural ring rates (for bin size D t D 5 ms) lay between 2 Hz and 30 Hz. These results are of dual importance. First, they directly specify the probabilities of detecting given deviations from independent ring with the two methods. These probabilities are not only important quantities to characterize the performance of the statistical test, they are also critical in the context of experimental design: they allow one to choose appropriate values for analysis parameters such as the size of the time window necessary to verify a theoretically predicted dependence between two neurons. Second, the power curves allow us to quantify the effect of the suggested modication of the testing procedure on the performance of the test in comparison to the original version. This is important to reevaluate the results obtained by use of the original method and point out parameter regimes where the usage of the modied version of the test is especially crucial. Overall, we found that for applications of the test to analysis windows comprising larger values of n (n 100), the modication proposed leads to an increase in test power of up to 0.12 (50% relative increase). This increase becomes especially pronounced for conservative nominal signicance levels a, asymmetric ring regimes of the two neurons, high ring rates, and
Statistical Signicance of Coincident Spikes
143
moderate degrees of spike correlation. In general, for the ring rates analyzed, the peak values of the difference in test power were reached for values of the correlation between 0.05 and 0.15. This range corresponds to the empirical values of the asynchronous gain (ASG) found in the cortex (see reviews by Aertsen & Gerstein, 1985, and Abeles, 1991). For short time windows (n ¼ 20) we found that the test power of both methods falls below 0.2 (cf. Figure 4) for two neurons that operate below 30 Hz (D t D 5 ms) with r ¼ 0.25, corresponding to the maximal ASG for cortical neurons reported by Abeles (1991). Thus, when applying the test to single-trial data with short analysis windows comprising only a low number of bins, one has to face substantial reductions in test power. For these applications of the test, the increase in test power due to the modication of the method is vital. In addition, it is important to realize that for low values of n and for low ring rates, the discrete nature of the underlying test statistic dominates the properties of the test. Their dependence on n and a is complex in this regime, so special care should be taken with respect to the experimental design. In addition, we have calculated the expected effective signicance levels of the two tests and their a-error probabilities when applied to two neurons operating at equal rates, ranging from 0.2 Hz to 30 Hz (D t D 5 ms). As discussed in detail in section 5, it is important to differentiate between the count-dependent effective signicance level that does not depend on the neuronal spike probabilities and the expected effective signicance level that depends on the spike probabilities through the joint count distribution. Contrasting the view expressed by Roy et al. (2000), who based their analysis on the count-dependent effective signicance levels, our calculations show that the a-error probabilities of both methods vary only slowly as a function of ring rate above 4 Hz. Thus, in this regime, the probability of falsely indicating neural synchrony is only moderately sensitive to the ring-rate levels of the investigated neurons. While this result describes the behavior of the method when applied to neurons operating at certain ring rate levels, that is, independent of any specic empirical realization of spike counts ( c1 , c2 ) , the count-dependent effective signicance level (i.e., the effective signicance level of the test when applied to a stretch of data with a specic combination of spike counts) does show considerable uctuations depending on the joint counts in the analysis window (cf. Figure 6). This indeed implies uctuations of the count-dependent effective signicance level with respect to different realizations of the analysis window. We emphasize that these uctuations are not the result of changes in neural ring rates, but the consequence of stochastic uctuations in the counts themselves. For ring rates below 4 Hz, the joint count distribution tends to concentrate on a small number of count combinations. This causes the changes of the a-error probabilities 2
bin j0
and 2
hyp j0
of the two versions of the test with
144
R. Gutig, ¨ A. Aertsen, and S. Rotter
changes in neural ring rates to become more pronounced. For neurons operating at rate levels below 1 Hz, the joing count distribution becomes so narrow that the expectation value of the count-dependent effective significance level essentially behaves like the count-dependent effective signicance level itself. Thus, the probability of producing a false positive when applying the method at ring rates below 1 Hz steeply decreases as the ring rate approaches lower values. As a consequence, in this ring-rate regime, the expected effective signicance level of the test becomes sensitive to the ring rates of the investigated neurons. Since, by construction of the test, the effective signicance level cannot surpass the nominal signicance level a, the problem is not that the test could produce more false positives than expected. Rather, the decrease of the effective signicance level for low ring rates implies that the test will effectively operate at a very conservative signicance level. However, as can be seen from the power curves (cf. Figure 5), this decrease in effective signicance level is accompanied by a decrease in test power. Thus, for very low ring rates, both versions of the test will fail to detect deviations from the null hypothesis of independent ring. Finally, we could show that the use of the hypergeometric coincidence count distribution allows us to relax the stationarity requirement of the neuronal ring rates. In contrast to the original method, which had to assume the ring rates of both neurons to remain stationary over all bins of the analysis window, the modied test requires only one of the two neurons to have a stationary ring rate, while the other can follow an arbitrary time course. This generalization implies an important increase of applicability of the method to empirical data. We are currently investigating whether this “one-sided” stationarity criterion can be further relaxed, possibly by imposing joint (but weaker) requirements on both neuronal rate proles. The robustness of the modied method with respect to the violation of the “one-sided” stationarity assumption will also be the subject of further research. A conceptually different approach to treat nonstationary neural data with count-based statistics could be based on the use of estimators for the instantaneous ring rate. Following recent work by Pauluis and Baker (2000), who implement a rate-based version of unitary-event analysis (Grun, ¨ 1996) in connection with an instantaneous rate estimation procedure, it might be interesting to use instantaneous rate estimation (Nawrot, Aertsen, & Rotter, 1999) together with the conditional coincidence count distribution used here for variable spike event probabilities. This approach seems capable of combining the advantages of count-based statistics with the improved applicability of instantaneous rate estimators to nonstationary data sets. In conclusion, in view of the increase in test power, the increased interpretability of the signicance measure, and the relaxation of the stationarity requirement, we clearly recommend implementation of the count-based rather than the rate-based version of this analysis method when testing the statistical signicance of coincident spikes.
Statistical Signicance of Coincident Spikes
145
Appendix A: Stochastic Model
According to the denition of a spike event given in section 2, we dene the probability space (Vi , Pji ) for each pair of bins within the analysis window (indexed by i 2 f1, . . . , ng) on the basis of the sample space, © ¡ ¢ ª Vi :D vi D v1,i , v2,i : v1,i , v2,i 2 f0, 1g ,
(A.1)
and a probability Pji for each of the four possible outcomes (0,0), (0,1), (1,0), (1,1). A parameterization of these probabilities in terms of the individual spike event probabilities p1,i and p2,i , ¡ ¢ ¡ ¢ Pji v1,i D 1 D p1,i and Pji v2,i D 1 D p2,i ( p1,i , p2,i 2 ( 0, 1) ) ,
(A.2)
and the spike correlation r i between the two neurons, Corr( v1,i , v2,i ) D r i
(ri 2 ( ¡1, 1) ) ,
(A.3)
leads to the denition Pji (vi D ( 1, 1) ) :D p1,i p2,i C ri Ri
(A.4)
Pji (vi D ( 1, 0) ) :D p1,i ( 1 ¡ p2,i ) ¡ r i Ri
(A.5)
Pji (vi D ( 0, 1) ) :D (1 ¡ p1,i ) p2,i ¡ r i Ri
(A.6)
(A.7) Pji (vi D ( 0, 0) ) :D (1 ¡ p1,i ) ( 1 ¡ p2,i ) C r i Ri , p with Ri D p1,i ( 1 ¡ p1,i ) p2,i ( 1 ¡ p2,i ) and ji :D ( p1,i , p2,i , r i ) . Based on equations A.4 through A.7, the conditional probabilities to observe a spike event from neuron 2, given the behavior of neuron 1, are ¡ ¢ ri R i #i :D Pji v2,i D 1 | v1,i D 1 D p2,i C p1,i
(A.8)
¡ ¢ ri Ri Qi :D Pji v2,i D 1 | v1,i D 0 D p2,i ¡ . 1 ¡ p1,i
(A.9)
These conditional probabilities will be used in appendix B, yielding compact expressions of the probability distribution functions. Starting from assumptions A1 and A2 as stated in section 2, we can dene the probability space describing the entire analysis window by forming the product space (V , Pj ) with V :D V 1 £ ¢ ¢ ¢ £ V n ,
Pj :D
n Y iD 1
Pji ,
(A.10)
146
R. Gutig, ¨ A. Aertsen, and S. Rotter
and samples v D ( v1 , . . . , vn ) . Based on this product space, we dene the discrete random variables, Cm ( v) :D
n X
vm,i
m D 1, 2,
iD 1
(A.11)
denoting the total number of spike events from neuron m in the analysis window. Similarly, K ( v) :D
n X iD 1
v1,i ¢ v2,i ,
(A.12)
denotes the number of coincident spike events from the two neurons. Appendix B: Probability Distributions
For the derivations in this section, we will assume stationarity of all parameter tripletsji in the analysis window, as formulated in assumption A1 . We let j D ji for all i D 1, 2, . . . , n. Given the probabilities of the four possible spike event constellations of any two coinciding bins (see equations A.4–A.7) and using the conditional spike probabilities # and Q (see equations A.8 and A.9), application of the multinomial distribution (cf. Feller, 1968) yields the probability of nding k coincident events and c1 , respectively c2 , spike events from the two neurons, that is, of making an observation v with K (v) D k, C1 (v) D c1 , and C2 ( v) D c2 : Pj ( K D k, C1 D c1 , C2 D c2 ) n! D k!( c1 ¡ k) !( c2 ¡ k)!( n ¡ c1 ¡ c2 C k)! £ ¤k £ ¤c ¡k £ ¤c ¡k ( 1 ¡ p1 ) Q ) 2 £ p1 # p1 ( 1 ¡ # ) 1 £ ¤n¡c1 ¡c2 C k £ ( 1 ¡ p1 ) ( 1 ¡ Q ) .
(B.1)
Using
³ ´ Pj ( C1 D c1 ) D
n
c1
p1 c1 ( 1 ¡ p1 )
( n¡c1 )
,
(B.2)
it follows that Pj ( K D k, C2 D c2 | C1 D c1 ) D
³ ´³
D
c1 k
n ¡ c1
c2 ¡ k
´
Pj ( K D k, C1 D c1 , C2 D c2 ) Pj ( C1 D c1 )
#k ( 1 ¡ # ) ( c1 ¡k) Q ( c2 ¡k)
£ ( 1 ¡ Q ) ( n¡c1 ¡c2 C k) .
(B.3)
Statistical Signicance of Coincident Spikes
147
Since the sample space V can be decomposed into disjoint subsets containing all elementary events with a specic coincidence count, we can write the joint count distribution as Pj ( C1 D c1 , C2 D c2 ) D Pj ( C1 D c1 ) Pj ( C2 D c2 | C1 D c1 ) D Pj ( C1 D c1 ) min(c X1 ,c2 )
£
kD 0
Pj ( K D k, C2 D c2 | C1 D c1 ) ,
(B.4)
which by insertion of equation B.3 turns into
³ ´ Pj ( C1 D c1 , C2 D c2 ) D
n
pc11 ( 1 ¡ p1 ) ( n¡c1 )
c1
min(c X1 ,c2 )
£
³ ´ c1 k
kD 0
³ k(
# 1 ¡ #)
( c1 ¡k)
n ¡ c1
c2 ¡ k
£ ( 1 ¡ Q ) ( n¡c1 ¡c2 C k) .
´ Q ( c2 ¡k) (B.5)
Going back to equation B.1, it is now straightforward to derive the conditional coincidence count distribution. By inserting equations B.1 and B.5 into Pj ( K D k | C1 D c1 , C2 D c2 ) D
Pj ( K D k, C1 D c1 , C2 D c2 ) , Pj ( C1 D c1 , C2 D c2 )
(B.6)
we nd Pj ( K D k | C1 D c1 , C2 D c2 )
³ ´ c1
D
k
#k (1 ¡#)
³ ´
min(c X1 ,c2 ) kD 0
³
c1 k
( c1 ¡k)
´
n ¡c1
c2 ¡ k
³
# k ( 1 ¡ #) ( c1 ¡k)
Q ( c2 ¡k) (1 ¡ Q ) ( n¡c1 ¡c2 C k)
´
n ¡c1
c2 ¡ k
.
(B.7)
Q ( c2 ¡k) ( 1 ¡ Q ) ( n¡c1 ¡c2 C k)
For independent neurons (r D 0), the conditional probabilities # and Q become equal to the spike probability of neuron 2 ( # D Q D p2 ). Therefore, it is a matter of straightforward substitution to derive the conditional coincidence count distribution Pj0 ( K D k | C1 D c1 , C2 D c2 ) and the joint count distribution Pj0 ( C1 D c1 , C2 D c2 ) for independent neurons from the general expressions in equations B.7 and B.5, respectively. Finally, the general form
148
R. Gutig, ¨ A. Aertsen, and S. Rotter
of the unconditional coincidence count distribution (cf. section 2) is found by replacing p2 of equation 2.1 with #, so that the term p1 # corresponds to the general form of the probability to observe a pair of coincident spikes within one bin (cf. equation A.4). Appendix C: Nonstationary Rates
Relaxing the stationarity assumption A1 , we return to the general formulation of our stochastic model as developed in section 2 and appendix A. Thus, we replace the parameter triplet j with a vector of triplets jE . Its n components ji D ( p1,i , p2,i , r i ) describe the probability space ( Vi , Pji ) of each individual pair of corresponding bins in the analysis window. Assuming stochastic independence of the two neurons (i.e., H 0), we can rewrite equation B.6 as PjE0 ( K D k | C1 D c1 , C2 D c2 ) D
PjE0 ( K D k, C1 D c1 , C2 D c2 ) PjE0 ( C1 D c1 ) PjE0 ( C2 D c2 )
,
(C.1)
where we used jE0 to indicate this parameter setting under the condition H0 . Following assumption A2 , we can write the probability for a specic realization as the product of the probabilities of obtaining or not obtaining a spike event in each of the corresponding bins, respectively. Thus, the probability PjE0 ( K D k, C1 D c1 , C2 D c2 ) of making an observation with c1 spike events from neuron 1, c2 spike events from neuron 2, and k coincident events is given by the sum over the probabilities of all possible arrangements of this count conguration. For M D f1, 2, . . . , ng, where n is the number of bins in the analysis window, we dene the set Mc1 as the collection of all subsets of M with c1 elements. Further, for any m 2 M c1 , we let the set m,k Mc2 denote the collection of all subsets of M with c2 elements in total and k elements in common with m . Using this notation, we have PjE0 ( K D k, C1 D c1 , C2 D c2 ) X X Y Y Y Y ( 1 ¡ p1, j ) ( 1 ¡ p2,m ). D p1,i p2,l m 2Mc1 l2Mm ,k i2m c2
j2Mnm
l2l
(C.2)
m2Mnl
Using the same notation, the probability PjE0 ( C1 D c1 ) is given by PjE0 ( C1 D c1 ) D
X Y
m 2Mc1 i2m
Y
p1,i
j2Mnm
( 1 ¡ p1, j ) .
(C.3)
The probability PjE0 ( C2 D c2 ) can be expressed analogously. Thus, we can rewrite equation C.1 as PjE0 ( K D k | C1 D c1 , C2 D c2 )
Statistical Signicance of Coincident Spikes
X
X Y
p1,i
m 2Mc1 l2Mm ,k i2m c
D 2
4
2
X Y
Y p1,i
g2Mc1
i2g
j2Mng
Y j2Mnm
( 1 ¡ p1, j )
149
Y
Y
p2,l
l2l
m2Mnl
32 ( 1 ¡ p1, j )54
(1 ¡p2,m ) 3
X Y
k 2Mc2
Y p2,i
i2k
j2Mnk
(C.4)
( 1 ¡ p2, j ) .5
This equation describes the conditional probability that, given the individual spike event counts c1 and c2 , one will observe k coincident events from two stochastically independent neurons with spike event probabilities according to jE0 . Assuming a stationary rate for neuron 2 (i.e., p2,i D p2 for all i), equation C.4 reduces to PjE0 ( K D k | C1 D c1 , C2 D c2 ) 2 3³ ´³ ´ X Y Y c1 n ¡ c1 c 2 4 ( 1 ¡ p1, j ) 5 p1,i p ( 1 ¡ p2 ) n¡c2 k c2 ¡ k 2 m 2Mc1 i2m j2Mnm 2 3³ ´ D X Y Y n 4 (1 ¡ p1, j ) 5 p1,i pc22 ( 1 ¡ p2 ) n¡c2 c 2 g2Mc i2g j2Mng 1
³ ´³ c1
D
k
n ¡ c1
c2 ¡ k
³ ´ n
´ ,
(C.5)
c2 which is the hypergeometric distribution as given in equation 2.2, independent of whether the ring rate of neuron 1 is stationary over the observation interval. Obviously, by interchanging the roles of the two neurons in the defm initions of Mc1 and Mc2,k, the same result can be obtained for a stationary rate in neuron 1—for p1,i D p1 for all i. In other words, stationarity of only one of the two neurons (either one) sufces to obtain the result in equation C.5. Appendix D: Approximative Evaluation of the Expectation Values
The number of terms in the sums underlying the computation of the expectation values for the count-dependent test power and signicance level with respect to the joint count distribution (cf. equations 4.4, 4.5, and 5.1– 5.3) grows with n2 . Thus, the direct evaluation of the sums for realistic values of n, which can reach up to thousands, is rather unpractical. However, by using the fact that the mass of the joint count distribution is mainly concentrated at relatively few count combinations, it is straightforward to
150
R. Gutig, ¨ A. Aertsen, and S. Rotter
calculate an approximate value of these quantities up to arbitrary preciapprox sion d, for example, the approximate test power pj,d . Selecting Bj,d ½ f0, 1, . . . , ng £ f0, 1, . . . , ng such that X (D.1) Pj ( C1 D c1 , C2 D c2 ) ¸ 1 ¡ d, Bj,d
approx
we nd the approximate test power p j,d given by X approx p j,d D Pj ( C1 D c1 , C2 D c2 ) Pj ( K ¸ kcrit ( c1 , c2 ) | c1 , c2 ) .
(D.2)
( c1 ,c2 )2Bj ,d
Since all probabilities are smaller than unity, it is clear from equation D.1 approx that pj ¡ p j,d · d. When computing the effective signicance levels aj0 and 2 j0 , the expectation value is formed over quantities smaller than a. Thus, the precision of the approximation of these signicance levels improves to ad. Note that Bj,d is not uniquely dened through equation D.1. In our calculations, Bj,d was determined by successively adding up the mass of individual spike event count combinations until the cutoff value 1 ¡ d was reached. To keep the number of individual spike count combinations entering Bj,d reasonably low, we started with the central term (cf. Feller, 1968) of the joint count distribution of independent neurons and iteratively added those count combinations from the surrounding of Bj,d that contributed most. Although not required for this procedure, this approach was motivated by the fact that the joint count distribution falls off monotonically with increasing distance from its central term. Appendix E: Randomized Tests
The basic idea behind randomized tests (Mood et al., 1974) is to dene a test through a critical function y that denes rejection probabilities for given empirical observations rather than through a xed critical region of rejection R. Through the incorporation of an additional independent random variable, it becomes possible to adjust the effective signicance level of the test to match its nominal signicance level a precisely, regardless of any discreteness of its test statistic. Applying this concept to the test based on the hypergeometric coincidence count distribution, this means that the decision of the test will no hyp longer be based on the critical region R c1 ,c2 D fk: k ¸ kcrit g of overly critical coincident spike event counts k but rather on the critical function, 8 hyp > 1 k ¸ kcrit ( c1 , c2 ) > < yc1 ,c2 :D fc1 ,c2 k D khyp (E.1) ( ) crit c1 , c2 ¡ 1 > > : hyp 0 k < kcrit ( c1 , c2 ) ¡ 1,
Statistical Signicance of Coincident Spikes
151
which for every element v of the sample space V sets the probability of rejecting the null hypothesis. Thus, while the null hypothesis will always hyp hyp be rejected if k ¸ kcrit ( c1 , c2 ) and never be rejected if k < kcrit ( c1 , c2 ) ¡ 1, the parameter fc1 ,c2 controls the rejection probability for all v with k D hyp kcrit ( c1 , c2 ) ¡ 1. In order to adjust the effective signicance level of the test to its nominal signicance level a, we dene fc1 ,c2 such that hyp
fc1 ,c2 :D
a ¡ Pj0 ( K ¸ kcrit ( c1 , c2 ) ) hyp
Pj0 ( K D kcrit ( c1 , c2 ) ¡ 1)
.
(E.2)
From here it is straightforward to see that the probability of falsely rejecting the null hypothesis of stochastically independent neurons reduces to hyp
hyp
Pj0 ( K ¸ kcrit ( c1 , c2 ) ) C fc1 ,c2 Pj0 ( K D kcrit ( c1 , c2 ) ¡ 1) D a,
(E.3)
and, thus, for all count combinations precisely corresponds to the nominal level of signicance. Note that this procedure adjusts the probability of falsely rejecting the null hypothesis by introducing rejections of the null hypothesis with probability fc1 ,c2 for all count constellations with k D hyp kcrit ( c1 , c2 ) ¡ 1. While this raises the probability of a false rejection to the nominal a-level of the test, and correspondingly increases its test power, the outcome of the test for a given set of data becomes a random variable. Thus, repeated applications of the test to the same data will in general lead to different ndings. This indeterminacy of randomized tests is the reason for their restricted use in applied statistics (Mood et al., 1974). hyp
By straightforward extension of equation 4.5, we nd the power p j,Rnd of the randomized version of the test to be given by hyp
pj,Rnd D
n X c1 ,c2 D 0
Pj ( C1 D c1 , C2 D c2 )
h hyp £ Pj ( K ¸ kcrit ( c1 , c2 ) | c1 , c2 ) hyp
i
C fc1 ,c2 Pj ( K D kcrit ( c1 , c2 ) ¡ 1 | c1 , c2 ) .
(E.4)
Comparison of the power curves for the randomized versus the nonrandomized test in the same parameter regime as used in Figure 3 demonstrates that, as expected, the power curves of the randomized version of the test lie above the values reached by the nonrandomized version. The maximum increase in test power ranges from approximately 0.05 for a D 0.01 (r D 0.1) to approximately 0.07 for a D 0.1 (r D 0.05) . Thus, the effect of randomization becomes larger for more permissive signicance levels a. Finally, we note
152
R. Gutig, ¨ A. Aertsen, and S. Rotter
that the randomized version of the test based on the hypergeometric coincidence count distribution, that is, of Fisher’s exact test, is uniformly most powerful unbiased for testing independence in a 2£2 table (Lehmann, 1997). Acknowledgments
We thank Hans Rudolf Lerche for valuable discussions and for calling our attention to Fisher ’s exact test. We thank Sonja Grun ¨ for helpful discussions and Arup Roy for kindly providing us with his manuscript prior to publication. We gratefully acknowledge Alexandre Kuhn for helpful comments on the manuscript. This work was supported in part by the Studienstiftung des deutschen Volkes, the German–Israeli Foundation for Scientic Research and Development (GIF), the Deutsche Forschungsgemeinschaft (DFG), and the Institut fur ¨ Grenzgebiete der Psychologie, Freiburg. References Abeles, M. (1982). Local cortical circuits: An electrophysiological study. Berlin: Springer-Verlag. Abeles, M. (1991). Corticonics: Neural circuits of the cerebral cortex. Cambridge: Cambridge University Press. Abeles, M., Bergman, H., Margalit, E., & Vaadia, E. (1993). Spatiotemporal ring patterns in the frontal cortex of behaving monkeys. J. Neurophysiol., 70(4), 1629–1638. Abeles, M., & Gerstein, G. L. (1988). Detecting spatiotemporal ring patterns among simultaneously recorded single neurons. J. Neurophysiol., 60(3), 909– 924. Aertsen, A., Bonhoeffer, T., & Kruger, ¨ J. (1987). Coherent activity in neuronal populations: Analysis and interpretation. In E. R. Canianello (Ed.), Physics of cognitive processes (pp. 1–34). Singapore: World Scientic Publishing. Aertsen, A., & Gerstein, G. L. (1985). Evaluation of neuronal connectivity: Sensitivity of cross-correlation. Brain Research, 340, 341–354. Aertsen, A., & Gerstein, G. L. (1991). Dynamic aspects of neuronal cooperativity: Fast stimulus-locked modulations of effective connectivity. In J. Kruger ¨ (Ed.), Neuronal cooperativity (pp. 52–67). Stuttgart: Springer-Verlag. Aertsen, A., Gerstein, G., Habib, M., & Palm, G. (1989).Dynamics of neuronal ring correlation: Modulation of “effective connectivity.” J. Neurophysiol., 61(5), 900–917. Feller, W. (1968). An introduction to probability theory and its application (3rd ed.). New York: Wiley. Galassi, M., Davies, J., Theiler, J., Gough, B., Priedhorsky, R., Jungman, G., & Booth, M. (1998). GNU Scientic Library—Reference Manual (0.4after ed.). Cambridge, MA: Free Software Foundation. Available online at: http://sources.redhat.com/gsl. Gerstein, G. L., Bedenbaugh, P., & Aertsen, A. (1989). Neuronal assemblies. IEEE Trans. Biomed. Eng., 36, 4–14.
Statistical Signicance of Coincident Spikes
153
Grun, ¨ S. (1996). Unitary joint-events in multiple–neuron spiking activity: Detection, signicance, and interpretation. Thun: Verlag Harri Deutsch. Grun, ¨ S., Diesmann, M., & Aertsen, A. (in press-a). “Unitary events” in multiple single neuron spiking activity I. Detection and signicance. Neural Computation. Grun, ¨ S., Diesmann, M., & Aertsen, A. (in press-b). “Unitary events” in multiple single neuron spiking activity. II. Non-stationary data. Neural Computation. Grun, ¨ S., Diesmann, M., Grammont, F., Riehle, A., & Aertsen, A. (1999).Detecting unitary events without discretization of time. J. Neurosci. Meth., 94, 67–79. Gutig, ¨ R., Rotter, S., Grun, ¨ S., & Aertsen, A. (2000). Signicance of coincident spikes: Count-based versus rate-based statistics. In Ninth Annual Computational Neuroscience Meeting CNS*2000 (p. 69). Brussels. Kreiter, A. K., & Singer, W. (1996). On the role of neural synchrony in the primate visual cortex. In A. Aertsen & V. Braitenberg (Eds.), Brain theory—Biological basis and computational principles (pp. 201–227). Amsterdam: Elsevier. Lehmann, E. (1997). Testing statistical hypotheses (2nd ed.). New York: SpringerVerlag. Mood, A. M., Graybill, F. A., & Boes, D. C. (1974). Introduction to the theory of statistics (3rd ed.). Tokyo: McGraw-Hill. Nawrot, M., Aertsen, A., & Rotter, S. (1999). Single-trial estimation of neuronal ring rates. J. Neurosci. Meth., 94, 81–91. Nicolelis, M. A. L. (Ed.). (1998).Methodsfor neural ensemble recordings. Boca Raton, FL: CRC Press. Palm, G., Aertsen, A., & Gerstein, G. L. (1988). On the signicance of correlations among neuronal spike trains. Biol. Cybern., 59, 1–11. Pauluis, Q., & Baker, S. N. (2000). An accurate measure of instantaneous discharge probability, with application to unitary joint-event analysis. Neural. Comp., 12, 647–669. Prut, Y., Vaadia, E., Bergman, H., Haalman, I., Hamutal, S., & Abeles, M. (1998). Spatiotemporal structure of cortical activity: Properties and behavioral relevance. J. Neurophysiol., 79(6), 2857–2874. Riehle, A., Grun, ¨ S., Diesmann, M., & Aertsen, A. (1997). Spike synchronization and rate modulation differentially involved in motor cortical function. Science, 278, 1950–1953. Roy, A., Steinmetz, P. N., & Niebur, E. (2000). Rate limitations of unitary event analysis. Neural. Comp., 12, 2063–2082. Singer, W. (1993). Synchronization of cortical activity and its putative role in information processing and learning. Annu. Rev. Physiol., 55, 349–374. Singer, W. (1999). Time as coding space. Curr. Op. Neurobiol., 9(2), 189–194. von der Malsburg, C. (1981). The correlation theory of brain function (Internal Rep. No. 81-2). Gottingen: ¨ Max-Planck-Institute for Biophysical Chemistry. Received August 1, 2000; accepted March 19, 2001.
LETTER
Communicated by Peter Dayan
Representational Accuracy of Stochastic Neural Populations Stefan D. Wilke [email protected] Christian W. Eurich [email protected] ¨ Theoretische Physik, Universit¨at Bremen, 28334 Bremen, Germany Institut fur Fisher information is used to analyze the accuracy with which a neural population encodes D stimulus features. It turns out that the form of response variability has a major impact on the encoding capacity and therefore plays an important role in the selection of an appropriate neural model. In particular, in the presence of baseline ring, the reconstruction error rapidly increases with D in the case of Poissonian noise but not for additive noise. The existence of limited-range correlations of the type found in cortical tissue yields a saturation of the Fisher information content as a function of the population size only for an additive noise model. We also show that random variability in the correlation coefcient within a neural population, as found empirically, considerably improves the average encoding quality. Finally, the representational accuracy of populations with inhomogeneous tuning properties, either with variability in the tuning widths or fragmented into specialized subpopulations, is superior to the case of identical and radially symmetric tuning curves usually considered in the literature.
1 Introduction
Despite impressive progress in the neurosciences, the code by which neural systems transmit and represent stimulus information has remained largely enigmatic (Rieke, Warland, de Ruyter van Steveninck, & Bialek, 1997; deCharms & Zador, 2000). A common way to achieve an understanding of this code is to study the encoding properties of sensory or motor neural systems under biological constraints such as receptive eld sizes, correlations in the neuronal ring, or energy consumption. A measure of the neural encoding ability is the stimulus reconstruction error—the “difference” between the actual stimulus and a stimulus estimate obtained solely from measured neural responses. Theoretically, the calculation of reconstruction errors is closely related to estimating an unknown parameter from a set of samples of a random variable, which is a standard problem of statistical estimation theory (Cover & Thomas, 1991; Kay, 1993). Solving this problem can provide an answer Neural Computation 14, 155–189 (2001)
° c 2001 Massachusetts Institute of Technology
156
Stefan D. Wilke and Christian W. Eurich
to the question of which coding strategy is favorable under given circumstances. Many works of this kind have concentrated, for instance, on tuning properties of neurons within a population, trying to derive under which circumstances broad or narrow tuning is favorable (Hinton, McClelland, & Rumelhart, 1986; Baldi & Heiligenberg, 1988; Snippe & Koenderink, 1992a; Eurich & Schwegler, 1997; Zhang & Sejnowski, 1999), and, more recently, on the effects of correlated variability in a population (Snippe & Koenderink, 1992b; Abbott & Dayan, 1999; Yoon & Sompolinsky, 1999; Karbowski, 2000). Most of these studies have given little attention to what is in fact one of the most important features of neural systems in this context: the response variability, which makes the reconstruction problem interesting in the rst place. While most studies have simply assumed Poissonian or additive gaussian noise on the responses, we will show in this article that the choice of a variability model has a major, nontrivial impact on the encoding properties of the neural population. The immense variability of individual response parameters, such as tuning widths or correlation coefcients, has also been neglected in most previous work. Although these parameter variations are always found in empirical data, they were considered functionally insignificant, and hence theoretical studies have almost always assumed uniform parameters throughout the population. We will show here that this uniform case is unfavorable in the sense that the introduction of parameter variability improves the encoding performance. In this study, neural population codes are analyzed using the Fisher information measure, which has turned out to be a powerful tool for studying questions of neural coding (Paradiso, 1988; Seung & Sompolinsky, 1993; Brunel & Nadal, 1998; Zhang & Sejnowski, 1999; Pouget, Deneve, Ducom, & Latham, 1999; Karbowski, 2000; Johnson, Gruner, Baggerly, & Seshagiri, 2001). As compared to mutual information and Bayesian techniques, it is especially suitable if the stimulus itself is not a random variable. Here, Fisher information provides a unied framework in which multiple aspects of population responses—response variability models, noise correlations, and tuning widths—can be discussed. In the following section, we give a review of the concept of Fisher information and specify the neural population model to be studied. In sections 3, 4, and 5, we discuss the representational accuracy of the population for different response variability models, noise correlation schemes, and tuning width distributions, respectively. Section 6 provides a summary of the main results. The major equations of the present study are derived in two mathematical appendixes.
2 Theoretical Framework 2.1 Fisher Information and the Cram Âer-Rao Bound. Consider a population of N neurons encoding a single stimulus x D ( x1 , . . . , xD ) in a Ddimensional space, the so-called implicit space (Zemel & Hinton, 1995). The
Representational Accuracy of Stochastic Neural Populations
157
stimulus components xi will be referred to as features; they correspond to the different physical properties of the stimulus to which the neurons are sensitive, such as position, velocity, brightness, and contrast, in the case of visual neurons. The neurons re their action potentials stochastically, so that the spike count vector n :D ( n1 , . . . , nN ) measured in a single trial of observation time t is a discrete random variable. The spike counts are assumed to carry all stimulus information, so that the population’s coding scheme is characterized by the joint probability P ( n | x ) for observing the spike count vector n given the stimulus value x . Neural systems are believed to use other codes to transmit information than the spike counts (Eckhorn, Grusser, ¨ Kroller, ¨ Pellnitz, & Popel, ¨ 1976; Rieke et al., 1997; Gautrais & Thorpe, 1998). However, in many cases a large amount of sensory information is encoded in the mean spike count (TovÂee, Rolls, Treves, & Bellis, 1993; Heller, Hertz, Kjaer, & Richmond, 1995). The representational capacity of the neural population can be quantied by how well the stimulus x can be recovered from an observation n . In the framework of classical estimation theory, a rule to construct an estimate xˆ ( n ) of the actual stimulus x from an observation n is called an estimator. A given estimator xˆ is usually characterized by two quantities: its bias b xˆ ( x ) :D E [xˆ ] ¡ x and its variance vi, xˆ ( x ) :D E[xO 2i ] ¡ E2 [xO i ] for i D 1, . . . , D, where E[¢] denotes the expectation value with respect to P ( n | x ) at xed x . While the bias is a measure of systematic deviations, the variance describes the scatter around the mean estimate E[xˆ ]. It is a common measure of encoding quality since it can be interpreted as a mean-squared reconstruction error. There is a convenient formula for a lower bound for the variance of any unbiased estimator, b xˆ ( x ) D 0, the famous Crame r-Rao bound (CRB) (Rao, 1945; Crame r, 1946): vi, xˆ ( x ) ¸ [J( x ) ¡1 ]ii
for all x and i D 1, . . . , D.
(2.1)
It states that unbiased estimates of the stimulus x have a variance that is at least the corresponding matrix element of the inverse of the Fisher information matrix J ( x ) , which is calculated from the probability distribution P ( n | x ) via µ Jij ( x ) :D ¡E
@2 @xi @xj
¶ ln P ( n | x )
i, j D 1, . . . , N.
(2.2)
Fisher information is independent of the estimation mechanism employed and can therefore serve as an objective measure of coding quality of the encoding scheme P ( n | x ) of the neural population. The formalism of Fisher information is not suited to take into account prior knowledge on the stimulus, which is known to improve the quality of reconstruction (Kay, 1993). Moreover, there are situations in which biased estimators can perform better than the CRB (Cover & Thomas, 1991). In
158
Stefan D. Wilke and Christian W. Eurich
the case of a dense, uniform population of neurons considered below, this is not a problem since translational invariance implies that all reasonable estimators—those that respect this invariance—are either equally biased for all stimulus values or unbiased. In the former case, subtracting the bias yields an unbiased estimator with better performance. Another nontrivial question is whether there is an efcient estimator in the rst place. For instance, the at tuning curves generally used in coarse coding schemes (Hinton et al., 1986; Eurich & Schwegler, 1997) do not allow for unbiased estimates. Recently, Bethge, Rotermund, & Pawelzik (2000) showed that efcient estimation can also become impossible if insufcient time is available for decoding. However, the time available for decoding may vary from one system and task to the other, and there is no simple way to calculate how much time-efcient estimation takes in a given system. Thus, it is important to know what encoding accuracy can be expected if no restrictions on timing are imposed; this is achieved by calculating Fisher information (Bethge et al., 2000). 2.2 Tuning Functions. Neural responses are conventionally characterized in terms of tuning functions describing the mean ring rate as a function of the stimulus value. In the present framework, the tuning functions are therefore given by f ( x ) :D E[n ] / t . Typically, fk ( x ) is a unimodal function centered around some preferred stimulus c ( k) that depends on the neuron k. For a more concrete discussion of the encoding properties of the model population, we specify the neuronal tuning characteristics as follows. For neuron k, we assume
fk ( x ) D Fw (j ( k)2 ) D Ff C F (1 ¡ f ) exp( ¡j ( k)2 / 2) ,
(2.3)
where j ( k)2 is a rescaled distance between the center of the tuning curve PD ( k)2 ( k) ( ) ( ) and the actual stimulus, j ( k)2 :D , ji :D ( xi ¡ ci k ) / si k , F is the iD 1 ji ( k) maximum ring rate, and si is the neuron’s tuning width for feature i (Zhang & Sejnowski, 1999; Wilke & Eurich, 1999; Eurich & Wilke, 2000). The parameter f measures the level of stimulus-unrelated background activity at a given maximum ring rate F; it ranges from zero-baseline ring (f D 0) to completely stimulus-independent ring (f D 1). The tuning curve is depicted in Figure 1. For simplicity, a gaussian tuning curve prole was assumed, but the results do not depend critically on this choice. In order to assess the redundancy of the neural population code, we introduce the local density of tuning curves in stimulus space, g( x ) :D PN ( k) ¡ x ) . In the following, it will often be assumed that the tunkD 1 d( c ing curves uniformly cover the stimulus space (see section A.2). This case will be referred to as the continuous limit. Nonuniform tuning curve distributions can optimize performance if the stimuli are not evenly distributed (Brunel & Nadal, 1998), but sparse tuning curves lead to unrepresented
Representational Accuracy of Stochastic Neural Populations
159
200 Mean response Additive (flat) noise Proportional noise Multiplicative noise
Response / Hz
150
100
50
0 4
3
2
1
0
1
Stimulus / s
2
3
4
Figure 1: Example of tuning curve and variability of a model neuron. Mean response (solid) C / ¡ standard deviation. The mean follows a gaussian centered at the preferred stimulus with F D 100 Hz and a baseline ring level of 10 Hz (f D 0.1). The shape of the response variance depends on the noise model. Arrows indicate stimuli for which the neuron’s Fisher information is maximal.
areas in stimulus space, the effects of which are discussed in section 5 (see also Eurich & Wilke, 2000). Finally, we introduce a distribution of tuning widths, Ps , from which the tuning widths of all neurons are assumed to be drawn (Eurich, Wilke, & Schwegler, 2000), (1) ( ) P (s1 , . . . , sDN ) D
N Y
( ) ( ) Ps ( s1 k , . . . , sDk ).
(2.4)
kD 1
This allows for a discussion of the effect of tuning properties of the neural population in section 5. Note that stimulus features for which the neurons ( ) are not sensitive (si k D 1) must be excluded from the analysis. 2.3 Spike Count Statistics. For the calculations to follow, a gaussian probability density will be assumed for the spike count. In this framework, correlations between the neurons of the population are described by the covariance matrix, h i (2.5) Q ( x ) :D E ( n ¡ t f ( x ) ) ( n ¡ t f ( x ) ) T .
160
Stefan D. Wilke and Christian W. Eurich
For simplicity of notation, we will not write out the x -dependence of f and Q in the following. Assuming that n can be treated as a continuous random variable, one has a probability density function p( n | x ) D p
µ ¶ 1 exp ¡ ( n ¡ t f ) T Q ¡1 ( n ¡ t f ) . 2 ( 2p ) N det Q 1
(2.6)
This approximation is expected to fail if the ring rate of any of the neurons becomes small and the discrete nature of the spike count variables becomes important. Inserting equation 2.6 into the denition of Fisher information, equation 2.2, one nds the explicit form (Kay, 1993), Jij ( x ) D t 2
@fT @xi
Q ¡1
@f @xj
C
µ ¶ 1 @Q ¡1 @Q ¡1 Tr . Q Q 2 @xi @xj
(2.7)
Because of the unspecied covariance matrix Q, equation 2.7 is too general for a discussion of the encoding properties of the neural population. Thus, the next task is to specify Q on the basis of empirical ndings. This will be performed in two steps. First, independent spiking is assumed, which reduces the task to specifying the diagonal elements of Q (section 3). In section 4, the independence assumption is dropped, and correlations are considered. 3 Effect of Neuronal Noise Models on the Encoding Accuracy
The implications of the variability of neuronal responses have been debated with some controversy (Softky & Koch, 1993; Mainen & Sejnowski, 1995; Reich, Victor, Knight, Ozaki, & Kaplan, 1997; Gur, Beylin, & Snodderly, 1997; Bair & O’Keefe, 1998; Pouget et al., 1999). In this section we focus on the deviations from the mean spike count in individual stimulus presentations. An empirical mean-variance relation involving two free parameters will be used to discuss the effect of different variance models on the encoding accuracy of neural populations. 3.1 Mean-Variance Scaling. Throughout this section, it is assumed that there are no correlations between neurons, rendering the covariance matrix 2.5 diagonal. The diagonal elements Q kk ( x ) to be specied now are the variances of the spike counts. Several empirical studies (Dean, 1981; Tolhurst, Movshon, & Thompson, 1981; Tolhurst, Movshon, & Dean, 1983; van Kan, Scobey, & Gabor, 1985; Teich & Khanna, 1985; Vogels, Spileers, & Orban, 1989; Gershon, Wiener, Latham, & Richmond, 1998; Lee, Port, Kruse, & Georgopoulos, 1998; Maynard et al., 1999) have yielded the result that the logarithm of the spike count variance is a linear function of the logarithm of the mean spike count. Theoretically, additive (at) noise (Abbott & Dayan, 1999; Yoon & Sompolinsky, 1999) and multiplicative noise (Abbott &
Representational Accuracy of Stochastic Neural Populations
161
Dayan, 1999) have been employed next to Poissonian noise to describe neural spiking statistics. Both the empirical results and the theoretical models are captured by the mean-variance relation of fractional power law noise, Qkk ( x ) D afk ( x ) 2a
k D 1, . . . , N,
(3.1)
with two (possibly t -dependent) parameters a and a yet to be specied. The choice a D 1 / 2 and a D t yields the mean-variance relation of the Poisson distribution (proportional noise), and the cases a D 0 and a D 1 are referred to as additive and multiplicative noise, respectively. Note that equation 3.1 is valid only for large mean spike counts, whereas for small spike counts, the spike count probability must approach proportional noise (Panzeri, Biella, Rolls, Skaggs, & Treves, 1996; Panzeri, Schultz, Treves, & Rolls, 1999; Panzeri, Treves, Schultz, & Rolls, 1999). Although Poissonian spike count statistics are frequently used to describe neural responses, empirical data often show substantial deviations from the Poissonian case, especially at high spike counts (Teich, Johnson, Kumar, & Turcott, 1990; Softky & Koch, 1993; Gershon et al., 1998; Rieke et al., 1997; Lee et al., 1998). Hence, there are aspects to neuronal noise that are not captured by a simple Poissonian spike count distribution. However, there are no clear ideas on what effect it may have on the encoding properties of a neural population. Sections 3.2 and 3.3 will therefore discuss the dependence of Fisher information on the mean-variance relation exponent a. It is shown in section A.1 that inserting equation 3.1 into the formula for the Fisher information in the gaussian noise case, equation 2.7, one nds in the limit of large N, Jij ( x ) D
t2 Fij ( xI a) C 2a2 Fij ( xI 1), a
(3.2)
with the abbreviation Fij ( x I a) from equation A.5. In the continuous limit (see section A.2), equation 3.2 can be written as *Q Jij ( x ) D dijg µ £
D l D1 sl
si sj
+
4p D / 2 ¡ ¢ C 1 C D2
¶ t 2 F2¡2a Aw (a, D) C 2a2 Aw ( 1, D ) , a
(3.3)
R1 ¡ ¢2 ¡ ¢ ¡2b where Aw ( b, D) :D 0 dj j D C 1 w 0 j 2 w j 2 is an integral that depends on the exponent a, the number of features D, and, through the shape of the tuning curve w , the baseline activity level f . The tuning widths enter the Q Fisher information only through the factor hsi¡1 sj¡1 D l D 1 sl i, which denotes an expectation value with respect to the tuning width distribution, equation 2.4. Equation 3.3 shows that the Fisher information matrix is diagonal
162
Stefan D. Wilke and Christian W. Eurich
for the population considered here and that its elements increase linearly with the density of neurons in stimulus space, regardless of all other parameters. 3.2 Encoding Accuracy and Maximum Firing Rate. Generating action potentials requires relatively large amounts of energy (Laughlin, de Ruyter van Steveninck, & Anderson, 1998), and it has been proposed that this energy constitutes a major constraint on the neural code (Levy & Baxter, 1996). Thus, it is an interesting question how sensory resolution depends on the maximum ring rate. The F-dependence of Fisher information for the model population studied here is directly obtained from equation 3.3. Since the term 2a2 Aw ( 1, D ) can be neglected for high maximum spike counts, t 2 F2¡2a À 2a, Fisher information exhibits the simple scaling behavior Jij / F2¡2a . Hence, the mean-variance relation, characterized by the exponent a, has a major impact on the F-dependence of the encoding accuracy. The scaling law Jij / F2¡2a yields a clear advantage for smaller values of a. While for additive noise, Fisher information increases quadratically with the maximum ring rate, multiplicative noise results in constant Fisher information—a xed minimal encoding error that cannot be improved by higher ring frequencies. These results are independent of the number of encoded features, D, and of the level of background activity, f , as long as the latter is nonzero. The reason for the great increase of Fisher information with F for small a lies in the mean-variance relation shown in Figure 1. At xed t , a constant variance for all ring rates loses relative importance if the mean ring rate increases. Hence, increasing F yields better performance. In the multiplicative noise case, however, any increase of ring rate is accompanied by an overproportional increase of variance, which compensates the higher number of counted spikes. How Fisher information changes as a function of observation time t remains an open question, since there are no empirical results on the t -dependence of the parameters a and a in the mean-variance relation, equation 3.1. 3.3 Encoding Accuracy and Baseline Activity. A desirable property for a neural population code is that the representational capacity not be overly sensitive to stimulus-unrelated activity, which naturally arises due to intrinsic noise (White, Rubinstein, & Kay, 2000) or ongoing network dynamics. In the following, it will be shown that the impact of background activity critically depends on a, and thus on the mean-variance relation. The analysis of the f -dependence in the current model is complicated by a problem with the gaussian approximation, equation 2.6, in the limit zero-baseline ring limit f ! 0. Neurons that are not activated by the stimulus will cease to re in this case, so that one leaves the range of validity of equation 2.6. In fact, for f D 0 and gaussian tuning curves, one has Aw ( b, D) D C (1 C D2 ) / [8( 1 ¡ b ) 1 C D / 2 ], which obviously diverges for b D 1.
Representational Accuracy of Stochastic Neural Populations
163
Fisher Information J/J(z=0)
1 Additive noise Proportional noise; D=1, 2, 3, 10
0,8 0,6 D=1
0,4 0,2
D=10
0 0
0,2
0,4
0,6
0,8
Baseline Activity Level z
1
Figure 2: Normalized Fisher information as a function of baseline ring level f for additive gaussian noise (dashed) and Poissonian noise (solid) for different numbers of encoded features. The additive noise result, equation 3.4, does not depend on D, while encoding accuracy rapidly decreases with f in the Poissonian noise case for higher D.
Hence, unless a D 0, the second term / a2 Aw (1, D) in the approximation 3.3 goes to innity as the noiseless limit f D 0 is approached. This result is related to the fact that the gaussian approximation, equation 2.6, for the spike count statistics is invalid if there is no baseline ring activity. The problem is absent if additive (a D 0) gaussian noise is assumed. In this case, equation 3.3 becomes *Q Jij ( x ) D dijg
D l D1 sl
si sj
+
( tF) 2 p D / 2 ( 1 ¡ f ) 2 , 2a
(3.4)
as shown in section A.2. Therefore, there are only two situations that can be analyzed analytically: equation 3.4 for additive gaussian spike count noise and the case of a Poissonian spike count distribution (Wilke & Eurich, 2001). These two cases are compared with respect to their robustness to background noise in Figure 2. As expected, Fisher information is a monotonically decreasing function of the relative level of baseline activity f . The plot shows that the susceptibility to background noise is qualitatively different in the two models. While the decrease does not depend on D at all in the additive noise case, it becomes stronger and stronger in the Poissonian
164
Stefan D. Wilke and Christian W. Eurich
noise case as the number of encoded features increases. Thus, a population with additive gaussian spike count noise is relatively robust with respect to baseline ring, while a Poissonian spike count distribution implies a coding accuracy that quickly declines with f , especially if many features are to be encoded. The observation that higher values of a lead to higher sensitivity to background noise is conrmed by numerical calculations on populations of neurons. Using Poissonian behavior for spike counts lower than a certain threshold to avoid the divergence problems for a > 0, it turns out that additive noise, a D 0, is the only case in which the decrease of Fisher information does not depend on D, while for a > 0, the susceptibility to background noise at higher D gradually increases with a (data not shown). This result can be understood intuitively as follows. Fisher information for single neurons is equal to the square of the derivative of the tuning function divided by the variance (Seung & Sompolinsky, 1993). This implies that the neurons with maximum Fisher information are at distances from the stimulus that increase with a (see the arrows in Figure 1). On the other hand, neurons activated by a stimulus far away from their tuning curve center are most vulnerable to background noise. Their relative loss of Fisher information, ( dJ / df ) / J, with increasing f is larger than for neurons closer to the tuning curve center. This explains why Poissonian noise (corresponding to a D 1 / 2) is more affected by baseline ring than additive noise (a D 0). The D-dependence in Figure 2 can be understood from the following argument. As the dimension of the stimulus space, D, increases, a receptive eld dened by the tuning curve in this space has an increasing percentage of its volume close to its surface (i.e., far away from its center). This mathematical phenomenon is also known from statistical physics (Callen, 1985). It implies that for large D, a stimulus has a marginal position in most receptive elds, which results in a mean spike count that is small compared to t F and therefore most vulnerable to background noise. This is especially disadvantageous for large a and Poissonian noise, where most of the Fisher information is carried by neurons far away from the tuning curve center. Thus, the higher D, the more neurons with small mean spike count contribute to the Fisher information, making the code more and more susceptible to baseline ring. Next to the exponentially growing number of neurons required to cover a high-dimensional stimulus space, this could be another reason that neurons in biological systems are usually specialized in only a few stimulus features. 4 Effect of Noise Correlations on the Encoding Accuracy
Correlations in neural systems include noise correlations and signal correlations. While signal correlations, which are equivalent to tuning curve overlap, are captured by the parameter g in the present framework, noise correlations (correlated variability) can be quantied by the nondiagonal elements of the covariance matrix Q, or, equivalently, by the normalized
Representational Accuracy of Stochastic Neural Populations
165
p (Pearson) correlation coefcient dened by qkl ( x ) :D Q kl ( x ) / Q kk ( x ) Qll ( x ) for k, l D 1, . . . , N. Several studies have found noise correlations in a wide variety of neural systems—for example, in monkey areas IT (Gawne & Richmond, 1993), MT (Zohary, Shadlen, & Newsome, 1994), motor cortex (Lee et al., 1998; Hatsopoulos, Ojakangas, Paninski, & Donoghue, 1998; Maynard et al., 1999), and parietal areas (Lee et al., 1998), as well as in cat visual cortex (Ghose, Ohzawa, & Freeman, 1994; DeAngelis, Ghose, Ohzawa, & Freeman, 1999) and LGN (Dan, Alonso, Usrey, & Reid, 1998). While some experiments have directly or indirectly shown that correlations carry stimulus information, the origin, the precise functional role, and the readout mechanisms (e.g., Feng & Brown, 2000) of these correlations are unclear. As a further complication, they certainly depend on the species and the brain area, but may also depend on the state of the animal (Cardoso de Oliveira, Thiele, & Hoffmann, 1997; Steinmetz et al., 2000). Theoretically, this situation is particularly unsatisfying, since the exact type of correlation assumed greatly inuences the theoretical predictions. In consequence, there has been a controversial discussion on the role that correlated variability may play in population coding. If the readout consists of a simple averaging over the population output, positive correlations pose severe limits on the encoding accuracy. It was therefore suggested that correlations in neural systems are harmful in general (Zohary et al., 1994; Shadlen & Newsome, 1998). However, the situation is more complicated if a distributed population code is considered, that is, a code in which the neurons have different preferred stimuli. In this case, the effect of correlations critically depends on the structure of the covariance matrix. For example, uniform positive correlations across the population increase the achievable encoding accuracy (Abbott & Dayan, 1999; Zhang & Sejnowski, 1999), while correlations of limited range can have the opposite effect (Johnson, 1980; Snippe & Koenderink, 1992b; Abbott & Dayan, 1999; Yoon & Sompolinsky, 1999). In the following, we attempt to clarify this issue by using a covariance matrix that incorporates both uniform and limited-range noise correlations and therefore interpolates between the cases discussed in the literature. 4.1 Model of Noise Correlations. Uniform and limited-range noise correlations have been shown empirically to coexist. For example, a recent study of the primary motor cortex of primates reports both xed-strength correlations and correlations that depend on the similarity of the preferred stimuli (Maynard et al., 1999). Thus, we choose the covariance matrix ( " ³ () ´#) | c k ¡ c ( l) | Qkl ( x ) D d kl C ( 1 ¡ d kl ) q C b exp ¡ L
£ afk ( x ) a fl ( x ) a ,
(4.1)
for k, l D 1 . . . , N, where the part in curly brackets is the normalized correlation coefcient q kl ( x ) dened above. The parameters 0 · q < 1 and b govern
166
Stefan D. Wilke and Christian W. Eurich
the strengths of uniform and limited-range correlations in the population, and L determines the length scale in stimulus space over which neurons are correlated. The covariance matrix, equation 4.1, interpolates between the additive (a D 0, b D 0) and multiplicative (a D 1, b D 0) noise models considered by Abbott and Dayan (1999) and also incorporates the limitedrange correlations treated by Snippe and Koenderink (1992b) and Abbott and Dayan (1999) as a special case, if equidistant tuning curves in D D 1 dimensions are assumed. For b D 0 it can be viewed as a special case of equations 4.2 and 4.3 of Zhang and Sejnowski (1999). 4.2 Uniform Correlations. In two special cases, equation 4.1 allows for an analytical solution for the Fisher information. First, consider a uniform correlation strength q ¸ 0 without limited-range contributions (b D 0), so that each neuron is positively correlated with all others. As shown in section A.1, one nds the large-N-limit,
Jij ( x ) D
£ ¤ t2 Fij ( x I a) ¡ Gij ( x I a) a (1 ¡ q ) ¤ a2 £ (2 ¡ q ) Fij ( x I 1) ¡ qGij ( x I 1) C 1 ¡q
(4.2)
in this case, with the abbreviations Fij and Gij from equations A.5 and A.6. In the continuous case, where Gij D 0, the only correction for q > 0 (as compared to q D 0) are two monotonically increasing factors, 1 / (1 ¡ q) and (2 ¡ q) / (1 ¡ q) . Thus, uniform positive correlations always increase the coding accuracy regardless of the values of the other system parameters. Intuitively, this result can be understood by considering the possibility of subtracting the noise if the correlations are strong (Abbott & Dayan, 1999). Hence, for uniform correlation strength, the mean-variance exponent a does not lead to new effects as compared to independent neurons. This is not the case for limited-range correlations, as we will show. 4.3 Limited-Range Correlations. Pure limited-range correlations, obtained by setting q D 0 and b D 1 in equation 4.1, may also be treated analytically if D D 1 and equidistant tuning curves of density g are assumed. In this case, one nds for the Fisher information (see section A.3)
J11 ( x ) D
t2 a
µ
1¡r r F11 ( xI a) C H ( xI a) 1Cr 1 ¡ r2 µ ¶ r2 ( C a2 2F11 ( xI 1) C 1) H xI , 1 ¡ r2
¶
(4.3)
where H ( xI a) is dened in equation A.16, and r :D exp[¡1 / ( Lg )]. In the case of additive noise (a D 0), the second term in equation 4.3 vanishes,
Representational Accuracy of Stochastic Neural Populations
167
and equation 4.10 of Abbott and Dayan (1999) is recovered. Regardless of a, their conclusion that limited-range correlations decrease the Fisher information remains valid. For large N, the F11-terms of equation 4.3 dominate. Because of the factor 1 ¡ r, Fisher information decreases with increasing r and, equivalently, with increasing correlation length L. This behavior is reversed for large L, where the H-terms / 1 / ( 1¡r 2 ) constitute the dominating contribution to the Fisher information. Apart from this decrease with L, limited-range correlations of the type described by equation 4.1 can have another negative effect on the encoding properties of a population. For uniform correlations, Fisher information increased linearly with the number of neurons (see equations 3.2 and 4.2). Thus, an increase in the number of neurons, or, equivalently, an increase in the density of tuning curves g, resulted in a proportional increase of Fisher information. In equation 4.3, increasing g implies r D exp[¡1 / ( Lg ) ] ¼ 1 ¡ 1 / ( Lg ) at g À L ¡1 , which leads to the asymptotic form of equation 4.3: µ ¶ t2 1 Lg J11 ( x ) D F11 ( xI a) C H ( xI a) 2 a 2Lg µ ¶ Lg 2 C a 2F11 ( xI 1) C (4.4) H ( xI 1) . 2 The fact that F11 / g and H11 / g ¡1 in this limit leads to the following conclusions: For a D 0, Fisher information fails to scale linearly with g but saturates at a nite, g-independent limiting value (Yoon & Sompolinsky, 1999; Abbott & Dayan, 1999). However, the F11 ( xI 1) -term increases linearly with g. Thus, the statement that J11 becomes independent of g for large populations is correct only for additive noise, a D 0 (see Figure 3). Hence, for noise models other than additive noise, Fisher information increases linearly with the number of neurons, even in the presence of limited-range correlations. Since the empirically found values for a are in the vicinity of 1 / 2, this result casts doubt on the conclusion that limited-range correlations generally impose an upper bound on the encoding accuracy for large populations. However, the increase of Fisher information associated with the F11 ( xI 1) -term in equation 4.4 may be hard to exploit in a biologically plausible network (Deneve, Latham, & Pouget, 1999). 4.4 Interaction of Uniform and Limited-Range Correlations. A new issue that can be studied with the general covariance matrix, equation 4.1, concerns the interactions between limited-range and uniform correlations. A qualitative prediction in this context was given by Oram, FoldiÂak, ¨ Perrett, and Sengpiel (1998). They argued that for neurons with positive signal correlation, that is, overlapping tuning curves, negative noise correlation should be favorable because this would allow for a better averaging. At the same time, neurons with negative signal correlation should be positively correlated to avoid reconstruction errors. Hence, a situation that should be
Stefan D. Wilke and Christian W. Eurich
Fisher Information J/J(N=100)
168
5 4
Additive Noise (a=0) Poissonian Noise (a=1/2) Multiplicative Noise (a=1)
3 2 1 0 0
100
200
300
400
Population Size N
Figure 3: Effect of limited-range correlations on the encoding accuracy of a neural population. The plot shows Fisher information as a function of the number of neurons N for different a (g D N / ( 10s) , D D 1, L D s, t F D 100, f D 0.1). The population is limited to about 100 degrees of freedom, that is, limN!1 J ( N ) ¼ J ( N D 100) , only in the additive noise case.
especially disadvantageous is the simultaneous presence of positive uniform and limited-range correlations, yielding high noise correlations for the neurons with positive signal correlation. On the other hand, negative limited-range correlations paired with positive uniform correlations should increase the encoding accuracy of the population. To test this prediction, the population Fisher information was calculated for a xed positive uniform correlation level (q D 0.5) as a function of sign and strength of an additional limited-range correlation contribution, as specied by the parameter b. The result is shown in the inset of Figure 4. For positive short-range correlations, the encoding becomes worse, while it improves for negative short-range correlations. Analytically, this effect can be illustrated by the expansion of Fisher information for small b and r (the latter measuring the correlation range) carried out in section A.4. In leading order, one nds 0 ( ) J11 ( x ) D J11 x ¡
2br (1 ¡ q ) 2
»
t2 [F11 ( xI a) ¡ G11 ( xI a) ] a C a2 q [F11 ( xI 1) ¡ G11 ( xI 1)]
¼ ,
(4.5)
169
1 0,8
N=50 N=100
0,6 1,1
J/J(b=0)
Uniform Correlation Strength q
Representational Accuracy of Stochastic Neural Populations
0,4 0,2 0 0
1
0,9
0,4 0,2 0
b
0,2
0,4
0,6
Correlation Range r
0,2 0,4 0,8
1
Figure 4: Uniform correlation strength q required to compensate the loss of Fisher information due to limited-range correlations as a function of r D exp( ¡1 / ( Lg ) ) . Parameters were a D 1 / 2, g D N / ( 10s) , t F D 100, f D 0.1, D D 1, b D 1 ¡ q. Inset: Fisher information for a population with both uniform and limited-range correlations as a function of the coefcient of the limitedrange contribution, b, for xed q D 1 / 2. For b > 0, limited-range correlations add to the uniform part, while for b < 0, limited-range correlations are negative and weaken the correlations locally. Parameters were N D 100,g D 10/ s, D D 1, a D 1 / 2, L D 0.03s, t F D 100, f D 0.1.
0 where J11 is the Fisher information for purely uniform correlations, b D 0. Since F11 ¸ G11 by the Cauchy-Schwarz inequality, the term in curly brackets is always positive. Hence, positive limited-range correlations decrease the coding accuracy, whereas negative limited-range correlations increase the coding accuracy, in agreement with Oram et al. (1998). While the expansion formula, equation 4.5, is valid only for r ¿ 1, the trend that uniform and limited-range correlations counteract continues to hold for larger r. Figure 4 demonstrates this result. The plot shows the level q of uniform correlations that is necessary to compensate the negative effect of limited-range correlations of given range r. This quantity was determined by choosing a xed r and increasing the level of uniform correlations q, while simultaneously decreasing the strength of limited-range correlations (b D 1 ¡ q), until the Fisher information reached the value obtained in the uncorrelated case (q D 0, b D 0). The initial increase with r reects the growing loss of Fisher information with r that has to be compensated.
170
Stefan D. Wilke and Christian W. Eurich
Eventually limited-range correlations also increase the Fisher information since the case of uniform correlations is approached, so that the curve approaches zero again for large r. The right part of the plot depends on N since limited-range correlations include terms that do not scale linearly with the number of neurons (see equation 4.3). 4.5 Uniform Positive Correlations with Jitter. An especially conspicuous feature of experimentally obtained correlation coefcients is their large and seemingly random variability. For example, Maynard et al. (1999) nd correlation coefcients ranging from –0.9 to 0.9 in a population of MI neurons, and Zohary et al. (1994) report a similarly large span of values in monkey MT. Despite the fact that this large variability appears to be a general nding in studies of correlations in neural systems, there have been no attempts to relate it to the population’s encoding performance. In the following, the effect of nonuniform correlations is studied by using a covariance matrix with entries that contain a stochastic component. It will turn out that, on average, jitter in the correlation coefcient increases the Fisher information. The idea that random covariance matrices could lead to a better encoding performance is based on equation 2.7, which shows that the smallest eigenvalues of Q yield the largest contribution to Fisher information. As the smallest eigenvalue, lmin , approaches zero, Jij ( x ) increases overproportionally and obviously diverges at lmin D 0. Random matrices have eigenvalues that are distributed around the eigenvalues of the corresponding “mean” matrix. Thus, it is expected that eigenvalues smaller than those of the corresponding covariance matrix without jitter occur and lead to an increase of average Fisher information. To illustrate this argument, we consider a population in which the correlations between pairs of neurons are normally distributed with mean q. The corresponding covariance matrix reads
Qij ( x ) D qij afi ( x ) a fj ( x ) a
i, j D 1, . . . , N,
(4.6)
where qii D 1 for i D 1, . . . , N. The nondiagonal elements qij for i < j are drawn from a gaussian distribution with mean q and standard deviation s, while the symmetry of Qij requires qji D qij . For small s, the covariance matrix, equation 4.6, can be regarded as the uniform-correlation case from section 4.2 (see equation 4.1, with b D 0) with a small perturbation of order s2 . Expanding the inverse of the covariance matrix in terms of s2 and neglecting terms of higher order, this leads to an expansion formula for the Fisher information derived in section A.5, » 2 ¤ t £ s2 N Jij ( x ) D Jij0 ( x ) C Fij ( x I a) ¡ Gij ( x I a) (1 ¡ q ) 3 a ¼ £ ¤ C a2 Fij ( x I 1) ¡ Gij ( xI 1) . (4.7)
Fisher Information J/J(s=0)
Representational Accuracy of Stochastic Neural Populations
171
1,3
1,2
Analytical Approximation Numerical Result +/ Standard Deviation
1,1
1 0
0,01
0,02
SD s of Correlation Coefficient
0,03
Figure 5: Variability in the correlation strength increases coding accuracy. Mean Fisher information (solid) and analytical formula 4.7 (dashed) as a function of standard deviation (SD) s of the correlation coefcient distribution. Parameters: N D 50, g D 5 / s, D D 1, a D 1 / 2, q D 1 / 2, t F D 100, f D 0.1, 10,000 samples.
For D D 1, the correction term is always positive. Hence, for a given mean correlation coefcient q, a population of neurons with noisy correlation coefcients performs better on average than one with uniform correlations. This effect is demonstrated by a numerical simulation in Figure 5, where the mean Fisher information of 10,000 covariance matrices is plotted as a function of the standard deviation of the correlation coefcient. Figure 5 also shows that the range of validity of the expansion 4.7 is quite small. This is due to the N-dependence of the smallest eigenvalue, which is given p by lmin D 1 ¡ q ¡ 2 Ns in this case (see appendix A.5). For N D 50 and q D 0.5, this implies p that the mean Fisher information must diverge around ( ) smax D 1 ¡ q / 4N ¼ 0.035, explaining why the quadratic approximation, equation 4.7, becomes invalid as soon as s D 0.015. The numerical calculations indicate that the increase in Fisher information is even more pronounced than predicted by the expansion formula. The result that correlations with jitter increase the Fisher information can be interpreted as follows. A zero eigenvalue in the covariance matrix indicates the existence of a linear combination of single-neuron activities that is completely noiseless. This situation of course cannot be achieved in realistic systems. Thus, there must be limits to the choice of the covariance matrix in biological systems. However, the form of these restrictions remains unclear and has not been subject to empirical investigations yet.
172
Stefan D. Wilke and Christian W. Eurich
This lack of knowledge makes a systematic optimization of the covariance matrix by theoretical analysis impossible. One may therefore view the introduction of jitter in the correlation coefcients as a means of randomly exploring the space of covariance matrices, which, according to the above, yields better average encoding accuracy as compared to the case of uniform correlation strength. The range of validity of this argument is limited by the fact that the matrices eventually leave the admissible set. If there is a biological cost associated with q, for example, if the correlations are attained through (expensive) lateral connections, equation 4.7 states that random correlations lead to a better encoding performance than uniform correlations at a given biological cost. From this point of view, it appears natural for neural populations to introduce variability in the correlations between neurons. Theoretically, this result shows that uniform correlations are disadvantageous in an estimation-theoretical sense: the existence of noise in the correlation coefcients is likely to improve the representational capacity of the population. 5 Effect of Tuning Widths on the Encoding Accuracy
In the analysis of neural coding strategies, one major question has always been whether narrow tuning or broad tuning is advantageous for the representation of a set of stimulus features. Both cases are encountered empirically. For example, in the human retina, small receptive elds of diameters less than 1 degree visual angle have been found (Kufer, 1953), while the early visual system of tongue-projecting salamanders contains neurons with receptive eld diameters of up to 180 degrees (Wiggers, Roth, Eurich, & Straub, 1995). Further examples for broadly tuned neurons can be found in the auditory system of the barn owl (Knudsen & Konishi, 1978) and in the motor cortex of primates (Georgopoulos, Schwartz, & Kettner, 1986). On the theoretical side, the situation is also inconclusive. Arguments have been put forward in support of small (Lettvin, Maturana, McCulloch, & Pitts, 1959; Barlow, 1972) as well as large (Hinton et al., 1986; Baldi & Heiligenberg, 1988; Snippe & Koenderink, 1992a; Eurich & Schwegler, 1997) receptive elds. It turned out that the number of encoded stimulus features, D, plays an important role in this context. For populations of binary neurons, large receptive elds are advantageous for D > 1 (Hinton et al., 1986; Eurich & Schwegler, 1997). Stochastic neurons with graded responses were found to yield optimal performance with small receptive elds for D D 1, and with large receptive elds for D > 2 (Zhang & Sejnowski, 1999). However, all previous studies have considered only the case that all neurons have identical receptive eld shapes, with the same tuning width for all stimulus features. The situation may be different for neural populations that do not exhibit these properties. Equation 3.3 demonstrates that for the model population considered in this article, Fisher information depends
Representational Accuracy of Stochastic Neural Populations
173
on the tuning widths si , i D 1, . . . , D, via the simple scaling rule Jii / Q hsi¡2 D l D1 sl i, where h. . .i denotes the average over the tuning width distribution Ps dened in equation 2.4. Thus, in the current context, the analysis of more complex strategies of tuning widths as specied by Ps can be carried out separately from the other parameters. For instance, the conclusions drawn in the following are independent of the level of background noise f and the maximum ring rate F and are also unaffected by the existence of noise correlations (of any type) within the population. While arbitrary Ps could be considered in principle to optimize the population’s representational capacity, there is no clear picture on the biological restrictions on this function. Thus, the following section deals with three special cases that illustrate the expected effects. 5.1 Identical Tuning Widths Across the Population. A particularly simple choice for the tuning widths is to choose one width sN i for each of the D dimensions, that is,
Ps ( s1 , . . . , sD ) D
D Y i D1
d( si ¡ sN i ) .
(5.1)
This choice is schematically shown in Figure 6b for D D 2. In this case, one nds for the average population Fisher information QD Jii /
Nl l D1 s . sN i2
(5.2)
From equation 5.2, it becomes immediately clear that the representational accuracy of the ith stimulus feature depends on i; the accuracies for the stimulus features need not be the equal. Thus, the population can be “specialized” by choosing the tuning widths sN i such that a particularly high accuracy for only a subset of the represented features results (Eurich & Wilke, 2000). Note that equation 5.2 includes the situation of identical tuning widths for all features as a special case (see Figure 6a). For sN 1 D ¢ ¢ ¢ D sN D D s, N one has the scaling behavior Jii / sN D¡2 , a result already obtained by Zhang and Sejnowski (1999). Thus, in this case, narrow tuning curves are favorable for D D 1, and broad tuning curves improve the encoding accuracy for D > 2. For D D 2, Fisher information does not depend on the tuning width at all. An important limitation to this simple behavior exists for extremely narrow tuning. Assuming a xed number of neurons, the tuning curves will eventually become too narrow to cover the stimulus space without gaps as sN ! 0. The unrepresented areas appearing in the stimulus space are disastrous for the stimulus-averaged estimation error. Thus, the sN D¡2 -scaling law behavior of Fisher information will, for sN ! 0, eventually cross over
174
Stefan D. Wilke and Christian W. Eurich
Figure 6: Visualization of tuning width distributions with (c, d) and without (a, b) intrapopulation variability for the case D D 2. (a) Uniform tuning width sN 6 s for all features. (b) Separate tuning width sN i for feature i, sN 1 D N 2 . (c) Same as a, but sN may vary from neuron to neuron. Here, sN is uniformly distributed within a range b. (d) Tuning widths uniformly distributed in a (hyper-)cuboid of edge lengths b1 , . . . , bD around the mean sN 1 , . . . , sND .
into a sharp decrease. This implies the existence of an optimal value for sN in the case D D 1 and, by the same argument, of an optimal tuning width for sN i if only this tuning width is varied and D > 1 (Eurich & Wilke, 2000). Note that there are no optimal tuning widths for D ¸ 2, where Fisher information is monotonous for all values of s. N 5.2 Tuning Width Variability Across the Population. Experimentally measured tuning widths often exhibit a large variability across the population (see, e.g., Wiggers et al., 1995). In order to study the encoding with nonuniform tuning widths as compared to identical tuning widths as in the previous cases, the distribution
P s ( s1 , . . . , sD ) D
µ ³ ´¶ µ³ ´ ¶ D Y 1 bi bi H si ¡ sN i ¡ H sN i C ¡ si 2 2 b iD 1 i
(5.3)
is considered, where H denotes the Heaviside step function. Equation 5.3 describes a uniform distribution in a D-dimensional cuboid of size b1 , . . . , bD around the mean ( sN 1 , . . . , sN D ) (see Figure 6d for D D 2). Up to third order in bi , the Fisher information for a population with this kind of variability in the tuning widths is given by the formula in Figure 6d (Eurich et al., 2000). Comparing this result to the encoding with the mean receptive eld
Representational Accuracy of Stochastic Neural Populations
175
sizes given by equation 5.2, one nds that variability in the tuning width bi increases the encoding accuracy for feature i, while leaving the Fisher information of the other features unaffected. This results can also be understood 6 i) but a convex function of si . Conseby noting that Jii is linear in sj (for j D quently, for feature i, one gains more from the smaller tuning widths in the si -distribution than one loses from the larger ones, implying an increase of Fisher information Jii with bi . The two effects cancel each other for the other 6 i). tuning widths, so that Jii does not depend on bj (j D In contrast to equation 5.3, Zhang and Sejnowski (1999) consider the situation of a correlated variability of tuning widths. In their model, the tuning curves are always assumed to be radially symmetric—the tuning widths in all dimensions are identical (si D s; N see Figure 6c). For a distribution of tuning widths with this restriction, one nds an average population Fisher information / hsN D¡2 i, so that variability in sN does not improve the encoding for D D 2 or D D 3. In summary, at a given mean tuning width, a neural population with variability in the tuning widths yields a more precise representation than a population with uniform tuning widths. This statement holds for an arbitrary number of encoded features D. It fails, however, at the optimal tuning width introduced in the previous section. Here, any deviations from the optimal value reduce the encoding accuracy. However, we argue that biologically plausible neural populations do not use this mathematically optimal tuning width since it is associated with an extremely small overlap of tuning functions. The failure of a single neuron would leave an unrepresented gap in the stimulus space. In the case of larger redundancy, the above calculation is valid, and jitter in the tuning widths improves the encoding accuracy. Hence, our result gives a theoretical argument for the large tuning width variability found in biological systems. 5.3 Specialized Subpopulations. Finally, a more complicated choice of tuning widths is discussed to demonstrate that uniform tuning properties across the population are not optimal. The population is split up into D > 1 subpopulations that differ by their tuning properties. Starting from a uniform tuning width, s, N for all neurons and features, the ith tuning width in the ith subpopulation is set to ls, N with a parameter l > 0. The other tuning widths of the subpopulation are adjusted such that the receptive eld Q volume in stimulus space / D kD1 sk remains constant. Mathematically, this formation of subpopulations corresponds to the tuning width distribution
2 3 D ² Y ± 1 X 4d (si ¡ lsN ) d sj ¡ l1/ (1¡D) sN 5 . Ps ( s1 , . . . , sD ) D D i D1 6 i jD
(5.4)
Note that for l D 1, the population is uniform: all neurons have tuning 6 width sN for all stimulus dimensions. For l D 1, the population is split up
Stefan D. Wilke and Christian W. Eurich
Fisher Information J/J(l=1)
176
5 D=2 D=3 D=4 D=10
4 3 2 1 0 0
1
2
3
4
Subpopulation Asymmetry l
5
Figure 7: Fragmentation into D subpopulations. Fisher information increase due to splitting of the population into D subpopulations from equation 5.5, as a function of the splitting parameter l.
into D subpopulations; in subpopulation i, si is different from the other tuning widths sj . The Fisher information for this population is given by Jii /
( D ¡ 1) l2D / ( D¡1) C 1 . Dl2
(5.5)
It does not depend on i because of the symmetry in the subpopulations. The l-dependent factor that quanties the change of encoding accuracy as compared to the uniform population is plotted in Figure 7 for different values of D. It turns out that the uniform case l D 1 is a local minimum of Fisher information: uniform tuning properties yield the worst encoding performance within the model class. Any other value of l—any unsymmetrical receptive eld shape for the subpopulations—leads to a more precise stimulus encoding. The increase of Fisher information for smaller l reects the better encoding with specialized subpopulations for smaller si , as discussed in section 5.1. In this limit, the ith subpopulation exclusively encodes feature i. For l À 1, on the other hand, the ith subpopulation encodes all features but xi . The Fisher information loss that results from decreasing the corresponding tuning widths is compensated by the increase of si . The fact that the approximation 5.5 diverges for l ! 0 and for l ! 1 is due to the idealization of assuming an innite population size and an innite stimulus space. Under these assumptions, the tuning curves will be
Representational Accuracy of Stochastic Neural Populations
177
strongly deformed if l approaches extreme values, but the number of tuning curves that cover a given point in stimulus space always remains constant. In a nite neural population, on the other hand, the coverage decreases for large or small l for two reasons. First, for some features, the tuning curves become narrower, so that neighboring neurons in these directions cease to encode the stimulus. Second, in the other directions, where the tuning curves become broader, no new tuning curves reach the stimulus position since the corresponding tuning curve centers would lie outside the boundaries of the population. Consequently, Fisher information eventually decreases as l ! 0 or l ! 1 if the number of neurons is nite. The value of l at which this effect sets in depends on the initial tuning widths s. N The larger s, N the sooner a considerable fraction of receptive elds will lie outside the stimulus space. A second effect that is expected is that values for l that are too large or too small will lead to unrepresented gaps in stimulus space, thus yielding a breakdown of encoding quality by the mechanism described in section 5.1. In contrast to the rst effect, this breakdown occurs sooner if sN is smaller, since at given g, this implies a less dense coverage of stimulus space with tuning curves. These arguments are illustrated by an example in Figure 8. The initial deviation from the analytical curve can be attributed to the rst effect; it is strongest for large s. N The drop of Fisher information of the nite population at large or small l is due to unrepresented areas in stimulus space; it is most severe for the smallest value of s. N Thus, our analysis leads to the prediction that for a nite-size neural population and a bounded stimulus space, there is an optimal level of specialization in terms of tuning curve nonuniformity. 6 Conclusion
We have presented a Fisher information analysis of the representational accuracy achieved by a population of stochastically spiking neurons that encode a stimulus with their spike counts during some xed time interval. It turned out that the type of neuronal noise—the response variability— has a strong inuence on the population Fisher information. It not only determines the increase of representational accuracy with increasing maximum spike count, but also governs how vulnerable the encoding scheme is to a high baseline ring level. Additive gaussian noise is very resistant, while Poissonian noise is disadvantageous, especially if many stimulus features are encoded simultaneously. The effect of noise correlations was also shown to depend on the response variability. The result that limited-range correlations can severely limit the effective population size was found to be a special feature of additive gaussian noise. In addition, we analyzed the inuence of random variations in the correlation coefcient. This yielded congurations with uniform correlation strength are disadvantageous in the sense that at xed mean correlation, variations improve the encoding on average. Finally, we discussed the effects of the tuning widths on the
Stefan D. Wilke and Christian W. Eurich
Fisher Information J/J(l=1)
178
15 Analytical s = 0.05 s = 0.075 s = 0.1
10
5
0
0,1
1
10
Subpopulation Asymmetry l Figure 8: Formation of subpopulations in a nite-size neural population. Analytical approximation (see equation 5.5) (solid) and mean Fisher information of N D 1000 neurons uniformly distributed in the stimulus space [0, 1]2 for sN D 0.05 (dashed), sN D 0.075 (dot-dashed), and sN D 0.1 (dotted). Fisher information was calculated at the middle of stimulus space and averaged over 5 £105 choices of the tuning curve positions. The apparent symmetry is due to the fact that in equation 5.4, l is equivalent to l¡1 for D D 2. Note the logarithmic scale for l in this gure.
encoding accuracy. The case of radially symmetric receptive elds, which attracts most attention in the literature, yielded a worse encoding accuracy than all other cases we studied. Both variations of tuning widths within the population and the fragmentation of the population into subpopulations (with respect to tuning properties) improved the representational capacity. The latter case may be the reason for the existence of neural subpopulations specializing in certain sensory features, such as in the visual system (Livingstone & Hubel, 1988). Taken together, these results lead to three main conclusions. First, the structure of neuronal noise can substantially modify the encoding properties of neural systems. Empirically, most studies have reported that the spike count variance follows a power law of the mean spike count. We showed that the theoretical representational capacity strongly depends on the power law exponent in this case. This shows that choosing the correct response model can be critical for theoretical analysis. Second, the considerations on the parameter variability lead to the hypothesis that the great variability may not simply be a by-product of neuronal diversity but could be exploited by
Representational Accuracy of Stochastic Neural Populations
179
the neural system to achieve better encoding performance. Finally, neural populations can choose from a wide variety of strategies to optimize their tuning properties. This may even occur in a dynamical or task-dependent way (Rolls, Treves, & Tovee, 1997; Worg¨ ¨ otter et al., 1998). Hence, the question of optimal tuning properties may not be reduced to a simple “broad or narrow” dichotomy. Appendix A: Fisher Information for Neural Populations with Gaussian Spike Count Statistics
In this appendix, we present the calculation of Fisher information for several variants of gaussian spike count statistics. The starting point is equation 2.7, into which the covariance matrix, equation 4.1, is inserted. An analytical form for Jij ( x ) is derived for two special cases only: Uniform correlations (see section A.1) and limited-range correlations for D D 1 (see section A.3). On the basis of these explicit formulas, expansions can be formed (see sections A.4 and A.5). For simplicity, the x -dependence of the tuning curves fk ( x ) and of the covariance matrix Q( x ) is not noted explicitly in the following. A.1 Uniform Correlations. For uniform correlations, b D 0, inverse and derivative of the covariance matrix, equation 4.1, read
1 d kl ( Nq C 1 ¡ q) ¡ q , afka fla ( 1 ¡ q) ( Nq C 1 ¡ q)
( Q ¡1 ) kl D
³
@Q @xj ( j)
´
( j)
kl
( j)
D a( g k C gl ) Qkl ,
(A.1) (A.2)
@f
where g k :D 1fk @xkj . Inserting this into equation 2.7, the rst term of Fisher information Jij ( x ) becomes " # N N X X t2 q ( i) ( j) ( i) ( j) (A.3) h h ¡ hk hl ( Nq C 1 ¡ q) a ( 1 ¡ q ) kD 1 k k k,l D 1 µ ¶ t2 qN (A.4) D Fij ( x I a) ¡ Gij ( x I a) a (1 ¡ q ) Nq C 1 ¡ q ( j)
with the abbreviations hk :D Fij ( x I a) :D Gij ( x I a) :D
1 @fk , fka @xj
N X @f k 1 @f k ¡! g £ const., @ xi fk2a @xj kD 1 N 1 X @f k 1 @fl ¡! 0. N k,lD 1 @xi fka fla @xj
(A.5)
(A.6)
180
Stefan D. Wilke and Christian W. Eurich
The expressions on the right indicate the behavior of Fij ( x I a) and Gij ( xI a) in the continuous limit (see section A.2). Note that the Cauchy-Schwarz inequality implies Fii ( x I a) ¸ Gii ( x I a) . The second term in equation 2.7 is given by N X N X N X N a2 X ( j) ( j) ( i) ( i) Q kl ( g k C gl ) ( Q ¡1 ) lm Qmn ( gm C gn ) ( Q ¡1 ) nk. (A.7) 2 kD 1 l D 1 m D 1 n D 1
Expanding the sums and canceling one of the inverse matrices, this becomes a2
N X k,l D 1
( ) ( j)
i Q kl gk gk ( Q ¡1 ) lk C a2
N X
( ) ( j)
i Qkl g k gl ( Q ¡1 ) lk.
(A.8)
k,l D 1
P ( i) ( j) 2 The rst sum yields a2 N kD 1 g k g k D a Fij ( x I 1) , and the second is simpli¡1 ed using the identity Q kl ( Q ) lk D [d kl ( Nq C 1 ¡ 2q C q2 ) ¡q2 ]( 1 ¡ q) ¡1 ( Nq C 1 ¡ q) ¡1 , which results in a2 Fij ( x I 1) C a2
Nq C 1 ¡ 2q C q2 F ( x I 1) ( 1 ¡ q) ( Nq C 1 ¡ q) ij
q2 NGij ( xI 1) (1 ¡ q ) ( Nq C 1 ¡ q ) qN[(2 ¡ q ) Fij ( x I 1) ¡ qGij ( x I 1) ] C 2( 1 ¡ q) 2 Fij ( x I 1) D a2 (1 ¡ q ) ( Nq C 1 ¡ q ) ¡ a2
(A.9) (A.10)
for the trace term of the Fisher information. Combining equations A.4 and A.10 and assuming N À 1, one arrives at equation 4.2. As a special case of equations A.4 and A.10, one nds for large uncorrelated (q D 0) neural populations equation 3.2 where the limit N À 1 must be taken after setting q D 0. A.2 Continuous Limit. The analysis of the equations derived for the Fisher information in section A.1 is simplied considerably in the continuous limit, where it is assumed that
Z Fij ( x I a) D
dD c g( c ) *Q
D dijg
@fc 1 @fc @xi fc2a @xj
D r D1 sr
si sj
+
Z dDc
¼g
4p D/ 2 F2¡2a C (1 C D2 )
Z1
@fc 1 @fc @xi fc2a @xj
dj j D C 1
w 0 (j 2 ) 2 . w (j 2 ) 2a
(A.11)
(A.12)
0
In equation A.12, h. . .i denotes the average over the distribution of tuning widths, Ps , and fc is the tuning curve of a neuron with receptive eld
Representational Accuracy of Stochastic Neural Populations
181
center the derivation of equation have used the identity R 1 Dat c . In R A.12, we 2 )j 2 ¡1 1 2 )j D C 1 D/ 2 C( 1 C (j (j d j p 2) dj for g ( z ) D w 0 ( z) 2 / g D D g / ¡1 0 2a w ( z ) . By a similar calculation and the fact that the tuning functions are symmetrical, one nds that Gij ( x I a) D 0 in this limit. The continuous limit is reached if the distribution of neurons in stimulus space, g, is sufciently dense and uniform (Zhang & Sejnowski, 1999). The result, equation 3.3, is identical to equation A.12, except that the integral is abbreviated as Aw ( a, D) . For additive noise (a D 0) and gaussian tuning curves, w ( z ) D f C ( 1 ¡ f ) exp( ¡z) , it becomes Aw ( a D 0, D) D (1 ¡ f ) 2 C ( 1 C D2 ) / 8, so that *Q Fij ( xI a D 0) D dijg
D rD 1 sr
si sj
+
p D / 2 F2 ( 1 ¡ f ) 2 , 2
(A.13)
and equation 3.4 follows immediately. A.3 Limited-Range Correlations. For limited-range correlations in one stimulus dimension, that is, D D 1, q D 0, and b D 1, the covariance matrix, equation 4.1, can be written as Q kl D r |i¡j| afk fl if the tuning functions are equidistant and r :D exp[¡1 / ( Lg ) ]. In this case, its inverse is given by
(Q
¡1
) kl
µ ¶ 1 C r2 r (d kC 1,l C d k¡1,l ) , d kl ¡ D 1 C r2 afka fla 1 ¡ r 2 1
(A.14)
and its derivative with respect to x1 D x is again of the form of equation A.2. Inserting this into equation 2.7, the rst term of the Fisher information J11 ( x ) becomes t2
X (1) (1) 1 C r2 2t 2r N¡1 ( a) ¡ F xI h h , 2 2 a (1 ¡ r ) a ( 1 ¡ r ) kD 1 k k C 1
(A.15)
where F ( xI a) :D F11 ( xI a). The sum in the second term of equation A.15 is equal to F ( xI a) ¡ H ( xI a) / 2, where H ( xI a) :D
N¡1 X kD 1
³
1 d fk 1 d fkC1 ¡ a a fk dx fkC 1 dx
´2 ¡! g ¡1 £ const.
(A.16)
Again, the expression on the right indicates the asymptotic behavior for large N / g. Hence, equation A.15 can be written as t2
1¡r t 2rH( xI a) . F ( xI a) C a (1 C r ) a (1 ¡ r 2 )
(A.17)
182
Stefan D. Wilke and Christian W. Eurich
In the second term of equation 2.7, the calculation proceeds as in the uniform correlation case (cf. equations A.7 and A.8) with i D j D 1, only that ¡1 ) D d kl ( 1 C r 2 ) / ( 1 ¡ r 2 ) ¡ r 2 (d kC 1,l C d k¡1,l ) / ( 1 ¡ r 2 ) yields Qkl ( Qlk 2 N¡1 X (1) (1) 1 C r2 2 2r ( 1) ¡ a F xI h h 2 2 1¡r 1 ¡ r kD 1 k k C 1 ³ ´ r 2 H ( xI a) 2 D a 2F( xI 1) C 1 ¡ r2
a2 F ( xI 1) C a2
(A.18)
instead of equation A.10. Combining the two terms, A.17 and A.18, one nds the result, equation 4.3. A.4 Uniform and Limited-Range Correlations. For b ¿ 1 and r ¿ 1, the covariance matrix, equation 4.1, can be written as Q D R C 2 S, where Rkl D [d kl C q( 1 ¡ d kl ) ]afka fla , S kl D (d kC 1,l C d k¡1,l ) afka fla , and 2 :D br ¿ 1. To obtain an explicit expression for Q¡1 , one iteratively applies the matrix relation Q ¡1 D R¡1 ¡ 2 R ¡1 SQ¡1 and nds the general expansion formula,
( R C 2 S ) ¡1 D R¡1 ¡ 2 R¡1 SR ¡1 C 2 2 R ¡1 SR ¡1 SR ¡1 C O ( 2
3)
.
(A.19)
Since the inverse of R is explicitly known (see equation A.1), equation A.19 can be used to obtain the Fisher information J11 ( x) from equation 2.7 to arbitrary order in br. The matrix R¡1 SR ¡1 can be calculated explicitly; it yields a correction to rst term of equation 2.7 that approaches ¡ brt 2
dfT ¡1 ¡1 df 2t 2 [F( x, a) ¡ G ( x, a)] R SR D ¡br dx dx a( 1 ¡ q ) 2
(A.20)
for large N. The correction to the second term requires some more work. First, note that the derivative of Q is given by equation A.2. Inserting the expansion A.19 into the trace and exploiting its cyclic invariance, one nds the two corrections, h i h i ¡ br Tr R¡1 SR ¡1 R0 R¡1 R0 C br Tr R¡1 S0 R¡1 R0 , (A.21) in leading order. Evaluating the matrix R¡1 SR ¡1 R0 R ¡1 R0 explicitly and taking the trace, one has for large N, h i 2a2 q [F(xI 1) ¡ G( xI 1) ] ¡ br Tr R¡1 SR ¡1 R0 R¡1 R0 D ¡br , ( 1 ¡ q) 2
(A.22)
which grows linearly with N. The second trace term in equation A.21 yields h i ¡8a2 F ( xI 1) ¡brTr R ¡1 S0 R¡1 R0 D ¡br ( 1 ¡ q) N
(A.23)
Representational Accuracy of Stochastic Neural Populations
183
for large N, which approaches a constant and can therefore be neglected. Combining the equations A.20 and A.22 with the (b D 0)-result, equation 4.3, derived in section A.3, one arrives at equation 4.5. A.5 Uniform Correlations with Jitter. In order to analyze noisy uniform correlations of strength q, one assumes for the covariance matrix Q D R C sS, where R is as above, S kl D skl afk fl , and the matrix elements skl for k < l are independent and identically distributed gaussian random variables with zero mean and unity standard deviation, while s kk D 0. Choosing s kl D slk for k > l ensures that Q is symmetric. This yields the relations
hs kl i D 0,
hskl smn i D ( 1 ¡ d kldmn ) (d kmdln C d kndlm ) .
(A.24)
The matrix expansion formula, A.19, is now employed to nd the inverse of Q. If the resulting standard deviation s is small compared to q, one has for the rst term of Fisher information, equation 2.7, the correction t 2 s2
N[Fij ( xI a) ¡ Gij ( xI a) ] @f T ¡1 @f R hSR ¡1 SiR ¡1 D t 2 s2 , @xi @xj a( 1 ¡ q ) 3
(A.25)
which is computed using equation A.24 and the explicit form of R. h. . .i denotes the average over the joint probability distribution of the matrix elements of S. Inserting equation A.19 into the trace term in equation 2.7 and sorting by orders of s, the leading-order correction is found to be ½
s2 h 0 ¡1 0 ¡1 ¡1 ¡1 Tr 2R R R R SR SR C R0 R¡1 SR ¡1 R0 R ¡1 SR ¡1 2 ¡ 2R0 R¡1 S0 R ¡1 SR ¡1 ¡ 2R0 R¡1 SR ¡1 S0 R¡1
(A.26) iE C S0 R ¡1 S0 R¡1 ,
where the cyclic invariance of the trace was used. Inserting R explicitly and making use of the relations A.24, the ve trace terms become 2s2 a2 N 2
µ
(4 ¡ 3q) Fij ( x I 1) ¡ qGij ( x I 1) 2Gij ( x I 1) C 3 (1 ¡ q ) ( 1 ¡ q) 2 ¶ Fij ( x I 1) C Gij ( x I 1) C ( ¡2 ¡ 2 C 1) ( 1 ¡ q) 2
(A.27)
for large N. Combining this result with the rst correction, equation A.25, one arrives at the expansion 4.7. Appendix B: Eigenvalue Density of a Covariance Matrix with Jitter
In this section we derive the eigenvalue density of the covariance matrix R C sS discussed in section A.5. The additional factor afk fl that is dropped
184
Stefan D. Wilke and Christian W. Eurich
here does not inuence the appearance of a zero eigenvalue, so that the conclusion drawn in section 4.5 will hold for the full covariance matrix. The distribution of the diagonal elements of S is irrelevant for the eigenvalue density in the limit of large N (e.g., Mehta, 1991). Thus, the eigenvalue density of the matrix ensemble R C sS remains unaltered if the diagonal elements S kk are turned p into independent gaussian random variables with standard deviation 2s. In the latter case, R C sS represents a deformed gaussian orthogonal ensemble (Brody et al., 1981). Its eigenvalue density, r ( l), can be determined via the relations (Pastur, 1972) µ ¶ 1 1 1 r ( l) D ¡ Im[g( l) ], Tr . (B.1) g ( l) D p l ¡ R ¡ s2 Ng( l) N In the example considered here, the eigenvalues of the matrix R are ( N ¡ 1)q C 1, and N ¡ 1 times 1 ¡ q. Thus, the second equation in equation B.1 becomes g (l) D
1 1 N l ¡ [( N ¡ 1) q C 1] ¡ s2 Ng( l) 1 N¡1 C , N l ¡ ( 1 ¡ q) ¡ s2 Ng(l)
which yields the solution µ ¶ q 1 2 ¡ 4Ns2 (1 ) ( ) [l ] l ¡ ¡ ¡ 1 ¡ g (l) D q § q 2Ns2
(B.2)
(B.3)
in the limit of large N, the minus sign resulting a semicircular mean eigenvalue density (Wigner, 1957), 8 1 q > 4Ns2 ¡ [l ¡ ( 1 ¡ q) ]2 < 2 p r ( l) D 2p Ns for |l ¡ ( 1 ¡ q) | · 2 Ns (B.4) > : 0 otherwise. The correction to r ( l) associated with the single eigenvalue ( N ¡ 1) q C 1 is of order 1 / N and concerns larger eigenvalues À 1 ¡ q only (except for the degenerate case q D 0, where equation B.3 follows from B.2 without further approximations). The minimum eigenvalue is at the left p end of the semicircle described by equation B.4, that is, lmin D 1 ¡ q ¡ 2 Ns. Acknowledgments
We thank M. Bethge, K. Pawelzik, H. Schwegler, and M. Wezstein for helpful discussions and the anonymous reviewers for insightful comments on an earlier version of the manuscript for this article. This work was supported by Deutsche Forschungsgemeinschaft, SFB 517.
Representational Accuracy of Stochastic Neural Populations
185
References Abbott, L. F., & Dayan, P. (1999). The effect of correlated variability on the accuracy of a population code. Neural Computation, 11, 91–101. Bair, W., & O’Keefe, L. P. (1998). The inuence of xational eye movements on the response of neurons in area MT of the macaque. Visual Neuroscience, 15, 779–786. Baldi, P., & Heiligenberg, W. (1988). How sensory maps could enhance resolution through ordered arrangements of broadly tuned receivers. Biological Cybernetics, 59, 313–318. Barlow, H. B. (1972).Single units and sensation: A neuron doctrine for perceptual psychology? Perception, 1, 371–394. Bethge, M., Rotermund, D., & Pawelzik, K. (2000). Optimal short-term population coding: When Fisher information fails. Manuscript submitted for publication. Brody, T. A., Flores, J., French, J. B., Mello, P. A., Pandey, A., & Wong, S. S. M. (1981). Random-matrix physics: Spectrum and strength uctuations. Reviews of Modern Physics, 53(3), 385–479. Brunel, N., & Nadal, J.-P. (1998). Mutual information, Fisher information, and population coding. Neural Computation, 10, 1731–1757. Callen, H. B. (1985). Thermodynamics and an introduction to thermostatistics (2nd ed). New York: Wiley. Cardoso de Oliveira, S., Thiele, A., & Hoffmann, K.-P. (1997). Synchronization of neuronal activity during stimulus expectation in a direction discrimination task. Journal of Neuroscience, 17(23), 9248–9260. Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: Wiley. CramÂer, H. (1946). Mathematical methods of statistics. Princeton, NJ: Princeton University Press. Dan, Y., Alonso, J.-M., Usrey, W. M., & Reid, R. C. (1998). Coding of visual information by precisely correlated spikes in the lateral geniculate nucleus. Nature Neuroscience, 1(6), 501–507. Dean, A. F. (1981). The variability of discharge of simple cells in the cat striate cortex. Experimental Brain Research, 44, 437–440. DeAngelis, G. C., Ghose, G. M., Ohzawa, I., & Freeman, R. D. (1999). Functional micro-organization of primary visual cortex: Receptive eld analysis of nearby neurons. Journal of Neuroscience, 19(9), 4046–4064. deCharms, R. C., & Zador, A. (2000). Neural representation and the cortical code. Annual Reviews of Neuroscience, 23, 613–647. Deneve, S., Latham, P. E., & Pouget, A. (1999). Reading population codes: A neural implementation of ideal observers. Nature Neuroscience, 2(8), 740–745. Eckhorn, R., Grusser, ¨ O.-J., Kroller, ¨ J., Pellnitz, K., & Popel, ¨ B. (1976). Efciency of different neural codes: Information transfer calculations for three different neuronal systems. Biological Cybernetics, 22, 49–60. Eurich, C. W., & Schwegler, H. (1997). Coarse coding: Calculation of the resolution achieved by a population of large receptive eld neurons. Biological Cybernetics, 76, 357–363.
186
Stefan D. Wilke and Christian W. Eurich
Eurich, C. W., & Wilke, S. D. (2000). Multi-dimensional encoding strategy of spiking neurons. Neural Computation, 12(7), 1519–1529. Eurich, C. W., Wilke, S. D., & Schwegler, H. (2000). Neural representation of multidimensional stimuli. In S. A. Solla, T. K. Leen, & K.-R. Muller ¨ (Eds.), Advances in neural information processing systems,12 (pp. 115–121). Cambridge, MA: MIT Press. Feng, J., & Brown, D. (2000). Impact of correlated inputs on the output of the integrate-and-re model. Neural Computation, 12, 671–692. Gautrais, J., & Thorpe, S. (1998). Rate coding versus temporal order coding: A theoretical approach. BioSystems, 48, 57–65. Gawne, T. J., & Richmond, B. J. (1993). How independent are the messages carried by adjacent inferior temporal cortical neurons? Journal of Neuroscience, 13(7), 2758–2771. Georgopoulos, A. P., Schwartz, A. B., & Kettner, R. E. (1986). Neuronal population coding of movement direction. Science, 233, 1416–1419. Gershon, E. D., Wiener, M. C., Latham, P. E., & Richmond, B. J. (1998). Coding strategies in monkey V1 and inferior temporal cortices. Journal of Neurophysiology, 79, 1135–1144. Ghose, G. M., Ohzawa, I., & Freeman, R. D. (1994). Receptive-eld maps of correlated discharge between pairs of neurons in the cat’s visual cortex. Journal of Neurophysiology, 71(1), 330–346. Gur, M., Beylin, A., & Snodderly, D. M. (1997). Response variability of neurons in primary visual cortex (V1) of alert monkeys. Journal of Neuroscience, 17(8), 2914–2920. Hatsopoulos, N. G., Ojakangas, C. L., Paninski, L., & Donoghue, J. P. (1998). Information about movement direction obtained from synchronous activity of motor cortical neurons. Proceedings of the National Academy of Sciences USA, 95, 15706–15711. Heller, J., Hertz, J. A., Kjaer, T. W., & Richmond, B. J. (1995). Information ow and temporal coding in primate pattern vision. Journal of Computational Neuroscience, 2, 175–193. Hinton, G. E., McClelland, J. L., & Rumelhart, D. E. (1986). Distributed representations. In D. E. Rumelhart & J. L. McClelland (Eds.), Parallel distributed processing (Vol. 1, pp. 77–109). Cambridge, MA: MIT Press. Johnson, D. H., Gruner, C. M., Baggerly, K., & Seshagiri, C. (2001). Informationtheoretic analysis of neural coding. Journal of Computational Neuroscience, 10, 47–69. Johnson, K. O. (1980). Sensory discrimination: Neural processes preceding discrimination decision. Journal of Neurophysiology, 43(6), 1793–1815. Karbowski, J. (2000). Fisher information and temporal correlations for spiking neurons with stochastic dynamics. Physical Review E, 61(4), 4235–4252. Kay, S. M. (1993). Fundamentals of statistical signal processing: Estimation theory. Englewood Cliffs, NJ: Prentice Hall. Knudsen, E. I., & Konishi, M. (1978). A neural map of auditory space in the owl. Science, 200, 795–797. Kufer, S. W. (1953). Discharge patterns and functional organization of the mammalian retina. Journal of Neurophysiology, 16, 37–68.
Representational Accuracy of Stochastic Neural Populations
187
Laughlin, S. B., de Ruyter van Steveninck, R., & Anderson, J. C. (1998). The metabolic cost of neural information. Nature Neuroscience, 1(1), 36–41. Lee, D., Port, N. L., Kruse, W., & Georgopoulos, A. P. (1998). Variability and correlated noise in the discharge of neurons in motor and parietal areas of the primate cortex. Journal of Neuroscience, 18(3), 1161–1170. Lettvin, J. Y., Maturana, H. R., McCulloch, W. S., & Pitts, W. H. (1959). What the frog’s eye tells the frog’s brain. Proceedings of the Institution of Radio Engineers, 47, 1940–1951. Levy, W. B., & Baxter, R. A. (1996). Energy efcient neural codes. Neural Computation, 8, 531–543. Livingstone, M. S., & Hubel, D. H. (1988). Segregation of form, color, movement, and depth: Anatomy, physiology, and perception. Science, 240, 740. Mainen, Z. F., & Sejnowski, T. J. (1995). Reliability of spike timing in neocortical neurons. Science, 268, 1503–1506. Maynard, E. M., Hatsopoulos, N. G., Ojakangas, C. L., Acuna, B. D., Sanes, J. N., Normann, R. A., & Donoghue, J. P. (1999). Neuronal interactions improve cortical population coding of movement direction. Journal of Neuroscience, 19(18), 8083–8093. Mehta, M. L. (1991). Random matrices (2nd ed.). San Diego, CA: Academic Press. Oram, M. W., Foldi ¨ ak, P., Perrett, D. I., & Sengpiel, F. (1998). The “ideal homunculus”: Decoding neural population signals. Trends in Neurosciences, 21, 259–265. Panzeri, S., Biella, G., Rolls, E. T., Skaggs, W. E., & Treves, A. (1996). Speed, noise, information and the graded nature of neuronal responses. Network: Computation in Neural Systems, 7, 365–370. Panzeri, S., Schultz, S. R., Treves, A., & Rolls, E. T. (1999a). Correlations and the encoding of information in the nervous system. Proceedings of the Royal Society of London B, 266, 1001–1012. Panzeri, S., Treves, A., Schultz, S., & Rolls, E. T. (1999b). On decoding the responses of a population of neurons from short time windows. Neural Computation, 11, 1553–1577. Paradiso, M. A. (1988). A theory of the use of visual orientation information which exploits the columnar structure of striate cortex. Biological Cybernetics, 58, 35–49. Pastur, L. A. (1972). On the spectrum of random matrices. Theoretical and Mathematical Physics, 10, 67–74. (Russian original: Teoreticheskayai Matematicheskaya Fizika, 10(1), 102–112, 1972.) Pouget, A., Deneve, S., Ducom, J.-C., & Latham, P. E. (1999). Narrow vs. wide tuning curves: What’s best for a population code? Neural Computation, 11, 85–90. Rao, C. R. (1945). Information and accuracy obtainable in the estimation of statistical parameters. Bulletin of the Calcutta Mathematical Society, 37, 81–91. Reich, D. S., Victor, J. D., Knight, B. W., Ozaki, T., & Kaplan, E. (1997). Response variability and timing precision of neuronal spike trains in vivo. Journal of Neurophysiology, 77, 2836–2841. Rieke, F., Warland, D., de Ruyter van Steveninck, R., & Bialek, W. (1997). Spikes: Exploring the neural code. Cambridge, MA: MIT Press.
188
Stefan D. Wilke and Christian W. Eurich
Rolls, E. T., Treves, A., & Tovee, M. J. (1997). The representational capacity of the distributed encoding of information provided by populations of neurons in primate temporal visual cortex. Experimental Brain Research, 114, 149– 162. Seung, H. S., & Sompolinsky, H. (1993). Simple models for reading neuronal population codes. Proceedings of the National Academy of Sciences USA, 90, 10749–10753. Shadlen, M. N., & Newsome, W. T. (1998). The variable discharge of cortical neurons: Implications for connectivity, computation, and information coding. Journal of Neuroscience, 18(10), 3870–3896. Snippe, H. P., & Koenderink, J. J. (1992a). Discrimination thresholds for channelcoded systems. Biological Cybernetics, 66, 543–551. Snippe, H. P., & Koenderink, J. J. (1992b).Information in channel-coded systems: Correlated receivers. Biological Cybernetics, 67, 183–190. Softky, W. R., & Koch, C. (1993). The highly irregular ring of cortical cells is inconsistent with temporal integration of random EPSPs. Journal of Neuroscience, 13, 334–350. Steinmetz, P. N., Roy, A., Fitzgerald, P. J., Hsiao, S. S., Johnson, K. O., & Niebur, E. (2000). Attention modulates synchronized neuronal ring in primate somatosensory cortex. Nature, 404, 187–190. Teich, M. C., Johnson, D. H., Kumar, A. R., & Turcott, R. G. (1990). Rate uctuations and fractional power law noise recorded from cells in the lower auditory pathway of the cat. Hearing Research, 46, 41–52. Teich, M. C., & Khanna, S. M. (1985). Pulse number distribution for the neural spike train in the cat’s auditory nerve. Journal of the Acoustical Society of America, 77, 1110–1128. Tolhurst, D. J., Movshon, J. A., & Dean, A. F. (1983). The statistical reliability of signals in single neurons in cat and monkey visual cortex. Vision Research, 23, 775–785. Tolhurst, D. J., Movshon, J. A., & Thompson, I. D. (1981). The dependence of response amplitude and variance of cat visual cortical neurones on stimulus contrast. Experimental Brain Research, 41, 414–419. TovÂee, M. J., Rolls, E. T., Treves, A., & Bellis, R. P. (1993). Information encoding and the responses of single neurons in the primate temporal visual cortex. Journal of Neurophysiology, 70(2), 640–654. van Kan, P. L. E., Scobey, R. P., & Gabor, A. J. (1985). Response covariance in cat visual cortex. Experimental Brain Research, 60, 559–563. Vogels, R., Spileers, W., & Orban, G. A. (1989). The response variability of striate cortical neurons in the behaving monkey. Experimental Brain Research, 77, 432–436. White, J. A., Rubinstein, J. T., & Kay, A. R. (2000). Channel noise in neurons. Trends in Neurosciences, 23(3), 131–137. Wiggers, W., Roth, G., Eurich, C., & Straub, A. (1995). Binocular depth perception mechanisms in tongue-projecting salamanders. Journal of Comparative Physiology A, 176, 365–377. Wigner, E. P. (1957). Characteristic vectors of bordered matrices with innite dimensions. II. Annals of Mathematics, 65, 203–207.
Representational Accuracy of Stochastic Neural Populations
189
Wilke, S. D., & Eurich, C. W. (1999). What does a neuron talk about? In M. Verleysen (Ed.), Proceedings of the European Symposium on Articial Neural Networks 1999 (pp. 435–440). Brussels: D Facto. Wilke, S. D., & Eurich, C. W. (2001). Neural spike statistics modify the impact of background noise. Neurocomputing, 38–40, 445–450. Worg ¨ otter, ¨ F., Suder, K., Zhao, Y., Kerscher, N., Eysel, U. T., & Funke, K. (1998). State-dependent receptive-eld restructuring in the visual cortex. Nature, 396, 165–168. Yoon, H., & Sompolinsky, H. (1999). The effect of correlations on the Fisher information of population codes. In M. S. Kearns, S. A. Solla, & D. A. Cohn (Eds.), Advances in neural information processing systems, 11. Cambridge, MA: MIT Press. Zemel, R. S., & Hinton, G. E. (1995). Learning population codes by minimizing description length. Neural Computation, 7, 549–564. Zhang, K., & Sejnowski, T. J. (1999). Neuronal tuning: To sharpen or broaden? Neural Computation, 11, 75–84. Zohary, E., Shadlen, M. N., & Newsome, W. T. (1994). Correlated neuronal discharge rate and its implications for psychophysical performance. Nature, 370, 140–143. Received September 13, 2000; accepted April 2, 2001.
LETTER
Communicated by Suzanna Becker
Supervised Dimension Reduction of Intrinsically Low-Dimensional Data Nikos Vlassis [email protected] RWCP, Autonomous Learning Functions SNN, University of Amsterdam, The Netherlands Yoichi Motomura [email protected] Electrotechnical Laboratory, Tsukuba Ibaraki 305-8568, Umezono 1-1-4, Japan Ben Krose ¨ [email protected] RWCP, Autonomous Learning Functions SNN, University of Amsterdam, The Netherlands High-dimensional data generated by a system with limited degrees of freedom are often constrained in low-dimensional manifolds in the original space. In this article, we investigate dimension-reduction methods for such intrinsically low-dimensional data through linear projections that preserve the manifold structure of the data. For intrinsically onedimensional data, this implies projecting to a curve on the plane with as few intersections as possible. We are proposing a supervised projection pursuit method that can be regarded as an extension of the single-index model for nonparametric regression. We show results from a toy and two robotic applications. 1 Introduction
Suppose one is given the data shown in Figure 1a. These are 200 threedimensional points sampled from a parameterized conic-like spiral, f ( s) D [(12p ¡ 2s) cos s, ( 12p ¡ 2s) sin s, ( s ¡ 2p ) 2 ]T ,
(1.1)
plus gaussian noise of zero-mean and unit variance. The parameter s is evenly spaced in [0, 5p ], while the quadratic form of the third component makes the curve fold itself after s D 2p . Trying a random orthogonal projection of this data set onto the plane, one observes that in most cases, the resulting curve is nonsimple—it intersects itself. A curve that does not intersect itself is called a simple curve Neural Computation 14, 191–215 (2001)
° c 2001 Massachusetts Institute of Technology
192
N. Vlassis, Y. Motomura, and B. Krose ¨
(a)
80 60 40 20 0 20
0
20
(b)
2
20
1
0
0
20 (c)
2
1
1
1 2
0
2 1
0
1
2
2
1
0
1
2
Figure 1: (a) Three-dimensional spiral data. (b) Projection on the rst two principal components. (c) Projection to a simple curve with the proposed model, hy D 0.22 (const), hs D 0.36 (const).
(Spivak, 1979). Even projections that are optimal according to criteria like maximum variance of the projected data can lead to nonsimple curves. In Figure 1b we plot the resulting curve after projection on the rst two principal components of the data, where we note an intersection point. However, there can exist projections that lead to simple curves; one such solution is shown in Figure 1c. Although this setting might seem articial, in fact it can be very realistic. It is often the case with real-life applications that a system generates data that lie in a high-dimensional space but the system itself has limited degrees of freedom. One example is a robot that moves along a corridor with xed orientation and observes its environment with sensors. Although the observations are high-dimensional, they are generated by a system with a single degree of freedom (translation of the robot) and thus can have much lower intrinsic dimensionality. In particular, assuming a smooth environment, the data must form a smooth one-dimensional manifold in the original space.
Dimension Reduction of Intrinsically Low-Dimensional Data
193
The embedding of such a nonlinear structure in a high-dimensional Euclidean space is, at least from a conceptual point of view, redundant, and methods that extract relevant features from the data are in order. The simplest and most interpretable feature extraction method is a linear projection, and this should be done in such a way that the topological structure of the data is preserved during the projection. Besides, there are strong statistical arguments that necessitate the extraction of features from the data. It is known that for reasonably sized data sets, any nonparametric statistical inference in high-dimensional spaces fails due to the “curse of dimensionality,” meaning the sparsity of points in any neighborhood of interest. An illustrative numerical example of this behavior is given in Huber (1985). Extracting linear features from the data prior to modeling is the central theme of all projection pursuit techniques. In this article, we are studying the problem of dimension reduction (through linear projections) of high-dimensional data that are intrinsically low-dimensional. For clarity we will limit our exposition to intrinsically one-dimensional data, but our method can be directly applied to higher intrinsic dimensionalities. For our denition of intrinsic dimensionality, we will adopt the concept of an inverse regression curve (Li, 1991), while our main focus will be on projections onto the plane. Our projection pursuit method is supervised in the sense that the manifold parameterization of the data is known, and thus it can be regarded as a model-free, nonadditive, projection pursuit regression method (see Ripley, 1996, for a review). However, a supervised dimension-reduction method like the one we propose can be more than just regression. We rephrase (Li, 1991): During the early stages of data analysis, when one does not have a xed objective in mind, the need for estimating quantities like conditional averages (regression) is not as pressing as that for nding ways to simplify the data. After projecting the data to a good feature space, we are in a better position to identify what should be pursued further—for example, model building, response surface estimation, or cluster analysis. The proposed method can be regarded as an extension of the single-index model for nonparametric regression (Hardle, ¨ Hall, & Ichimura, 1993). This is a fully nonparametric regression model for predicting a scalar variable from a high-dimensional vector. We argue, and show with a real-life example, that for intrinsically low-dimensional data, the single-index model can be inadequate, and an extension of it is required. In the next section, we formalize the problem, and in section 3 we describe the proposed method and compare it with the single-index model. The important issues of optimal smoothing and optimization are considered in section 4. In section 5 we demonstrate the proposed method on two robotic applications, and in section 6 we discuss related and further research issues. 2 Intrinsic Dimensionality and Linear Projections
We start with a denition of intrinsic dimensionality.
194
N. Vlassis, Y. Motomura, and B. Krose ¨
We will say that a random vector x 2 Rd is intrinsically onedimensional with manifold variable s when the conditional average E[x | s] denes a smooth curve f ( s ) in Rd and the conditional variance Var[x | s] is bounded. The curve f ( s ) is called the inverse regression curve (Li, 1991). Denition.
We now assume a data set xi , 1 · i · n, of independent and identically distributed (i.i.d.) observations of an intrinsically one-dimensional random vector x 2 Rd , with corresponding manifold values si . According to the denition above, the d-dimensional data x i are constrained to live in a one-dimensional smooth manifold embedded in Rd , plus noise. One such example is the spiral data shown in Figure 1a, with d D 3, where the inverse regression curve f ( s ) that generated the data is also plotted. We are interested in reducing the dimensionality of such data by linearly projecting them to a subspace Rq , 1 < q < d, multiplying them with a d £ q matrix W with orthonormal columns, yi D W T xi ,
1 · i · n,
W T W D Iq ,
(2.1)
where Iq stands for the q-dimensional identity matrix. Clearly, the above mapping, being continuous, will project the original inverse regression curve f ( s ) to a new curve g ( s) , the inverse regression curve of the projected variable y given the manifold variable s. We argued in section 1 that feature extraction from such data is imperative. Moreover, it is often the case that for relatively large d, there is a linear projection that maps the original curve f ( s) to a plane curve g ( s ) that preserves most of the topological structure of the original data. One way to nd such a projection is by forcing the projected curve to be as near to a simple curve as possible—in other words, by minimizing the number of self-intersections. A formal way to compute the number of self-intersections of g ( s ) is by monitoring the average modality of the conditional density p ( s | y ) . Clearly, if g ( s ) is simple, p ( s | y ) will exhibit a single mode for all values of y , a single intersection of g ( s ) at y D y o will make p( s | y D y o ) bimodal, and so forth. This observation makes the task of projecting to a simple curve easier because there are several methods for assessing the degree of modality of a distribution. Moreover, since this will be part of an optimization procedure, the measure of multimodality of p( s | y ) must be computed quickly. The above discussion implies that we need a nonparametric estimate of p ( s | y ). For an appropriate sequence of weights lj ( y ) , 1 · j · n, such an estimate pO( s | y ) is (Stone, 1977) pO( s | y ) D
n X j D1
lj ( y ) w hs ( s ¡ sj ) ,
(2.2)
Dimension Reduction of Intrinsically Low-Dimensional Data
195
where w hs ( s ) D p
³ ´ s2 exp ¡ 2 2hs 2p hs 1
(2.3)
is the univariate gaussian kernel with bandwidth hs , dening a local smoothing region around s. A weight function lj ( y ), which satises the conditions in Stone (1977) and makes the above estimate a smooth function of the projection matrix W , is w hy ( y ¡ yj ) lj ( y ) D Pn , kD 1 w h y ( y ¡ y k ) where
³
ky k2 w hy ( y ) D exp ¡ 2 q 2hy (2p ) q / 2 hy 1
(2.4)
´ (2.5)
is the q-dimensional spherical gaussian kernel with bandwidth hy . The two kernel bandwidths hy and hs are the only free parameters of the model pO( s | y ) , and their values affect the resulting projections. We discuss possible choices later. 3 Nonparametric Measures of Multimodality
The above discussion has made it clear that we need a way to assess the degree of modality of p( s | y ) using the nonparametric estimate pO( s | y ) from equation 2.2. This estimate must be computed quickly, a requirement that renders approximations more favorable than accurate methods. 3.1 The Proposed Model. The proposed measure is based on the simple observation that for a given point x i , which is projected through equation 2.1 to y i , the density p( s | y D y i ) will always exhibit a mode on s D si . Moreover, if g ( s ) is a simple curve, this will be the only mode of p ( s | y D y i ) . Thus, an approximate measure of multimodality is the Kullback-Leibler ¨ distance between p ( s | y D yi ) and a unimodal density sharply peaked at s D si , giving the approximate estimate ¡ log p ( si | y D y i ) plus a constant. 1 Averaging over all points y i , we have to minimize the risk
RK D ¡ 1
n 1X log pO( si | y D y i ) , n iD 1
(3.1)
As one reviewer pointed out, it would be more accurate to call this a measure of unimodality since for a given peak of p(s | y D yi ) at s D si and any number of additional modes, this measure would not discriminate between these distributions. We come back to this point in section 6.4.
196
N. Vlassis, Y. Motomura, and B. Krose ¨
with pO( s | y D y i ) the nonparametric estimate of p ( s | y D y i ) computed from equation 2.2. This risk can be regarded as the negative average loglikelihood of the data given a model dened by the kernel bandwidths hy , hs , and the projection matrix W . Applying the Bayes’ rule on p ( si | y D y i ) gives an alternative interpretation of the above risk, RK D ¡
n n 1X 1X log p( y i | s D si ) C log p ( y i ) C const n iD 1 n i D1
(3.2)
which shows that the optimal projection tries to “stretch” the projected points y i as far from each other as possible by minimizing the (logarithm of the) density of the y vector (the second term), while reducing at the same time the “noise” of y conditional on s expressed by the negative loglikelihood of p ( y i | s D si ) (the rst term). 3.2 The Double-Index Model. If a projection to a simple curve g ( s) does exist, then p ( s | y ) is strictly unimodal and can be approximated by a gaussian with constant variance. Then the above risk becomes proportional to the mean squared error,
RD D
n 1X O ( y i ) g2 , fsi ¡ m n iD 1
(3.3)
O ( y i ) the nonparametric estimate of the expectation of p ( s | y D y i ) with m computed with the Nadaraya-Watson formula for nonparametric regression, mO ( y i ) D
n X
lj ( y i ) sj ,
(3.4)
jD 1
with lj ( y i ) from equation 2.4. Minimization of equation 3.3 with respect to the projection matrix W leads to a nonparametric regression model known as the single-index model for q D 1, multiple-index model for q > D 2 (H¨ardle et al., 1993). For projections to the plane, we adopt hereafter the term double-index model (DIM). In this model, as we see from equation 3.3, the squared distance of si to its nonparametric conditional mean mO ( y i ) acts as a diagnostic of the multimodality of p ( s | y D y i ) . This, in turn, implies a gaussian approximation to the “true” p ( s | y D y i ). At rst glance, this approximation seems too naive since it is continuously violated during optimization (at least during the rst steps of optimization), since most random projections will yield nonsimple curves. However, if a projection to a simple curve exists, the global minimum of RD must correspond to this projection. In fact, it is an individual projection that makes p ( s | y D y i ) multimodal, and assuming a gaussian
Dimension Reduction of Intrinsically Low-Dimensional Data
197
model for it does not imply a violation of its real parametric shape but is rather a constraint to be satised by the optimization algorithm. 3.3 Limitations of the Double-Index Model. The above discussion reveals two limitations of the DIM. The rst is related to optimization. Although a projection to a simple curve can exist, the mean square error RD can exhibit very sharp local minima. One way to see this is by noticing how bad an indicator of multimodality can be the squared distance to the mean: densities with the same variance can have varying higher moments and thus can exhibit different modalities. The Givens parameterization (see section 4.3) allows the visualization in Figure 2 of the landscapes of the risks RK and RD for the spiral data, where the sharp minima of RD are evident. Such a local minimum solution of the DIM is shown in Figure 3c, where for the points y i around [¡0.7, 0.9]T , the conditional density p ( s | y D y i ) is
(a)
150 30
200 30
20
20
10
10 (b)
0.2 30
0 30
20
20 10
10
Figure 2: The landscapes of the (a) negative log-likelihood RK and (b) mean squared error RD . Although the global minimum occurs for the same parameter vector, the landscape of RD exhibits sharper local minima.
198
N. Vlassis, Y. Motomura, and B. Krose ¨ (a)
(b)
2
2
1
1
0
0
1
1
2
2
1
0
1
2 2
2
1
0
1
2
(c) 2 1 0 1 2
1
0
1
2
3
Figure 3: Projections of the spiral data. (a) Double-index model, hy D 0.13. (b) Proposed method with optimized bandwidths, hy D 0.03, hs D 0.02 (c) Local minimum solution of the DIM, hy D 0.04.
approximately symmetric and trimodal, making the algorithm get stuck in a local minimum. The most severe limitation of the DIM, however, is that it can be used only when the original curve f ( s) does not intersect itself in Rd . Clearly, a multimodality of the conditional density p( s | x ) in the original space Rd will lead to an equivalent multimodality for p( s | y ), and this cannot be remedied by any projection. A typical situation is when the curve f ( s ) constitutes a closed manifold, in which case the respective density p( s | x ) will exhibit bimodality. For any parameterization of the curve, a single point on it will 6 s2 , f ( s1 ) D f ( s 2 ) . be the image of two different manifold values: 9s1 , s2 : s1 D In this case, the failure of the DIM to compute the correct projections is a direct consequence of the fact that the risk RD assumes unimodal p ( s | y ) . The proposed risk RK in equation 3.1 does not suffer from the above two limitations and can be used even for data lying on closed manifolds. This will be illustrated in section 5.1 with a real-life application.
Dimension Reduction of Intrinsically Low-Dimensional Data
199
4 Model Selection and Optimization 4.1 Kernel Smoothing. We discuss here the selection of the kernel bandwidths hy and hs in equations 2.5 and 2.3, respectively. For the DIM, this problem has been solved by the cogent proof in H¨ardle et al. (1993) that a cross-validation estimate of the Nadaraya-Watson conditional mean allows the simultaneous optimization of both the projection directions—columns of W —and the kernel bandwidth hy in equation 2.5 (note that the DIM requires no kernel smoothing on s). The cross-validation estimate of m ( y i ) is given by P ( ) 6 i w hy y i ¡ yj sj jD Q ( yi ) D P (4.1) m , ( ) 6 i w hy y i ¡ yj jD
as in equation 3.4 but with the points yi and si excluded from the summations in the numerator and the denominator. Application of the DIM to the spiral data results in the curve shown in Figure 3a. For our model we do not have such a proof, but the same methodology seems to apply as well. Specically, using the cross-validation estimate of p( s | y D yi ) , P ( ) ( ) 6 i w hy y i ¡ yj w hs s ¡ sj jD P (4.2) pQ( s | y D yi ) D , ( ) 6 i w hy yi ¡ yj jD and plugging it into the risk, equation 3.1, we were able to optimize simultaneously over the matrix W and the two kernel bandwidths hy and hs . The resulting projection of the spiral data is shown in Figure 3b. However, a lack of proof makes this procedure questionable, so we investigated the possibility of dropping the cross-validation estimate and assigning constant values to hy and hs during optimization. From Hall (1989), we know that in projection pursuit regression, the mean integrated squared error (MISE) optimal bandwidth for gaussian data hy D O ( n ¡1 / 5 ) gives direction estimates with error O ( n ¡2 / 5 ) , while under assumptions, this error can be further dropped to O ( n ¡1/ 2 ) by choosing a bandwidth between O ( n ¡1 / 3 ) and O ( n ¡1 / 4 ) . The solution we adopted for the bandwidth hy and for projections to the plane was hy D n ¡2/ 7 , which can be kept xed during optimization after sphering the data (see below). This value satises the above bounds and gives good results in practice. For the s-bandwidth we chose the gaussian MISE optimal value hs D ( 3n / 4) ¡1 / 5 (Wand & Jones, 1995). For these two values, the proposed method projected the spiral data as shown in Figure 1c. 4.2 Sphering. A sphering of the data x i —a normalization to zero mean and identity covariance matrix—makes the kernel bandwidth hy independent of the projection. Then hy can be kept constant during optimization, leading to considerable computational savings. Sphering means a rotation
200
N. Vlassis, Y. Motomura, and B. Krose ¨
of the data to their principal component analysis (PCA) directions and then standardization of the individual variances to one. To avoid modeling noise in the data, it is typical to ignore directions with small eigenvalues; a heuristic method to do this is by calculating the ratio of the cumulative variance (added eigenvalues) to the total variance. A reasonable threshold is 0.8. The numerically most accurate way to sphere the data is by singular value decomposition (Press, Teukolsky, Flannery, & Vetterling, 1992). Let X be the n £ d matrix whose rows are the data x i after they have been normalized to zero mean. For n > d, we compute the singular value decomposition p X D ULV T of the matrix X and form the matrix A D nVL ¡1 . The points XA are then sphered. For n · d, the data x i lie in general in an ( n ¡ 1)-dimensional Euclidean subspace of Rd . In this case, it is more convenient to compute the principal directions through eigenanalysis of K D XX T , the inner products matrix of the zero mean data. We compute its singular value decomposition K D ULV T and remove the last column of V and last column and row of Lp(the last eigenvalue of K will always be zero). Then we form the matrix A D nVL ¡1 . The points KA are ( n ¡ 1) -dimensional and sphered. Moreover, all projections of sphered data x i in the form of equation 2.1 also give sphered data y i because E[yyT ] D E[W T xxT W ] D W T E[xxT ]W D W T W D Iq ,
(4.3)
due to the constraint of orthonormal columns of W . This frees us from having to reestimate (co)variances of the projected data in each step of the optimization algorithm. In the following, we assume that the data x i have already been sphered and the manifold data si have been normalized to zero mean and unit variance. 4.3 Optimization. The smooth form of the risk RK as a function of W , hy , and hs allows the minimization of the former with nonlinear optimization. For constrained optimization, we must compute the gradient of RK and the gradient of the constraint function W T W ¡ Iq with respect to W , and then plug these estimates in a constrained nonlinear optimization routine to optimize with respect to RK (Gill, Murray, & Wright, 1981). The gradient of RK with respect to W is given in section A.1. An alternative approach that avoids using constrained nonlinear optimization, in a similar problem involving kernel smoothing for discriminant analysis, has recently been proposed by Torkkola and Campbell (2000). The idea is to parameterize the projection matrix W by a product of Givens (Jacobi) rotation matrices (Press et al., 1992) and then optimize with respect to the angle parameters involved in each matrix. For projections from Rd to Rq , this parameterization takes the form W D
q d Y Y oD 1 u D q C 1
G ou ,
(4.4)
Dimension Reduction of Intrinsically Low-Dimensional Data
201
where G ou is a Givens rotation matrix that equals Id except for the elements goo D coshou , gou D sin hou , guo D ¡ sin hou , and guu D coshou for an angle hou , which depends on o and u. For simplicity we let in the above notation goo , gou , . . . denote the ( o, o ) -th, ( o, u ) -th, . . . elements of the matrix G ou , respectively. To ensure that W is d £ q, only the rst q columns of the last matrix G qd in equation 4.4 are retained, while multiplications must be carried from right to left to reduce the evaluation cost (see section A.3). Multiplication with a matrix G ou causes a rotation by hou over the plane dened by the dimensions o and u, while the range of indices in equation 4.4 ensures that all rotations take place over planes dened by at least one nonprojective direction, that is, one among the d ¡q remaining dimensions. This fact also reduces the total number of parameters from qd in the constrained optimization case (elements of matrix W ) to q ( d ¡ q) here (angles hou ). The required derivatives are given in section A.2. However, as we see in Figure 2, the mixture-like form of equation 2.2 and the additional trigonometric functions in equation 4.4 can make the landscape of the risk RK have numerous local minima. For this reason, combining a gradient-free optimization method like Nelder-Mead with nonlinear optimization is requisite. Also a projection (through sphering) to a lowdimensional Euclidean subspace prior to optimization can signicantly facilitate the search. In any case, the optimization algorithm must be applied many times, and the solution with the minimum risk must be retained. In section A.3 we derive some complexities and discuss some implementation issues. 5 Applications 5.1 Appearance Models for Visual Servoing. One potential application of the proposed method is in robot visual servoing: the ability of a robot manipulator to position itself to a desired position with respect to an object, using visual information from a camera mounted on the robot’s end-effector. Usually, this task is carried out using a geometric representation of the object, which is used for estimating the displacement of the robot with respect to the object, and then feeding this estimate to an appropriate controller. However, building such geometric models can be cumbersome when dealing with objects with complex shapes. Nayar, Nene, and Murase (1996) describe an alternative approach where the appearance of the object is modeled for different poses of the camera. This appearance model is learned by capturing a large set of object images by incrementally displacing the robot’s end-effector. Because all images are from the same object, consecutive images are strongly correlated, given that no large variations in lighting conditions occur. The image set is compressed using PCA to obtain a low-dimensional subspace. In this low-dimensional space, the variations due to the robot displacement are represented in the form of a parameterized manifold. The manifold is then a continuous representation of the visual workspace of the servoing task.
202
N. Vlassis, Y. Motomura, and B. Krose ¨
Figure 4: A subset of the duck images. Closing the loop results in a closed manifold in the image space.
Visual positioning is achieved by rst projecting a new image to the eigenspace and then determining the robot displacement to the desired position by the location of the projection on the parameterized manifold. In this application, it is obvious that the manifold should be nonintersecting; otherwise, two different robot poses will correspond to the same low-dimensional representation, leading to bad pose estimation. In Nayar et al. (1996), the number of eigenvectors was high enough in order to avoid intersections. Here we illustrate that our supervised dimension-reduction method gives a better solution, in the sense that the dimensionality of the resulting subspace in which the parameterized manifold lives can be lower. This can be very important when nonparametric statistical inference in the reduced space is in order. We carried out an experiment involving images of an object (a duck) corresponding to consecutive views obtained every 10 degrees of rotation about a single axis of the object. The data set is supervised in the sense that the angle of view in each image is known. The last image coincides with the rst image in the sequence, giving a total set of 37 images forming a closed manifold in the image space. A subset of these images is shown in Figure 4. We sphered the data by applying PCA using the inner products matrix of the zero-mean data (see section 4.2). It turned out that the rst four eigenvectors (with the largest eigenvalues) explained more than 80% of the total variance, so we discarded the other dimensions. From this four-dimensional space, projecting to the plane using the proposed method was easy. In Figure 5, we show the resulting projections using PCA, the proposed method with optimized and xed bandwidths, and the double-index model. We observe that PCA and the DIM project to nonsimple curves, whereas the proposed method successfully nds a projectionto a simple curve, verifying our theoretical claims. If the resulting manifold is to be used for visual servoing, it is therefore advisable to use a supervised linear projection instead of PCA. 5.2 Mobile Robot Localization. A second application of the proposed method is in mobile robot localization, a term referring to the ability of a robot to estimate its position successfully in its workspace at any moment. We show here the importance of having a good low-dimensional representation (features) of the robot observations. In contrast to the previous application where visual servoing can be carried out by learning a direct
Dimension Reduction of Intrinsically Low-Dimensional Data
203
(b)
(a) 1.5
1
1 0
0.5 0
1
0.5 1 1.5
2 1
0
1
2
1
(c)
2
1
(d) 2
1
1
0
0
1
1
2
0
1
0
1
2 1
0
1
2
Figure 5: Projections of the duck image data from 4D space. (a) PCA projection. (b) Proposed method with optimized bandwidths, hy D 0.08, hs D 0.1. (c) Proposed method with xed bandwidths, hy D 0.36, hs D 0.52. (d) Double-index model, hy D 0.47. The dotted lines indicate where the manifold closes.
mapping from features to poses (e.g., using a neural network; see Nayar et al., 1996), here the necessity for propagating the location predictions over time, as we show next, renders direct mappings impossible. In particular, the inverse mapping from robot positions to observation features is needed for localization. Let us assume that at time t, the robot has a belief—a probability density function, bt ( st ) of its position in the workspace. We also assume two probabilistic models: (1) a model of the robot kinematics in the form of the conditional density p( st | ut¡1 , st¡1 ) where ut¡1 is an action that takes the robot from its previous position st¡1 to st , and (2) an observation model p ( y t | st ) that realizes the inverse mapping from robot positions to features y . Using the Markov assumption that the past has no effect on observations beyond the previous time step, the update rule for the robot belief is (Thrun, 2000) Z bt ( st ) / p ( y t | st )
p ( st | ut¡1 , st¡1 ) bt¡1 ( st¡1 ) dst¡1 .
(5.1)
204
N. Vlassis, Y. Motomura, and B. Krose ¨
Figure 6: The robot trajectory in our building.
Moreover, a convenient way to implement this density propagation scheme is through Monte Carlo sampling from the various densities in the above equation. This amounts to approximating the belief density bt ( st ) with a weighted set of “particle” points sampled from bt ( st ) , in which case equation 5.1 enjoys a very simple implementation while the method has been shown to give very satisfactory results in practice. (For details, we refer to Thrun, 2000.) We applied the above algorithm to data collected by a Nomad Scout robot following a predened trajectory in our mobile robot lab and the adjoining hall, as shown in Figure 6. The omnidirectional imaging device mounted on top of the robot consists of a vertically mounted standard camera aimed upward, looking into a spherical mirror. The data set contains 104 omnidirectional images (320 £ 240 pixels) captured every 25 centimeters along the robot path. Each image is transformed to a panoramic image (64 £ 256), and these 104 panoramic images, together with the robot positions along the trajectory, constitute the training set of our algorithm. A typical panoramic image shot at the position A of the trajectory is shown in Figure 7. In order to apply our supervised projection method, we rst sphered the panoramic image data using the inner products matrix, as explained above, and kept the rst 10 dimensions explaining about 60% of the total variance. The robot positions were normalized to zero mean and unit variance. Then we applied our method projecting the sphered data points from 10-D to 2-D. In Figure 8 we plot the resulting two-dimensional projections using PCA and our supervised projection method. We clearly see the advantage
Dimension Reduction of Intrinsically Low-Dimensional Data
205
Figure 7: Panoramic snapshot from position A in the robot trajectory.
of the proposed method over PCA. The risk is smaller, and from the shape of the projected manifold, we see that taking into account the robot position during projection can signicantly improve the resulting features: there are fewer self-intersections of the projected manifold in our method than in PCA, which means better robot position estimation on the average (smaller risk). In Figure 9 we show the localization performance of the robot when running the Monte Carlo localization method using both the PCA and the supervised features. For the robot kinematics, we assumed a simple gaussian model with standard deviation half the traveled distance of the robot in each action ut¡1 (translation), while the observation model p ( y | s ) is given by using equation 2.2 with the roles of s and y interchanged and using the same (xed) kernel bandwidths, Pn p ( y | s) D
j D 1 w hs ( s ¡ sj ) w hy ( y Pn j D1 w hs ( s ¡ sj )
¡ yj )
.
(5.2)
We placed the robot in position B on the trajectory and planned its translation toward position C. The corresponding projections of the panoramic images in the B!C traversed part of the trajectory are shown with dashed lines in Figure 8. No prior knowledge about the initial position of the robot was provided (kidnapped robot problem), in which case the initial belief b1 ( s1 ) is a uniform density over the s space. In Figure 9 we plot the estimated position of the robot as the average of bt ( st ) at any moment t ¸ 2, together with 1s error bars, when using Monte Carlo localization with the PCA projections of Figure 8a, and the supervised projections of Figure 8b. We see that by using the supervised projections, the algorithm converges faster to the true position of the robot (dashed lines) and is more stable than the PCA solution. This result clearly shows the advantage of using a supervised dimension-reduction method instead of an unsupervised one like PCA. In both two applications, we tried both constrained nonlinear optimization using sequential quadratic programming (Gill et al., 1981) and unconstrained optimization using the Broyden-Fletcher-Goldfard-Shanno (BFGS)
206
N. Vlassis, Y. Motomura, and B. Krose ¨
Figure 8: Projection of the panoramic image data from 10D space. (a) Projection on the rst two principal components. (b) Supervised projection using the proposed method with xed bandwidths. The part with the dashed lines corresponds to projections of the panoramic images captured by the robot between positions B and C of its trajectory.
algorithm with the Givens parameterization. The latter was more sensitive to local minima but many times faster than sequential quadratic programming, so it was our main choice. We repeated the optimization algorithm several times starting from random values for the Givens angles in
Dimension Reduction of Intrinsically Low-Dimensional Data
207
Figure 9: Localization performance of the robot (estimated position versus true position) using Monte Carlo localization with (a) the PCA features, and (b) the supervised features, for a translation from point B to point C along the trajectory.
[¡p / 2, p / 2] and small values for the bandwidths in the case the latter were also optimized. In order to make sure that the global minimum solution was reached, we ran the Nelder-Mead algorithm until convergence, and from this solution we started nonlinear optimization with the BFGS, a common practice in similar optimization problems (see, e.g., H¨ardle et al., 1993). A gradient-
208
N. Vlassis, Y. Motomura, and B. Krose ¨
free method like Nelder-Mead requires more evaluations of the objective function than BFGS, but this is compensated by the better global convergence properties of the former. The complexity analysis in section A.3 can be used for efcient evaluation of the objective function and its derivatives. A Matlab implementation is available from the rst author. 6 Discussion
We have described a method for supervised dimension reduction (feature extraction) of intrinsically low-dimensional data and justied its use with some real-life applications. One issue that should be reemphasized is that the proposed method should not be regarded as a regression method (although it can also be used for regression) but rather as a feature extraction method that takes into account the supervised information of the data during optimization. The mobile robot application provided a clear motivation for the use of the method in the sense described. The resulting features might need to be used for realizing the inverse mapping from label data (robot positions) to feature data, and not necessarily for predicting the label data from new observations. Besides, the extracted features can also be used for other purposes (Li, 1991). It is also this fact that renders the nonparametric models adopted in this article attractive: as long as we are interested in the projection matrix W only, the model that we must use during optimization must be succinct and employ as few parameters other than the elements of W as possible. In our case, the kernel bandwidths are also model parameters to be estimated, but typically they will be be used after feature extraction for kernel smoothing in the resulting y and s spaces. In the rest of this section, we discuss some additional topics related to the method and outline some related and future work. 6.1 Information-Theoretic Concepts. A nonparametric estimate of the modality of p ( s | y ) must be computed quickly and thus be approximate. A more principled approach to the proposed Kullback-Leibler distance to a peaked density would be an information-theoretic one: an indicator of the number of peaks of p ( s | y D y i ) is the entropy of p( s | y D y i ) , which reaches its minimum value when the density is unimodal (note that s conditioned on y D y i does not have constant variance). Averaging over all y i corresponds to mutual information maximization between the projected variable y and the manifold variable s, which, since s is independent of the projection, corresponds to minimizing the risk (where we have used the empirical estimate of p( y ) ):
RH D ¡
n 1X n iD 1
Z p ( s | y D y i ) log p ( s | y D y i ) dsI
(6.1)
Dimension Reduction of Intrinsically Low-Dimensional Data
209
however, the presence of the integral and the logarithm makes the evaluation of this quantity costly. One possibility is to replace the Shannon entropy in the above formula with the Renyi entropy of order two (Cover & Thomas, 1991), Z ( ) | y y ¡ log (6.2) h2 s D i D p ( s | y D y i ) 2 ds, and then use the nonparametric estimate of p( s | y ) from equation 2.2. This makes the integral in equation 6.1 analytical (Principe & Xu, 1999); however, unless some application-specic approximations are employed, the cost of the above estimate is O ( n3 ). Finally, we note from equation 3.1 that the log-likelihood risk RK can be regarded as a rough approximation to the above integral, equation 6.1, assuming that p ( s | y D y i ) is sufciently peaked about si . The gained speedup of one order of magnitude (or more, see the discussion about fast Fourier transform below) can be a good justication for the approximation. 6.2 Dropping the Complexity with the Fast Fourier Transform. As shown in section A.3, the bottleneck of the method lies in the quadratic complexity of computing all the pairwise distances of the projected points y i and then smoothing with the p-gaussian kernel in equation 2.4. However, one property of the gaussian kernel is that smoothing can be efciently carried out with the fast Fourier transform (Silverman, 1982). This involves a two-stage approach where the projected data are rst binned—an approximate histogram of them is computed—and then kernel smoothing is carried out by discrete convolution with a discretized gaussian kernel (see Wand & Jones, 1995, for details). This approach drops the quadratic cost to O( n log2 n ) which is the cost of discrete convolution with fast Fourier transform. 6.3 Optimization Issues. The efciency of the optimizer to locate the global minimum solution depends on the original dimension d of the data. In our experiments we noticed that for large values of d (e.g., d > 10), the optimizer often had difculties locating the global minimum solution. If d is too large (e.g., images), then dimension reduction through sphering can help lowering d to some p manageable number. Sphering to a dimension approximately equal to n is a reasonable choice, as justied by the complexity analysis in section A.3. However, if projections from large d are in order, alternative optimization methods might be needed—for example, stochastic search or xed-point iterations. In particular, the log-likelihood form of equation 3.1 suggests that optimization with the expectation-maximization (EM) algorithm (Dempster, Laird, & Rubin, 1977) might be possible, although in its current form, the M-step is not analytical. An alternative denition of the weight function lj ( y ) in equation 2.4 can make the M-step analytical, and we are investigating this possibility.
210
N. Vlassis, Y. Motomura, and B. Krose ¨
Another interesting topic is the relaxation of the constraint of orthonormality of the columns of W allowing more general features to be discovered. Such a step would require a better optimization strategy in order to avoid columns of W converging to singular or degenerate solutions. Moreover, since equation 4.3 would not hold any more, the hy kernel bandwidth could not be kept xed during optimization and should be either optimized using the cross-validation approach mentioned above or computed by estimating the (co)variances of the projected points in each optimization step. 6.4 An Alternative Measure of Multimodality. Another measure of multimodality of p( s | y ) in a supervised nonlinear feature extraction problem has been proposed by Thrun (1998). There, multimodality was measured by the average distance to the “correct” mode si giving the risk n 1X RL D n iD 1
Z |s ¡ si |p(s | y D y i ) ds,
(6.3)
which penalizes modes that appear far from si . In Thrun (1998) this risk was approximated from the sample with cost O ( n3 ). However, if instead of absolute loss function |s ¡ si | we take square loss ( s ¡ si ) 2 and use the nonparametric estimate of p ( s | y ) from equation 2.2, the risk reads Z n X n 1X ( ) ( s ¡ si ) 2 w hs ( s ¡ sj ) ds, lj y i (6.4) RL D n iD 1 jD 1 and the above integral can be analytically computed as ( si ¡sj ) 2 C h2s , yielding the risk (plus a constant) RL D
n X n 1X lj ( y i ) ( si ¡ sj ) 2 . n iD 1 jD 1
(6.5)
This risk is O( n2 ) and can be regarded as an approximate Bayes risk with negative log-likelihood loss (Stone, 1977). Moreover, Jensen’s inequality can relate this risk to both RK and RD . For the latter, write 8 92 <X = O ( y i ) g2 D fsi ¡ m lj ( y i ) ( si ¡ sj ) , (6.6) : ; j
and then apply E[x]2 · E[x2 ]. Plots of the parameter landscape for this risk showed a large resemblance to the landscape of RD . This can be explained by the fact that in both risks, no smoothing is carried out in the s space, in contrast to the proposed risk RK , which essentially ignores modes of p ( s | y D y i ) that appear far from the true
Dimension Reduction of Intrinsically Low-Dimensional Data
211
position si . As we showed in the mobile robot localization application, positioning is achieved by propagating the belief estimates over time; therefore, far-away modes of the conditional density p ( s | y D y i ) are easier to handle by the localization mechanism, while modes close to si are more likely to cause a problem. This fact also justies the use of the Kullback-Leibler distance in the denition of RK in equation 3.1 as a measure of multimodality of p ( s | y D y i ) . 6.5 Unsupervised Dimension Reduction. Finally, a challenging problem is the unsupervised feature extraction of intrinsically low-dimensional data, where “unsupervised” means that the manifold variable is unobserved in the sample. In this case, the inverse regression curve must be substituted by a principal curve (Hastie & Stuetzle, 1989), and projection direction estimation must be interleaved with curve tting. We are not aware of any such work and believe that it deserves attention. 7 Conclusion
We proposed a supervised dimension-reduction method that can be used for data sets that are intrinsically low-dimensional. We discussed the case of one-dimensional intrinsic dimensionality and linear projections to the plane, but the method can be applied to other congurations as well. We compared with the single-index (double-index) model for nonparametric regression and showed that in some cases, the proposed method can provide more accurate solutions. We derived gradients to be used with either constrained or unconstrained nonlinear optimization and carried out a time complexity analysis of the associated estimates. We demonstrated the proposed method on two robotic applications and showed its potential over other methods. Appendix A.1 Derivatives of RK : Constrained Optimization Case. We compute the derivatives of the risk RK in equation 3.1 assuming that p ( s | y ) is given by equation 2.2. The results extend easily to the cross-validation case. It is not difcult to see that the derivative of RK with respect to a parameter r (element of W or angle h kl ) is @ @r
RK D
n X n ¡1 X @ bij ky i ¡ yj k2 , 2 2nhy iD 1 jD 1 @r
(A.1)
where w hy ( y i ¡ yj ) w hs ( si ¡ sj ) . bij D lj ( y i ) ¡ Pn k D1 w hy ( y i ¡ y k ) w hs ( si ¡ s k )
(A.2)
212
N. Vlassis, Y. Motomura, and B. Krose ¨
In the constrained optimization case, we can directly compute the gradient
rW ky i ¡ yj k2 D rW ( y i ¡ yj ) T ( y i ¡ yj ) D 2( xi ¡ xj ) ( y i ¡ yj ) T I
(A.3)
substituting in equation A.1 and expanding, we get 8 X X X X X ¡1 <X rW RK D xi y Ti xi xj bij ¡ bij yjT ¡ bij y Ti 2 nhy : i i i j j j 9 X X = C xj yjT (A.4) bij . ; i j
The rst term is zero by the denition of bij in equation A.2, while the rest can be written in the convenient matrix notation,
rW RK D
1 T X [B C B T ¡ diag( 1 T B )]Y , nh2y
(A.5)
where X is the n £ d matrix of the sphered data, Y is the n £ q matrix of the projected data, B is the n £n matrix with elements bij , 1 is a column vector of all ones, and diag(¢) transforms a vector to a diagonal matrix. Multiplying from right to left leads to reduced evaluation cost (see section A.3). A.2 Derivatives of RK : Unconstrained Optimization Case. In the unconstrained optimization case, it is not difcult to see that it holds @ @hkl
³ ky i ¡ yj k2 D 2( x i ¡ xj ) T
@ @hkl
´ W
( y i ¡ yj ) T ,
(A.6)
where the derivative of W with respect to hkl can be computed from equation 4.4 as @ @hkl
W D
q d Y Y
@
@h kl oD 1 uD q C 1
G ou ,
(A.7)
where @ @hkl
» G ou D
G 0ou G ou
if k D o and l D u otherwise
(A.8)
and G 0ou is the matrix G ou with the ones substituted by zeros and the trigonometric functions substituted by their derivatives. Expanding equation A.1
Dimension Reduction of Intrinsically Low-Dimensional Data
213
using equation A.6, we get the derivative of RK with respect to an angle hkl as » ³ ´ ¼ 1 @ @ T[ C T T )] trace X W Y B B ¡ diag( 1 B . (A.9) RK D @hkl nh2y @hkl Using the permutation property of the trace and the gradient (see equation A.5), the above derivative can be simplied as » ³ ´¼ @ W RK D trace ( rW RK ) T , @hkl @hkl @
(A.10)
in accordance with the chain rule for derivatives. Note that the gradient rW RK needs to be computed only once. A.3 Complexities. We derive here some costs for the evaluation of relevant quantities in the non-cross-validation case, with immediate impact on the algorithmic speed-up. The cost of a projection through equation 2.1 is O( qdn ) , while a single evaluation of RK through equations 3.1 and 2.2 requires the computation of all pairwise distances between the projected points in equation 2.4, which is O ( qn2 ) . An approximation via fast Fourier transform (FFT) (see section 6) can drop this cost to at most O ( qn log2 n ), giving the total cost for evaluating RK O ( qn ¢ maxfd, log2 ng ) . If W is parameterized by Givens matrices, there is the additional cost of the multiplications in equation 4.4. Since the last matrix Gqd in the sequence is d £ q, the multiplications must be carried out from right to left, giving total complexity for the product O ( q( d ¡ q) qd2 ). This makes a single evaluation of RK using the Givens parametrization and the FFT O ( q ¢ maxf( d ¡ q) qd2 , nd, n log2 ng ). For the derivatives, we do not have results using FFT, so the following applies to the non-FFT case. Constrained optimization requires the evaluation of the gradient rW RK in equation A.5, which, for the common case n > d, has cost O ( qn2 ) . This is achieved if we carry out the multiplications in equation A.5 from right to left,
rW RK D
² 1 T± X [B C B T ¡ diag( 1 T B ) ]Y , 2 nhy
(A.11)
and since q (columns of Y ) < d (columns of X ). In the unconstrained optimization case, the cost of computing @W / @hkl in equation A.7 for a single angle equals the cost of evaluating W in equation 4.4, which, is O( q2 ( d ¡ q) d2 ) . Since the trace in equation A.10 has negligible cost O ( qd) , and the cost of rW RK above is O ( qn2 ) , we conclude that the cost of computing all q ( d ¡ q ) derivatives in equation A.10 is O ( q ¢ maxfn2 , q2 ( d ¡ q) 2 d2 g) . This suggests that a sensible dimension to compress
214
N. Vlassis, Y. Motomura, and B. Krose ¨
p the data with sphering is approximately n. However, we should note that the above complexities are in fact lower when the algorithm is implemented in a software specializing in matrix computations. Acknowledgments
We are indebted to R. Bunschoten for providing the data sets and assisting in the experiments. We thank J. Portegies Zwart and J. J. Verbeek for inspiring discussions, K. Fukumizu for bringing Li (1991) to our attention, and the anonymous reviewers for their motivating comments. This work is supported by the Real World Computing program. References Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: Wiley. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm (with discussion). J. Roy. Statist. Soc. B, 39, 1–38. Gill, P. E., Murray, W., & Wright, M. (1981). Practical optimization. London: Academic Press. Hall, P. (1989). On projection pursuit regression. Ann. Statist., 17(2), 573–588. H¨ardle, W., Hall, P., & Ichimura, H. (1993). Optimal smoothing in single-index models. Ann. Statist., 21(1), 157–178. Hastie, T., & Stuetzle, W. (1989). Principal curves. J. Amer. Statist. Assoc., 84, 502–516. Huber, P. J. (1985). Projection pursuit (with discussion). Ann. Statist., 13, 435–525. Li, K.-C. (1991). Sliced inverse regression for dimension reduction. J. Amer. Statist. Assoc., 86(414), 316–342. Nayar, S. K., Nene, S. A., & Murase, H. (1996). Subspace methods for robot vision. IEEE Trans. Robotics and Automation, 12(5), 750–758. Press, W. H., Teukolsky, S. A., Flannery, B. P., & Vetterling, W. T. (1992).Numerical recipes in C (2nd ed.). Cambridge: Cambridge University Press. Principe, J. C., & Xu, D. (1999). Information-theoretic learning using Renyi’s quadratic entropy. In Int. Workshop on Independent Component Analysis and Blind Separation of Signals (pp. 407–412). Aussois, France. Ripley, B. D. (1996). Pattern recognition and neural networks. Cambridge: Cambridge University Press. Silverman, B. W. (1982). Kernel density estimation using the fast Fourier transform. Appl. Statist., 31, 93–99. Spivak, M. (1979). A comprehensive introduction to differential geometry vol. II (2nd ed.). Berkeley, CA: Publish or Perish. Stone, C. J. (1977). Consistent nonparametric regression (with discussion). Ann. Statist., 5, 595–645. Thrun, S. (1998). Bayesian landmark learning for mobile robot localization. Machine Learning, 33(1).
Dimension Reduction of Intrinsically Low-Dimensional Data
215
Thrun, S. (2000). Probabilistic algorithms in robotics (Tech. Rep. No. CMU-CS-00126). Pittsburgh, PA: School of Computer Science, Carnegie Mellon University. Torkkola, K., & Campbell, W. (2000). Mutual information in learning feature transformations. In Proc. Int. Conf. on Machine Learning. Stanford, CA. Wand, M. P., & Jones, M. C. (1995). Kernel smoothing. London: Chapman & Hall.
Received February 29, 2000; accepted April 3, 2001.
LETTER
Communicated by Naftali Tishby
Clustering Based on Conditional Distributions in an Auxiliary Space Janne Sinkkonen janne.sinkkonen@hut. Samuel Kaski samuel.kaski@hut. Neural Networks Research Centre, Helsinki Universityof Technology,FIN-02015 HUT, Finland We study the problem of learning groups or categories that are local in the continuous primary space but homogeneous by the distributions of an associated auxiliary random variable over a discrete auxiliary space. Assuming that variation in the auxiliary space is meaningful, categories will emphasize similarly meaningful aspects of the primary space. From a data set consisting of pairs of primary and auxiliary items, the categories are learned by minimizing a Kullback-Leibler divergence-based distortion between (implicitly estimated) distributions of the auxiliary data, conditioned on the primary data. Still, the categories are dened in terms of the primary space. An online algorithm resembling the traditional Hebb-type competitive learning is introduced for learning the categories. Minimizing the distortion criterion turns out to be equivalent to maximizing the mutual information between the categories and the auxiliary data. In addition, connections to density estimation and to the distributional clustering paradigm are outlined. The method is demonstrated by clustering yeast gene expression data from DNA chips, with biological knowledge about the functional classes of the genes as the auxiliary data. 1 Introduction
Clustering algorithms and their goals vary, but it is common to aim at clusters that are relatively homogeneous while data in different clusters are dissimilar. The results depend totally on the criterion of similarity. The difcult problem of selecting a suitable criterion is commonly addressed by feature extraction and variable selection methods that dene a metric in the data space. Recently, metrics have also been derived by tting a generative model to the data and using information-geometric methods for extracting a metric from the model (Hofmann, 2000; Jaakkola & Haussler, 1999; Tipping, 1999). We study the related case in which additional useful information exists about the data items during the modeling process. The information is Neural Computation 14, 217–239 (2001)
° c 2001 Massachusetts Institute of Technology
218
Janne Sinkkonen and Samuel Kaski
available as auxiliary samples ck; they form pairs ( x k, ck ) with the primary samples x k. In this article, x k 2 Rn , and the ck is multinomial. The extra information may, for example, be labels of functional classes of genes, as in our case study. It is assumed that differences in the auxiliary data indicate what is important in the primary data space. More precisely, the difference between samples x k and xl is signicant if the corresponding values ck and cl are different. The usefulness of this assumption depends, of course, on the choice of the auxiliary data. Since the relationship between the auxiliary data and the primary data is stochastic, we get a better description of the difference between values x and x 0 by measuring differences between the distributions of c, given x and x 0 . The conditional densities p ( c | x ) and p( c | x 0 ) are not known, however. Only the set of sample pairs f( x k, ck ) gk is available. Because our aim is to minimize within-cluster dissimilarities, the clusters should be homogeneous in terms of the (estimated) distributions p( c | x ) . In order to retain the potentially useful structure of the primary space, we use the auxiliary data only to indicate importance and dene the clusters in terms of localized basis functions within the primary space. Such a clustering can then be used later for new samples from the primary data space even when the corresponding auxiliary samples are not available. Very loosely speaking, our aim is to preserve the topology of the primary space but measure distances by similarity in the auxiliary space. Clearly, almost any kind of paired data is applicable, but only good auxiliary data improve clustering. If the auxiliary data are closely related to the goal of the clustering task, as, for example, a performance index would be, then the auxiliary data guide the clustering to emphasize the important dimensions of the primary data space and to disregard the rest. This automatic relevance detection is the main practical motivation for this work. We previously constructed a local metric in the primary space that measures distances in that space by approximating the corresponding differences between conditional distributions p ( c | x ) in the auxiliary space (Kaski, Sinkkonen, & Peltonen, in press). The metric can be used for clustering, and then maximally homogeneous clusters in terms of the conditional distributions appear. In this work, we introduce an alternative method that, contrary to the approach generating an explicit metric, does not need an estimate of the conditional distributions p ( c | x ) as an intermediate step. We additionally show that minimizing the within-cluster distortion is equivalent to maximizing the mutual information between the basis functions used for dening the clusters (interpreted as a multinomial random variable) and the auxiliary data. Maximization of mutual information has been previously used for constructing neural representations (Becker, 1996; Becker & Hinton, 1992). Other related works and paradigms include learning from (discrete) dyadic data (Hofmann, Puzicha, & Jordan, 1998) and
Clustering in an Auxiliary Space
219
distributional clustering (Pereira, Tishby, & Lee, 1993) with the information bottleneck (Tishby, Pereira, & Bialek, 1999) principle. 2 Clustering Based on the Kullback-Leibler Divergence
We seek to cluster items x of the data space by using the information within a set of pairs ( x k, ck ) of data. The set consists of paired samples of two random variables. The vector-valued random variable X takes values x 2 X ½ Rn , and the ck (or sometimes just c) are values of the multinomial random variable C. We wish to keep the clusters local with respect to x but measure similarities between the samples x by the differences of the corresponding conditional distributions p ( c | x ) . These distributions are unknown and will be implicitly estimated from the data. Vector quantization (VQ) or, equivalently, K-means clustering, is one approach to categorization. In VQ, the goal is to minimize the average distortion E between the data and the prototypes or code book vectors mj , dened by XZ (2.1) ED yj ( x ) D ( x , mj ) p ( x ) dx. j
Here, D( x , mj ) denotes the measure of distortion between x and mj , and yP j ( x ) is the cluster membership function that fullls 0 · yj ( x ) · 1 and j yj ( x ) D 1. In the classic “hard” vector quantization, the membership function is binary valued: yj ( x ) D 1 if D ( x , mj ) · D ( x, m i ) for all i, and yj ( x ) D 0 otherwise. Such functions dene a partitioning of the space into discrete cells, Voronoi regions, and the goal of learning is to nd the partitioning that minimizes the average distortion. If the membership functions yj ( x ) may attain any values between zero and one, the approach may be called soft vector quantization (Nowlan, 1990, has studied a maximum likelihood solution). We measure distortions D as differences between the distributions p ( c | x ) and model distributions. The measure of the differences will be the Kullback–Leibler divergence, dened for two discrete-valued distribuP tions with event probabilities fpi g and fyi g as DKL ( p, Ã ) ´ i pi log( pi / yi ). In our case, the rst distribution is the multinomial distribution in the auxiliary space that corresponds to the data x, that is, pi ´ p ( ci | x ) . The second distribution is the prototype; let us denote the jth prototype by Ãj . When the Kullback–Leibler distortion measure is plugged into equation 2.1, the error function of VQ, the average distortion becomes XZ (2.2) EKL D yj ( x ) DKL ( p ( c | x ) , Ãj ) p ( x ) dx . j
Instead of the distortions between the vectorial samples and vectorial prototypes as in equation 2.1, we now compute point-wise distortions between
220
Janne Sinkkonen and Samuel Kaski
the distributions p( c | x ) and the prototypes Ãj . The prototypes are distributions in the auxiliary space. The average distortion will be minimized by parameterizing the functions yj ( x ) and optimizing the distortion with respect to the cluster membership parameters and the prototypes. When the parameters of yj ( x ) are denoted by µj , the average distortion can be written as E KL
D ¡
XZ i, j
[yj ( x I µj ) log yji ]p( ci , x ) dx C const.,
(2.3)
where the constant is independent of the parameters. The membership functions yj ( x I µj ) can be interpreted as conditional densities p ( vj | x ) ´ yj ( x ) of a multinomially distributed random variable V that indicates the cluster identity. The value of the random variable V will be denoted by v 2 fvj g, and the value of the random variable C corresponding to the multinomially distributed auxiliary distribution will be denoted by c 2 fci g. Given x , the choice of the cluster v does not depend on the c. In other words, C and V are conditionally independent: p ( c, v | x ) D p( c | x ) p ( v | x ) . R It follows that p ( c, v ) D p( c | x ) p ( v | x ) p ( x ) dx. It can be shown (see appendix A) that if the membership distributions are of the normalized exponential form exp f ( x I µj ) yj ( xI µj ) D P , l exp f ( x I µ l )
(2.4)
then the gradient of EKL with respect to the parameters µl becomes @EKL @µ l
D
X Z @f ( x I µ l ) i, j
@µ l
log
yji p ( ci , vj , vl , x ) dx . Ã li
(2.5)
The prototypes Ãj are probabilities of multinomial distributions, and P therefore they must fulll 0 · yji · 1 and i yji D 1. We will incorporate these conditions into our model by reparameterizing the prototypes as follows: X log yji ´ c ji ¡ log (2.6) ec jm . m
The gradient of the average distortion, equation 2.3, with respect to the new parameters of the prototypes is @EKL @c lm
D
XZ i
( ylm ¡ dmi ) p( ci , vl , x ) dx ,
(2.7)
where the Kronecker symbol dmi D 1 when m D i, and dmi D 0 otherwise.
Clustering in an Auxiliary Space
221
The average distortion can be minimized with stochastic approximation, by sampling from yj ( x ) yl ( x ) p ( ci , x ) D p( vj , vl , ci , x ) . This leads to an on-line algorithm in which the following steps are repeated for t D 0, 1, . . . with a( t) gradually decreasing toward zero: 1. At the step t of stochastic approximation, draw a data sample (x ( t) , c( t) ). Assume that the value of c( t) is ci . This denes the value of i in the following steps. 2. Draw two basis functions, j and l, from the multinomial distribution with probabilities fy k ( x ( t) )g k. 3. Adapt the parameters µ l and c lm , m D 1, . . . , Nc , by µ ¶ yji @f ( x I µ l ) log µl ( t C 1) D µl ( t) ¡ a( t) yli @µl l D l ( t) c lm ( t C 1) D c lm ( t) ¡ a( t) ( ylm ¡ dmi ) ,
(2.8) (2.9)
where Nc is the number of possible values of the random variable C. Due to the symmetry between j and l, it is possible to adapt the parameters twice for one t by swapping j and l in equations 2.8 and 2.9 for the second adaptation. Note that µ l ( t C 1) D µ l ( t) if j D l. the a should fulll the conditions P In stochastic approximation, P 2 < 1. In practice, we have used piecewise( ) ) a(t) 1 and a( D t t t linear decreasing schedules. We will consider two special cases. In the demonstrations in Figure 1, the basis functions are normalized gaussians in the Euclidean space X D Rn . In the second case in section 3, gene expression data mapped onto a hypersphere, X D Sn are clustered by using normalized von Mises–Fisher distributions (Mardia, 1975) as the basis functions. For gaussians parameterized by their locations µ l and having a diagonal covariance matrix in which the variance s 2 is equal in each dimension, f ( x I µl ) D ¡kx ¡ µ l k2 / 2s 2 , and @f ( xI µ l ) @µ l
D
1 ( x ¡ µl ). s2
(2.10)
The von Mises–Fisher (vMF) distribution is an analog of the gaussian distribution on a hypersphere (Mardia, 1975) in that it is the maximum entropy distribution when the rst two moments are xed. The density of an n-dimensional vMF distribution is vMF( x I µ ) D
1 xT µ exp k . kµ k Zn (k )
(2.11)
Here, the parameter vector µ represents the mean direction vector. The 1 1 normalizing coefcient Zn (k ) ´ ( 2p ) 2 n I 1 n¡1 (k ) /k 2 n¡1 is not relevant here, 2
222
Janne Sinkkonen and Samuel Kaski
Figure 1: The location parameters µj (small circles) of the gaussian basis functions of two models, optimized for two-class (n( c) D 2), two-dimensional, and three-dimensional data sets. The shades of gray at the background depict densities from which the data were sampled. (A) Two-dimensional data, with roughly gaussian p ( x ) . The inset shows the conditional density p( c0 | x ) that is monotonically decreasing as a function of the y-dimension. The model has learned to represent only the dimension on which p( c | x ) changes. (B, C) Two projections of three-dimensional data, with a symmetric gaussian p ( x ) (ideally, the form of this distribution should not affect the solution). The insets show the conditional density p ( c0 | x ), which decreases monotonically as a function of a two-dimensional radius and stays constant with respect to the orthogonal third dimension z. The one-dimensional cross section describing p( c 0, x ) and p ( c1 , x ) as a function of the two-dimensional radius is shown in the inset of C. The model has learned to represent only variation in the direction of the radius and along the dimensions x and y, and discards the dimension z as irrelevant.
Clustering in an Auxiliary Space
223
but it will be used in the mixture model of section 3. The function Ir (k ) is the modied Bessel function of the rst kind and order r. In the clustering algorithm described, we use vMF basis functions, f ( x I µ l ) D k x T µ l / kµ l k,
(2.12)
with a constant dispersion k . Then, @f ( xI µ l ) @µ l
2 D k ( x ¡ xT µl µl / kµl k ) / kµ l k.
The norm kµ l k does not affect f , and we may normalize µ l , whereby the gradient becomes @f ( xI µ l ) @µ l
D k ( x ¡ xT µ l µ l ) .
(2.13)
It can be shown (Kaski, 2000) that the stochastic approximation algorithm dened by equations 2.8, 2.9, and 2.13 converges with probability one when the µj are normalized after each step. 2.1 Connections to Mutual Information and Density Estimation. It can be easily shown (using the information inequality) that at the minimum of the distortion EKL , the prototype à l takes the form
yli D p( ci | vl ) .
(2.14)
Hence, the distortion 2.3 can be expressed as EKL D ¡
X
p ( ci , vj ) log
i, j
p ( ci , vj ) C const. p ( ci ) p( vj )
D ¡I( CI V ) C const.,
(2.15)
I ( CI V ) being the mutual information between the random variables C and V. Thus, minimizing the average distortion EKL is equivalent to maximizing the mutual information between the auxiliary variable C and the cluster memberships V. Minimization of the distortion has a connection to density estimation as well; the details are described in appendix B. It can be shown that minimization of EKL minimizes an upper limit of the mean Kullback-Leibler divergence between the real distribution p( c | x ) and a certain estimate pO( c | x ). The estimate is pO( ci | x ) D
X 1 exp yj ( x I µj ) log yji , Z( x )
(2.16)
j
where Z ( x ) is a normalizing coefcient selected such that
P i
pO( ci | x ) D 1.
224
Janne Sinkkonen and Samuel Kaski
For gaussian basis functions, the upper limit becomes tight when s approaches zero and the cluster membership functions become binary. Then the cluster membership functions approach indicator functions of Voronoi regions. Note that the solution cannot be computed in practice for s D 0; smooth vector quantization with s > 0 results in a compromise in which the solution is tractable, but only an upper limit of the error will be minimized. Furthermore, the mean divergence can be expressed in terms of the divergence of joint distributions as follows: E X fDKL ( p ( c | x ) , pO( c | x ) ) g D DKL ( p ( c, x ) , pO( c | x ) p ( x ) ).
(2.17)
Here EX f¢g denotes the expectation over values of X, and the latter DKL is the divergence of the joint distribution. The expression 2.17 is a cost function of an estimate of the conditional probability p ( c | x ) . Intuitively speaking, by minimizing equation 2.17, resources are not wasted in estimating the marginal p ( x ) , but all resources are concentrated on estimating p( c | x ) . It can further be shown (see appendix B) that maximum likelihood estimation of the model pO( c | x ) using a nite data set is asymptotically equivalent to minimizing equation 2.17. 2.2 Related Works
2.2.1 Competitive Learning. The early work on competitive learning or adaptive feature detectors (Didday, 1976; Grossberg, 1976; Nass & Cooper, 1975; PÂerez, Glass, & Shlaer, 1975) has a close connection to vector quantization (Gersho, 1979; Gray, 1984; Makhoul, Roucos, & Gish, 1985) and K-means clustering (Forgy, 1965; MacQueen, 1967). The neurons in a competitivelearning network are parameterized by vectors describing the synaptic weights, denoted by mj for neuron j. In the simplest models, the activity of a neuron due to external inputs is a nonlinear function f of the inputs x multiplied by the synaptic weights, f ( x T mj ) . The activities of the neurons compete; the activity of each neuron reduces the activity of the others by negative feedback. If the competition is of the winner-take-all type (Kaski & Kohonen, 1994), only the neuron with the largest f ( xT mj ) remains active. Each neuron therefore functions as a feature detector that detects whether the input comes from a particular domain of the input space. During Hebbian-type learning, the neurons gradually specialize in representing different types of domains. (For recent more detailed accounts, see Kohonen, 1984, 1993; Kohonen & Hari, 1999.) Although the winner is usually dened in terms of inner products, it is possible to generalize the model to an arbitrary metric. If the usual Euclidean metric is used, the learning corresponds to minimization of a mean-squared vector quantization distortion or, equivalently, minimization of the distance
Clustering in an Auxiliary Space
225
to the closest cluster center in the K-means clustering. The domain of the input space that a neuron detects can hence be interpreted as a Voronoi region in vector quantization. The relationship of competitive learning to our work is that the “cluster membership functions” yj ( x I µj ) in section 2 may be interpreted as the outputs of a set of neurons, and in the limit of crisp membership functions (for gaussians, s ! 0), only one neuron—the one having the largest external input—is active and can be interpreted as the winner. After learning, our algorithm therefore corresponds to a traditional competitive network. The learning procedure makes the difference by making the network detect features that are as homogeneous as possible with regard to the auxiliary data. The learning algorithm has a potentially interesting relation to Hebbian or competitive learning as well. Assume that at most two of the neurons may become active at a time, with probabilities yj ( x I µj ) . Then the learning algorithm, equation 2.8, for vMF kernels reads
µl ( t C 1) D µl ( t) C a( t) ( x ¡ xT µ l µl ) ( log yli ¡ log yji ) (neuron j is adapted at the same time, swapping j and l). If the activity of the neurons is binary valued, that is, the neurons j and l have activity value one and the others value zero, then the adaptation rule for any neuron k can be expressed by
µ k ( t C 1) D µ k ( t) C a( t)gk ( t) ( x ¡ x T µ kµ k ) ( log yki ¡ log yji ) .
(2.18)
Here gk ( t) denotes the activity of the neuron k. The term gk ( t) x is Hebbian, whereas g k ( t ) x T µ kµ k is a kind of a forgetting term (cf. Kohonen, 1984; Oja, 1982). The difference from common competitive learning then lies within the last parentheses in equation 2.18. The parameter vector of the neuron of the active pair ( j, l) that represents better the class i has a larger value of log y and is moved toward the current sample, whereas the other neuron is moved away from the sample. (Note also the similarity to the learning vector quantization algorithms, see e.g. Kohonen, 1995.) Note that for normalized gaussian membership functions yj ( xI µj ) and 6 s D 0, our model is a kind of a variant of gaussian mixture models or soft vector quantization. At the limit of crisp feature detectors (for gaussians, s ! 0), the output of the network reduces to a 1-of-C-coded discrete value. Similarly, the outputs of the soft version can be interpreted as probability density functions of a multinomial random variable. Such an interpretation has already been made by Becker in some of her work, discussed in section 2.2.2. 2.2.2 Multinomial Variables. Becker et al. (Becker & Hinton, 1992; Becker, 1996) have introduced a learning goal for neural networks called Imax. Their networks consist of two separate modules having different inputs,
226
Janne Sinkkonen and Samuel Kaski
and the learning algorithms aim at maximizing the mutual information between the outputs of the modules. For example, if the inputs are twodimensional arrays of random dots with stereoscopic displacements simulating the views of two eyes, the networks are able to infer depth from the data. The variant called discrete Imax (Becker, 1996) is closely related to the clustering algorithm of this article. In Imax, the outputs of the neurons in each module are interpreted as the probabilities of a multinomial random variable, and the goal of learning is to maximize the mutual information between the variables of the two modules. Our model differs from Becker ’s in two ways. First, Becker uses (normalized) exp( xT µj ) as basis functions, whereas our parameterization makes the basis functions invariant to the norms of µj (cf. equation 2.11). Without such invariance, the units with the largest norms may dominate the representation, a phenomenon that Becker noted as well. The other difference is that Becker optimizes the model using gradient descent based on the whole batch of input vectors, whereas we have a simple on-line algorithm, 2.18, adapting on the basis of one data sample at a time. The gradient of the discrete Imax with respect to the parameters µ l is, after simplication and in our notation, XZ p ( ci | vj ) @I x log (2.19) D p ( vj , vl , ci , x ) dx. @µj p( ci | vl ) i, j
It would be possible to apply stochastic approximation here by sampling from p ( vj , vl , ci , x ) , which leads to a different adaptation rule from ours. Becker (1996) has also used gaussian basis functions, but with some approximations and ending up with a different formula for the gradient. 2.2.3 Continuous Variables. The mutual information between continuously valued outputs of two neurons can be maximized as well (Becker & Hinton, 1992; Becker, 1996). Some assumptions about the continuously valued signals and the noise have to be made, however. In Becker and Hinton (1992), the outputs were assumed to consist of gaussian signals corrupted by independent, additive gaussian noise. In this article, the multinomial Imax has been reinterpreted as (soft) vector quantization in the Kullback-Leibler “metric” in which the distance is measured in an auxiliary space. In neural terms, the model builds a representation of the input space: each neuron is a detector specialized to represent a certain domain. In contrast, the continuous version tries to represent the input space by generating a parametric transformation to a continuous, lower-dimensional output space. If the parameterization and the assumptions about the noise are correct, the continuous representations are potentially more accurate. The advantage of the quantized representation is that no such assumptions need to be made; the model is almost purely
Clustering in an Auxiliary Space
227
data driven (semiparametric) and, of course, very useful for clustering-type applications.1 It is particularly difcult to maximize the mutual information if each module has several continuously valued outputs. In some recent works (Fisher & Principe, 1998; Torkkola & Campbell, 2000), the Shannon entropy has been replaced by the quadratic Renyi entropy, yielding simpler formulas for the mutual information. 2.2.4 Information Bottleneck and Distributional Clustering. In distributional clustering works (Pereira et al., 1993) with the information bottleneck principle (Tishby et al., 1999), mutual information between two discrete variables has been maximized. Tishby et al. get their motivation from the rate distortion theory of Shannon and Kolmogorov (see Cover & Thomas, 1991, for a review). In the rate distortion theory, the aim is to nd an optimal code book for a set of discrete symbols when a “cost” in the form of a distortion function describing the effects of a transmission line is given. In our notation, the authors consider the problem of building an optimal representation V for a discrete random variable X. In the rate distortion theory, a real-valued distortion function d( x, v ) is assumed known, and I ( XI V ) is minimized with respect to the representation (or conventionally, the code book) p( v | x) subject to the constraint EX,V fd( x, v )g < k. At the minimum, the conditional distributions dening the code book are p ( vl ) exp [¡bd( x, vl ) ] £ ¤, p ( vl | x ) D P ( ) ) j p vj exp ¡bd( x, vj
(2.20)
where b depends on k. The authors realized that if the average distortion EX,V fd( x, v ) g is replaced by the mutual information ¡I( CI V ), then the rate distortion theory gives a solution that captures as much information of the “relevance variable” C as possible. Here, the multinomial random variable C has the same role as our auxiliary data. The functional to be minimized becomes I ( XI V ) ¡ bI( CI V ), and its variational optimization with respect to the conditional densities p ( v | x) leads to the solution 2.20 with d( x, vj ) D DKL ( p ( c | x ) , p ( c | vj ) ) .
(2.21)
Together, these two equations give a characterization of the optimal representation V once we accept the criterion I ( XI V ) ¡ bI( CI V ) for the goodness of the representation. The characterization is self-referential through p ( c | v) and therefore does not in itself present an algorithm for nding the p ( v | x ) 1 We have made some assumptions by parameterizing the f (xI h) in equation 2.4. However, the model becomes semiparametric as a scale parameter, similar to the s for gaussians and 1 /k for the vMF kernels, approaches zero.
228
Janne Sinkkonen and Samuel Kaski
and p ( c | v ) , but Tishby et al. (1999) introduced an algorithm for nding a solution in the case of a multinomial X. Like the method presented in this article, the bottleneck aims at revealing nonlinear relationships between two data sets by maximizing a mutual information-based cost function. The relation between the two approaches is that although we started from the clustering viewpoint, our error criterion EKL turned out to be equivalent to the (negative) mutual information I ( CI V ) . The bottleneck has an additional term in its error function for keeping the complexity of the representation low, while the complexity of our clusters is restricted by their number and their parameterization. The most fundamental difference between our clustering approach and the published bottleneck works, however, arises from the continuity of our random variable X. The theoretical form of the bottleneck principle, equation 2.21, is not limited to discrete or nite spaces. According to our knowledge, however, no continuous applications of the principle have so far been published. For a continuous X, the distortion d ( x, v) in equation 2.21 cannot be readily evaluated without some additional assumptions, such as restrictions to the form of the cluster memberships p ( v | x ) . Our solution is to parameterize p ( v | x ) , which allows us to optimize the partitioning of the data space X into (soft) clusters. 2 3 Case Study: Clustering of Gene Expression Data
We tested our approach by clustering a large, high-dimensional data set, i.e., expressions of the genes of the budding yeast Saccharomyces cerevisiae in various experimental conditions. Such measurements, obtained from socalled DNA chips, are used in functional genomics to infer similarity of function of different genes. There are two popular approaches for analyzing expression data: traditional clustering methods (see, e.g., Eisen, Spellman, Brown, & Botstein, 1998) and supervised classication methods (support vector machines; Brown et al., 2000). In this case study, we intend to show that our method has the good sides of both of the approaches. For the majority of yeast genes, there exists a functional classication based on biological knowledge. The goal of the supervised classiers is to learn this classication in order to predict functions for new genes. The classiers may additionally be useful in that the errors they make on the genes having known classes may suggest that the original functional classication has errors. The traditional unsupervised clustering methods group solely on the basis of the expression data and do not use the known functional classes.
2 Alternatively, instead of solving a continuous problem, X could be (suboptimally) partitioned into predened clusters, after which the standard distributional clustering algorithms are applicable.
Clustering in an Auxiliary Space
229
Hence, they are applicable to sets of genes without known classication, and they may additionally generate new discoveries. There may be hidden similarities between the classes in the hierarchical functional classication, and there may even exist new subclasses that are revealed as more experimental data are collected. The clustering methods can therefore be used as hypothesis-generating machines. The disadvantage of the clustering algorithms is that the results are determined by the metric used for measuring similarity of the expression data. The metric is always somewhat arbitrary unless it is based on a considerable amount of knowledge about the functioning of the genes. Our goal is to use the known functional classication to dene implicitly which aspects of the expression data are important. The clusters are local in the expression data space, but the prototypes are placed to minimize the average distortion 2.2 in the space of the functional classes. The difference from supervised classication methods is that while classication methods cannot surpass the original classes, the (supervised) clusters are not tied to the classication and may reveal substructures within and relations between the known functional classes. In this case study, we compare our method empirically with alternative methods and demonstrate its convergence properties and the potential usefulness of the results. More detailed biological interpretation of the results will be presented in subsequent articles. We compared our model with two standard state-of-the-art mixture density models. The rst is a totally unsupervised mixture of vMF distributions. The model is analogous to the usual mixture of gaussians; the gaussian mixture components are simply replaced by the vMF components. The model is p( x ) D
X
p ( x | vj ) pj ,
(3.1)
j
where p ( x | vj ) D vMF( x I µj ) , and vMF is dened in equation 2.11. The pj are the mixing parameters. In the second model, mixture discriminant analysis 2 (MDA2; Hastie, Tibshirani, & Buja, 1995), the joint distribution between the functional classes c and the expression data x is modeled by a set of additive components denoted by uj : p ( ci , x ) D
X
p ( ci | uj ) p( x | uj ) pj ,
(3.2)
j
where p ( ci | uj ) and pj are parameters to be estimated, and p ( x | uj ) D vMF( xI µj ). Both models are tted to the data by maximizing their log likelihood with the expectation-maximization algorithm (Dempster, Laird, & Rubin, 1977).
230
Janne Sinkkonen and Samuel Kaski
3.1 The Data. Temporal expression patterns of 2476 genes of the yeast were measured with DNA chips in nine experimental settings (for more details, see Eisen et al., 1998; the data are available online at http://rana. Stanford.edu/clustering/). Each sample measures the expression level of a gene compared to the expression in a reference state. Altogether, there were 79 time points for each gene, represented below by the feature vector x. The data were preprocessed in the same way as in Brown et al. (2000)— by taking logarithms of the individual values and normalizing the length of x to unity. The data were then divided into a training set containing twothirds of the samples and a test set containing the remaining third. All the reported results except those reported in Table 1 are computed for the test set. The functional classication was obtained from the Munich Information Center for Protein Sequences Yeast Genome Database (MYGD).3 The classication system is hierarchic, and we chose to use the 16 highest-level classes to supervise the clustering. Sample classes include metabolism, transcription, and protein synthesis. Some genes belonged to several classes. Seven genes were removed because of a missing classication at the highest level of the hierarchy. 3.2 The Experiments. We rst compared the performance of the three models—the mixture of gaussians, MDA2, and our own—after the algorithms had converged. All models had 8 clusters and were run until there was no doubt on convergence. The mixture of gaussians and MDA2 were run for 150 epochs through the whole data and our model for 4.5 million stochastic iterations (a(t) decreased rst with a piecewise-linear approximation to an exponential curve, and then linearly to zero in the end). All models were run three times with different randomized initializations, and the best of the three results was chosen. We measured the quality of the resulting clusterings by the average distortion error or, equivalently, the empirical mutual information. When estimating the empirical mutual information, the table of the joint distributions p( ci , vj ) is rst estimated. In our model, the ith row of the table is updated by p ( vj | x ) D yj ( xI µj ) (equation 2.4 with f dened by equation 2.12) for each sample ( ci , x ). In the gaussian mixture model, the update is p( vj | x ) D p ( x | vj ) pj / p ( x ) , and in MDA2 it is p ( uj | x ) D p( x | uj ) pj / p ( x ) . After the table p ( ci , vj ) is computed, the performance criterion is obtained from equation 2.15 without the constant. 4 3
http://www.mips.biochem.mpg.de/proj/yeast. The empirical mutual information is an upward-biased estimate of the real mutual information, and the bias grows with decreasing data. Because the size of our sample is rather large and constant across the compared distributions and the number of values of the discrete variables is small, the bias does not markedly affect the results. 4
Clustering in an Auxiliary Space
231
Figure 2: Empirical mutual information between the generated gene expression categories and the functional classes of the genes, as a function of parameter k , which governs the width of the basis functions. Solid line: our model; dashed line: mixture of vMFs; dotted line: MDA2.
The results are shown in Figure 2 for different values of the parameter k that governs the width of the vMF kernels, equation 2.11. Our model clearly outperforms the other models for a wide range of widths of the kernels and produces the best overall performance; the clusters of our model convey more information about the functional classication of the genes than the alternative models do. There is a somewhat surprising sideline in the results: the gaussian mixture model is about as good as the MDA2, although it does not use the class information at all. The reason is probably in some special property of the data since for other data sets, MDA2 has outperformed the plain mixture model. Next we demonstrate the number of iterations required for convergence. The empirical mutual information is plotted in Figure 3 as a function of the number of iterations. In our model, the schedule for decreasing the coefcient a( t) was the same in each run, stretched to cover the number of iterations and decaying to zero in the end. The number of complete data epochs for the MDA2 was made comparable to the number of stochastic iterations by multiplying it with the number of data and the number of kernels, divided by two (in our algorithm, two kernels are updated at each iteration step). The performance of MDA2 attains its maximum quickly, but our model surpasses MDA2 well before 500,000 iterations (see Figure 3).
232
Janne Sinkkonen and Samuel Kaski
Figure 3: Empirical mutual information as a function of the number of iterations. Solid line: our model; dashed line: MDA2. k D 148.
Finally, we demonstrate the quality of the clustering by showing the distribution of the genes into the clusters from a few functional subclasses, known to be regulated in the experimental conditions like those of our data (Eisen et al., 1998). Note that these subclasses were not included in the values of the auxiliary variable C. Instead, they were picked from the second and third level of the functional gene classication. To characterize all the genes, the learning and the test sets were now combined. In Table 1, each gene is assigned to the cluster having the largest value of the membership function for that gene. The table reveals that many subclasses are concentrated in one of the clusters found by our algorithm. The rst four subclasses (a–d) belong to the same rst-level class and are placed in the same cluster, number 2. For comparison, the distribution of the same subset of genes into clusters formed by the mixture of vMFs and MDA2 is shown in Table 2. The concentratedness of the classes in different clusters can be summarized by the empirical mutual information within the table; the mutual information is 1.2 bits for our approach and 0.92 for the other two. In Table 1, produced by our method, three of the subclasses (c, e, and f) have been clearly divided into two clusters, suggesting a possible biologically interesting division. Its relevance will be determined later by further biological inspection; in this article, our goal is to demonstrate that the semisupervised clustering approach can be used to explore the data set and provide potential further hypotheses about its structure.
Clustering in an Auxiliary Space
233
Table 1: Distribution of Genes (Learning and Test Set Combined) of Sample Functional Subclasses into the Eight Clusters Obtained with Our Method. Cluster Number Class a b c d e f g
1
2
3
4
5
6
7
8
0 0 1 0 122 3 0
1 1 5 3 1 3 1
6 16 39 8 0 1 0
0 0 1 1 2 20 21
1 0 1 0 0 0 0
0 0 4 0 2 46 0
0 0 14 0 44 2 0
0 0 3 2 2 12 7
Note: These subclasses were not used in supervising the clustering. a: the pentose-phosphate pathway; b: the tricarboxylic acid pathway; c: respiration; d: fermentation; e: ribosomal proteins; f: cytoplasmic degradation; g: organization of chromosome structure.
Table 2: Distribution of Genes (Learning and Test Set Combined) of Sample Functional Subclasses into the Eight Clusters Obtained by the Mixture of vMFs model and MDA2. Cluster Number Class a b c d e f g
1
2
3
4
5
6
7
8
0 1 3 0 0 42 3
3 1 16 9 6 12 1
1 0 2 0 1 6 10
0 0 5 2 4 8 5
0 0 23 0 32 0 0
2 14 14 0 1 4 0
1 0 4 2 125 4 2
1 1 1 1 4 11 8
Note: Both methods yield the same table for these subclasses. For an explanation of the classes, see Table 1.
4 Conclusion
We have described a soft clustering method for continuous data that minimizes the within-cluster distortion between distributions of associated, discrete auxiliary data. The approach was inspired by our earlier work in which an explicit density estimator was used to derive an information-geometric metric for similar kinds of clustering tasks (Kaski et al., in press). The method presented here is conceptually simpler and does not require explicit density estimation, which is known to be difcult in high-dimensional spaces. The task is analogous to that of distributional clustering (Pereira et al., 1993) of multinomial data with the information bottleneck method (Tishby
234
Janne Sinkkonen and Samuel Kaski
et al., 1999), or learning from dyadic data (Hofmann et al., 1998). The main difference from our method is that these works operate on a discrete and nite data space, while our data are continuous. Our setup and cost function have connections to the information bottleneck method, but the approaches are not equivalent. We showed that minimizing our Kullback-Leibler divergence-based distortion criterion is equivalent to maximizing the mutual information between (neural) representations of the inputs and a discrete variable studied by Becker (Becker & Hinton, 1992; Becker, 1996). The distortion was additionally shown to be bounded by a conditional likelihood, which it approaches in the limit where the clusters sharpen toward Voronoi regions. We derived a simple on-line algorithm for optimizing the distortion measure. The convergence of the algorithm is proven for vMF basis functions in Kaski (2000). The algorithm was shown to have a close connection to competitive learning. We applied the clustering method to a yeast gene expression data set that was augmented with an independent, functional classication for the genes. The algorithm performs better than other algorithms available for continuous data, the mixture of gaussians and MDA2, a model for the joint density of the expression data and the classes. Our method turned out to be relatively insensitive to the (a priori set) width parameter of the gaussian parameterization, outperforming the competing methods for a wide range of parameter values. It was shown that the obtained clusters mediate information about the function of the genes, and although the results have not yet been biologically analyzed, they potentially suggest novel cluster structures for the yeast genes. Topics of future work include investigation of more exible parameterizations for the clusters, the relationship of the method to the metric dened in our earlier work, and variations of the algorithm toward visualizable clusterings and continuous auxiliary distributions. Appendix A: Gradient of the Error Criterion with Respect to the Parameters of the Basis Functions
If we denote Kil ( x ) ´
X @yj ( x I µj )
log yji ,
(A.1)
Kil ( x ) p ( ci , x ) dx .
(A.2)
@µ l
j
then @EKL @µ l
D ¡
XZ i
Clustering in an Auxiliary Space
235
Since the basis functions yj ( xI µj ) are of the normalized exponential form (see equation 2.4), the gradient of yj is5 @yj ( x I µj ) @µ l
D dlj yj ( x I µj )
@fj ( xI hj ) @µ l
¡ yj ( x I µj ) yl ( xI µ l )
@fl ( x I µl ) @µl
.
(A.3)
Substituting the result into equation A.1 gives 2
3
Kil ( x ) D yl ( x I µ l ) 4log yli ¡ 2 D yl ( x I µ l ) 4
X j
X j
yj ( x Ihj ) log yji 5
@fl ( x I hl ) @µ l
3 yli 5 @fl ( x I µ l ) . yj ( x I hj ) log yji @µ l
(A.4)
Since the V and C are conditionally independent with respect to X, we may write yj ( x I µj ) yl ( x I µ l ) p ( ci , x ) D p( ci , vj , vl , x ).
(A.5)
Substituting this and equation A.4 into A.2, we arrive at equation 2.5. Appendix B: Connection to Conditional Density Estimation
Let us denote yj ( xI µj ) D yj ( x ) for brevity, R Pand note that the conditional ( ) ( ) entropy of C given X, or H ( C | X ) D ¡ i p c i , x log p ci | x dx , is independent of the parameters µj and yji . The distortion of equation 2.3 can then be expressed as EKL D ¡ D D D
XZ i, j
Z X i
Z X
yj ( x ) log yji p ( ci , x ) dx ¡ H ( C | X ) C const.
2
i
Z X i
3
4log p ( ci | x ) ¡ p ( ci | x ) log p ( ci | x ) log
(B.1)
X
exp
j
yj ( x ) log yji5 p ( ci , x ) dx C const.
p( c | x ) P i p ( x ) dx C const. ( ) j yj x log yji
p ( ci | x ) p ( x ) dx C const., qi ( x )
(B.2)
(B.3) (B.4)
5 Note that the normalized versions of the densities of the so-called exponential family are included—if the yj (x I µj ) are interpreted as densities of the random variables p(vj | x ).
236
Janne Sinkkonen and Samuel Kaski
where qi ( x ) ´ exp
X
yj ( x ) log p ( ci | vj ) .
The fqi ( x ) gi is not a proper density. However, inequality. Hence, for all x , X
(B.5)
j
p ( ci | x ) log
i
P
i qi ( x )
· 1 based on Jensen’s
X p ( ci | x ) p ( ci | x ) p( x ) ¸ p ( ci | x ) log p( x) , qi ( x ) pO( ci | x ) i
(B.6)
where qi ( x ) . pO( ci | x ) ´ P i qi ( x )
(B.7)
Therefore, minimizing the clustering criterion EKL minimizes an upper limit of Z X p ( ci | x ) p( ci | x ) log p( x ) dx D E X fDKL ( p ( c | x ) , pO( c | x ) ) g, (B.8) pO( ci | x ) i the expected KL divergence between the conditional density p( c | x ) and its estimate pO( c | x ) . (Here EX f¢g denotes the expectation over x .) One can also write Z X p ( ci , x ) E X fDKL ( p ( c | x ) , pO( c | x ) ) g D p ( ci , x ) log dx pO( ci | x ) p ( x ) i D DKL ( p ( c, x ) , pO( c | x ) p ( x ) ).
(B.9)
Maximizing the likelihood of the model pO( ck | x k ) for data f( ck, x k )g k sampled from p ( c, x ) is asymptotically equivalent to minimizing the KullbackLeibler divergence, equation B.9. This is because for the N independent and identically distributed samples f( ck, x k ) gk, the scaled log likelihood, 1 X log pO( ck | x k ) , N k
converges to Z X i
p( ci , x ) log pO( ci | x ) dx D ¡
Z X i
p ( ci | x ) log
p ( ci | x ) p ( x ) dx C const. pO( ci | x )
D ¡EX fDKL ( p ( c | x ) , pO( c | x ) ) g C const.
(B.10)
Clustering in an Auxiliary Space
237
with probability 1 when N ! 1, and because maximizing equation B.10 minimizes equation B.9. In the special case when the yj ( x ) are binary valued, the qi ( x ) D pO( ci | x ) is a proper density and the equality in equation B.6 holds for almost every x. Therefore, the minimization of (an approximation of) the average distortion EKL on a nite data set is equivalent to maximizing the likelihood of pO( c | x ) on the same sample. Acknowledgments
This work was supported by the Academy of Finland, in part by grant 50061. We thank Janne Nikkil¨a, Petri Tor ¨ onen, ¨ and Jaakko Peltonen for their help with the simulations and processing of the data. References Becker, S. (1996). Mutual information maximization: Models of cortical selforganization. Network: Computation in Neural Systems, 7, 7–31. Becker, S., & Hinton, G. E. (1992). Self-organizing neural network that discovers surfaces in random-dot stereograms. Nature, 355, 161–163. Brown, M. P., Grundy, W. N., Lin, D., Cristianini, N., Sugnet, C. W., Furey, T. S., Ares, Jr., M., & Haussler, D. (2000). Knowledge-based analysis of microarray gene expression data by using support vector machines. Proceedings of the National Academy of Sciences, USA, 97, 262–267. Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: Wiley. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39, 1–38. Didday, R. L. (1976). A model of visuomotor mechanisms in the frog optic tectum. Mathematical Biosciences, 30, 169–180. Eisen, M. B., Spellman, P. T., Brown, P. O., & Botstein, D. (1998). Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences, USA, 95, 14863–14868. Fisher III, J. W., & Principe, J. (1998). A methodology for information theoretic feature extraction. In Proc. IJCNN’98, International Joint Conference on Neural Networks (pp. 1712–1716). Piscataway, NJ: IEEE Service Center. Forgy, E. W. (1965). Cluster analysis of multivariate data: Efciency vs. interpretability of classications. Biometrics, 21, 768–769. Gersho, A. (1979). Asymptotically optimal block quantization. IEEE Transactions on Information Theory, 25, 373–380. Gray, R. M. (1984, April). Vector quantization. IEEE ASSP Magazine, 4–29. Grossberg, S. (1976). On the development of feature detectors in the visual cortex with applications to learning and reaction-diffusion systems. Biological Cybernetics, 21, 145–159.
238
Janne Sinkkonen and Samuel Kaski
Hastie, T., Tibshirani, R., & Buja, A. (1995). Flexible discriminant and mixture models. In J. Kay & D. Titterington (Eds.), Neural networks and statistics. New York: Oxford University Press. Hofmann, T. (2000). Learning the similarity of documents: An informationgeometric approach to document retrieval and categorization. In S. A. Solla, T. K. Leen, & K.-R. Muller ¨ (Eds.), Advances in neural information processing systems, 12 (pp. 914–920). Cambridge, MA: MIT Press. Hofmann, T., Puzicha, J., & Jordan, M. I. (1998). Learning from dyadic data. In M. S. Kearns, S. A. Solla, & D. A. Cohn (Eds.), Advances in neural information processing systems, 11 (pp. 466–472). San Mateo, CA: Morgan Kaufmann. Jaakkola, T. S., & Haussler, D. (1999). Exploiting generative models in discriminative classiers. In M. S. Kearns, S. A. Solla, & D. A. Cohn (Eds.), Advances in neural information processingsystems,11 (pp. 487–493). San Mateo, CA: Morgan Kaufmann. Kaski, S. (2000). Convergence of a stochastic semisupervised clustering algorithm (Tech. Rep. No. A62). Espoo, Finland: Helsinki University of Technology. Kaski, S., & Kohonen, T. (1994). Winner-take-all networks for physiological models of competitive learning. Neural Networks, 7, 973–984. Kaski, S., Sinkkonen, J., & Peltonen, J. (in press). Bankruptcy analysis with selforganizing maps in learning metrics. IEEE Transactions on Neural Networks. Kohonen, T. (1984). Self-organization and associative memory. Berlin: SpringerVerlag. Kohonen, T. (1993). Physiological interpretation of the self-organizing map algorithm. Neural Networks, 6, 895–905. Kohonen, T. (1995). Self-organizing maps. Berlin: Springer. Kohonen, T., & Hari, R. (1999). Where the abstract feature maps of the brain might come from. TINS, 22, 135–139. MacQueen, J. (1967). Some methods for classication and analysis of multivariate observations. In L. M. Le Cam & J. Neyman (Eds.), Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. Vol. 1: Statistics (pp. 281–297). Berkeley: University of California Press. Makhoul, J., Roucos, S., & Gish, H. (1985). Vector quantization in speech coding. Proceedings of the IEEE, 73, 1551–1588. Mardia, K. V. (1975). Statistics of directional data. Journal of the Royal Statistical Society B 37, 349–393. Nass, M. M., & Cooper, L. N. (1975). A theory for the development of feature detecting cells in visual cortex. Biological Cybernetics, 19, 1–18. Nowlan, S. J. (1990). Maximum likelihood competitive learning. In D. Touretzky (Ed.), Advances in neural information processing systems, 2 (pp. 574–582). San Mateo, CA: Morgan Kaufmann. Oja, E. (1982). A simplied neuron model as a principal component analyzer. J. Math. Biology, 15, 267–273. Pereira, F., Tishby, N., & Lee, L. (1993). Distributional clustering of English words. In Proceedings of the 30th Annual Meeting of the Association for Computational Linguistics (pp. 183–190). PÂerez, R., Glass, L., & Shlaer, R. J. (1975). Development of specicity in cat visual cortex. Journal of Mathematical Biology, 1, 275–288.
Clustering in an Auxiliary Space
239
Tipping, M. E. (1999). Deriving cluster analytic distance functions from gaussian mixture models. In Proc. ICANN99, Ninth International Conference on Articial Neural Networks (pp. 815–820). London: IEE. Tishby, N., Pereira, F. C., & Bialek, W. (1999). The information bottleneck method. In 37th Annual Allerton Conference on Communication, Control, and Computing. Urbana, IL. Torkkola, K. & Campbell, W. (2000). Mutual information in learning feature transformations. In Proc. ICML’2000, the 17th International Conference on Machine Learning (pp. 1015–1022). San Mateo, CA: Morgan Kaufmann. Received June 28, 2000; accepted April 7, 2001.
ARTICLE
Communicated by Peter Bartlett
On the Complexity of Computing and Learning with Multiplicative Neural Networks Michael Schmitt [email protected] ¨ Mathematik, Ruhr-Universit¨at Lehrstuhl Mathematik und Informatik, Fakult¨at fur Bochum, D–44780 Bochum, Germany
In a great variety of neuron models, neural inputs are combined using the summing operation. We introduce the concept of multiplicative neural networks that contain units that multiply their inputs instead of summing them and thus allow inputs to interact nonlinearly. The class of multiplicative neural networks comprises such widely known and wellstudied network types as higher-order networks and product unit networks. We investigate the complexity of computing and learning for multiplicative neural networks. In particular, we derive upper and lower bounds on the Vapnik-Chervonenkis (VC) dimension and the pseudodimension for various types of networks with multiplicative units. As the most general case, we consider feedforward networks consisting of product and sigmoidal units, showing that their pseudo-dimension is bounded from above by a polynomial with the same order of magnitude as the currently best-known bound for purely sigmoidal networks. Moreover, we show that this bound holds even when the unit type, product or sigmoidal, may be learned. Crucial for these results are calculations of solution set components bounds for new network classes. As to lower bounds, we construct product unit networks of xed depth with superlinear VC dimension. For sigmoidal networks of higher order, we establish polynomial bounds that, in contrast to previous results, do not involve any restriction of the network order. We further consider various classes of higher-order units, also known as sigma-pi units, that are characterized by connectivity constraints. In terms of these, we derive some asymptotically tight bounds. Multiplication plays an important role in both neural modeling of biological behavior and computing and learning with articial neural networks. We briey survey research in biology and in applications where multiplication is considered an essential computational element. The results we present here provide new tools for assessing the impact of multiplication on the computational power and the learning capabilities of neural networks. Neural Computation 14, 241–301 (2001)
° c 2001 Massachusetts Institute of Technology
242
Michael Schmitt
1 Introduction Neurons compute by receiving signals from a large number of other neurons and processing them in a complex way to yield output signals sent to other neurons again. A major issue in the formal description of single neuron computation is how the input signals interact and jointly affect the processing that takes place further. In a great many neuron models, this combination of inputs is specied using a linear summation. The McCulloch-Pitts model and the sigmoidal neuron are examples of these summing neurons, which are very popular in applications of articial neural networks. Neural network researchers widely agree that there is only a minor correspondence between these neuron models and the behavior of real biological neurons. In particular, the interaction of synaptic inputs is known to be essentially nonlinear (see, e.g., Koch, 1999). In search for biologically closer models of neural interactions, neurobiologists have found that multiplicative-like operations play an important role in single neuron computations (see also Koch & Poggio, 1992; Mel, 1994). For instance, multiplication models nonlinearities of dendritic processing and shows how complex behavior can emerge in simple networks. In recent years, evidence has also accumulated that specic neurons in the nervous systems of several animals compute in a multiplicative way (Andersen, Essick, & Siegel, 1985; Suga, 1990, Hatsopoulos, Gabbiani, & Laurent, 1995; Gabbiani, Krapp, & Laurent, 1999; Anzai, Ohzawa, & Freeman, 1999a, 1999b). That multiplication increases the computational power and storage capacities of neural networks is well known from extensions of articial neural networks where this operation appears in the form of higher-order units (see, e.g., Giles & Maxwell, 1987). A more general type of multiplicative neuron model is the product unit introduced by Durbin and Rumelhart (1989) where inputs are multiplied after they have been raised to some power specied by an adjustable weight. We subsume networks containing units that multiply their inputs instead of summing them under the general concept of multiplicative neural networks and investigate the impact that multiplication has on their computational and learning capabilities. A theoretical tool that quanties the complexity of computing and learning with function classes in general, and neural networks in particular, is the Vapnik-Chervonenkis (VC) dimension. In this article, we provide a theoretical study of the complexity of computing and learning with multiplicative neural networks in terms of this dimension. The VC dimension and related notions, such as the pseudo-dimension and the fat-shattering dimension, are well known to yield estimates for the number of examples required by learning algorithms for neural networks and other hypothesis classes, such that training results in a low generalization error. Using these dimensions, bounds on this sample complexity of learning can be obtained not only for the model of probably approximately correct (pac) learning due to Valiant (1984) (see also Blumer, Ehrenfeucht,
Multiplicative Neural Networks
243
Haussler, & Warmuth, 1989), but also for the more general model of agnostic learning, that is, when the training examples are generated by some arbitrary probability distribution (see, e.g., Haussler, 1992; Maass, 1995a, 1995b; Anthony & Bartlett, 1999). Furthermore, in terms of the VC dimension, bounds have also been established for the sample complexity of on-line learning (Maass & Tura n, 1992) and Bayesian learning (Haussler, Kearns, & Schapire, 1994). The VC dimension, however, is not only useful in the analysis of learning but has also proven to be a successful tool for studying the complexity of computing over, in particular, the real numbers. Koiran (1996) and Maass (1997) employed the VC dimension to establish lower bounds on the size of sigmoidal neural networks for the computation of functions. Further, using the VC dimension, limitations of the universal approximation capabilities of sigmoidal neural networks have been exhibited by the derivation of lower bounds on the size of networks that approximate continuous functions (Schmitt, 2000). Thus, in particular for neural networks, the VC dimension has acquired a wide spectrum of applications in analyzing the complexity of analog computing and learning. There are some known bounds on the VC dimension for neural networks that also partly include multiplicative units. These results are concerned with networks of higher order where the linearly weighted sum of a classical summing unit is replaced by polynomials of a certain degree. All bounds determined thus far, however, are given in terms of the maximum order of the units or require the order to be xed. This is a severe restriction; it not only imposes a bound on the exponent of each individual input but also limits the degree of interaction among different inputs. A network with units computing piecewise polynomial functions with order at most d is known to have VC dimension O ( W 2 d ) where W is the number of network parameters (Goldberg & Jerrum, 1995; Ben-David & Lindenbaum, 1998). For sigmoidal networks of higher order, Karpinski and Macintyre (1997) established the bound O (W 2 k2 log d ) , where k is the number of network nodes. There are some further results for other network types of which we give a more complete account in a later section. They all consider the degree of higher-order units to be restricted. In this article, we derive bounds on the VC dimension and the pseudodimension of product unit networks. In these networks, the exponents are no longer xed but are treated as variable weights. In addition, no restriction is imposed on their order. We show that a feedforward network consisting of product and sigmoidal units has pseudo-dimension at most O ( W 2 k2 ), where W is the number of parameters and k the number of nodes. Hence, the same bound that is known for higher-order sigmoidal networks of restricted degree also holds when product units are used instead of monomials. Moreover, the bound is valid not only when the types of the nodes are xed but also when they can vary between a sigmoidal unit or a product unit. This may be the case, for instance, when a learning algorithm decides which unit type to assign to which node. The results are based on the method of solution
244
Michael Schmitt
set components bounds (see Anthony & Bartlett, 1999), and we derive new such bounds. We use this method also for showing that a network with k higher-order sigmoidal units and W parameters has pseudo-dimension at most O ( W 4 k2 ) , a bound that also includes the exponents as adjustable parameters. Thus, the VC dimension and the pseudo-dimension of these networks cannot grow indenitely in terms of the order, and the exponents can be treated as weights analogous to the coefcients of the polynomials. These results indicate that from the theoretical viewpoint, multiplication can be considered as an alternative to summation without signicantly increasing the complexity of computing and learning with higher-order networks. We further derive bounds for specic classes of higher-order units. Considering a higher-order unit as a network with one hidden layer, we dene these classes in terms of constraints on the network connectivity. On the one hand, we restrict the number of connections outgoing from the input nodes; on the other hand, we put a limit on the number of hidden nodes. We derive various VC dimension bounds for these classes, order dependent as well as independent, and show for some of them that they are asymptotically tight. It might certainly be possible to embed a class of networks entirely into a single network. The results show, however, that smaller bounds arise when the specic properties of the class are taken into account. We also establish a lower bound for product unit networks stating that networks with two hidden layers of product and summing units have a superlinear VC dimension. Finally, we show that the pseudo-dimension of the single product unit and of the class of monomials is equal to the number of input variables. We focus on feedforward networks with a single output node. This means that at least two possible directions are not pursued further here: recurrent networks and networks with multiple output nodes. VC dimension bounds for recurrent networks consisting of summing units, including threshold, sigmoidal, and higher-order units with xed degree, have been established by Koiran and Sontag (1998). Shawe-Taylor and Anthony (1991) give bounds on the sample complexity for networks of threshold units with more than one output node. Since both of these works build on previous bounds for feedforward networks or single-output networks, respectively, the ideas presented here may be helpful for deriving similar results for recurrent or multiple-output networks containing multiplicative units. The article is organized as follows. In section 2 we introduce the necessary terminology and demarcate multiplicative from summing units. We then report on some research in neurobiology that resulted in the use of multiplication for the modeling of biological neural systems. Further, we give a brief review of some learning applications where multiplication, mainly in the form of higher order, has been employed to increase the capabilities of articial neural networks. In section 3 we introduce the denitions of the VC dimension and the pseudo-dimension, and exhibit some close relationships between these two combinatorial characterizations of function classes. In this section we also survey previous results where bounds on
Multiplicative Neural Networks
245
these dimensions have been obtained for neural networks. The new results follow in the two subsequent sections: section 4 contains the calculations of the upper bounds, and lower bounds are derived in section 5. Finally, in section 6 we give a summary of the results in a table and conclude with some remarks and open questions. Proofs of some technically more involved results, which are needed in section 4, are in the appendix. 2 Neural Networks with Multiplicative Units Terminology in neural network literature varies and is occasionally used inconsistently. We introduce the terms and concepts that we shall adhere to throughout this article and present some of the biological and applicationspecic motivations that led researchers to the use of multiplicative units in neural networks. 2.1 Neural Network Terminology. The connectivity of a neural network is given in terms of an architecture, which is a graph with directed edges, or connections, between the nodes. Nodes with no incoming edges are called input nodes; nodes with no outgoing edges are output nodes. All remaining ones are hidden nodes. The computation nodes of an architecture are its output and hidden nodes. Input nodes serve as input variables of the network. The fan-in of a node is the number of connections entering the node; correspondingly, its fan-out is the number of connections leaving it. The architecture of a feedforward network has no cycles. We focus on feedforward networks with one output node such as shown in Figure 1 that are suitable for computing functions with a scalar, that is, one-dimensional, output range. Some specic architectures are said to be layered. In this case, all nodes with equal distance from the input nodes constitute one layer, and edges exist only between subsequent layers. An architecture becomes a neural network when edges and nodes are labeled with weights and thresholds, respectively, as the network parameters. To specify further which function is to be computed by the network, variables must be assigned to the input nodes and units selected for the computation nodes. The types of units studied in this article will be dened in the following section. Finally, when values, that is, in general real numbers, are specied for the network parameters, the network computes a unique function dened by functional composition of its units. 2.2 Neuron Models That Sum or Multiply. The networks we are considering consist of multiplicative as well as standard summing units. First, we briey recall the denitions of the latter. 2.2.1 Summing Units. The three most widely used summing units are the threshold unit, the sigmoidal unit, and the linear unit. Each of them is parameterized by a set of weights w1 , . . . , wn 2 R, where n is the number
246
Michael Schmitt output node computation nodes hidden nodes/layer fan-in fan-out input nodes/layer
Figure 1: Neural network terminology. An architecture is shown with four input nodes, one hidden layer consisting of three hidden nodes, and one output node. Hidden and output nodes are computation nodes. All input nodes have fanout 3; the hidden nodes have fan-in 4. We also consider architectures with fan-out restrictions. In this case, subsequent layers need not be fully connected. When parameters, that is, weights and thresholds, are assigned and node functions are specied by choosing units, the architecture becomes a neural network.
of input variables, and a threshold t 2 R. They compute their output in the form f ( w1 x1 C w2 x2 C ¢ ¢ ¢ C wn xn ¡ t ) , where x1 , . . . , xn are the input variables with domain R and f is a nonlinear function, referred to as the activation function of the unit. The particular choice of the activation function characterizes what kind of unit is supposed. The threshold unit, also known as McCulloch-Pitts neuron (after McCulloch & Pitts, 1943) or perceptron,1 uses for f the sign or Heaviside function sgn ( y ) D 1 if y ¸ 0, and sgn ( y ) D 0 otherwise. The logistic function s ( y ) D 1 / (1 C e ¡y ) is the activation function employed by the sigmoidal unit.2 Finally, we have a linear unit if the activation function is chosen to be the identity, that is, the linear unit plainly outputs the weighted sum w1 x1 C w2 x2 C ¢ ¢ ¢ C wn xn ¡ t. In other words, the linear unit computes an afne function. The activation function also denes the output ranges of the units: f0, 1g, (0, 1), and R for the threshold, sigmoidal, and linear unit, respectively. Linear units as computation nodes in neural networks are in general redundant if 1
Perceptron is a widely used synonym for the threshold unit, although in the original denition by Rosenblatt (1958), the perceptron is introduced as a network of threshold units with one hidden layer (see also Minsky & Papert, 1988). 2 There is a large class of sigmoidal functions of which the logistic function is only one instance. The sigmoidal unit as it is dened here is also referred to as the standard sigmoidal unit.
Multiplicative Neural Networks
247
they feed their outputs only to summing units. For instance, a network consisting solely of linear units is equivalent to a single linear unit. They are mainly used as output nodes of networks that compute or approximate realvalued functions (see the survey by Pinkus, 1999). However, linear units as hidden nodes can help to save network connections. 2.2.2 Multiplicative Units. The simplest type of a multiplicative unit is a monomial, that is, a product xd11 xd22 ¢ ¢ ¢ xdnn , where x1 , x2 , . . . , xn are input variables. Each di is a nonnegative integer referred to as the degree or exponent of the variable xi . The value d1 C ¢ ¢ ¢ C dn is called the order of the monomial. Since monomials restricted to the Boolean domain f0, 1g, where all nonzero exponents are 1 without loss of generality, compute the logical AND function, they are often considered as computing the continuous analog of a logical conjunction. A Boolean conjunction, however, does not necessarily require the use of multiplication because the logical AND can also be computed by a threshold unit. The same holds for the neural behavior known as shunting inhibition, which is sometimes referred to as a continuous AND-NOT operation. In this article, we do not consider conjunction as a multiplicative operation and use monomials as computational models for genuine multiplicative behavior with real-valued input and output. If M1 , . . . , M k are monomials, a higher-order unit is a polynomial w1 M1 C w2 M 2 C ¢ ¢ ¢ C wk M k ¡ t with real-valued weights w1 , . . . , w k and threshold t. It is also well known under the name sigma-pi unit (Rumelhart, Hinton, & McClelland, 1986; Williams, 1986). As in the case of a summing unit, weights and threshold are parameters, but to uniquely identify a higher-order unit, one also has to specify the structural parameter of the unit, that is, the set of monomials fM 1 , . . . , M kg. We call this set the structure of a higher-order unit. A higherorder unit can be viewed as a network with one hidden layer of monomials and a linear unit as output node. For this reason, a higher-order unit is often referred to in the literature as a higher-order network. In such a network, only the connections leading from hidden nodes to the output node have weights, whereas the connections between input and hidden layer are weightless. Clearly, assigning multiplicative weights to input variables does not increase the power of a higher-order unit since due to the multiplicative nature of the hidden nodes, all such weights can be moved forward to the output node. For higher-order units, it is also common, depending on the type of application, to use a threshold or a sigmoidal unit instead of a linear unit as output node. This yields a higher-order unit that computes f ( w1 M1 C w2 M2 C ¢ ¢ ¢ C w k M k ¡ t)
248
Michael Schmitt
with nonlinear activation function f . In case f is the sign function, we have a higher-order threshold unit, or polynomial threshold unit. If the activation function is sigmoidal, we refer to it as a higher-order sigmoidal unit. Finally, we introduce the most general type of multiplicative unit, the product unit. It has the form w
w
wp
1 2 x1 x2 ¢ ¢ ¢ xp
with variables x1 , . . . , xp and weights w1 , . . . , wp . The number p of variables is called the order of the product unit. In contrast to the monomial, the product unit does not have xed integers as exponents but variable ones that may even take on any real number. Thus, a product unit is computationally at least as powerful as a monomial. Moreover, ratios of variables, and thus division, can be expressed using negative weights. This is one reason the product unit was introduced by Durbin and Rumelhart (1989). Another advantage that these and other authors explore is that the exponents of product units can be adjusted automatically, for instance, by gradient-based and other learning methods (Durbin & Rumelhart, 1989; Leerink, Giles, Horne, & Jabri, 1995a, 1995b). We mentioned above the well-known fact that networks of linear units are equivalent to single units. The same holds for networks consisting solely of product units. Unless there is a restriction on the order, such a network can be replaced by a single unit. Therefore, product units are mainly used in networks where they occur together with other types of units, such as threshold or sigmoidal units. For instance, the standard neural network containing product units is an architecure with one hidden layer such as shown in Figure 1, where the hidden nodes are product units and the output node is a sigmoidal unit. A legitimate question is whether multiplicative units are actually needed in neural networks and whether their task can be done by some reasonably sized network of summing units. Indeed, for multiplication and exponentiation of integers in binary representation, networks of threshold units with a small number of layers have been constructed that perform these and other arithmetic operations (for surveys, see Hofmeister, 1994; Siu, Roychowdhury, & Kailath, 1995). However, the drawback with these networks is that they grow in size, albeit polynomially, with the number of bits required for the representation of the numbers to be multiplied. Such a quality is certainly disastrous when one would like to process real numbers with possibly innite precision. Therefore, networks of summing units do not seem to be an adequate substitute for genuine multiplicative units. 2.2.3 Units versus Neurons. We have introduced all neuron models that will be considered in this article as units. As common in the neural network literature, we could have called them neurons or gates synonymously, so that we could speak of a sigmoidal neuron or a product gate. In what follows, we
Multiplicative Neural Networks
249
shall always use the term unit for the computational element of a neural network, thereby viewing it as an abstraction for the functionality it represents. In an articial neural network, a unit may be implemented by connecting several model neurons together, such as monomials and a linear unit form a higher-order unit. A unit may also serve as a model for information processing in some part of a single biological neuron such as monomials or product units used to model nonlinear interactions of synaptic inputs in dendritic trees. 2.3 Multiplication in Biological Neural Networks. There are several reasons that neurobiologists study multiplication as a computational mechanism underlying the behavior of neural systems. First, it can be used to model the nonlinearities involved in dendritic processing of synaptic inputs. Second, it is shown to arise in the output activity of individual model neurons or in neural populations. Third, it can be employed to explain how simple model networks can achieve complex behavior with biologically plausible methods. And nally, multiplicative neurons are found in real neural networks. In the following, we mention some of the research that has been done in each of these directions. Further references regarding the fundamental role and the evidence of multiplication-like operations on all levels of neural information processing can be found in Koch and Poggio (1992) and Koch (1999). 2.3.1 Dendritic Multiplication and Division. In a large quantity of neuron models, interaction of synaptic inputs is modeled as a linear operation. Thus, synaptic inputs are combined by adding them. This is evident for the summing units introduced above, but it holds also for the more complex and biologically closer model of the leaky integrate-and-re neuron (see, e.g., Softky & Koch, 1995; Koch, 1999, for surveys of single-neuron models).3 Linearity is believed to be sufcient for capturing the passive, or cable, properties of the dendritic membrane where synaptic inputs are currents that add. From numerous studies using recording experiments or computer simulations, sufcient evidence has arisen that synaptic inputs can interact nonlinearly when the synapses are co-localized on patches of dendritic membrane with specic properties. Thus, the spatial grouping of synapses on the dendritic tree is reected in the computations performed at local branches. The summing operation, due to its associativity merely representing the dendrite as an amorphous device, does not take hold of this. Consequently, 3 The linearity referred to here concerns the way several inputs interact, and not the neural response to a single synaptic input. Leaky integrate-and-re models treat single synaptic inputs as nonlinearitites by describing the course of the postsynaptic response as a nonlinear function in time. Here, we do not take detailed temporal effects into account but consider synaptic inputs as represented by weighted analog variables.
250
Michael Schmitt
these local computations must be nonlinear, thereby enriching the computational power of the neuron by nonlinear computations that take place in the dendritic tree prior to the central nonlinearity of the neuron, the threshold. It has been convincingly argued that these dendritic nonlinearities should be modeled by multiplication. For instance, performing extensive computer experiments, Mel (1992b, 1993) has found that excitatory voltage-dependent membrane mechanisms, such as NMDA receptor channels, could form a basis for multiplicative interactions among neighboring synapses. The exhibited so-called cluster sensitivity can give rise to complex pattern recognition tasks performed by single neurons. This is also shown in a learning experiment where a biologically plausible, Hebbian-like learning rule is introduced to manipulate the spatial ordering of synaptic connections onto the dendritic tree (Mel, 1992a). Multiplicative-like operations in dendritic trees are also shown to take place in the form of division, an operation that is not performed by monomials and higher-order units but is available in the product units of Durbin and Rumelhart (1989) by using negative exponents (see section 2.2.2). Neurobiological results exhibiting division arise mainly in investigations concerned with shunting inhibitory synapses. For instance, in a mathematical analysis of a simple model neuron, Blomeld (1974) derives sufcient conditions for inhibitory synapses to perform division. The division operation can be seen as a continuous analog of the logical AND-NOT, also referred to as veto mechanism, a form in which it is studied by Koch, Poggio, and Torre (1983). Using computer simulations, they show that inhibitory synapses are able to perform such an analog veto operation. The capability of computing division is also found to be essential in a model constructed by Carandini and Heeger (1994) for the responses of cells in the visual cortex. The divisive operation of the cell membrane of the model neuron explains how nonlinear operations required for selectivity to various visual input stimuli, such as position, orientation, and motion direction, can be achieved by single neurons. These authors also show that their theoretical results compare well with physiological data from the monkey primary visual cortex. Perhaps one of the earliest modeling studies involving nonlinear dendritic processing, albeit not explicitly multiplicative, is due to Feldman and Ballard (1982), who demonstrate how complex cognitive tasks can be implemented in a biologically plausible way using their model neurons. For a detailed account of further ideas and results about dendritic computation, we refer to the review by Mel (1994). 2.3.2 Model Neurons and Networks That Multiply. Beyond using multiplication for modeling dendritic computation, investigations have been aimed at revealing that entire neurons can function as multipliers. Even in the early days of cybernetics research, it was known that under certain conditions, a coincidence-detecting neuron performs a multiplication by transforming the frequencies of input spikes arriving at two synaptic sites into an output
Multiplicative Neural Networks
251
spike frequency that is proportional to the product of the input frequencies (Kupfm ¨ uller ¨ & Jenik, 1961). Similar studies based on slightly different neuron models have been carried out by Srinivasan and Bernard (1976), Bugmann (1991), and Rebotier and Droulez (1994), with comparable results. Bugmann (1991) argues that proximal synapses lead to a multiplicative behavior, while with distal synapses the neuron operates in a summation mode. A further model of Bugmann (1992) includes time-dependent synaptic weights for the compensation of a problem caused by irregularities in input spike trains, which can deteriorate multiplicative behavior. Tal and Schwartz (1997) show that even the summing operation can be used to compute multiplication. They nd sufcient conditions for a leaky integrate-and-re neuron to compute a nonlinearity close to the logarithm. In this way, the logarithm of a product can be obtained by summing the outputs of leaky integrate-and-re neurons. By means of the ln-exp transform xy D eln x C ln y , the logarithm is also a key ingredient for the biophysical implementation of multiplication proposed by Koch and Poggio (1992) (see also Koch, 1999). As such, the transform is also suggested by Durbin and Rumelhart (1989) for their product units. Multiplication is shown to arise not only in single elements, but also in an ensemble of neurons, where it emerges as a property of the network. Salinas and Abbott (1996) show by computer simulations that population effects in a recurrent network can lead to multiplicative neural responses even when the individual neurons are not capable of computing a product. 2.3.3 Complex Neural Behavior Through Multiplication. Assuming that multiplicative operations can be carried out in the nervous system, several researchers have established that model neural networks relying on multiplication are capable of solving complex tasks in a biologically plausible way. Poggio (1990) proposes a general framework for the approximate representation of high-dimensional mappings using radial basis functions. In particular, he shows that multiplication offers a computationally powerful and biologically realistic possibility of synthesizing high-dimensional gaussian radial basis functions from low dimensions (see also Poggio and Girosi, 1990a, 1990b). Building on this theory, Pouget and Sejnowski (1997) demonstrate that basis functions could be used by neurons in the parietal cortex to encode sensory inputs in a format suitable for generating motor commands. The hidden units of their model network, which is trained using gradient descent, compute the product of a gaussian function of retinal location with a sigmoidal function of eye position. Learning in multiplicative basis function networks is also investigated by Mel and Koch (1990) using a Hebbian method. Olshausen, Anderson, & Van Essen (1993) suggest a mechanism for how the visual system could focus attention and achieve pattern recognition that
252
Michael Schmitt
is invariant to position and scale by dynamic routing of information. The neural circuit contains control units that make multiplicative contacts in order to modulate the strengths of synaptic connections dynamically. The multiplicative network of Salinas and Abbott (1996) is conceived for solving a similar task. They show that it can transform visual information from retinal coordinates to coordinates that represent object locations with respect to the body. That nonlinear dendritic computation can be used in visual cortical cells for translation-invariant tuning is shown by the computer simulations of Mel, Ruderman, and Archie (1998). A neuron model proposed by Maass (1998) includes input variables for the representation of ring correlations among presynaptic neurons, in addition to the common input variables that represent their ring rates. In the computation of the output, the correlation variables are multiplicatively combined with monomials of rate variables. The model accounts for various phenomena believed to be relevant for complex information processing in biological networks, such as synchronization and binding. Theoretical results show that the model neurons and networks are computationally more powerful than models that compute only in terms of ring rates. 2.3.4 Biological Multiplicative Neurons. Neurons that perform multiplicative-like computations have been identied in several biological nervous systems. Recordings performed by Andersen et al. (1985) from single neurons in the visual cortex of monkeys show that the selectivity of their receptive elds changes with the angle of gaze. Moreover, the interaction of the visual stimulus and the eye position in these neurons is found to be multiplicative. Thus, they could contribute to the encoding of spatial locations independent of eye position. Suga, Olsen, and Butman (1990) describe arrays of neural lters in the auditory system of the bat that provide a means for processing complex sound signals by operating as multipliers. As such, they are involved in a cross-correlation analysis of distance information conveyed by echo delays (see also Suga, 1990). Investigating the visual system of the locust, Hatsopoulos et al. (1995) show that a single motion-sensitive neuron, known as the lobula giant motion detector, performs a multiplication of two independent input signals. From experimental data, they derive an algorithm that could be used in the visual system to anticipate the time of collision with approaching objects. The results reveal multiplication to be an elementary building block underlying motion detection in insects. In the work of Gabbiani et al. (1999), these investigations are continued, resulting in a conrmation and generalization of the model. In an experimental study of the visual system of the cat, Anzai et al. (1999a, 1999b) nd that neural mechanisms underlying binocular interaction are based on multiplication. They show that the well-studied simple and complex cells in the visual cortex perform multiplicative operations
Multiplicative Neural Networks
253
analogous to those that are used in an algorithm for the computation of interocular cross-correlation. These results provide a possible explanation of how the visual system could solve the stereo correspondence problem. 2.4 Learning in Articial Multiplicative Networks. Early on in the history of articial neural networks and machine learning, higher-order neurons were used as natural extensions of the widespread linearly weighted neuron models. One of the earliest machine learning algorithms using higher-order terms was the checker playing program developed by Samuel (1959). Its decisions are based on computing a weighted sum including second-order Boolean conjunctions. In the 1960s, higher-order units were very common in pattern recognition, where they appear in the form of polynomial discriminant functions (Cover, 1965; Nilsson, 1965; Duda & Hart, 1973). There is an obvious reason for this popularity: higher-order units are computationally more powerful than single neurons since the higherorder terms act as hidden units. Furthermore, since these hidden units have no weights, there is no need to backpropagate errors when using gradient descent for training. Therefore, considering a higher-order unit as a linear unit in an augmented input space, all available learning methods for linear units are applicable. Later, the delta rule, developed within the framework of parallel distributed processing, was easily generalized to networks having hidden units of higher order, resulting in a learning method for backpropagating errors in networks of higher-order sigmoidal units (Rumelhart, Hinton, & McClelland, 1986; Rumelhart, Hinton, & Williams, 1986). Soon, however, higher-order networks were caught up by what is known as the curse of dimensionality. Due to the fact that the number of monomials in a higher-order unit can be exponential in the input dimension, the complete representation of a higher-order unit succumbs to a combinatorial explosion. Even if the order of the monomials is restricted by some xed value, the number of higher-order terms, and thus the number of free parameters to manipulate, is exponential in this bound—too large for many practical applications. Therefore, networks of higher-order neurons have gained importance in cases where the number of parameters can be kept low. One area of application for these sparse higher-order networks is the recognition and classication of patterns that underlie various geometric transformations such as scale, translation, and rotation. Maxwell, Giles, Lee, and Chen (1986) describe a method for incorporating invariance properties into a network of higher-order units that leads to a signicant reduction of the number of parameters (see also Giles & Maxwell, 1987; Bishop, 1995, for an explanation). Thus, faster training can also be achieved by encoding into the network a priori knowledge that need not be learned. Several studies using invariant recognition and classication experiments show that higher-order networks can be superior to standard neural networks and other methods with respect to training time and generalization capabilities (see, e.g., Giles & Maxwell, 1987; Perantonis & Lisboa, 1992;
254
Michael Schmitt
Schmidt & Davis, 1993; Spirkovska & Reid, 1994). Invariance properties are also established for the so-called pi-sigma networks introduced by Ghosh and Shin (1992). A pi-sigma network consists of one hidden layer of summing units and has a monomial as output unit. Compared to a sigma-pi unit, it has the advantage of reducing the number of adjustable parameters further. The dependence of the number of weights on the order is linear for a pi-sigma network, whereas it can be exponential for a higher-order unit. While invariance and pi-sigma networks are mostly used with restricted order, researchers have also strived to use the full computational power of higher-order networks, that is, without imposing any constraints on the degree of multiplicative interactions. In order for this to be accomplished, however, it is not sufcient simply to adopt learning methods from standard neural networks. Instead, a number of new learning algorithms have been developed that allow units of any order but enforce sparseness by incrementally building up higher orders and adding units only when necessary. Instances of such algorithms can be found in the work of Redding, Kowalczyk, and Downs (1993), Ring (1993), Kowalczyk and FerrÂa (1994), Fahner and Eckmiller (1994), Heywood and Noakes (1995), and Roy and Mukhopadhyay (1997). A constructive algorithm for a class of pi-sigma networks, called ridge polynomial networks, is proposed by Shin and Ghosh (1995). A method that replaces units in a quadratic classier with a limited number of terms is devised in the early work of Samuel (1959). All of these incremental algorithms, however, have the disadvantage that they create higher order in a step-wise, discrete manner. Further, once a unit has its order, it cannot change it unless the unit is deleted and a new one added. This problem is overcome by the product units of Durbin and Rumelhart (1989), where high order appears in the form of real-valued, adjustable weights that are exponents of the input variables. Thus, a product unit is able to learn any higher-order term. Moreover, its computational power is signicantly larger compared to a monomial due to the fact that exponents can be nonintegral and even negative. In further studies by Janson and Frenzel (1993), Leerink et al. (1995a, 1995b), and Ismail and Engelbrecht (2000), product unit networks are found to be computationally more powerful than sigmoidal networks in many learning applications. In particular, it is shown that they can solve many well-studied problems using fewer neurons than networks with summing units. Besides the articial neural network learning methods, backpropagation and cascade correlation and more global optimization algorithms such as genetic algorithms, simulated annealing, and random search techniques are considered. A considerable volume of research has been concerned with the encoding and learning of formal languages in higher-order recurrent networks. Especially second-order networks have proven to be useful for learning nite-state automata and recognizing regular languages (Giles, Sun, Chen, Lee, & Chen, 1990; Pollack, 1991; Giles et al., 1992; Watrous & Kuhn, 1992; Omlin & Giles, 1996a, 1996b). Higher-order units have also been employed
Multiplicative Neural Networks
255
as computational elements in Boltzmann machines and Hopeld networks in order to enlarge the capabilities and overcome the limitations of the rstorder versions of these network models (Lee et al., 1986; Sejnowski, 1986; Psaltis, Park, & Hong, 1988; Venkatesh & Baldi, 1991a, 1991b; Burshtein, 1998). Results on storage capacities and learning curves for single higherorder units are established by Yoon and Oh (1998) using an approach from statistical mechanics. 3 Vapnik-Chervonenkis and Pseudo-Dimension As the main subjects of study, we now introduce the tools for assessing the computational and learning capabilities of multiplicative neural networks: the VC dimension and the pseudo-dimension. We give the denition of the VC dimension in section 3.1 and exhibit as a helpful property that if the output node of a network is a summing unit, then it does not matter for the VC dimension which one it is. We also postulate a condition for the parameter domain of product units that avoids the problem of having to deal with complex numbers. In contrast to the VC dimension, the pseudo-dimension takes into account that neural networks in general deliver output values that are not necessarily binary. Thus, the pseudo-dimension appears suitable for networks that employ as output node a real-valued unit such as the linear or the sigmoidal unit, the monomial or the product unit. Nevertheless, after introducing the pseudo-dimension in section 3.2 we shall point out that the pseudo-dimension of a neural network, being an upper bound on its VC dimension, can also almost serve as lower bound since it can be considered as the VC dimension of a slightly augmented network. In section 3.3 we give a brief review of VC dimension bounds for neural networks that have been previously established in the literature. The bounds concern networks with summing and multiplicative units and are the starting point for the results that will be derived in the subsequent sections. 3.1 Vapnik-Chervonenkis Dimension. Before we can give a formal definition of the VC dimension, we require some additional terms. A partition of a set S µ Rn into two disjoint subsets ( S 0 , S1 ) is called a dichotomy of S. A function f : Rn ! f0, 1g is said to induce the dichotomy ( S 0, S1 ) of S if f satises f (S 0 ) µ f0g and f (S1 ) µ f1g. More generally, if F is a class of functions mapping Rn to f0, 1g then F induces the dichotomy (S 0 , S1 ) if there is some f 2 F that induces ( S 0, S1 ) . Further, the class F shatters S if F induces all possible dichotomies of S. Denition. The Vapnik-Chervonenkis (VC) dimension of a class F of functions that map Rn to f0, 1g, denoted VCdim( F ) , is the cardinality of the
256
Michael Schmitt
largest set shattered by F . If F shatters arbitrarily large sets, then the VC dimension of F is innite. The denition applies to function classes, but it is straightforwardly transferred to neural networks. Consider a network having connections and nodes labeled with programmable parameters, that is, weights and thresholds, respectively, and having a certain number, say n, of input nodes and one output node. If the output node is a threshold unit, there is a set of functions mapping Rn to f0, 1g naturally associated with this network: the set of functions obtained by assigning all possible values to the network parameters. Thus, the VC dimension of a neural network is well dened if the output node is a threshold unit. If the output node is not a threshold unit, the network functions are made f0, 1g-valued to comply with the denition of the VC dimension. A general convention is to compare the output of the network with some xed threshold h, for instance, h D 1 / 2 in the case of a sigmoidal unit. This is also common in applications when networks with continuous output values are used for classication tasks. More specically, the binary output value of the network is obtained by applying the function y 7! sgn ( y ¡ h ) to the output y of the real-valued network. Thus, h can be considered as an additional parameter of the network that may be chosen independently for every dichotomy. However, one can show that for a sigmoidal or a linear output node, the VC dimension does not rely on the specic values of this threshold. Moreover, it is also independent of the particular summing unit that is chosen. The following result makes this statement more precise. We omit its proof since it is easily obtained using scalings of the output weights and adjustments of the thresholds. Proposition 1. The VC dimension of a neural network remains the same regardless of which summing unit (threshold, sigmoidal, or linear) is chosen for the output node. According to this statement, we may henceforth assume without loss of generality that if the output node of a network is a summing unit, then it is a linear unit. Linear output nodes are commonly used, for instance, in neural networks constructed with the aim of approximating continuous functions (Pinkus, 1999). Beyond analyzing single networks, we also investigate the VC dimension of particular classes of networks. In this case, we refer to the VC dimension of a class of networks as the VC dimension of the class of functions obtained by taking the union of the function classes associated with each particular network. A problem arises with networks containing product units that receive negative inputs and have weights that are not integers. A negative number raised to some nonintegral power yields a complex number and has
Multiplicative Neural Networks
257
no meaning in the reals. Since neural networks with complex outputs are rarely used in applications, Durbin and Rumelhart (1989) suggest a method for proceeding in this case. The idea is to discard the imaginary part and use only the real component for further processing. For Boolean inputs, this implies that the product unit becomes a summing unit that uses the cosine activation function. Although there are no problems reported from applications so far, this manipulation would have disastrous consequences for the VC dimension if it was extended to real-valued inputs. It is known that a summing unit with the sine or cosine activation function can shatter nite subsets of R of arbitrarily large cardinality (Sontag, 1992; Anthony & Bartlett, 1999). Therefore, no nite VC dimension bounds can in general be derived for networks containing such units. To avoid these problems arising from negative inputs in combination with fractional weights, we shall require that the following condition on the parameter domain of product units is always satised in the networks we consider. Condition. If an input xi of a product unit is negative, the corresponding weight wi is an integer. This presupposition guarantees that there are no complex outputs resulting from product units, and it still permits viewing the product unit as a generalization of the monomial and, hence, networks containing product units as generalizations of networks with higher-order units. One of the main results in this article will be that networks with product units, where inputs and parameters satisfy the above condition, have a VC and pseudodimension that is a low-degree polynomial in the number of network parameters and the network size. Because product units comprise monomials, this will then also lead to a similar bound for higher-order networks. 3.2 Pseudo-Dimension. Besides the VC dimension, several other combinatorial measures have been considered in the literature for the characterization of the variety of, in particular, a real-valued function class. The most relevant is the pseudo-dimension, which is also well known for providing bounds on the number of training examples in models of learning (Anthony & Bartlett, 1999). Denition. Let F be a class of functions that map Rn to R. The pseudodimension of F , denoted Pdim( F ) , is the VC dimension of the class fg: Rn C 1 ! f0, 1g | there is some f 2 F such that for all x 2 Rn and y 2 R: g ( x, y ) D sgn ( f ( x) ¡ y )g. The pseudo-dimension of a neural network is then dened as the pseudodimension of the class of functions computed by this network. Clearly, the VC dimension of a network is not larger than its pseudo-dimension. The
258
Michael Schmitt
pseudo-dimension and the VC dimension of a network with a summing unit as output node are even more closely related. In particular, if there is an upper bound on the VC dimension of such a network in terms of the number of weights and computation nodes, then one can also obtain an upper bound on the pseudo-dimension. We shall show this now in a statement that slightly improves theorems 14.1 and 14.2 of Anthony and Bartlett (1999) if the output node is a summing unit. The latter results require two more connections and one more computation node, whereas the following manages with the same number of computation nodes and only one additional connection. Proposition 2. Let N be a neural network having as output node a summing unit (threshold, sigmoidal, or linear). If the output node is a threshold unit, then Pdim( N ) D VCdim( N ) . If the output node is a sigmoidal or a linear unit, then there is a network N 0 with the same computation nodes as N , one more input node, and one more connection such that Pdim( N ) D VCdim( N 0 ) . Proof. by
Suppose N
has n input nodes and let the function class G be dened
G D fg: Rn C 1 ! f0, 1g | there is some f computed by N such that for all x 2 Rn and y 2 R: g ( x, y ) D sgn ( f ( x) ¡ y) g, according to the denition of the pseudo-dimension of N . If the output node of N is a threshold unit, then it can easily be seen that for s1 , . . . , sm 2 Rn and u1 , . . . , um 2 R, G shatters f(s1 , u1 ) , . . . , ( sm , um ) g if and only if G shatters f(s1 , 1 / 2), . . . , ( sm , 1 / 2) g. Further, the latter set is shattered by G if and only if the set fs1 , . . . , sm g is shattered by N . Thus, Pdim( N ) D VCdim(N ) . If the output node is linear or sigmoidal, we construct N 0 by adding to N a new input node for the variable y and a connection with weight ¡1 from this input node to the output node. Due to proposition 1, we may employ without loss of generality a linear unit for the output node of N 0 . Clearly, if N has a linear output node, then the set f(s1 , u1 ) , . . . , (sm , um )g is shattered by G if and only if the set f(s1 , u1 ) , . . . , ( sm , um ) g is shattered by N 0 . Further, if N has a sigmoidal output node, then the set f(s1 , u1 ) , . . . , (sm , um )g is shattered by G if and only if the set f(s1 , s ¡1 ( u1 ) ) , . . . , (sm , s ¡1 ( um ) ) g is shattered by N 0 . 3.3 Known Bounds on the VC Dimension of Neural Networks. We give a short survey of some previous results on the VC dimension of neural networks. Most of the bounds are for networks consisting solely of summing units. Since a summing unit can be considered as a multiplicative unit with order restricted to one, special cases of these results are relevant for the networks considered in this article. Further, some of the results hold for
Multiplicative Neural Networks
259
polynomial activation functions of restricted order and are given in terms of a bound on this order. It is therefore interesting to see how these bounds are related to the order-independent bounds derived here. We quote the results for the most part in their asymptotic form. The reader may nd constants in the cited references or in Anthony and Bartlett (1999). The bounds can be divided into three categories: for single units, networks, and classes of networks. The VC dimension of a summing unit is known to be n C 1, where n is the number of variables, and is hence equal to the number of adjustable parameters. This holds for the threshold, sigmoidal, and linear unit. The proof of this fact goes back to an ancient result by Schl¨ai (1901). The pseudodimension of these units has also been determined exactly, being n C 1 for each of them as well (see, e.g., Anthony & Bartlett, 1999). There is a large quantity of results on the VC dimension for neural networks that takes into account various unit types and network architectures. An upper bound for networks of threshold units is calculated by Baum and Haussler (1989). They show that a network with k computation nodes, which are all threshold units, and a total number of W weights has VC dimension O (W log k) . This bound can also be derived from an earlier result of Cover (1968). Its tightness in the case of threshold units is established by two separate works. Sakurai (1993) shows that the architecture with one hidden layer has VC dimension V ( W log k) on real-valued inputs. Maass (1994) provides an architecture with two hidden layers respecting the bound V ( W log W ) on Boolean inputs. There are also some bounds known for networks using more powerful unit types. A result of Goldberg and Jerrum (1995), which was independently obtained by Ben-David and Lindenbaum (1998), shows that neural networks employing as computation nodes higher-order units with an order bounded by some xed value have VC dimension O (W 2 ), where W is the number of weights. This bound holds even if the activation functions of the units are piecewise polynomial functions with an order bounded by a constant. 4 Koiran and Sontag (1997) construct networks of threshold and linear units with VC dimension V (W 2 ), thus showing that the quadratic upper bound of Goldberg and Jerrum (1995) is asymptotically optimal. Networks with depth restrictions are considered by Bartlett, Maiorov, and Meir (1998). They establish the bound O ( W log W ) for networks with piecewise polynomial activation functions, a xed number of layers, and
4 For polynomial neural networks, Goldberg and Jerrum (1995) derive no explicit bound in terms of the order. The bound O (W 2 ) for neural networks is obtained via an upper bound on the time needed by an algorithm to compute the output value of the network. This latter bound is linear in the running time of the algorithm. (More precisely O (Wt), if the number of computation steps is bounded by t.) Since the algorithm may multiply two real numbers in constant time, the time required for the evaluation of a polynomial grows linearly in the order of the polynomial. These considerations lead to an upper bound that is linear in the order (see also Anthony & Bartlett, 1999, theorem 8.7).
260
Michael Schmitt
polynomials of bounded order. Expressed in terms of the number L of layers and the bound d on the order, their result is O ( WL log W C WL2 log d ) . A similar upper bound is obtained by Sakurai (1999). Bartlett et al. (1998) also show that V (WL ) is a lower bound for piecewise polynomial networks with W weights and L layers. An upper bound for sigmoidal networks is due to Karpinski and Macintyre (1997), who show that higher-order sigmoidal networks with W weights and k computation nodes of order at most d have VC dimension O ( W 2 k2 C Wk log d ) . Koiran and Sontag (1997) establish V ( W 2 ) as a lower bound for sigmoidal networks. Finally, some authors consider classes of networks. Hancock, Golea, and Marchand (1994) show that the so-called class of nonoverlapping neural networks with n input nodes and consisting of threshold units has VC dimension O ( n log n ). In a nonoverlapping network, all nodes, computation as well as input nodes, have fan-out at most one. That this bound is asymptotically tight for this class of networks is shown by Schmitt (1999) establishing the lower bound V ( n log n ) . Classes of higher-order units with order restrictions are studied by Anthony (1995). He shows that the VC dimension of the class of higher-order units with n inputs and order not larger than d is equal to ( n Cd d ) . Hence, this VC dimension respects the upper and lower bound H (nd ). Karpinski and Werther (1993) consider classes of higher-order units with one input node, or univariate polynomials, with a restriction on the fan-in of the output node, or equivalently, the number of monomials. They show that the class of higher-order units having one input node and at most k monomials has VC dimension H ( k) . It is noteworthy that this result allows higher-order units of arbitrary order and does not depend on a bound for this order. 4 Upper Bounds In the following, we derive upper bounds on the VC dimension and the pseudo-dimension of various feedforward networks with multiplicative units in terms of the number of parameters and the network size. We begin in section 4.1 by considering networks with one hidden layer of product units and a summing unit as output node. Then bounds for arbitrary networks, where each node may be a product or a sigmoidal unit, are established in section 4.2. These results are employed in section 4.3 to determine bounds for higher-order sigmoidal networks. In section 4.4 we consider single higher-order units and, nally, in section 4.5 we focus on product units and monomials. 4.1 Product Unit Networks with One Hidden Layer. We consider rst the most widely used type of architecture, which has one hidden layer of computation nodes such as shown in Figure 1. Throughout this section, we assume that the output node is a summing unit. The use of this class of
Multiplicative Neural Networks
261
networks is theoretically justied by the so-called universal approximation property. Results from approximation theory show that networks with one hidden layer of product or sigmoidal units and a linear unit as output node are dense in the set of continuous functions and hence can approximate any such function arbitrarily well (see, e.g., Leshno, Lin, Pinkus, & Schocken, 1993; Pinkus, 1999). We recall from the condition on the parameter domain of product units stated in section 3.1 that for product units to yield output values in the reals, the input domain is restricted such that for a nonintegral weight, the corresponding input value is from the set R 0C D fx 2 R: x ¸ 0g. There is, however, still a problem when a product unit receives the value 0, and this input is weighted by a negative number. Then the output value of the unit is undened. The possibility of forbidding 0 as input value in general is certainly too restrictive and would go against many learning applications. Therefore, we use a default value in case a unit raises the input value 0 to some negative power and say that the output value of such a unit is 0.5 With these agreements, the following can be shown. (Here and in subsequent formulas we use “log” to denote the logarithm of base 2.) Theorem 1. Suppose that N is a neural network with one hidden layer consisting of k product units, and let W be the total number of parameters (weights and threshold). Then N has pseudo-dimension at most ( Wk) 2 C 8Wk log(13Wk) . The main ideas of the proof are rst to derive from N and an arbitrary set S of input vectors a set of exponential polynomials in the parameter variables of N . The next step is to consider the connected components of the parameter domain that arise from the zero sets of these polynomials. A connected component (also referred to as a cell) of a set D µ Rd is a maximal nonempty subset C of D such that any two points of C are connected by a continuous curve that lies entirely in C. The polynomials are constructed in such a way that the number of dichotomies that N induces on S is not larger than the number of connected components generated by these polynomials. Thus, a bound on the number of connected components also limits the number of dichotomies. The basic tools for this calculation are provided by Karpinski and Macintyre (1997), who combined the work of Warren (1968) and KhovanskiÆ õ (1991) to obtain for the rst time polynomial bounds on the 5 Another way would be to introduce three-valued logic such that the output of a network that divides by 0 is some value “undened,” that is, neither 0 nor 1. This then leads to a different notion of dimension that takes multiple-valued outputs into account. Ben-David, Cesa-Bianchi, Haussler, and Long (1995) study such dimensions and their relevance for learning. In particular, they show that a large variety of dimensions for multiple-valued functions are closely related in that they differ from each other at most by a constant factor. Therefore, should the three-valued approach be preferred, the bounds derived here can also be used to obtain estimates for these generalized dimensions.
262
Michael Schmitt
VC dimension of sigmoidal neural networks. These tools are explicated and further developed in Anthony and Bartlett (1999). We introduce two denitions from this book for sets of functions: the property to have regular zero-set intersections and the solution set components bound (denitions 7.4 and 7.5 of Anthony & Bartlett, 1999). A set f f1 , . . . , fkg of differentiable real-valued functions on Rd is said to have regular zero-set intersections if for every nonempty set fi1 , . . . , il g µ f1, . . . , kg, the Jacobian (i.e., the matrix of the partial derivatives) of ( fi1 , . . . , fil ) : Rd ! Rl has rank l at every point of the set fa 2 Rd : fi1 ( a) D ¢ ¢ ¢ D fil (a ) D 0g. A class G of real-valued functions dened on Rd has solution set components bound B if for every k 2 f1, . . . , dg and every f f1 , . . . , fkg µ G that has regular zero-set intersections the number of connected components of the set fa 2 Rd : f1 ( a) D ¢ ¢ ¢ D fk ( a) D 0g is at most B. The proofs for the three subsequent lemmas can be found in the appendix. Lemma 1. Let G be the class of polynomials of degree at most p in the variables y1 , . . . , yd and in the exponentials e g1 , . . . , e gq , where g1 , . . . , gq are xed afne functions in y1 , . . . , yd . Then G has solution set components bound, B D 2q ( q¡1 ) / 2 [p(p C 1)d C p]d [(p C 1) d ( d C 1) C 1]q . Lemma 2. Let q be a natural number and suppose G is the class of real-valued functions in the variables y1 , . . . , yd satisfying the following conditions: For every f 2 G there exist afne functions g1 , . . . , gr , where r · q, in the variables y1 , . . . , yd such that f is an afne combination of y1 , . . . , yd and eg1 , . . . , egr . Then G has solution set components bound, ( ) B D 2dq dq¡1 / 2 [2(dq C d ) C 1]dqC d [2(dq C d ) ( dq C d C 1) C 1]dq.
A class F of real-valued functions is said to be closed under addition of constants if for every c 2 R and f 2 F the function z 7! f ( z ) C c is a member of F . The following result gives a stronger formulation of a bound stated in theorem 7.6 of Anthony and Bartlett (1999): Lemma 3. Let F be a class of real-valued functions ( y1 , . . . , yd , x1 , . . . , xn ) 7! f ( y1 , . . . , yd , x1 , . . . , xn ) that is closed under addition of constants and where each function in F is Cd in the variables y1 , . . . , yd . If the class G D f(y1 , . . . , yd ) 7! f ( y1 , . . . , yd , s) : f 2 F , s 2 Rn g has solution set components bound B, then
Multiplicative Neural Networks
263
for any sets f f1 , . . . , fkg µ F and fs1 , . . . , sm g µ Rn , where m ¸ d / k, the set T µ f0, 1gmk dened as T D f(sgn ( f1 ( a, s1 ) ) , . . . , sgn ( f1 ( a, sm ) ) , . . .
. . . , sgn ( fk ( a, s1 ) ) , . . . , sgn ( fk ( a, sm ) ) ) : a 2 Rd g
satises |T| · B
´ d ³ X mk iD 0
i
³ ·B
emk d
´d .
Now we have all that is required for the proof of theorem 1. Proof of Theorem 1. Assume that N is given as supposed, having k hidden product units and W parameters. Denote the weights of the hidden nodes by wi, j , and let v 0 , v1 , . . . , v k be the weights of the output nodes where v 0 is the threshold. We rst use an idea from Karpinski and Macintyre (1997) and divide for each i 2 f1, . . . , kg the functions computed by N into three categories corresponding to the sign of the parameter vi , that is, depending on whether vi < 0, vi D 0, or vi > 0. This results in a total of 3 k categories for all v1 , . . . , v k. We consider each category separately and introduce new variables v01 , . . . , v0k for the weights of the output node by dening 8
if vi > 0, if vi D 0, if vi < 0,
for i D 1, . . . , k. Thus, within this category, we can write the computation of N on some real input vector s as a function of the network parameters in the form w
0
wk, p
1, p1 1,1 vk wk, 1 k (v 0 , v0 , w) 7! v 0 C b1 ev1 sw 1,1 ¢ ¢ ¢ s1,p1 C ¢ ¢ ¢ C b k e s k,1 ¢ ¢ ¢ s k,pk , 0
where the si, j are components of the input vector and bi 2 f0, 1g is dened by » bi D
0 1
if v0i D 0 or si, j D 0 for some j 2 f1, . . . , pi g, otherwise,
for i D 1, . . . , k. Note that if si, j D 0 and wi, j < 0 for some i, j, we dene the output of the affected product unit to be 0. Thus, we may assume that 6 0 for all i, j without loss of generality. Recall further that according to si, j D the condition on the parameter domain of product units stated in section 3.1, for si, j < 0 we have required wi, j to be an integer. In this case, however, the
264
Michael Schmitt
sign of si, j can have an effect only when the weight is odd. And this effect on the product unit consists of only a possible change of the sign of its output value. Since the sign of si, j is not known, we consider the worst case and assume that each input vector generates all possible signs at the outputs of the product units. Therefore, we can restrict the input vectors to the positive reals if we take note of the fact that each input vector gives rise to at most 2 k functions of the form w
wk, p
1,p1 1,1 vk wk,1 k ( v 0, v0 , w ) 7! v0 § b1 ev1 sw 1,1 ¢ ¢ ¢ s1,p1 § ¢ ¢ ¢ § b k e sk,1 ¢ ¢ ¢ sk,pk . 0
0
For positive input values, these functions can be rewritten as ( v 0, v0 , w ) 7! v0 § b1 exp(v01 C w1,1 ln s1,1 C ¢ ¢ ¢ C w1,p1 ln s1,p1 ) § ¢ ¢ ¢ ¢ ¢ ¢ § bk exp(v0k C wk,1 ln sk,1 C ¢ ¢ ¢ C w k,pk ln sk,pk ) . To obtain a bound on the pseudo-dimension, we want to estimate the number of dichotomies that are induced on some arbitrary set f(s1 , u1 ) , . . . , (sm , um )g, where s1 , . . . , sm are input vectors for N and u1 , . . . , um are real numbers, by functions of the form ( x, z ) 7! sgn ( f ( x ) ¡ z ), where f is computed by N . We do this for each of the categories dened above separately. Thus, within one such category, the number of dichotomies induced is at k most as large as the cardinality of the set T µ f0, 1gm2 satisfying T D f(sgn ( f1 (a, s01 , u1 ) ) , . . . , sgn ( f1 ( a, s0m , um ) ) , . . .
. . . , sgn ( f2k ( a, s01 , u1 ) ) , . . . , sgn( f2k (a, s0m , um ) ) ) : a 2 RW g
for real vectors s01 , . . . , s0m and functions f1 , . . . , f2k , where each of these functions has the form ( y, x, z ) 7! c0 C y 0 C c1 exp(y1 C y1,1 x1,1 C ¢ ¢ ¢ C y1,p1 x1,p1 ) C ¢ ¢ ¢ ¢ ¢ ¢ C ck exp(yk C y k,1 x k,1 C ¢ ¢ ¢ C y k,pk x k,pk ) ¡ z for real numbers c0, c1 , . . . , ck with c0 D 0 and ci 2 f¡1, 0, 1g for i D 1, . . . , k. The variables yi and yi, j play the role of the network parameters, xi, j are the input variables receiving values s0i, j D ln si, j , and z is the input variable for u1 , . . . , um . Let F denote the class of these functions arising for arbitrary real numbers c0, c1 , . . . , ck. We have introduced c0 in particular to make this class F closed under addition of constants. Now, for the vectors s01 , . . . , s0m and the real numbers u1 , . . . , um , we consider the function class G D fy 7! f ( y, s0i , ui ) : f 2 F , i D 1, . . . , mg. Clearly, every element of G is an afne combination of W variables and k exponentials of afne functions in these variables. According to lemma 2, G has solution set components bound B D 2Wk( Wk¡1 ) / 2 [2(Wk C W ) C 1]WkC W £ [2(Wk C W ) ( Wk C W C 1) C 1]Wk .
(4.1)
Multiplicative Neural Networks
265
Since F is closed under addition of constants, we have from lemma 3 that |T| · B ( em2k / W ) W , which is by the construction of T an upper bound on the number of dichotomies that are induced on any set of m vectors f(s1 , u1 ) , . . . , (sm , um )g. Since this bound is derived for network parameters chosen within one category, we obtain an upper bound for all parameter values by multiplying with the number of categories, which is 3k. This yields the bound |T| · B ( em2k / W ) W 3k .
(4.2)
If there is a set of m vectors that is shattered, then all 2m dichotomies of this set must be induced. Since |T| is an upper bound on this number, this implies m · log B C W log(em2 k / W ) C k log 3. From the well-known inequality ln a · ab C ln(1 / b ) ¡ 1, which holds for all a, b > 0 (see, e.g., Anthony & Bartlett, 1999, appendix A.1.1), we obtain for a D m and b D (ln 2) / (2W ) that W log m · m / 2 C W log(2W / (e ln 2) ). Using this in the above inequality, we get m · log B C m / 2 C W log(2 kC 1 / ln 2) C k log 3, which is equivalent to m · 2 log B C 2Wk C 2W log(2 / ln 2) C 2k log 3. Substitution of the value for B yields m · Wk ( Wk ¡ 1) C 2(Wk C W ) log[2(Wk C W ) C 1] C 2Wk log[2(Wk C W ) ( Wk C W C 1) C 1] C 2Wk C 2W log(2 / ln 2) C 2k log 3.
Simplifying and rearranging, we obtain m · ( Wk) 2 C (2Wk C 2W) log[2(Wk C W ) C 1] C 2Wk log[2(Wk C W ) ( Wk C W C 1) C 1] C Wk C 2W log(2 / ln 2) C 2k log 3.
The second and the third line together are less than or equal to 2Wk(log[13 ( Wk) 2 ] C 1 / 2 C log(2 / ln 2) C log 3) ,
266
Michael Schmitt
p which is equal to 2Wk log[78 2(Wk) 2 / ln 2] and hence less than 4Wk log (13Wk). The last term in the rst line is at most 4Wk log(5Wk) . Using these relations, we arrive at m < (Wk) 2 C 4Wk log(5Wk) C 4Wk log(13Wk) and hence at m < (Wk) 2 C 8Wk log(13Wk) , which completes the proof of the theorem. The constants in the bound of theorem 1 can be made slightly smaller, and one can get rid of the dependency on k of the term in the logarithm. Karpinski and Macintyre (1997) show by means of a more direct application of the method of Warren (1968) that for obtaining a solution set components bound for the class of functions considered here, it is not necessary to introduce additional parameters as intermediate variables as we have done in lemma 2.6 Instead, it is sufcient to employ the bound provided by lemma 1 for afne functions in W variables and Wk xed exponentials. This yields the improved solution set components bound B D 2Wk( Wk¡1 ) / 2 [2W C 1]W [2W (W C 1) C 1]Wk ,
(4.3)
which can be used in the proof of theorem 1 in place of equation (4.1) to derive the following improved version of this theorem. Corollary 1. Let N be a neural network with one hidden layer of k product units and W parameters. Then N has pseudo-dimension at most ( Wk) 2 C 6Wk log(8W) . Proof. Using solution set components bound (4.3) in the proof of theorem 1, we get m · (Wk) 2 C 2W log(2W C 1) C 2Wk log[2W ( W C 1) C 1] C Wk C 2W log(2 / ln 2) C 2k log 3. The last line is less than 4Wk log(8W), and the last term of the rst line is at most 2W log(3W). This implies m < ( Wk) 2 C 6Wk log(8W) . 6 Theorem 7.6 in Anthony and Bartlett (1999), which also builds on the method of Warren (1968), provides a more general method suitable for any function class that satises the conditions of lemma 3. In particular, it applies to networks with more than one hidden layer. For such networks, the direct method of Karpinski and Macintyre (1997) bears no advantage. The latter method also deals with these networks by introducing intermediate variables. We have chosen to employ the result of Anthony and Bartlett (1999) in the proof of lemma 3 because it is more general and makes the proof of theorem 1 shorter and easier to read.
Multiplicative Neural Networks
267
One of the constants can be further improved for networks that operate in the nonnegative domain only. The improvement is marginal, but the result demonstrates how the input restriction affects the calculation of the bound. Corollary 2. Let N be a neural network with one hidden layer of k product units and W parameters. If the inputs for N are restricted to nonnegative real numbers, then N has pseudo-dimension at most ( Wk) 2 C 6Wk log(6W). Proof. If all inputs are positive, then no sign changes have to be taken into account for the output values of the product units. Therefore, we can use in the proof of theorem 1 that each input vector gives rise to 1 instead of 2 k functions. This gives the new bound |T| · B (em / W ) W 3 k for inequality (4.2). Using solution set components bound (4.3) yields m · ( Wk) 2 C 2W log(2W C 1) C 2Wk log[2W ( W C 1) C 1] ¡ Wk C 2W log(2 / ln 2) C 2k log 3. The last line is less than 4Wk log(6W) , implying m < (Wk) 2 C 6Wk log(6W). The networks considered thus far in this section have in common a rigid architecture with a prescribed number of hidden nodes. In some learning applications, however, it is customary not to x the architecture in advance but to let the networks grow. In this case, a variety of networks can result from the learning algorithm. It might be possible to accommodate all these networks in a single large network so that a bound for the VC dimension of the class of networks is obtained in terms of a bound for the large network. Often, however, better bounds can be derived if one takes into account the constraint that underlies the growth of the network. In the following, we assume that this growth is limited by a bound on the fan-out of the input nodes. Such networks with sparse connectivity have been suggested, for instance, by Lee et al. (1986) and Hancock et al. (1994). Corollary 3. Let C be the class of networks with one hidden layer of product units and n input nodes where every input node has fan-out at most l. Then C has pseudo-dimension at most 4(nl) 4 C 18(nl) 2 log(23nl) . Proof. The number of networks in C is not larger than ( nl ) nl since there are at most this many ways to assign nl connections to nl hidden units. Each network has at most 2nl C 1 parameters, where nl are for connections leading to hidden nodes and nl C 1 are for the output node. Thus, with r D nl, the number of dichotomies induced by C , or more precisely, by functions of the form (x, z) 7! sgn ( f (x ) ¡ z ) where f is computed by C , on a set of cardinality m is at most rr times the number of dichotomies induced by a network with r hidden nodes and 2r C 1 parameters. Using solution set
268
Michael Schmitt
components bound (4.3), we get from the proof of corollary 1 with k D r and W D 2r C 1, m · (r (2r C 1) ) 2 C 2(2r C 1) log(2(2r C 1) C 1) C 2r(2r C 1) log[2(2r C 1) (2r C 2) C 1] C r (2r C 1) C 2(2r C 1) log(2 / ln 2) C 2r log 3 C r log r,
where the last term is due to the factor rr . From this we obtain m < 4r4 C 6r log(7r) C 12r2 log(23r) , and hence m < 4r4 C 18r2 log(23r) . Resubstituting r D nl yields the claimed result. 4.2 Networks with Product and Sigmoidal Units. We now consider feedforward architectures with an arbitrary number of layers. The networks are nonhomogeneous in that each node may be a product or a sigmoidal unit independent of the other nodes. Pseudo-dimension bounds for pure product unit networks with the output node being a summing unit are known from the previous section. We noted in section 2.2.2 that a network of product units only is equivalent to a single product unit. A bound for the single product unit will be given in section 4.5. Also, bounds for networks consisting solely of sigmoidal units have been established by Karpinski and Macintyre (1997) and Anthony and Bartlett (1999). In the following, we calculate bounds for networks containing both unit types. Theorem 2. Suppose N is a feedforward neural network with k computation nodes and W parameters, where each computation node is a sigmoidal or a product unit. Then the pseudo-dimension of N is at most 4(Wk) 2 C 20Wk log(36Wk) . Before proving this, we determine a solution set components bound for classes of functions arising from feedforward networks. The following result considering arbitrarily many layers corresponds to lemma 2, which was for networks with one hidden layer only. The proof is given in the appendix. Lemma 4. Let G be the class of real-valued functions in d variables computed by a network with r computation nodes where q of these nodes compute one of the functions a 7! c C 1 / (1 C e¡a ) , a 7! c § ea , or a 7! ln a for some arbitrary constant c, and r ¡ q nodes compute some polynomial of degree 2. Then G has solution set components bound, ( ) B D 2dq dq¡1 / 2 [6(2dr C d) C 2]2dr C d [3(2dr C d ) (2dr C d C 1) C 1]dq.
Besides the above solution set components bound, the following proof uses lemma 3 from the appendix.
Multiplicative Neural Networks
269
Proof of Theorem 2. We show rst that we can conne the argumentation to positive inputs. For some input vector s, consider all functions computed by N that arise when s is fed into the network and the signs of the parameters are varied in all possible ways. Treating input values 0 to product units as in the proof of theorem 1 and taking into account the at most 2W functions thus generated by each input vector, we may henceforth assume without loss of generality that all input values to product units are positive real numbers. (A number of 2k functions is in general not sufcient, as it was in the proof of theorem 1, since changing the sign of the output of some input unit, for instance, can modify the sign of an input to a sigmoidal unit. This cannot be compensated for by changing the sign of the sigmoidal unit, but by changing the sign of a weight.) Thus, the number of dichotomies that are induced by functions of the form (x, z) 7! sgn ( f ( x) ¡ z) with f being computed by N on some arbitrary set f(s1 , u1 ) , . . . , ( sm , um ) g with input vectors s1 , . . . , sm and real numbers W u1 , . . . , um is at most as large as the cardinality of the set T µ f0, 1gm2 dened by T D f(sgn ( f1 ( a, s01 , u1 ) ) , . . . , sgn ( f1 ( a, s0m , um ) ) , . . .
. . . , sgn ( f2W ( a, s01 , u1 ) ) , . . . , sgn ( f2W ( a, s0m , um ) ) ) : a 2 RW g,
where s01 , . . . , s0m are positive real vectors and f1 , . . . , f2W are the functions arising from N after making the sign variations described above. We allow that any arbitrary constant may be added to these functions and use F to denote this class of functions ( w, x, z ) 7! f ( w, x, z ) in the network parameters w and the input variables x and z. Clearly, then, F is closed under addition of constants. Consider the class G D fw 7! f ( w, s0i , ui ) : f 2 F , i D 1, . . . , mg of functions in W variables. Every function in G can be computed by a network where each computation node computes one of the functions a 7! c C 1 / (1 C e¡a ), a 7! c § ea , a 7! ln a, or some polynomial of degree 2. Here, c is some arbitrary constant that is required for the output nodes due to the closedness of F under addition of constants and also accommodates the subtraction of some ui from the output. The “ § ” comes from the sign variation of the output of the product unit. The networks result as follows: Each product unit receives only positive inputs and can thus be written as ci § exp(wi,1 ln vi,1 C ¢ ¢ ¢ C wi,pi ln vi,pi ) , and each sigmoidal unit has the form ci C 1 / (1 C exp(¡wi,1 vi,1 ¡ ¢ ¢ ¢ ¡ wi,pi vi,pi C ti ) ) , where the vi, j are output values of other nodes. Therefore, each product unit can be decomposed into one function a 7! c § e a , one polynomial of degree 2,
270
Michael Schmitt
and functions a 7! ln a applied to the outputs of other nodes. Further, each sigmoidal unit can be decomposed into one function a 7! c C 1 / (1 C e¡a ) and one polynomial of degree 2. This leads to a network with at most k ¡1 nodes computing a 7! ln a (the logarithm of the output node is not needed), at most k nodes computing a polynomial of degree 2, and at most k nodes each of which computes either a 7! c § e a or a 7! c C 1 / (1 C e¡a ). In total, every function in G can be computed by a network with 3k ¡ 1 computation nodes of which 2k ¡ 1 nodes compute one of the functions a 7! c C 1 / (1 C e¡a ) , a 7! c § e a , or a 7! ln a. Lemma 4 with d D W, r D 3k ¡ 1, and q D 2k ¡ 1 shows that G has solution set components bound, B D 2W (2 k¡1 ) ( W (2k¡1 ) ¡1) / 2 [6(2W (3k ¡ 1) C W ) C 2]2W(3k¡1) C W ¢ [3(2W (3 k ¡ 1) C W ) (2W (3 k ¡ 1) C W C 1) C 1]W (2 k¡1 ) . Since F is closed under addition of constants, it follows from lemma 3 that B (2em2W / W ) W is an upper bound on the cardinality of T and, thus, on the number of dichotomies induced on any set of cardinality m. If such a set is shattered, this implies m · log B C W log(em2W / W ). Using W log m · m / 2 C W log(2W / ( e ln 2) ) (see the proof of theorem 1), we obtain m · 2 log B C 2W 2 C 2W log(2 / ln 2) , and after substituting the value for B, m · W (2 k ¡ 1) ( W (2 k ¡ 1) ¡ 1) C (4W (3 k ¡ 1) C 2W) log[6(2W (3 k ¡ 1) C W ) C 2] C 2W (2 k ¡ 1) log[3(2W (3 k ¡ 1) C W ) (2W (3 k ¡ 1) C W C 1) C 1] C 2W 2 C 2W log(2 / ln 2) .
Simplifying and rearranging leads to m · 4(Wk) 2 ¡ 4W 2 k C 3W 2 C (12Wk ¡ 2W) log[6(6Wk ¡ W ) C 2]) C 2W (2 k ¡ 1) log[3(6Wk ¡ W ) (6Wk ¡ W C 1) C 1] ¡ W (2 k ¡ 1) C 2W log(2 / ln 2) .
The last two lines together are less than 4Wk(log[109(Wk) 2 ] C log(2 / ln 2) ) , which equals 4Wk log[218 ( Wk) 2 / ln 2] and is less than 8Wk log(18Wk) . The last three terms of the rst line together are less than 12Wk log(36Wk) . Thus, we may conclude that m < 4(Wk) 2 C 20Wk log(36Wk) .
Multiplicative Neural Networks
271
The networks considered in theorem 2 have xed units in the sense that it is determined in advance for a computation node whether it is to be a product or a sigmoidal unit. One can imagine situations in which the learning algorithm chooses the unit type for each node. Then the function class is no longer represented by a single network, but by a class of networks. An inspection of the proof of theorem 2 in this regard shows that its argumentation does not depend on knowing which unit type is given. Hence, the same bound holds if the unit type of each computation node is variable. Corollary 4. Let C be a class of feedforward neural networks where each network has k computation nodes and W parameters, and where each computation node is a product unit or a sigmoidal unit. Then C has pseudo-dimension at most 4(Wk) 2 C 20Wk log(36Wk) . 4.3 Higher-Order Sigmoidal Networks. A neural network consisting of higher-order sigmoidal units can be considered as a network of product and sigmoidal units where the weights of the product units are restricted to the nonnegative integers. Thus, an upper bound on the VC dimension or pseudo-dimension for a higher-order network is obtained by means of the same network with the monomials replaced by product units and the exponents considered as variables. We distinguish between two ways of viewing higher-order sigmoidal networks: the exponents of the monomials can be parameters of the network, or they can be xed. We consider the latter case rst, that is, when the exponents are not variable. We emphasize that we do not explicitly count the number of monomials in a higher-order network. For both cases, we obtain bounds that do not impose any restriction on the order of the monomials. Theorem 3. Suppose N is a network with k computation nodes and W parameters where each computation node is a higher-order sigmoidal unit with xed exponents. Then N has pseudo-dimension at most 36 ( Wk) 4 C 136 ( Wk) 2 log(12Wk) . Proof. The parameters of a sigmoidal higher-order network are the weights and thresholds of the sigmoidal units only. Further, a sigmoidal higher-order network has the following properties: First, every input node and every nonoutput node that is a sigmoidal unit feeds its output only to monomials. Second, every sigmoidal unit receives its input only from monomials through parameterized connections. Thus, after replacing the monomials by product units, we can guide the proof analogously to that of theorem 2 with the following ingredients: Since sign variations of input vectors affect product units only and there are at most W product units, it sufces to consider as upper bound on the number of dichotomies that N induces via functions of the form (x, z) 7! sgn ( f ( x ) ¡ z ) on a set of cardi-
272
Michael Schmitt W
nality m the cardinality of a set T with T µ f0, 1gm2 dened as in the proof of theorem 2. The networks that give rise to the function class G have at most: k nodes computing the function a 7! c C 1 / (1 C e ¡a ) (due to the sigmoidal units) W nodes computing the function a 7! c § e¡a (due to the product units) k nodes computing the function a 7! ln a (only logarithms of sigmoidal units are needed) W C k nodes computing a polynomial of degree 2 (due to the product and sigmoidal units) This yields networks having at most 2W C 3k · 5Wk nodes of which up to W C 2k · 3Wk compute a nonpolynomial function. Since each product unit, receiving inputs from sigmoidal units only, has at most k exponents, the functions in G have in total at most W C Wk · 2Wk variables. Thus, with these assumptions, we get from lemma 4 using d D 2Wk, r D 5Wk, and q D 3Wk the solution set components bound, B D 26(Wk)
2 (6( Wk) 2 ¡1)
/ 2 [6(20(Wk) 2 C 2Wk) C 2]20(Wk) 2 C 2Wk 2
¢ [3(20(Wk) 2 C 2Wk) (20( Wk) 2 C 2Wk C 1) C 1]6(Wk) . W
As in the proof of theorem 2, we infer from lemma 3 using T µ f0, 1gm2 that B ( em2W / (2Wk) ) 2Wk is an upper bound on the number of dichotomies induced by N on any set of cardinality m. Thus, if such a set is shattered, we have m · log B C 2Wk log(em2W / (2Wk) ) , from which we get m · 2 log B C 4W 2 k C 4Wk log(2 / ln 2) using 2Wk log m · m / 2 C 2Wk log(4Wk / (e ln 2) ) (see the proof of theorem 1). With the above solution set components bound, this implies m · 6(Wk) 2 (6( Wk) 2 ¡ 1)
C (40( Wk) 2 C 4Wk) log[6(20(Wk) 2 C 2Wk) C 2] C 12(Wk) 2 log[3(20(Wk) 2 C 2Wk) (20( Wk) 2 C 2Wk C 1) C 1] C 4W 2 k C 4Wk log(2 / ln 2) .
Multiplicative Neural Networks
273
From this, we obtain m · 36(Wk) 4 C (40( Wk) 2 C 4Wk) log[6(20(Wk) 2 C 2Wk) C 2]
C 12 ( Wk) 2 log[3(20(Wk) 2 C 2Wk) (20( Wk) 2 C 2Wk C 1) C 1]
¡ 6(Wk) 2 C 4W 2 k C 4Wk log(2 / ln 2) . The last two lines together are less than 12 ( Wk) 2 (log[1519 ( Wk) 4 ] C log(2 / ln 2) ) ,
which is equal to 12 (Wk) 2 log[3038 ( Wk) 4 / ln 2] and less than 48 (Wk) 2 log(9Wk) . Since the second term of the rst line is less than 88 (Wk) 2 log(12Wk) , we get m < 36 ( Wk) 4 C 136 ( Wk) 2 log(12Wk) as claimed. The bound we derive next concerns higher-order sigmoidal networks with variable exponents. As is to be expected, the bound is smaller since more parameters are counted. Theorem 4. Suppose N is a higher-order sigmoidal network with k computation nodes and W parameters that include the exponents of the monomials. Then the pseudo-dimension of N is at most 9(W 2 k) 2 C 34W 2 k log(68W 2 k) . Proof. In comparison to the case with xed exponents in theorem 3, the difference here is that the exponents of the monomials do not increase the number of parameters since they are already counted in W. Thus, lemma 4, with d D W, r D 5Wk, and q D 3Wk, provides solution set components bound B D 23W
2
k (3W 2 k¡1) / 2 [6(10W 2
k C W ) C 2]10W
2 C k W 2
¢ [3(10W 2 k C W ) (10W 2 k C W C 1) C 1]3W k,
and lemma 3 yields B ( em2W / W ) W as upper bound on the number of dichotomies induced on a set of cardinality m. Assuming that such a set is shattered, we obtain m · 2 log B C 2W 2 C 2W log(2 / ln 2) similarly as in the proof of theorem 2. After substituting the above value for B, we get m · 3W 2 k (W 2 k ¡ 1) C (20W 2 k C 2W) log[6(10W 2 k C W ) C 2] C 6W 2 k log[3(10W 2 k C W ) (10W 2 k C W C 1) C 1] C 2W 2 C 2W log(2 / ln 2),
274
Michael Schmitt
which implies m · 9(W 2 k) 2 C (20W 2 k C 2W) log[6(10W 2 k C W ) C 2]
C 6W 2 k log[3(10W 2 k C W ) (10W 2 k C W C 1) C 1]
¡ 3W 2 k C 2W 2 C 2W log(2 / ln 2) . The last two lines together are less than 6W 2 k (log[397 W 4 k2 ] C log(2 / ln 2) ) ,
which is equal to 6W 2 k log[794W 4 k2 / ln 2] and less than 12W 2 k log(34W 2 k) . The second term of the rst line is less than 22W 2 k log(68W 2 k) . Thus, we have m < 9(W 2 k) 2 C 34W 2 k log(68W 2 k) . 4.4 Single Higher-Order Units. Since a single unit can be viewed as a small network, bounds on the VC dimension and pseudo-dimension for product units and higher-order units can be obtained from previous sections. For particular cases, however, we shall establish signicant improvements in the following. We look at single higher-order units and classes of these units rst; then, we consider single product units and classes of monomials. We recall from section 2.2.2 the denition of a higher-order unit, which has the form w1 M1 C w2 M2 C ¢ ¢ ¢ C wk M k ¡ t, where M1 , . . . , M k are monomials. We also remember that the set fM1 , . . . , M kg is called the structure of the higher-order unit. We can view such a unit as a network with one hidden layer and a linear unit as output node. If a threshold or sigmoidal unit is employed for the output node, we have the higher-order variants of the threshold and sigmoidal unit, respectively, which were also dened in section 2.2.2. According to proposition 1, when studying the VC dimension it makes no difference which summing unit is employed for the output node. Thus, we may focus on linear output units without loss of generality. If the structure of a higher-order unit is xed, its only parameters are the weights and the threshold of the output node. Thus, with xed structure, the VC dimension cannot be larger than the VC dimension of a summing unit with the same number of parameters. This fact is employed in the following result. Lemma 5. Let N be a higher-order unit with xed structure that consists of k monomials. Then the number of dichotomies that N induces on a set of cardinality
Multiplicative Neural Networks
275
m is at most 2
´ k ³ X m¡1 iD 0
i
³
e (m ¡ 1) < 2 k
´k ,
for m > k ¸ 1, and the VC dimension of N
is at most k C 1.
Proof. Let fM1 , . . . , M kg be the structure of the higher-order unit. Assume further that S is a set of m input vectors and consider the set of vectors S0 D f(M1 ( s) , . . . , M k ( s ) ) : s 2 Sg. Obviously, every dichotomy induced by a summing unit on S0 corresponds to at least one dichotomy induced by N on S. Hence, the number of dichotomies that N induces on S cannot be larger than the number of dichotomies that a summing unit with k input variables induces on S0 . The P ), an expression that latter quantity is known to be not larger than 2 ikD 0 ( m¡1 i is less than 2(e(m ¡ 1) / k) k (see Anthony & Bartlett, 1999, theorems 3.1 and 3.7, respectively). Thus, we have the claimed upper bound on the number of dichotomies. The VC dimension of a summing unit with k input variables is known to be k C 1 (Anthony & Bartlett, 1999, section 3.3). We apply this result in the next two theorems, where we consider higherorder units with variable structure. The variability is given in terms of a class of units that underlie a certain connectivity constraint. In the rst class, the fan-in of the output node, or the number of monomials, is limited. In the second class, we have a bound on the fan-out of the input nodes, or the number of monomials in which each variable may occur. Both classes can be considered as multivariate generalizations of a function class studied by Karpinski and Werther (1993). They dene a polynomial to be t-sparse if it has at most t nonzero coefcients. Thus, a t-sparse univariate polynomial has at most t monomials, and each variable occurs in at most t of them. Karpinski and Werther (1993) show that the class of t-sparse univariate polynomials has VC dimension H ( t) . Theorem 5. Let C be the class of higher-order units with n input variables and at most k monomials where each variable has an exponent of value at most d. Then C has VC dimension at most minf(k(nk C k C 1) ) 2 C 6k(nk C k C 1) log(8(nk C k C 1) ) , 2nk log(9d) g. Proof. Obviously, a network with one hidden layer of product units and a linear unit as output node comprises the set of functions computed by the
276
Michael Schmitt
units in C . Thus, the rst bound is obtained from corollary 1 considering the exponents of the variables as parameters and using the fact that a product unit can compute any monomial. Since in a higher-order unit with at most k monomials there are no more than nk occurrences of variables, this leads to a total number of nk C k C 1 parameters. The second bound is established as follows: Each occurrence of a variable can have an exponent from f0, 1, . . . , dg (where we consider a monomial to be 0 if all its variables have exponent 0). Therefore, (d C 1) nk is an upper bound on the number of structures in C . From this bound and lemma 5, we infer that the number of dichotomies induced by C on a set S of m input vectors is at most ³ ( d C 1) nk ¢ 2
e (m ¡ 1) k
´k .
If S is shattered, its cardinality satises m · nk log(d C 1) C k log(e(m ¡ 1) / k) C 1. Now we use that ln a · ab C ln(1 / b ) ¡ 1 for a, b > 0 (see, e.g., Anthony & Bartlett, 1999, appendix A.1.1). Assuming m > 1, we may substitute a D m¡1 and b D (ln 2) / (2 k) to obtain k log(m¡1) · ( m¡1 ) / 2 C k log(2k / ( e ln 2) ) . From this we have m · 2nk log(d C 1) C 2k log(2 / ln 2) C 1. The right-hand side p is at most 2nk(log(d C 1) C log(2 / ln 2) C 1 / 2) , which is equal to 2nk log(2 2(d C 1) / ln 2) , and this is less than 2nk log(9d) . Theorem 6. Let C be the class of higher-order units with n input variables where each variable occurs in at most l monomials and has an exponent of value at most d. Then C has VC dimension at most minf4(nl) 4 C 18 (nl ) 2 log(23nl) , 2n2 l log(9d) , 2nl log(5dnl)g. Proof. The rst bound is due to corollary 3 by considering a higher-order unit as a network with one hidden layer of product units and a linear unit as output node. The second bound is obtained from theorem 5 using the fact that every unit in C has at most nl monomials. We derive the third bound as follows: Since there are at most nl connections between the input nodes and the monomials, there are at most (nl ) nl possibilities of connecting the input nodes with the monomials. Further, there are at most dnl ways to assign exponents from f1, . . . , dg to the
Multiplicative Neural Networks
277
occurrences of the variables. Thus, there are at most (dnl ) nl different structures in C . This bound together with lemma 5 implies that the number of dichotomies induced by C on a set S of cardinality m is not larger than ³ (dnl ) nl ¢ 2
e (m ¡ 1) nl
´nl .
This expression is equal to 2(ed(m ¡ 1) ) nl . If S is shattered by C , it follows that m · nl log(ed(m ¡ 1) ) C 1. Similarly as in the previous proof, assuming m > 1, we can make use of nl log(m ¡ 1) · ( m ¡ 1) / 2 C nl log(2nl / ( e ln 2) ) to obtain m · 2nl log(2dnl / ln 2) C 1. The right-hand side is not larger than 2nl(log(2dnl / ln 2) C 1 / 2), which equals p 2nl log(2 2dnl / ln 2) . Since this is less than 2nl log(5dnl) , the third bound follows. 4.5 Single Product Units and Monomials. Next we look at a single product unit. Since it can be viewed as a (trivial) network with one hidden unit, an upper bound on its VC dimension is immediately obtained from corollary 1 in the form of n2 C 6n log(8n) , where n is the number of variables. The following statement shows that the exact values for its VC dimension and pseudo-dimension are considerably smaller. This result also contains the rst lower bound of this article. Theorem 7. The VC dimension and the pseudo-dimension of a product unit with n input variables are both equal to n. Proof. That n is a lower bound easily follows from the fact that a monomial with n variables shatters the set of unit vectors from f0, 1gn , that is, the set of vectors with a 1 in exactly one position. We show now that n is an upper bound on the pseudo-dimension, so that the theorem follows. The idea is to derive the upper bound by means of the pseudo-dimension of a linear unit. Let (w, x) 7! f ( w, x) be the function computed by a product unit, that is, wn 1 w2 f ( w, x ) D xw 1 x2 ¢ ¢ ¢ xn ,
where w1 , . . . , wn are parameters and x1 , . . . , xn are input variables. Consider some arbitrary set S D f(s1 , u1 ) , . . . , ( sm , um ) g where si 2 Rn and ui 2 R for i D 1, . . . , m. According to the denition, the pseudo-dimension of a
278
Michael Schmitt
product unit is the cardinality of the largest such S that is shattered by functions of the form ( w, x, y ) 7! sgn ( f (w, x ) ¡ y )
(4.4)
with parameters w1 , . . . , wn and input variables x1 , . . . , xn , y. We use the same idea as in the proof of theorem 1 to get rid of negative input values. According to the assumptions on the parameters (see section 3.1), if some input value is negative, its weight may take on integer values only. The sole effect of changing the sign of an input value, therefore, is possibly to change the sign of the output of the product unit. Hence, if we consider the set S0 D f(s01 , u1 ) , . . . , ( s0m , um ) g, where s0i arises from si by taking absolute values in all components, then the number of dichotomies induced on S is less than or equal to the cardinality of the set T µ f0, 1g2m dened by T D f(sgn ( f ( a, s01 ) ¡ u1 ) , . . . , sgn ( f ( a, s0m ) ¡ um ) , sgn (¡ f (a, s01 ) ¡ u1 ) , . . . , sgn (¡ f ( a, s0m ) ¡ um ) ) : a 2 Rn g. Since we are interested in an upper bound for |T|, we may assume without loss of generality that no s0i has some component equal to 0, because in that case, the value sgn ( f (a, s0i ) ¡ ui ) is the same for all a 2 Rn . Thus, for inputs from S0 , the function f can be written as f ( w, x ) D exp(w1 ln x1 C ¢ ¢ ¢ C wn ln xn ) . Since f (a, s0i ) > 0 for every a 2 Rn and i D 1, . . . , m, we may suppose that each of u1 , . . . , um is different from 0. Now, depending on whether ui > 0 or ui < 0, exactly one of sgn ( f ( a, s0i ) ¡ ui ) , sgn (¡ f ( a, s0i ) ¡ ui ) changes when a varies, while the other one remains constant. Hence, by dening » bi D
1 ¡1
if ui > 0, if ui < 0,
we select the varying components and obtain with T 0 D f(sgn ( b1 f ( a, s01 ) ¡ u1 ) , . . . , sgn ( bm f (a, s0m ) ¡ um ) ) : a 2 Rn g a set T 0 µ f0, 1gm that has the same cardinality as T. Consider the function (w, x) 7! g ( w, x ) dened for positive input vectors x as g (w, x ) D w1 ln x1 C ¢ ¢ ¢ C wn ln xn ,
Multiplicative Neural Networks
279
that is, g D ln ± f . If ui > 0, then sgn ( bi f ( a, s0i ) ¡ ui ) D sgn( f ( a, s0i ) ¡ bi ui )
0 D sgn(ln( f ( a, si ) ) ¡ ln(bi ui ) )
D sgn( g ( a, s0i ) ¡ ln(bi ui ) ) ,
and if ui < 0, then sgn ( bi f ( a, s0i ) ¡ ui ) D sgn(¡ f (a, s0i ) C bi ui )
D sgn(¡ ln( f ( a, s0i ) ) C ln(bi ui ) ) D sgn(¡g(a, s0i ) C ln(bi ui ) ).
This implies that sgn ( bi f ( a, s0i ) ¡ ui ) D sgn(bi g (a, s0i ) ¡ bi ln(bi ui ) ) for every a 2 Rn and i D 1, . . . , m. From this, we have that |T 0 | is not larger than the number of dichotomies induced on the set S00 D f(b1 ln(s01 ) , b1 ln(b1 u1 ) ) , . . . , ( bm ln(s0m ) , bm ln(bm um ) )g (where logarithms of vectors are taken component-wise) by functions of the form (w, x, y ) 7! sgn ( w1 x1 C ¢ ¢ ¢ C wn xn ¡ y ) .
(4.5)
To conclude the proof, assume that some set of cardinality n C 1 is shattered by functions of the form (4.4). Then from the reasoning above, we may infer that some set of cardinality n C 1 is shattered by functions of the form (4.5). This, however, contradicts the fact that the pseudo-dimension of a linear unit with n parameters is at most n (see, e.g., Anthony & Bartlett, 1999, theorem 11.6). Since a single product unit can compute any monomial, the previous result implies that the VC dimension and the pseudo-dimension of the class of monomials do not grow with the degree of the monomials but are identical with the number of variables. Corollary 5. The VC dimension and the pseudo-dimension of the class of monomials with n input variables are both equal to n.
280
Michael Schmitt
5 Lower Bounds The results presented thus far exclusively dealt with upper bounds, except for the single product unit and the class of monomials for which a lower bound has been given in section 4.5. In the following, we establish some further lower bounds for the VC dimension, which are then, by denition, also lower bounds for the pseudo-dimension. The results in section 5.1 concern classes of higher-order units and show that some upper bounds given in section 4.4 cannot be improved in a certain sense. The main result of section 5.2 is a superlinear lower bound for networks that consist of product and linear units and have constant depth. 5.1 Higher-Order Units. Two types of restrictions have been considered for classes of higher-order units in section 4.4. First, a bound k was imposed on the number of monomials or, equivalently, on the fan-in of the output node. Second, the number of occurrences of each variable or, equivalently, the fan-out of the input nodes was limited by some bound l. We give a lower bound for the latter class rst. The following result provides the essential means. Theorem 8. Let m, r ¸ 1 be natural numbers. Suppose C is the class of higherorder units with m C 2r variables, where each variable occurs in at most one monomial and, if so, with exponent 1. Then there is a set of cardinality m ¢ r that is shattered by C . r
Proof. We show that the class C shatters some set S µ f¡1, 1gm C 2 , which is constructed as the direct product of a set U µ f¡1, 1gm and a set V µ r f¡1, 1g2 . First, let U D fu1 , . . . , um g be dened by » ui, j D
¡1 1
if i D j, otherwise,
for i, j D 1, . . . , m, where ui, j denotes the jth component of ui . Second, given an enumeration L1 , . . . , L2r of all subsets of the set f1, . . . , rg, we dene V D fv1 , . . . , vr g by » v k, j D
¡1 1
if k 2 Lj , otherwise,
for k D 1, . . . , r and j D 1, . . . , 2r . Then the set S D fui : i D 1, . . . , mg £ fv k: k D 1, . . . , rg obviously has cardinality m ¢ r.
Multiplicative Neural Networks
281
To verify that S is shattered by C , assume that some dichotomy ( S 0, S1 ) of S is given. We denote the m C 2r input variables of the units in C by x1 , . . . , xm , y1 , . . . , y2r such that x1 , . . . , xm receive inputs from U and y1 , . . . , y2r receive inputs from V. We construct monomials M1 , . . . , M2r in these variables as follows: Let the function h: f1, . . . , mg ! f1, . . . , 2r g satisfy Lh ( i) D fk: ui v k 2 S1 g, where ui v k is the vector resulting from the concatenation of ui and vk . Clearly, h is well dened. Then we build the monomials by dening Mj D yj ¢
Y xi i: h (i ) D j
for j D 1, . . . , 2r . Obviously, every variable occurs in at most one monomial r and with exponent 1. Hence, the function f : Rm C 2 ! R with f ( x 1 , . . . , x m , y 1 , . . . , y 2 r ) D M1 C M2 C ¢ ¢ ¢ C M2 r can be computed by a member of C . We claim that the function sgn ± f induces the dichotomy ( S 0 , S1 ) . Let ui v k be some element of S. Since k occurs in exactly half of the sets L1 , . . . , L2r , we have from the denition of v k that vk,1 C v k,2 C ¢ ¢ ¢ C vk,2r D 0. 6 Lh ( i ) according to the denition of h. Then v k satises If ui vk 2 S 0 , then k 2 v k,h ( i) D 1. Since ui,i is the only component of ui with value ¡1 and xi occurs only in monomial Mh ( i) , we have M h ( i) ( ui v k ) D ¡1 and thus f ( ui vk ) D ¡2. On the other hand, if ui v k 2 S1 , then k 2 Lh ( i) and v k,h ( i) D ¡1. Now Mh (i) ( ui v k ) D 1 implies f ( ui v k ) D 2. Thus, ( S 0, S1 ) is induced by sgn ± f .
We can now derive a lower bound for the class of higher-order units where the number of occurrences of the variables is restricted. The result implies that the bound O ( nl log(dnl) ) , where n is the number of variables, l the number of occurrences, and d the largest degree, given in theorem 6 is asymptotically optimal with respect to n. Moreover, this optimality holds even if each variable is allowed to occur at most once and with exponent 1. By bxc we denote the largest integer less than or equal to x. Corollary 6. Suppose C is the class of higher-order units with n variables, where each variable occurs in at most one monomial and, if so, with exponent 1. Then the VC dimension of C is at least bn / 2c ¢ blog(n / 2) c.
282
Michael Schmitt
Proof. If m D bn / 2c and r D blog(n / 2) c, then m C 2r · n. Hence, theorem 8 shows that there exists a set S µ Rn of cardinality m ¢ r D bn / 2c ¢ blog(n / 2)c that is shattered by C , even if each variable has exponent 1. The next result paves the way to a lower bound for the class of higherorder units with a limited number of monomials. Theorem 9. Let m, r ¸ 1 be natural numbers. Suppose C is the class of higherorder units with m C 2r variables such that each unit consists of at most 2r monomials and each variable occurs with exponent 1 only. Then there is a set of cardinality m ¢ 2r that is shattered by C . Proof. We construct S µ f¡1, 0, 1gm C 2r as the direct product of two sets U µ f¡1, 1gm and V µ f0, 1g2r . As in the previous proof, we dene U D fu1 , . . . , um g with ui, j , the jth component of ui , being » ui, j D
¡1 1
if i D j, otherwise,
for i, j D 1, . . . , m. For the denition of V D fv1 , . . . , v2r g, let L1 , . . . , L2r be an enumeration of all subsets of the set f1, . . . , rg. Then the components of v k are dened by v k, j
» 1 D 0
if j 2 L k, otherwise,
» and
v k,r C j D
0 1
if j 2 L k, otherwise,
for k D 1, . . . , 2r and j D 1, . . . , r. Clearly, the set S D fui : i D 1, . . . , mg £ fv k: k D 1, . . . , 2r g has cardinality m ¢ 2r . It remains to show that C shatters S. Let ( S 0 , S1 ) be some arbitrary dichotomy of S. Denote the input variables by x1 , . . . , xm , y1 , . . . , y2r such that x1 , . . . , xm and y1 , . . . , y2r receive inputs from U and V, respectively. First, we dene monomials N1 , . . . , N2r in the variables y1 , . . . , y2r by Nk D
Y j2Lk
yj ¢
Y 6 j 2L k
yr C j
for k D 1, . . . , 2r . Next, we use them to construct monomials M1 , . . . , M2r dened by Mk D N k ¢
Y i: ui vk 2S1
xi
Multiplicative Neural Networks
283
for k D 1, . . . , 2r , where ui v k denotes the concatenation of ui and v k. Then the function f : Rm C 2r ! R with f ( x1 , . . . , xm , y1 , . . . , y2r ) D ¡M1 ¡ ¢ ¢ ¢ ¡ M2r can clearly be computed by a higher-order unit in C . We show that ( S 0, S1 ) is induced by sgn ± f . Let ui v k be some element of S. The denitions of v k and the monomials N1 , . . . , N2r imply that Nl (ui v k ) D
» 1 0
if l D k, otherwise,
for l D 1, . . . , 2r . Hence, we have f ( ui v k ) D ¡M k ( ui v k ) D ¡
Y
ui,h ,
h: uh vk 2S1
which together with the denition of ui implies » f ( ui v k ) D
¡1 1
if ui v k 2 S 0, if ui v k 2 S1 .
Thus, sgn ± f induces the dichotomy (S 0 , S1 ) as claimed, and, consequently, S is shattered by C . Finally, we obtain a lower bound for the class of higher-order units with a restricted number of monomials. In particular, the result shows that the bound O ( nk log d ) , where k is the number of monomials, obtained in theorem 5 cannot be improved with respect to nk. Corollary 7. Suppose C is the class of higher-order units with n variables where each variable has exponent 1 and each unit consists of at most k monomials for some arbitrary k satisfying k · 2n/ 4 . Then C has VC dimension at least bn / 2c ¢ bk / 2c. Proof. Let r be the largest integer satisfying 2r · k. Clearly, then, 2r > bk/ 2c. If we choose m D bn / 2c, then m C 2r · bn / 2c C 2 log k · n, and theorem 9 implies that there is a set S µ Rn of cardinality m¢2r ¸ bn / 2c¢bk/ 2c that is shattered by the class of higher-order units with at most 2r · k monomials where each variable has exponent 1. We conclude by observing that the previous result also yields a lower bound for the class of higher-order units with restricted fan-out of the input nodes since, clearly, each variable occurs in at most k monomials.
284
Michael Schmitt
5.2 Networks with Product Units. Multiplication is certainly the simplest type of arithmetical operation that can be performed by a product unit. All weights just need to be set to 1. Koiran and Sontag (1997) show that there exist networks consisting of linear and multiplication units that have VC dimension quadratic in the number of weights. Hence, this bound remains valid when product units are used instead of multiplication units, and corollary 1 of Koiran and Sontag (1997) implies that for every W, there is a network with O ( W ) weights that consists only of linear and product units and has VC dimension W 2 . This lower bound is based on the use of networks with unrestricted depth. An extension of the result of Koiran and Sontag (1997) is obtained by Bartlett et al. (1998), who give a lower bound for layered sigmoidal networks in terms of the number of weights and the number of layers. Using the constructions of Koiran and Sontag (1997) and Bartlett et al. (1998) in terms of linear and multiplication units, we deduce that for every L and sufciently large W, there is a network with L layers and O ( W ) weights that consists only of linear and product units and has VC dimension at least bL / 2c¢bW / 2c. Thus, in terms of the number of weights, we have a quadratic lower bound for arbitrary networks and a linear lower bound for networks of constant depth. It is known, however, that networks of summing units can have constant depth and superlinear VC dimension. For threshold units, such networks have been constructed by Sakurai (1993) and Maass (1994). We show now that product unit networks of constant depth also can have a superlinear VC dimension. In particular, we establish this for networks consisting of product and linear units and having two hidden layers. The numbering of the hidden layers in the following statement is done from the input nodes toward the output node. Theorem 10. Let n, k be natural numbers satisfying k · 2n C 2 . There is a network N with the following properties: It has n input nodes, at most k hidden nodes arranged in two layers with product units in the rst hidden layer and linear units in the second, and a product unit as output node; furthermore, N has 2nbk / 4c adjustable and 7bk/ 4c xed weights. The VC dimension of N is at least ( n ¡ blog(k / 4) c) ¢ bk / 8c ¢ blog(k / 8) c. With the aim of proving this, we rst establish a lemma in which we introduce a new kind of summing unit and make use of a property of sets of vectors. A set of m vectors in Rn is said to be in general position if every subset of at most n vectors is linearly independent. Obviously, a set in general position can be constructed for any m and n. The new summing unit has weights and a threshold as parameters and computes its output by applying the activation function t ( y) D 1 C 1 / cosh(y) to the weighted sum. This function has its maximum at y D 0 with t (0) D 2 and satises lim t ( y ) D 1 for y ! ¡1 as well as for y ! 1. Further, t ( y ) ¸ 1 always holds.
Multiplicative Neural Networks
285
Lemma 6. Let h, m, r be arbitrary natural numbers. Suppose N is a network with m C r input nodes, one hidden layer of h C 2r nodes that are summing units with activation function 1 C 1 / cosh, and a monomial as output node. Then there is a set of cardinality h ¢ m ¢ r that is shattered by N . Proof. The construction is based on methods due to Sakurai and Yamasaki (1992) and Sakurai (1993). We choose a set fs1 , . . . , sh¢m g µ Rm in general position and let e1 , . . . , er be the unit vectors in Rr , that is, they have a 1 in exactly one component and 0 elsewhere. Clearly, then, the set S D fsi : i D 1, . . . , h ¢ mg £ fej : j D 1, . . . , rg is a subset of Rm C r with cardinality h ¢ m ¢ r. We show that it can be shattered by the network N as claimed. Assume that ( S 0 , S1 ) is a dichotomy of S. Let L1 , . . . , L2r be an enumeration of all subsets of the set f1, . . . , rg, and dene the function g: f1, . . . , h ¢ mg ! f1, . . . , 2r g to satisfy L g ( i) D fj: si ej 2 S1 g, where si ej denotes the concatenated vectors si and ej . For l D 1, . . . , 2r let Rl µ fs1 , . . . , sh¢m g be the set Rl D fsi : g ( i) D lg. For each Rl , we use d|Rl | / me hidden nodes of which we dene the weights as follows: We partition Rl into d|Rl | / me subsets Rl,p , p D 1, . . . , d|Rl | / me, each of which has cardinality m, except for possibly one set of cardinality less than m. For each subset Rl,p there exist real numbers wl,p,1 , . . . , wl,p,m , tl,p such that every si 2 fs1 , . . . , sh¢m g satises (wl,p,1 , . . . , wl,p,m ) ¢ si ¡ tl,p D 0 if and only if si 2 Rl,p .
(5.1)
This follows from the fact that the set fs1 , . . . , sh¢m g is in general position. (In other words, (wl,p,1 , . . . , wl,p,m , tl,p ) represents the hyperplane passing through all points in Rl,p and through none of the other points.)7 With subset Rl,p , we associate a hidden node with threshold tl,p and with weights wl,p,1 , . . . , wl,p,m for the connections from the rst m input nodes. Since of 7 If |Rl,p | D m, then this hyperplane is unique. In case |Rl,p | < m, we select one of the hyperplanes containing Rl, p and none of the other points. This can be done, for example, by extending the unique (|Rl,p | ¡ 1)-dimensional hyperplane determined by Rl,p to an appropriate (m ¡ 1)-dimensional hyperplane.
286
Michael Schmitt
all subsets Rl,p at most h have cardinality m and at most 2r have cardinality less than m, this construction can be done with at most h C 2r hidden nodes. Thus far, we have specied the weights for the connections outgoing from the rst m input nodes. The connections from the remaining r input nodes are weighted as follows: Let e > 0 be a real number such that for every si 2 fs1 , . . . , sh¢m g and every weight vector ( wl,p,1 , . . . , wl,p,m , tl,p ), 6 Rl,p then | ( w l,p,1 , . . . , wl,p,m ) ¢ s i ¡ tl,p | > e. if si 2
According to the construction of the weight vectors in equation 5.1, such an e clearly exists. We dene the remaining weights wl,p,m C 1 , . . . , wl,p,m C r by » wl,p,m C j D
0 e
if j 2 Ll , otherwise.
(5.2)
This completes the denition of the hidden nodes. We show that they have the following property: Claim. If si ej 2 S1 , then there is exactly one hidden node with output value 2; if si ej 2 S 0 , then all hidden nodes yield an output value less than 2. In order to establish this we observe that according to equation 5.1, there is exactly one weight vector ( wl,p,1 , . . . , wl,p,m , tl,p ), where l D g (i) , that yields 0 on si . If si ej 2 S1 , then j 2 L g ( i) , which together with equation 5.2 implies that the weighted sum (wl,p,m C 1 , . . . , wl,p,m C r ) ¢ ej is equal to 0. Hence, this node gets the total weighted sum 0 and, applying 1 C 1 / cosh, outputs 2. The input vector ej changes the weighted sums of the other nodes by an amount of at most e . Thus, the total weighted sums for these nodes remain different from 0, and, hence, the output values are less than 2. 6 L g ( i ) , and the node that yields 0 On the other hand, if si ej 2 S 0 , then j 2 on si receives an additional amount e through weight wl,p,m C j . This gives a total weighted sum different from 0 and an output value less than 2. All other nodes fail to receive 0 by an amount of more than e and thus have total weighted sum different from 0 and, hence, an output value less than 2. Thus, the claim is proven. Finally, to complete the proof, we do one more modication with the weight vectors and dene the weights for the ouptut node. Clearly, if we multiply all weights and thresholds dened thus far with any real number a > 0, the claim above remains true. Since lim(1 C 1 / cosh(y) ) D 1 for y ! ¡1 and y ! 1, we can nd an a such that on every si ej 2 S, the output values of those hidden nodes that do not output 2 multiplied together yield a value as close to 1 as necessary. Further, this value is at least 1, since 1 C 1 / cosh(y) ¸ 1 for all y. Thus, if we employ a monomial with all exponents equal to 1 for the output node, it follows from the reasoning above that the
Multiplicative Neural Networks
287
output value of the network is at least 2 if and only if si ej 2 S1 . This shows that S is shattered by N . We now employ the previous result and give a proof of theorem 10. Proof of Theorem 10. The idea is to take a set S0 constructed as in lemma 6 and, as shown there, shattered by a network N 0 with a monomial as output node and one hidden layer of summing units that use the activation function 1 C 1 / cosh. Then S0 is transformed into a set S and N 0 into a network N such that for every dichotomy (S00 , S01 ) induced by N 0 on S0 , the network N induces the corresponding dichotomy ( S 0, S1 ) of S. Assume that n and k are given as supposed and let S0 be the set dened in lemma 6, choosing h D bk / 8c, m D n ¡ blog(k / 4)c, and r D blog(k / 8) c. Note that the assumption k · 2n C 2 ensures that m ¸ 0. Then S0 has cardinality m ¢ h ¢ r D ( n ¡ blog(k / 4) c) ¢ bk / 8c ¢ blog(k / 8) c. Furthermore, we have m C r D n ¡ 1 and hence S0 µ Rn¡1 , and lemma 6 implies that S0 is shattered by a network N 0 with n ¡ 1 input nodes, a monomial as output node, and one hidden layer of h C 2r · bk / 4c summing units with activation function 1 C 1 / cosh. From S0 we construct S µ Rn by dening 0
0
S D f(es1 , . . . , esn¡1 , e) : ( s01 , . . . , s0n¡1 ) 2 S0 g. In other words, S is obtained from S0 by appending a component containing 1 to each vector and applying the function y 7! exp(y) to every component. On some input vector s0 2 S0 , a hidden node of N 0 with weight vector w and threshold t computes 1C
1 exp(w ¢ s0 ¡ t) ¡ exp(¡w ¢ s0 C t) C 2 . D cosh(w ¢ s0 ¡ t) exp(w ¢ s0 ¡ t) C exp(¡w ¢ s0 C t)
(5.3)
If s D ( s1 , . . . , sn ) is the vector in S obtained from the (unique) vector s0 D ( s01 , . . . , s0n¡1 ) in S0 , then according to the construction of S (s01 , . . . , s0n¡1 , 1) D (ln( s1 ) , . . . , ln(sn¡1 ) , ln(sn ) ) , which implies that an exponential on the right-hand side of equation 5.3 with weights w and threshold t yields on input vector s0 the same output as a product unit with weights w, t on input vector s. (It is clear now that the reason for appending one component to the vectors in S0 was to accommodate the threshold t as a weight in a product unit.) Therefore, the computation of a summing unit with activation function 1 C 1 / cosh on s0 2 S0 can be simulated
288
Michael Schmitt
by feeding the vector s 2 S into a network with two hidden layers, where the rst layer consists of two product units, the second layer has two linear units, and the output node computes a division. Furthermore, this network of four hidden nodes has 2n connections with adjustable weights and seven connections with xed weights (two for each linear unit, one for the threshold of the linear unit computing the numerator, and two for the division). Replacing all bk / 4c hidden nodes of N 0 in this way, we obtain the network N , which has at most k hidden nodes arranged in two layers, where the rst hidden layer consists of product units and the second of linear units. The output node has to compute a product of divisions, which can be done by a single product unit. Further, N has 2nbk / 4c adjustable and 7bk/ 4c xed weights. Thus, N has the properties as claimed and shatters the set S, which has the same cardinality as S0 . From the previous result, we derive the following more simplied statement of a superlinear lower bound. Corollary 8. Let n, k be natural numbers where 16 · k · 2n / 2 C 2 . There is a network of product and linear units with n input units, at most k hidden nodes in two layers, and at most nk weights that has VC dimension at least ( nk/ 32) log(k / 16) . Proof. The network constructed in the proof of theorem 10 has 2nbk / 4c C 7bk / 4c · nk / 2 C 2k weights, which are, using 2 · n / 2 from the assumptions, not more than nk weights. The VC dimension of this network was shown to be at least ( n ¡ blog(k / 4) c) ¢ bk / 8c ¢ blog(k / 8) c. Now, k · 2n/ 2 C 2 implies n ¡ blog(k / 4) c ¸ n / 2, from k ¸ 16 we get bk / 8c ¸ k/ 8 ¡ 1 ¸ k / 16, and at last we use blog(k/ 8) c ¸ log(k / 8) ¡ 1 D log(k / 16). 6 Summary and Conclusion Multiplication is an arithmetical operation that, when used in neural networks, certainly helps to increase their computational power by allowing neural inputs to interact nonlinearly. The question is how this gain is reected in quantitative measures of complexity and of, in particular, analog computational power. In this article, we have dealt with two such measures: the Vapnik-Chervonenkis dimension and the pseudo-dimension. We have derived upper and lower bounds on these dimensions for neural networks in which multiplication occurs as a fundamental operation in the interaction of network elements. An overview of the results is given in Table 1, where we present the bounds mainly in asymptotic form, abstracting from most of the constant factors.
Product Monomial
Single Class
Class
Class
Single
Product and summing Product and summing Product and summing Higher-order
Single
Product and sigmoidal Higher-order sigmoidal
Unit Types Unit type variable Exponents as parameters Exponents xed Summing units in second hidden layer k hidden nodes Input nodes fan-out · l Input nodes fan-out · l, exponents · d Input nodes fan-out · l, exponents 1 Input nodes fan-out 1, exponents 1 k monomials, exponents · d k monomials, exponents 1
4(Wk)2 C O (Wk log(Wk)) 9(W2 k)2 C O (W 2 k log(W2 k)) 36 (Wk)4 C O ((Wk)2 log(Wk)) V (W log k)
4(nl)4 C O ((nl)2 log(nl)) O ((nl)4 ), O (n2 l log d), O (nl log dnl)
O (n2 k4 ), O (nk log d) V (nk) n n
V (n log n)
V (nl)
(Wk)2 C O (Wk log W)
Remarks
Bound
Theorem 5 Corollary 7 Theorem 7 Corollary 5
Corollary 6
Corollary 7
Theorem 6
Corollary 3
Corollary 1
Theorem 3 Corollary 8
Theorem 4
Corollary 4
Reference
Notes: If not otherwise stated, W, k, and n refer to the number of parameters, computation nodes, and input nodes, respectively. Upper bounds are valid for the pseudo-dimension, lower bounds for the VC dimension.
Single unit
Two hidden layers One hidden layer
Class
General feedforward
Single
Class/Single
Architecture
Table 1: Survey of the Results.
Multiplicative Neural Networks 289
290
Michael Schmitt
The bounds are given in terms of the numbers of network parameters and computation nodes and, for classes, in terms of the restrictions that characterize the architectures in the respective class. We highlight two features. First, the upper bounds are all polynomials of low order. In particular, the bound for general feedforward networks exhibits the same order of magnitude as the best-known upper bound for purely sigmoidal networks—and this even in the case when it is not predetermined whether a node is to become a summing or a product unit. Second, the upper bounds for higherorder networks and some of the bounds for classes of higher-order units do not involve any constraint on the order. It is therefore impossible to nd lower bounds that exhibit a growth in terms of the order only. This limitation is also indicated by the fact that some lower bounds for classes of higher-order units are already tight for order one. In this case, the degree of multiplicativity cannot help in proving better lower bounds. In general, the results show that multiplication in neural networks does not lead to an immeasurable growth of VC dimension and pseudo-dimension. In practical uses of articial neural networks, such as, for instance, in pattern recognition, higher-order networks and product unit networks are considered natural extensions of the classical linear summing networks. We have reported about some applications where learning algorithms have been designed for training multiplicative networks. The question of how well networks resulting from these algorithms generalize is theoretically studied in numerous models of learning. In a major part of them, the VC dimension and the pseudo-dimension play a central role. They can be used to estimate the number of training examples required by learning algorithms to generate hypotheses with low generalization error. The results given here imply that estimates can now be given for higher-order sigmoidal networks that do not come with an a priori restriction of their order. Hence, one need not cope with a sample complexity that grows with the order. For learning applications, this suggests the use of higher-order networks without any limit on the order. Further, the estimates are valid for a class of neural network learning algorithms that has yet to be developed. They hold even if the algorithm is allowed to decide for each node whether it is to be a summing or a product unit. Apart from applications of learning, multiplicative neural networks are used for modeling the behavior of biological nervous systems or parts thereof. In this context, questions arise as to what type of functions have been captured in a model that has been constructed in accordance with experimental observations. The VC dimension and the pseudo-dimension are combinatorial measures for the complexity and diversity of function classes. As such, they can be used to compare networks with respect to their expressiveness. Moreover, using upper bounds on these dimensions, lower bounds on the size of networks for the computation and approximation of functions can be calculated. By means of the results given here, such calculations can now be done for multiplicative neural networks. Thus, a new
Multiplicative Neural Networks
291
tool is available for the assessment of these networks and for the verication of their proper use in neural modeling. Our investigations naturally raise some new questions. Most prominent, since it is also an open problem for networks of sigmoidal units, is the issue whether signicantly better upper bounds can be shown for networks of xed depth. The bounds for depth-restricted networks established so far coincide with the bounds for general feedforward networks. For the latter, however, quadratic lower bounds have been derived using a method that does not apply to constant-depth networks. Thus, the gap between upper and lower bound for depth-restricted networks is larger than in the general feedforward case. The so-called fat-shattering dimension is a further combinatorial measure that is known to give bounds on the complexity of learning. Since it is bounded from above by the pseudo-dimension, the results in this article imply upper bounds on the fat-shattering dimension. Moreover, when the output node is a linear unit, the fat-shattering dimension is equal to the pseudo-dimension. It is an interesting question whether for networks with nonlinear output nodes, bounds can be obtained for the fat-shattering dimension that are signicantly smaller than the pseudo-dimension of the network. The lower bounds we presented are all derived for the VC dimension and, hence, are by denition also valid for the pseudo-dimension. It is currently not known how to obtain lower bounds for the pseudo-dimension of neural networks directly. Finally, our calculations resulted in several constants appearing in the bounds. We did not strive for obtaining optimal values but were content with the constants being small. Certainly, improvements might be possible using more detailed calculations or new approaches. Appendix: Solution Set Components Bounds In the following we give the proofs of the lemmas required for the upper bounds in section 4. Proof of Lemma 1. Consider for 1 · k · d an arbitrary set f f1 , . . . , fkg µ G that has regular zero-set intersections and let pi be the degree of fi . It follows from KhovanskiÆ õ (1991, p. 91, corollary 3) that if l is the dimension of the set fa 2 Rd : f1 ( a ) D ¢ ¢ ¢ D fk ( a) D 0g, then this set has at most 2q ( q¡1 ) / 2 p1 ¢ ¢ ¢ p kSl [(l C 1) S ¡ l]q connected compoPk nents where S D i D1 pi C l C 1. From pi · p, k · d, and l · d, we get S · ( p C 1) d C 1 and ( l C 1) S ¡ l · ( p C 1) d ( d C 1) C 1, which implies the result.
292
Michael Schmitt
Proof of Lemma 2. Let k · d and consider some arbitrary set f f1 , . . . , fkg µ G that has regular zero-set intersections. According to the assumptions, for i D 1, . . . , k each fi can be written in the form ( y1 , . . . , yd ) 7! ai C bi,1 y1 C ¢ ¢ ¢ C bi,d yd C ci,1 egi,1 C ¢ ¢ ¢ C ci,ri egi, ri , where ri · q, the gi, j are afne functions in y1 , . . . , yd , and ai , bi,1 , . . . , bi,d , ci,1 , . . . , ci,ri are real numbers. We introduce new functions fQi and hi, j in y1 , . . . , yd and in new variables zi, j by dening fQi ( y1 , . . . , yd , zi,1 , . . . , zi,ri ) D ai C bi,1 y1 C ¢ ¢ ¢ C bi,d yd C ci,1 ezi, 1 C ¢ ¢ ¢ C ci,ri ezi,r i ,
hi, j ( y1 , . . . , yd , zi, j ) D gi, j ( y1 , . . . , yd ) ¡ zi, j , for i D 1, . . . , k and j D 1, . . . , ri . Let G Q be the class of afne functions in y1 , . . . , yd , and in zi, j and ezi, j , for i D 1, . . . , k and j D 1, . . . , q. Clearly, the functions fQi and hi, j are elements of G Q . Furthermore, since f1 , . . . , fk are chosen arbitrarily from G and at most q new variables are introduced for each fi , the classes G and G Q satisfy denition 7.12 of Anthony and Bartlett (1999), that is, G Q computes G with q intermediate variables. In particular, the partial derivative of hi, j with respect to the variable zi, j is ¡1, and hence nonzero. Thus, the derivative condition iii of the denition 7.12 is met. It follows from theorem 7.13 of Anthony and Bartlett (1999) that any solution set components bound for G Q is also a solution set components bound for G . Since G Q consists of polynomials of degree 1 in dq C d variables and dq xed exponentials, by virtue of lemma 1, class G Q has solution set components bound B D 2dq ( dq¡1 ) / 2 [2(dq C d ) C 1]dqC d [2(dq C d ) ( dq C d C 1) C 1]dq, which is hence also a solution set components bound for G as claimed. Proof of Lemma 3. Let f f1 , . . . , fk g µ F and fs1 , . . . , sm g µ Rn be given, and T be dened as above. In the proof of theorem 7.8 in Anthony and Bartlett (1999), it is shown that then there exist real numbers l1,1 , . . . , l k,m such that the following holds: Let C denote the number of connected components of the set
Rd ¡
k [ m [ iD 1 jD 1
fa 2 Rd : fi ( a, sj ) ¡ li, j D 0g.
Multiplicative Neural Networks
293
Then T satises |T| · C, and the set of functions fa 7! fi (a, sj ) ¡ li, j : i D 1, . . . , kI j D 1, . . . , mg has regular zero-set intersections. Clearly, this set is a subset of G , which has solution set components bound B. In the proof of theorem 7.6 in Anthony and Bartlett (1999), it is shown that this implies C·B
´ d ³ X mk iD 0
i
³ ·B
emk d
´d
for m ¸ d / k. Hence, the claimed result follows using |T| · C. Proof of Lemma 4. Let f f1 , . . . , fk g µ G , where k · d, be some arbitrary set of functions that has regular zero-set intersections. According to the assumptions, there is for each fi a network that computes fi with r and q computation nodes as described. We number the nodes such that the computation of each node depends only on nodes with a smaller number. Then for i D 1, . . . , k and j D 1, . . . , r, the computation performed by node j in the network for fi can be represented by a function ni, j in the variables y1 , . . . , yd that is recursively dened by 8 ci, j C 1 / (1 C exp(¡ni,l ( y ) ) , > > > ( ) ) ln(n y , i,l > > : pi, j ( y, ni,1 ( y) , . . . , ni, j¡1 ( y ) ) ,
l < j, l < j, l < j,
depending on whether node j computes the function a 7! ci, j C 1 / (1 C e¡a ), a 7! ci, j § ea , a 7! ln a, or the degree 2 polynomial pi, j , respectively. We introduce new functions gi, j , gQ i, j in y1 , . . . , yd and in new variables zi, j , zQ i, j corresponding to the above four cases for ni, j as follows: If node j in the network for fi computes 1. the function a 7! ci, j C 1 / (1 C e¡a ), then gQ i, j ( y, zQ i, j ) D ni,l ( y ) C zQ i, j ,
gi, j ( zi, j , zQ i, j ) D (zi, j ¡ ci, j ) (1 C exp( zQi, j ) ) ¡ 1, 2. the function a 7! ci, j § ea , then gQ i, j ( y, zQ i, j ) D ni,l ( y ) ¡ zQ i, j ,
gi, j ( zi, j , zQ i, j ) D exp(zQ i, j ) § ci, j ¡ zi, j ,
294
Michael Schmitt
3. the function a 7! ln a, then gQ i, j ( y, zQ i, j ) D ni,l ( y) ¡ exp( zQ i, j ) , gi, j ( zi, j , zQ i, j ) D zQ i, j ¡ zi, j , 4. the degree 2 polynomial pi, j then gi, j ( y, zi,1 , . . . , zi, j ) D pi, j ( y, zi,1 , . . . , zi, j¡1 ) ¡ zi, j , where l, ci, j , and pi, j are as in the corresponding denition of ni, j above. Let G Q be the class of polynomials of degree at most 2 in the variables y1 , . . . , yd and in zi, j , zQ i, j and exp(zQ i, j ) . There are at most kr variables zi, j and—since the variables zQ i, j are introduced only for those nodes that compute a nonpolynomial function—at most kq variables zQ i, j and kq exponentials exp( zQ i, j ) . Clearly, G contains the functions gQ i, j and gi, j , which implicitly dene the variables zQ i, j and zi, j , respectively. The partial derivative of gQ i, j with respect to zQ i, j in cases 1–3 is 1, ¡1, and ¡ exp( zQ i, j ), respectively. For gi, j the partial derivative with respect to zi, j is 1 C exp( zQ i, j ) in case 1 and ¡1 in cases 2–4. All of these partial derivatives are everywhere nonzero. Hence, condition iii of denition 7.12 in Anthony and Bartlett (1999) is satised. Furthermore, theorem 7.13 of Anthony and Bartlett (1999) implies that G Q computes G with r C q intermediate variables, and any solution set components bound for G Q is also a solution set components bound for G . The polynomials in G Q are of degree at most 2 in no more than 2dr C d variables and dq xed exponentials. Thus, lemma 1 shows that G Q has solution set components bound ( ) B D 2dq dq¡1 / 2 [6(2dr C d) C 2]2dr C d [3(2dr C d ) (2dr C d C 1) C 1]dq,
and we conclude that this is also a solution set components bound for G . Acknowledgments I thank the anonymous referees for helpful comments. This work has been supported in part by the ESPRIT Working Group in Neural and Computational Learning II, NeuroCOLT2, No. 27150. Some of the results have been presented at the NeuroCOLT Workshop “New Perspectives in the Theory of Neural Nets” in Graz, Austria, on May 3, 2000. I am grateful to Wolfgang Maass for the invitation to give a talk at this meeting. References Andersen, R. A., Essick, G. K., & Siegel, R. M. (1985). Encoding of spatial location by posterior parietal neurons. Science, 230, 456–458. Anthony, M. (1995). Classication by polynomial surfaces. Discrete Applied Mathematics, 61, 91–103.
Multiplicative Neural Networks
295
Anthony, M., & Bartlett, P. L. (1999). Neural network learning: Theoretical foundations. Cambridge: Cambridge University Press. Anzai, A., Ohzawa, I., & Freeman, R. D. (1999a). Neural mechanisms for processing binocular information I. Simple cells. Journal of Neurophysiology, 82, 891–908. Anzai, A., Ohzawa, I., & Freeman, R. D. (1999b). Neural mechanisms for processing binocular information II. Complex cells. Journal of Neurophysiology, 82, 909–924. Bartlett, P. L., Maiorov, V., & Meir, R. (1998). Almost linear VC dimension bounds for piecewise polynomial networks. Neural Computation, 10, 2159–2173. Baum, E. B., & Haussler, D. (1989). What size net gives valid generalization? Neural Computation, 1, 151–160. Ben-David, S., Cesa-Bianchi, N., Haussler, D., & Long, P. M. (1995). Characterizations of learnability for classes of f0, . . . , ng-valued functions. Journal of Computer and System Sciences, 50, 74–86. Ben-David, S., & Lindenbaum, M. (1998). Localization vs. identication of semialgebraic sets. Machine Learning, 32, 207–224. Bishop, C. M. (1995). Neural networks for pattern recognition. Oxford: Clarendon Press. Blomeld, S. (1974). Arithmetical operations performed by nerve cells. Brain Research, 69, 115–124. Blumer, A., Ehrenfeucht, A., Haussler, D., & Warmuth, M. K. (1989). Learnability and the Vapnik-Chervonenkis dimension. Journal of the Association for Computing Machinery, 36, 929–965. Bugmann, G. (1991). Summation and multiplication: Two distinct operation domains of leaky integrate-and-re neurons. Network: Computation in Neural Systems, 2, 489–509. Bugmann, G. (1992). Multiplying with neurons: Compensation for irregular input spike trains by using time-dependent synaptic efciencies. Biological Cybernetics, 68, 87–92. Burshtein, D. (1998). Long-term attraction in higher order neural networks. IEEE Transactions on Neural Networks, 9, 42–50. Carandini, M., & Heeger, D. J. (1994). Summation and division by neurons in primate visual cortex. Science, 264, 1333–1336. Cover, T. M. (1965). Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE Transactions on Electronic Computers, 14, 326–334. Cover, T. M. (1968). Capacity problems for linear machines. In L. N. Kanal (Ed.), Pattern recognition (pp. 283–289). Washington, DC: Thompson Book Co. Duda, R. O., & Hart, P. E. (1973). Pattern classication and scene analysis. New York: Wiley. Durbin, R., & Rumelhart, D. (1989). Product units: A computationally powerful and biologically plausible extension to backpropagation networks. Neural Computation, 1, 133–142. Fahner, G., & Eckmiller, R. (1994). Structural adaptation of parsimonious higherorder neural classiers. Neural Networks, 7, 279–289.
296
Michael Schmitt
Feldman, J. A., & Ballard, D. H. (1982). Connectionist models and their properties. Cognitive Science, 6, 205–254. Gabbiani, F., Krapp, H. G., & Laurent, G. (1999). Computation of object approach by a wide-eld, motion-sensitive neuron. Journal of Neuroscience, 19, 1122– 1141. Ghosh, J., & Shin, Y. (1992). Efcient higher-order neural networks for classication and function approximation. International Journal of Neural Systems, 3, 323–350. Giles, C. L., & Maxwell, T. (1987). Learning, invariance, and generalization in high-order neural networks. Applied Optics, 26, 4972–4978. Giles, C. L., Miller, C. B., Chen, D., Chen, H. H., Sun, G. Z., & Lee, Y. C. (1992). Learning and extracting nite state automata with second-order recurrent neural networks. Neural Computation, 4, 393–405. Giles, C. L., Sun, G. Z., Chen, H. H., Lee, Y. C., & Chen, D. (1990). Higher order recurrent networks and grammatical inference. In D. S. Touretzky (Ed.), Advances in neural information processing systems, 2 (pp. 380–387). San Mateo, CA: Morgan Kaufmann. Goldberg, P. W., & Jerrum, M. R. (1995). Bounding the Vapnik-Chervonenkis dimension of concept classes parameterized by real numbers. Machine Learning, 18, 131–148. Hancock, T. R., Golea, M., & Marchand, M. (1994). Learning nonoverlapping perceptron networks from examples and membership queries. Machine Learning, 16, 161–183. Hatsopoulos, N., Gabbiani, F., & Laurent, G. (1995). Elementary computation of object approach by a wide-eld visual neuron. Science, 270, 1000– 1003. Haussler, D. (1992). Decision theoretic generalizations of the PAC model for neural net and other learning applications. Information and Computation, 100, 78–150. Haussler, D., Kearns, M., & Schapire, R. E. (1994). Bounds on the sample complexity of Bayesian learning using information theory and the VC dimension. Machine Learning, 14, 83–113. Heywood, M., & Noakes, P. (1995). A framework for improved training of sigma-pi networks. IEEE Transactions on Neural Networks, 6, 893–903. Hofmeister, T. (1994). Depth-efcient threshold circuits for arithmetic functions. In V. Roychowdhury, K.-Y. Siu, & A. Orlitsky (Eds.), Theoretical advances in neural computation and learning (pp. 37–84). Norwell, MA: Kluwer. Ismail, A., & Engelbrecht, A. P. (2000). Global optimization algorithms for training product unit neural networks. In International Joint Conference on Neural Networks IJCNN’2000 (Vol. I, pp. 132–137). Los Alamitos, CA: IEEE Computer Society. Janson, D. J., & Frenzel, J. F. (1993). Training product unit neural networks with genetic algorithms. IEEE Expert, 8(5), 26–33. Karpinski, M., & Macintyre, A. (1997). Polynomial bounds for VC dimension of sigmoidal and general Pfafan neural networks. Journal of Computer and System Sciences, 54, 169–176.
Multiplicative Neural Networks
297
Karpinski, M., & Werther, T. (1993). VC dimension and uniform learnability of sparse polynomials and rational functions. SIAM Journal on Computing, 22, 1276–1285. KhovanskiÆ õ , A. G. (1991). Fewnomials. Providence, RI: American Mathematical Society. Koch, C. (1999). Biophysics of computation. New York: Oxford University Press. Koch, C., & Poggio, T. (1992). Multiplying with synapses and neurons. In T. McKenna, J. Davis, & S. Zornetzer (Eds.), Single neuron computation, (pp. 315–345). Boston: Academic Press. Koch, C., Poggio, T., & Torre, V. (1983). Nonlinear interactions in a dendritic tree: Localization, timing, and role in information processing. Proceedings of the National Academy of Sciences USA, 80, 2799–2802. Koiran, P. (1996). VC dimension in circuit complexity. In Proceedings of the 11th Annual IEEE Conference on Computational Complexity CCC’96 (pp. 81–85). Los Alamitos, CA: IEEE Computer Society Press. Koiran, P., & Sontag, E. D. (1997). Neural networks with quadratic VC dimension. Journal of Computer and System Sciences, 54, 190–198. Koiran, P., & Sontag, E. D. (1998). Vapnik-Chervonenkis dimension of recurrent neural networks. Discrete Applied Mathematics, 86, 63–79. Kowalczyk, A., & FerrÂa, H. L. (1994). Developing higher-order networks with empirically selected units. IEEE Transactions on Neural Networks, 5, 698–711. ¨ Kupfm ¨ uller, ¨ K., & Jenik, F. (1961). Uber die Nachrichtenverarbeitung in der Nervenzelle. Kybernetik, 1, 1–6. Lee, Y. C., Doolen, G., Chen, H. H., Sun, G. Z., Maxwell, T., Lee, H., & Giles, C. L. (1986). Machine learning using a higher order correlation network. Physica D, 22, 276–306. Leerink, L. R., Giles, C. L., Horne, B. G., & Jabri, M. A. (1995a). Learning with product units. In G. Tesauro, D. Touretzky, & T. Leen (Eds.), Advances in neural information processing systems, 7 (pp. 537–544). Cambridge, MA: MIT Press. Leerink, L. R., Giles, C. L., Horne, B. G., & Jabri, M. A. (1995b). Product unit learning (Tech. Rep. UMIACS-TR-95-80). College Park: University of Maryland. Leshno, M., Lin, V. Y., Pinkus, A., & Schocken, S. (1993). Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural Networks, 6, 861–867. Maass, W. (1994). Neural nets with super-linear VC-dimension. Neural Computation, 6, 877–884. Maass, W. (1995a). Agnostic PAC learning of functions on analog neural nets. Neural Computation, 7, 1054–1078. Maass, W. (1995b). Vapnik-Chervonenkis dimension of neural nets. In M. A. Arbib (Ed.), The handbook of brain theory and neural networks (pp. 1000–1003). Cambridge, MA: MIT Press. Maass, W. (1997). Noisy spiking neurons with temporal coding have more computational power than sigmoidal neurons. In M. Mozer, M. I. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 211– 217). Cambridge, MA: MIT Press.
298
Michael Schmitt
Maass, W. (1998). A simple model for neural computation with ring rates and ring correlations. Network: Computation in Neural Systems, 9, 381–397. Maass, W., & TurÂan, G. (1992). Lower bound methods and separation results for on-line learning models. Machine Learning, 9, 107–145. Maxwell, T., Giles, C. L., Lee, Y. C., & Chen, H. H. (1986). Nonlinear dynamics of articial neural systems. In J. S. Denker (Ed.), Neural networks for computing. New York: American Institute of Physics. McCulloch, W. S., & Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity. Bulletin of Mathematical Biophysics, 5, 115–133. Mel, B. W. (1992a). The clusteron: Toward a simple abstraction for a complex neuron. In J. Moody, S. Hanson, & R. Lippmann (Eds.), Advances in neural information processing systems, 4 (pp. 35–42). San Mateo, CA: Morgan Kaufmann. Mel, B. W. (1992b). NMDA-based pattern discrimination in a modeled cortical neuron. Neural Computation, 4, 502–517. Mel, B. W. (1993). Synaptic integration in an excitable dendritic tree. Journal of Neurophysiology, 70, 1086–1101. Mel, B. W. (1994). Information processing in dendritic trees. Neural Computation, 6, 1031–1085. Mel, B. W., & Koch, C. (1990). Sigma-pi learning: On radial basis functions and cortical associative learning. In D. S. Touretzky (Ed.), Advances in neural information processingsystems,2 (pp. 474–481). San Mateo, CA: Morgan Kaufmann. Mel, B. W., Ruderman, D. L., & Archie, K. A. (1998). Translation-invariant orientation tuning in visual “complex” cells could derive from intradendritic computations. Journal of Neuroscience, 18, 4325–4334. Minsky, M. L., & Papert, S. A. (1988). Perceptrons:An introduction to computational geometry. Cambridge, MA: MIT Press. Nilsson, N. J. (1965). Learning machines. New York: McGraw-Hill. Olshausen, B. A., Anderson, C. H., & Van Essen, D. C. (1993). A neurobiological model of visual attention and invariant pattern recognition based on dynamic routing of information. Journal of Neuroscience, 13, 4700–4719. Omlin, C. W., & Giles, C. L. (1996a). Constructing deterministic nite-state automata in recurrent neural networks. Journal of the Associationfor Computing Machinery, 43, 937–972. Omlin, C. W., & Giles, C. L. (1996b). Stable encoding of large nite-state automata in recurrent neural networks with sigmoid discriminants. Neural Computation, 8, 675–696. Perantonis, S. J., & Lisboa, P. J. G. (1992). Translation, rotation, and scale invariant pattern recognition by high-order neural networks and moment classiers. IEEE Transactions on Neural Networks, 3, 241–251. Pinkus, A. (1999). Approximation theory of the MLP model in neural networks. Acta Numerica, 8, 143–195. Poggio, T. (1990). A theory of how the brain might work. In Cold Spring Harbor Symposia on Quantitative Biology (Vol. 55, pp. 899–910). Cold Spring Harbor, NY: Cold Spring Harbor Laboratory Press. Poggio, T., & Girosi, F. (1990a). Networks for approximation and learning. Proceedings of the IEEE, 78, 1481–1497.
Multiplicative Neural Networks
299
Poggio, T., & Girosi, F. (1990b). Regularization algorithms for learning that are equivalent to multilayer networks. Science, 247, 978–982. Pollack, J. B. (1991). The induction of dynamical recognizers. Machine Learning, 7, 227–252. Pouget, A., & Sejnowski, T. J. (1997). Spatial transformations in the parietal cortex using basis functions. Journal of Cognitive Neuroscience, 9, 222–237. Psaltis, D., Park, C. H., & Hong, J. (1988). Higher order associative memories and their optical implementations. Neural Networks, 1, 149–163. Rebotier, T. P., & Droulez, J. (1994). Sigma vs pi properties of spiking neurons. In M. Mozer, P. Smolensky, D. Touretzky, J. Elman, & A. Weigend (Eds.), Proceedings of the 1993 Connectionist Models Summer School (pp. 3–10). Hillsdale, NJ: Erlbaum. Redding, N. J., Kowalczyk, A., & Downs, T. (1993). Constructive higher-order network algorithm that is polynomial time. Neural Networks, 6, 997–1010. Ring, M. (1993). Learning sequential tasks by incrementally adding higher orders. In S. J. Hanson, J. D. Cowan, & C. L. Giles (Eds.), Advances in neural information processing systems, 5 (pp. 115–122). San Mateo, CA: Morgan Kaufmann. Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65, 386–408. Roy, A., & Mukhopadhyay, S. (1997). Iterative generation of higher-order nets in polynomial time using linear programming. IEEE Transactions on Neural Networks, 8, 402–412. Rumelhart, D. E., Hinton, G. E., & McClelland, J. L. (1986). A general framework for parallel distributed processing. In D. E. Rumelhart & J. L. McClelland (Eds.), Parallel distributed processing: Explorations in the microstructure of cognition (Vol. 1, pp. 45–76). Cambridge, MA: MIT Press. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal representations by error propagation. In D. E. Rumelhart & J. L. McClelland (Eds.), Parallel distributed processing: Explorations in the microstructure of cognition (Vol. 1, pp. 318–362). Cambridge, MA: MIT Press. Sakurai, A. (1993). Tighter bounds of the VC-dimension of three layer networks. In Proceedings of the World Congress on Neural Networks (Vol. 3, pp. 540–543). Hillsdale, NJ: Erlbaum. Sakurai, A. (1999). Tight bounds for the VC-dimension of piecewise polynomial networks. In M. S. Kearns, S. A. Solla, & D. A. Cohn (Eds.), Advances in neural information processing systems, 11 (pp. 323–329). Cambridge, MA: MIT Press. Sakurai, A., & Yamasaki, M. (1992). On the capacity of n-h-s networks. In I. Aleksander & J. Taylor (Eds.), Articial neural networks (Vol. 2, pp. 237–240). Amsterdam: Elsevier. Salinas, E., & Abbott, L. F. (1996). A model of multiplicative neural responses in parietal cortex. Proceedings of the National Academy of Sciences USA, 93, 11956–11961. Samuel, A. L. (1959). Some studies in machine learning using the game of checkers. IBM Journal of Research and Development, 3, 211–229. Schl¨ai, L. (1901). Theorie der vielfachen Kontinuit¨at. Zurich: ¨ Zurcher ¨ & Furrer.
300
Michael Schmitt
Schmidt, W. A. C., & Davis, J. P. (1993). Pattern recognition properties of various feature spaces for higher order neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15, 795–801. Schmitt, M. (1999). On the sample complexity for nonoverlapping neural networks. Machine Learning, 37, 131–141. Schmitt, M. (2000). Lower bounds on the complexity of approximating continuous functions by sigmoidal neural networks. In S. A. Solla, T. K. Leen & K.-R. Muller ¨ (Eds.), Advances in neural information processing systems, 12 (pp. 328–334). Cambridge, MA: MIT Press. Sejnowski, T. J. (1986). Higher-order Boltzmann machines. In J. S. Denker (Ed.), Neural networks for computing (pp. 398–403). New York: American Institute of Physics. Shawe-Taylor, J., & Anthony, M. (1991). Sample sizes for multiple-output threshold networks. Network: Computation in Neural Systems, 2, 107–117. Shin, Y., & Ghosh, J. (1995). Ridge polynomial networks. IEEE Transactions on Neural Networks, 6, 610–622. Siu, K.-Y., Roychowdhury, V., & Kailath, T. (1995). Discrete neural computation: A theoretical foundation. Englewood Cliffs, NJ: Prentice Hall. Softky, W., & Koch, C. (1995). Single-cell models. In M. A. Arbib (Ed.), The handbook of brain theory and neural networks (pp. 879–884). Cambridge, MA: MIT Press. Sontag, E. (1992). Feedforward nets for interpolation and classication. Journal of Computer and System Sciences, 45, 20–48. Spirkovska, L., & Reid, M. B. (1994). Higher-order neural networks applied to 2D and 3D object recognition. Machine Learning, 15, 169–199. Srinivasan, M. V., & Bernard, G. D. (1976). A proposed mechanism for multiplication of neural signals. Biological Cybernetics, 21, 227–236. Suga, N. (1990). Cortical computational maps for auditory imaging. Neural Networks, 3, 3–21. Suga, N., Olsen, J. F., & Butman, J. A. (1990). Specialized subsystems for processing biologically important complex sounds: Cross-correlation analysis for ranging in the bat’s brain. In Cold Spring Harbor Symposia on Quantitative Biology (Vol. 55, pp. 585–597). Cold Spring Harbor, NY: Cold Spring Harbor Laboratory Press. Tal, D., & Schwartz, E. L. (1997). Computing with the leaky integrate-and-re neuron: Logarithmic computation and multiplication. Neural Computation, 9, 305–318. Valiant, L. G. (1984). A theory of the learnable. Communications of the ACM, 27, 1134–1142. Venkatesh, S. S., & Baldi, P. (1991a). Programmed interactions in higher-order neural networks: Maximal capacity. Journal of Complexity, 7, 316–337. Venkatesh, S. S., & Baldi, P. (1991b). Programmed interactions in higher-order neural networks: The outer-product algorithm. Journal of Complexity, 7, 443– 479. Warren, H. E. (1968). Lower bounds for approximation by nonlinear manifolds. Transactions of the American Mathematical Society, 133, 167–178.
Multiplicative Neural Networks
301
Watrous, R. L., & Kuhn, G. M. (1992). Induction of nite-state languages using second-order recurrent networks. Neural Computation, 4, 406–414. Williams, R. J. (1986). The logic of activation functions. In D. E. Rumelhart & J. L. McClelland (Eds.), Parallel distributed processing: Explorations in the microstructure of cognition (Vol. 1, pp. 423–443). Cambridge, MA: MIT Press. Yoon, H., & Oh, J.-H. (1998). Learning of higher-order perceptrons with tunable complexities. Journal of Physics A: Math. Gen., 31, 7771–7784. Received July 27, 2000; accepted April 9, 2001.
NOTE
Communicated by Eric Mjolsness
A Lagrange Multiplier and Hopeld-Type Barrier Function Method for the Traveling Salesman Problem Chuangyin Dang [email protected] Department of Manufacturing Engineering and Engineering Management, City University of Hong Kong, Kowloon, Hong Kong Lei Xu [email protected] Department of Computer Science and Engineering, Chinese University of Hong Kong, New Territories, Hong Kong A Lagrange multiplier and Hopeld-type barrier function method is proposed for approximating a solution of the traveling salesman problem. The method is derived from applications of Lagrange multipliers and a Hopeld-type barrier function and attempts to produce a solution of high quality by generating a minimum point of a barrier problem for a sequence of descending values of the barrier parameter. For any given value of the barrier parameter, the method searches for a minimum point of the barrier problem in a feasible descent direction, which has a desired property that lower and upper bounds on variables are always satised automatically if the step length is a num ber between zero and one. At each iteration, the feasible descent direction is found by updating Lagrange multipliers with a globally convergent iterative procedure. For any given value of the barrier parameter, the method converges to a stationary point of the barrier problem without any condition on the objective function. Theoretical and numerical results show that the method seems more effective and efcient than the softassign algorithm. 1 Introduction
The traveling salesman problem (TSP) is an NP-hard combinatorial optimization problem and has many important applications. In order to solve it, a number of classic algorithms and heuristics have been proposed. We refer to Lawler, Lenstra, Rinnooy, and Shmoys (1985) for an excellent survey of techniques for solving the problem. Since Hopeld and Tank (1985), combinatorial optimization has become a popular topic in the literature of neural computation. Many neural computational models for combinatorial optimization have been developed. They include Aiyer, Niranjan, and Fallside (1990); van den Bout and Miller (1990); Neural Computation 14, 303–324 (2001)
° c 2001 Massachusetts Institute of Technology
304
Chuangyin Dang and Lei Xu
Durbin and Willshaw (1987); Gee, Aiyer, and Prager (1993); Gee and Prager (1994); Gold, Mjolsness, and Rangarajan (1994); Gold and Rangarajan (1996); Peterson and Soderberg (1989); Rangarajan, Gold, and Mjolsness (1996); Simic (1990); Urahama (1996); Wacholder, Han, and Mann (1989); Waugh and Westervelt (1993); Wolfe, Parry, and MacMillan (1994); Xu (1994); and Yuille and Kosowsky (1994). A systematic investigation of such neural computational models for combinatorial optimization can be found in van den Berg (1996) and Cichocki and Unbehaunen (1993). Most of these algorithms are of the deterministic annealing type, which is a heuristic continuation method that attempts to nd the global minimum of the effective energy at high temperature and track it as the temperature decreases. There is no guarantee that the minimum at high temperature can always be tracked to the minimum at low temperature, but the experimental results are encouraging (Yuille & Kosowsky, 1994). We propose a Lagrange multiplier and a Hopeld-type barrier function method for approximating a solution of the TSP. The method is derived from applications of Lagrange multipliers to handle equality constraints and a Hopeld-type barrier function to deal with lower and upper bounds on variables. The method is a deterministic annealing algorithm that attempts to produce a high-quality solution by generating a minimum point of a barrier problem for a sequence of descending values of the barrier parameter. For any given value of the barrier parameter, the method searches for a minimum point of the barrier problem in a feasible descent direction, which has the desired property that the lower and upper bounds on variables are always satised automatically if the step length is a number between zero and one. At each iteration, the feasible descent direction is found by updating Lagrange multipliers with a globally convergent iterative procedure. For any given value of the barrier parameter, the method converges to a stationary point of the barrier problem without any condition on the objective function. Theoretical and numerical results show that the method seems more effective and efcient than the softassign algorithm. The rest of this paper is organized as follows. We introduce the Hopeldtype barrier function and derive some properties in section 2. We present the method in section 3. We report some numerical results in section 4. We conclude in section 5. 2 Hopeld-Type Barrier Function
The problem we consider is as follows. Given n cities, nd a tour such that each city is visited exactly once and that the total distance traveled is minimized. Let ( vik D
1 0
if City i is the kth city to be visited in a tour, otherwise,
Traveling Salesman Problem
305
where i D 1, 2, . . . , n, k D 1, 2, . . . , n, and v D (v11 , v12 , . . . , v1n , . . . , vn1 , vn2 , . . . , vnn ) > . In Hopeld and Tank (1985), the problem was formulated as min subject to
Pn P n P n
kD1 dij vik vj, k C 1
i D1
jD 1
j D 1 vij
D 1,
i D 1, 2, . . . , n,
i D1 vij
D 1,
j D 1, 2, . . . , n,
Pn Pn
vij 2 f0, 1g,
i D 1, 2, . . . , n,
(2.1) j D 1, 2, . . . , n,
where dij denotes the distance from city i to city j and vj,kC 1 D vj1 for k D n. Clearly, for any given r ¸ 0, equation 2.1 is equivalent to min subject to
² P P ± Pn 1 2 e 0 ( v ) D niD 1 jnD 1 kD1 dij vik vj,k C 1 ¡ 2 rvij Pn i D 1, 2, . . . , n, j D 1 vij D 1, Pn j D 1, 2, . . . , n, i D1 vij D 1, vij 2 f0, 1g,
i D 1, 2, . . . , n,
(2.2)
j D 1, 2, . . . , n.
The continuous relaxation of equation 2.2 yields min subject to
² P P ± Pn 1 2 e 0 ( v ) D niD 1 jnD 1 kD1 dij vik vj,k C 1 ¡ 2 rvij Pn i D 1, 2, . . . , n, j D 1 vij D 1, Pn j D 1, 2, . . . , n, i D1 vij D 1, 0 · vij · 1,
i D 1, 2, . . . , n,
(2.3)
j D 1, 2, . . . , n.
When r is sufciently large, one can see that an optimal solution of equation 2.3 is an integer solution. Thus, when r is large, equation 2.3 Psufciently P is equivalent to equation 2.1. The term ¡ 12 r niD1 jnD1 v2ij was introduced in Rangarajan et al. (1996) to obtain a strictly concave function e0 (v ) on the null space of the constraint matrix for convergence of their softassign algorithm to a stationary point of a barrier problem. We note that the size of r affects the quality of the solution produced by a deterministic annealing algorithm, and it should be as small as possible. However, when r is a small, positive number but still satises that equation 2.3 is equivalent to equation 2.1, the softassign algorithm may not converge to a stationary point of the barrier problem since e0 (v ) may not be strictly concave on the null space of the constraint matrix. Numerical tests demonstrate that it indeed occurs to the softassign algorithm. Following Xu (1995), we introduce a Hopeld-type barrier term, d ( vij ) D vij ln vij C (1 ¡ vij ) ln(1 ¡ vij ) ,
(2.4)
306
Chuangyin Dang and Lei Xu
to incorporate 0 · xij · 1 into the objective function of equation 2.3 and obtain P P min e ( vI b ) D e0 (v ) C b niD 1 jnD 1 d ( vij ) Pn subject to i D 1, 2, . . . , n, (2.5) j D 1 vij D 1, Pn 1, 1, 2, . . . v D j D , n, i D 1 ij where b is a positive barrier parameter. The barrier term, equation 2.4, appeared rst in an energy function given by Hopeld (1984) and has been extensively used in the literature. Instead of solving equation 2.3 directly, we consider a scheme that obtains a solution of it from the solution of equation 2.5 atPthe limit P of b # 0. Let b (v ) D niD1 jnD1 d ( vij ) . Then e (vI b ) D e0 ( v) C bb (v ) . Let 8 9 P n vij D 1, i D 1, 2, . . . , n, > > jD 1 < = Pn P D v iD 1 vij D 1, j D 1, 2, . . . , n, > > : 0 · vij · 1, i D 1, 2, . . . , n, j D 1, 2, . . . , n ; and B D fv | 0 · vij · 1,
i D 1, 2, . . . , n,
j D 1, 2, . . . , ng.
Then P is the feasible region of equation 2.3. Let us dene d (0) D d (1) D 0. Since limvij !0C d (vij ) D limvij !1¡ d (vij ) D 0; hence, b ( v) is continuous on B. From b ( v) , we obtain @b ( v) @vij
D ln vij ¡ ln(1 ¡ vij ) D ln
vij . 1 ¡ vij
Then limC
vij !0
@b ( v ) @vij
D ¡1
and
lim¡
@b ( v)
vij !1
@vij
D 1.
Observe @e0 ( v ) @vij
D
n X kD 1
( d ki vk, j¡1 C dikvk, j C 1 ) ¡ rvij ,
where vk, j¡1 D v kn for j D 1, and v k, j C 1 D vk1 for j D n. Thus, bounded on B. From @e ( vI b ) @vij
D
@e0 ( v ) @vij
Cb
@b ( v ) @vij
,
we obtain lim
vij !0C
@e ( vI b ) @vij
D ¡1
and
lim
vij !1¡
@e ( vI b ) @vij
D 1.
@e0 ( v ) @vij
is
Traveling Salesman Problem
307
For any given b > 0, if v¤ is a local minimum point of equation 2.5, v is an interior point of P, that is, 0 < v¤ij < 1, i D 1, 2, . . . , n, j D 1, 2, . . . , n.1 Lemma 1. ¤
Let 0 r
L (v, l , l
c)
D e ( vI b ) C
n X
lri
@
1 n X
iD 1
j D1
vij ¡ 1A C
n X
³ ljc
n X iD 1
jD 1
´ vij ¡ 1 .
Lemma 1 indicates that if v¤ is a local minimum point of equation 2.5, then there exist lr¤ and lc¤ satisfying
rv L ( v¤ , lr¤ , lc¤ ) D 0, Pn ¤ i D 1, 2, . . . , n, j D1 vij D 1, Pn ¤ j D 1, 2, . . . , n, i D 1 vij D 1, where ³
rv L ( v, lr , lc ) D
@L ( v, lr , lc ) @L ( v, lr , lc ) @v11
...,
,
@v12
, ...,
@L ( v, lr , lc ) @L ( v, lr , lc ) @vn1
,
@vn2
@L ( v, lr , lc )
, ...,
@v1n
,
@L ( v, lr , lc )
´>
@vnn
with @L ( v, lr , lc ) @vij
D
@e 0 ( v ) @vij
C lri C ljc C b ln
vij , 1 ¡ vij
i D 1, 2, . . . , n, j D 1, 2, . . . , n. Let b k, k D 1, 2, . . . , be a sequence of positive numbers satisfying b1 > b2 > ¢ ¢ ¢ and limk!1 b k D 0. For k D 1, 2, . . . , let v (b k ) denote a global minimum point of equation 2.5 with b D b k. Theorem 1.
For k D 1, 2, . . . ,
e0 (v (b k ) ) ¸ e 0 ( v (b kC 1 ) ) , and any limit point of v (b k ) , k D 1, 2, . . . , is a global minimum point of equation 2.3. 1 All the proofs of lemmas and theorems in this article can be found on-line at www.cityu.edu.hk/meem/mecdang.
308
Chuangyin Dang and Lei Xu
This theorem indicates that a global minimum point of equation 2.3 can be obtained if we are able to generate a global minimum point of equation 2.5 for a sequence of descending values of the barrier parameter with zero limit. For k D 1, 2, . . . , let vk be a local minimum point of equation 2.5 b b with D k . For any limit point v¤ of v k , k D 1, 2, . . . , if there are no lr D (lr1 , lr2 , . . . , lrn ) > and lc D (lc1 , lc2 , . . . , lcn ) > satisfying Theorem 2.
@e0 ( v¤ ) @vij
C lri C ljc D 0,
i D 1, 2, . . . , n, j D 1, 2, . . . , n, then v¤ is a local minimum point of equation 2.3. This theorem indicates that at least a local minimum point of equation 2.3 can be obtained if we are able to generate a local minimum point of equation 2.5 for a sequence of descending values of the barrier parameter with zero limit. 3 The Method
Stimulated from the results in the previous section, we propose in this section a method for approximating a solution of equation 2.3. The idea of the method is as follows: Choose b 0 to be a sufciently large, positive number satisfying that e ( vI b 0 ) is strictly convex. Let b q , q D 0, 1, . . . , be a sequence of positive numbers satisfying b 0 > b1 > ¢ ¢ ¢
and limq!1 bq D 0. Choose v¤, 0 to be the unique minimum point of equation 2.5 with b D b 0 . For q D 1, 2, . . . , starting at v¤,q¡1 , we search for a minimum point v¤,q of equation 2.5 with b D bq . Given any b > 0, consider the rst-order necessary optimality condition for equation 2.5:
rv L (v, lr , lc ) D 0, Pn i D 1, 2, . . . , n, j D1 vij D 1, Pn j D 1, 2, . . . , n. i D1 vij D 1, From @L ( v, lr , lc ) @vij
D
@e 0 ( v ) @vij
vij
C lri C ljc C b ln D 0, 1 ¡ vij
we obtain vij D
1 ( ) 1 C exp( ( @e@0vijv C lri C ljc ) / b )
.
Traveling Salesman Problem
Let ri D exp vij D
± r² li b
309
³ c´ lj
and cj D exp 1
1 C ri cj exp( @e@0v(ijv ) / b )
b
. Then,
.
For convenience of the following discussions, let aij (v ) D exp Then, vij D
1 . 1 C ri cj aij ( v)
±
@e0 ( v ) @vij
/b
² .
(3.1)
P P Substituting equation 3.1 into jnD 1 vij D 1, i D 1, 2, . . . , n, and niD1 vij D 1, j D 1, 2, . . . , n, we obtain Pn 1 i D 1, 2, . . . , n, j D1 1 C ri cj aij ( v) D 1, (3.2) Pn 1 j D 1, 2, . . . , n. i D1 1 C ri cj aij ( v ) D 1, Based on the above notations, a conceptual algorithm was proposed in Xu (1995) for approximating a solution of equation 2.3, which is as follows: Fix r and c. Use equation 3.1 to obtain v. Fix v. Solve equation 3.2 for r and c. Let hij (v, r, c ) D
1 1 C ri cj aij ( v )
and h ( v, r, c) D (h11 ( v, r, c) , h12 ( v, r, c) , . . . , h1n ( v, r, c ) ,
. . . , hn1 ( v, r, c) , hn2 ( v, r, c) , . . . , hnn ( v, r, c) ) > .
If v is an interior point of B, the following lemma shows that h ( v, r, c) ¡ v is a descent direction of L ( v, lr , lv ) . Lemma 2.
Assume 0 < vij < 1, i D 1, 2, . . . , n, j D 1, 2, . . . , n.
1.
@L ( v,lr ,lv ) @vij
> 0 if hij ( v, r, c) ¡ vij < 0.
2.
@L ( v,lr ,lv ) @vij
< 0 if hij ( v, r, c) ¡ vij > 0.
3.
@L ( v,lr ,lv ) @vij
D 0 if hij ( v, r, c ) ¡ vij D 0.
6 0. 4. ( h (v, r, c) ¡ v ) > rv L (v, lr , lc ) < 0 if h ( v, r, c) ¡ v D
P 6 0 and nkD 1 ( hik ( v, r, c) ¡ 5. ( h (v, r, P c) ¡ v ) > rv e ( vI b ) < 0 if h ( v, r, c ) ¡ v D vik ) D nkD1 ( hkj (v, r, c ) ¡ vkj ) D 0, i D 1, 2, . . . , n, j D 1, 2, . . . , n.
310
Chuangyin Dang and Lei Xu r
v
( ,l ) We only need to show that @L v,l > 0 if hij ( v, r, c) ¡ vij < 0. The @vij rest can be obtained similarly or in a straightforward manner. From
Proof.
1
hij ( v, r, c) ¡ vij D
1 C ri cj aij ( v )
¡ vij < 0,
we obtain 1 < ri cj aij ( v )
vij . 1 ¡ vij
(3.3)
Applying the natural logarithm, ln, to both sides of equation 3.3, we get ³ 0 < ln ri cj aij ( v)
vij 1 ¡ vij
´
vij 1 ¡ vij vij 1 @e0 ( v) 1 r 1 c 1 @L ( v, lr , lc ) C li C lj C ln . D D b @vij b b 1 ¡ vij b @vij D ln aij ( v ) C ln ri C ln cj C ln
Thus, @L ( v, lr , lc ) @vij
> 0.
The lemma follows. Since 0 < hij ( v, r, c) < 1, we note that the descent direction h ( v, r, c) ¡ v has a desired property that any point generated along h ( v, r, c) ¡ v satises automatically the lower and upper bounds if v 2 B and the step length is a number between zero and one. For any given point v, we use ( r ( v) , c ( v) ) to denote a positive solution of equation 3.2. Let v be an interior point of P. In order for h ( v, r, c) ¡ v to become a feasible descent direction of equation 2.5, we need to compute a positive solution (r ( v) , c (v ) ) of equation 3.2. Let 0 0 12 n n X 1 BX 1 @ ¡ 1A f ( r, c) D @ 2 iD 1 jD 1 1 C ri cj aij ( v )
³
C
n n X X jD 1
iD 1
1 ¡1 1 C ri cj aij ( v)
´2 1 A.
Traveling Salesman Problem
311
Observe that the value of f ( r, c) equals zero only at a solution of equation 3.2. For i D 1, 2, . . . , n, let 0
xi (r, c) D ri @
n X j D1
1 1 ¡ 1A , 1 C ri cj aij (v )
and for j D 1, 2, . . . , n, let
³
yj (r, c) D cj
n X i D1
´
1 ¡1 . 1 C ri cj aij ( v )
Let x ( r, c) D ( x1 ( r, c) , x2 ( r, c) , . . . , xn ( r, c) ) > y (r, c) D ( y1 ( r, c) , y2 ( r, c) , . . . , yn ( r, c) ) > . One can easily prove that ³ ´ x ( r, c) y ( r, c) is a descent direction of f ( r, c). For any given v, based on this descent direction, the following iterative procedure is proposed for computing a positive solution ( r ( v ) , c ( v ) ) of equation 3.2. Take ( r0, c0 ) to be an arbitrary positive vector, and for k D 0, 1, . . . , let rk C 1 D rk C m k x ( r k , c k ) , c k C 1 D c k C m k y ( rk , c k ) ,
(3.4)
where m k is a number in [0, 1] satisfying f ( rkC 1 , ckC 1 ) D min f (r k C m kx ( rk, ck ) , ck C m ky ( rk, ck ) ). m 2[0,1]
Observe that ( rk, ck ) > 0, k D 0, 1, . . . . There are many ways to determine m k (Minoux, 1986). For example, one can simply choose m k to be any number P in (0, 1] satisfying lkD 0 m l ! 1 and m k ! 0 as k ! 1. We have found in our numerical tests that when m k is any xed number in (0, 1], the iterative procedure, equation 3.4, converges to a positive solution of equation 3.2. For any given v, every limit point of ( rk, ck ) , k D 0, 1, . . . , generated by the iterative procedure, equation 3.4, is a positive solution of equation 3.2. Theorem 3.
312
Chuangyin Dang and Lei Xu
Based on the feasible descent direction, h (v, r ( v) , c (v ) ) ¡v, and the iterative procedure, equation 3.4, we have developed a method for approximating a solution of equation 2.3, which can be stated as follows: Step 0: Let 2 > 0 be a given tolerance. Let b 0 be a sufciently large, positive
number satisfying that e (vI b 0 ) is convex. Choose an arbitrary interior point vN 2 B, and two arbitrary positive vectors, r 0 and c0. Take an arbitrary positive numberg 2 (0, 1) (in general,g should be close to one). Given v D N use equation 3.4 to obtain a positive solution (r( vN ), c (vN ) ) of equation 3.2. v, Let r 0 D r ( vN ) and c0 D c ( vN ). Let 0 0 0 0 0 0 )> v 0 D ( v11 , v12 , . . . , v1n , . . . , vn1 , vn2 , . . . , vnn
with vij0 D
1 , 1 C ri ( vN ) cj (vN ) aij (vN )
where i D 1, 2, . . . , n, j D 1, 2, . . . , n. Let q D 0 and k D 0, and go to step 1.
Step 1: Given v D v k , use equation 3.4 to obtain a positive solution (r(v k ) ,
c (v k ) ) of equation 3.2. Let r 0 D r ( v k ) and c0 D c ( v k ) . Go to step 2.
Step 2: Let
h ( vk , r ( v k ) , c ( v k ) ) D ( h11 ( v k , r ( v k ) , c ( v k ) ) , h12 ( v k , r ( v k ) , c ( v k ) ) , . . . ,
h1n ( vk , r (v k ) , c (v k ) ) , . . . , hn1 ( v k, r ( vk ) , c ( vk ) ) , hn2 ( vk , r ( v k ) , c ( v k ) ) , . . . , hnn (v k, r ( vk ) , c ( vk ) ) ) >
with hij ( v k, r ( vk ) , c (v k ) ) D
1 1
C ri ( vk ) cj ( v k )aij ( v k )
,
where i D 1, 2, . . . , n, j D 1, 2, . . . , n. If kh(vk, r ( v k ) , c ( v k ) ) ¡ v kk < 2 , do as follows: If bq is sufciently small, the method terminates. Otherwise, let v¤,q D v k, v 0 D v k, bq C 1 D gb q , q D q C 1, and k D 0, and go to step 1. If kh(v k, r (v k ) , c (v k ) ) ¡ v kk ¸ 2 , do as follows: Compute v k C 1 D v k C h k ( h ( vk , r ( v k ) , c ( v k ) ) ¡ vk ) , where hk is a number in [0, 1] satisfying e (v kC 1 I bq ) D min e ( vk C h ( h (v k, r ( v k ) , c ( vk ) ) ¡ v k ) I b q ) . h2[0,1]
Let k D k C 1, and go to step 1.
(3.5)
Traveling Salesman Problem
313
Note that an exact positive solution ( r ( vk ) , c ( vk ) ) of equation 3.2 for v D v k and an exact solution of minh2[0,1] e (v k C h ( h ( v k, r ( vk ) , c ( vk ) ) ¡ v k ) I bq ) are not required in the implementation of the method, and their approximate solutions will do. There are many ways to determine hk (Minoux, 1986). For example, one can simply choose hk to be any number in (0, 1] satisfying Pk l D 0 hl ! 1 and h k ! 0 as k ! 1. The method is insensitive to the starting point since e ( v, b 0 ) is convex over B. For b D bq , every limit point of v k, k D 0, 1, . . . , generated by equation 2.5 is a stationary point of equation 3.5. Theorem 4.
Although it is difcult to prove that for any given b > 0, a limit point of v k, k D 0, 1, . . . , generated by equation 3.5 is at least a local minimum point of equation 2.5, in general, it is indeed at least a local minimum point of equation 2.5. Theorem 2 implies that every limit point of v¤,q , q D 0, 1, . . . , is at least a local minimum point of equation 2.3 if v¤,q is a minimum point of equation 2.5 with b D bq . For b D bq , our method can be proved to converge to a stationary point of equation 2.5 for any given r; however, the softassign algorithm can be proved to converge to a stationary point of equation 2.5 only if r is sufciently large so that e0 ( v) is strictly concave on the null space of the constraint matrix (Rangarajan, Yuille, & Mjolsness, 1999). Numerical tests also show that the softassign algorithm does not converge to a stationary point of equation 2.5 if the condition is not satised. Thus, for the softassign algorithm to converge, one has to determine the size of r through estimating the maximum eigenvalue of the matrix of the objective function of equation 2.1, which requires some extra computational work. As we pointed out, the size of r affects the quality of a solution generated by a deterministic annealing algorithm, and it should be as small as possible. Since our method converges for any r, one can start with a smaller positive r and then increase r if the solution generated by the method is not a near integer solution. In this respect, our method is better than the softassign algorithm. Numerical results support this argument. 4 Numerical Results
The method has been used to approximate solutions of a number of TSP instances. The method succeeds in nding an optimal or near-optimal tour for each of the TSP instances. In our implementation of the method, 1. 2 D 0.01 and b 0 D 200. 2. We take r 0 D ( r10 , r20, . . . , rn0 ) > and c0 D ( c10 , c20 , . . . , cn0 ) > to be two random vectors satisfying 0 < ri0 < 1 and 0 < ci0 < 1, i D 1, 2, . . . , n.
314
Chuangyin Dang and Lei Xu
3. m k D 0.95, and for anyqgiven v, the iterative procedure, equation 3.4, terminates as soon as
f ( rk, ck ) < 0.001.
4. We replace e ( xI b ) with L ( v, lr , lc ) in the method since ( r ( v k ) , c ( v k ) ) is an approximate solution of equation 3.2. 5. hk is determined with the following Armijo-type line search: hk D j mk , with m k being the smallest nonnegative integer satisfying L ( v k C j mk (h ( vk , r ( v k ) , c ( v k ) ) ¡ vk ) , lr,k, lc, k ) · L ( vk , lr,k , lc,k )
C j mk c ( h ( v k , r ( v k ) , c ( v k ) ) ¡ v k ) > rv L ( v k , lr, k , lc, k ) ,
where j and c are any two numbers in (0, 1) (we set j D 0.6 and c D 0.8), lr, k D bq (ln r1 ( vk ) , ln r2 ( vk ) , . . . , ln rn ( v k ) ) > , and lc, k D b q (ln c1 (v k ) , ln c2 ( vk ) , . . . , ln cn ( v k ) ) > . The method terminates as soon as bq < 1. To produce a solution of higher quality, the size of r should be as small as possible. However, a small r may lead to a fractional solution v¤,q . To make sure that an integer solution will be generated, we continue the following procedure: Step 0: Let b D 1, v 0 D v¤,q , and k D 0. Go to step 1.
Step 1: Let v¤ D ( v¤11, v¤12 , . . . , v¤1n , . . . , v¤n1 , v¤n2 , . . . , v¤nn ) > with
(
v¤ij
D
1 0
if vijk ¸ 0.9,
if vijk < 0.9,
i D 1, 2, . . . , n, j D 1, 2, . . . , n. If v¤ 2 P, the procedure terminates. Otherwise, let r D r C 2, and go to step 2. Step 2: Given v D vk , use equation 3.4 to obtain a positive solution ( r ( v k ) , c ( v k ) )
of equation 3.2. Let r0 D r ( vk ) , c0 D c ( v k ) ,
lr, k D (ln r1 ( vk ) , ln r2 ( vk ) , . . . , ln rn ( v k ) ) > , and lc,k D (ln c1 ( v k ) , ln c2 ( vk ) , . . . , ln cn ( vk ) ) > . Go to step 3.
Traveling Salesman Problem
315
Step 3: Let
h ( v k, r (v k ) , c (v k ) ) D ( h11 ( vk , r ( v k ) , c ( v k ) ) , h12 ( vk , r ( v k ) , c ( v k ) ) , . . . ,
h1n ( v k, r ( vk ) , c ( vk ) ) , . . . , hn1 (v k, r ( v k ) , c ( v k ) ) ,
hn2 ( v k, r ( vk ) , c ( vk ) ) , . . . , hnn ( v k, r ( vk ) , c ( vk ) ) ) > with hij (v k, r ( v k ) , c ( vk ) ) D
1 , 1 C ri ( v k ) cj ( v k ) aij (v k )
where i D 1, 2, . . . , n, j D 1, 2, . . . , n. If kh(v k, r ( v k ) , c ( v k ) ) ¡ v kk < 2 , let v 0 D v k and k D 0, and go to step 1. Otherwise, compute vk C 1 D v k C h k ( h ( v k , r ( v k ) , c ( v k ) ) ¡ v k ) , where h k is a number in [0, 1] satisfying L ( v kC 1 , lr, k, lc, k ) D min L ( vk C h ( h ( v k, r ( vk ) , c ( vk ) ) ¡ v k ) , lr, k, lc,k ) . h2[0,1]
Let k D k C 1, and go to step 2. The method is programmed in MATLAB. To compare the method with the softassign algorithm proposed in Gold et al. (1994) and Rangarajan et al. (1996, 1999) and the softassign algorithm modied by introducing line search, the softassign algorithm and its modied version are also programmed in MATLAB. All our numerical tests are done on a PC computer. In the presentations of numerical results, DM stands for our method, SA the softassign algorithm, MSA the modied version of the softassign algorithm, CT the computation time in seconds, OPT the length of an optimal tour, OBJ the length of a tour generated by an algorithm, OBJD the length of the tour generated by our method, OBJSA the length of the tour generated by the ¡OPT softassign algorithm or its modied version, and RE D OBJOPT . Numerical results are as follows. These ten TSP instances are from a well-known web site, TSPLIB. We have used the method, the softassign algorithm, and the modied softassign algorithm to approximate solutions of these TSP instances. Numerical results are presented in Figures 1, 2, 3, and 4 and Table 1, where the softassign algorithm fails to converge when r D 30. Example 1.
These TSP instances have 100 cities and are generated randomly. Every city is a point in a square with integer coordinates ( x, y) Example 2.
316
Chuangyin Dang and Lei Xu
Figure 1: Relative error to optimal tour. 1. bays29, 2. att48, 3. eil51, 4. berlin52, 5. st70, 6. eil76, 7. pr76, 8. rd100, 9. eil101, 10. lin105.
Figure 2: Computation time for different algorithms. 1. bays29, 2. att48, 3. eil51, 4. berlin52, 5. st70, 6. eil76, 7. pr76, 8. rd100, 9. eil101, 10. lin105.
Traveling Salesman Problem
317
Figure 3: Relative error to optimal tour. 1. bays29, 2. att48, 3. eil51, 4. berlin52, 5. st70, 6. eil76, 7. pr76, 8. rd100, 9. eil101, 10. lin105.
Figure 4: Computation time for different algorithms. 1. bays29, 2. att48, 3. eil51, 4. berlin52, 5. st70, 6. eil76, 7. pr76, 8. rd100, 9. eil101, 10. lin105.
0.95
0.9
g
22 3093 53
355 852 56
CT OBJ RE(%)
256 923 69
CT OBJ RE(%)
CT OBJ RE(%)
7 2404 19
CT OBJ RE(%)
DM r D 80
DM r D 30
22 2045 1
157 566 4
30 2045 1
131 638 17
237 567 4
eil76, n D 76
13 2472 22
bays29, n D 29
89 671 23
eil76, n D 76
9 2430 20
bays29, n D 29
SA r D 80
476 189,949 76
104 36,293 8
133 169,769 57
pr76, n D 76
67 36,106 8
223 118,459 10
79 34,736 4
187 120,417 11
64 34,736 4
DM r D 30
att48, n D 48
101 142,331 32
pr76, n D 76
53 35,434 6
DM r D 80
att48, n D 48
885 196,520 82
109 36,943 10
SA r D 80
367 10,483 28
53 588 37
412 9172 12
76 449 4
eil51, n D 51
436 8626 5
50 464 8
rd100, n D 100
2382 13,506 65
59 572 33
240 12,788 56
32 501 17
DM r D 30
eil51, n D 51
DM r D 80
rd100, n D 100
636 13,223 61
55 591 37
SA r D 80
Table 1: Numerical Results for TSP Instances from TSPLIB.
1193 988 54
416 926 44
473 680 6
89 7837 4
eil101, n D 101
55 8857 17
362 687 7
berlin52, n D 52
229 862 34
87 8205 9
DM r D 30
eil101, n D 101
121 11,793 56
2525 1091 70
51 10017 33
DM r D 80
berlin52, n D 52
115 10,494 39
SA r D 80
641 682 19,532 16,788 36 17 (continued)
189 690 2
522 17,278 20
lin105, n D 105
109 867 28
st70, n D 70
1020 17,310 20
1052 1198 76
505 19,080 33
138 710 5
DM r D 30
lin105, n D 105
66 887 31
DM r D 80
st70, n D 70
1011 18,272 27
835 1174 73
SA r D 80
318 Chuangyin Dang and Lei Xu
0.95
0.9
CT OBJ RE(%)
CT OBJ RE(%)
CT OBJ RE(%)
CT OBJ RE(%)
att48, n D 48
pr76, n D 76
eil76, n D 76
252 581 6.6
217 567 4.0
22 2148 6.3
291 119,332 10.3
93 35,068 4.6
212 118,453 9.5
75 34,735 3.6
207 111,862 3.4
bays29, n D 29
21 2124 5.2
226 124,408 15.0
pr76, n D 76
167 583 7.0
22 2045 1.2
66 34,736 3.6
DM r D 30
eil76, n D 76
223 574 5.3
16 2045 1.2
80 34,903 4.1
MSA r D 30
att48, n D 48
DM r D 30
bays29, n D 29
MSA r D 30
Table 1: (continued).
449 8761 6.8
rd100, n D 100
910 8930 8.8
85 468 8.8
eil51, n D 51
68 461 7.2
358 8999 9.7
rd100, n D 100
708 9287 13.2
100 448 4.2
DM r D 30
eil51, n D 51
52 471 9.5
MSA r D 30
522 678 5.6
eil101, n D 101
750 694 8.1
99 8373 11.0
berlin52, n D 52
110 8214 8.9
374 703 9.5
eil101, n D 101
564 694 8.1
76 8047 6.7
DM r D 30
berlin52, n D 52
111 8588 13.8
MSA r D 30
673 18,581 29.2
lin105, n D 105 877 16,694 16.1
188 704 3.7
st70, n D 70 253 780 14.9
436 16,498 14.7
lin105, n D 105
978 16,676 15.9
166 756 11.3
DM r D 30
st70, n D 70
206 713 5.0
MSA r D 30
Traveling Salesman Problem 319
320
Chuangyin Dang and Lei Xu
satisfying 0 · x · 100 and 0 · y · 100. We have used the method, the softassign algorithm, and the modied softassign algorithm to approximate solutions of a number of TSP instances. Numerical results are presented in Table 2, where the softassign algorithm fails to converge when r D 30. From these numerical results, one can see that our method seems more effective and efcient than the softassign algorithm. Comparing our method with the softassign algorithm modied by introducing line search, one can nd that our method is signicantly superior to the modied softassign algorithm in computational time, although the quality of solutions generated by our method on average is only slightly better than those generated by the modied softassign algorithm. The reason that our method is faster than the softassign algorithm and its modied version lies in the procedures for updating Lagrange multipliers. Our procedure for updating Lagrange multipliers is much more efcient than Sinkhorn’s approach adopted in the softassign algorithm for updating Lagrange multipliers. Although our method has advantages over the softassign algorithm and its modied version, it still may not compete with the elastic net and nonneural algorithms for TSP. The idea presented here for constructing a procedure to update Lagrange multipliers can also be applied to solving more complicated problems. 5 Conclusion
We have developed a Lagrange multiplier and Hopeld-type barrier function method for approximating a solution of the TSP. Some theoretical results have been derived. For any given barrier parameter, we have proved that the method converges to a stationary point of equation 2.5 without any condition on the objective function, which is stronger than the convergence result for the softassign algorithm. The numerical results show that the method seems more effective and efcient than the softassign algorithm. The method would be improved with a faster iterative procedure to update Lagrange multipliers for obtaining a feasible descent direction. Acknowledgments
We thank the anonymous referees for their constructive comments and remarks, which have signicantly improved the quality of this article. The preliminary version of this article was completed when C.D. was on leave at the Chinese University of Hong Kong from 1997 to 1998. The work was supported by SRG 7001061 of CityU, CUHK Direct Grant 220500680, Ho Sin-Hang Education Endowment Fund HSH 95 / 02, and Research Fellow and Research Associate Scheme of the CUHK Research Committee.
0.95
0.9
g
OBJD OBJSA
CT OBJ
OBJD OBJSA
CT OBJ
OBJD OBJSA
CT OBJ
OBJD OBJSA
CT OBJ
3496 1361
905 1359
6553 1412
1019 1327
SA r D 80
282 927 0.68
6
486 1262 0.93
1
365 976 0.69
6
209 1172 0.88
1
DM r D 80
455 817 0.60
421 814 0.60
397 821 0.58
343 814 0.61
DM r D 30
565 1397
834 1262
633 1345
895 1240
SA r D 80
279 1157 0.83
7
285 1088 0.86
2
275 1256 0.93
7
254 1084 0.87
2
DM r D 80
437 870 0.62
430 896 0.71
357 870 0.65
409 896 0.72
DM r D 30
628 1299
4960 1586
641 1304
3053 1430
SA r D 80
281 1252 0.96
8
320 989 0.62
3
297 1200 0.92
8
336 987 0.69
3
DM r D 80
Table 2: Numerical Results for TSP Instances Generated Randomly.
413 889 0.68
437 867 0.55
332 889 0.68
313 867 0.61
DM r D 30
1473 1237
1928 1347
1399 1196
1371 1396
SA r D 80
274 1015 0.82
9
321 995 0.74
4
190 1059 0.89
9
249 1040 0.75
4
DM r D 80
438 810 0.66
488 808 0.60
374 810 0.68
418 808 0.58
DM r D 30
974 1371
630 1254
843 1366
798 1295
SA r D 80
266 1057 0.77
10
400 1053 0.84
5
197 1115 0.82
10
250 1129 0.87
5
400 869 0.69
464 856 0.63
311 869 0.67
DM r D 30
570 856 0.62 (continued)
DM r D 80
Traveling Salesman Problem 321
0.95
0.9
603 890
642 865
CT OBJ
417 902
CT OBJ
CT OBJ
527 887
CT OBJ
MSA r D 30
495 806
6
420 874
1
405 885
6
364 814
1
DM r D 30
Table 2: (continued).
0.93
0.98
0.98
0.92
OBJD OBJSA
700 824
636 908
537 904
617 950
MSA r D 30
471 782
7
418 846
2
297 879
7
409 912
2
DM r D 30
0.95
0.93
0.97
0.96
OBJD OBJSA
666 872
639 869
471 827
381 942
MSA r D 30
425 850
8
511 861
3
370 901
8
500 832
3
DM r D 30
0.97
0.99
1.09
0.88
OBJD OBJSA
654 849
791 894
603 815
513 931
MSA r D 30
584 852
9
476 898
4
363 827
9
341 874
4
DM r D 30
1.00
1.00
1.02
0.94
OBJD OBJSA
625 1012
742 883
719 846
846 795
MSA r D 30
417 822
10
461 869
5
336 889
10
415 789
5
DM r D 30
0.81
0.98
1.05
0.99
OBJD OBJSA
322 Chuangyin Dang and Lei Xu
Traveling Salesman Problem
323
References Aiyer, S., Niranjan, M., & Fallside, F. (1990). A theoretical investigation into the performance of the Hopeld model. IEEE Transactions on Neural Networks, 1, 204–215. Cichocki, A., & Unbehaunen, R. (1993). Neural networks for optimization and signal processing. New York: Wiley. Durbin, R., & Willshaw, D. (1987). An analogue approach to the traveling salesman problem using an elastic network method. Nature, 326, 689–691. Gee, A., Aiyer, S., & Prager, R. (1993). An analytical framework for optimizing neural networks. Neural Networks, 6, 79–97. Gee, A., & Prager, R. (1994). Polyhedral combinatorics and neural networks. Neural Computation, 6, 161–180. Gold, S., Mjolsness, E., & Rangarajan, A. (1994). Clustering with a domainspecic distance measure. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing systems, 6 (pp. 96–103). San Mateo, CA: Morgan Kaufmann. Gold, S., & Rangarajan, A. (1996). Softassign versus softmax: Benchmarking in combinatorial optimization. In D. Touretzky, M. Mozer, & M. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 626–632). Cambridge, MA: MIT Press. Hopeld, J. (1984).Neurons with graded response have collective computational properties like those of two-state neurons. Proceedings of the National Academy of Sciences of the USA, 81, 3088–3092. Hopeld, J., & Tank, D. (1985). Neural computation of decisions in optimization problems. Biological Cybernetics, 52, 141–152. Lawler, E. L., Lenstra, J. K., Rinnooy Kan, A. H. G., & Shmoys, D. B. (1985). The traveling salesman problem. New York: Wiley. Minoux, M. (1986). Mathematical programming: Theory and algorithms. New York: Wiley. Peterson, C., & Soderberg, B. (1989). A new method for mapping optimization problems onto neural networks. International Journal of Neural Systems, 1, 3– 22. Rangarajan, A., Gold, S., & Mjolsness, E. (1996). A novel optimizing network architecture with applications. Neural Computation, 8, 1041–1060. Rangarajan, A., Yuille, A., & Mjolsness, E. (1999). Convergence properties of the softassign quadratic assignment algorithm. Neural Computation, 11, 1455– 1474. Simic, P. (1990). Statistical mechanics as the underlying theory of “elastic” and “neural” optimizations. Networks, 1, 89–103. Urahama, K. (1996). Gradient projection network: Analog solver for linearly constrained nonlinear programming. Neural Computation, 6, 1061–1073. van den Berg, J. (1996). Neural relaxation dynamics. Unpublished doctoral dissertation, Erasmus University of Rotterdam, Rotterdam, Netherlands. van den Bout, D., & Miller, T., III (1990). Graph partitioning using annealed networks. IEEE Transactions on Neural Networks, 1, 192–203.
324
Chuangyin Dang and Lei Xu
Wacholder, E., Han, J., & Mann, R. (1989). A neural network algorithm for the multiple traveling salesman problem. Biological Cybernetics, 61, 11–19. Waugh, F., & Westervelt, R. (1993). Analog neural networks with local competition: I. Dynamics and stability. Physical Review E, 47, 4524–4536. Wolfe, W., Parry, M., & MacMillan, J. (1994). Hopeld-style neural networks and the TSP. In Proceedings of the IEEE International Conference on Neural Networks, 7 (pp. 4577–4582). Piscataway, NJ: IEEE Press. Xu, L. (1994). Combinatorial optimization neural nets based on a hybrid of Lagrange and transformation approaches. In Proceedings of the World Congress on Neural Networks (pp. 399–404). San Diego, CA. Xu, L. (1995). On the hybrid LT combinatorial optimization: New U-shape barrier, sigmoid activation, least leaking energy and maximum entropy. In Proceedings of the International Conference on Neural Information Processing (ICONIP’95) (pp. 309–312). Beijing: Publishing House of Electronics Industry. Yuille, A., & Kosowsky, J. (1994). Statistical physics algorithms that converge. Neural Computation, 6, 341–356.
Received March 29, 1999; accepted April 27, 2001.
NOTE
Communicated by Jonathan Victor
The Time-Rescaling Theorem and Its Application to Neural Spike Train Data Analysis Emery N. Brown [email protected] Neuroscience Statistics Research Laboratory, Department of Anesthesia and Critical Care, Massachusetts General Hospital, Boston, MA 02114, U.S.A., and Division of Health Sciences and Technology, Harvard Medical School / Massachusetts Institute of Technology, Cambridge, MA 02139, U.S.A. Riccardo Barbieri [email protected] Neuroscience Statistics Research Laboratory, Department of Anesthesia and Critical Care, Massachusetts General Hospital, Boston, MA 02114, U.S.A. Val Âerie Ventura [email protected] Robert E. Kass [email protected] Department of Statistics, Carnegie Mellon University, Center for the Neural Basis of Cognition, Pittsburg, PA 15213, U.S.A. Loren M. Frank [email protected] Neuroscience Statistics Research Laboratory, Department of Anesthesia and Critical Care, Massachusetts General Hospital, Boston, MA 02114, U.S.A. Measuring agreement between a statistical model and a spike train data series, that is, evaluating goodness of t, is crucial for establishing the model’s validity prior to using it to make inferences about a particular neural system. Assessing goodness-of-t is a challenging problem for point process neural spike train models, especially for histogram-based models such as perstimulus time histograms (PSTH) and rate functions estimated by spike train smoothing. The time-rescaling theorem is a wellknown result in probability theory, which states that any point process with an integrable conditional intensity function may be transformed into a Poisson process with unit rate. We describe how the theorem may be used to develop goodness-of-t tests for both parametric and histogrambased point process models of neural spike trains. We apply these tests in two examples: a comparison of PSTH, inhomogeneous Poisson, and inhomogeneous Markov interval models of neural spike trains from the supNeural Computation 14, 325–346 (2001)
° c 2001 Massachusetts Institute of Technology
326
E. N. Brown, R. Barbieri, V. Ventura, R. E. Kass, and L. M. Frank
plementary eye eld of a macque monkey and a comparison of temporal and spatial smoothers, inhomogeneous Poisson, inhomogen eous gamma, and inhomogeneous inverse gaussian models of rat hippocampal place cell spiking activity. To help make the logic behind the time-rescaling theorem more accessible to researchers in neuroscience, we present a proof using only elementary probability theory arguments. We also show how the theorem may be used to simulate a general point process model of a spike train. Our paradigm makes it possible to compare parametric and histogram-based neural spike train models directly. These results suggest that the time-rescaling theorem can be a valuable tool for neural spike train data analysis. 1 Introduction
The development of statistical models that accurately describe the stochastic structure of neural spike trains is a growing area of quantitative research in neuroscience. Evaluating model goodness of t—that is, measuring quantitatively the agreement between a proposed model and a spike train data series—is crucial for establishing the model’s validity prior to using it to make inferences about the neural system being studied. Assessing goodness of t for point neural spike train models is a more challenging problem than for models of continuous-valued processes. This is because typical distance discrepancy measures applied in continuous data analyses, such as the average sum of squared deviations between recorded data values and estimated values from the model, cannot be directly computed for point process data. Goodness-of-t assessments are even more challenging for histogram-based models, such as peristimulus time histograms (PSTH) and rate functions estimated by spike train smoothing, because the probability assumptions needed to evaluate model properties are often implicit. Berman (1983) and Ogata (1988) developed transformations that, under a given model, convert point processes like spike trains into continuous measures in order to assess model goodness of t. One of the theoretical results used to construct these transformations is the time-rescaling theorem. A form of the time-rescaling theorem is well known in elementary probability theory. It states that any inhomogeneous Poisson process may be rescaled or transformed into a homogeneous Poisson process with a unit rate (Taylor & Karlin, 1994). The inverse transformation is a standard method for simulating an inhomogeneous Poisson process from a constant rate (homogeneous) Poisson process. Meyer (1969) and Papangelou (1972) established the general time-rescaling theorem, which states that any point process with an integrable rate function may be rescaled into a Poisson process with a unit rate. Berman and Ogata derived their transformations by applying the general form of the theorem. While the more general time-rescaling theorem is well known among researchers in point process theory (BrÂemaud, 1981; Jacobsen, 1982; Daley & Vere-Jones, 1988; Ogata, 1988; Karr, 1991), the
The Time-Rescaling Theorem
327
theorem is less familiar to neuroscience researchers. The technical nature of the proof, which relies on the martingale representation of a point process, may have prevented its signicance from being more broadly appreciated. The time-rescaling theorem has important theoretical and practical implications for application of point process models in neural spike train data analysis. To help make this result more accessible to researchers in neuroscience, we present a proof that uses only elementary probability theory arguments. We describe how the theorem may be used to develop goodnessof-t tests for both parametric and histogram-based point process models of neural spike trains. We apply these tests in two examples: a comparison of PSTH, inhomogeneous Poisson, and inhomogeneous Markov interval models of neural spike trains from the supplementary eye eld of a macque monkey and a comparison of temporal and spatial smoothers, inhomogeneous Poisson, inhomogeneous gamma, and inhomogeneous inverse gaussian models of rat hippocampal place cell spiking activity. We also demonstrate how the time-rescaling theorem may be used to simulate a general point process. 2 Theory 2.1 The Conditional Intensity Function and the Joint Probability Density of the Spike Train. Dene an interval (0, T], and let 0 < u1 < u2 < , . . . , < un¡1 < un · T be a set of event (spike) times from a point process.
For t 2 (0, T], let N ( t) be the sample path of the associated counting process. The sample path is a right continuous function that jumps 1 at the event times and is constant otherwise (Snyder & Miller, 1991). In this way, N (t) counts the number and location of spikes in the interval (0, t]. Therefore, it contains all the information in the sequence of events or spike times. For t 2 (0, T], we dene the conditional or stochastic intensity function as l(t | Ht ) D lim
D t!0
Pr (N ( t C D t) ¡ N (t ) D 1 | Ht ) , Dt
(2.1)
where Ht D f0 < u1 < u2 , . . . , uN ( t ) < tg is the history of the process up to time t and uN (t ) is the time of the last spike prior to t. If the point process is an inhomogeneous Poisson process, then l(t | Ht ) D l(t) is simply the Poisson rate function. Otherwise, it depends on the history of the process. Hence, the conditional intensity generalizes the denition of the Poisson rate. It is well known that l(t | Ht ) can be dened in terms of the event (spike) time probability density, f ( t | Ht ) , as l(t | Ht ) D
1¡
Rt
f (t | Ht )
uN (t)
f ( u | Ht ) du
,
(2.2)
for t > uN (t ) (Daley & Vere-Jones, 1988; Barbieri, Quirk, Frank, Wilson, & Brown, 2001). We may gain insight into equation 2.2 and the meaning of the
328
E. N. Brown, R. Barbieri, V. Ventura, R. E. Kass, and L. M. Frank
conditional intensity function by computing explicitly the probability that given Ht , a spike u k occurs in [t, t C D t) where k D N ( t) C 1. To do this, we note that the events fN(t C D t) ¡ N ( t) D 1 | Ht g and fu k 2 [t, t C D t) | Ht g are equivalent and that therefore, Pr ( N ( t C D t) ¡ N ( t) D 1 | Ht ) D Pr(u k 2 [t, t C D t ) | u k > t, Ht ) Pr(u k 2 [t, t C D t ) | Ht ) D Pr ( u k > t | Ht ) R tC D t f ( u | Ht ) du t D Rt 1 ¡ uN (t) f (u | Ht ) du ¼
1¡
f ( t | Ht ) D t Rt ( | ) uN (t) f u Ht du
D l(t | Ht ) D t.
(2.3)
Dividing by D t and taking the limit gives lim
D t!0
Pr ( u k 2 [t, t C D t) | Ht ) f ( t | Ht ) D Rt Dt 1 ¡ uN (t) f (u | Ht ) du D l(t | Ht ) ,
(2.4)
which is equation 2.1. Therefore, l(t | Ht ) D t is the probability of a spike in [t, t C D t ) when there is history dependence in the spike train. In survival analysis, the conditional intensity is termed the hazard function because in this case, l(t | Ht ) D t measures the probability of a failure or death in [t, tCD t) given that the process has survived up to time t (Kalbeisch & Prentice, 1980). Because we would like to apply the time-rescaling theorem to spike train data series, we require the joint probability density of exactly n event times in (0, T]. This joint probability density is (Daley & Vere-Jones, 1988; Barbieri, Quirk, et al., 2001) f ( u1 , u2 , . . . , un \ N ( T ) D n ) D f ( u1 , u2 , . . . , un \ un C 1 > T ) D f ( u1 , u2 , . . . , un \ N ( un ) D n ) Pr ( un C 1 > T | u1 , u2 , . . . , un ) » Z uk ¼ n Y l(u k | Huk ) exp ¡ l(u | Hu ) du D uk¡1
kD 1
( Z ¢ exp ¡
T un
) l(u | Hu ) du ,
(2.5)
The Time-Rescaling Theorem
329
where f ( u1 , u2 , . . . , un \ N ( un ) D n ) » Z n Y l(u k | Huk ) exp ¡ D kD 1
Pr(un C 1
uk
¼ l(u | Hu ) du
(2.6)
uk¡1
( Z > T | u1 , u2 , . . . , un ) D exp ¡
T
) l(u | Hu ) du ,
(2.7)
un
and u 0 D 0. Equation 2.6 is the joint probability density of exactly n events in (0, un ], whereas equation 2.7 is the probability that the n C 1st event occurs after T. The conditional intensity function provides a succinct way to represent the joint probability density of the spike times. We can now state and prove the time-rescaling theorem. Let 0 < u1 < u2 < , . . . , < un < T be a realization from a point process with a conditional intensity function l(t | Ht ) satisfying 0 < l(t | Ht ) for all t 2 (0, T]. Dene the transformation Z uk L ( uk ) D l(u | Hu ) du, (2.8) Time-Rescaling Theorem.
0
for k D 1, . . . , n, and assume L ( t) < 1 with probability one for all t 2 (0, T]. Then the L ( u k ) ’s are a Poisson process with unit rate. Let tk D L ( u k ) ¡ L (u k¡1 ) for k D 1, . . . , n and set tT D | l(u Hu ) du. To establish the result, it sufces to show that the tks un are independent and identically distributed exponential random variables with mean one. Because the tk transformation is one-to-one and tn C 1 > tT if and only if un C 1 > T, the joint probability density of the tk’s is Proof.
RT
f (t1 , t2 , . . . , tn \ tn C 1 > tT ) D f (t1 , . . . , tn ) Pr (tn C 1 > tT | t1 , . . . , tn ) .
(2.9)
We evaluate each of the two terms on the right side of equation 2.9. The following two events are equivalent: ftn C 1 > tT | t1 , . . . , tn g D fun C 1 > T | u1 , u2 , . . . , un g.
(2.10)
Hence Pr(tn C 1 > tT | t1 , t2 , . . . , tn ) D Pr(un C 1 > T | u1 , u2 , . . . , un ) ( Z ) T
D exp
¡
un
D expf¡tT g,
l(u | Hun ) du (2.11)
330
E. N. Brown, R. Barbieri, V. Ventura, R. E. Kass, and L. M. Frank
where the last equality follows from the denition of tT . By the multivariate change-of-variable formula (Port, 1994), f (t1 , t2 , . . . , tn ) D | J| f (u1 , u2 , . . . , un \ N ( un ) D n ) ,
(2.12)
where J is the Jacobian of the transformation between uj , j D 1, . . . , n and tk, k D 1, . . . , n. Because tk is a function of u1 , . . . , u k, J is a lower triangular matrix, and Q its determinant is the product of its diagonal elements dened as | J| D | nkD 1 Jkk |. By assumption 0 < l(t | Ht ) and by equation 2.8 and the denition of tk, the mapping of u into t is one-to-one. Therefore, by the inverse differentiation theorem (Protter & Morrey, 1991), the diagonal elements of J are Jkk D
@u k @tk
¡1 D l(u k | Huk ) .
(2.13)
Substituting |J| and equation 2.6 into equation 2.12 yields f (t1 , t2 , . . . , tn ) D
n Y
l(u k | Huk ) ¡1
kD 1
» Z ¢ exp ¡
D D
n Y
n Y kD 1 n Y kD 1
l(u k | Huk )
kD 1 uk
¼
l(u | Hu ) du
uk¡1
expf¡[L (u k ) ¡ L (u k¡1 )]g expf¡tkg.
(2.14)
Substituting equations 2.11 and 2.14 into 2.9 yields f (t1 , t2 , . . . , tn \ tn C 1 > tT ) D f (t1 , . . . , tn ) Pr(tn C 1 > tT | t1 , . . . , tn )
³
D
n Y kD 1
´
expf¡tkg expf¡tT g,
(2.15)
which establishes the result. The time-rescaling theorem generates a history-dependent rescaling of the time axis that converts a point process into a Poisson process with a unit rate. 2.2 Assessing Model Goodness of Fit. We may use the time-rescaling theorem to construct goodness-of-t tests for a spike data model. Once a
The Time-Rescaling Theorem
331
model has been t to a spike train data series, we can compute from its estimated conditional intensity the rescaled times tk D L ( u k ) ¡ L ( u k¡1 ) .
(2.16)
If the model is correct, then, according to the theorem, the tks are independent exponential random variables with mean 1. If we make the further transformation zk D 1 ¡ exp(¡t k ) ,
(2.17)
then zks are independent uniform random variables on the interval (0, 1). Because the transformations in equations 2.16 and 2.17 are both one-to-one, any statistical assessment that measures agreement between the zks and a uniform distribution directly evaluates how well the original model agrees with the spike train data. Here we present two methods: Kolmogorov-Smirnov tests and quantile-quantile plots. To construct the Kolmogorov-Smirnov test, we rst order the zks from smallest to largest, denoting the ordered values as z ( k) s. We then plot the values of the cumulative distribution function of the uniform density dened k¡ 1
as bk D n 2 for k D 1, . . . , n against the z ( k) s. If the model is correct, then the points should lie on a 45-degree line (Johnson & Kotz, 1970). Condence bounds for the degree of agreement between the models and the data may be constructed using the distribution of the Kolmogorov-Smirnov statistic. For moderate to large sample sizes the 95% (99%) condence bounds are well approximated as bk § 1.36 / n1/ 2 (bk § 1.63 / n1 / 2 ) (Johnson & Kotz, 1970). We term such a plot a Kolmogorov-Smirnov (KS) plot. Another approach to measuring agreement between the uniform probability density and the zks is to construct a quantile-quantile (Q-Q) plot (Ventura, Carta, Kass, Gettner, & Olson, 2001; Barbieri, Quirk, et al., 2001; Hogg & Tanis, 2001). In this display, we plot the quantiles of the uniform distribution, denoted here also as the bks, against the z ( k) s. As in the case of the KS plots, exact agreement occurs between the point process model and the experimental data if the points lie on a 45-degree line. Pointwise condence bands can be constructed to measure the magnitude of the departure of the plot from the 45-degree line relative to chance. To construct pointwise bands, we note that if the t ks are independent exponential random variables with mean 1 and the z ks are thus uniform on the interval (0, 1), then each z ( k) has a beta probability density with parameters k and n ¡ k C 1 dened as f ( z | k, n ¡ k C 1) D
n! z k¡1 (1 ¡ z ) n¡k, ( n ¡ k) ! ( k ¡ 1)!
(2.18)
for 0 < z < 1 (Johnson & Kotz, 1970). We set the 95% condence bounds by nding the 2.5th and 97.5th quantiles of the cumulative distribution
332
E. N. Brown, R. Barbieri, V. Ventura, R. E. Kass, and L. M. Frank
associated with equation 2.18 for k D 1, . . . , n. These exact quantiles are readily available in many statistical software packages. For moderate to large spike train data series, a reasonable approximation to the 95% (99%) condence bounds is given by the gaussian approximation to the binomial probability distribution as z ( k) § 1.96[z ( k) (1 ¡ z ( k) ) / n]1/ 2 (z ( k) § 2.575[z ( k) (1 ¡ z ( k) ) / n]1/ 2 ). To our knowledge, these local condence bounds for the Q-Q plots based on the beta distribution and the gaussian approximation are new. In general, the KS condence intervals will be wider than the corresponding Q-Q plot intervals. To see this, it sufces to compare the widths of the two intervals using their approximate formulas for large n. From the gaussian approximation to the binomial, the maximum width of the 95% condence interval for the Q-Q plots occurs at the median: z ( k) D 0.50 and is 2[1.96 / (4n ) 1 / 2 ] D 1.96n ¡1 / 2 . For n large, the width of the 95% condence intervals for the KS plots is 2.72n ¡1 / 2 at all quantiles. The KS condence bounds consider the maximum discrepancy from the 45-degree line along all quantiles; the 95% bands show the discrepancy that would be exceeded 5% of the time by chance if the plotted data were truly uniformly distributed. The Q-Q plot condence bounds consider the maximum discrepancy from the 45-degree line for each quantile separately. These pointwise 95% condence bounds mark the amount by which each value z ( k) would deviate from the true quantile 5% of the time purely by chance. The KS bounds are broad because they are based on the joint distribution of all n deviations, and they consider the distribution of the largest of these deviations. The Q-Q plot bounds are narrower because they measure the deviation at each quantile separately. Used together, the two plots help approximate upper and lower limits on the discrepancy between a proposed model and a spike train data series. 3 Applications 3.1 An Analysis of Supplementary Eye Field Recordings. For the rst application of the time-rescaling theorem to a goodness-of-t analysis, we analyze a spike train recorded from the supplementary eye eld (SEF) of a macaque monkey. Neurons in the SEF play a role in oculomotor processes (Olson, Gettner, Ventura, Carta, & Kass, 2000). A standard paradigm for studying the spiking properties of these neurons is a delayed eye movement task. In this task, the monkey xates, is shown locations of potential target sites, and is then cued to the specic target to which it must saccade. Next, a preparatory cue is given, followed a random time later by a go signal. Upon receiving the go signal, the animal must saccade to the specic target and hold xation for a dened amount of time in order to receive a reward. Beginning from the point of the specic target cue, neural activity is recorded for a xed interval of time beyond the presentation of the go signal. After a brief rest period, the trial is repeated. Multiple trials from an experiment
The Time-Rescaling Theorem
333
such as this are jointly analyzed using a PSTH to estimate ring rate for a nite interval following a xed initiation point. That is, the trials are time aligned with respect to a xed initial point, such as the target cue. The data across trials are binned in time intervals of a xed length, and the rate in each bin is estimated as the average number of spikes in the xed time interval. Kass and Ventura (2001) recently presented inhomogeneous Markov interval (IMI) models as an alternative to the PSTH for analyzing multiple-trial neural spike train data. These models use a Markov representation for the conditional intensity function. One form of the IMI conditional intensity function they considered is l(t | Ht ) D l(t | uN ( t ) , h ) D l1 ( t | h ) l2 ( t ¡ uN (t ) | h ) ,
(3.1)
where uN (t ) is the time of the last spike prior to t, l1 ( t | h ) modulates ring as a function of the experimental clock time, l2 ( t ¡ uN ( t ) | h ) represents Markov dependence in the spike train, and h is a vector of model parameters to be estimated. Kass and Ventura modeled log l1 (t | h ) and log l2 (t ¡ uN ( t) | h ) as separate piecewise cubic splines in their respective arguments t and t ¡ uN ( t ) . The cubic pieces were joined at knots so that the resulting functions were twice continuously differentiable. The number and positions of the knots were chosen in a preliminary data analysis. In the special case l(t | uN ( t) , h ) D l1 ( t | h ) , the conditional intensity function in equation 3.1 corresponds to an inhomogeneous Poisson (IP) model because this assumes no temporal dependence among the spike times. Kass and Ventura used their models to analyze SEF data that consisted of 486 spikes recorded during the 400 msec following the target cue signal in 15 trials of a delayed eye movement task (neuron PK166a from Olson et al., 2000, using the pattern condition). The IP and IMI models were t by maximum likelihood, and statistical signicance tests on the spline coefcients were used to compare goodness of t. Included in the analysis were spline models with higher-order dependence among the spike times than the rst-order Markov dependence in equation 3.1. They found that the IMI model gave a statistically signicant improvement in the t relative to the IP and that adding higher-order dependence to the model gave no further improvements. The t of the IMI model was not improved by including terms to model between-trial differences in spike rate. The authors concluded that there was strong rst-order Markov dependence in the ring pattern of this SEF neuron. Kass and Ventura did not provide an overall assessment of model goodness of t or evaluate how much the IMI and the IP models improved over the histogram-based rate model estimated by the PSTH. Using the KS and Q-Q plots derived from the time-rescaling theorem, it is possible to compare directly the ts of the IMI, IP, and PSTH models and to determine which gives the most accurate description of the SEF spike train structure. The equations for the IP rate function, lIP ( t | hIP ) D l1 ( t | hIP ),
334
E. N. Brown, R. Barbieri, V. Ventura, R. E. Kass, and L. M. Frank
and for the IMI conditional intensity function, lIMI ( t | uN (t ) , hIMI ) D l1 ( t | hIMI ) l2 ( t ¡ uN ( t ) , hIMI ) , are given in the appendix, along with a discussion of the maximum likelihood procedure used to estimate the coefcients of the spline basis elements. The estimated conditional intensity functions for the IP and IMI models are, respectively, lIP (t |, hOIP ) and lIMI ( t | uN ( t ) , hOIMI ) , where hO is the maximum likelihood estimate of the specic spline coefcients. For the PSTH model, the conditional intensity estimate is the PSTH computed by averaging the number of spikes in each of 40 10 msec bins (the 400 msec following the target cue signal) across the 15 trials. The PSTH is the t of another inhomogeneous Poisson model because it assumes no dependence among the spike times. The results of the IMI, IP, and PSTH model ts compared by KS and Q-Q plots are shown in Figure 1. For the IP model, there is lack of t at lower quantiles (below 0.25) because in that range, its KS plot lies just outside the 95% condence bounds (see Figure 1A). From quantile 0.25 and beyond, the IP model is within the 95% condence bounds, although be-
The Time-Rescaling Theorem
335
yond the quantile 0.75, it slightly underpredicts the true probability model of the data. The KS plot of the PSTH model is similar to that of the IP except that it lies entirely outside the 95% condence bands below quantile 0.50. Beyond this quantile, it is within the 95% condence bounds. The KS plot of the PSTH underpredicts the probability model of the data to a greater degree in the upper range than the IP model. The IMI model is completely within the 95% condence bounds and lies almost exactly on the 45-degree line of complete agreement between the model and the data. The Q-Q plot analyses (see Figure 1B) agree with KS plot analyses with a few exceptions. The Q-Q plot analyses show that the lack of t of the IP and PSTH models is greater at the lower quantiles (0–0.50) than suggested by the KS plots. The Q-Q plots for the IP and PSTH models also show that the deviations of these two models near quantiles 0.80 to 0.90 are statistically signicant. With the exception of a small deviation below quantile 0.10, the Q-Q plot of the IMI lies almost exactly on the 45-degree line. Figure 1: Facing page. (A) Kolmogorov-Smirnov (K-S) plots of the inhomogeneous Markov interval (IMI) model, inhomogeneous Poisson (IP), and perstimulus time histogram (PSTH) model ts to the SEF spike train data. The solid 45-degree line represents exact agreement between the model and the data. The dashed 45-degree lines are the 95% condence bounds for exact agreement between the model and experimental data based on the distribution of the 1 Kolmogorov-Smirnov statistic. The 95% condence bounds are b k § 1.36n ¡ 2 , where bk D ( k ¡ 12 ) / n for k D 1, . . . , n and n is the total number of spikes. The IMI model (thick, solid line) is completely within the 95% condence bounds and lies almost exactly on the 45-degree line. The IP model (thin, solid line) has lack of t at lower quantiles ( < 0.25). From the quantile 0.25 and beyond, the IP model is within the 95% condence bounds. The KS plot of the PSTH model (dotted line) is similar to that of the IP model except that it lies outside the 95% condence bands below quantile 0.50. Beyond this quantile, it is within the 95% condence bounds, yet it underpredicts the probability model of the data to a greater degree in this range than the IP model does. The IMI model agrees more closely with the spike train data than either the IP or the PSTH models. (B) Quantile-quantile (Q-Q) plots of the IMI (thick, solid line), IP (thin, solid line), and PSTH (dotted line) models. The dashed lines are the local 95% condence bounds for the individual quantiles computed from the beta probability density dened in equation 2.18. The solid 45-degree line represents exact agreement between the model and the data. The Q-Q plots suggests that the lack of t of the IP and PSTH models is greater at the lower quantiles (0–0.50) than suggested by the KS plots. The Q-Q plots for the IP and PSTH models also show that the deviations of these two models near quantiles 0.80 to 0.90 are statistically signicant. With the exception of a small deviation below quantile 0.10, the Q-Q plot of the IMI lies almost exactly on the 45-degree line.
336
E. N. Brown, R. Barbieri, V. Ventura, R. E. Kass, and L. M. Frank
We conclude that the IMI model gives the best description of these SEF spike trains. In agreement with the report of Kass and Ventura (2001), this analysis supports a rst-order Markov dependence among the spike times and not a Poisson structure, as would be suggested by either the IP or the PSTH models. This analysis extends the ndings of Kass and Ventura by showing that of the three models, the PSTH gives the poorest description of the SEF spike train. The IP model gives a better t to the SEF data than the PSTH model because the maximum likelihood analysis of the parametric IP model is more statistically efcient than the histogram (method of moments) estimate obtained from the PSTH (Casella & Berger, 1990). That is, the IP model t by maximum likelihood uses all the data to estimate the conditional intensity function at all time points, whereas the PSTH analysis uses only spikes in a specied time bin to estimate the ring rate in that bin. The additional improvement of the IMI model over the IP is due to the fact that the former represents temporal dependence in the spike train. 3.2 An Analysis of Hippocampal Place Cell Recordings. As a second example of using the time-rescaling theorem to develop goodness-of-t tests, we analyze the spiking activity of a pyramidal cell in the CA1 region of the rat hippocampus recorded from an animal running back and forth on a linear track. Hippocampal pyramidal neurons have place-specic ring (O’Keefe & Dostrovsky, 1971); a given neuron res only when the animal is in a certain subregion of the environment termed the neuron’s place eld. On a linear track, these elds approximately resemble one-dimensional gaussian surfaces. The neuron’s spiking activity correlates most closely with the animal’s position on the track (Wilson & McNaughton, 1993). The data series we analyze consists of 691 spikes from a place cell in the CA1 region of the hippocampus recorded from a rat running back and forth for 20 minutes on a 300 cm U-shaped track. The track was linearized for the purposes of this analysis (Frank, Brown, & Wilson, 2000). There are two approaches to estimating the place-specic ring maps of a hippocampal neuron. One approach is to use maximum likelihood to t a specic parametric model of the spike times to the place cell data as in Brown, Frank, Tang, Quirk, and Wilson (1998) and Barbieri, Quirk, et al. (2001). If x ( t) is the animal’s position at time t, we dene the spatial function for the one-dimensional place eld model as the gaussian surface
» ¼ b ( x (t) ¡ m ) 2 s (t ) D exp a ¡ , 2
(3.2)
where m is the center of place eld, b is a scale factor, and expfag is the maximum height of the place eld at its center. We represent the spike time probability density of the neuron as either an inhomogeneous gamma (IG)
The Time-Rescaling Theorem
337
model, dened as f ( u k | u k¡1 , h ) D
y s ( uk ) C (y )
µZ
uk
y s (u ) du
¶y ¡1
uk¡1
» Z exp ¡
uk
¼ y s ( u ) du ,
(3.3)
uk¡1
or as an inhomogeneous inverse gaussian (IIG) model, dened as f ( u k | u k¡1 , h ) s ( uk )
D µ
2p
hR uk uk¡1
s ( u ) du
8 ±R ²2 9 > < 1 uuk s (u ) du ¡ y > = k¡1 exp ¡ R uk , (3.4) 1 ¶ i3 2 > : 2 y 2 uk¡1 s ( u ) du > ;
where y > 0 is a location parameter for both models and h D (m , a, b, y ) is the set of model parameters to be estimated from the spike train. If we set y D 1 in equation 3.3, we obtain the IP model as a special case of the IG model. The parameters for all three models—the IP, IG, and the IIG—can be estimated from the spike train data by maximum likelihood (Barbieri, Quirk, et al., 2001). The models in equations 3.3 and 3.4 are Markov so that the current value of either the spike time probability density or the conditional intensity (rate) function depends on only the time of the previous spike. Because of equation 2.2, specifying the spike time probability density is equivalent to specifying the conditional intensity function. If we let hO denote the maximum likelihood estimate of h, then the maximum likelihood estimate of the conditional intensity function for each model can be computed from equation 2.2 as l(t | Ht , hO ) D
1¡
Rt
f (t | uN ( t) , hO )
uN (t)
f ( u | uN ( t) , hO ) du
.
(3.5)
for t > uN ( t ) . The estimated conditional intensity from each model may be used in the time-rescaling theorem to assess model goodness of t as described in Section 2.1. The second approach is to compute a histogram-based estimate of the conditional intensity function by using either spatial smoothing (Muller & Kubie, 1987; Frank et al., 2000) or temporal smoothing (Wood, Dudchenko, & Eichenbaum, 1999) of the spike train. To compute the spatial smoothing estimate of the conditional intensity function, we followed Frank et al. (2000) and divided the 300 cm track into 4.2 cm bins, counted the number of spikes per bin, and divided the count by the amount of time the animal spends in the bin. We smooth the binned ring rate with a six-point gaussian window with a standard deviation of one bin to reduce the effect of
338
E. N. Brown, R. Barbieri, V. Ventura, R. E. Kass, and L. M. Frank
running velocity. The spatial conditional intensity estimate is the smoothed spatial rate function. To compute the temporal rate function, we followed Wood et al. (1999) and divided the experiment into time bins of 200 msec and computed the rate as the number of spikes per 200 msec. These two smoothing procedures produce histogram-based estimates of l(t) . Both are histogram-based estimates of Poisson rate functions because neither the estimated spatial nor the temporal rate functions make any history-dependence assumption about the spike train. As in the analysis of the SEF spike trains, we again use the KS and Q-Q plots to compare directly goodness of t of the ve spike train models for the hippocampal place cells. The IP, IG, and IIG models were t to the spike train data by maximum likelihood. The spatial and temporal rate models were computed as described. The KS and Q-Q plot goodness-of-t comparisons are in Figure 2. The IG model overestimates at lower quantiles, underestimates at intermediate quantiles, and overestimates at the upper quantiles (see Figure 2A).
The Time-Rescaling Theorem
339
The IP model underestimates the lower and intermediate quantiles and overestimates the upper quantiles. The KS plot of the spatial rate model is similar to that of the IP model yet closer to the 45-degree line. The temporal rate model overestimates the quantiles of the true probability model of the data. This analysis suggests that the IG, IP, and spatial rate models are most likely oversmoothing this spike train, whereas the temporal rate model undersmooths it. Of the ve models, the one that is closest to the 45-degree line and lies almost entirely within the condence bounds is the IIG. This model disagrees only with the experimental data near quantile 0.80. Because all of the models with the exception of the IIG have appreciable lack of t in terms of the KS plots, the ndings in the Q-Q plot analyses are almost identical (see Figure 2B). As in the KS plot, the Q-Q plot for the IIG model is close to the 45-degree line and within the 95% condence bounds with the exception of quantiles near 0.80. These analyses suggest that IIG rate model gives the best agreement with the spiking activity of this pyramidal neuron. The spatial and temporal rate function models and the IP model have the greatest lack of t. 3.3 Simulating a General Point Process by Time Rescaling. As a second application of the time-rescaling theorem, we describe how the theorem may be used to simulate a general point process. We stated in the Introduction that the time-rescaling theorem provides a standard approach for simulating an inhomogeneous Poisson process from a simple Poisson Figure 2: Facing page. (A) KS plots of the parametric and histogram-based model ts to the hippocampal place cell spike train activity. The parametric models are the inhomogeneous Poisson (IP) (dotted line), the inhomogeneous gamma (IG) (thin, solid line), and the inhomogeneous inverse gaussian (IIG) (thick, solid line). The histogram-based models are the spatial rate model (dashed line) based on 4.2 cm spatial bin width and the temporal rate model (dashed and dotted line) based a 200 msec time bin width. The 45-degree solid line represents exact agreement between the model and the experimental data, and the 45-degree thin, solid lines are the 95% condence bounds based on the KS statistic, as in Figure 1. The IIG model KS plot lies almost entirely within the condence bounds, whereas all the other models show signicant lack of t. This suggests that the IIG model gives the best description of this place cell data series, and the histogram-based models agree least with the data. (B) Q-Q plots of the IP (dotted line), IG (thin, solid line), and IIG (thick, solid line) spatial (dashed line) temporal (dotted and dashed line) models. The 95% local condence bounds (thin dashed lines) are computed as described in Figure 1. The ndings in the Q-Q plot analyses are almost identical to those in the KS plots. The Q-Q plot for the IIG model is close to the 45-degree line and within the 95% condence bounds, with the exception of quantiles near 0.80. This analysis also suggests that IIG rate model gives the best agreement with the spiking activity of this pyramidal neuron. The spatial and temporal rate function models and the IP model have the greatest lack of t.
340
E. N. Brown, R. Barbieri, V. Ventura, R. E. Kass, and L. M. Frank
process. The general form of the time-rescaling theorem suggests that any point process with an integrable conditional intensity function may be simulated from a Poisson process with unit rate by rescaling time with respect to the conditional intensity (rate) function. Given an interval (0, T], the simulation algorithm proceeds as follows: 1. Set u 0 D 0; Set k D 1.
2. Draw t k an exponential random variable with mean 1. R k 3. Find u k as the solution to t k D uuk¡1 l(u | u0 , u1 , . . . , u k¡1 ) du. 4. If u k > T, then stop. 5. k D k C 1 6. Go to 2.
By using equation 2.3, a discrete version of the algorithm can be constructed as follows. Choose J large, and divide the interval (0, T] into J bins each of width D D T / J. For k D 1, . . . , J draw a Bernoulli random variable u¤k with probability l(kD | u¤1 , . . . , u¤k¡1 ) D , and assign a spike to bin k if u¤k D 1, and no spike if u¤k D 0. While in many instances there will be faster, more computationally efcient algorithms for simulating a point process, such as model-based methods for specic renewal processes (Ripley, 1987) and thinning algorithms (Lewis & Shedler, 1978; Ogata, 1981; Ross, 1993), the algorithm above is simple to implement given a specication of the conditional intensity function. For an example of where this algorithm is crucial for point process simulation, we consider the IG model in equation 3.2. Its conditional intensity function is innite immediately following a spike if y < 1. If in addition, y is time varying (y D y ( t) < 1 for all t), then neither thinning nor standard algorithms for making draws from a gamma probability distribution may be used to simulate data from this model. The thinning algorithm fails because the conditional intensity function is not bounded, and the standard algorithms for simulating a gamma model cannot be applied because y is time varying. In this case, the time-rescaling simulation algorithm may be applied as long as the conditional intensity function remains integrable as y varies temporally. 4 Discussion
Measuring how well a point process model describes a neural spike train data series is imperative prior to using the model for making inferences. The time-rescaling theorem states that any point process with an integrable conditional intensity function may be transformed into a Poisson process with a unit rate. Berman (1983) and Ogata (1988) showed that this theorem may be used to develop goodness-of-t tests for point process models of seismologic data. Goodness-of-t methods for neural spike train models based on
The Time-Rescaling Theorem
341
time-rescaling transformations but not the time-rescaling theorem have also been reported (Reich, Victor, & Knight, 1998; Barbieri, Frank, Quirk, Wilson, & Brown, 2001). Here, we have described how the time-rescaling theorem may be used to develop goodness-of-t tests for point process models of neural spike trains. To illustrate our approach, we analyzed two types of commonly recorded spike train data. The SEF data are a set of multiple short (400 msec) series of spike times, each measured under identical experimental conditions. These data are typically analyzed by a PSTH. The hippocampus data are a long (20 minutes) series of spike time recordings that are typically analyzed with either spatial or temporal histogram models. To each type of data we t both parametric and histogram-based models. Histogram-based models are popular neural data analysis tools because of the ease with which they can be computed and interpreted. These apparent advantages do not override the need to evaluate the goodness of t of these models. We previously used the time-rescaling theorem to assess goodness of t for parametric spike train models (Olson et al., 2000; Barbieri, Quirk, et al., 2001; Ventura et al., 2001). Our main result in this article is that the time-rescaling theorem can be used to evaluate goodness of t of parametric and histogram-based models and to compare directly the accuracy of models from the two classes. We recommend that before making an inference based on either type of model, a goodness-of-t analysis should be performed to establish how well the model describes the spike train data. If the model and data agree closely, then the inference is more credible than when there is signicant lack of t. The KS and Q-Q plots provide assessments of overall goodness of t. For the models t by maximum likelihood, these assessments can be applied along with methods that measure the marginal value and marginal costs of using more complex models, such as Akaikie’s information criterion (AIC) and the Bayesian information criterion (BIC), in order to gain a more complete evaluation of model agreement with experimental data (Barbieri, Quirk, et al., 2001). We assessed goodness of t by using the time-rescaling theorem to construct KS and Q-Q plots having, respectively, liberal and conservative condence bounds. Together, the two sets of condence bounds help characterize the range of agreement between the model and the data. For example, a model whose KS plot lies consistently outside the 95% KS condence bounds (the IP model for the hippocampal data) agrees poorly with the data. On the other hand, a model that is within all the 95% condence bounds of the Q-Q plots (the IMI model for the SEF data) agrees closely with the data. A model such as the IP model for the SEF data, that is, within nearly all the KS bounds, may lie outside the Q-Q plot intervals. In this case, if the lack of t with respect to the Q-Q plot intervals is systematic (i.e., is over a set of contiguous quantiles), this suggests that the model does not t the data well. As a second application of the time-rescaling theorem, we presented an algorithm for simulating spike trains from a point process given its condi-
342
E. N. Brown, R. Barbieri, V. Ventura, R. E. Kass, and L. M. Frank
tional intensity (rate) function. This algorithm generalizes the well-known technique of simulating an inhomogeneous Poisson process by rescaling a Poisson process with a constant rate. Finally, to make the reasoning behind the time-rescaling theorem more accessible to the neuroscience researchers, we proved its general form using elementary probability arguments. While this elementary proof is most certainly apparent to experts in probability and point process theory (D. Brillinger, personal communication; Guttorp, 1995) its details, to our knowledge, have not been previously presented. The original proofs of this theorem use measure theory and are based on the martingale representation of point processes (Meyer, 1969; Papangelou, 1972; BrÂemaud, 1981; Jacobsen, 1982). The conditional intensity function (see equation 2.1) is dened in terms of the martingale representation. Our proof uses elementary arguments because it is based on the fact that the joint probability density of a set of point process observations (spike train) has a canonical representation in terms of the conditional intensity function. When the joint probability density is represented in this way, the Jacobian in the change of variables between the original spike times and the rescaled interspike intervals simplies to a product of the reciprocals of the conditional intensity functions evaluated at the spike times. The proof also highlights the signicance of the conditional intensity function in spike train modeling; its specication completely denes the stochastic structure of the point process. This is because in a small time interval, the product of the conditional intensity function and the time interval denes the probability of a spike in that interval given the history of the spike train up to that time (see equation 2.3). When there is no history dependence, the conditional intensity function is simply the Poisson rate function. An important consequence of this simplication is that unless history dependence is specically included, then histogram-based models, such as the PSTH, and the spatial and temporal smoothers are implicit Poisson models. In both the SEF and hippocampus examples, the histogram-based models gave poor ts to the spike train. These poor ts arose because these models used few data points to estimate many parameters and because they do not model history dependence in the spike train. Our parametric models used fewer parameters and represented temporal dependence explicitly as Markov. Point process models with higher-order temporal dependence have been studied by Ogata (1981, 1988), Brillinger (1988), and Kass and Ventura (2001) and will be considered further in our future work. Parametric conditional intensity functions may be estimated from neural spike train data in any experiments where there are enough data to estimate reliably a histogram-based model. This is because if there are enough data to estimate many parameters using an inefcient procedure (histogram / method of moments), then there should be enough data to estimate a smaller number of parameters using an efcient one (maximum likelihood). Using the K-S and Q-Q plots derived from the time-rescaling theorem, it is possible to devise a systematic approach to the use of histogram-based
The Time-Rescaling Theorem
343
models. That is, it is possible to determine when a histogram-based model accurately describes a spike train and when a different model class, temporal dependence, or the effects of covariates (e.g., the theta rhythm and the rat running velocity in the case of the place cells) should be considered and the degree of improvement the alternative models provide. In summary, we have illustrated how the time-rescaling theorem may be used to compare directly goodness of t of parametric and histogram-based point process models of neural spiking activity and to simulate spike train data. These results suggest that the time-rescaling theorem can be a valuable tool for neural spike train data analysis. Appendix A.1 Maximum Likelihood Estimation of the IMI Model. To t the IMI model, we choose J large, and divide the interval (0, T] into J bins of width D D T / J. We choose J so that there is at most one spike in any bin. In this way, we convert the spike times 0 < u1 < u2 < , . . . , < un¡1 < un · T into a binary sequence uj¤ , where uj¤ D 1 if there is a spike in bin j and 0 otherwise for j D 1, . . . , J. By denition of the conditional intensity function for the IMI model in equation 3.1, it follows that each uj¤ is a Bernoulli random variable with the probability of a spike at jD dened as l1 ( jD | h )l2 ( jD ¡u¤N ( jD ) | h ) D . We note that this discretization is identical to the one used to construct the discretized version of the simulation algorithm in Section 3.3. The log probability of a spike is thus
log(l1 ( jD | h ) ) C log(l2 ( jD ¡ u¤N ( jD ) | h ) D ) ,
(A.1)
and the cubic spline models for log(l1 ( jD | h ) ) and log(l2 ( jD ¡ u¤N ( jD ) | h ) ) are, respectively, log(l1 ( jD | h ) ) D
3 X l D1
hl ( jD ¡ j1 ) lC
Ch4 ( jD ¡ j2 ) 3C C h5 ( jD ¡ j3 ) 3C
log(l2 ( jD ¡ u¤N ( jD ) | h ) ) D
3 X lD 1
(A.2)
hl C 5 ( jD ¡ u¤N ( jD ) ¡ c 1 ) lC
C h9 ( jD ¡ u¤N ( jD ) ¡ c 2 ) 3C ,
(A.3)
where h D (h1 , . . . , h9 ) and the knots are dened as jl D lT / 4 for l D 1, 2, 3 and c l is the observed 100l / 3 quantile of the interspike interval distribution of the trial spike trains for l D 1, 2. We have for Section 3.1 that hIP D (h1 , . . . , h5 )
344
E. N. Brown, R. Barbieri, V. Ventura, R. E. Kass, and L. M. Frank
and hIMI D (h1 , . . . , h9 ) are the coefcients of the spline basis elements for the IP and IMI models, respectively. The parameters are estimated by maximum likelihood with D D 1 msec using the gam function in S-PLUS (MathSoft, Seattle) as described in Chapter 11 of Venables and Ripley (1999). The inputs to gam are the uj¤ s, the time arguments jD and jD ¡ u¤N ( jD ) , the symbolic specication of the spline models, and the knots. Further discussion on spline-based regression methods for analyzing neural data may be found in Olson et al. (2000) and Ventura et al. (2001). Acknowledgments
We are grateful to Carl Olson for allowing us to use the supplementary eye eld data and Matthew Wilson for allowing us to use the hippocampal place cell data. We thank David Brillinger and Victor Solo for helpful discussions and the two anonymous referees whose suggestions helped improve the exposition. This research was supported in part by NIH grants CA54652, MH59733, and MH61637 and NSF grants DMS 9803433 and IBN 0081548. References Barbieri, R., Frank, L. M., Quirk, M. C., Wilson, M. A., & Brown, E. N. (2001). Diagnostic methods for statistical models of place cell spiking activity. Neurocomputing, 38–40, 1087–1093. Barbieri, R., Quirk, M. C., Frank, L. M., Wilson, M. A., & Brown, E. N. (2001). Construction and analysis of non-Poisson stimulus-response models of neural spike train activity. J. Neurosci. Meth., 105, 25–37. Berman, M. (1983). Comment on “Likelihood analysis of point processes and its applications to seismological data” by Ogata. Bulletin Internatl. Stat. Instit., 50, 412–418. BrÂemaud, P. (1981). Point processes and queues: Martingale dynamics. Berlin: Springer-Verlag. Brillinger, D. R. (1988). Maximum likelihood analysis of spike trains of interacting nerve cells. Biol. Cyber., 59, 189–200. Brown, E. N., Frank, L. M., Tang, D., Quirk, M. C., & Wilson, M. A. (1998). A statistical paradigm for neural spike train decoding applied to position prediction from ensemble ring patterns of rat hippocampal place cells. Journal of Neuroscience, 18, 7411–7425. Casella, G., & Berger, R. L. (1990). Statistical inference. Belmont, CA: Duxbury. Daley, D., & Vere-Jones, D. (1988). An introduction to the theory of point process. New York: Springer-Verlag. Frank, L. M., Brown, E. N., & Wilson, M. A. (2000). Trajectory encoding in the hippocampus and entorhinal cortex. Neuron, 27, 169–178. Guttorp, P. (1995). Stochastic modeling of scientic data. London: Chapman & Hall.
The Time-Rescaling Theorem
345
Hogg, R. V., & Tanis, E. A. (2001). Probability and statistical inference (6th ed.). Englewood Cliffs, NJ: Prentice Hall. Jacobsen, M. (1982). Statistical analysis of counting processes. New York: SpringerVerlag. Johnson, A., & Kotz, S. (1970). Distributions in statistics: Continuous univariate distributions—2. New York: Wiley. Kalbeisch, J., & Prentice, R. (1980). The statisticalanalysis of failure time data. New York: Wiley. Karr, A. (1991). Point processes and their statistical inference (2nd ed.). New York: Marcel Dekker. Kass, R. E., & Ventura, V. (2001).A spike train probability model. Neural Comput., 13, 1713–1720. Lewis, P. A. W., & Shedler, G. S. (1978). Simulation of non-homogeneous Poisson processes by thinning. Naval Res. Logistics Quart., 26, 403–413. Meyer, P. (1969).DÂemonstration simpliÂee d’un thÂeor`eme de Knight. In SÂeminaire probabilitÂe V (pp. 191–195). New York: Springer-Verlag. Muller, R. U., & Kubie, J. L. (1987). The effects of changes in the environment on the spatial ring of hippocampal complex-spike cells. Journal of Neuroscience, 7, 1951–1968. Ogata, Y. (1981). On Lewis’ simulation method for point processes. IEEE Transactions on Information Theory, IT-27, 23–31. Ogata, Y. (1988). Statistical models for earthquake occurrences and residual analysis for point processes. Journal of American Statistical Association, 83, 9– 27. O’Keefe, J., & Dostrovsky, J. (1971). The hippocampus as a spatial map: Preliminary evidence from unit activity in the freely-moving rat. Brain Res., 34, 171–175. Olson, C. R., Gettner, S. N., Ventura, V., Carta, R., & Kass, R. E. (2000). Neuronal activity in macaque supplementary eye eld during planning of saccades in response to pattern and spatial cues. Journal of Neurophysiology,84, 1369–1384. Papangelou, F. (1972). Integrability of expected increments of point processes and a related random change of scale. Trans. Amer. Math. Soc., 165, 483–506. Port, S. (1994). Theoretical probability for applications. New York: Wiley. Protter, M. H., & Morrey, C. B. (1991). A rst course in real analysis (2nd ed.). New York: Springer-Verlag. Reich, D., Victor, J., & Knight, B. (1998). The power ratio and the interval map: Spiking models and extracellular recordings. Journal of Neuroscience, 18, 10090–10104. Ripley, B. (1987). Stochastic simulation. New York: Wiley. Ross, S. (1993). Introduction to probability models (5th ed.). San Diego, CA: Academic Press. Snyder, D., & Miller, M. (1991). Random point processes in time and space (2nd ed.). New York: Springer-Verlag. Taylor, H. M., & Karlin, S. (1994). An introduction to stochastic modeling (rev. ed.). San Diego, CA: Academic Press. Venables, W., & Ripley, B. (1999). Modern applied statistics with S-PLUS (3rd ed.). New York: Springer-Verlag.
346
E. N. Brown, R. Barbieri, V. Ventura, R. E. Kass, and L. M. Frank
Ventura, V., Carta, R., Kass, R., Gettner, S., & Olson, C. (2001). Statistical analysis of temporal evolution in single-neuron ring rate. Biostatistics. In press. Wilson, M. A., & McNaughton, B. L. (1993). Dynamics of the hippocampal ensemble code for space. Science, 261, 1055–1058. Wood, E. R., Dudchenko, P. A., & Eichenbaum, H. (1999). The global record of memory in hippocampal neuronal activity. Nature, 397, 613–616.
Received October 27, 2000; accepted May 2, 2001.
LETTER
Communicated by Carson Chow
The Impact of Spike Timing Variability on the Signal-Encoding Performance of Neural Spiking Models Amit Manwani [email protected] Peter N. Steinmetz [email protected] Christof Koch [email protected] Computation and Neural Systems, California Institute of Technology, Pasadena, CA 91125, U.S.A. It remains unclear whether the variability of neuronal spike trains in vivo arises due to biological noise sources or represents highly precise encoding of temporally varying synaptic input signals. Determining the variability of spike timing can provide fundamental insights into the nature of strategies used in the brain to represent and transmit information in the form of discrete spike trains. In this study, we employ a signal estimation paradigm to determine how variability in spike timing affects encoding of random time-varying signals. We assess this for two types of spiking models: an integrate-and-re model with random threshold and a more biophysically realistic stochastic ion channel model. Using the coding fraction and mutual information as information-theoretic measures, we quantify the efcacy of optimal linear decoding of random inputs from the model outputs and study the relationship between efcacy and variability in the output spike train. Our ndings suggest that variability does not necessarily hinder signal decoding for the biophysically plausible encoders examined and that the functional role of spiking variability depends intimately on the nature of the encoder and the signal processing task; variability can either enhance or impede decoding performance. 1 Introduction
Deciphering the neural code remains an essential and yet elusive key to understanding how brains work. Unraveling the nature of representation of information in the brain requires an understanding of the biophysical constraints that limit the temporal precision of neural spike trains, the dominant mode of communication in the brain (Mainen and Sejnowski, 1995; van Steveninck, Lewen, Strong, Koberle, & Bialek, 1997). The representation used by the nervous system depends on the precision with which neurons Neural Computation 14, 347–367 (2001)
° c 2001 Massachusetts Institute of Technology
348
A. Manwani, P. N. Steinmetz, and C. Koch
respond to their synaptic inputs (Theunissen & Miller, 1995), which in turn is inuenced by noise present at the single cell level (Koch, 1999). Neuronal hardware inherently behaves in a probabilistic manner, and thus the encoding of information in the form of spike trains is noisy and may result in irregular timing of individual action potentials in response to identical inputs (Schneidman, Freedman, & Segev, 1998). As in other physical systems, noise has a direct bearing on how information is represented, transmitted, and decoded in biological information processing systems (Cecchi et al., 2000), and a quantitative understanding of neuronal noise sources and their effect on the variability of spike timing reveals the constraints under which neuronal codes must operate. The variability of spike timing observed in vivo can arise due to a variety of factors. One possibility is that it originates at the level of the single neuron, due to either a delicate balance of the excitatory and inhibitory synaptic inputs it receives (Shadlen & Newsome, 1998) or sources of biological noise intrinsic to it (Schneidman et al., 1998). The other possibility is that variability is an emergent property of large, recurrent networks of spiking neurons connected in a certain fashion, representing faithful encoding of nonlinear or chaotic network dynamics (van Vreeswijk & Sompolinsky, 1996, 1998). Such faithful encoding argues in favor of the hypothesis that single neurons are capable of very precise signaling; high spike timing reliability is intuitively appealing since it provides a substrate for efcient temporal coding (Abeles, 1990; Bialek, Rieke, van Steveninck, & Warland, 1991; Softky & Koch, 1993). In this study, we eschew the debate regarding the origin of variability and instead assess the functional role of spike timing variability in a specic instance of a neural coding problem: signal estimation. We quantify the ability of two types of neural spiking models, integrate-and-re models and stochastic ion channel models, to encode information about their random time-varying inputs. The goal in signal estimation is to estimate a random time-varying current injected into a spike-encoding model from the corresponding spike train output. The efcacy of encoding is estimated by the ability to reconstruct the inputs from the output spike trains using optimal least-mean-square estimation; we use information-theoretic measures to quantify the fraction of variability in output spike train, which conveys information about the input. We have previously reported the coding fraction for the Hodgkin-Huxley dynamics for a limited number of input bandwidths (Steinmetz, Manwani, & Koch, in press); here we report the results for more biophysically realistic encoders and for a more complete range of model parameters. 2 Methods
In the following, we consider spiking models that transform continuous, time-varying input signals into sequences of action potentials or spikes
Signal Encoding in Noisy Spiking Models
349
Figure 1: Noisy models of spike timing variability. (A) For an adapting integrateand-re model with random threshold, the time-varying input current m ( t) is integrated by a combination of the passive membrane resistance and capacitance (the RC circuit) to give rise to the membrane voltage Vm . When Vm exceeds a threshold Vth drawn from a random distribution p ( Vm ) , a spike is generated and the integrator is reset for a duration equal to the refractory period tref. The output spike train of the model in response to the input is represented as a point process s (t ) . p ( Vth ) is modeled as an nth-order gamma distribution where n determines the variability in spike timing (the inset shows gamma distributions for n D 1, 2, and 10). Each spike increases the amplitude of a conductance gadapt by an amount Ginc . gadapt corresponds to a calcium-dependent potassium conductance responsible for ring-rate adaptation and decays exponentially to zero between spikes with a time constant tadapt. (B) A time-varying current input m ( t) is injected into a membrane patch containing stochastic voltage-gated ion channels, which are capable of generating action potentials in response to adequately strong current inputs. When the membrane voltage exceeds an arbitrarily chosen reference value above resting potential (+10 mV, in this case), a spike is recorded in the output spike train s ( t ) . Parameters correspond to the kinetic model for regular spiking cortical neurons derived by Golomb and Amitai (1997). (C, D) Sample traces of the input m ( t) , the membrane voltage Vm ( t) , and the spike train s (t ) for the models in A and B, respectively.
(shown in Figure 1). The spike train output of a model in response to the injection of an input current i ( t) is denoted by s ( t) , which is assumed to be a point process and is mathematically modeled as a sequence of delta
350
A. Manwani, P. N. Steinmetz, and C. Koch
functions at time instants fti g s ( t) D
X i
d ( t ¡ ti ) .
The models we consider generate irregular spike trains in response to repeated presentations of the same input and can be regarded as representations of irregular spiking behavior in real biological neurons. We use a specic signal processing task (signal estimation) to study the effects of spike timing variability on the encoding of time-varying input modulations by these models. 2.1 Measurement of Spike Timing Variability. Spike timing variability has previously been examined using two methods. The rst, introduced by Mainen and Sejnowski (1995), measures the precision and reliability of spike times generated by one neuron in response to repeated presentations of the same input current. The second method, measurement of the coefcient of variation (CV) of the interspike interval distribution, examines variability of ring when a neuron is responding to natural inputs. Thus, these two measures represent two different paradigms of neuronal variability. CV is the measure used here, since determining the efcacy of information transfer during signal encoding requires the presentation of a large group of randomly selected stimuli. Although there is no general relationship between precision and reliability as measured by Mainen and Sejnowski (1995) for the encoding models examined in this study, the two measures roughly correspond, as shown in Figure 2C. We will return to this point in the Discussion. 2.2 Measurement of Coding Efciency. In order to compute coding efciency, we construct the optimal linear estimator of the input current i (t) . The difference, m (t ), from the mean current is a zero-mean random timevarying input signal that is encoded in the form of a spike train s ( t) . For the purposes of this article, we assume that the current i (t) injected into the model has the form i (t) D I C m ( t) , where I is the constant component and m ( t) is the uctuating component of the injected current. We assume that weak-sense stationary (WSS) processes m ( t) and s ( t) are (real-valued) « ¬ jointly « ¬ 2 < 1, |s(t) ¡ l| 2 < 1, where l = hs(t) i with nite variances, m2 ( t) = sm is the mean ring rate of the neuron. The operator h ¢ i denotes an average over the joint input and spike train ensemble. The objective in signal estimation is to reconstruct the input m ( t) from the spike train s ( t) such that the mean square error (MSE) between m ( t) and its estimate is minimized. In general, the optimal MSE estimator is mathematically intractable to derive, so we shall restrict ourselves to the optimal O ( t) ), which can be written as linear estimator of the input (denoted by m
mO ( t) D ( g ? s ) ( t) .
(2.1)
Signal Encoding in Noisy Spiking Models
351
Figure 2: Block diagram of the signal estimation paradigm. (A) A noisy spikeencoding mechanism transforms a random time-varying input m ( t ) drawn from a probability distribution into a spike train s ( t ) . Techniques from statistical esO ( t ) of the input timation theory are used to derive the optimal linear estimate, m m ( t ) from the spike train s ( t ) . m ( t) is a gaussian, band-limited, wide sense stationary (WSS) stochastic process with a power spectrum Smm ( f ) that is at over a bandwidth Bm and whose standard deviation is denoted by sm . (B) Variability of the spike train is characterized by CV, the ratio of the standard deviation sT of the interspike intervals (Ti D tiC 1 ¡ ti ) to the mean interspike interval m T . The estimation performance is characterized by the coding fraction j D 1 ¡ E / sm . E is the mean-square error between the time-varying input m ( t) and its opO ( t) from the spike train s ( t ) . (C) Correspondence timal linear reconstruction m between CV and other measures of spike irregularity. Using the procedure described in Mainen and Sejnowski (1995), reliability and precision are estimated from responses of the model to repeated presentations of the same input. The spike sequences are used to obtain the poststimulus time histogram (PSTH) shown in the lowest trace. Instances when the PSTH exceeds a chosen threshold (dotted line) are termed events. Reliability is dened as the fraction of spikes occurring during these events, and precision is dened as the mean length of the events. The inverse relationship between reliability and CV (as the input bandwidth is varied) validates our use of CV as a representative measure of spike variability. A similar relationship exists between precision and CV (data not shown).
352
A. Manwani, P. N. Steinmetz, and C. Koch
The reconstruction noise nO ( t) for the estimation task is given as the difference between the input and the optimal estimate, O ( t) ¡ m (t ) , nO ( t) D m
(2.2)
« ¬ and the reconstruction error E is equal to the variance of nO ( t) , E D nO 2 . As O 2 · nO 2 for the opshown in Gabbiani (1996) and Gabbiani and Koch (1996), m timal linear estimator. Following their conventions, we dene a normalized dimensionless quantity, called the coding fraction j , as follows: j D1¡
E , 2 sm
0 · j · 1.
(2.3)
j can be regarded as a quantitative measure of estimation performance; j = 0 implies that the input modulations cannot be reconstructed at all, denoting estimation performance at chance, whereas j = 1 implies that the input modulations can be perfectly reconstructed. Another measure of estimation performance is the mutual information rate, denoted by I[m ( t) I s (t) ], between the random processes m (t ) and s ( t) (Cover & Thomas, 1991). The data processing inequality (Cover & Thomas, 1991) maintains that the mutual information rate between input m ( t) and the spike train s (t) is greater than the mutual information between m ( t) and its optimal estimate mO ( t), O (t )]. I[m ( t) I s (t )] ¸ I[m ( t)I m When the input m ( t) is gaussian, it can be shown (Gabbiani, 1996; Gabbiani & Koch, 1996) that I[m ( t) I mO ( t) ] (and thus I[m ( t)I s ( t) ]) is bounded below by I[m ( t) I mO ( t) ] ¸ ILB D
1 2
Z
1 ¡1
£ ¤ d f log2 SNR ( f )
(bit sec ¡1 ) ,
(2.4)
where SNR ( f ) is the signal-to-noise ratio dened as SNR ( f ) D
Smm ( f ) . Snˆ nˆ ( f )
(2.5)
Smm ( f ) and Snˆ nˆ are power spectral densities of the input m (t) and the noise n (t ). In our simulations, SNR ( f ) may be greater than one since we are separately adjusting the mean and variance of the input signal and the bandwidth of the noise. The lower bound on the information rate ILB lies in the set (0, 1) . The lower limit, ILB = 0, corresponds to chance performance (j = 0), whereas the upper limit, ILB D 1, corresponds to perfect estimation (j = 1). The information rate denotes the amount of information about the input (measured in units of bits) that can be reliably transmitted per second in
Signal Encoding in Noisy Spiking Models
353
the form of spike trains. Clearly, it depends on the rate at which spikes are generated: the higher the mean ring rate, the higher is the maximum amount of information that can be transmitted per second. Thus, in order to eliminate this extrinsic dependence on mean ring rate, we dene a quantity, IS D ILB / l, which measures the amount of information communicated per spike on average. IS is measured in units of (bits per spike). Thus, the coding fraction (j ) and the information rates (ILB , IS ) can be used to assess the ability of the spiking models to encode time-varying inputs in the specic context of the signal estimation (Schneidmann, Segev, & Tishby, 2000). 2.3 Models of Spike Encoding. We have previously reported the coding efciency for a noisy nonadapting integrate-and-re model, as well as for a stochastic version of the Hodgkin-Huxley kinetic scheme (Steinmetz et al., in press). A major goal of this work was to expand this analysis to use more biophysically realistic encoders.
2.3.1 Integrate-and-Fire model. Integrate-and-re models (I&F) are simplied, phenomenological descriptions of spiking behavior in biological neurons (Tuckwell, 1988). They retain two important aspects of neuronal ring: a subthreshold regime, where the input to the neuron is passively integrated, and a voltage threshold, which, when exceeded, leads to the generation of stereotypical spikes. Although I&F models are physiologically inaccurate, they are often used to model biological spike trains because of their analytical tractability. Real neurons show evidence of ring-rate adaptation; their ring rate decreases with time in response to constant, steady inputs. Such adaptation can be caused by processes like the release of neurotransmitters and neuromodulators and the presence of specic ionic currents (Ca2 C -dependent, slow K C ) among others. Wehmeier, Dong, Koch, & van Essen (1989) introduced an I&F model with a purely time-dependent shunting conductance, gadapt, with a reversal potential equal to the resting potential to account for shortterm (10–50 millisecond) adaptation. Each spike increases gadapt by a xed amount Ginc . Between spikes, gadapt decreases exponentially with a time constant tadapt. This models the effect of membrane Ca2 C -dependent potassium conductance, reproducing the effect of a relative refractory period following spike generation. We refer to this model as an adapting integrate-and-re model. An absolute refractory period tref is required in order to mimic very short-term adaptation. In the subthreshold domain, the membrane voltage Vm is given by C
Vm (1 C R gadapt ) dVm C D i (t) , dt R X dgadapt C gadapt D Ginc tadapt d ( t ¡ ti ). dt i
(2.6) (2.7)
354
A. Manwani, P. N. Steinmetz, and C. Koch
Table 1: Parameters for the Leaky Integrate-and-Fire Model. V th
16.4 mV
C
0.207 nF
R
38.3 MV
tref
2.68 msec
Ginc
20.4 nS
tadapt
52.3 msec
Dt
0.5 msec
When Vm reaches the voltage threshold Vth at time ti , a spike is generated if ti ¡ ti¡1 > tref , where ti¡1 is the time of the previous spike. When a spike is generated, gadapt ( ti ) is also incremented by Ginc . The model is completely characterized by six parameters: Vth , C, R, tref, Ginc , and tadapt . The values used for our simulations are adapted from Koch (1999) and are given in Table 1. Biological neurons in vivo show a substantial variability in the exact timing of action potentials to identical stimulus presentations (Calvin & Stevens, 1968; Softky & Koch, 1993; Shadlen & Newsome, 1998). A simple modication to reproduce the random nature of biological spike trains is to regard the voltage threshold Vth as a random variable drawn from some arbitrary probability distribution p ( Vth ) (Holden, 1976). We refer to this class as integrate-and-re models with random threshold. In general, p ( Vth ) can be arbitrary, but here we assume that it is an nth-order gamma distribution ³ pn (Vth ) D cn
Vth Vth
´n¡1
³ exp
¡nVth V th
´ ,
(2.8)
with cn D
1 nn , ( n ¡ 1) ! Vth
where Vth denotes the mean voltage threshold. The order of the distribution, n, determines the variability of spike trains in response to the injection of a constant current. Thus, one can obtain spike trains of varying regularity by modifying n. For constant current injection and in the absence of a refractory period, the CV varies from CV = 1 to CV = 0 as n is increased from n = 1 to n = 1 (corresponding to a deterministic threshold). A schematic diagram of the adapting I&F model with random threshold used in this article is shown in Figure 1A. 2.3.2 Stochastic Ion Channel Model. While a proper adjustment of model parameters allows I&F models to provide a fairly accurate description of
Signal Encoding in Noisy Spiking Models
355
the ring properties of some cortical neurons (Stevens & Zador, 1998; Koch, 1999), many neurons cannot be modeled by I&F models. Nerve membranes contain several voltage- and ligand-gated ionic currents, which are responsible for a variety of physiological properties that phenomenological models fail to capture. The successful elucidation of the ionic basis underlying neuronal excitability in the squid giant axon by Hodgkin and Huxley (1952) led to the development of more sophisticated mathematical models that described the initiation and propagation of action potentials by explicitly modeling the different ionic currents owing across a neuronal membrane. In the original Hodgkin and Huxley model, membrane currents were expressed in terms of macroscopic deterministic conductances representing the selective permeabilities of the membrane to different ionic species. However, it is now known that the macroscopic currents arise as a result of the summation of stochastic microscopic currents owing through a multitude of ion channels in the membrane. Ion channels have been modeled as nite-state Markov chains with state transition probabilities proportional to the kinetic transition rates between different conformational states (Skaugen & Wallœ, 1979; Clay & DeFelice, 1983; Strassberg & DeFelice, 1993). In earlier research, we studied the inuence of the stochastic nature of voltage-gated ion channels in excitable neuronal membranes on subthreshold membrane voltage uctuations (Steinmetz, Manwani, Koch, London, & Segev, 2000; Manwani, Steinmetz, & Koch, 2000). Here we are interested in assessing the inuence of variability in spike timing on the ability of noisy spiking mechanisms to encode time-varying inputs, in the context of stochastic ion channel models. For voltage-gated ion channels, the kinetic transition rates (and other parameters determined by them) are functions of the membrane voltage Vm . As in the case of the I&F model, a band-limited white noise current m ( t) is injected into a patch of membrane containing stochastic voltage-gated ion channels, and MonteCarlo simulations are carried out to determine the response of the model to random suprathreshold stimuli (see Figure 1B). The ion channel model we consider here is a stochastic counterpart of the single-compartment model of a regular-spiking cortical developed in Golomb and Amitai (1997). The original version consists of a fast sodium current, a persistent sodium current, a delayed-rectier potassium current, an A-type potassium current (for adaptation), a slow potassium current, a passive leak current, and excitatory synaptic (AMPA- and NMDA-type) currents. We are interested in the variability due to the stochastic nature of ion channels, and so here we assume that the synaptic currents are absent. In order to simulate stochastic Markov models of the ion channel kinetics associated with ve voltage-dependent ionic currents, we performed Monte Carlo simulations of single-compartmental models of membrane patches of area A. The Markov models correspond to equations A1 through A20 of Golomb and Amitai (1997) and were constructed using methods given in
356
A. Manwani, P. N. Steinmetz, and C. Koch
Table 2: Parameters for the Stochastic Ion Channel Model. NaC current (INa ) gNa c Na
2 states for h 0.24 nS / m 2 0.18 pS
Persistent NaC current (INaP )
Deterministic
Delayed rectier K C current (I Kdr )
5 states for n
gKdr c Kdr A-type K C current (IKA ) gKA c KA Slow K C current (IK¡slow ) gKsl c Ksl Leakage current (I L )
0.24 nS / m 2 0.21 pS 2 states for b 0.24 nS / m 2 0.020 pS 2 states for z 0.24 nS / m 2 0.20 pS Deterministic
Skaugen and Wallœ (1979), Skaugen (1980), and Steinmetz et al. (2000). The particular parameters used for these simulations are given in Table 2. We chose a sufciently small time step for the simulation so that the membrane voltage can be assumed to be relatively constant over the duration of the step. The voltage-dependent state transition probabilities are computed for each time step.1 Knowledge of the transition probabilities between states is used to determine the modied channel populations occupying different states. This is done by drawing random numbers specifying the number of channels making transitions between any two states from multinomial distributions parameterized by the transition probabilities. The membrane conductance due to a specic ion channel is determined by tracking the number of members of the given channel type that are in its open state. This membrane current is integrated over the time step to compute the membrane voltage for the next time step. This procedure is applied iteratively to obtain the membrane voltage trajectory in response to the input current waveform. For a detailed description of the Monte Carlo simulations, see Schneidman et al. (1998) and Steinmetz et al. (2000). The voltage trajectory is transformed into a point process by considering the instance of the voltage crossing a threshold (here, 10 mV with respect to resting potential) as a spike occurrence. Thresholding allows us to treat the output of the model as a sequence of spike times rather than as membrane voltage modulations. This simple recipe to detect spikes works well for the model we consider here. 1 The transition probabilities are computed by multiplying the corresponding rates by the length of the time step, assuming the product is much smaller than one.
Signal Encoding in Noisy Spiking Models
357
3 Results
We carried out simulations for the I&F model and the stochastic ion channel model and recorded the output spike times in response to the injection of a current equal to a mean current, I, plus pseudo-random, gaussian, band-limited noise (at power spectrum Smm ( f ) over bandwidth Bm ) ( f 2 (0, Bm ]). We then computed the CV of the interspike interval distribution and the coding fraction, j , for signal estimation task. CV measures the variability of the spike train in response to the input, whereas the coding fraction quanties the fraction of the variability in the spike train, which is functionally useful to reconstruct the input modulations. Once again, our goal is to understand how spike timing variability inuences performance in a specic biological information processing task—here, a signal estimation task. 3.1 Dependence of Variability on Firing Rate and Bandwidth. First, we explored the dependence of the coefcient of variability of the interspike interval, CV, on the mean ring rate l of the spiking models. Figure 3A shows the CV as a function of l for an input bandwidth of B m = 50 Hz. The mean ring rate l for the I&F model depends on only the mean injected current I, whereas for membrane patch–containing stochastic ion channels, l depends on a variety of additional parameters, such as the area of the patch A, the standard deviation of the stochastic input noise current sm , and bandwidth of the input Bm . For both models, we varied l while maintaining the contrast of the input, dened as c D sm / I, constant at c D 1 / 3. In both cases, CV increased monotonically with mean ring rate. When the contrast is kept constant, an increase in I (to increase the mean ring rate) requires a corresponding increase in the magnitude of the uctuations sm . This results in an increase in the amplitude of ring-rate modulations, allowing the input to be estimated more accurately. Figure 3B shows the CV for the two models as a function of the bandwidth of the input Bm . For both models, total noise power was held constant at all bandwidths, and I was adjusted so that the mean ring rate l was approximately equal to 50 Hz. In both cases, CV decreases with increasing Bm , which is in qualitative agreement with earlier experimental (Mainen & Sejnowski, 1995) and computational (Schneidman et al., 1998) ndings demonstrating an inverse relationship between spike timing precision (using a measure different from CV) and the temporal bandwidth of the input. Strong temporal dynamics in the input to a neuron dwarf the effect of the inherent noise in the spiking mechanism and regularize the spike train. Within the class of I&F models, models with higher n re more regularly and have lower CV values, as expected. 3.2 Dependence of Encoding Performance on Firing Rate and Bandwidth. Next, we explored the dependence of coding efciency on the mean
358
A. Manwani, P. N. Steinmetz, and C. Koch
Figure 3: Variability and coding efciency of spiking models. (A) CV of the interspike interval distribution of the spike train as a function of the mean ring rate of the spike train l. The input to the model is a gaussian, white, bandlimited (bandwidth Bm = 50 Hz) input, with mean I and standard deviation sm . The mean ring rate of the model is varied by changing the mean current I while maintaining the contrast of the input, dened as c D sm / I, constant (c = 1/3). The solid curves correspond to the adapting I&F model for different values of the order n of the gamma-distributed voltage threshold distribution (n D 1 corresponds to a deterministic threshold). The dotted curve corresponds to a 1000 m m2 membrane patch containing stochastic ion channels. (B) CV as a function of input bandwidth Bm . l for both the models was maintained at 50 Hz. (C, D) The dependence of the coding fraction j in the signal estimation task for the two types of spiking models on the mean ring rate l (for Bm = 50 Hz) and the input bandwidth Bm (for l = 50 Hz), respectively. Model parameters are summarized in the caption of Figure 1.
ring rate and the input bandwidth. In Figures 3C and 3D, the coding fraction j is plotted as a function of l (for Bm = 50 Hz) and Bm (l = 50 Hz), respectively. For both spike-encoding mechanisms, j increases with mean
Signal Encoding in Noisy Spiking Models
359
ring rate and decreases with input bandwidth. One interpretation of the previously observed decrease in variability of spike timing with input bandwidth (see Figure 3B) is that it suggests an improvement in coding—that in a sense, the neuron prefers higher bandwidths (Schneidman, Freedman, & Segev, 1997). By contrast, we nd that the noisy spike-encoding models considered here encode slowly varying stimuli more effectively than rapid ones. We believe that in the context of a signal estimation task, this is a generic property of all models that encode continuous signals as ring-rate modulations of sequences of discrete spike trains. 3.3 Dependence of Mutual Information on Firing Rate and Bandwidth.
Next, we explored the dependence of the mutual information rates on the mean ring rate and input bandwidth. Figures 4A and 4B, respectively, show that the lower bound of the mutual information rate ILB increases with l and Bm . This behavior can be better understood in the light of the phenomenological expression ILB D Bm log2 (1 C k l / Bm ), where k is a constant that depends on the details of the encoding scheme. The above expression is exact when the instantaneous ring rate of the model is a linear function of the input (as in the case of the perfect I&F model) and the input is a white band-limited gaussian signal with bandwidth B m . For a Poisson model without adaptation, k D c2 / 2. (Details of the derivation of this expression are provided in Manwani and Koch, 2001.) From the above expression, one can deduce that for low ring rates, ILB increases linearly with l, but at higher rates, ILB becomes logarithmic with l. One can also conclude that ILB increases with Bm for small bandwidths but quickly saturates at high bandwidths at the value k l / ln 2. This qualitatively agrees with Figures 4A and 4B. The dependence of the information rate per spike IS on l and Bm can be similarly explored. The expression for ILB is sublinear with respect to l, and thus one can deduce that IS should decrease monotonically with ring rate when the bandwidth Bm is held xed. In fact, its maximum value, IS D k , occurs at l D 0. On the other hand, when l is held xed, IS should increase with Bm initially but saturate at high bandwidths at k / ln 2. Once again, Figures 4C and 4D agree qualitatively with these predictions. 3.4 Relationship Between Variability and Encoding Performance. In order to understand further the role of variability in the context of signal estimation, we plot measures of coding efciency (j and ILB / Bm ) versus the corresponding CV values for the two models as different parameters are varied. Figures 5A and 5B show the dependence of coding performance on the variability of spike timing for the I&F model, and Figures 5C and 5D show the corresponding behaviors for the stochastic ion channel model. For both models, estimation performance improves with variability when the mean ring rate was increased or the input bandwidth was decreased. This implies that the variability in the output spike train rep-
360
A. Manwani, P. N. Steinmetz, and C. Koch
Figure 4: Information rates in signal estimation for spiking models. (A) Lower bounds of the information rate ILB for the two spiking model classes considered in this article. The solid curves correspond to the adapting I&F model for different values of n, and the dotted curve corresponds to the stochastic ion channel model. As in Figure 3, the input is a band-limited gaussian process with bandwidth Bm = 50 Hz. (B) ILB as a function of the input bandwidth Bm for l = 50 Hz. (C) The mutual information transmitted per spike on average, IS D ILB / l, as a function of l (Bm = 50 Hz). (D) IS as a function of input bandwidth Bm for l = 50 Hz. Model parameters are summarized in the caption of Figure 1.
resents faithful encoding of the input modulations, and thus greater variability leads to better signal estimation. On the other hand, when either the order of the gamma distribution for the I&F model or the area of the membrane patch for the stochastic ion channel model was decreased, coding performance decreased with variability. This suggests that the variability is due to noise (randomness of the spiking threshold). Thus, we nd that the effect of spike timing variability on the coding efciency of spiking models may be benecial or detrimental; the direction of its inuence depends on the specic nature of the signal processing task the neuron is expected to perform (signal estimation here) and the parameter that is varied.
Signal Encoding in Noisy Spiking Models
361
Figure 5: Is it signal or is it noise? Parametric relationships between measures of coding efciency and the variability of the spike train as different parameters were varied for the two spiking models. (A) Coding fraction j and (B) mutual information transmitted per input time constant ILB / Bm for the I&F model as a function of the CV of the spike train. Squares plot results while input bandwidth was varied between 10 and 150 Hz. Open circles plot results while the mean input was varied to change the ring rate from 40 to 92 Hz. Filled circles show results for the order of the gamma distribution of thresholds varied from 2 to innity. The increase in estimation performance with CV when the mean ring rate l was increased or the input bandwidth Bm was decreased (with n D 1 for the I&F model) suggests that the variability arises as a result of faithful encoding of the input and thus represents signal, whereas a decrease with CV when the order n of the threshold distribution was decreased suggests that the variability impedes encoding and thus represents noise. (C) Coding fraction j and (D) mutual information transmitted per input time constant ILB / Bm for the stochastic ion channel model as a function of the CV of the spike train as the mean ring rate l (A = 1000 m m2 , Bm = 50 Hz, open circles), input bandwidth Bm (A = 1000 m m2 , l = 50 Hz, open squares), and the area of the patch A (Bm = 50 Hz, l = 50 Hz, lled circles) were varied.
3.5 Mean Rate Code in Signal Estimation. Figures 6A and 6B demonstrate that for the spiking models we have considered here, performance in the signal estimation task is determined by the ratio l / Bm and not by
362
A. Manwani, P. N. Steinmetz, and C. Koch
Figure 6: Coding efciency as a function of l / Bm . (A) Coding fraction j and (B) mutual information transmitted per input time constant, ILB / Bm , for the two spiking models as a function of the mean number of spikes available per input time constant, l / Bm , for different combinations of Bm and l (empty symbols: Bm varied, empty symbols: l varied). The solid curves correspond to the adapting I&F model (different symbols represent different values of the order, n, of the voltage threshold gamma distribution), whereas the dotted curve corresponds to a 1000 m m2 membrane patch containing stochastic ion channels. The contrast of the input, c, was maintained at one-third.
the absolute values of l and Bm . The quantity l / Bm represents the number of spikes observed during an input time constant, a time interval over which the input is relatively constant. Thus, the larger the number of spikes available for the estimation task, the better the estimate of the neuron’s instantaneous ring rate l(t) and, consequently, the better the estimate of
Signal Encoding in Noisy Spiking Models
363
the instantaneous value of the input m ( t) . This suggests that the relevant variable that encodes the input modulations is the neuron’s (instantaneous) ring rate computed over time intervals of length 1 / Bm . Furthermore, the more action potential per time constant of encoding, the more efcient information can be estimated using mean rate codes. 4 Discussion
In this article, we use two types of noisy spiking models to study the inuence of spike timing variability on the ability of single neurons to encode time-varying signals in their output spike trains. Here we extend the preliminary results for simplied channel models reported previously (Steinmetz et al., in press) to more biophysically realistic models of spike encoding. For both I&F models with noisy thresholds and stochastic ion channel encoders, we nd that decreased spike timing variability, as assayed using the CV of the interspike interval distribution, does not necessarily translate to an increase in performance for all signal processing tasks. These results show that although the variability of spike timing decreases for these encoders as the bandwidth of the input is increased, the ability to estimate random continuous signals drops. Conversely, an increase in variability with ring rate also causes an increase in estimation performance. A similar connection between increased variability and increased signal detection performance is observed in systems that exhibit stochastic resonance (Chialvo, Dykman, & Millonas, 1995; Chialvo, Longtin, & MullerGerking, 1997; Collins, Chow, & Imhoff, 1995a, 1995b; Henry, 1999; Russell, Wilkens, & Moss, 1999). In these systems, the addition of noise both increases output variability and improves signal transduction. In the encoding models studied here, noise is added by the stochastic nature of the ion channels responsible for action potential production, which could evoke stochastic resonance for specic encoding tasks. For hippocampal CA1 cells, the addition of synaptic noise has recently been shown to evoke stochastic resonance when detecting periodic pulse trains (Stacey & Durand, 2000). The observed trends of CV as a function of ring rate (cf. Figure 3) are in agreement with those previously reported (Christodoulou & Bugmann, 2000; Tiesinga, Jose, & Sejnowski, 2000) and correspond to an encoder driven by Gaussian noise, but not to a system driven by a Poisson process (Tiesinga et al., 2000). We have previously shown that for subthreshold voltages, noise generated by stochastic channel models is well approximated by a gaussian distribution (Steinmetz et al., 2000); thus, the combination of these observations suggests that a gaussian distribution may function as a good approximation for suprathreshold effects as well. The trends of coding fraction as a function of input bandwidth are also in qualitative agreement with experimental measurements of the coding fraction in cortical pyramidal cells reported by Fellous et al. (2001), although
364
A. Manwani, P. N. Steinmetz, and C. Koch
there are differences in the input signal, which was sinusoidal for these experiments. While interpreting these results, a few limitations must be borne in mind. We measure coding performance for gaussian, white band-limited inputs in the context of a specic signal estimation paradigm. Generally, the computation performed by real neurons in the brain and the statistical properties of the milieu of signals in which they operate are difcult to determine. Estimation performance using white noise does not shed light on the operation of neural systems highly specialized to detect specic input patterns or those optimized to process natural and ecologically relevant signals with specic statistical properties. However, in the absence of knowledge regarding the role of a single neuron, the coding fraction for white noise stimuli represents a convenient metric to quantify its behavior. The second general limitation is that we employ simple linear decoding to recover the input from the spike train, which is inferior in performance to general nonlinear decoding mechanisms. However, it has been argued that when the mean rate or some function of it is the relevant encoding variable, the difference in performance between linear and nonlinear estimators is marginal (Rieke, Warland, van Steveninck, & Bialek, 1997) and the coding fraction is a good indicator of coding efciency. Finally, this study assumes that information is encoded at the level of single neurons. The investigation of the role of variability on the ability of a population of neurons to encode information is a signicantly more complicated and interesting problem since temporal synchrony between groups of neurons in a population may enhance or decrease coding efciency. We are currently investigating this issue. Earlier studies have shown that neurons re more predictably and precisely when their inputs have richer temporal structure (Mainen & Sejnowski, 1995; Schneidman et al., 1998). It has also been argued that the ability to alter the accuracy of their representation depending on the nature of their inputs may enable neurons to adapt to the statistical structure of their ecological environment and act as “smart encoders” (Schneidman et al., 1998; Brenner, Rieke, & de Ruyter van Steveninck, 2000; Cecchi et al., 2000). One example would be using a coarse rate code to encode slowly varying input signals but a ne temporal code to encode rapidly varying inputs. While we agree with the basic premise of the argument, viewing variability from a different paradigm, namely signal estimation, and assaying variability with CV, suggests that variability could represent the input signal in biophysically plausible encoders; thus, encoding efciency increases with increasing variability for several of the encoders examined here. For those encoders, spike timing reliability decreases with increasing CV, as shown in Figure 2, so in these cases, reliability will be decreasing as coding efciency increases, providing one counterexample wherein an increase in reliability does not lead to better performance in a signal reconstruction
Signal Encoding in Noisy Spiking Models
365
task. This leads us to argue that the role of variability depends intimately on the nature of the information processing task and the nature of the spike encoder. These results also highlight the need to further measure and understand biophysical noise sources and the mechanisms of computation in cortical neurons. Acknowledgments
This work was funded by the NSF Center for Neuromorphic Systems Engineering at Caltech, the NIMH, and the Sloan-Swartz Center for Theoretical Neuroscience. We thank Idan Segev, Michael London, and Elad Schneidman for their invaluable suggestions. References Abeles, M. (1990). Corticonics: Neural circuits of the cerebral cortex. Cambridge: Cambridge University Press. Bialek, W., Rieke, F., van Steveninck, R. R. D., & Warland, D. (1991). Reading a neural code. Science, 252, 1854–1857. Brenner, N., Bialek, W., & de Ruyter van Steveninck, R. (2000).Adaptive rescaling maximizes information transmission. Neuron, 26, 695–702. Calvin, W. H., & Stevens, C. F. (1968). Synaptic noise and other sources of randomness in motoneuron interspike intervals. J. Neurophysiol., 31, 574–587. Cecchi, G. A., Mariano, S., Alonso, J. M., Martinez, L., Chialvo, D. R., & Magnasco, M. O. (2000). Noise in neurons is message-dependent. PNAS, 97(10), 5557–5561. Chialvo, D. R., Dykman, M. I., & Millonas, M. M. (1995). Fluctuation-induced transport in a periodic potential: Noise versus chaos. Physical Review Letters, 78(8), 1605. Chialvo, D. R., Longtin, A., & Muller-Gerking, J. (1997). Stochastic resonance in models of neuronal ensembles. Physical Review E, 55(2), 1798–1808. Christodoulou, C., & Bugmann, G. (2000). Near Poisson-type ring produced by concurrent excitation and inhibition. Biosystems, 58, 41–48. Clay, J. R., & DeFelice, L. J. (1983). Relationship between membrane excitability and single channel open-close kinetics. Biophys. J., 42, 151–157. Collins, J. J., Chow, C. C., & Imhoff, T. T. (1995a). Aperiodic stochastic resonance in excitable systems. Physical Review. E. Statistical Physics, Plasmas, Fluids, and Related Interdisciplinary Topics, 52(4), R3321–R3324. Collins, J. J., Chow, C. C., & Imhoff, T. T. (1995b). Stochastic resonance without tuning. Nature, 376(6537), 236–238. Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: Wiley. Fellous, J. M., Houweling, A. R., Modi, R. H., Rao, R. P. N., Tiesinga, P. H. E., & Sejnowski, T. J. (2001). Frequency dependence of spike timing reliability in cortical pyramidal cells and interneurons. Journal of Neurophysiology, 85, 1782–2001.
366
A. Manwani, P. N. Steinmetz, and C. Koch
Gabbiani, F. (1996). Coding of time-varying signals in spike trains of linear and half-wave rectifying neurons. Network: Comput. Neural Syst., 7, 61–85. Gabbiani, F., & Koch, C. (1996). Coding of time-varying signals in spike trains of integrate-and-re neurons with random threshold. Neural Comput., 8, 44–66. Golomb, D., & Amitai, Y. (1997). Propagating neuronal discharges in neocortical slices: Computational and experimental study. J. Neurophysiol.,78, 1199–1211. Henry, K. R. (1999). Noise improves transfer of near-threshold, phase-locked activity of the cochlear nerve: Evidence for stochastic resonance? J. Comp. Physiol. (A), 184(6), 577–584. Hodgkin, A. L., & Huxley, A. F. (1952). A quantitative description of membrane current and its application to conductation and excitation in nerve. J. Physiol. (Lond.), 117, 500–544. Holden, A. V. (1976). Models of the stochastic activity of neurones. New York: Springer-Verlag. Koch, C. (1999). Biophysics of computation: Information processing in single neurons. New York: Oxford University Press. Mainen, Z. F., & Sejnowski, T. J. (1995). Reliability of spike timing in neocortical neurons. Science, 268, 1503–1506. Manwani, A., & Koch, C. (2001). Detecting and estimating signals over noisy and unreliable synapses: Information-theoretic analysis. Neural Comput., 13, 1–33. Manwani, A., Steinmetz, P. N., & Koch, C. (2000). Channel noise in excitable neuronal membranes. In S. A. Solla, T. K. Leen, & K.-R. Muller ¨ (Eds.), Advances in neural information processingsystems,12 (pp. 143–149). Cambridge, MA: MIT Press. Rieke, F., Warland, D., van Steveninck, R. D. R., & Bialek, W. (1997). Spikes: Exploring the neural code. Cambridge, MA: MIT Press. Russell, D. F., Wilkens, L. A., & Moss, F. (1999). Use of behavioural stochastic resonance by paddle sh for feeding. Nature, 402(6759), 291–294. Schneidman, E., Freedman, B., & Segev, I. (1997). Ion-channel stochasticity may be a critical factor in determining the reliability of spike timing. Neurosci-L Supplement, 48, 543–544. Schneidman, E., Freedman, B., & Segev, I. (1998). Ion-channel stochasticity may be critical in determining the reliability and precision of spike timing. Neural Comput., 10, 1679–1703. Schneidmann, E., Segev, I., & Tishby, N. A. (2000). Information capacity and robustness of stochastic neuron models. In S. A. Solla, T. K. Leen, & K.-R. Muller ¨ (Eds.), Advances in neural informationprocessingsystems,12. Cambridge, MA: MIT Press. Shadlen, M. N., & Newsome, W. T. (1998). The variable discharge of cortical neurons: Implications for connectivity, computation, and information coding. J. Neurosci., 18, 3870–3896. Skaugen, E. (1980). Firing behavior in stochastic nerve membrane models with different pore densities. Acta Physiol. Scand., 108, 49–60. Skaugen, E., & Wallœ, L. (1979). Firing behavior in a stochastic nerve membrane model based upon the Hodgkin-Huxley equations. Acta Physiol. Scand., 107, 343–363.
Signal Encoding in Noisy Spiking Models
367
Softky, W. R., & Koch, C. (1993). The highly irregular ring of cortical cells is inconsistent with temporal integration of random EPSPS. J. Neurosci., 13, 334–350. Stacey, W. C., & Durand, D. M. (2000). Stochastic resonance improves signal detection in hippocampal CA1 neurons. J. Neurophysiol., 83(3), 1394–1402. Steinmetz, P. N., Manwani, A., & Koch, C. (in press). Variability and coding efciency of noisy neural spike encoders. Biosystems. Steinmetz, P. N., Manwani, A., Koch, C., London, M., & Segev, I. (2000). Subthreshold voltage noise due to channel uctuations in active neuronal membranes. J. Comput. Neurosci., 9, 133–148. Stevens, C. F., & Zador, A. M. (1998). Novel integrate-and-re-like model of repetitive ring in cortical neurons. In: Proceedings of the 5th Joint Symposium on Neural Computation (La Jolla, CA). Strassberg, A. F., & DeFelice, L. J. (1993). Limitations of the Hodgkin-Huxley formalism: Effect of single channel kinetics on transmembrane voltage dynamics. Neural Comput., 5, 843–855. Theunissen, F., & Miller, J. P. (1995). Temporal encoding in nervous systems: A rigorous denition. J. Comput. Neurosci., 2, 149–162. Tiesinga, P. H. E., Jose, J. V., & Sejnowski, T. J. (2000). Comparison of currentdriven and conductance-driven neocortical model neurons with HodgkinHuxley voltage-gaged channels. Physical Review E, 62(6), 8413–8419. Tuckwell, H. C. (1988). Introduction to theoretical neurobiology II: Nonlinear and stochastic theories. Cambridge: Cambridge University Press. van Steveninck, R. R. D., Lewen, G. D., Strong, S. P., Koberle, R., & Bialek, W. (1997). Reproducibility and variability in neural spike trains. Science, 275, 1805–1808. van Vreeswijk, C., & Sompolinsky, H. (1996). Chaos in neuronal networks with balanced excitatory and inhibitory activity. Science, 274(5293), 1724–1726. van Vreeswijk, C., & Sompolinsky, H. (1998). Chaotic balanced state in a model of cortical circuits. Neural Comput., 10, 1321–1371. Wehmeier, U., Dong, D., Koch, C., & van Essen, D. (1989). Modeling the mammalian visual system. In C. Koch & I. Segev (Eds.), Methods in neuronal modeling. Cambridge, MA: MIT Press. Received June 20, 2000; accepted May 2, 2001.
LETTER
Communicated by Wulfram Gerstner
Temporal Correlations in Stochastic Networks of Spiking Neurons Carsten Meyer [email protected] Carl van Vreeswijk [email protected] Racah Institute of Physics and Center for Neural Computation, Hebrew University, Jerusalem 91904, Israel The determination of temporal and spatial correlations in neuronal activity is one of the most important neurophysiological tools to gain insight into the mechanisms of information processing in the brain. Its interpretation is complicated by the difculty of disambiguating the effects of architecture, single-neuron properties, and network dynamics. We present a theory that describes the contribution of the network dynamics in a network of “spiking” neurons. For a simple neuron model including refractory properties, we calculate the temporal cross-correlations in a completely homogeneous, excitatory, fully connected network in a stable, stationary state, for stochastic dynamics in both discrete and continuous time. We show that even for this simple network architecture, the crosscorrelations exhibit a large variety of qualitatively different properties, strongly dependent on the level of noise, the decay constant of the refractory function, and the network activity. At the critical point, the crosscorrelations oscillate with a frequency that depends on the refractory properties or decay exponentially with a diverging damping constant (for “weak” refractory properties). We also investigate the effect of the synaptic time constants. It is shown that these time constants may, apart from their inuence on the asymmetric peak arising from the direct synaptic connection, also affect the long-term properties of the cross-correlations.
1 Introduction
One of the most important tools in current neurophysiological brain research is the determination of temporal and spatial correlations in neuronal activity. In 1967, Perkel, Gerstein, and Moore (1967a, 1967b) showed that the characteristics of the interactions between two neurons are reected in their temporal correlations. This work proposed that the spike correlations are a physiological indication for anatomical connections between the neurons. Many studies have used this approach to study the anatomy of different Neural Computation 14, 369–404 (2001)
° c 2001 Massachusetts Institute of Technology
370
Carsten Meyer and Carl van Vreeswijk
cortical areas (Toyama, Kimura, & Tanaka, 1981a, 1981b; T’so, Gilbert, & Wiesel, 1986; Reid & Alonso, 1995). However, it should be recognized that the spike correlations between neurons depend not only on the network architecture, but also on the properties of the individual neurons, as well as the dynamical state of the network. Indeed, in prefrontal cortex, the correlation in activity can vary with the task that is to be performed (Vaadia et al., 1995). And, in sensory cortices, the level of correlation is strongly stimulus dependent (Gray, Konig, ¨ Engel, & Singer, 1989; deCharms & Merzenich, 1996; for a comprehensive overview, see Gray, 1999). These variations in spike correlation presumably reect differences in the dynamical state of the cortical network. Unfortunately, the interplay among the factors in generating the correlations is highly complex, making it difcult to interpret experimentally obtained correlations. Therefore, a theoretical framework is needed to analyze cross-correlations in different functional, anatomical, and dynamical contexts, enabling comparisons to experimental measurements and facilitating the interpretation of observed results. The importance of the network dynamics for the cross-correlations has been stressed by Ginzburg and Sompolinsky (1994), who showed that the temporal correlations between a pair of neurons may even in an “asynchronous state” be dominated by the cooperative dynamics of the network the neuron pair belongs to. In this analysis, the network consists of neurons described by asynchronously updated Ising spins. The irregularity of the neuronal activity is described by thermal noise, and it is shown that for noise levels close to the bifurcation point, the crosscorrelations become large. A major limitation of this theory is the absence of refractoriness of the neurons. This leads to autocorrelations of the neuron that decay exponentially. Viewed on a timescale of single spikes (i.e., some milliseconds), this does not even qualitatively correspond to experimental results. Thus, the biological interpretation of their theory in terms of single spikes is not clear. Another problem is that within their model, due to the lack of refractoriness, a single neuron without noise always approaches a state of either maximal or minimal activity. Noise has to be added to induce an intermediate activity level. The aim of this article is to extend the theory of Ginzburg and Sompolinsky to networks of spiking neurons. In our approach, a single neuron is described by the spike response model introduced by Gerstner and van Hemmen (1992) which we call renewal neuron. We calculate and analyze the cross-correlations in a large but nite stochastic, homogeneous, fully connected excitatory network in a stable stationary state. We show that the temporal correlations depend crucially and in a complex manner on the refractory properties of the neurons—apart from the dependence on the synaptic transmission and the network dynamics. In section 2, we dene the network model for dynamics in discrete and continuous time, based on the spike response model. In section 3 we analyze the basic properties of a single renewal neuron, following Gerstner
Stochastic Networks of Spiking Neurons
371
and van Hemmen (1992). In section 4, we summarize the basic properties of a stochastic, homogeneous, fully connected excitatory network of renewal neurons within mean-eld theory. Section 5 presents the analysis of the temporal cross-correlations in a large but nite network in a stable, stationary state, for dynamics in both discrete and continuous time. Finally, in section 6, we summarize and discuss our results. The mathematical theory is presented in the appendix. 2 Model
A simple way to introduce relative refractory properties into a neuron model is given within renewal theory (Cox, 1962; Perkel et al., 1967a, 1967b; Stein, 1967a, 1967b; Tuckwell, 1988; Gerstner, 1995). The basic assumption is that the internal state of the neuron, that is, its refractory properties, depends only on the time since the neuron has emitted its most recent spike; previous spikes do not contribute. This assumption can be justied if the range of the refractory properties is smaller than the average interspike interval. Due to the renewal assumption, the state of neuron i is completely characterized by two quantities: the neuron’s input at time t and a variable m i ( t) , denoting the time interval that has passed since the neuron emitted its last spike. If m i ( t) D º, then neuron i emitted its last spike at time t ¡ º. m i ( t) D 0 is equivalent to emission of a spike at time t. We investigate two different models, one using discrete time and the other continuous time. The discrete-time model is introduced as the conceptually and theoretically simpler model. It has the advantage of offering more analytical insight and is numerically the more tractable. For reasons of simplicity, a synaptic transmission function is not included in the discretetime model. The continuous-time model, on the other hand, is biologically more realistic. In the discrete-time model, we assume that the probability that a neuron will re within the time interval D t is given by P ( f ire ) D g[hin ( t) ¡ Jr (m (t) ) ¡ #]D t (Gerstner & van Hemmen, 1992). Here hin is the part of the potential due to the various inputs into the neuron, # is the threshold of a neuron at rest, and Jr describes the effects of the last spike, such as an afterhyperpolarizing (AHP) current and a temporary increase in the threshold. We denote the latter by the refractory properties of the neuron and refer to Jr as refractory function. The function g ( h ) describes the probability of ring, given the net membarane potential h D hin ¡ Jr (m ) ¡ #. It is a sigmoidal function that subsumes the various sources of noise in the system. This sigmoid is an increasing function that goes from 0 for h ! ¡1 to 1 / D t for h ! 1. Stochastic dynamics in discrete time (Gerstner & van Hemmen, 1992) is given by P (m i ( t C D t) D 0 | fm ( t) g) D g[hin i ( t ) ¡ Jr (m i ( t ) ) ¡ #i ]D t
(2.1)
372
Carsten Meyer and Carl van Vreeswijk
and for º > 0 P (m i ( t C D t) D º | fm ( t) g) © ª D 1 ¡ g[hin i ( t ) ¡ Jr (m i ( t ) ) ¡ #i ]D t dº,m i ( t ) C D t ,
(2.2)
where D t is the discrete time step. Equation 2.1 gives the probability of neuron i to emit a spike at time t C D t, given the refractory states fm (t )g of all the neurons in the network at time t. As explained below, the dependence 6 i is through the network feedback hin ( t ). of this probability on m j for j D i We dene a continuous model as the limit D t ! 0 of the discrete model. Introducing a probability density P (m ) by P (m ) D m D P (m ) and a transition rate w ( h ) into the active state, the dynamical equations of the continuous model are in integrated form given by Z 1 ( ) (m P i t D 0) D (2.3) dfm (t )g w[hin i ( t ) ¡ Jr (m i ( t ) ) ¡ #i ]P (fm ( t ) g) 0
and, for m > 0, @P (m i ( t) D m ) @t
D ¡
@P (m i ( t ) D m ) @m
Z ¡
1 0
dfm ( t) g
£ w[hin i ( t ) ¡ Jr (m i ( t ) ) ¡ #i ]P (fm ( t ) g) ,
(2.4)
where dfm (t )g refers to integration over all neurons in the network. Notice that for the transition rate, we use the notation w in the continuous model rather than g. This is to emphasis the difference between these two functions: w ( h) is the instantaneous transition rate, and g ( h ) D t is the probability of transition averaged over a nite time window D t. The two functions are related by the requirement that for constant h, the portion of neurons remaining inactive for the time step D t should be the same in the discrete and the continuous model: exp(¡D t w ) D 1 ¡ gD t, yielding w ( h) D ¡
£ ¤ 1 ln 1 ¡ g (h ) D t . Dt
(2.5)
The postsynaptic potential hin i ( t ) consists of two parts: syn ext hin i ( t ) D Ii ( t ) C hi ( t ).
(2.6)
The external inputs, Iiext , describe the inputs from other parts of the brain. syn The second term, hi , describes the feedback input from the other neurons in the network and is given by syn hi ( t ) D
N X jD 1 jD 6 i
Jij 2 (m j ( t) ) ,
(2.7)
Stochastic Networks of Spiking Neurons
373
where Jij are the synaptic efcacies, and the postsynaptic response function 2 (m ) ¸ 0 describes the time course of the synaptic transmission (GerstnerR & van Hemmen 1992); it is chosen to satisfy 2 (m ) D 0 for m < 0 P 1 and 0 dm 2 (m ) D 1, or m 2 (m ) D 1, respectively, for the continuous- and discrete-time model. Note that we have made the further simplication that only the most recent spike from each presynaptic neuron is taken into account for synaptic transmission; if a given presynaptic neuron emits more than one spike in a short time period, the inuence of spikes other than the last one is neglected. With these denitions, we have a model in which the probability that neuron i emits an action potential depends on hi and m i , while hi is given by Iext i and fm g, the refractory state of all neurons in the network. Thus, in the discrete-time model, one can construct the transition matrix P (fm ( t C D t) g|fm ( t) g, fIext ( t) g) , which describes the probability of being in the state fm ( t C D t) g at time t C D t, given that at time t, the system is in state fm ( t) g and the inputs are given by fI ( t) g. In the continuous-time model, one nds an expression for the evolution of P (fm ( t) g|fIext (t )g) . (For details, see the appendix.) For the refractory function,Jr , we choose Jr (m ) D
» 1 c m
for m < tabs for m ¸ tabs
(2.8)
for dynamics in discrete time and » Jr (m ) D
1
c m ¡tabs
for m · tabs for m > tabs
(2.9)
for continuous time. The parameter c > 0 measures the strength of the refractory function. Using the respective dynamical equations 2.1 and 2.2 for discrete time and 2.3 and 2.4 for continuous time, respectively, both models yield nearly identical results for the autocorrelation function and the interspike interval distribution, provided that the activity is not too high. During the absolute refractory period tabs , the neuron cannot emit another spike. For m > tabs , Jr (m ) can be viewed as enhancing the effective threshold # of the neuron to become active again (Horn & Usher, 1989; Gerstner & van Hemmen, 1992) (“relative refractoriness”). The inclusion of refractory properties allows an interpretation of the active state m D 0 as the emission of a single spike, whereby the emission of a spike is described as a binary event in time without modeling the detailed time course of the neurons’ membrane potential. Real neurons have an absolute refractory period of at least 1 millisecond. It is convenient to rescale time to this period. Accordingly, throughout this study, we will use tabs D 1 DO 1 ms. In the discrete-time model, we will
374
Carsten Meyer and Carl van Vreeswijk
for simplicity also set the time step D t to D t D 1. For the spike emission probability g, we will use g (h ) D
1 (1 C tanh (b h ) ) 2
(2.10)
in the discrete model. Here, the noise parameter b measures the degree of stochasticity (b ! 1 leads to the deterministic case). For this choice of g and D t, the transition rate w in the continuous model is, according to equation 2.5, given by ³ ´ 1 (1 ¡ tanh (b h ) ) . (2.11) w ( h) D ¡ ln 2 This rate leads to a linear dependence on h for large h and should be contrasted to the exponential rate w ( h) D exp(h) / t0 suggested by Gerstner and van Hemmen (1992). For the time course, 2 , of the synapses, we will, for simplicity, use 2 (m ) D dm , 0 in the discrete-time model, and in the continuous-time model we will use the “alpha function” (Jack, Noble, & Tsien, 1975; Brown & Johnston, 1983), 2 ¡am H (m ) . 2 (m ) D a m e
(2.12)
Here, a is the synaptic rate constant, and H (m ) is the Heaviside function, H (m ) D 0 for m < 0 and H (m ) D 1 for m ¸ 0. Because in this study we want to concentrate on the effects of the refractoriness, we will keep the synaptic coupling as simple as possible. We therefore consider only networks with equal all-to-all coupling: Jij D
¢ J0 ¡ 1 ¡ di, j . N¡1
(2.13)
We also assume that the external input into all neurons is identical, Iiext (t ) D I ( t), and we assume that the threshold is identical for all cells, #i D #. This homogeneity assumption leads to the fact that all neurons have the same mean ring rate and autocorrelations. The cross-correlations are also identical for each pair of neurons. 3 Single Renewal Neuron
Before we consider the properties of networks of renewal neurons, it is useful to summarize some basic properties of a single renewal neuron, following the analysis by Gerstner and van Hemmen (1992). The input of the neuron is assumed to be stationary and is given by ext hin i ( t ) D Ii D I
(in the rest of this section, the index i will be dropped).
(3.1)
Stochastic Networks of Spiking Neurons
375
In stochastic dynamics, the time between two consecutive spikes is characterized by the interspike interval distribution D (t ) . In continuous time, D (t ) is given by (
0 D (t ) D w exp(¡ R t dº w ) t º t abs
for for
0 · t < tabs , t ¸ tabs
(3.2)
with wm :D w[I ¡ Jr (m ) ¡ #].
(3.3)
The frequency of the neuron is given by the mean interspike interval ht i (Gerstner & van Hemmen, 1992): º(I ) D
1 1 n R o. D R 1 m ht i tabs C t dm exp ¡ t dº wº ( I ) abs
(3.4)
abs
Some examples of gain functions º( I ) are presented in Figure 1. In addition to the interspike interval distribution and the gain function, the ring pattern of the neuron is conveniently described by the autocorrelation function. The (reduced) autocorrelation Ai is dened as the joint probability that neuron i emits action potentials at times t and t C t , minus the uncorrelated expectation value:
Ai ( t, t C t ) D P (m i ( t) D 0, m i (t C t ) D 0) ¡ P (m i (t ) D 0) P (m i ( t C t ) D 0) .
(3.5)
In the stationary states, renewal theory provides a simple relation between the interspike interval distribution and the autocorrelation: Z
A (t ) D
t 0
µZ dm D (t ¡ m ) A (m ) C P0 D (t ) C P 02
t 0
¶ dm D (m ) ¡ 1 . (3.6)
Here we have used P 0 to denote P (m i ( t C t ) D 0) . Note that equation 3.6 describes only the bounded part of the autocorrelation. It is easily seen from denition 3.5 that A (t ) diverges for t D 0; the central peak is given by P 0 d (t ). This peak is not included in equation 3.6. In Figure 2, the interspike interval distribution D (t ) (A) and the autocorrelation A (t ) (without central peak, B) are shown for various parameters c and b. For small c (weak refractoriness) and small b (high degree of stochasticity), D (t ) has an asymmetric shape with a long tail. The autocorrelation A (t ) increases monotonically after the minimum, which is caused by the refractory properties. For increasing c (i.e., stronger refractoriness) and increasing b (lower degree of stochasticity), D (t ) becomes more and
376
Carsten Meyer and Carl van Vreeswijk
Figure 1: Gain function (frequency) º( I ) of a single renewal neuron with stationary input I. The lines labeled “determ.” show results for deterministic dynamics, for the (a) discrete-time and (b) continuous-time model. The other lines are obtained using stochastic dynamics with noise parameter b. The line labeled “discr.” shows a gain function in discrete time, for a refractory function Jr , equation 2.8, with c D 1.0, b D 10.3, and # D 0.1020. The other lines show gain functions in continuous time (Jr from equation 2.9, threshold, # D 0.1). The intersection of the gain function º( I ) with the identity I gives the xed points in a network with synaptic efcacy J0 D 1.0. For the example in discrete time, the xed point P0 D 0.05, marked by , is unstable (see Figure 4B). º is given in 1000 Hz, tabs D 1.0.
more symmetric around the average interspike interval ht i, and the autocorrelation becomes oscillatory with a damping constant that increases with increasing c and b. From renewal theory, similar equations can be derived for dynamics in discrete time. Choosing the refractory function, equation 2.8, and the transition probability, equation 2.10, in the discrete-time model (instead of equations 2.9 and 2.11, respectively, in the continuous model), the numerical results for the interspike interval distribution and the autocorrelation are almost identical, provided the activity is not too large (· 50 Hz).
Stochastic Networks of Spiking Neurons
377
Figure 2: (A) Interspike interval distribution D (t ) and (B) autocorrelation A (t ) (without central peak), of a single renewal neuron with stationary input I for various noise parameters b and various strengths c of the refractory function Jr . Dynamics either in discrete time with Jr from equation 2.8 or in continuous time with Jr from 2.9, give the same results. The activity P 0 is kept constant by adjusting the threshold d. P 0 D 0.05 D O 50 Hz, I D 0.05, tabs D 1.0.
Figure 2 demonstrates that the interspike interval distribution and especially the autocorrelation of a renewal neuron are qualitatively similar to experimental results. Thus, the renewal neuron appears to be a suitable model to describe a “regular ” (nonadaptive, nonbursting) neuron in terms of the generation of single spikes (Gerstner & van Hemmen, 1992). Yet the model is simple enough to allow to calculate the cross-correlations in sta-
378
Carsten Meyer and Carl van Vreeswijk
tionary states, as will be shown in section 5. Therefore, it is a convenient model to study cross-correlations on the timescale of single spikes. 4 Mean-Field Theory
Before we consider the cross-correlations we rst summarize the mean properties of homogeneous networks of renewal neurons. The mean-eld theory of such networks has been studied intensively (for a review, see Gerstner, 1995). The aim of this section is to introduce the notation used for calculating the cross-correlations (section 5) and summarize some basic properties of the network that are important for an interpretation of the cross-correlations. In this section, we focus on dynamics in discrete time, equations 2.1 and 2.2; the model with continuous time behaves qualitatively similarly. For simplicity, we assume that the network receives no external input, Iiext ( t) D 0, so that the postsynaptic potential hin i ( t ) satises hin i ( t) D
N J0 X dm ( t ) ,0 . N ¡ 1 jD 1 j
(4.1)
jD 6 i
In mean-eld theory (N ! 1), the postsynaptic potential can be replaced by its (noise) average hhin i ( t ) i. In a homogeneous network, the homogeneous solution Pm ( t) :D P (m ( t) D m ) (independent of i) is a consistent solution of in the network dynamics. Thus, hin i ( t ) can be written hi ( t ) D J 0 P 0 ( t ) , and the network dynamics is given by P(t C 1) D T (t ) ¢ P(t) , with 0
1 P0 (t) B P1 ( t ) C B C B C P(t) :D BP2 (t) C , B P3 ( t ) C @ A .. .
(4.2) 0
gt0 B1 ¡ gt 0 B B T ( t ) :D B 0 B 0 @ .. .
gt1 0 1 ¡ gt1 0 .. .
gt2 0 0 1 ¡ gt2 .. .
1 ¢¢¢ ¢ ¢ ¢C C ¢ ¢ ¢C C ¢ ¢ ¢C A
(4.3)
¢¢¢
and gmt :D g ( J 0 P 0 ( t) ¡ Jr (m ) ¡ # ).
(4.4)
The xed point P¤ D T ¢ P¤ is stable if all eigenvalues lk of the matrix 0 P1 1 O m Pm C g 0 ¢¢¢ g1 g2 m D0 g B ¡gO 0 P 0 C 1 ¡ g0 0 0 ¢ ¢ ¢C B C B ¡gO 1 P1 1 ¡ g1 0 ¢ ¢ ¢C S :D B (4.5) C B ¡gO 2 P2 0 1 ¡ g2 ¢ ¢ ¢C @ A .. .. .. . . . ¢¢¢
Stochastic Networks of Spiking Neurons
satisfy |lk | < 1, where @g ( h) . gO m :D J0 @h hD J0 P0 ¡Jr (m ) ¡#
379
(4.6)
Note that with g ( h) from equation 2.10, gO m contains an explicit factor b from ( ) the derivative @g@hh . Thedynamics, equation 4.2, has been studied intensively. Here, we restrict ourselves to summarizing some basic effects of the refractory function Jr on the existence and basic properties of stable, stationary states for dynamics in discrete time. For neurons without refractoriness (Jr (m ) D 0 8 m ) or with absolute refractoriness of one time step only (Jr (0) D 1, Jr (m ) D 0 for m ¸ 1), the xed-point structure is well known. There is either one stable xed point h l h m P 0 or three xed points Pl0 < Pm 0 < P 0, where P 0 and P 0 are stable and P 0 is unstable; stable oscillations do not exist. The dynamics can become more interesting for neurons with relative as well as absolute refractory properties. For suitable parameter values, even if there is only one xed point, unstable or stable oscillations around this xed point may show up. An example is shown in Figure 1, where the xed point P 0 D 0.05 for c D 1.0 is unstable (see Figure 4). We now address the existence of stable stationary states in dependence of the noise parameter b. First, we consider the network behavior if the noise level b is varied, where the neuron parameters c and # and the synaptic efcacy J0 are kept constant. Figure 3 shows the temporal evolution of the network activity P 0 (t ) in dependence of the noise parameter b for the same initial distribution. For low noise levels (b D 13.0), the network activity evolves into a stable, stationary state with “low” activity P 0 . Since P 0 < #, the network activity is completely noise driven. Increasing the noise level (decreasing b) drives the network from a stationary state into stable oscillations (see Figure 3 for b D 9.8). This is due to the fact that increasing the amount of noise leads to an increase of the mean network activity P 0 , overcompensating for the stabilizing effect of noise. Even larger noise levels (e.g., b D 5.0) stabilize the network activity again, but at a much larger activity level. As in the case of low noise (b D 13.0), the network activity is highly stochastic, with the noise stabilizing the network activity, which is slightly above threshold. It can be concluded that the network is susceptible to stable oscillations especially in parameter regions, where the average network activity is of the order of the threshold. In discrete dynamics, the network activity in a stable stationary state is highly stochastic: either the activity is completely noise driven (“low” activity states), or a high degree of noise stabilizes the network activity (“high” activity states). A second way to vary the noise level b is to keep the average network activity, P 0 , constant, by adjusting the threshold # of the neurons. This allows studying the dependence of the cross-correlations on the noise level for constant activity (see section 5). An example of the network behavior for
380
Carsten Meyer and Carl van Vreeswijk
Figure 3: Activity P0 ( t) as a function of time t in a homogeneous, innitely large network of neurons with absolute and relative refractoriness for the same initial distribution but different noise parameters b. Parallel, stochastic dynamics in discrete time, Jr , from equation 2.8, with c D 1.0, J 0 D 1.0, # ¼ 0.1032. The initial distribution is the same as that of Figure 4A. Compare with Usher, Schuster, and Niebur (1993).
different noise levels but constant activity P 0 is presentedin Figure 4. In Figure 4A, the network asymptotically approaches the xed point P 0 D 0.05. In Figure 4B, the noise level is reduced (b increased), with simultaneous decrease of the threshold # in order to keep the xed-point value at P 0 D 0.05. P 0 then becomes unstable, and the network is driven into stable oscillations. Thus, noise-maintained stationary states in homogeneous networks of neurons with sufciently strong relative refractoriness become unstable against oscillations if the noise level and the threshold # are decreased, keeping the network activity P 0 constant. The critical point bc depends on the strength c of the refractoriness. For weak refractoriness, the network becomes unstable against a xed point with larger activity. To summarize, two important points have to be stressed. First, due to relative refractoriness, the xed-point activity in an innitely large, fully connected, homogeneous, excitatory network of renewal neurons may assume rather low and medium values even without noise (Gerstner & van Hemmen, 1992; Usher et al., 1993). Second, due to the existence of stable oscillations in discrete-time dynamics, the activity in stable xed points is
Stochastic Networks of Spiking Neurons
381
Figure 4: Activity P 0 ( t ) as function of time t in a homogeneous, innitely large network of renewal neurons with absolute and relative refractoriness. Parallel, stochastic dynamics in discrete time, Jr , from equation 2.8, with c D 1.0. The noise parameter b and threshold # are chosen so that P 0 D 0.05 is a stationary solution; J0 D 1.0. The initial distribution is the stationary distribution for the parameter values of A, slightly perturbed by setting Pm D 0 for m > 50 and normalizing P50 . Note the different scaling of the ordinate.
382
Carsten Meyer and Carl van Vreeswijk
highly stochastic—either purely noise driven or stabilized by a large amount of noise. 5 Cross–Correlations 5.1 General Analysis. In mean-eld theory, the dynamics of a homogeneous network of renewal neurons can be solved exactly, since in this case, the synaptic input can be replaced by its average value. For a nite network, the synaptic input of a given neuron in general deviates from its mean, leading to correlations between the neurons. Ginzburg and Sompolinsky (1994) have shown that for a large network (N À 1) in an asynchronous state, 1 these correlations are dominated by pairwise correlations, for which closed equations can be derived. In the following, the analysis of Ginzburg and Sompolinsky (1994) is extended to a large but nite (N À 1) homogeneous network of renewal neurons in a stable, stationary state (the mathematical theory is sketched in the appendix). The cross-correlation ¡ ¢ ( ) ( ) ( ) Cºv ij t, t C t :D P m i t D º, m j t C t D v ¡ ¢ ¡ P (m i ( t) D º) P m j ( t C t ) D v (5.1)
is the joint probability for neuron i to be in refractory state º at time t and neuron j to be in refractory state v at time t C t , minus the uncorrelated expectation value for this conguration. The stationary value is dened by (t ) :D lim Cºv ( ) Cºv ij ij t, t C t t!1
(5.2)
(t ) D Cjivº (¡t ) . For a homogeneous network with identical and satises Cºv ij couplings, Jij D
¢ J0 ¡ 1 ¡ di, j , N¡1
(5.3)
ºv the cross-correlations are identical for all neuron pairs i and j, Cºv ij D C , for 00 6 iD j, and the “spike-spike-cross-correlation” C (t ) :D C (t ) is symmetrical with respect to t :
C (t ) D C (¡t ) .
(5.4)
The autocorrelation is dened accordingly (compare with equation 3.5): ºv Aºv (t, t C t ) D Aºv i ( t, t C t ) :D C ii ( t, t C t )
(5.5)
1 An asynchronous state is dened by the condition that the cross-correlations vanish for N ! 1.
Stochastic Networks of Spiking Neurons
383
and in a stationary state A (t ) :D A 00 (t ) D A (¡t ) .
(5.6)
In the appendix, it is shown by expanding the postsynaptic potential in ( hin i t ) , equation 4.1, around its average value hhi ( t ) i D J 0 P 0 ( t ) for large networks N À 1 that the correlations in a stable, stationary state are determined in leading order by the two-point correlations, equation 5.1, for which closed equations can be derived. They can be written in matrix form: C (t C 1) D C (t ) ¢ S T C
± ² 1 A (t ) ¢ S T ¡ T T , N
(5.7)
where ¡
C (t ) :D Cºv (t )
¢ º,v¸ 0
and
¡
A (t ) :D Aºv (t )
¢ º,v ¸ 0
.
(5.8)
S and T are given in equations 4.5 and 4.3, respectively, and the superscript
T denotes the transpose of a matrix. The initial value C (0) of the recursion, equation 5.7, is given by the solution of µ ¶ 1 1 C (0) D S ¢ C (0) C A (0) ¢ S T ¡ T ¢ A (0) ¢ T T . (5.9) N N Similar to Ginzburg and Sompolinsky (1994), the dynamics of the correlations is governed by uctuations of the synaptic input of the “target neuron,” described by the derivative gO m , which is proportional to the synaptic efcacy J0 (see equations 4.6 and 4.5). In addition, due to the dynamics of the refractory states m , the correlations depend directly on the transition probabilities gm and 1 ¡ gm , as can be seen in the denition of the matrix S in equation 4.5. Since the “reference neuron” i also contributes to the synaptic input of the target neuron j, there is also a contribution of the autocorrelation A (t ) to the dynamics of the cross-correlations, which is proportional to the derivatives of the transition probabilities (term S T ¡ T T in equation 5.7) and can be viewed as the “source” of the cross-correlations (Ginzburg & Sompolinsky, 1994). If there is no direct synaptic connection from neuron i to neuron j, the contribution of the autocorrelation to the cross-correlation C ij vanishes. Because at the xed point the input into a neuron is constant, up to uctuations of order 1 / N, the autocorrelation of the neurons are, to leading order, the same as those we calculated in section 3 for the single neuron, provided that the appropriate level of input is taken. The uctuations add a term of order 1 / N to the autocorrelation, but because these only contribute a term of order 1 / N 2 to the cross-correlations, according to equation 5.7, we will not calculate this correction here.
384
Carsten Meyer and Carl van Vreeswijk
Expanding the spike-spike-cross-correlation C (t ) D C 00 (t ) with respect P 0º k k to the normal modes C0 (t ) :D º C (t ) Lº (with Lºk as ºth component of the kth left eigenvector of the matrix S ), the asymptotic behavior of C (t ) can be derived. Some general conclusions can be drawn from this analysis. First, in a stable xed point, the eigenvalues lk of S satisfy |lk | < 1 and the cross-correlations vanish asymptotically: limt !1 C (t ) D 0. If the xed point is unstable, the amplitude of C (t ) diverges in the order 1 / N, and the cross-correlations are of higher order (Ginzburg & Sompolinsky, 1994)— thus, the analysis presented in this work is no longer applicable. Second, it can be seen that the cross-correlations in a stable xed point are composed as a sum of monotonically decaying or damped oscillating normal modes, the source of which is given by the autocorrelation, convoluted and weighted by the eigenvalues of the stability matrix S . Note that in contrast to a respective network of neurons without refractory properties, oscillating cross-correlations are possible even in a completely homogeneous network, due to the dynamics of the refractory states m . Third, near a bifurcation point, the normal mode belonging to the largest eigenvalue |lk | of S , which satises |l k | ! 1, dominates the long-term behavior of the cross-correlations. If lk D 1 ¡ 2 is real, C (t ) decays monotonically; if l k D 1 ¡ 2 C i v t is complex, C (t ) oscillates with frequency v (compare Figure 6). Fourth, the longer the range is of the refractory function Jr , the more complex is the behavior of C (t ). In the following, some examples of cross-correlations are discussed in more detail. 5.2 Cross-Correlations: Discrete Time. For neurons without any refractoriness, the cross-correlation C (t ) in a homogeneous network with synaptic input, equation 4.1, decays monotonically after a peak at t D 1. This peak is due to the direct synaptic connection between the two neurons and vanishes if this direct connection is removed. If an absolute refractory period of one time step is included, Jr (0) D 1, Jr (m ) D 0 for m ¸ 1, C (t ) may show damped oscillations of period 2 due to the absolute refractoriness of the presynaptic and the postsynaptic neuron. In addition, C (0) and C (t ) may become negative. Close to the critical point, however, C (t ) decays asymptotically monotonously. For renewal neurons with relative refractoriness, in Figure 5 some examples for the spike-spike-cross-correlation C (t ) D C00 (t ), multiplied by the number N of network neurons, are shown for the refractory function Jr , equation 2.8, determined by numerical solution of equations 5.9 and 5.7. For reasons of comparison, we keep the network activity P 0 xed; each time the parameter values c or b are changed, the threshold # is adjusted accordingly to keep P 0 constant. For numerical reasons (matrix inversion), we focus on network states of “medium” activity P 0 D 0.05 DO 50 Hz (see also the appendix). Figure 5 shows that even for the simplest network architecture (completely homogeneous and fully connected) in a stable, stationary state,
Stochastic Networks of Spiking Neurons
385
Figure 5: Cross-correlation N ¢ C (t ) in a homogeneous network with N neurons (N À 1) with absolute and relative refractoriness. The network is in a stable stationary state with activity P 0 D 0.05 D O 50 Hz; the noise parameter b is varied (and the threshold # adjusted to keep P 0 constant). Dynamics in discrete time, Jr from equation 2.8, J0 D 1.0. (A) “Medium” refractoriness c D 0.2; the threshold drops from # ¼ 0.31 for b D 5.0 to # ¼ 0.16 for b D 11.0. (B) “Strong” refractoriness c D 1.0; the threshold drops from # ¼ 0.24 for b D 5.0 to # ¼ 0.11 for b D 9.8.
386
Carsten Meyer and Carl van Vreeswijk
the cross-correlations exhibit a large variety of qualitative and quantitative properties, depending on the strength c of the refractory function, the noise level b, and the activity P 0 . For small c, Figure 5A, the cross-correlation has an asymmetric peak at t D 1, which is the result of the direct synaptic connection between the neuron pair; this peak vanishes if the direct connection is removed. The peak value C (1) increases with b. The angular shape of the peak C (1) is a consequence of the d-form of the postsynaptic response function, as will be shown in the next subsection. The equal time cross-correlation C (0) rst decreases and then increases with b. For low noise (large b), some minima and maxima with period of about 1 / P 0 and decreasing amplitude show up. The damping constant of these oscillations increases with b. Close to the critical point, C (t ) decays exponentially (“weak refractoriness”; see also Figure 6). For large c, Figure 5B, C (t ) in general exhibits damped oscillations, with the damping constant increasing with b. The rst maximum may be located at t > 1, since the equal time cross-correlations may be strongly negative. The peak value itself increases with b, whereas C (0) rst decreases and then increases with b. Far below the critical point, the frequency º of the oscillations of C (t ) is nearly constant and corresponds to the frequency of a single neuron, whereas near the critical point, the frequency º depends strongly on b. Approaching the critical point, the asymmetric peak for t > 0 becomes a symmetric peak at t D 0. The damping constant diverges, and C (t ) oscillates with a frequency that strongly depends on the strength c of the refractory function. This is shown in Figure 6, where the frequency of C (t ) at the critical point is plotted in dependence of the strength c of the refractory function; the frequency of a single neuron is indicated by the solid line. Note that near the critical point, the cross-correlation C (t ) may oscillate even if—beyond the critical point—the network dynamics will evolve not into stable oscillations but into a xed point of larger activity. The dependence of C (t ) on the strength c of the refractory function for xed noise level b is shown in Figure 7. Figures 7B and 7C show the corresponding autocorrelation A (t ) (without central peak) and interspike interval distribution D (t ), respectively, in rst order. A (t ) and D (t ) correspond to the single-neuron results, as mentioned before. Near the critical point, the 1 / N-corrections from the crosscorrelations become relevant and cannot be neglected anymore (Ginzburg & Sompolinsky, 1994). We argued in section 4 that the activity in a stable, stationary state is highly stochastic—either purely maintained by noise or stabilized by a high level of noise. In the examples presented in this section, since P 0 < #, the activity is completely noise generated. This is reected in the monotonic shape of A (t ) and the asymmetric form of D (t ) . For a smaller activity P 0, the frequency and the amplitude of the crosscorrelation become smaller, and the equal-time cross-correlation in general
Stochastic Networks of Spiking Neurons
387
0.06 0.05
b
(c)
0.04 0.03 0.02 0.01 0
0
0.5
1
c
1.5
2
2.5
Figure 6: Dependence of the frequency ºb ( c) of the time-delayed crosscorrelation C (t ) at the critical point bcr on the strength c of the refractory function Jr , equation 2.8. The network is in a stationary state with activity P 0 D 0.05 DO 50 Hz (threshold # adjusted to keep P 0 constant). Dynamics in discrete time, J 0 D 1.0. The frequency of a single neuron is indicated by the solid line.
turns out to be positive. The qualitative dependence of the cross-correlation on c and b, however, is similar. 5.3 Cross-Correlations: Continuous Time. Within the discrete model analyzed in section 4 and the previous subsection, the absolute refractory period as well as the postsynaptic response function have been restricted to exactly one time step. This considerably simplies the equations for the crosscorrelations, enabling an effective calculation and analysis. The square form of the synaptic transmission we have assumed is, however, a poor description for the synaptic transmission, leading to the angular shape of the maximum of C (t ) (see Figures 5 and 7A). Obviously, within the discrete model, it is not possible to study the inuence of the synaptic transmission on the short-time and long-time behavior of C (t ) . For this purpose, we extend our analysis to the more general continuous model introduced in section 2. The dynamics of the continuous model is dened by equations 2.3 and 2.4 with hin i ( t ) from equation 2.7. For the postsynaptic response function 2 (m ),
388
Carsten Meyer and Carl van Vreeswijk
Figure 7: Cross–correlation N ¢ C (t ) (A); autocorrelation A (t ) , to leading order O (1) , without central peak (B); and interspike interval distribution D (t ) (C) for various values of the refractory strength c (threshold # adjusted to keep P0 constant), in a homogeneous network of N À 1 renewal neurons. The network is in a stable stationary state with P 0 D 0.05. Dynamics in discrete time, b D 10.0, J 0 D 1.0. The threshold drops from # ¼ 0.19 for c D 0.0 to # ¼ 0.14 for c D 0.6.
Stochastic Networks of Spiking Neurons
389
we choose the “alpha function,” equation 2.12. Due to the renewal assumption, only the most recent spike of each presynaptic neuron is taken into account; the contribution of spikes that have been emitted before is neglected. It is easily seen that this assumption is justied in asynchronous states with low activity. Introducing densities C (t ) and A (t ) for the (stationary) cross- and autocorrelations, respectively, the theory of the continuous model is developed in appendix A.3. It is shown that the cross-correlation in a homogeneous network is determined by a linear, inhomogeneous system of coupled integrodifferential equations; see equations A.22 and A.25. A few remarks are in order. First, the postsynaptic response function 2 (m ) directly enters into the equation for C (t ) , even though the mean network activity P 0 (for low activity) does not depend on 2 (m ) . If the synaptic transmission is—similar to the discrete model, using ad-distribution—modeled to be instantaneous, the cross-correlation C (t ) also shows a d peak. Second, comparing the equations for C (0) and C (t ) , it can be seen that the cross-correlation may be discontinuous at t D 0 (see appendix A.3). From Figure 8, it may be inferred, however, that a continuous shape of Jr and 2 (m ) smoothes this discontinuity. Figures 8 and 9A show some examples of N ¢ C (t ) in a homogeneous network of N À 1 renewal neurons with refractory function, Jr from equation 2.9, calculated by numerical solution of equations A.22 and A.25, far below the critical point. The network activity is xed to be P 0 D 0.02 DO 20 Hz. Figures 8 and 9A show that the short time behavior of the cross-correlations—far below the critical point—is characterized by an asymmetric peak, the shape of which is strongly inuenced by the postsynaptic response function 2 (m ) . For oscillating C (t ) , even its long time behavior—its frequency and damping constant—may depend on a (see Figure 8 for c D 0.4), if the strength c of the refractory function is not too large. Qualitatively the dependence of the cross-correlation C (t ) on the strength c of the refractoriness and the noise level b is similar to the discrete model (see also section 5.4). Figure 9 presents an example where the network activity P 0 exceeds the threshold #. Thus, the network activity is not maintained purely by noise, and a relatively low amount of noise is required to stabilize the xed point. This is reected in the oscillatory shape of the autocorrelation and the more or less symmetric interspike interval distribution (see Figures 9B and 9C), respectively. The fact that in contrast to similar P0 in the discrete model, the asynchronous state is nevertheless stable indicates that the network in continuous time is less susceptible to stable oscillations than the discrete model. 5.4 Comparison Between Discrete- and Continuous-Time Models. The examples in sections 5.2 and 5.3 show that the basic properties of the timedependent cross-correlation are very similar in the discrete and the continuous model. This is true for the existence of an asymmetric peak, the
390
Carsten Meyer and Carl van Vreeswijk
Figure 8: Cross-correlation N¢C (t ) in a homogeneous network of N À 1 renewal neurons in a stationary state with P 0 D 0.02. (threshold # adjusted accordingly). The gure shows the dependence on the time constant a of the postsynaptic response function 2 (m ) , for two sets of parameters for the noise level b and refractory strength c. The equal time cross-correlation C (0) is marked (see the legend). Dynamics in continuous time, tabs D 1.0, J0 D 1.0.
monotonous or oscillating form of the time-delayed cross-correlation, and its dependence on the strength c of the refractory function and the noise level b. The most striking difference, the different form of the asymmetric peak, is a consequence of the different description of the synaptic transmission in both models. Quantitatively, however, large differences can be observed in both models, as illustrated in Figure 10. First, in the continuous model, the crosscorrelation is much more damped than in the model with discrete time. This may indicate that in continuous time, there is less of a tendency for stable oscillations, which may be the result of the larger time constant of the postsynaptic response function (see also Abbott & van Vreeswijk, 1993). The effect of the discretization of time on the stability of the stationary solutions has not been investigated yet. Second, in the continuous model, we generally observed the equal-time cross-correlations to be positive, whereas they may be strongly negative in the discrete model, inuencing the location of the asymmetric peak (see Figure 10). Third, the amplitude of the extrema of
Stochastic Networks of Spiking Neurons
391
Figure 9: (A) Cross-correlation N ¢ C (t ) , (B) autocorrelation A (t ) (to leading order O (1) without central peak, and (C) interspike interval distribution D (t ) in a homogeneous network of N À 1 renewal neurons in a stationary state with P 0 D 0.02 (threshold # adjusted accordingly). The gure shows the dependence on the time constant a of the postsynaptic response function 2 (m ) and of the noise level b. (For low activity, 2 (m ) does not enter into the rst order of D (t ) and A (t ) .) Dynamics in continuous time, Jr , from equation 2.9, with c D 5.0, tabs D 1.0, J 0 D 1.0. The equal time cross-correlation C (0) is marked (see the legend).
392
Carsten Meyer and Carl van Vreeswijk
Figure 10: Cross-correlation N ¢ C (t ) in a homogeneous network of N À 1 renewal neurons in a stationary state with activity P 0 D 0.05 (threshold # adjusted to keep P 0 constant). Comparison of results for dynamics in discrete time (labeled “discr.”) with refractory function Jr from equation 2.8 and dynamics in continuous time (labeled “cont.”), Jr from equation 2.9, with c D 1.0, b D 8.0, tabs D 1.0, J0 D 1.0. The equal-time cross-correlations are marked with a symbol (see the legend).
the time-delayed cross-correlation is much larger in the discrete than in the continuous model. Fourth, from the continuous model with varying time constant of the postsynaptic response function, it is seen that even the long time behavior of the time-delayed cross-correlation may be inuenced by the time constant of the synaptic transmission (see Figure 8). Note also that for large activity, the autocorrelation is much more damped in the continuous model than in the discrete model (not shown). It is not clear yet to which extent these differences are due to the time constant of the postsynaptic response function 2 (m ) only or also an effect of the discretization of time. To summarize, the simple discrete model allows an efcient calculation and analysis of the qualitative properties of the cross-correlations for medium and large frequencies (about 50 Hz), including an analysis close to the critical point. The quantitative properties especially of the short time
Stochastic Networks of Spiking Neurons
393
behavior of the time-delayed cross-correlations, however, can be addressed only within the more complicated framework of a generalized model with postsynaptic response function; such as the continuous model. 6 Summary and Discussion
We have analyzed the cross-correlations in the stable stationary state of a large but nite, homogeneous, fully connected, excitatory network of renewal neurons with stochastic dynamics, in both discrete and continuous time. Our work extends the analysis of Ginzburg and Sompolinsky (1994) to neurons with an absolute and relative refractory period. This allows for the interpretation of the cross-correlations on a timescale of single spikes. The model neuron used is based on a model proposed by Gerstner and van Hemmen (1992). We have shown that even for the simplest network architecture—homogeneous and fully connected—the cross-correlation of a pair of neurons can have a wide variety of qualitatively different features, even if the network is in a stable asynchronous state with low activity. Below the critical point, the cross-correlation exhibits an asymmetric peak resulting from the direct connection of the neuron pair. The detailed shape of the peak is strongly inuenced by the synaptic transmission. Beyond the peak, the cross-correlation decays monotonically (for weak refractory properties) or shows damped oscillations (for stronger refractory properties), even if to leading order the autocorrelation does not oscillate. As the critical point is approached, the amplitude of the cross-correlations increases rapidly, and the damping time constant diverges. For strong refractory properties, the cross-correlations start to oscillate without damping with a frequency that strongly depends on the strength of the refractory function. In this regime, the behavior and the magnitude of the cross-correlation are dominated by the network dynamics. In general, the details of the cross-correlation—its qualitative shape and its magnitude—depend crucially on the strength of the refractory properties, the noise level, the details of the synaptic transmission, the level of activity, and the closeness of the network dynamics to a bifurcation point. In addition, the analysis shows that the refractory properties of all neurons—the target neuron, the reference neuron, and the other neurons in the network—inuence the behavior of the cross-correlation. Because of this dependence on a multitude of factors, there is no simple relation between the cross-correlations and the neurons’ refractory function or interspike interval distribution. Instead, any interpretation of the cross-correlations and their magnitude has to take into account all of the above-mentioned factors. The analysis was performed in a simple network with homogeneous all-to-all coupling, thus highlighting the effect of the single neuron and network dynamics, as opposed to the network architecture, on the crosscorrelations. The analytical techniques used here can be readily adapted to
394
Carsten Meyer and Carl van Vreeswijk
describe correlations in networks with several populations, with different intrinsic dynamics or coupling strengths (Ginzburg & Sompolinsky, 1994). In addition, random dilution of the coupling matrix can also be dealt with in a straight-forward manner. We have described a network in which the strength of the individual synapses scales as 1 / N. This has the effect that the uctuations in the inputs p scale as 1 / N. Nevertheless, to obtain a variability in the spiking of the neurons that is biologically plausible, the spiking of the neurons is governed by a stochastic process, described by g (h ) in the discrete and w ( h) in the continuous model. This process can be viewed as describing either an intrinsic stochasticity of the spike-generating mechanism of the neuron or as being due to uctuations in the input into the cells from outside the network. In the latter view, it should, however, be emphasized that these uctuations are uncorrelated, both in time and between neurons. The cross-correlations studied here arise purely as a result of the network dynamics. Our analysis can be extended to include correlations in the input, provided that these are sufciently weak, that is, the cumulants of the input statistics scale as N ¡k/ 2 . They will contribute an extra “source” in the matrix equations 5.7 and 5.9. Still, the theory can deal only with cross-correlations that are of order 1 / N. Only close to the bifurcation point, where the asynchronous state becomes unstable, do they grow to an appreciable value. It is still a matter of debate whether it is biologically plausible for the cross-correlations to be this weak. In experiments, one regularly obtains cross-correlations with peak values that are several percent of the products of their rates (Vaadia et al., 1995). But cortical neurons receive input from thousands of other cells (Stevens, 1989), implying, in our theoretical model, that the cross-correlations should be of order 1/1000, unless the network operates very close to the bifurcation point. For the network to be close to the bifurcation point, the parameters have to be ne-tuned. It may be the case that being close to the bifurcation point has functional advantages, so that an argument could be made for a learning mechanism that puts the network near this point (Ginzburg & Sompolinsky, 1994). An alternative scenario that has recently gotten considerable attention is the possibility that the individual synapses are much stronger than of order 1 / N, and yet the mean input is small due to the fact that the total excitatory input is nearly balanced by the total inhibitory input. This scenario has recently been investigated by Amit & Brunel (1997a, 1997b) and van Vreeswijk and Sompolinsky (1996, 1998) in extremely diluted networks. In such networks, the cross-correlations can be reasonably large, even far from the bifurcation point, if there is a direct connection between the cells, due to the relatively strong synapses. The mean correlations are still of order 1 / N, due to the sparsity of the connections. Experimental data suggest that the extremely small fraction of neuron pairs with appreciable cross-correlation is an unrealistic feature of these networks. Models with a connectivity that is less diluted but still with relatively strong synapses are needed to account
Stochastic Networks of Spiking Neurons
395
for this nding. Unfortunately, the analytical approach presented here cannot be used in such networks. This is because in networks with a coupling that does not scale as 1 / N, the higher cumulants of the spike distribution are not negligible compared to the two-point cross-correlation. A theoretical frame-work to study this scenario is not currently available, so that a comparison of this scenario with the model presented in our article close to the bifurcation point cannot be undertaken yet. Several other authors have recently considered the dynamics of networks of spiking neurons with a considerable degree of stochasticity (Gerstner, 2000; Spiridon & Gerstner, 1999; Brunel & Hakim, 1999). Gerstner (2000) and Spiridon and Gerstner (1999) have investigated the response of such networks to small variations in external input and characterized this response analytically. This was done by studying the mean network response to these input changes. Whereas the analysis of such an input change within meaneld theory is straightforward within our model, its effect on the crosscorrelations cannot be calculated within our framework, since the assumption of an asynchronous state and “weak” cross-correlations is no longer justied. The effect of an input change on the cross-correlations is still an open issue. Nevertheless, one can relate the cross-correlations in a network that receives constant input to the mean network response to weakly varying input; the time course of both is dependent on the perturbation matrix S (see equation 4.5). Brunel and Hakim (1999) consider fast oscillations in a network of integrate-and-re neurons receiving noisy external input. Their study is mainly concerned with sparsely connected networks in which the asynchronous state is unstable, whereas our focus is on fully connected networks in stable asynchronous states. Still, it is interesting to ask how their study relates to the model presented here. However, the different network architecture and the different dynamical state, as well as the considerable difculty of the calculations in both models, make such a comparison extremely involved. There are also indications that networks of deterministic integrate-and-re neurons with stochastic inputs and models involving stochastically spiking neurons behave qualitatively differently. The rapid response to external input found for stochastic neurons (Gerstner, 2000) is absent for integrate-andre neurons with input noise (Brunel, Chance, Fourcaud, & Abbott, 2001). Hence, a detailed comparison between the results of Brunel and Hakim (1999) and the work presented here is beyond the scope of this article. The main goal of this work was the theoretical calculation and analysis of the cross-correlations in large networks. Thus, we posed rather strong constraints on our model, like the homogeneity assumption and full connectivity. It is an interesting question how our results apply to real networks, lifting some of the constraints. Such an investigation and quantitative comparison to experimental results, carried out by numerical simulations of realistic networks, is an interesting topic for future research, but beyond the scope of this article.
396
Carsten Meyer and Carl van Vreeswijk
Appendix
In this appendix, we present the general theory for calculating the temporal cross-correlations in a large, fully connected, excitatory network (N À 1) in a stable, stationary state. For simplicity, it is assumed that the network is homogeneous, #i D #, and there is no external input, Iiext (t) D 0. A.1. Expansion Around the Mean-Field Results. In a large but nite network (N À 1) in a stable, stationary state, it is expected that the network dynamics deviates only slightly from the mean eld results. For synaptic couplings, equation 2.13, and postsynaptic response function 2 (m ) D dm ,0 , the postsynaptic potential, equation 4.1, can be written as
hin i ( t) D
N J0 X dm ( t ) ,0 D J0 P 0 ( t) C dhin i ( t) , N ¡ 1 jD 1 j
(A.1)
jD 6 i
where dhin i ( t) D
¢ J0 X ¡ dm j (t ) , 0 ¡ P 0 ( t) . N ¡ 1 jD6 i
(A.2)
In an asynchronous state, dm j ( t) , 0 is to leading-order independent for different j, the mean of dm j ( t ) ,0 ¡ P 0 ( t) is 0, and its variance of order 1. This implies p dhin i ( t ) D O (1 / N ) , and we can expand the spike emission probability ¡ ¢ t O mt i ( t ) g hin i ( t ) ¡Jr (m i ( t ) ) ¡# D gm i ( t) C g
¢ 1 X¡ dm k ( t) , 0 ¡ P 0 ( t) N ¡ 1 kD6 i
1 OO t C g m i ( t) ( N ¡ 1) 2 £
XX¡ ¢¡ ¢ dm k ( t ) , 0 ¡P 0 (t) dm l ( t ) ,0 ¡P 0 (t) ,
(A.3)
t 1 @2 g ( h) , gOOm :D J20 2 @h2 hm
(A.4)
6 i lD 6 i kD
with gmt from equation 4.22,
gO mt :D J0
@g ( h) @h hm
and
where hm D J0 P 0 ( t) ¡ Jr (m ) ¡ #. In addition, the joint probability distrim bution of the network states can be expanded, setting dm i ( t ) ,m D Pi ( t) C m (dm i ( t ) ,m ¡ Pi (t) ) , and so on (cumulant expansion): ¡ ¢ « ¬ P m i ( t) D º, m j (t0 ) D m D dm i ( t ) ,º dm j ( t0 ) ,m m
ºm 0 0 D Pºi ( t) Pj ( t ) C Cij ( t, t )
(A.5)
Stochastic Networks of Spiking Neurons
397
¡ ¢ P m i ( t) D º, m j ( t0 ) D m , m k ( t00 ) D v « ¬ D dm i ( t ) ,º dm j ( t0 ) ,m dm k ( t00 ) ,v ¡ ¢ m ( 00 ) C Pvk ( t00 ) Cºm ( 0) D Pºi ( t) P m j ( t0 ) D m , m k ( t00 ) D v C Pj ( t0 ) Cºv ik t, t ij t, t D¡ ± ² E ¢ ¡ ¢ m C dm i ( t) ,º ¡ Pº dm j ( t0 ) ,m ¡ Pj ( t0 ) dm k (t00 ) ,v ¡ Pvk (t00 ) , (A.6) i ( t) m and so on for higher-order correlations. Here we have used Pi ( t) to denote P (m i ( t) D m ) . Using equation A.3, the cumulants can be evaluated. As mentioned above, if the asynchronous state is stable, one expects that the m 1 P (d moments N¡1 m j (t ) ,m ¡ Pi ( t ) ) become small in the large N limit. Since 6 i jD p these moments are of order 1 / N, the kth-order moment should be of order 1 / N k/ 2 , if all indices i, j, k, . . . are different, of order 1 / N ( k¡1 ) / 2 if one pair of indices is the same, and so on. If this ansatz is made, the joint distribution can be calculated to leading order in 1 / N, since for the kth moment, only cumulants of equal or lower order contribute. Using this ansatz in equation A.3, it can then be shown that the kth cumulant is indeed of order 1 / N k/ 2 , verifying the self-consistency of the ansatz. Thus, within this ansatz, higher-order v terms in equation A.3, as well as the three-point cross-correlations Cºm (last ijk summand in equation A.6), can be neglected.
A.2 Discrete Model. We are particularly interested in the “spike-spikecross-correlation,” C ( t, t0 ) D C00 ( t, t0 ). However, because the probability of ring an action potential depends on the time since the last spike, m , this depends on Cmº for all m and º. We therefore have to evaluate all of these. Because m is reset to 0 after a spike is emitted, Cm 0 and Cmº have to be treated separately. According to denition equation 5.1, the time-delayed cross-correlations 6 ( 0 C 1) ( i D Cº0 j) can, for t0 ¡ t > 0, be written as ij t, t
( 0 Cº0 ij t, t C 1) X D
fm ( t0 )g
¡ D
¡ ¢ ¡ ¢ P m j ( t0 C 1) D 0 | fm ( t0 ) g, m i ( t) D º P m i ( t) D º, fm (t0 ) g
X
fm (t0 )g
X
fm ( t0 )g
¡ ¢ ¡ ¢ P (m i ( t) D º) P m j ( t0 C 1) D 0 | fm ( t0 ) g P fm (t0 ) g
£ ¤£ ¡ ¢ gj fm ( t0 ) g P m i ( t) D º, fm ( t0 ) g
¡ ¢¤ ¡ P (m i ( t) D º) P fm ( t0 ) g ,
(A.7)
where, in the last equality, equation 2.1 has been used. Note that gj depends on all network states at time t0 , but not on m i ( t) D º, since t < t0 . For t0 D t, a similar equation can be written, where the sum runs over all network states
398
Carsten Meyer and Carl van Vreeswijk
(at time t0 D t) except m i ( t) , which is xed to be º. This eventually leads to the same result. In asynchronous states, the expansion A.3 can be used. Considering all neuron states (at time t0 ) that contribute to the right-hand side of equation A.3 separately (i.e., summing over m D m j ( t0 ) , v D m k ( t0 ), and so on) and integrating out the other neuron states (at time t0 > t), which are not explicitly involved in expansion A.3, one arrives at ( 0 C 1) Cº0 ij t, t D
1 X m D0
C
¡ ¢ ¡ ¢¤ 0 £ gmt P m i ( t) D º, m j ( t0 ) D m ¡ P (m i ( t) D º) P m j ( t0 ) D m 1 X
m ,v D 0
0 gO mt
¡ ¢ ¢ dv,0 ¡ P 0 ( t) X £ ¡ P m i ( t) D º, m j ( t0 ) D m , m k ( t0 ) D v ¡ 1 N 6 j kD
¡ ¢¤ ¡P (m i ( t) D º) P m j (t0 ) D m , m k (t0 ) D v .
(A.8)
Here, the neuron state m i (t ) D º remains in any case since only neuron states at time t0 > t are involved in the summation (or, for t0 D t, the summation 6 i). Note that the term containing the second runs only over neuron states j D t0 O derivative gO 0 does not contribute in leading order 1 / N since it already m j (t )
describes pair correlations of order 1 / N in the synaptic input of neuron j 3 and leads to a term of order N ¡ 2 in equation A.8. ( 0 ) . The second term is exThe rst term in equation A.8 gives Cºm ij t, t panded using equation A.6. As explained in section A.1, in asynchronous v states, the three-point correlation Cºm (the last term in equation A.6) can ijk 3
be neglected because it contributes only a term of order N ¡ 2 . Since
³
1 ¡ 1 X X ¢ dv, 0 ¡ P 0 (t) Pv ( t) D P 0 ( t) 1 ¡ Pv ( t ) vD 0
vD 0
´ D 0,
(A.9)
1 P v ( 00 ) the term proportional to N¡1 D Pv ( t00 ) does not contribute. The 6 j Pk t kD m 0 6 only remaining term proportional to Pj (t ) results for k D i in the crossºv 0 ºv correlation Cik (t, t ) and for k D i in the autocorrelation Ai ( t, t0 ) . The result can be further simplied with the following identity. Since
P (m i ( t) D m ) D D
1 X ºD 0
P (m i ( t) D m , m j ( t0 ) D º)
1 n X ºD 0
mº P (m i ( t) D m ) P (m j (t0 ) D º) C Cij (t, t0 )
D P (m i ( t) D m ) C
1 X ºD 0
mº Cij ( t, t0 ) ,
o
(A.10)
Stochastic Networks of Spiking Neurons
and likewise P (m j (t0 ) D º) D mº Cij satisfy the equality 1 X ºD 0
mº Cij ( t, t0 ) D
1 X m D0
P1
m D 0 P (m i ( t )
399 D m , m j ( t0 ) D º), the correlations
mº Cij (t, t0 ) D 0.
(A.11)
Using this relation, it can be seen that the term proportional to P 0 ( t) does not contribute. Thus, one obtains in a homogeneous network in stable, stationary state, with t D t0 ¡ t > 0, Cº0 (t C
1) D
1 X
ºm
gm C
m D0
(t ) C
³1
X
m D0
´µ
gO m Pm
º0 (t ) C
C
¶ 1 º0 (t ) . A N
(A.12)
Here, we have used that for t ! 1, Cºm ( t, t C t ) ! Cºm (t ) , gmt ! gm , and gOmt ! gO m . For t D 0, a similar derivation leads to the same result. For º, v ¸ 0, Cº(vC 1) (t C 1) can be calculated analogously and obeys Cº
(v C 1)
µ ¶ 1 (t C 1) D (1 ¡ gv ) Cºv (t ) ¡ gO v Pv Cº0 (t ) C Aº0 (t ) . (A.13) N
Equations A.12 and A.13 can be written in matrix form, equation 5.7. This is an inhomogeneous matrix equation from which Cºv (t ) can be inferred for all º, v, and t > 0, if Aºv (t ) and Cºv (0) are given. The equal-time cross-correlations, Cºv (0) , can be calculated in a fashion similar to the time-delayed cross-correlations (involving a product of equation A.3), and using Cºv (0) D Cvº (0) ,
(A.14)
one arrives at equation 5.9. Since the autocorrelations are of order 1, it is seen from equations 5.7 and 5.9 that C (t ) » O (1 / N ). Finally, Av0 (t ) is derived as follows: Av0 (t ) D limt!1 Av0 ( t, t C t ) , is the excess probability that the neuron res at times t ¡ v and t C t and does not re between times t ¡ v and t. Thus, Av0 (t ) is given by Av0 (t ) D
Pv 00 A (v C t ). P0
(A.15)
For a numerical analysis, it is convenient to introduce a cutoff L in the refractory function by Jr (m ) D 0 for all m ¸ L , that is, gm D g :D g[ ( J0 P 0 ¡ # ) ] for all m ¸ L . Then the xed-point activity P 0 can be expressed in terms of the components Pm for the rst L components: 0 · m · L ¡ 1. It is straightforward to replace the innite-dimensional equations 4.2, 5.7, 5.9, and by corresponding equations of nite dimension L , if T and S are
400
Carsten Meyer and Carl van Vreeswijk
modied accordingly. Note that in order to keep the matrix dimension L small (for numerical reasons), the network activity has to be chosen large enough to avoid artifacts arising from the cutoff L ; thus, in section 5.2, the cross-correlations are evaluated for an activity of P 0 D 0.05 D O 50 Hz and a cutoff L D 50. A.3 Continuous Model. The equations for the model in continuous time are derived from a generalized discrete model with time step D t and transition rate w, equation 2.5. Including the postsynaptic response function 2 (m ) , the postsynaptic potential is according to equation 2.7 in a homogeneous network 2.13 given by Z ¡ ¢ J0 X 1 syn (A.16) hi ( t ) D dm 2 (m ) d m j ( t) ¡ m , N ¡ 1 jD6 i 0
yielding for N ! 1 Z syn hhi (t ) i D J0
1 0
dm 2 (m ) Pm (t) D J0 PN (t )
(A.17)
with Z
Pm ( t) D hd (m j ( t) ¡ m ) i
and
PN (t ) D
1
dm 2 (m ) Pm (t ).
0
(A.18)
Expanding the transition rate w ( h ) in a stationary state around the average synaptic input J0 PN similar to equation A.3, the following equation can be derived for the time-delayed cross-correlation (compare Equation A.12): 0
( ) Cº0 ij t, t C D t D
1 X
0
D t wmt Cºm ij
m D0
0
( t, t ) C
³1
X
m D0
´ 0
0
wO mt Pm ( t )
2 3 N 1 6X 7 £ Dk 2 (k ) 4 Cºikk ( t, t0 ) C Aºi k ( t, t0 )5, N k D 1 k D0 1 X
(A.19)
kD 6 i
with and wmt :D w[J 0 PN ( t) ¡ Jr (m ) ¡ #] @w ( h ) . wO mt :D J0 @h h D J0 PN (t ) ¡Jr (m ) ¡#
(A.20)
0 ( 0 ) and Aºv Introducing a density C ºv i ( t, t ) for the cross-correlations and ij t, t autocorrelations, respectively, by
( 0) Cºv ij t, t D D º D v C
0 ºv ( ) ij t, t ,
0 0 ºv Aºv i ( t, t ) D Dº D v Ai ( t, t ) ,
(A.21)
Stochastic Networks of Spiking Neurons
401
and a probability density Pm by Pm D D m Pm , we get for the time-delayed spike-spike-cross-correlation C (t ) :D C 00 (t ) :D limt!1 C 00 ( t, t C t ) in a stable stationary state, Z 1 C (t ) D dm wm C 0m (t ) 0
³Z
C
1 0
´Z dm wO m Pm
1 0
µ ( ) dk 2 k C
0k
(t ) C
¶ 1 0k (t ) A , N
(A.22)
where wm and wO m are the time-independent versions of Equation A.20 (stationary PN ). A 0k (t ) is calculated from » Pk 00 A (t ¡ k ) for k < t P0 A 0k (t ) D (A.23) Pt d (k ¡ t ) ¡ P 0 Pk for k ¸ t. To derive an equation for C º(m C D t ) (t C ij
C
ºm ij
(t ) , we expand the function
ºm D C ij (t ) C
D t)
C
Dt
@ @m
Dt
@ @t
ºm ij
C
C
ºm ij
(t )
(t ) C o ( D t) ,
(A.24)
which is continuous for m > 0, t > 0, and º ¸ 0, and arrive at the integrodifferential equation (t , m > 0) : @ @t
C
0m
(t ) C
@
D ¡wm C
0m
0m
C
@m
(t ) Z
(t ) ¡ wO m Pm
1 0
µ ( ) dk 2 k C
0k
(t ) C
¶ 1 0k (t ) A . (A.25) N
The equal-time cross-correlations are given by Z 1 Z 1 C 00 (0) D dº wº dm wm C ºm (0) 0
C 2s Q
Z 0m
(0) D
1 0
Cs Q
1 0
Z Cs Q2
C
0
Z
1 0
1 0
dm wm
1 0
Z dv 2 (v)
dº wº C Z
Z
ºm
µ dv 2 (v)
C
vm
1 vm (0) C A (0) N
1
µ dk 2 (k ) C
vm
1 vm (0) C A (0) N
0
vk (0)
C
¶
1 vk A (0) N
¶ (A.26)
(0) C µ
dv 2 (v) C
¶ (m > 0)
(A.27)
402
Carsten Meyer and Carl van Vreeswijk @ @º
C
ºm
@
(0) C Z
¡ wOº Pº
1 0
Z ¡ wOm Pm
C
@m
(0) D ¡wº C
µ dv 2 (v) C
1 0
ºm
µ (v) C dv 2
ºm
(0) ¡ wm C
(0) C
1 vm A (0) N
ºv (0) C
1 ºv A (0) N
vm
ºm
(0)
¶ ¶ (º, m > 0) ,
(A.28)
where Z sQ :D
1 0
dm wO m Pm ,
(A.29)
and Avº (0) , the excess probability density of m i (t) D v and m i ( t) D º, is given by
A vº (0) D Pvd (v ¡ º) ¡ PvPº.
(A.30)
Note that with C ºm (0) D C mº (0), it follows from equations A.22, A.27, and A.26, using equations A.23 and A.30, lim C
m !0
0m
(0) D lim C t !0
C
sQ N
00
(t ) D C
00
(0)
Z dº [2 (0) ¡ 2 (º) ]wºPº
in general 6 D
C
00 (0)
.
(A.31)
This is because C 00 (0) is obtained from the probability that in a small time window dt, two neurons re. It is due to the fact that if neuron i res at time t, its input into other neurons changes instantaneously from J02 (m i ( t) ) / N to J02 (0) / N. As a result, C 00 (t ) is different from C 00 (0) even in the limit t ! 0. Only when 2 is continuous at t D 0, 2 ( t) D 0 and 2 is always 0 when the spike occurs, 2 ( t) wt D 0 for all t, do instantaneous jumps in the input not occur. In this case limt !0 C 00 (t ) is indeed equal to C 00 (0) . In the limit dt ! 0, the probability of two spikes occurring at the same time becomes vanishingly small, and this double process should not contribute to the cross-correlations 6 at t D 0. Indeed, its contribution vanishes in the limit dt ! 0 since here it affects only a single point on the continuous-time axis. Acknowledgments
We thank Haim Sompolinsky for many helpful discussions and ideas. Further thanks to David Hansel, Hagai Bergman, and Eilon Vaadia for stimulating discussions. The work of C. M. was supported by the Minerva foundation and the graduate students’ foundation “Organisation and Dynamics of Neural Networks” at the University of Gottingen, ¨ Germany.
Stochastic Networks of Spiking Neurons
403
References Abbott, L. F., & van Vreeswijk, C. (1993). Asynchronous states in networks of pulse-coupled neurons. Phys. Rev. E, 48, 1483–1488. Amit, D. J., & Brunel, N. (1997a). Model of global spontaneous activity and local structured activity during delay periods in the cerebral cortex. Cereb. Cortex, 7, 237–252. Amit, D. J., & Brunel, N. (1997b). Dynamics of a recurrent network of spiking neurons before and following learning. Network, 8, 373–404. Brown, T. H., & Johnston, D. (1983). Voltage clamp analysis of mossy ber synaptic input to hippocampal neurons. J. Neurophysiol., 50, 487–507. Brunel, N., Chance, F., Fourcaud, N., & Abbott, L. (2001). The effect of synaptic noise on ltering of the frequency response of integrate and re neurons. Phys. Rev. Lett., 86, 2186–2189. Brunel, N., & Hakim, V. (1999). Fast global oscillations in networks of integrateand-re neurons with low ring rates. Neural Comp., 12, 1621–1671. Cox, D. R. (1962). Renewal theory. London: Methuen. deCharms, R. C., & Merzenich, M. M. (1996). Primary cortical representation of sounds by the coordination of action potential timing. Nature, 381, 610–613. Gerstner, W. (1995). Time structure of the activity in neural network models. Phys. Rev. E, 51, 738–758. Gerstner, W. (2000). Population dynamics of spiking neurons: Fast transients, asynchronous states, and locking. Neural Comp., 12, 43–89. Gerstner, W., & van Hemmen, J. L. (1992). Associative memory in a network of “spiking” neurons. Network, 3, 139–164. Ginzburg, I., & Sompolinsky, H. (1994). Theory of correlations in stochastic neural networks. Phys. Rev. E, 50, 3171–3191. Gray, C. M. (1999). The temporal correlation hypothesis of visual feature integration: Still alive and well. Neuron, 24, 31–47, 111–125. Gray, C. M., Konig, ¨ P., Engel, A. K., & Singer, W. (1989). Oscilliatory responses in cat visual cortex exhibit intercolumnar synchronization which reects global stimulus properties. Nature, 338, 334–337. Horn, D., & Usher, M. (1989). Neural networks with dynamical thresholds. Phys. Rev. A, 40, 1036–1044. Jack, J. J. B., Noble, D., & Tsien, R. W. (1975). Electric current ow in excitable cells. Oxford: Clarendon Press. Perkel, D. H., Gerstein, G. L., & Moore, G. P. (1967a). Neural spike trains and stochastic processes. I. The single spike train. Biophys. J., 7, 391–418. Perkel, D. H., Gerstein, G. L., & Moore, G. P. (1967b). Neural spike trains and stochastic processes. II. Simultaneous spike trains. Biophys. J., 7, 419– 440. Reid, R. C., & Alonso, J. M. (1995). Specicity of monosynaptic connections from the thalamus to visuaocortex. Nature, 378, 281–284. Spiridon, M., & Gerstner, W. (1999). Noise spectrum and signal transmission through a population of spiking neurons. Network: Computation in Neural Systems, 10, 257–272.
404
Carsten Meyer and Carl van Vreeswijk
Stein, R. B. (1967a). Some models of neuronal variability. Biophys. J., 7, 37–68. Stein, R. B. (1967b). The information capacity of nerve cells using frequency code. Biophys. J., 7, 797–826. Stevens, C. F. (1989). How cortical interconnectedness varies with network size. Neural Comp., 1, 473–479. Toyama, K., Kimura, M., & Tanaka, K. (1981a). Cross-correlation analysis of interneuronal connectivity in cat visual cortex. J. Neurophysiol., 46, 191–201. Toyama, K., Kimura, M., & Tanaka, K. (1981b). Organization of cat visual cortex as investigated by cross-correlation techniques. J. Neurophysiol., 46, 202–214. T’so, D. Y., Gilbert, C. D., & Wiesel, T. N. (1986).Relationships between horizontal interactions and functional architecture in cat striate cortex as revealed by cross-correlation analysis. J. Neurosci., 6, 1160–1170. Tuckwell, H. C. (1988). Introduction to theoretic neurobiology. Cambridge: Cambridge University Press. Usher, M., Schuster, H. G., & Niebur, E. (1993).Dynamic populations of integrateand-re neurons, partial synchronization and memory. Neural Comp., 5, 570– 586. Vaadia, E., Haalman, I., Abeles, M., Prut, Y., Slovin, H., & Aertsen, A. (1995). Dynamics of neural interaction in monkey cortex in relation to behavioral events. Nature, 373, 515–518. van Vreeswijk, C., & Sompolinsky, H. (1996). Chaos in neuronal networks with balanced excitatory and inhibitory activity. Science, 274, 1724–1726. van Vreeswijk, C., & Sompolinsky, H. (1998). Chaotic balanced state in a model of cortical circuits. Neural Comput., 10, 1321–1371. Received September 8, 2000; accepted May 22, 2001.
LETTER
Communicated by Misha Tsodyks
Measuring Information Spatial Densities Michele Bezzi [email protected].it Cognitive Neuroscience Sector, S.I.S.S.A, Trieste, Italy, and INFM sez. di Firenze, 2 I-50125 Firenze, Italy In e s Samengo [email protected] Cognitive Neuroscience Sector, S.I.S.S.A, Trieste, Italy Stefan Leutgeb [email protected] Program in Neuroscience and Department of Psychology, University of Utah, Salt Lake City, UT 84112, U.S.A. Sheri J. Mizumori [email protected] Psychology Department, University of Washington, Seattle, WA 98195, U.S.A. A novel denition of the stimulus-specic information is presented, which is particularly useful when the stimuli constitute a continuous and metric set, as, for example, position in space. The approach allows one to build the spatial information distribution of a given neural response. The method is applied to the investigation of putative differences in the coding of position in hippocampus and lateral septum. 1 Introduction
It has long been known that many of the pyramidal cells in the rodent hippocampus selectively re when the animal is in a particular location of its environment (O’Keefe & Dostrovsky, 1971). This phenomenon gave rise to the concept of place elds and place cells, that is, the association between a given cell and the particular region of space where it res. Many computational models of hippocampal coding for space (Tsodyks, 1999) are based on the idea that the information provided by each place cell concerns whether the animal is or is not at a particular location—the place eld. It is clear, however, that at least in principle, a given cell could provide an appreciable amount of information about the position of the animal without having a place eld in the rigorous sense. For example, it could happen that a cell indiscriminately red all over the environment except in one specic location, Neural Computation 14, 405–420 (2001)
° c 2001 Massachusetts Institute of Technology
406
M. Bezzi, I. Samengo, S. Leutgeb, and S. Mizumori
Figure 1: Firing-rate distribution of four different neurons when a rat is exploring an eight-arm maze. In each case, the density of the color is proportional to the number of spikes per unit time, as a function of space. Since cells of different types have very dissimilar mean ring rates, each plot has been normalized. The absolute rates are shown in Table 1. Each cell provides an amount of information that is at least as large as hIi C sI , where hIi is the mean information provided by the whole set of cells of that particular type, and sI is its standard deviation (see Table 2 for the quantitative details).
where it remained silent. Such a coding would be particularly informative on those occasions when the neuron did not re. More generally, a cell could re throughout the whole environment but with a stable and reproducible distribution of ring rates strongly selective to the position. Place eld coding would mean that such a ring distribution is a very specic one—one where the cell remains silent everywhere except inside the place eld. In this sense, the idea of a place cell is to the coding of position what in other contexts has been referred to as a grandmother cell. Whether a given place cell behaves strictly as a grandmother cell depends on the size of the spatial bins. However, broadly speaking, place cells use a sparse code and respond to only a very small fraction of all possible locations. Figure 1 shows four examples of the ring-rate distribution of four different neurons when a rat is exploring an eight-arm maze. Similar to previous reports (Zhou, Tamura, Kuriwaki, & Ono, 1999), all of these neurons pro-
Measuring Information Spatial Densities
407
Table 1: Data Corresponding to the Cells Whose Firing Density Is Shown in Figure 1. Figure
Type of Neuron
1a 1b 1c 1d
hri (Spikes per Second)
I/t (bits per second)
0.1981 15.8995 2.2688 10.4878
0.7431 2.1941 1.2921 0.4411
Hippocampal pyramidal cell Hippocampal interneuron Lateral septum Medial septum
Note: hri D mean ring rate of the cell, averaged throughout the maze. I / t D corrected information rate (see equations 2.7 and 3.1).
vide an appreciable amount of information about the location of the animal. Table 1 summarizes the data corresponding to the gure. Figure 1a shows a typical place eld; the cell res only when the animal reaches the end point of the right arm. Figures 1b and 1c show two different distributed codes; the rst corresponds to a hippocampal interneuron and the latter to a neuron in the lateral septum that selectively res when the rat occupies the end points of the maze. Finally, Figure 1d shows a cell in the medial septum whose discharge corresponds to all locations within the environment, with a somewhat lower rate in the end points of the maze, specially in the upper arm. These examples show that there might be different coding schemes for position other than localized place elds. Here, we explore such coding strategies in both hippocampal pyramidal neurons and lateral septal cells of behaving rats. The lateral septum receives a massive projection from the hippocampus (Swanson, Sawchenko, & Cowan, 1981; Jakab & Leranth, 1995), which presumably provides information about spatial location. Our aim is to see whether different types of neurons use different codes to represent position. In the following section, the local information is dened, and its relation to previous similar quantities is established. By taking the limit of a very ne binning, such local information gives rise to spatial information density that can be used to explore the coding strategy of a cell. In section 3, the spatial information distribution is calculated for actual recordings in rat hippocampal and lateral septal cells. In section 4 we make some concluding remarks. 2 Stimulus-Specic Informations
Our analysis is based on the calculation of the mutual information between the neural response of each single cell and the location of the animal, ID
XX j
n
µ P ( n, xj ) log2
¶ P ( n, xj ) , P ( n ) P ( xj )
(2.1)
408
M. Bezzi, I. Samengo, S. Leutgeb, and S. Mizumori
where xj is a small region of space and n is the number of spikes red in a given time window. P ( r, xj ) is the joint probability of nding the rat at xj while measuring response n and can always be written as P ( n, xj ) D P (n| xj ) P ( xj ) . The a priori probability P ( xj ) is estimated from the ratio between the time spent at position xj and the total time. The probability of response n reads P (n) D
X
P ( n, xj ) .
(2.2)
j
The mutual information I measures the selectivity of the ring of the cell to the location of the animal. It quanties how much can be learned about the position of the rat by looking at the response of the neuron. In contrast to other correlation measures, its numerical value does not depend on whether the cell res only in a particular location or whether it remains silent there. It may happen, however, that the neural responses are highly selective to some very specic locations and not to others. It is clear that the quantity dened in equation 2.1 provides the total amount of information, averaged over all positions. The scope of this section is to characterize the detailed structure of the spatial locations where the cell is most informative. To do so, we build a spatial information map,that is, a way to quantify the amount of information provided by the cell about every single location xj . This issue has been discussed in De Weese and Meister (1999), although in the context of more general stimuli rather than specically position in space. Two denitions (among innitely many) have been pointed out: the stimulus-specic surprise (which will be addressed here as the position-specic surprise), I1 ( xj ) D
X
µ P ( n| xj ) log2
n
¶ P ( n| xj ) , P ( n)
(2.3)
and the stimulus-specic information (here, position-specic information) I2 ( xj ) D ¡
X n
P (n ) log2 [P(n)] C
X
£ ¤ P ( n| xj ) log2 P ( n| xj ) .
(2.4)
n
Both of these quantities, when averaged in xj , give the total information (see equation 2.1), X j
P ( xj ) I1,2 ( xj ) D I.
(2.5)
However, neither of the two is, by itself, a proper mutual information. The stimulus-specic surprise, equation 2.3, is guaranteed to be positive but may not be additive, while the stimulus-specic information, equation 2.4, is additive but not always positive. Moreover, any weighted sum of I1 and I2 is also a valid estimator of the information to be associated to each location
Measuring Information Spatial Densities
409
(De Weese & Meister, 1999). However, in specic situations, these two local information estimators can be very different, which means that their weighted sum can lead to any possible result. Let us examine the behavior of I and I1,2 in the short time limit. We consider a time interval t and a cell whose mean ring rate at position xj is r ( xj ). Therefore, if t ¿ 1 / r ( xj ), the cell will probably remain silent at xj , seldom ring a spike. The short time approximation involves discarding any response consisting of two or more spikes. It does not mean that such events will not occur, but rather that the set of symbols considered as informative responses are the ring of a single spike, with probability P (1| xj ) ¼ r (xj ) t, and whatever other event, which we call response 0, with probability P (0| xj ) D 1 ¡ P (1| xj ). Therefore, as derived in Skaggs, McNaughton, Gothard, and Markus (1993) and Panzeri, Biella, Rolls, Skaggs, and Treves (1996), » µ ¶ ¼ X hri ¡ r ( xj ) r ( xj ) C C O ( t2 ) , (2.6) IDt P ( xj ) r ( xj ) log2 hri ln 2 j where hri D
X
P ( xj ) r (xj ) .
(2.7)
j
The short time limit of I is much more easily evaluated from recorded data than the full equation, 2.1, since it does not need the estimation of the conditional probabilities. Only the ring rates at each location are needed. The rst term in the curled brackets comes from the ring of a spike in xj , and the second describes the silent response. Similarly, I1,2 tend to » µ ¶ ¼ hri ¡ r ( xj ) r ( xj ) C C O ( t2 ) , (2.8) I1 ( xj ) D t r (xj ) log2 hri ln 2 » ¼ hri ¡ r ( xj ) C r ( xj ) log2 [r( xj ) t] ¡ hri log2 (hrit ) I2 ( x j ) D t ln 2 C O ( t2 ).
(2.9)
Equation 2.8 states that the stimulus-specic surprise also rises linearly as a function of t. Its rst term comes when the cell res a spike, and the second corresponds to the silent response. The stimulus-specic information, on the other hand, diverges. However, since for some stimuli, I2 ( x ) is negative and for some others it is positive, the average of them all is nite, as stated in equation 2.6. The innitely large discrepancy between equations 2.8 and 2.9 shows that for small t, the choice of any one of these estimators is particularly meaningless. Although this procedure is usually referred to as a short time limit, the crucial step in deriving equations 2.6 through 2.9 is to partition the set of all
410
M. Bezzi, I. Samengo, S. Leutgeb, and S. Mizumori
possible responses into two subsets: one containing a single response (the case n D 1) and the complementary one. The conditional probabilities for the occurrence of the distinguished response (n D 1) are taken proportional to a parameter t, which is supposed to be small. Such a procedure with the response variable inspires the exploration of an analogous partition in the set of locations. 2.1 A Spatial Information Density. To nd a well-behaved measure of a location-specic information, we now introduce the local information I`( xj ) , which quanties how much can be learned from the responses about whether the animal is or is not in xj . In other words, we partition the set of possible locations into two subsets: one containing position xj and the complementary set xN j D fx / x does not belong to region jg. Mathematically,
I` ( xj ) D
» µ ¶ P ( xj |n) P ( n ) P (xj |n) log2 P ( xj ) nD 0 µ ¶¼ P ( xN j |n) C P ( xN j |n) log2 , P ( xN j )
C1 X
(2.10)
where P (n ) is the probability of the cell ring n spikes no matter where, P ( xj ) is the probability of visiting location xj , P ( xN j ) D 1 ¡ P ( xj ) is the probability of not being in xj , P ( xj |n) is the conditional probability of being in xj when the cell res n spikes, and P ( xN j |n) is the conditional probability of not being in xj while the cell res n spikes and can be obtained from P (xj |n) C P ( xN j |n) D 1.
(2.11)
Equation 2.10 denes a proper mutual information, in the sense that it is positive and additive. In contrast to the short time limit in the response set, now there is no preferred location to be separated out. That is why we calculate as many I` ( xj ) as there are positions xj . As j changes, however, I`( xj ) refers to a different partition of the environment. This means that one should not average the various I`( xj ). In parallel to the short time limit, we now make the area D of region xj tend to zero. To do so, we assume that both P ( xj ) and P ( xj |n) arise from a continuous spatial density r, Z P (xj |n) D P ( xj ) D
Z
r ( x |n) dx
(2.12)
r ( x ) dx,
(2.13)
region j
region j
Measuring Information Spatial Densities
411
where r (x) D
C1 X
P (n )r (xj |n) .
(2.14)
nD0
For D sufciently small, P (xj |n) ¼ D r ( x |n) P ( xj ) ¼ D r ( x ) .
(2.15)
The continuity of r allows us to drop the subindex j. Expanding I` ( x ) in powers of D , it may be seen that the rst term is » µ ¶ ¼ r ( x |n) r ( x ) ¡ r ( x |n) C I x) D D P ( n ) r ( x |n) log2 r (x ) ln 2 nD 0 C1 X
`(
C O (D 2) .
(2.16)
The local information is therefore proportional to D , which means that in the limit D ! 0, it becomes a differential. Equation 2.16 is completely analogous to equation 2.6. This behavior indicates that the density i (x ) D
I` ( x )
D
(2.17)
approaches a well-dened limit when D ! 0. As pointed out earlier, I` ( x ) is conceptually different from the full information I dened in equation 2.1, and for nite D , there is no meaning in summing together the I` (xj ) corresponding to different j. However, it is easy to verify that when D ! 0, the integral of i ( x ) throughout the whole space coincides with the full information I, when the latter is calculated in the limit of very ne binning. Therefore, i ( x ) behaves as an information spatial density. Moreover, in contrast to I1 ( x) and I2 ( x ) , it derives from a properly dened mutual information. Equation 2.16 is the continuous version of 2.10. It should be noticed, however, that in practice, one cannot calculate r ( x |n) from experimental data for nite time bins. If D is sufciently small and the animal moves around with a given velocity, it never remains within xj during the chosen time window. Nevertheless, there is no inconvenience in giving a theoretical denition of r ( x |n) D limD !0 P (xj |n) / D , imagining one could perform the experiment placing the animal in xj and conning it there throughout the whole time interval. In order to bridge the gap between the theoretical denition of r ( x |n) and the actual possibility of measuring the information spatial density with freely moving animals, we now take equation 2.16 and
412
M. Bezzi, I. Samengo, S. Leutgeb, and S. Mizumori
calculate its short time limit. The result is » µ ¶ ¼ hri ¡ r ( x) r ( x) C I` ( x ) D tDr ( x) r ( x ) log2 hri ln 2 C O (D 2 ) C
O ( t2 ) .
(2.18)
This same expression is obtained if one starts with the full denition, equation 2.10, and rst calculates the limit of t ! 0 and subsequently makes D ! 0. Comparing equation 2.18 with 2.8, it is clear that I` ( x ) D tP ( x ) I1 ( x ) C O (D 2 ) C O ( t2 ) .
(2.19)
We therefore conclude that in the short time limit, the position-specic surprise coincides with the local information (multiplied by the probability of occupancy). This gives the position-specic surprise I1 a different status from I2 or any combination of the two; although in a general situation I1 is not additive, when the number of stimuli is very large (or the binning very ne, if the stimuli are continuous), it coincides with a well-dened quantity—the local information. However, such a correspondence between I`( x ) and I1 ( x ) is valid only in the short time limit—or, more precisely, when computing the information provided by one spike. 3 Data Analysis
In this section, we evaluate I` ( xj ) as a function of xj using electrophysiological data recorded from rodents performing a spatial task. First, the experimental procedure is explained, and then we show that different brain regions use different coding strategies in the representation of space. 3.1 Experiment Design. Nine young adult Long-Evans rats were tested during a forced-choice and a spatial working memory task. Both tasks were performed in an eight-arm radial maze, each arm containing a small amount of chocolate milk in its distal part. In addition, the proximal part of each arm could be selectively lowered, thereby forbidding the entrance to that particular arm (for details, see Leutgeb & Mizumori, 2000). In the forcedchoice task, the animal was placed in the center of the maze, while the entrance to a single arm was available. The other seven arms were kept at a lower level. After the animal had entered the available arm and taken its food reward, a second arm was raised and the previous one lowered. The procedure was continued by allowing the animal to enter just one arm at a time, with no repetitions. The session ended when the rat had taken the milk of all eight arms. A pseudo-random sequence of arms was chosen for each trial. The beginning of the working memory task was identical to the forced-choice one, until the animal had entered the fourth arm in the sequence. At this point, all arms of the maze were raised, and the rat
Measuring Information Spatial Densities
413
could move freely. However, only the four new arms still contained the food reward. The session continued until the animal had taken all the available chocolate milk, or for a maximum of 16 choices. Reentries into previously visited arms of the maze were counted as working memory errors, since in principle, the animal should have kept in mind that the food had already been eaten in that arm. Septal and hippocampal cells were simultaneously recorded during both of the tasks (for recording details, see Leutgeb & Mizumori, 2000). A head stage held the eld effect transmitter ampliers for the recording electrodes as well as a diode system for keeping track of animals’ positions. Single units were separated using an on-line and off-line separation software. Units were then identied according to their anatomical location and the characteristics of the spikes. Hippocampal pyramidal cells and interneurons, as well as lateral and medial septal cells, were identied. Animals were tested in either standard illumination condition or in darkness. 3.2 Results and Discussion. In order to compute I` ( xj ) from the experimental data, both r ( xj ) and P ( xj ) are needed for each position. In order to compute r ( x ) , the total number of spikes red in location xj is divided by the total time spent there. The a priori probability P ( xj ) of visiting the spatial bin j was obtained by the ratio of the time spent in xj and the total duration
of the trial. The computation of mutual information typically introduces an upward bias due to limited sampling. Therefore, a correction has been applied in order to reduce this overestimation, as suggested in Panzeri and Treves (1996). In our case, the rst-order correction for equations 2.6 and 2.10 can be derived analytically, Icorr D I ¡
t (N ¡ 1) , 2T ln 2
(3.1)
where N is the number of positions xj in which the environment has been binned, T is the total duration of the recording, and t the time window used to measure the response. Throughout the article, when specifying experimental data, we always give the corrected value of I, although for simplicity of notation, we drop the subindex “corr.” Table 2 summarizes the overall statistics of our experimental data. The values of I / t have been calculated in the short time limit, equation 2.6, and further subtracting the correction, equation 3.1. The proportionality between I` ( x ) and D (see equation 2.18) was based on the assumption that the conditional probabilities P (xj |n) emerged from a continuous density r ( x |n) . In order to verify whether such a supposition actually holds, we evaluated I`( xj ) from our experimental data for different values of the area D . In Figure 2 we show a spatial average of our results: hI` ( x ) i D
1 X ` I ( xj ) , N j
(3.2)
414
M. Bezzi, I. Samengo, S. Leutgeb, and S. Mizumori
Table 2: Statistic Corresponding to the Whole Population of Recorded Neurons. Type of Neuron HP HI LS MS
Number of Units
hhrii (spikes per second)
shhrii (spikes per second)
hIi / t (bits per second)
shIi / t (bits per second)
114 21 327 34
0.99665 15.228 5.0732 10.971
1.2353 6.9618 7.7119 12.177
0.43851 0.81994 0.25274 0.27802
0.39535 0.85995 0.33575 0.46498
Notes: hhrii D mean ring rate averaged throughout the maze and over all cells. shhrii D average quadratic deviation from the mean. hIi/ t D mean information rate averaged over cells. shIi / t D mean quadratic deviation. HP D pyramidal cells in the hippocampus. HI D interneurons in the hippocampus. LS D units in the lateral septum. MS D units in the medial septum.
where N is the total number of positions j in which the maze has been binned. Clearly, N D total area / D . The local information I` ( xj ) has been evaluated in the short time limit. In all cases, the local information grows linearly with D , as predicted by equation 2.16. We therefore build the local information maps for all the cells recorded. In other words, we calculate i ( xj ) for all the positions xj . We have restrained ourselves from going into too ne a binning, however, in order to avoid limited sampling problems. In Figure 3 we show the information maps corresponding to the ring distributions of Figure 1. The hippocampal pyramidal cell in Figure 3a is informative only at the same location where the cell res. In this case, the intuition seems to be justied: the cell codes for a single position and does so by increasing its ring rate at that location. However, the other three cases show that the neuron may well provide information not only where it res most but also where it res least. In particular, the hippocampal interneuron in Figure 3b, which tends to re all over the maze, is particularly informative in the upper-left and upper-right end points, where it remains almost silent. In Figures 3c and 3d, the cells provide information where there is both a high and a low ring rate. As a consequence, we conclude that if a cell has a distributed coding scheme (as opposed to a very local one, more typical of hippocampal place cells), the information map may well not coincide with the ring-rate one. In this sense, one should beware not to judge cells with a distributed ring pattern as noninformative. If such a pattern is stable and reproducible and covers a wide range of ring rates, the neuron may well be providing spatial information. Could a quantitative analysis of the coding strategies of hippocampal pyramidal cells and neurons in the lateral septum be given? We have not considered hippocampal interneurons or medial septum cells since in these two cases, we do not have enough statistics to draw conclusions. In addition, on average, they are less informative (see Table 2).
Measuring Information Spatial Densities
415 (a) Hippocampal pyramids
0.020
< I (x ) > / t
0.015
0.010
0.005
0.000 0.000
0.004
0.008
0.012
0.016
(b) Lateral septum 0.020
> /t
0.015
0.010
0.005
0.000 0.000
0.004
0.008
0.012
0.016
Figure 2: Mean local information rate, dened in equation 3.2, as a function of the area D of each bin. (a) Three pyramidal cells in the hippocampus. (b) Three cells in lateral septum cells. All cells carry an appreciable amount of information when compared to other cells of the same type. Both a and b contain one unit with a high ring rate, an intermediate one, and a low-frequency cell.
One possible measure of the degree of localization of the coding scheme is given by the correlation between the information maps and the ring-rates distributions (that is, between the graphs of Figures 1 and 3). We evaluate the Pearson correlation coefcient between the two maps, P£ CD
j
¤ £ ¤ I` (xj ) ¡ hI` ( xj ) ij r ( xj ) ¡ hri hP i hP i , `( ) ( ) N j I xj j r xj
(3.3)
416
M. Bezzi, I. Samengo, S. Leutgeb, and S. Mizumori
Figure 3: Local information distributions corresponding to the ring densities of Figure 1. In each case, the density of the color is proportional to the number of bits per unit time, as a function of space.
where hI`( xj ) ij is the spatial average of the local information. Thus, C is equal to 1 if I` ( xj ) ¡ hI`( xj ) ij is proportional to r ( xj ) ¡ hri, and takes the value ¡ 1 if the proportionality factor is negative. Notice that there is one such C for every single cell. In Figure 4a we show the frequency of the correlation C for the 114 pyramidal cells recorded in the hippocampus, and in Figure 4b, the 297 units recorded in the lateral septum. Pyramidal cells tend to have larger values of the correlation C, indicating that they tend to provide information in the same locations where they re most. In other words, the code in the hippocampus can be characterized as localized, as is well known, giving rise to place cells and place elds. In contrast, septal cells have a somewhat more symmetrical distribution around zero. If C ¼ 0, then the cell gives as much information in those locations where it res most as where it remains silent (or, more precisely, where it res less than its average spontaneous rate). A negative value of C indicates that the cell is most informative in the locations where it does not re (Figures 1b and 3b provide an example from a hippocampal interneuron). As stated in Table 2, hippocampal pyramidal cells are more informative than lateral septal cells. The point we want to stress is that the lateral septum follows a different coding strategy: cells do not specialize in a particular region of space but rather respond with a complex, distributed ring pattern all over the place.
Measuring Information Spatial Densities
Frequency of Cells
0,3
417
(a)
0,2
0,1
0,0 -1,2
-0,8
-0,4
0,0
0,4
0,8
1,2
0,4
0,8
1,2
C
Frequency of cells
0,3
(b)
0,2
0,1
0,0 -1,2
-0,8
-0,4
0,0
C Figure 4: Frequency distribution of the correlation C between the information and ring-rate spatial distributions for (a) 114 pyramidal cells in the hippocampus and (b) 327 neurons in the lateral septum.
4 Conclusion
The aim of this work was to characterize the way the information provided by neural activity distributes among the elements of a given set of stimuli. In our case, the stimuli were the different positions in which an animal can be located within its environment. We dened the local information I` ( s) — the information provided by the responses of whether the stimulus is or is
418
M. Bezzi, I. Samengo, S. Leutgeb, and S. Mizumori
not s. In other words, it is the mutual information between the responses and a reduced set of stimuli, consisting of only two elements: stimulus s and its complement. In contrast to other quantities introduced previously, this is a well-dened mutual information, which can be employed in the short time limit. In fact, other possible denitions have some drawbacks; for example, the position-specic surprise has the disadvantage of not being additive. From the theoretical point of view, it is therefore not a very sound candidate for quantifying the information to be associated to each stimulus. The position-specic information, on the other hand, may not be positive and diverges for t ! 0, thus making its application to actual data quite cumbersome. In this article, we have studied the properties of I` in the particular situation where the stimuli arise from a continuous variable (as position in space), which has an underlying metric. In this case, the binning that transforms the continuous variable (in our case, x ) into a discrete set ( xj ) may be chosen, in principle, at will. When working with real data, however, the size of the bins is determined by the amount of data, since the mean error in the calculation of the mutual information is linear in the number of bins (see equation 3.1). We have shown analytically that when the size of the bin goes to zero, the local information is proportional to the bin size. This means that I` (x ) / D behaves as an information spatial density, in that it tends to a constant value when D ! 0, and its integral all over space coincides with the full information I. These two properties hold only in the limit of D ! 0, whereas the position-specic surprise and the position-specic information fulll equation 2.5 for any size of the bins. We have also shown that in the short time limit and for D ! 0, the local information coincides with the position-specic surprise, multiplied by the probability of occupancy. This result may seem puzzling, since I1 is known not to be additive, while additivity is guaranteed in I`. However, the additivity of I` is more restricted than the one desired for I1 . If the positionspecic surprise were to be additive, I1 would obey the relation I1 ( xj1 , xj2 ) D I1 ( xj1 ) C I1 ( xj2 | xj1 ) ,
(4.1)
where I1 ( xj1 , xj2 ) is the information provided by the responses about two particular results of the measurement of the stimulus. As shown by De Weese and Meister (1999), I1 does not follow equation 4.1. The local information, on the other hand, does fulll the condition I ` ( x a , x b ) D I ` ( x a ) C I1 ( x b | x a ) ,
(4.2)
where xa and xb may only be xj or xN j , and I` (x a , xb ) represents a true mutual information, between the set of responses and the two sets fxj , xN j g (one set for each measurement). Additivity for I1 requires additivity for any choice
Measuring Information Spatial Densities
419
of xj1 and xj2 in equation 4.1, while the possible values of xa and x b are much more restricted in equation 4.2. One should therefore not mistrust the fact that I1 does not obey a very demanding additivity condition, while I` fullls a quite relaxed one. The local information, as dened here, allows the characterization of the spatial properties of the information conveyed by a given cell. By measuring the mutual information between a given neuronal response and the set of possible locations, one sees that there are cells (in both the hippocampus and in the lateral septum) that provide an appreciable amount of information without actually having a place eld. By means of the local information, it is possible to draw a spatial information density, which may, in these nontypical cases, differ signicantly from the rate distribution. The local information can be calculated easily from experimental data, and it can be used to characterize the coding strategy of different cell types. In particular, we saw that hippocampal place cells tend to provide spatial information in the same places where they re, whereas lateral septal neurons use a more distributed coding scheme. This result is in agreement with the different goals in the encoding of movement-related quantities in both regions, as described recently in Bezzi, Leutgeb, Treves, and Mizumori (2000). Acknowledgments
We thank Bill Bialek and Alessandro Treves for very useful discussions. This work was supported by the Human Frontier Science Program, Grant No. RG 01101998B. References Bezzi, M., Leutgeb, S., Treves, A., & Mizumori, S. J. Y. (2000). Information analysis of location-selective cells in hippocampus and lateral septum. Society for Neuroscience Abstracts, 26, 173.11. De Weese, M. R., & Meister, M. (1999). How to measure the information gained from one symbol. Network, 10, 325–340. Jakab, R., & Leranth, C. (1995). In G. Paxinos (Ed.), The rat nervous system (2nd ed., pp. 405–442). New York: Academic Press. Leutgeb, S., & Mizumori, S. J. (2000). Temporal correlations between hippocampus and septum are controlled by environmental cues: Evidence from parallel recordings. Manuscript submitted for publication. O’Keefe, J., & Dostrovsky, J. (1971). The hippocampus as a spatial map: Preliminary evidence from unit activity in the freely moving rat. Brain Res., 34, 171–175. Panzeri, S., Biella, G., Rolls, E. T., Skaggs, W. E., & Treves, A. (1996). Speed, noise, information and the graded nature of neuronal responses. Network, 7, 365–370. Panzeri, S., & Treves, A. (1996). Analytical estimates of limited sampling biases in different information measures. Network, 7, 87–107.
420
M. Bezzi, I. Samengo, S. Leutgeb, and S. Mizumori
Skaggs, W. E., McNaughton, B. L., Gothard, K., & Markus, E. (1993). An information theoretic approach to deciphering the hippocampal code. In S. J. Hanson, J. D. Cowan, & C. L. Giles (Eds.), Advances in neural information processing systems, 4 (pp. 1030–1037). San Mateo, CA: Morgan Kaufmann. Swanson, L. W., Sawchenko, P. E., & Cowan, W. M. (1981). Evidence for collateral projections by neurons in Ammon’s horn, the dentate gyrus, and the subiculum: A multiple retrograde labeling study in the rat. J. Neurosci., 1, 548–559. Tsodyks, M. (1999). Attractor neural network models of spatial maps in hippocampus. Hippocampus, 9, 481–489. Zhou, T., Tamura, R., Kuriwaki, J., & Ono, T. (1999). Comparison of medial and lateral septum neuron activity during performance of spatial tasks in rat. Hippocampus, 9, 220–234. Received January 31, 2001; accepted May 1, 2001.
LETTER
Communicated by Erkki Oja
Stochastic Trapping in a Solvable Model of On-Line Independent Component Analysis Magnus Rattray [email protected] Computer Science Department, University of Manchester, Manchester M13 9PL, U.K. Previous analytical studies of on-line independent component analysis (ICA) learning rules have focused on asymptotic stability and efciency. In practice, the transient stages of learning are often more signicant in determining the success of an algorithm. This is demonstrated here with an analysis of a Hebbian ICA algorithm, which can nd a small number of nongaussian components given data composed of a linear mixture of independent source signals. An idealized data model is considered in which the sources comprise a number of nongaussian and gaussian sources, and a solution to the dynamics is obtained in the limit where the number of gaussian sources is innite. Previous stability results are conrmed by expanding around optimal xed points, where a closed-form solution to the learning dynamics is obtained. However, stochastic effects are shown to stabilize otherwise unstable suboptimal xed points. Conditions required to destabilize one such xed point are obtained for the case of a single nongaussian component, indicating that the initial learning rate g required to escape successfully is very low (g D O ( N ¡2 ) where N is the data dimension), resulting in very slow learning typically requiring O ( N 3 ) iterations. Simulations conrm that this picture holds for a nite system. 1 Introduction
Independent component analysis (ICA) is a statistical modeling technique that has attracted a signicant amount of research interest in recent years (for a review, see Hyv¨arinen, 1999). In ICA, the goal is to nd a representation of data in terms of a combination of statistically independent variables. This technique has a number of useful applications, most notably blind source separation, feature extraction, and blind deconvolution. A large number of neural learning algorithms have been applied to this problem, as detailed in Hyv¨arinen (1999). Theoretical studies of on-line ICA algorithms have mainly focused on asymptotic stability and efciency, using the established results of stochastic approximation theory. However, in practice, the transient stages of learning will often be more signicant in determining the success of an algorithm. In this article, a Hebbian ICA algorithm is analyzed, and a solution to the learnNeural Computation 14, 421–435 (2001)
° c 2001 Massachusetts Institute of Technology
422
Magnus Rattray
ing dynamics is obtained in the limit of large data dimension. The analysis highlights the critical importance of the transient dynamics; in particular, an extremely low initial learning rate is found to be essential in order to avoid trapping in a suboptimal xed point close to the initial conditions of the learning dynamics. This work focuses on the bigradient learning algorithm introduced by Wang and Karhunen (1996) and studied in the context of ICA by Hyv¨arinen and Oja (1998), where it was shown to have nice stability conditions. This algorithm can be used to extract a small number of independent components from high-dimensional data and is closely related to projection pursuit algorithms that detect interesting projections in high-dimensional data. The algorithm can be dened in on-line mode or can form the basis of a xedpoint batch algorithm, which has been found to improve computational efciency (Hyv¨arinen & Oja, 1997). In this work, the dynamics of the online algorithm is studied. This may be the preferred mode when the model parameters are nonstationary or the data set is very large. Although the analysis is restricted to a stationary data model, the results are relevant to the nonstationary case in which learning strategies are often designed to increase the learning rate when far from the optimum (Muller, ¨ Ziehe, Murata, & Amari, 1998). The results obtained here suggest that this strategy can lead to very poor performance. In order to gain insight into the learning dynamics, an idealized model is considered in which data are composed of a small number of nongaussian source signals linearly mixed in with a large number of gaussian signals. A solution to the dynamics is obtained in the limiting case where the number of gaussian signals is innite. In this limit, one can use techniques from statistical mechanics similar to those that previously have been applied to other on-line learning algorithms, including other unsupervised Hebbian learning algorithms such as Sanger’s principal components analysis algorithm (Biehl & Schlosser, ¨ 1988). For the asymptotic dynamics close to an optimal solution, the stability conditions due to Hyv¨arinen and Oja (1998) are conrmed, and the eigensystem is obtained, which determines the asymptotic dynamics and optimal learning rate decay. However, the dynamical equations also have suboptimal xed points that are stable for any O (N ¡1 ) learning rate where N is the data dimension. Conditions required to destabilize one such xed point are obtained in the case of a single nongaussian source, indicating that learning must be very slow initially in order to learn successfully. The analysis requires a careful treatment of uctuations, which prove to be important even in the limit of large input dimension. Finally, the simulation results presented indicate that this phenomenon also persists in nite-sized systems. 2 Data Model
The linear data model is shown in Figure 1. In order to apply the Hebbian ICA algorithm, one should rst sphere the data, that is, linearly transform
Stochastic Trapping in On-Line ICA
423
Figure 1: Linear mixing model generating the data. There are M nongaussian independent sources s, while the remaining N ¡ M sources n are uncorrelated gaussian variables. There are N outputs x formed by multiplying the sources by the square nonsingular mixing matrix A ´ [As An ]. The outputs are linearly projected onto the K-dimensional vector y D W T x. It is assumed that M ¿ N and K ¿ N.
the data so that they have an identity covariance matrix. This can be achieved by standard transformations in a batch setting, but for on-line learning, an adaptive sphering algorithm, such as the one introduced by Cardoso and Laheld (1996), could be used. To simplify matters, it is assumed here that the data have already been sphered. Without loss of generality, it can also be assumed that the sources each have unit variance. The sources are decomposed into M nongaussian and N ¡ M gaussian components, respectively: p (s) D
M Y
¡n 2 i
N Y
e 2 p ( n) D p . 2p iD M C 1
pi (si ) ,
iD 1
(2.1)
To conform with the above model assumptions, the mixing matrix A must be unitary. The mixing matrix is decomposed into two rectangular matrices As and An associated with the nongaussian and gaussian components, respectively: µ ¶ µ ¶ s s (2.2) xDA D [A s A n ] D As s C An n.
n
n
The unitary nature of A results in the following constraints: " # [A s A n ] "
ATs ATn
#
ATs ATn
T T D As A s C An An D I ,
[A s A n ] D
"
A Ts A s
ATs An
A Tn A s
ATn An
#
(2.3) µ
D
I 0
0
I
¶ .
(2.4)
424
Magnus Rattray
3 Algorithm
The following on-line Hebbian learning rule was introduced by Wang and Karhunen (1996) and analyzed in the context of ICA by Hyv¨arinen and Oja (1998):
W t C 1 ¡ W t D g¾xt Á ( y t ) T C aW t ( I ¡ (W t ) T W t ) ,
(3.1)
where Á (y t ) i D w ( yti ) is an odd nonlinear function that is applied to every component of the K-dimensional vector y ´ W T x. The rst term on the right is derived by maximizing the nongaussianity of the projections in each component of the vector y . The second term ensures that the columns of W are close to orthogonal so that the projections are uncorrelated and of unit variance. The learning rate g is a positive scalar parameter, which must be chosen with care and may depend on time. The parameter a is less critical; setting a D 0.5 seems to provide reasonable performance in general. The diagonal matrix ¾ has elements sii 2 f¡1, 1g, which ensure the stability of the desired xed point. The elements of this matrix can be chosen either adaptively or according to a priori knowledge about the source statistics. Stability of the optimal xed point is ensured by the condition (Hyv¨arinen & Oja, 1998) sii D Sign(hsi w ( si ) ¡ w 0 ( si ) i),
(3.2)
assuming we order indices such that yi ! § si for i · min(K, M ) asymptotically. The angled brackets denote an average over the source distribution. A remarkable feature of equation 3.2 is that the same nonlinearity can be used for source signals with very different characteristics. For example, both subgaussian and supergaussian signals can be separated using either w ( y ) D y3 or w ( y ) D tanh ( y) , two common choices of nonlinearity. 4 Dynamics for Large Input Dimension
Dene the following two matrices:
R ´ W T As ,
Q ´ W TW .
(4.1)
Using the constraint in equation 2.3, one can show that
y D W T ( As s C An n ) D Rs C z
where
z»N
(0, Q ¡ RR T ).
(4.2)
Knowledge of the matrices R and Q is therefore sufcient to describe the relationships between the projections y and the sources s in full. Although
Stochastic Trapping in On-Line ICA
425
the dimension of the data is N, the dimension of these matrices is K £ M and K £ K, respectively. The system can therefore be described by a small number of macroscopic quantities in the limit of large N as long as K and M remain manageable. In appendix A, it is shown that in the limit N ! 1, Q ! I , while R evolves deterministically according to the following rst-order differential equation, dR T T T D m ¾ (hÁ ( y ) s i ¡ 12 hÁ ( y ) y C yÁ ( y ) iR ) dt ¡ 12 m 2 hÁ ( y ) Á ( y ) T iR ,
(4.3)
with rescaled variables t ´ t / N and m ´ Ng. This deterministic equation is valid only for R D O (1) , and a different scaling is considered in section 4.2, in which case uctuations have to be considered even in the limit. The brackets denote expectations with respect to the source distribution and z » N (0, I ¡ RR T ). The bracketed terms therefore depend on only R and statistics of the source distributions, so that equation 4.3 forms a closed system. 4.1 Optimal Asymptotics. The desired solution is one where as many as possible of the K projections mirror one of the M sources. If K < M, then not all the sources can be learned, and which ones are learned depends on details of the initial conditions. If K ¸ M, then which projections mirror each source also depends on the initial conditions. For K > M, there will be projections that do not mirror any sources; these will be statistically independent of the sources and have a gaussian distribution with identity covariance matrix. Consider the case where yi ! si for i D 1, . . . , min(K, M ) asymptotically (all other solutions can be obtained by a trivial permutation of indices or changes in sign). The optimal solution is then given by R¤ij D dij , which is a xed point of equation 4.3 as m ! 0. Asymptotically, the learning rate should be annealed in order to approach this xed point, and the usual inverse law annealing can be shown to be optimal subject to a good choice of prefactor, ± ² m0 m0 m » as t ! 1 or equivalently g » as t ! 1 . (4.4) t t
The asymptotic solution to equation 4.3 with the above annealing schedule was given by Leen, Schottky, and Saad (1998). Let uij D Rij ¡ R¤ij be the deviation from the xed point. Expanding equation 4.3 around the xed point, one obtains uij (t ) »
K M X X k,n D1 l,m D1
» Vijkl
¡ 12 Xkl hw 2 ( sn ) idnm C
± t ²¡m 0 lkl 0
t
¼ ¡1 (t ) unm 0 Vklnm , (4.5)
426
Magnus Rattray
where
³ Xij D
m 20 ¡m 0 lij ¡ 1
´µ
¶ 1 1 ± t0 ² ¡m 0 lij ¡ . t t0 t
(4.6)
Here, t0 is the time at which annealing begins, and lij and Vijkl are the eigenvalues and eigenvectors of the Jacobian of the learning equation to rst order in m . These are written as matrices and tensors, respectively, because the system’s variables are in a matrix. One can think of pairs of indices (i, j) and ( k, l) as each representing a single index in a vectorized system. The explicit solution to the eigensystem is given in appendix B. From the eigenvalues dened in equation B.4, it is clear that the xed point is stable if and only if the condition in equation 3.2 is met. There is a critical learning rate, m crit 0 D ¡1 / lmax , where lmax is the largest eigenvalue (smallest in magnitude, since the eigenvalues are negative), such that if m 0 < m crit 0 , then the approach to the xed point will be slower than the optimal 1 / t decay. From the eigenvalues given in equation B.4, we nd that lmax D ¡jmin , where ji D sii hsi w ( si ) ¡ w 0 (si ) i, so that m crit 0 D 1 /jmin . As long as m 0 > m crit , then the terms involving t in equations 4.5 and 4.6 will 0 0 be negligible, and the asymptotic decay will be independent of the initial conditions and transient dynamics. Assuming m 0 > m crit 0 and substituting in the explicit expressions for the eigensystem, we nd the following simple form for the asymptotic dynamics to leading order in 1 / t : uij (t ) » ¡
dij t
³
´
m 20hw 2 ( si ) i . 4m 0ji ¡ 2
(4.7)
4.2 Escape from the Initial Transient. Unfortunately, the optimal xed points described in the previous section are not the only stable xed points of equation 4.3. In some cases, the algorithm will converge to a suboptimal solution in which one or more potentially detected signals remain unlearned and the corresponding entry in R decays to zero. The stability of these points is due to the O (m 2 ) term in equation 4.3, which becomes less signicant as the learning rate is reduced, in which case the corresponding negative eigenvalue of the Jacobian eventually vanishes. Higher-order terms then lead to instability and escape from this suboptimal xed point. One can therefore avoid trapping by selecting a sufciently low learning rate during the initial transient. Consider the simplest case where K D M D 1, in which case the matrix R reduces to a scalar: R D R11 and s D s11 . Expanding equation 4.3 around R D 0,
dR 2 2 000 3 5 D ¡ 12 hw ( z ) im R C 16k 4 hw ( z) ism R C O ( R ) . dt
(4.8)
Stochastic Trapping in On-Line ICA
427
Here, k 4 is the fourth cumulant of the source distribution, and the brackets denote averages over z » N (0, 1) . Although R D 0 is a stable xed point, the range of attraction is reduced as m ! 0, until eventually instability occurs. The condition under which one will successfully escape the xed point is found by setting d|R| / dt > 0: s D Sign(k 4 hw 000 ( z) i),
m <
R2 |hw 000 ( z) ik 4 | . 3hw 2 ( z) i
(4.9)
Notice that the condition on s for escaping the initial transient is not generally the same as the condition in equation 3.2, which ensures stability of the asymptotic xed point. For w ( y ) D y3 , the conditions are exactly equivalent. However, in other cases, the conditions may conict, and an adaptive choice of s based on equation 3.2, as suggested by Hyv¨arinen and Oja (1998), may give poor results. With w ( y) D tanh ( y ) , the conditions appear to be equivalent in many cases. This condition is equally applicable to gradient-based batch algorithms since it is due to the O (m ) term above, which is not related to uctuations. If the entries in A and W are initially of similar order, then one would 1 expect R D O (N ¡ 2 ) . This is the typical case if we consider a random and uncorrelated choice for A and the initial entries in W . Larger initial values of R could be obtained only with some prior knowledge of the mixing matrix, 1 which we will not assume. With R D O (N ¡ 2 ) , the initial value of m required ¡1 to escape is O ( N ), indicating a very slow initial phase in the dynamics (recall that the unscaled learning rate g ´ m / N). Larger learning rates will result in trapping in the suboptimal xed point. However, the above result is not strictly valid unless R D O (1) , since this was an assumption used in 1 the derivation of equation 4.3. When R D O ( N ¡ 2 ), then one can no longer (1) assume p that uctuations are negligible as N ! 1. Dene the O quantities r ´ R N and º ´ gN2 . The mean and variance of the change in r at each learning step can be calculated (to leading order in N ¡1 ), ± ² E[D r] ’ ¡ 12 hw 2 ( z ) iº2 r C 16k 4 hw 000 ( z ) isºr3 N ¡3 , Var[D r] ’ hw 2 (z ) iº2 N ¡3 .
(4.10) (4.11)
This expression is derived by a similar adiabatic elimination of Q11 as carried out for the deterministic case in appendix A. This requires that a and g are the same order before taking the limit N ! 1, followed by the limit a ! 1, and corresponds to the usual form of “slaving” in Haken’s terminology in which the eliminated variable contributes only to the change in mean. Other scalings may result in slightly different expressions (see, for example, Gardiner, 1985, section 6.6), although it is expected that the main conclusions described below will not be affected.
428
Magnus Rattray
p Figure 2: For r D R N D O (1) the dynamics can be represented as a diffusion in a symmetric quartic potential U ( r) . The escape time from an unstable xed point at r D 0 is mainly determined by the potential barrier D U.
The equation for the mean is similar to equation 4.8. However, the variance is the same order as the mean in the limit N ! 1, and uctuations cannot be ignored in this case. The system is better described by a FokkerPlanck equation (see, e.g., Gardiner, 1985) with a characteristic timescale of O ( N 3 ) . The system is locally equivalent to a diffusion in the following quartic potential, U ( r) D 14 hw 2 (z ) iº2 r2 ¡
000 1 ( z) i|ºr4 , | 24 k 4 hw
(4.12)
with a diffusion coefcient D D hw 2 ( z ) iº2 which is independent of r. The shape of this potential is shown in Figure 2. A potential barrier D U must be overcome to escape an unstable state close to R D 0 (assuming that the condition on s in equation 4.9 is satised). For large º, this system corresponds to an Ornstein-Uhlenbeck process with a gaussian stationary distribution of xed unit variance. Thus, if one chooses too large º initially, the dynamics will become localized close to R D 0. As º is reduced, the potential barrier conning the dynamics is reduced. The timescale for escape for large º is mainly determined by the effective size of the barrier (Gardiner, 1985): ³ T escape / exp
DU D
´
³ D exp
3hw 2 ( z) iº 8|k 4 hw 000 ( z) i|
´ .
(4.13)
As the learning rate is reduced, so the timescale for escape is also reduced. However, the choice of optimal learning rate is nontrivial and cannot be determined by considering only the leading-order terms in R as above, because although small º will result in a quicker escape from the unstable xed point near R D 0, this will then lead to a very slow learning transient after escape.
Stochastic Trapping in On-Line ICA
429
From this discussion, we can draw two important conclusions. First, the initial learning rate should be O ( N ¡2 ) or less initially in order to avoid trapping close to the initial conditions. Second, the timescale required to escape the initial transient is O ( N 3 ) , resulting in an extremely slow initial stage of learning. 4.3 Other Suboptimal Fixed Points. In studies of other on-line learning algorithms, such as Sanger’s rule and backpropagation, a class of suboptimal xed points has been discovered that is due to symmetries inherent in the learning machine’s structure (Saad & Solla, 1995a, 1995b; Biehl & Schwarze, 1995; Biehl & Schlosser, ¨ 1998). These symmetric xed points are unstable for small learning rates, but the eigenvalues determining escape are typically of very small magnitude, so that trapping can occur if the initial conditions are sufciently symmetric. In practice, this will typically occur only for very large input dimensions (N > 106 ) and will result in learning timescales of O ( N 2 ) for O (N ¡1 ) learning rates. Equation 4.3 does exhibit xed points of this type for particular initial conditions. Consider the case K D M D 2 as an example. If initially R11 ’ R21 and R12 ’ R22, then the dynamics will preserve this symmetry until instabilities due to slight initial differences lead to escape from an unstable xed point. This symmetry breaking is necessary for good performance since each projection must specialize to a particular source signal. Sufciently small differences in the initial value of the entries in R typically will occur only for very large N—much larger than the typical values currently used in ICA. A very small learning rate is then required to avoid trapping in a xed point near the initial conditions, as discussed in the previous section. This initial trapping is far more serious than the symmetric xed point since it requires a learning rate of O ( N ¡2 ) in order to escape, resulting in a far greater loss of efciency. In practice, symmetric xed points do not appear to be a serious problem, and we have not observed any such xed points in simulations of nite systems. This may be due to the highly stochastic nature of the initial dynamics, in which uctuations are large compared to the average dynamical trajectory. This is in contrast to the picture for backpropagation, for example, where uctuations result in relatively small corrections to the average trajectory (Barber, Sollich, & Saad, 1996). The strong uctuations observed here may help break any symmetries that might otherwise lead to trapping in a symmetric xed point, although a full understanding of this effect requires careful analysis of the multivariate diffusion equation describing the dynamics near the initial conditions. 5 Simulation Results
The theoretical results in the previous section are for the limiting case where N ! 1. In practice, we should verify that the results are relevant in the case of large but nite N. In this section, simulation evidence is presented
430
Magnus Rattray
1
1
0.5
0.5
0
0
0.5
0.5
1 0
2
4
6
8
10
1 0
1
1
0.5
0.5
0
0
0.5
0.5
1 0
2
4
6
8
10
1 0
1
1
0.5
0.5
0
0
0.5
0.5
1 0
2
4
6
8
10
1 0
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
Figure 3: 100-dimensional data (N D 100) are produced from a mixture containing a small number of uniformly distributed sources. Figures on the left (a–c) are for a single nongaussian source and a single projection (M D K D 1), while gures on the right (d–f) are for two nongaussian sources and two projections (M D K D 2). Each column shows examples of learning with the same initial conditions and data but with different learning rates. From top to bottom: g D 10¡5 (º D 0.1), g D 10¡4 (º D 1) , and g D 4 £ 10¡4 (º D 4) .
demonstrating that the trapping predicted by the theory occurs in nite systems. Figures 3a through 3c show results produced by an algorithm learning a single projection from 100-dimensional data with a single nongaussian
Stochastic Trapping in On-Line ICA
431
(uniformly distributed) component (N D 100, M D K D 1). The matrices A and W are randomly initialized with orthogonal, normalized columns. Similar results are obtained for other random initializations. A cubic nonlinearity is used, and s is set to ¡1, although the adaptive scheme for setting s suggested by Hyv¨arinen and Oja (1998) gives similar results. In each example, dashed lines show the maxima of the potential in Figure 2. Figure 3a shows the learning dynamics from a single run with g D 10¡5 (º D 0.1). The dynamics follows a relatively smooth trajectory in this case, and much of the learning is determined by the cubic term in equation 4.10. With this choice of learning rate, there is a strong dependence on the initial conditions, with larger initial magnitude of R often resulting in signicantly faster learning. However, recall that a high value for R cannot be chosen without prior knowledge of the mixing matrix. Figure 3b shows the learning dynamics with a larger learning rate g D 10¡4 (º D 1) for exactly the same initial conditions and sequence of data. In this case, the learning trajectory is more obviously stochastic and is initially conned within the unstable suboptimal state with R ’ 0. Eventually the system leaves this unstable state and quickly approaches R ’ 1. In this case, the dynamics is not particularly sensitive to the initial magnitude of R, although the escape time can vary signicantly due to the inherent randomness of the learning process. In Figure 3c, the learning dynamics is shown for a larger learning rate g D 4 £10¡4 (º D 4). In this case, the system remains trapped in the suboptimal state for the entire simulation time. The analysis in section 4.2 is strictly valid only for the case of a single nongaussian source and a single projection. However, similar trapping occurs in general, as demonstrated in Figures 3d through 3f. The components of R are plotted for an algorithm learning two projections from 100-dimensional data with two nongaussian (uniformly distributed) components (N D 100, M D K D 2). The different learning regimes identied in the single component case are mirrored closely in the case of this two-component model. 6 Conclusion
An on-line Hebbian ICA algorithm was studied for the case in which data comprise a linear mixture of gaussian and nongaussian sources, and a solution to the dynamics was obtained for the idealized scenario in which the number of nongaussian sources is nite, while the number of gaussian sources is innite. The analysis conrmed the stability conditions found by Hyv¨arinen and Oja (1998), and the eigensystem characterizing the asymptotic regime was determined. However, it was also shown that there exist suboptimal xed points of the learning dynamics, which are stabilized by stochastic effects under certain conditions. The simplest case of a single nongaussian component was studied in detail. The analysis revealed that typically a very low learning rate (g D O (N ¡2 ) where N is the data dimension) is required to escape this suboptimal xed point, resulting in a long
432
Magnus Rattray
learning time of O ( N 3 ) iterations. Simulations of a nite system support these theoretical conclusions. The suboptimal xed point studied here has some interesting features. In the limit g ! 0, the dynamics becomes deterministic, and uctuations due to the stochastic nature of on-line learning vanish. In this case, the suboptimal xed point is unstable, but the Jacobian is zero at the xed point (in the one-dimensional case), indicating that one must go to higher order to describe the dynamics. Standard methods for describing the dynamics of on-line algorithms have all been developed in the neighborhood of xed points with negative eigenvalues and are not applicable in this case (Heskes & Kappen, 1993). Furthermore, stability of the xed point is induced by uctuations. This is contrary to our intuition that uctuations may be benecial, resulting in quicker escape from suboptimal xed points. In the present case, one has precisely the opposite effect: stochasticity stabilizes an otherwise unstable xed point. In similar studies of on-line PCA (Biehl & Schl¨osser, 1998) and backpropagation algorithms (Biehl & Schwarze, 1995; Saad & Solla, 1995a, 1995b), suboptimal xed points have been found that are also stabilized when the learning rate exceeds some critical value. However, the scale of critical learning rate stabilizing these xed points is typically O (N ¡1 ) , much larger than in the present case. Also, the resulting timescale for learning is O ( N 2 ) with a very small prefactor (in practice, an O ( N ) term will dominate for realistic N). These xed points reect saddle points in the mean ow, while here we have a at region, and escape is through much weaker higher-order effects. This type of suboptimal xed point is more reminiscent of those found in studies of small networks, which often have a much more dramatic effect on learning efciency (Heskes & Wiegerinck, 1998). It is unclear whether on-line ICA algorithms based on maximum-likelihood and information-theoretic principles (see, for example, Amari, Cichocki, & Yang, 1996; Bell & Sejnowski, 1995; Cardoso & Laheld, 1996) exhibit suboptimal xed points similar to those studied here. These algorithms estimate a square demixing matrix and require a different theoretical treatment from the projection model considered here, since there may be no simple macroscopic description of the system for large N. Appendix A: Derivation of the Dynamical Equations
From equation 3.1, one can calculate the change in R and Q (dened in equation 4.1) after a single learning step,
DR DQ
D g¾Á ( y ) sT C a( I ¡ Q ) R,
D g¾ ( I C a( I ¡ Q ) ) Á ( y ) y T C g¾yÁ ( y ) T ( I C a( I ¡ Q ) ) C 2a( I ¡ Q ) Q C a2 ( I ¡ Q ) 2 Q C g2 Á ( y ) xT xÁ ( y ) T .
(A.1)
Here, the denition in equation 2.2 and the constraint in equation 2.4 have been used to set xT As D sT . One can obtain a set of differential equations
Stochastic Trapping in On-Line ICA
433
in the limit N ! 1 using a statistical mechanics formulation that has previously been applied to the dynamics of on-line PCA algorithms (Biehl, 1994; Biehl & Schl¨osser, 1998) as well as other unsupervised and supervised learning algorithms (see, e.g., Biehl & Schwarze, 1995; Saad & Solla, 1995a, 1995b; and contributions in Saad, 1998). To obtain differential equations, one should scale the parameters of the learning algorithm in an appropriate way, in particular, g ´ m / N. Typically one chooses a D O (1) , but in order to obtain an analytical solution, it is more convenient to choose a ´ a0 / N before taking N ! 1 and then take the limit a0 ! 1. The dynamics do not appear to be sensitive to the exact value of a as long as a À g, and it is therefore hoped that the dynamical equations are valid for a D O (1), which is usually the case. The learning rate is taken to be constant here, but the dynamical equations are also valid when the learning rate is changed slowly, as suggested for the annealed learning studied in section 4.1. As N ! 1 one nds, dR D m ¾ hÁ ( y ) sT i C a0 ( I ¡ Q ) R , dt dQ D m ¾ hÁ ( y ) y T C yÁ ( y ) T i C m 2 hÁ ( y ) Á ( y ) T i dt C 2a0 Q ( I ¡ Q ) ,
(A.2)
(A.3)
where t ´ t / N is a rescaled time parameter. The angled brackets denote averages over y , as dened in equation 4.2. In deriving equations A.2 and A.3, one should check that uctuations in R and Q vanish in the limit N ! 1. This relies on an assumption that R D O (1) , which may not be appropriate in some cases. For example, in section 4.2, a suboptimal xed p point is analyzed where it is more appropriate to consider R D O (1 / N ) and a more careful treatment of uctuations is required. As a0 is increased, Q approaches I . If one sets Q ¡ I ´ q / a0 and makes the a priori assumption that q D O (1), then, 1 dq D m ¾ hÁ ( y ) y T C yÁ ( y ) T i C m 2 hÁ ( y ) Á ( y ) T i a0 dt ¡ 2q C O (1 / a0 ).
(A.4)
As a0 ! 1, one can solve for q ,
qD
² 1± m ¾ hÁ ( y ) y T C yÁ (y ) T i C m 2 hÁ ( y ) Á ( y ) T i , 2
(A.5)
which is consistent with the a priori assumption. Substituting this result into equation A.2 leads to equation 4.3 in the main text. This is an example of adiabatic elimination of fast variables (Gardiner, 1985, section 6.6) and greatly simplies the dynamical equations.
434
Magnus Rattray
Appendix B: Eigensystem of Asymptotic Jacobian
The Jacobian of dR / dt as m ! 0 is dened (divided by m ), ³ ´ 1 dRij @ . Jijkl D @R kl m dt ¤ m D0
(B.1)
RD R
This is a tensor rather than a matrix because the system’s variables are in a matrix. One can think of pairs of indices (i, j) and ( k, l) as each representing a single index in a vectorized system. If the dynamics is equivalent to gradient descent on some potential function, then the above quantity is proportional to the (negative) Hessian of this cost function. The Jacobian is not guaranteed to be symmetric in the present case, so this will not be possible in general. From equation 4.3 one obtains, ¡ ¢ (B.2) Jijkl D ¡dikdjl ji C 12 jj ¡ 12 dildjkji , with,
( ji D
sii hsi w ( si ) ¡ w 0 ( si ) i for i · min(K, M ) , 0 otherwise.
One must solve the following eigenvalue problem, X Jijkl Vklnm D lnm Vijnm ,
(B.3)
kl
where lij and V klij are the eigenvalues and eigenvectors, respectively. A solution is required for all i · K and j · M in order to get a complete set of eigenvalues, lii D ¡2ji , lij D
V klii D dikdil ,
¡ 12 (ji C jj ) ,
lij D ¡ji ,
Vklij D dikdjl ¡ djkdil
Vklij D dikdjl
lij D ¡(ji C jj ) ,
for j > K,
V klij D jidikdjl C jjdjkdil
for i < j · K, for i > j.
(B.4)
Acknowledgments
This work was supported by an EPSRC award (GR/M48123). References Amari, S.-I., Cichocki, A., & Yang, H. H. (1996). A new learning algorithm for blind source separation. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Neural information processing systems, 8 (pp. 757–763). Cambridge, MA: MIT Press.
Stochastic Trapping in On-Line ICA
435
Barber, D., Sollich, P., & Saad, D. (1996). Finite size effects in on-line learning of multi-layer neural networks. Europhysics Letters, 34, 151–156. Bell, A. J., & Sejnowski, T. J. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7, 1129–1159. Biehl, M. (1994). An exactly solvable model of unsupervised learning. Europhysics Letters, 25, 391–396. Biehl, M., & Schl¨osser, E. (1998). The dynamics of on-line principal component analysis. Journal of Physics A, 31, L97–L103. Biehl, M., & Schwarze, H. (1995). Learning by on-line gradient descent. Journal of Physics A, 28, 643–656. Cardoso, J.-F., & Laheld, B. (1996). Equivariant adaptive source separation. IEEE Transactions on Signal Processing, 44, 3017–3030. Gardiner, C. W. (1985).Handbookof stochasticmethods. New York: Springer-Verlag. Heskes, T. M., & Kappen, B. (1993). On-line learning processes in articial neural networks. In J. Taylor (Ed.), Mathematical foundations of neural networks (pp. 199–233). Amsterdam: Elsevier. Heskes, T. M., & Wiegerinck, W. (1998). On-line learning with time-correlated examples. In D. Saad (Ed.), On-line learning in neural networks (pp. 251–278). Cambridge: Cambridge University Press. Hyva¨ rinen, A. (1999). Survey on independent component analysis. Neural Computing Surveys, 2, 94–128. Hyva¨ rinen, A., & Oja, E. (1997). A fast xed-point algorithm for independent component analysis. Neural Computation, 9, 1483–1492. Hyva¨ rinen, A., & Oja, E. (1998). Independent component analysis by general non-linear Hebbian-like learning rules. Signal Processing, 64, 301–313. Leen, T. K., Schottky, B., & Saad, D. (1998).Two approaches to optimal annealing. In M. I. Jordan, M. J. Kearns, & S. A. Solla (Eds.), Advances in neural information processing systems, 10. Cambridge, MA: MIT Press. Muller, ¨ K.-R., Ziehe, A., Murata, N., & Amari, S.-I. (1998). On-line learning in switching and drifting environments with application to blind source separation. In D. Saad (Ed.), On-line learning in neural networks (pp. 93–110). Cambridge: Cambridge University Press. Saad, D. (Ed.). (1998). On-line learning in neural networks. Cambridge: Cambridge University Press. Saad, D., & Solla, S. A. (1995a). Exact solution for on-line learning in multilayer neural networks. Physical Review Letters, 74, 4337–4340. Saad, D., & Solla, S. A. (1995b). On-line learning in soft committee machines. Physical Review E, 52, 4225–4243. Wang, L.-H., & Karhunen, J. (1996). A unied neural bigradient algorithm for robust PCA and MCA. International Journal of Neural Systems, 7, 53–67. Received November 8, 2000; accepted May 2, 2001.
LETTER
Communicated by Eric Mjolsness
A Neural-Network-Based Approach to the Double Traveling Salesman Problem Alessio Plebe [email protected] Angelo Marcello Anile
[email protected] Department of Mathematics and Informatics, University of Catania 8 I-95125 Catania, Italy The double traveling salesman problem is a variation of the basic traveling salesman problem where targets can be reached by two salespersons operating in parallel. The real problem addressed by this work concerns the optimization of the harvest sequence for the two independent arms of a fruit-harvesting robot. This application poses further constraints, like a collision-avoidance function. The proposed solution is based on a self-organizing map structure, initialized with as many articial neurons as the number of targets to be reached. One of the key components of the process is the combination of competitive relaxation with a mechanism for deleting and creating articial neurons. Moreover, in the competitive relaxation process, information about the trajectory connecting the neurons is combined with the distance of neurons from the target. This strategy prevents tangles in the trajectory and collisions between the two tours. Results of tests indicate that the proposed approach is efcient and reliable for harvest sequence planning. Moreover, the enhancements added to the pure self-organizing map concept are of wider importance, as proved by a traveling salesman problem version of the program, simplied from the double version for comparison.
1 Introduction
The traveling salesman problem (TSP) is probably the most famous topic in the eld of combinatorial optimization, and many studies have searched for a reliable solution. In this problem, the objective is to nd the shortest route between a set of N randomly located cities, in which each city is visited only once. The number of possible routes grows exponentially with an increase in the number of cities. This makes an exhaustive search for the shortest route unfeasible, even if there is only a small number of cities. The high level of interest in the TSP has arisen from the wide range of related Neural Computation 14, 437–471 (2001)
° c 2001 Massachusetts Institute of Technology
438
Alessio Plebe and Angelo M. Anile
optimization problems in elds ranging from parts placement in electronic circuits, to routing in communication network systems, to resource planning. (For general overviews of TSP, see (Lawler, Lenstra, Rinnooy Kan, & Shmoys, 1985) and Reinelt, 1994.) However, real problems are seldom dened in terms of just the pure TSP, often involve multiple traveling salesmen planning routes between a single set of cities, and include additional constraints. A well-known case is the vehicle routing problem (VRP), involving the design of a set of minimum cost delivery routes for a eet of vehicles, originating and terminating at a central depot, serving a set of customers (Golden & Assad, 1988; Laporte, 1992; Fisher, 1995). In the multi-vehicle covering tour problem (m-CTP), only a subset of all vertices has to be collectively covered by the m vehicles (Hodgson, Laporte, & Semet, 1998; Hachicha, Hodgson, Laporte, & Semet, 2000). Examples of the many possible additional constraints are time deadlines for serving customers (Thangiah, Osman, Vinayagamoorthy, & Sun, 1993), time windows (Desrochers, Desrosiers, & Solomon, 1992), and equity in customer satisfaction over periodic services (Dell, Batta, & Karwan, 1996). From a theoretical point of view, all such problems could be mapped in an equivalent TSP formulation, but in practice this is not a viable route to an efcient solution. Therefore, there is a vast literature on heuristics tailored to each abstract class in the family of TSP variants. The work presented here considers the case in which there are two salesmen who each visit an independent subset of the cities. As in the TSP algorithm, each city should be visited only once, and each tour should start and end in the same city. In this case, the best solution is the one in which there is both the least difference between the length of the routes taken by the two salesmen and the total distance traveled in visiting all of the cities is shortest. This work has been developed to plan the path of the two arms of a robot that was designed to harvest oranges. The two arms are electrically driven and move along the direction of unit vectors in a spherical coordinate system. The 2-TSP algorithm here described has been successfully implemented in the CRAM-OPR harvesting robot prototype. A general description of this robot and results of harvesting in real conditions is available in Recce, Taylor, Plebe, and Tropiano (1996) and Plebe and Grasso, in press). The 2-TSP algorithm incorporates two of the constraints arising from the robotic application. One is the partition of the space allowed for the two salesmen. The total space where the targets lay is divided into an overlap area, where both salesmen can reach targets, and two disjoint areas, where only one of the salesmen can move. The other constraint is that in the overlap area, the paths planned by the two salesmen cannot cross. Despite being conceived for a real application with specic constraints, the solving algorithm also turns out to be a very promising neural solution to the pure TSP.
An Approach to the Double Traveling Salesman Problem
439
1.1 Neural Network Approach. The idea of investigating neural networks as an alternative to classical heuristics in the operations research domain in solving the TSP problem has gained widespread attention since the early work of Hopeld and Tank (1985) and Durbin and Willshaw (1987) (see Potvin, 1993, for a recent survey). Both pioneer methods basically perform a gradient-descent search of an appropriate energy function. The Hopeld-Tank model suffers from expensive network representation, with a number of nodes proportional to the square of the cities. Despite subsequent efforts in the tuning of the involved parameters (Kagmar-Parsi & Kagmar-Parsi, 1992; Lai & Coghill, 1992) and in several enhancements (Lo, 1992; Van den Bout & Miller, 1989), there is an intrinsic difculty in the quadratic nonconvex formulation of the TSP objective function, with many more local minima than a standard linear formulation (Smith, 1996). Instead, a linear formulation is shown be the energy function of the socalled elastic net paradigm of Durbin-Willshaw, originating from models of the biological visual cortex, which has been adopted and extended by several researchers (Boeres, Carvalho, & Barbosa, 1992; Vakhutinsky & Golden, 1995). Although quite accurate, this method is still computationally expensive. Most recently, several algorithms have relied on the self-organizing feature map (SOM) (Kohonen, 1984, 1990, 1995), which is also a biologically motivated model and apparently shares some similarity with the elastic net but is not mathematically equivalent to an energy-function formalism (Erwin, Obermayer, & Schulten, 1992). First attempts in using SOM to solve the TSP can be found in Fort (1988) and Heuter (1988). In this method, the activity and weights of a set of articial neurons are adapted to match the topological relationship of the targets within the input set by means of iterative competitive learning. The input sample is presented to the neurons, and the connection weights to the neuron with the best match, together with those of neighboring neurons, are modied in the direction that strengthens the match. This process results in a neural encoding that gradually relaxes toward a valid tour. Although in its original form the SOM does not perform very well for the TSP, it is an attractive paradigm, which has been used by several researchers as a basis for more rened schemes: (AngÂeniol, Vaubois, & Texier, 1988; Fritzke & Wilke, 1991; Burke & Damany, 1992; Burke, 1994, 1996; Aras, Oommen, & Altinel, 1999). In general, the relaxation process of the SOM is not sufcient to nd an optimal tour, since in most cases there is a mismatch between the relative spatial distribution of randomly placed neurons and the location of the targets. This drawback has been addressed in the algorithm of AngÂeniol et al. (1988) by a network that starts with just one neuron and neurons are cloned as long as they win on more than one city during the same iteration step. The same algorithm has been extended by Goldstein (1990) to the m-TSP problem.
440
Alessio Plebe and Angelo M. Anile
In the “guilty net” of Burke, the number of neurons is xed, but the function used in the competition is modied with a penalty term for neurons that win too often. Aras et al. (1999) included in the SOM adaptation a mechanism for restoring the centroid of the overall network to the centroid of the targets. 2 The New Heuristics in the 2-TSP
The 2-TSP presented here is an algorithm based on the SOM mechanism, with the addition of two main heuristics. 2.1 Regeneration. One strategy is to combine the competitive relaxation with a mechanism for deleting and creating neurons, referred to as regeneration. The question then becomes when to perform the regeneration process and where to perform it along the tours. In the approach here, the latter question is addressed by increasing the amount of information acquired during the map relaxation process. During each competition step, information on the spatial mismatch between individual targets and neurons is accumulated in local indicators called activity. In addition, in this algorithm, explicit information concerning the history of the competition process is associated with each target not yet assigned. The former question—when to perform the regeneration—is addressed through a continuous monitoring of the tour relaxation, synthesized into two global indicators, called frustration and knowledge. The frustration measure increases when there are local failures of individual neurons to map to a single target and decreases when a nal match is made between a neuron and a target. The knowledge is a measure of the number of stored dependencies between a target and surrounding neurons. These stored dependencies are used to sort out the mismatch between the distribution of targets and neurons in regeneration mode. The value of both of these parameters decreases with the successful discovery of the optimal tour. The rst parameter can be seen as a measure of the need to regenerate the neural network, and the second determines the potential usefulness of the regeneration process. There is no need to perform a regeneration if the frustration is low, and even if it is high, there is no benet if the knowledge is too small to drive the creation and deletion of neurons. 2.2 Competition Strategy. The other new heuristics in the 2-TSP concerns the competition process. Unlike other adaptive neural approaches, the competition rule is based not only on the minimization of the distance between neurons and candidate targets; it also includes the distance to the trajectory connecting the neurons. At any point in the competition process when a segment of the trajectory of the tour is closer to a target than any of the neurons, then the winning neuron is the nearest along that trajectory rather than simply the nearest neuron. This modication brings two major benets in the algorithm: it reduces the
An Approach to the Double Traveling Salesman Problem
441
tangles that can occur in the trajectory, and it ensures that the two tours do not intersect. As a minor effect, this competition process also acts as a limit in the number of allowed candidates in a competition, with a speed benet. Prior work with adaptive networks has also introduced criteria to limit the candidates in the competition process. Fritzke and Wilke (1991) use a local selection of the tour, essentially to achieve better speed. The “guilty net” of Burke (1996) inhibits multiple winning neurons to ensure node separation. In this work, the candidates are also limited to a variable-length section of the trajectory between the neurons. In addition, the orientation of the tour trajectory with respect to the city is also used as a preliminary selection of candidates. 3 Description of the Algorithm
First, a formal denition of the problem dealt by the algorithm will be given. In a two-dimensional (2D) space where the coordinates of the targets are normalized in the interval [0, 1], it is assumed that the common area allowed to both salesmen is dened by two limits on the x dimension. The left salesman extends on the right side up to x2 , and the right salesman extends to the left up to x1 , with x1 < x2 . Given a set of N targets T , let H (T ) be the shortest tour for T and d(H ( T ) ) be its length. H (T ) is a set of N segments hi connecting two targets Ti¡ , TiC , C ¡ with TiC D Ti¡C 1 , TN D T 0 and T iC , Ti¡ 2 T . The double traveling salesman problem addressed here is formally the following: From all pairs of target sets fL , Rg, L
Problem 1 (2-TSP).
satisfy: S
L
T
L hl
T
DT;
R
D ;;
R
D ;,
hr
0 < xl < x2 , x1 < xr < 1,
8hl 2 H ( L ) , 8hr 2 H ( R ) ; 8fxl , yl g 2 hl 2 H ( L ) ;
8fxr , yr g 2 hr 2 H ( R ) .
nd the two sets fL max (d( H (L
2 T , R 2 T that
¤
¤) )
, R¤ g 2 ffL , Rgg such that:
, d( H ( R¤ ) ) ) D min, Rfmax (d(H ( L ) ) , d( H ( R ) ) )g. L
The solution is the shortest path, with the condition that the paths of the two arms do not cross. Since the arms work in parallel, the total time required is the longest time spent by one of the arms. Theoretically, the 2-TSP can be reduced to several TSP problems by assigning each combination of targets to one tour, from a minimum of two targets up to N/2, and the remaining to the other tour. In the worst case,
442
Alessio Plebe and Angelo M. Anile
Table 1: 2-TSP Algorithm. Input: the set of targets T , the limits of the overlap area x1 , x2 T is split in the sets O , D of targets in the overlap and in the nonoverlap area initialize the tours X l , X r with |X l | C |X r | D |T |, U Ã T , default to regeneration mode while
U if
6 ; D
in adaptation mode: choose randomly a target Tj 2 U (in regeneration mode):
else
choose Tj 2 U
with the largest activity ET
identify one section of the tour if Tj 2 D , or two sections if Tj 2 O , and compute from the candidate set C the winner neuron Xw 2 X l [ X r if in regeneration mode: Ew ¸ 0 (Xw is not yet assigned):
if
xw à t w
(Xw is already assigned):
else
ÃU
U else
delete the worst neuron X, and add a new neuron near Xw with x à tw ¡ fTj g
(in adaptation mode):
Xw is a previous winner on Tj for t times:
if
xw à t w , U
ÃU
¡ fTj g
(Xw is a (relatively) newly winner on Tj ):
else
change the weights of the winner x w à x w C (tw ¡ x w )g and of its neighbors with a smaller learning rate update the process parameters K and F, the activity E according to the value of parameters K, F, current g, number of cities with positive activity ET , choose next process mode: adaptation or regeneration Output the tours X
l,
X
r
PN N! this requires i2D 2 ( N¡i )! TSP computations. All of these TSPs should rst be checked for collisions, and the shortest tour that meets the conditions in problem 1 is selected. The approach taken here is to nd an approximate solution that fully complies with the collision-avoidance condition, which is sufciently close to the optimal solution. The algorithm, summarized in Table 1, includes an initialization phase and an iterative process described in details in the following sections. The input of the algorithm is a set of targets T , which is divided according to the horizontal limitation of the two salesmen,
O fT 2 T : x1 < xT < x2 gI
D fT 2 T : xT < x1 _ xT > x2 g,
where xT is the horizontal component of the vector t associated with the target T.
An Approach to the Double Traveling Salesman Problem
443
The 2-TSP works with a xed number of neurons N, equal from the start to the number of targets. In most of the prior work based on SOM neural networks, the network is initialized with a small number of neurons (e.g., one neuron in the work of AngÂeniol et al., 1988), and the number increases during the search for the best tour. Only Burke (1994, 1996) has started with a xed number of neurons that is equal to the number of targets, but has no addition and deletion mechanism. After an initial setup of the two tours, the algorithm is iterated through a process of adaptation and regeneration until all targets are assigned to neurons. Throughout the iteration process, the number of neurons never changes, since the creation and deletion of neurons is balanced. 3.1 The Starting Tours. The initialization process should pose in the space all the neurons organized in two initial left and right tours, X l , X r , with X D X l [ X r , X l \ X r D ;, | X | D | T |. The general principle for initializing the tours in the 2-TSP would be to lay two polygonal tours with the vertex approximating the center of gravity of the targets in some discrete partition of the whole space and with neurons on the edges approximating the target density in the same partition. Calling Ai the areas into which the space related to one tour is partitioned and C the closed polygonal interpolating the center of gravity of targets in Ai , the linear density of neurons on the initial tour, º (s ) , shall approximate the area density % ( A ) of the targets, Z Z Z (3.1) º ( s) ds D % ( A ) dA Ci
Ai
where Ci is the section of C inside Ai . The choice of the partition can vary in granularity and geometry. An efcient solution is to divide each of the two subspaces in radial sectors, from the center of gravity of all targets in the subspace, at xed angular steps. Because of the weak sensitivity of the algorithm on the initial tour shape, a simpler solution is adopted, based on a tessellation of the space into 16 squares of equal area (4 £ 4). The neurons initially are placed into these squares, where the number of neurons is set to be equal to the number of targets contained in the square. The center of gravity (c i ) of the targets is computed for each of the squares, and piecewise linear loops are connected between the squares (see Figure 1). More precisely, neurons in a square i are placed over a line segment in which the end points are b (i ¡ b) ci C (1 ¡ b ) cb I 1 ¡ b C b ( i ¡ b)
b ( f ¡ i) c i C (1 ¡ b ) c f , 1 ¡ b C b ( f ¡ i)
(3.2)
where b is the index of the rst nonempty square backward from i and f the index of the rst nonempty forwards. When all square are populated with
444
Alessio Plebe and Angelo M. Anile
Figure 1: The partition of the problem space into 16 regions and the shape of the two starting tours. The medium-sized dots are the targets (oranges), the larger dots are centroids for each square, and the small dots are the starting locations for the articial neurons.
neurons, equation 3.2 is reduced to b ci C (1 ¡ b ) c i¡1 I b c i C (1 ¡ b ) ci C 1 . If an index is out of the boundary, it will be wrapped around it. These segments are connected, and the neurons are placed equidistant from each other along the segment within each square. The proportion of segments populated with neurons and joint segments between them is ruled by b. For example, with b D 23 , along the line connecting ci and ci C 1 at distance 13 D (c i , ci C 1 ) from c i there will be the end point of the segment carrying neurons of Ai . At distance 23 D ( ci , c i C 1 ) , there will be the starting point of the segment carrying the neurons of Ai C 1 . In between, there will be a joint segment without neurons. (See Figure 1 for an example with 50 targets.) While for a number of conventional TSP heuristics the choice of a good starting tour has an important impact (Perttunen, 1994), experimentation has revealed that for the 2-TSP, the initialization is not critical. Even a very crude initialization, like two squares with neurons placed without any relationship with the target distribution, leads to a worsening in time and quality less than 1% on a 1000-target problem with respect to the solutions just described. The initial assignment of nodes is independent of the size of the mechanical overlap between the two arms, the partition of targets between the two tours will rely on the competition process, and each target initially included in the overlap set O will be presented to both tours for the competition, in the way described below.
An Approach to the Double Traveling Salesman Problem
445
3.2 The Iterative Loop. In the 2-TSP algorithm, one iteration step consists of the presentation of a single target Tj 2 T , the competition for the winner neuron Xw 2 X , followed by either the adaptation of synaptic weights or the neural regeneration process. In each iteration step, the target is selected randomly, and the overall likelihood of selecting a target is independent of its membership in the sets O or D . However, the assignment of common target is more critical, since it is essential to avoid a premature assignment of free neurons in the overlapped area prior to an assessment of the remaining tours. For this reason, there is an explicit alternation in the selection process between distinct targets and common targets:
( Tj,n 2
O, D,
± l m² when D D ; _ n mod 1 C |D|O || D 0,
(3.3)
otherwise
where de is ceiling function. One conventional practice that reduces the time spent in nding the winner neuron for each target is to restrict the search to a local neighborhood of the target (e.g., Fritzke & Wilke, 1991). In our algorithm, there is a modied version of this approach, based on shrinking the neighborhood of a target during the process. The set of neurons within the neighborhood of target Tj on the nth iteration is referred to as Sjk,n , with k 2 fl, rg. If Tj 2 D , there is only one Sjk,n selection; otherwise, if Tj 2 O , the selection is done for the left tour Sjl,n and the right tour Sjr,n .
In order to determine the initial denition of Sjk,n , the central neuron in the selection is found using ( XD
W ( Tj , 1) X Ai
6 6b if W (Tj , 1) D , otherwise
(3.4)
where 6 b is the symbol for an undened winner. If the target has already been selected, there will be a previous winner, given by the function W dened later in equation 3.9, and used as current center in Sjk,n . Otherwise, resorting to the information that Tj belongs to area Ai in the partition of the plane (see section 3.1), neuron XAi is used. XAi is the most central neuron in the segment whose end points are given by equation 3.2. The neighborhood selection spans L neurons backward and forward from the central neuron along the connected chain of articial neurons. The neighborhood set of all target neurons is changed in each iteration step. In particular, the neighborhood set of the nonselected targets always decreases, while the neighborhood of the selected target may increase. More precisely, the value of Lp associated with each target Tp , is updated at each
446
Alessio Plebe and Angelo M. Anile
iteration (n) as Lp, 0 D dN, ¡ ¢ 1 Lp,n D 1 ¡ dN Lp,n¡1 , Lj,n D
Lj, n¡1 C Lj,0 . 2
p D 0..N I
[initial value of L]
6 jI p D 0..N,pD
[decreases L] [increases L]
(3.5)
with d < 12 . At the beginning of the search process, Sjk,n is large and includes many potential neurons within the neighborhood of each target. As the iteration proceeds, the neurons in the tour move closer to the associated targets, and the search gradually becomes more of a local process. Nevertheless, the neighborhood of a target that has been selected but not assigned by the competition is enlarged the next time that it is selected (Lj is increased). Actually, only a subset of the neurons in Sjk,n competes to be closest to Tj , as described in section 3.3. 3.3 The Competition Rule. The orientation of the tour trajectory with respect to the target Tj is used to select possible candidate winner neurons p out of the group dened by Sj . There is a set of candidates C j for which p the distance D ( Xi , Tj ) from target j to the candidate neuron i is measured and another set C js for which the segment distance Ds (Xi Xi C 1 , Tj ) from the segment of the tour connecting two candidate neurons and the target is used. The two sets are constructed according to the sign changes of the derivative of the distance along the trajectory
(
) dD dD ¸ 0^ ·0 Xi 2 Sj : ds ds XiC Xi¡ ( ) dD dD def · 0^ ¸0 . D Xi 2 Sj : ds ds XiC X¡
C
p def j D
C
s j
(3.6)
iC1
As long as the tour trajectory is oriented toward the target, the next neuron is closer to the target, and there is no need to compute the distance. It is necessary to compute the distance only when a change in this orientation occurs. A change in orientation can occur either at a neuron or along a part of the trajectory connecting two neurons. In the rst case it is more appropriate to use the point-to-point distance as a score in the competition for the winner neuron. In the second case, the distance is measured to the segment, and if the segment is closer than the nearest of the two neurons connected by the segment, it is selected as the winner. The effect of including the segment distance in the competition is illustrated in an example in Figure 2. In this case, if the segment distance is not included in the algorithm, the tour becomes longer and tangled.
An Approach to the Double Traveling Salesman Problem
447
X4 X3
X6
X1
X5
X2
target
X4
A
X3
X6 X1
X5 X2
target
B
Figure 2: A typical situation when prior competition rules lead to a tangle in the tour (A). Neuron X5 will win and move toward the current target, together with X4 and X6 (B). With the current algorithm, the winner will be X2 .
In Figure 2A, the closest neuron to the selected target is X5 , but a part of the tour between X2 and X3 , is much closer to the target. However neither X2 or X3 is close enough to win. Without including the segment distances, X5 would win and be moved, together with its neighbors X4 and X6 , toward the selected target (see Figure 2B). Instead, X2 is selected as the winner and is moved toward the target. This safeguard feature inside the competition is one of the reasons that this algorithm can use values of the learning rate much higher than in conventional SOM, as will be shown in section 4.1, with faster convergence. The speed of the algorithm is further increased by computing the distances only when a neuron or a segment could be a winner in the competition. The orientation of the tour in relation to the target is computed at both end points of each segment within the neighborhood Sj of a target. The segments and neurons that participate in the competition are found by computing the sign of the change in distance to the selected target at the end points of the segments. This is most easily calculated by taking the sign of the dot product of the vector from Xi to Xi C 1 along the segment of the tour and the vector
448
Alessio Plebe and Angelo M. Anile
from an end point of the segment to the target,
³ sign
³ sign
´ ± ² dD ´ sign ( x i C 1 ¡ x i ) T ( x i ¡ tj ) C ds Xi
´ ± ² dD ´ sign ( x i C 1 ¡ x i ) T ( x i C 1 ¡ tj ) , ds X¡
(3.7)
iC 1
since, for example, the sign( ( xi C 1 ¡ x i ) T ( xi ¡ tj ) ) is the sign of cos \TXi Xi C 1 . Table 2 shows the possible cases that need to be considered in the evaluation of a neuron in the trajectory, in accordance with equation 3.6. In the rst case, the tour is progressing toward the target, so the neuron is not shortlisted. The trajectory is directed away from the target in the second case. In the third case, the tour has just passed the selected target, and the distance from the nearest point in the trajectory is used, so Xi 2 C js . The fourth case is when the trajectory was previously directed toward the target and p changes on Xi , the distance from its position is used: Xi 2 C j . Note that case p dD > 0 and dD < 0 is physically impossible. All the Xi 2 C j [ C js are C ds ds ¡ Xi
Xi C 1
selected for the nal competition. In this competition, the neuron with the smallest distance D wins. An example of the role of the segment orientation in the competition process is shown in Figure 3. The signs of the derivatives are shown in parentheses, and thin lines show the distances D taken into account. Note that some of these are point-to-point distances, and others are segment distances. This competition process is also part of the mechanism for selecting collision-free tours. Let the two tours X l and X r be in any state of the competition process but not yet colliding, and let Tj be an arbitrary next-selected target. A tour intersection cannot take place if Tj is already included in one of the tours. Figure 4 shows an example in which the current algorithm avoids a collision between the two trajectories that would have occurred without the competition process already described. There is no absolute means to avoid a collision when the contended target is outside both tours. The rule described in equation 3.6 is applied independently on the trajectory of each tour, and at the end, the two results are compared. For an absolute guarantee of collision-free tours, it would be necessary to extend the rule to simultaneously take into account the orientation of the two trajectories with respect to the current target. In practice, the current algorithm is sufcient. It is interesting to note that collision avoidance is included in the algorithm as a soft constraint, without imposing any check or penalty term where crossing of the two tours is mathematically described. This pertains to the self-adaptive nature of the SOM, where an explicit formulation of the desired objective function is not required, and this is another advantage of
An Approach to the Double Traveling Salesman Problem
449
Table 2: Rules for Selecting Candidates for the Competition to the Selected Target.
Number
@D @s X i ¡
@D @s X i C
@D @s Xi C 1 ¡
Distance Type
1
8
·0
·0
None
2
¸0
¸0
8
None
3
8
·0
¸0
Ds
4
·0
¸0
8
Dp
Pattern
Notes: Columns 2–4 are the inputs to the table: the two distance derivatives of the competing neuron and the derivatives of the next neuron on the incoming trajectory. The output, in the fth column, is the distance used as score in the competition, if any. The last column shows an example of the trajectory conguration treated for each case. A distance is calculated only in cases 3 and 4.
450
Alessio Plebe and Angelo M. Anile
Figure 3: An example of the competition process. This is a part of an example tour, in which the thick, solid lines correspond to set segments in the current tour, and there is an articial neuron (not shown) at each end of each segment. The signs of the distance derivatives with respect to the target are shown in parentheses for each of the neurons. Following the rules described in equation 3.6, the distances to only four neurons out of nine are evaluated. Two of the measurements are to segments of the tour, and two of the distances are to articial neurons. The thin lines show the minimum (perpendicular) distance to the evaluated line segments and the evaluated neurons.
the chosen approach. It is known that the most efcient local heuristics for the TSP, like Lin-Kernighan, are not suitable for accepting additional constraints, but also methods based on the explicit list of constraints, like linear and integer programming (Nemhauser & Wolsey, 1988; Junger, ¨ Reinelt, & Rinaldi, 1995), are limited to modeling linear equalities and inequalities. For the collision avoidance, the mathematical representation of the necessary inequality constraints would require a number of binary variables and extra constraints, which result in a signicant increase in model size in terms of variables and constraints. There are several alternative methods for including directly nonlinear constraints, like special branch-and-bound formulations (Hajian & Richards, 1996) and translators from constraint logic to mixed integer programming (Rodosek & Wallace, 1996), which are too expensive in terms of computation for the 2-TSP application. 3.4 Neural Adaptation. In the 2-TSP algorithm, the state of the neurons is not only reected in the weight values; there is also another measure, the
An Approach to the Double Traveling Salesman Problem
451
A
target
B
target
C
target
Figure 4: An example of the mechanism for avoiding a collision between the two tours. (A) A potential initial situation that could lead to a crossing of the two tours. (B) The modication to the tours that would result using conventional competition rules. A neuron of the tour on the right wins and is pushed toward the target, crossing the left tour. Its two neighbors are also moved, as shown by the dashed arrows. (C) The modication of the tours that results using the current 2-TSP competition rule. The winner is part of the left tour, and it is moved toward the target, together with the unassigned neighboring neuron.
452
Alessio Plebe and Angelo M. Anile
activity of the neurons, which is updated during the process. The activity is an accumulation of winning events for the same neuron and is a measure of the local mismatch between neuron and target density, which will be exploited during the regeneration phase. The activity will be high for neurons in an area crowded with targets and remain low where many neurons compete in an area with sparse targets. It is also a kind of energy level of the neuron in the transition from the free state to the state of being assigned to a target. Once the neuron is assigned, it will continue to compete, without the possibility of conquering further targets. Its activity will be still updated, but it will have a negative sign, being actually a negative potential for all the free neurons competing for same targets. At the end of the competition at the nth iteration, the activity E of the winner is updated by the following rule: ( E w,n Ã
Ew,n¡1 C 1 C a Nn , Ew,n¡1 ¡ 1 ¡ a Nn ,
when Ew,n¡1 > 0 . when Ew,n¡1 < 0
(3.8)
The absolute value given by equation 3.8 will be explained in the next subsection; the only effect for the adaptation process is that in the rst case, Xw is a free neuron, and in the second case, it is already assigned. The target activity is updated by replacing the previous value, T Ã |Ew,n |, Ej,n
and the winner is registered in the winner history for Tj , which is a function dened as W ( T, t) : T £ [0, t ] ! X
C f 6 bg
(3.9)
W is a function giving the winner for a target at discrete time lags. t is the delay of drawing on a target, t D 0 being the most recent time that the target has been drawn. Note that t is not an absolute time metric, since drawing events are not synchronous for the different targets. 6 b is the symbol for the indeterminate winner. At iteration n D 0 (at the beginning of the process), W D 6 b, 8T, 8t. t is the memory of W, the longest delay for which the winner is still recorded for every target. W is updated as follows: W (Tj , t) Ã W (Tj , t ¡ 1) ,
W ( Tj , 0) Ã Xw .
t 2 [1, t ] (3.10)
Further updating of the activity of a neuron takes place only if the process is not in the regeneration state. If the following condition is met, W ( Tj , t ) D W (Tj , t C 1) ,
t 2 [0, t ¡ 1],
(3.11)
An Approach to the Double Traveling Salesman Problem
453
Xw is the same winner for the whole history of Tj and is accepted as the node assigned to the target Tj : xw à tj , Ew à ¡E w .
(3.12)
Otherwise, the winner is updated toward the target, using the current learning rate, in a similar fashion to that used in Kohonen self-organizing maps (Kohonen, 1984): xw,n à x w,n¡1 C gn ( tj ¡ x w,n¡1 ) .
(3.13)
The learning rate g is normally reduced at each iteration by ³ gn Ã
gN g0
´ 1/ N gn¡1 .
(3.14)
g 0 is the value at n D 1. The parameter gN is the value g would reach after N iteration steps in adaptation mode. g is also reduced to half its value each time the process moves into the regeneration state. The weights of all of the neurons in the neighborhood region Sj are updated by an amount that depends on the distance of the neuron along the connected segments from the winner neuron. The amount of movement of the ith neighboring neuron is given by ³ xi,n à x i,n¡1 C gn
U N
´ |w¡i| (x i,n¡1 ¡ tj ) ,
(3.15)
where U is the number of unassigned targets. The changes are symmetric in the neighborhood in each direction along the tour away from Xw . This update process is stopped for a particular neuron as soon as the neuron is assigned to a target. The parameterization of the neural adaptation in equations 3.8 and 3.15 normalizes the behavior with respect to N, so that the values of the parameters are almost independent of the size of the 2-TSP problem, and furthermore this ¡stabilizes the relaxation process throughout the search for ¢ a tour. The ratio U in equation 3.15, for example, provides a more global N change in the weights when the tour is still in an early stage and restricts the changes to nearby neurons in the later stages of the search process. 3.5 Neuron Regeneration. In each iteration of the algorithm, the values of two key parameters are used to determine if the search process is in the state of neural adaptation or neural regeneration. These two parameters are called frustration ( F ) and knowledge ( K ) .
454
Alessio Plebe and Angelo M. Anile
Let the following dene the sets:
X
¡ def n D
X
n
fXw,n0 : Ew,n0 ¡1 < 0,
n0 2 [no , n]g
(3.16)
6 def D
6 W ( Tj,n0 , 1) , D fXw,n0 : Xw,n0 D 6 6 b, W ( Tj,n0 , 1) D
n0 2 [no , n]g
D def
X
n
D fXw,n0 : Xw,n0 D W ( Tj,n0 , t) , t 2 [1, t ],
def
TnC D fTj,n0 : W (Tj,n0 , t) D 6 b, t 2 [1, t ],
(3.17) n0 2 [no , n]g
n0 2 [no , n]g.
(3.18) (3.19)
no is the last iteration since the process was in adaptation mode: at n D 0 or immediately after the end of a regeneration state. During neural adaptation, F and K are computed as follows: Fno D 0, |X
n
6 D
|¡
Kn D | TnC | ¡ | X
n
D
|.
Fn D | X
¡ n | C
Kno D 0,
|X
D|
n
2
I
(3.20)
(3.21)
F is increased if the current competition has not improved the tours. This happens in the following cases: every time the winner is already assigned to a different target, equation 3.16, or is different from the previous winner, equation 3.17. On the contrary, it is decreased by successful neural assignments (see equations 3.11 and 3.18). K is simply increased by any competition involving a new target (see equation 3.19), and decreased by an assignment of a neuron to a target. The regeneration state occurs when the following condition is met: F K > c U.
(3.22)
When the conditions described in equation 3.22 switch the process to the regeneration state, the competition process continues, but in these iteration steps, the target Tj is not drawn randomly. Instead, the target with the largest T is selected. The competition calculation nds the winning neuactivity Ej,n ron Xw , and the action performed by the regeneration process depends on the activity Ew of this neuron. If Ew < 0, a real create-and-delete process occurs:
X
ÃX
e C fXg. b ¡ fXg
(3.23)
An Approach to the Double Traveling Salesman Problem
455
e is the worst neuron, using as a metric the activation of unassigned neurons: X e D arg minfEX : EX ¸ 0g. X X
(3.24)
From equation 3.8, EX is basically increased by 1 every time neuron X is the winner; therefore rule 3.24 selects the less successful neuron. There is an additional term a Nn in equation 3.8, typically smaller than 1, and increasing as the algorithm runs. Due to this term, neurons that become inactive earlier will be processed rst. For example, if two neurons have been winners only once, the rst to be removed by the regeneration mechanism will be the oldest winner. e is chosen randomly If there is more than one neuron with E D 0, X between such neurons. b is the newly generated neuron, placed in the same tour X k of Xw , in a X position identied by the indexb õ given by: ( b õ D
w, w C 1,
when D ( Xw¡1 Xw , Tj ) > D ( Xw Xw C 1 , Tj ) when D ( Xw¡1 Xw , Tj ) < D ( Xw Xw C 1 , Tj )
(3.25)
and with weights given by b x D tj .
(3.26)
If the activity of the winning neuron Ew is nonnegative, then it has not been assigned to a target, and the regeneration process assigns it to the current target, with x i à tj , and Ew à 0, and no create-and-delete process takes place. When the process is in the regeneration state, the condition of equation 3.22 is not checked, and the next iteration step remains in the regenerT . ation state, causing the next Tj to be the one with the largest activity Ej,n The normal state of neural adaptation is resumed only when maxfET g · 0, unless the learning rate is below a xed threshold g < g2 , with g2 > gN . This rst of these conditions allows the regeneration process to continue until all of the unassigned targets that have been in a prior competition process are assigned. When the latter learning-rate condition is met, the neural adaptation never resumes, and the regeneration process continues on all of the remaining targets until the end of the search for the tours. This last condition is a very efcient short-cut in the nal stage of the search process, when the neural convergence is very slow and the few targets left may take many iterations to attract neurons. The regeneration process leads to assignment of a target to a neuron at every step and therefore rapidly completes the development of the tour. When the neural adaptation process resumes, frustration F and knowledge k are always set to zero, as in equations 3.20 and 3.21.
456
Alessio Plebe and Angelo M. Anile
4 Results 4.1 Parameter Sensitivity. Several of the mathematical expressions in the algorithm include working parameters 3.2, 3.5, 3.8, 3.9, 3.14, and 3.22. However, the algorithm works reasonably well in a broad range of most of the parameters, and in general the effort of ne tuning will bring a very marginal improvement in terms of tour quality. The same has already been noted for the starting tour (which actually can be interpreted as another parameter choice) in section 3.1. Moreover, the effect of the parameters is almost independent of the problem size, which is automatically accounted, as in equations 3.5, 3.8, 3.14, and 3.22. The learning rate is a classic crucial parameter in neural-based TSP algorithms in balancing the speed of convergence on the nal tour and the avoidance of local minima. Kohonen (1995) suggested in general a decrease from unity in the initial adaptation phase and a low value, like 0.02, during the nal phase. In AngÂeniol et al. (1988), the learning rate is decreased from unity at every iteration using the rule gn à (1 ¡ Cg )gn¡1 , where a typical value of the constant Cg is 0.02; the same value is used in Goldstein (1990). With this setting on a 1000-target run, the average learning rate will be 0.05. In the 2-TSP, the alternative mechanism of regeneration and the selection of candidates, equation 3.6, allow for a higher learning rate to be applied without errors due to local minima. For example, with a typical setting, the 2-TSP will apply an average learning rate of 0.5 when processing a 100-target problem and an average learning rate of 0.6 for a 1000-target problem. Figure 5 shows the performances of the 2-TSP on a 250-target problem, with 20% overlap, at various settings of the learning rate. From equation 3.14, the learning rate is ruled by its initial value, g 0, and the value decayed after N iterations gN . In the gure, g0 is varied in [0.1, 1.2], and the ratio ggN0 spans four decades. The quality of a computed tour is measured as D the gap from the best value, 100( b ¡ 1) , where D is the total distance travD b is the minimum D achieved over all runs, which eled by the tours, and D in this case happens for g 0 D 0.85 and gN D 0.000114. All data points are averages on 1000 runs, since results from one run are always affected by the random sequence of presentation of the targets during the process. There is a constraint placed on the selection of targets from the set D or O that reduces the likelihood of problematic sequences. It is evident in the gure that for a broad range of values ruling g, the surface of the gap is contained within 1% of the best case. The most important parameter for the behavior of the 2-TSP is c , which determines the transition between the regeneration and adaptation states. As is evident in equation 3.22, with lower values of c , the algorithm easily switches into the regeneration state. This results in a reduction of the time required for the search but may lead to a premature assignment of targets and less optimal nal tours.
An Approach to the Double Traveling Salesman Problem
457
0.
gap [%]
1. 2.
2.
log
1.
3.
0 N
0. 1. 0
4.
0.5 0.
5.
gap [%]
4. 3. 2. 1. 0
5 4
(C)
3
(B)
2 1
(A) 0.001
0.1
10.
1000.
0
iterations [# X 1000]
Figure 5: Sensitivity of the 2-TSP quality to different learning-rate parameterization. g 0 is the initial learning rate, and gg0 is the indication of the span of the N learning rate during the iteration.
Figure 6: Results from running the 2-TSP algorithm with different values of c (x-axis) on 250 targets with 20% overlap of the search area of the two tours. (A) Number of iterations performed by the algorithm. (B) Gap of total length of the two tours from the best value. (C) Gap of the mean difference between the two tours. B and C refer to the left-sided y-axis, and A refers to the right-sided one. Each data point corresponds to the average from 1000 different random tours.
Figure 6 shows how the performance of the algorithm depends on changes in the value of c , using the same problem described for Figure 5. Curve A in Figure 6 conrms that when the neural adaptation is the prevailing process (larger values of c ), the convergence of the tours toward the targets requires
458
Alessio Plebe and Angelo M. Anile
many more iterations. Up to c equal to 10, most of the targets are actually assigned during the regeneration state, and since each iteration step in this state results in one assignment, there are nearly 250 steps of regeneration. For values c > 2000, the regeneration is almost inhibited; in other words, the algorithm behaves more and more like an ordinary SOM, apart from the competition rule, and the number of iterations grows exponentially. With a total exclusion of the regeneration mechanism (c ! 1), the nal convergence will not be ensured anymore, which is known from the mathematics of the SOM, as well as from experimental results (Jeffries & Niznik, 1994; Aras et al., 1999). Convergence of the algorithm with the regeneration is guaranteed, since during this process, one city is assigned at each iteration, and in the adaptation state, from equation 3.14, limi!1 gi D 0. Therefore, for a number of iterations i large enough, gi < g2 so that the adaptation process is denitely inhibited, and the algorithm will terminate in regression mode. Curve B in Figure 6 shows the overall optimization of the tours, expressed as percentage gap, as in Figure 5. Curve C shows the effectiveness of the algorithm in balancing the length of the two tours, measured as 100 sbD , D b the where sD is the deviation between left and right tour lengths, and D shortest total length over all trials. Neither parameters varies substantially with c . Note that the best results lay in an area where the computation time is still very close to the lower limit. Other tests have shown that similar behavior occurs with a larger or smaller number of targets. Figure 7 shows an example of the progress of the search process. In particular, it shows the times at which the algorithm switched between the regeneration and competition states. In this case, there are 100 targets, and c is equal to two. The plot of U shows the rate of assignment of the targets as a function of the number of iteration steps. Whenever the amount of frustration and knowledge enable the regeneration to take place (e.g., near step 40), the number of unassigned targets decreases almost linearly. In this case, when c is two, there is no assignment during the neural adaptation state. As a result of applying equation 3.22, in the nal part of the search process, the regeneration state is more likely than the adaptation state. In the example (after step 120), regeneration is the sole process, and there is a target assignment in each iteration step. Figure 8 shows the evolution of the 2-TSP algorithm on a large problem. In this case, there are 2000 targets, and the two tours are completely overlapping. This gure contains the initial conguration, an intermediate step, and the nal tours. 4.2 Performances on Double Problems. Since the 2-TSP has been designed for a specic real application, it is not easily comparable with respect to all its features with other algorithms because m-TSP with additional constraints available in literature is quite different, and none was exactly
An Approach to the Double Traveling Salesman Problem
459
Frustration
Knowledge
Unassigned targets
0
25
50
75
100
125
150
iterations Figure 7: Plot of target assignment during the iteration of the 2-TSP algorithm. The middle panel shows the values of the two controlling parameters frustration and knowledge, and the upper bar shows the state of the process—in black during adaptation and in white during regeneration.
formulated as for this robotic application. However, it is possible to compare the two most important gures, computation speed and tour lengths, with other m-TSP, neglecting other features of the algorithms. A comparison has been done with goldstein , another SOM-based algorithm (Goldstein, 1990), which is a direct extension of the algorithm of AngÂeniol. When a target Tj is selected, the distance from a neuron Xi is weighted by an additional term,
³
¤(
D Tj , Xi ) D D (Tj , Xi ) 1 C
d ( k) ¡ dN dN
´
,
(4.1)
where d ( k) is the total current distance of tour k 2 l, r, and dN is the mean of the two total distances. In this way, neurons on an overloaded tour are less likely to win. Table 3 reports comparisons made on the same two problems used in the original work of Goldstein. The tour quality is slightly better for 2-tsp, but the major difference is in time performances, where goldstein is at least an order of magnitude slower. Both algorithms were run on a 266 MHz
460
Alessio Plebe and Angelo M. Anile
Figure 8: An example 2-TSP result with 2000 targets and 100% overlap of the search space of the two arms. The upper two panels contain the starting tour and one of the intermediate steps. The lower panel shows the two nal tours. Table 3: The Algorithm Compared with Another m-TSP Neural Solver in Double Tour Problems. Problem Size
goldstein
2-tsp
Length
Deviation
Time
Length
Deviation
Time
100
10.33
0.68
127
10.13
0.08
7
200
13.37
0.27
264
12.08
0.24
34
Note: The length is the sum of the two tours, the deviation is between tours, and the running time is in milliseconds.
Pentium II PC with Linux OS, C code compiled using GNU compiler, O1 level optimization, and timing is measured with the getrusage system call. These are the conditions used in all computations in this work. 4.3 Collision Avoidance. The current algorithm does not guarantee that the two tours do not cross. Using a simulation, the probability that the rules given in equation 3.6 result in two tours that cross has been evaluated. A large number of different tours have been run, with sizes of the problem in the range of interest for the harvesting application. Results are shown in Table 4.
An Approach to the Double Traveling Salesman Problem
461
Table 4: Number of Final Tours with Collision over 100,000 2-TSP Runs, for Various Problem Sizes and Conditions. Problem Size
Fraction of Targets in the Overlap Area 0.50
0.70
0.90
10
2
12
5
20
7
8
16
30
12
5
4
40
6
6
8
50
6
12
7
60
6
6
3
Simulations were run using 50% of the total space as overlap area and several distributions of targets in the overlap and disjointed areas. The case with 50% of targets in the overlap area corresponds to the uniform distribution. The number of tours in which the trajectories cross is few then 10 cases over 100,000 runs for most of the conditions. The values shown in the table were computed with a value c D 0.1; other tests were done with c D 2000 to check if the regeneration process had an inuence on the collision avoidance. For such value of c , regeneration is almost inactive, and an inevitable side effect is the increase of computation time of at least a factor of 1000, so only 10,000 runs were performed. For most of the cases, a single collision occurred; two collisions were detected for 60 targets and 0.9 overlap. These numbers are too small for a statistical investigation; however, the avoidance is mainly due to the competition strategy, and the regeneration has no evident inuence. The most important evidence is that collision episodes are very marginal. These trajectory crossings would actually lead to a potential collision if the two arms are simultaneously in the overlapping section of the corresponding tours. A real collision would never occur since there are safety mechanisms in the lowlevel control system of the arms. If a collision condition occurs, the robot controller is forced into a nonnominal feedback system, which protects the arms but takes up valuable time that could be spent in the harvesting process. The very low collision probability makes the avoidance function of the algorithm sufcient for the application. 4.4 TSP Version of the Algorithm. The 2-TSP algorithm has been designed with the robotic application in mind; however, its key mechanisms, like the adaptation-regeneration process and the competition based on trajectories, seem to be an enhancement of the SOM paradigm of a more general signicance. In order to assess the effectiveness of the 2-TSP solutions, the algorithm has been simplied to solve the classical one-TSP problem, where competitors abound. A benchmark set, which is being used more often in
462
Alessio Plebe and Angelo M. Anile
Table 5: Quality of the TSP Version of the Existing Algorithm Compared with Other Neural Algorithms, Run on Benchmark TSP Problems with Known Optimum Solutions. Gap (%)
Problem Name
Size (#)
guilty
bier127
aras
ang
2-tsp
127
31.2
2.76
3.71
5.92
eil51
51
10.6
2.82
3.99
3.05
eil76
76
14.1
5.02
6.13
5.76
eil101
101
22.7
4.61
6.68
6.83
kroA200
200
37.8
5.72
8.50
8.49
lin105
105
7.6
1.98
6.49
4.93
pcb442
442
—
11.07
17.47
11.08
pr107
107
81.7
0.73
1.79
1.93
pr124
124
47.1
0.08
5.12
0.33
pr136
136
40.4
4.53
6.89
2.54
pr152
152
42.8
0.97
1.30
1.17
rat195
195
65.6
12.23
15.41
9.25
rd100
100
10.4
2.10
4.50
4.07
70
12.0
1.48
2.67
1.19
Average
32.6
4.00
6.47
4.79
Standard deviation
23.0
3.66
4.7
3.27
st70
Note: Quality is measured as the percentage gap from the known optimum.
the TSP community, is the TSPLIB (Reinelt, 1991), a growing collection of sample instances. Recent work (Aras et al., 1999) has included, for the rst time, a comparison of neural TSP solutions on some TSPLIB benchmarks. Therefore, runs of the 2-TSP have been performed on the same instances. 1 Results are given in Table 5. All 14 problems have a known optimum tour; therefore, the output is given in terms of gap from the optimum value. The compared algorithms are Burke’s “guilty-net” guilty , AngÂeniol’s SOM extended by Aras et al. aras, the original AngÂeniol’s SOM ang, and the 2-TSP 2-tsp. All algorithms has been described in the Introduction and compared using best results, as in the original work. The guilty algorithm clearly has a lower-quality output, of about an order of magnitude worse
1 For clarity, the algorithm of this work will be always called 2-TSP, even when applied to a “single” problem.
463
IN
NN
CH
LK
70
2O
An Approach to the Double Traveling Salesman Problem
runs [#]
60 50 40 30 20 10 0
5
10
15
20
25
30
35
40
45
gap [%] Figure 9: Distribution of TSP tour lengths from 500 runs of the standard problem p654 . These results are compared with Reinelt’s (1994) measurements of the performance of several methods for solving the TSP: NN : Nearest neighbor, IN : Insertion, CH : Christodes, 2O : 2-Opt, LK : Lin-Kernighan.
than the other neural approaches. It does not converge for problem pcb442 , so statistics for this algorithm have been computed on one fewer sample. The aras, ang, and 2-tsp are of similar quality, with the best average quality for aras (4.00), followed by 2-tsp ( 4.79) and ang (6.47) . For the 2-tsp, the quality is less affected by the specic instance of the problem, as indicated by the standard deviation. 4.5 Comparison with Nonneural Solutions. It is well known that neural network approaches to the TSP, while of theoretical importance in developing and demonstrating capabilities of neural models, are reputed as being far less effective compared to operational research state of the art. Therefore, comparisons with classical operational research methods are necessary in assessing the progress of neural network approaches. Figure 9 shows the results of running the TSP version of the algorithm 500 times on one of the benchmark problems, p654, compared with several other classical heuristics of the operational research domain (Reinelt, 1994). The Nearest-Neighbor (NN ) and Insertion (IN ) methods successively build a tour according to some construction rules (Rosenkrantz, Stearns, & Lewis, 1977). In the NN algorithm, a new node is added at every step by simply choosing the nearest target that is not yet assigned, and the tour is closed at the end by connecting the last and rst nodes. The results from an implementation of the NN algorithm shown in Figure 9 use an additional heuristic to avoid a characteristic problem with NN . At the end of the search
464
Alessio Plebe and Angelo M. Anile
with the NN algorithm, isolated targets (the so-called forgotten nodes) have to be connected at a high cost. The IN algorithm starts from a small tour, and at every iteration, an edge is broken to insert a new node. In the version of the algorithm presented in the gure, the target is inserted whose minimal distance to the current tour is maximal. Moreover, the candidate set is reduced to the targets connected to the tour by a subgraph edge. If this set is empty, an arbitrary target is inserted. The Christodes (CH ) heuristics (Christodes, 1976) is one of a number of methods that use a minimum spanning tree (the shortest tree connecting all targets and without cycles) as a basis for generating tours. In the CH algorithm, a Eulerian graph (a graph connecting all targets, not necessarily with unique paths) is built from the minimum spanning tree by connecting every odd-degree node using exactly one edge. Then multiple paths are eliminated to get the nal tour. The 2-opt algorithm (2O ) is a general and widely used improvement principle that consists of eliminating two edges and reconnecting the two resulting paths in a different way. The shortest path between the original and the new obtained tour is selected at each iteration. In this algorithm, the initial tour must be constructed using a different technique. In its simplest form, the 2O algorithm can be iterated on every node with a certain sequence, until the tour stops improving. This requires a very large and unbounded number of iterations. For this reason, the number of 2O iteration steps is restricted using some criteria, such as like on the k-nearest neighbor subgraph, (a graph built on the nodes of the tour by connecting to each target its k-nearest neighbors), but this can reduce the quality of the nal tour. In the case presented in Figure 9, the starting tour is obtained with NN , and the termination of the 2O algorithm is not restricted. A more exible set of heuristics is found in the LK algorithm, which was originally developed by Lin and Kernighan (1973). This algorithm accepts 2O moves that temporarily increase the length of the tour. This is clearly a way to avoid local minima, but it may considerably increase the running time. In practical implementations of LK , a move is dened as a sequence of simple submoves, like 2O, and the number of submoves in a move, and the alternatives for each submove, are limited (Mak & Morton, 1993). Despite their age, Lin-Kernighan-based algorithms are still the most successful tournding approaches (Codenotti, Manzini, Margara, & Resta, 1996; Johnson & McGeoch, 1995; Helsgaun, 2000). The solution compared here has moves with a maximum of 15 submoves and up to two alternatives for each of the rst three submoves. The starting tour was again obtained with the NN algorithm. 4.6 Time Performances. The speed of the algorithm in the simple TSP version has been tested and compared with neural and classical solutions using the benchmarks of the TSPLIB library, now with a range of problem size up to 13,509. The Lin-Kernighan heuristic has been at the heart
An Approach to the Double Traveling Salesman Problem
465
of many solvers, with different designs and performances. The version linkern compared here is a state-of-the-art implementation including several enhancements for speed efciency (Helsgaun, 2000). For the neural solutions, comparison has been made with an improvement of the DurbinWillshaw elastic net algorithm elnet, which runs about 15 times faster than the original one (Vakhutinsky & Golden, 1995), and with ang. No original time performances are available for aras; however, from the description of the algorithm, we believe it cannot be faster than ang, since it is based on ang itself with an added mechanism counteracting the SOM relaxation. The runs of elnet in the original work were computed on a Sparc-10 machine and have been downscaled by a factor of 3.0, a gure derived by several tests of the 2-tsp algorithm on the 266 MHz Pentium II and on a Sparc-10. In the original work of Helsgaun, linkern was run on a 300 MHz G3 Power Macintosh, which is about the same class of the 266 MHz Pentium II, so data have been left unchanged. AngÂeniol’s SOM algorithm ang has been run on the same 266 MHz Pentium II machine as the 2-TSP. Figure 10 presents the comparison. The top graph spans all the problem sizes, and the bottom graph is a zoom close to the origin. The data points correspond to the 23 benchmarks: pr76, pr107, pr124 , pr136, pr152, pr226, pr264, pr299, pr439, pr1002 , pr2392 , kroA200 , pcb442 , rl1304 , rl1323 , rl1889 , pcb3038 , fl417, fl3795 , fnl4461 , rl5915 , rl5934 , usa13509 . For the elnet, the original data points were used—19 random problems of sizes from 50 to 500. The 2-tsp is clearly much faster than the compared algorithm in the range of problem sizes used for the evaluation. The time required is almost a function of the problem size, while for both linkern and ang, one problem may require more running time than a larger one. It has to be noted that linkern has impressive quality performances; even on the larger problem, with 13,509 cities, the gap is 0.008%. For this problem, the gap of the 2-tsp was 15%, solved in 6 minutes against the 13 hours of linkern . Gaps of ang are of the same order of magnitude of the 2-tsp, in accordance with the previous analysis, while computation time is quite a bit longer. For 2-tsp, the running time is a function of the problem size and is weakly affected by the problem typology. On the contrary, for both linkern and ang , one problem may require much more running time than a larger one depending on the target typology. The curves are tted with N a functions in order to express the asymptotic time complexity of the algorithms. Values of a, for elnet, linkern , ang, and 2-tsp, are of 1.7, 2.4, 2.0, and 2.1, respectively. This complexity would suggest that elnet will perform better on larger problems (as reported by the authors); however, data for this algorithm are available only up to 500 cities, and it is not known if a similar gure will be maintained over a wider range. Also, the 2-tsp exhibits a lower complexity when tting
466
Alessio Plebe and Angelo M. Anile
Figure 10: Comparison of time performance on 23 TSP benchmarks for the following algorithms: linkern, curve A and star marks; elnet, curve B and cross marks; ang, curve C and triangle marks; 2-tsp, curve D and diamond marks. The plot on the top spans all the size of the TSP problems; the plot on the bottom is a zoom in the smallest problems, up to 500 targets.
data in a limited range of sizes—for example, up to 4000 cities, the t is N1.8 . In the regeneration process, it is possible to compute the theoretical complexity, since one target is assigned in each iteration, and during each iteration, a number of scans proportional to the total number of targets is performed. Therefore, total time complexity becomes O (N 2 ). There are a couple of mechanisms reducing this gure: the rule selecting candidates (see equation 3.6) and the reduction of the scan section during the process
An Approach to the Double Traveling Salesman Problem
467
(see equation 3.5), explaining the measured complexity less than quadratic in the range up to 4000 targets. During the adaptation process, complexity cannot be computed theoretically, since it is not only dependent on the problem size, and this portion of the overall process accounts for the higher measured complexity in the full range of problem sizes. Also for linkern , the theoretical assessment of time complexity is difcult; a typical reported value is O ( N 2.2 ) (Helsgaun, 2000). However, taking for granted the above complexity over a wide range of sizes, elnet would outperform ang on problems larger than 50,000 and 2-tsp only for problems with more than 3 £ 107 cities. The 2-tsp is clearly much faster than the compared algorithm, and the time required is almost a function of the problem size, while for both linkern and ang, one problem may require much more running time than a larger one. linkern gives impressive performances; even on the larger problem, with 13,509 cities, the gap is 0.008%. For this problem, the gap of the 2-tsp was 15%, solved in 6 minutes against the 13 hours of linkern . Gaps of ang are of the same order of magnitude as the 2-tsp, in accordance with the previous analysis, and computation time is quite a bit longer. 5 Conclusion
The 2-TSP algorithm described here, based on neural network concepts, proved to be a reliable yet simple solution to the 2-TSP problem. In the case of the harvesting robot application, the algorithm satises the real constrains of the working space and collision avoidance of the two arms. Thanks to its simplicity and limited memory requirements, it is suitable for implementation on embedded real-time hardware without the need of powerful general-purpose computer hardware. The evaluation of the algorithm revealed that there was no substantial dependence on the particular values of the parameters of the algorithm and no signicant discrepancy between the speed of the algorithm and the optimality of the resulting tours. The mechanics of the harvesting robot, with two telescopic arms in spherical coordinates, allows a simple projection of the path optimization problem in a 2D Euclidean space, which was used during the algorithm description. Multi-arm robots with different joint geometry in general require higherdimensional space for representing their path planning, especially with respect to collision avoidance. Furthermore, time optimization in robotics involves joint dynamics as well, requiring non-Euclidean space representations, typically Riemaninann (Shin & Mckay, 1985). While in principle the algorithm can be directly extended to any space representation, some of its features will certainly not work anymore (like the collision avoidance), and its overall efciency may not be as valid as in Euclidean 2D space. While the 2-TSP algorithm was intended initially for the harvesting application, a TSP variant of the algorithm performed well in comparison with
468
Alessio Plebe and Angelo M. Anile
the prior solutions to the TSP problem, especially as a quick nder of tours where very high quality is not the main requirement. In particular, several concepts of the 2-TSP algorithm, like combining neural adaptation and regeneration and taking into account the trajectory distance in the neural competition, are very general in nature and proved to enhance the SOM paradigm considerably, which appears to be the most promising neural strategy for solving TSP-class problems. Appendix: List of Symbols
C D H L O R S T U X D Dp Ds E ET F K L N O T U W X h j i k n s w c t x
a
set of neuron candidates for a target competition set of targets in disjointed areas not yet assigned shortest tour targets on the left tour set of targets in the overlap area not yet assigned targets on the right tour set of neurons in the neighborhood of a neuron set of all targets set of targets not yet assigned set of neurons Euclidean distance function Euclidean distance from a point to the target Euclidean distance from a segment to the target activity of a neuron activity of a target frustration knowledge size of the neighborhood of a neuron total number of targets and number of neurons number of targets in the overlap area target number of unassigned targets winner history function neuron segment of a tour current target index current neuron index (unless otherwise stated) tour index 2 f l, r g current iteration index length variable along the trajectory index of current winner neuron centroid of targets in a space subregion (vector of coordinates) vector of target coordinates vector of neuron weights decay of neural activity
An Approach to the Double Traveling Salesman Problem
b d c g g0 gN º %
t
469
weighting factor for the tour initialization decay of neuron neighborhood size threshold for regeneration or adaptation state change learning rate initial learning rate learning-rate decay linear density of neurons in the tour area density of targets longest time lag of the winner history
Acknowledgments
This work was supported in part by the European Community ESPRIT Project 6715 CONNY (Robot Control based on Neural Network Systems). References AngÂeniol, B., Vaubois, G. d. l. C., & Texier, J.-Y. L. (1988). Self-organizing feature maps and the travelling salesman problem. Networks, 1, 289–293. Aras, N., Oommen, B. J., & Altinel, I. K. (1999). The Kohonen network incorporating explicit statistics and its application to the traveling salesman problem. Neural Networks, 12, 1273–1284. Boeres, M., Carvalho, L., & Barbosa, V. (1992). A faster elastic-net algorithm for the travelling salesman problem. In Proc. IJCNN-92 (Baltimore). Burke, L. (1994). Neural methods for the traveling salesman problem: Insights from operations research. Neural Networks, 7(4), 532–541. Burke, L. (1996). Conscientious neural nets for tour construction in the traveling salesman problem: The vigilant net. Computers and Operations Research, 23(2),121–129. Burke, L., & Damany, P. (1992).The guilty net for the traveling salesman problem. Computers and Operations Research, 19, 255–265. Christodes, N. (1976). Worst case analysis of a new heuristic for the travelling salesman problem (Research Rep.). Pittsburgh: Carnegie-Mellon University. Codenotti, B., Manzini, G., Margara, L., & Resta, G. (1996). Perturbation: An efcient technique for the solution of very large instances of the Euclidean TSP. INFORMS Journal on Computing, 8(2), 125–133. Dell, F. R., Batta, R., & Karwan, M. H. (1996). The multiple vehicle TSP with time windows and equity constrains over a multiple day horizon. Transportation Science, 23(2), 121–129. Desrochers, M. J., Desrosiers, J., & Solomon, M. (1992). The vehicle routing problem: An overview of exact and approximate algorithms. Operations Research, 40, 342–354. Durbin, R., & Willshaw, D. (1987). An analogue approach to the travelling salesman problem using an elastic net method. Nature, 326, 689–691. Erwin, E., Obermayer, K., & Schulten, K. (1992). Self-organizing maps: Ordering, convergence properties and energy functions. BiologicalCybernetics,67, 47–55.
470
Alessio Plebe and Angelo M. Anile
Fisher, M. L. (1995). Vehicle routing. In M. Ball, T. L. Magnanti, C. L. Momma, & G. L. Nemhauser (Eds.), Handbook in operations research and management science. Amsterdam: North-Holland. Fort, J. C. (1988). Solving a combinatorial problem via self-organizing process: An application of Kohonen-type neural networks to the travelling salesman problem. Biological Cybernetics, 59, 33–40. Fritzke, B., & Wilke, P. (1991). A neural network for the travelling salesman problem with linear time and space complexity. In Proc. IJCNN-91 (Singapore). Golden, B. L., & Assad, A. A. (1988). Vehicle routing: Methods and studies. New York: North-Holland. Goldstein, M. (1990). Self organizing feature maps for the multiple travelling salesmen problem. In Proc. INNC ’90 (Paris). Hachicha, M., Hodgson, M., Laporte, G., & Semet, F. A. (2000). Heuristics for the multi-vehicle covering tour problem. Computers and Operations Research, 27, 29–42. Hajian, M. T., & Richards, E. B. (1996). Introduction of a new class of variables to discrete and integer programming problems. In Proc Intern. Symposium on Combinatorial Optimization (London). Helsgaun, K. (2000). An effective implementation of the Lin-Kernighan traveling salesman heuristic. European Journal of Operational Research, 126, 106– 130. Heuter, G. J. (1988).Solution of the travelling salesman problem with an adaptive ring. In Proceedings of the IEEE International Conference on Neural Networks. Hodgson, M., Laporte, G., & Semet, F. A. (1998). A covering tour model for planning mobile health care facilities in Suhum district, Ghana. Journal of Regional Science, 38, 621–638. Hopeld, J. J., & Tank, D. (1985). Neural computation of decisions in optimization problems. Biological Cybernetics, 4, 141–152. Jeffries, C., & Niznik, T. (1994). Easing the conscience of the guilty net. Computers and Operations Research, 21(4), 961–968. Johnson, D. S., & McGeoch, L. A. (1995). The traveling salesman problem: A case study in local optimization. In L. E. H. Aarts & J. K. Lenstra (Eds.), Local search in combinatorial optimization. New York: Wiley. Junger, ¨ M., Reinelt, G., & Rinaldi, G. (1995). The traveling salesman problem. In M. Ball, T. L. Magnanti, C. L. Momma, & G. L. Nemhauser (Eds.), Handbook in operations research and management science. Amsterdam: NorthHolland. Kagmar-Parsi, B., & Kagmar-Parsi, B. (1992). Dynamical stability and parameter selection in neural optimization. In Proc. IJCNN-92 (Baltimore). Kohonen, T. (1984). Self-organization and associative memory. Berlin: SpringerVerlag. Kohonen, T. (1990). The self-organizing map. Proceedings of the IEEE, 78, 74–90. Kohonen, T. (1995). Self-organizing maps. Berlin: Springer-Verlag. Lai, W. K., & Coghill, G. C. (1992). Genetic breeding of control parameters for the Hopeld-Tank neural net. In Proc. IJCNN-92 (Baltimore). Laporte, G. (1992). The vehicle routing problem: An overview of exact and approximate algorithms. European Journal of Operational Research, 59, 345–358.
An Approach to the Double Traveling Salesman Problem
471
Lawler, E. L., Lenstra, J. K., Rinnooy Kan, A. H. G., & Shmoys, D. B. (1985). The travelling salesman problem: A guided tour of combinatorial optimization. Chichester: Wiley. Lin, S., & Kernighan, B. W. (1973). An effective heuristic algorithm for the travelling salesman problem. Operations Research, 21, 498–516. Lo, J. T. H. (1992). A new approach to global optimization and its application to neural networks. In Proc. IJCNN-92 (Baltimore). Mak, K. T., & Morton, A. J. (1993). A modied Lin-Kernighan for the travelling salesman heuristic. Operations Research Letters, 13, 127–132. Nemhauser, G. L., & Wolsey, L. A. (1988). Integer and combinatorial optimization. Chichester: Wiley. Perttunen, J. (1994). The vehicle routing problem: An overview of exact and approximate algorithms. Journal of the Operational Research Society, 45, 1131– 1140. Plebe, A., & Grasso, G. (in press). Localization of spherical fruits for robotic harvesting. Machine Vision and Applications. Potvin, J. Y. (1993). The travelling salesman problem: A neural network perspective. ORSA Journal on Computing, 5, 328–348. Recce, M., Taylor, J., Plebe, A., & Tropiano, G. (1996). Vision and neural control for orange harvesting robot. In International Workshop on Neural Networks for Identication, Control, Robotics and Signal/Image Processing (Venice, Italy). IEEE Computer Society Press. Reinelt, G. (1991). TSPLIB—a traveling salesman problem library. ORSA Journal on Computing, 3, 376–384. Reinelt, G. (1994). The traveling salesman. Berlin: Springer-Verlag. Rodosek, R., & Wallace, M. (1996).Translating CLP(R) programs to mixed integer programs, a rst step towards integrating two programming paradigms. In Proc Intern. Symposium on Combinatorial Optimization (London). Rosenkrantz, D. J., Stearns, R. E., & Lewis, P. M. (1977). An analysis of several heuristics for the travelling salesman problem. SIAM Journal of Computing, 6, 563–581. Shin, K. G., & Mckay, N. D. (1985). Minimum time control of robotic manipulator with geometric path constraints. IEEE Transactions on Automatic Control, 31(6), 532–541. Smith, K. (1996). An argument for abandoning the traveling salesman problem as a neural-network benchmark. IEEE Transactions on Neural Networks, 7(6), 1542–1544. Thangiah, S. R., Osman, I. H., Vinayagamoorthy, R., & Sun, T. (1993). Algorithms for the vehicle routing problems with time deadlines. American Journal of Mathematical and Management Sciences, 13, 323–355. Vakhutinsky, A. I., and Golden, B. (1995). A hierarchical strategy for solving traveling salesman problem using elastic nets. Journal of Heuristics, 2, 67–76. Van den Bout, D. E., & Miller, T. K. (1989). Improving the performance of the Hopeld-Tank neural network through normalization and annealing. Biological Cybernetics, 62, 129–139. Received March 14, 2000; accepted May 14, 2001.
VIEW
Communicated by Bard Ermentrout
What Geometric Visual Hallucinations Tell Us about the Visual Cortex Paul C. Bressloff [email protected] Department of Mathematics, University of Utah, Salt Lake City, Utah 84112, U.S.A.
Jack D. Cowan [email protected] Department of Mathematics, University of Chicago, Chicago, IL 60637, U.S.A.
Martin Golubitsky [email protected] Department of Mathematics, University of Houston, Houston, TX 77204-3476,U.S.A.
Peter J. Thomas [email protected] Computational Neurobiology Laboratory, Salk Institute for Biological Studies, San Diego, CA 92186-5800, U.S.A.
Matthew C. Wiener [email protected] Laboratory of Neuropsychology, National Institutes of Health, Bethesda, MD 20892, U.S.A.
Many observers see geometric visual hallucinations after taking hallucinogens such as LSD, cannabis, mescaline or psilocybin; on viewing bright ickering lights; on waking up or falling asleep; in “near-death” experiences; and in many other syndromes. Kluver ¨ organized the images into four groups called form constants: (I) tunnels and funnels, (II) spirals, (III) lattices, including honeycombs and triangles, and (IV) cobwebs. In most cases, the images are seen in both eyes and move with them. We interpret this to mean that they are generated in the brain. Here, we summarize a theory of their origin in visual cortex (area V1), based on the assumption that the form of the retino–cortical map and the architecture of V1 determine their geometry. (A much longer and more detailed mathematical version has been published in Philosophical Transactions of the Royal Society B, 356 [2001].) We model V1 as the continuum limit of a lattice of interconnected hypercolumns, each comprising a number of interconnected iso-orientation columns. Based on anatomical evidence, we assume that the lateral conNeural Computation 14, 473–491 (2002)
° c 2002 Massachusetts Institute of Technology
474
P. C. Bressloff et al.
nectivity between hypercolumns exhibits symmetries, rendering it invariant under the action of the Euclidean group E (2), composed of reections and translations in the plane, and a (novel) shift-twist action. Using this symmetry, we show that the various patterns of activity that spontaneously emerge when V1’s spatially uniform resting state becomes unstable correspond to the form constants when transformed to the visual eld using the retino-cortical map. The results are sensitive to the detailed specication of the lateral connectivity and suggest that the cortical mechanisms that generate geometric visual hallucinations are closely related to those used to process edges, contours, surfaces, and textures. 1 Introduction Seeing vivid visual hallucinations is an experience described in almost all human cultures. Painted hallucinatory images are found in prehistoric caves (Clottes & Lewis-Williams, 1998) and scratched on petroglyphs (Patterson, 1992). Hallucinatory images are seen both when falling asleep (Dybowski, 1939) and on waking up (Mavromatis, 1987), following sensory deprivation (Zubek, 1969), after taking ketamine and related anesthetics (Collier, 1972), after seeing bright ickering light (Purkinje, 1918; Helmholtz, 1925; Smythies, 1960), or on applying deep binocular pressure on the eyeballs (Tyler, 1978), in “near-death” experiences (Blackmore, 1992), and, most striking, shortly after taking hallucinogens containing ingredients such as LSD, cannabis, mescaline, or psilocybin (Siegel & Jarvik, 1975). In most cases, the images are seen in both eyes and move with them, but maintain their relative positions in the visual eld. We interpret this to mean that they are generated in the brain. One possible location for their origin is provided by fMRI studies of visual imagery suggesting that V1 is activated when human subjects are instructed to inspect the ne details of an imagined visual object (Miyashita, 1995). In 1928, Kluver ¨ (1966) organized such images into four groups called form constants: (I) tunnels and funnels, (II) spirals, (III) lattices, including honeycombs and triangles, and (IV) cobwebs, all of which contain repeated geometric structures. Figure 1 shows their appearance in the visual eld. Ermentrout and Cowan (1979) provided a rst account of a theory of the generation of such form constants. Here we develop and elaborate this theory in the light of the anatomical and physiological data that have accumulated since then. 1.1 Form Constants and the Retino–Cortical Map. Assuming that form constants are generated in part in V1, the rst step in constructing a theory of their origin is to compute their appearance in V1 coordinates. This is done using the topographic map between the visual eld and V1. It is well established that the central region of the visual eld has a much bigger representation in V1 than does the peripheral eld (Drasdo, 1977; Sereno
Geometric Visual Hallucinations
475
Figure 1: Hallucinatory form constants. (I) Funnel and (II) spiral images seen following ingestion of LSD (redrawn from Siegel & Jarvik, 1975),(III) honeycomb generated by marijuana (redrawn from Siegel & Jarvik, 1975), and (IV) cobweb petroglyph (redrawn from Patterson, 1992).
et al., 1995). Such a nonuniform retino-cortical magnication is generated by the nonuniform packing density of ganglion cells in the retina, whose axons in the optic nerve target neurons in the lateral geniculate nucleus (LGN), and (subsequently) in V1, that are much more uniformly packed. Let rR D frR , hR g be the (polar) coordinates of a point in the visual eld and r D fx, yg its corresponding V1 coordinates. Given a retinal ganglion cell
476
P. C. Bressloff et al.
packing density r R (cells per unit retinal area) approximated by the inverse square law, rR D
1 , ( w0 C 2 rR ) 2
(Drasdo, 1977), and a uniform V1 packing density r, we derive a coordinate transformation (Cowan, 1977) that reduces, sufciently far away from the fovea (when rR À w 0 /2 ) to a scaled version of the complex logarithm (Schwartz, 1977), xD 2
a
µ ln
2
w0
¶ rR ,
yD
bhR 2
,
(1.1)
where w0 and 2 are constants. Estimates of w 0 D 0.087 and 2 D 0.051 in appropriate units can be obtained from Drasdo’s data on the human retina, and a and b are constants. These values correspond to a magnication factor of 11.5 mm / degrees of visual angle at the fovea and 5.75 mm / degrees of visual angle at rR D w0 /2 D 1.7 degrees of visual angle. We can also compute how the complex logarithm maps local edges or contours in the visual eld, that is, its tangent map. Let w R be the orientation preference of a neuron in V1 whose receptive eld is centered at the point rR in the visual eld, and let fr, w g be the V1 image of frR , w R g. It can be shown (Wiener, 1994; Cowan, 1997; Bressloff, Cowan, Golubitsky, Thomas, & Wiener, 2001) that under the tangent map of the complex logarithm, w D w R ¡ hR . Given the retino-cortical map frR , w R g ! fr, w g, we can compute the form of logarithmic spirals, circles, and rays and their local tangents in V1 coordinates. The equation for spirals can be written as hR D a ln[rR exp(¡b) ] C c, whence y ¡ c D a ( x ¡ b ) under the action rR ! r. Thus, logarithmic spirals become oblique lines of constant slope a in V1. Circles and rays correspond to vertical (a D 1) and horizontal (a D 0) lines. Since the slopes of the local tangents to logarithmic spirals are given by the equation tan (w R ¡ hR ) D a, and therefore in V1 coordinates by tan w D a, the local tangents in V1 lie parallel to the images of logarithmic spirals. It follows that types I and II form constants corresponding to stripes of neural activity at various angles in V1 and types III and IV to spatially periodic patterns of local tangents in V1, as shown in Figure 2. On the average, about 30 to 36 stripes and about 60 to 72 contours are perceived in the visual eld, corresponding to a wavelength of about 2.4 to 3.2 mm in human V1. This is comparable to estimates of about twice V1 hypercolumn spacing (Hubel & Wiesel, 1974) derived from the responses of human subjects to perceived grating patterns (Tyler, 1982) and from direct anatomical measurements of human V1 (Horton & Hedley-Whyte, 1984). All of these facts and calculations reinforce the suggestion that the form constants are located in V1. It might be argued that since hallucinatory images are seen as continuous across the midline, they must be located at higher levels in the visual pathway than V1. However,
Geometric Visual Hallucinations
477
Figure 2: (a) Retino-cortical transform. Logarithmic spirals in the visual eld map to straight lines in V1. (b, c) The action of the map on the outlines of funnel and spiral form constants.
there is evidence that callosal connections along the V1 / V2 border can act to maintain continuity of the images across the vertical meridian (Hubel & Wiesel, 1962). Thus, if hallucinatory images are generated in both halves of V1, as long as the callosal connections operate, such images will be continuous across the midline.
478
P. C. Bressloff et al.
Figure 3: Outline of the architecture of V1 represented by equation 1.2. Local connections between iso-orientation patches within a hypercolumn are assumed to be isotropic. Lateral connections between iso-orientation patches in different hypercolumns are assumed to be anisotropic.
1.2 Symmetries of V1 Intrinsic Circuitry. In recent years, data concerning the functional organization and circuitry of V1 have accrued from microelectrode studies (Gilbert & Wiesel, 1983), labeling, and optical imaging (Blasdel & Salama, 1986; Bosking, Zhang, Schoeld, & Fitzpatrick, 1997). Perhaps the most striking result is Blasdel and Salama’s (1986) demonstration of the large-scale organization of iso-orientation patches in primates. We can conclude that approximately every 0.7 mm or so in V1 (in macaque), there is an iso-orientation patch of a given preference w and that each hypercolumn has a full complement of such patches, with preferences running from w to w C p . The other striking result concerns the intrinsic horizontal connections in layers 2, 3, and (to some extent) 5 of V1. The axons of these connections make terminal arbors only every 0.7 mm or so along their tracks (Gilbert & Wiesel, 1983), and connect mainly to cells with similar orientation preferences (Bosking et al., 1997). In addition, there is a pronounced anisotropy to the pattern of such connections: differing isoorientation patches preferentially connect to patches in neighboring hypercolumns in such a way as to form continuous contours under the action of the retino-cortical map described above (Gilbert & Wiesel, 1983; Bosking et al., 1997). This contrasts with the pattern of connectivity within any one hypercolumn, which is much more isotropic: any given iso-orientation patch connects locally in all directions to all neighboring patches within a radius of less than 0.7 mm (Blasdel & Salama, 1986). Figure 3 shows a diagram of such connection patterns.
Geometric Visual Hallucinations
479
Since each location in the visual eld is represented in V1, roughly speaking, by a hypercolumn-sized region containing all orientations, we treat r and w as independent variables, so that all possible orientation preferences exist at each corresponding position rR in the visual eld. This continuum approximation allows a mathematically tractable treatment of V1 as a lattice of hypercolumns. Because the actual hypercolumn spacing (Hubel & Wiesel, 1974) is about half the effective cortical wavelength of many of the images we wish to study, care must be taken in interpreting cortical activity patterns as visual percepts. In addition, it is possible that lattice effects could introduce deviations from the continuum model. Within the continuum framework, let w (r, w | r0 , w 0 ) be the strength or weight of connections from the iso-orientation patch at fx0 , y0 g D r0 in V1 with orientation preference w 0 to the patch at fx, yg D r with preference w . We decompose w in terms of local connections from elements within the same hypercolumn and patchy lateral connections from elements in other hypercolumns, that is, w (r, w | r0 , w 0 ) D wLOC (w ¡ w 0 )d (r ¡ r0 )
C bwLAT ( s )d (r ¡ r0 ¡ sew )d (w ¡ w 0 ) ,
(1.2)
where d (¢) is the Dirac delta function, ew is a unit vector in the w –direction, b is a parameter that measures the weight of lateral relative to local connections, and wLAT (s ) is the weight of lateral connections between isoorientation patches separated by a cortical distance s along a visuotopic axis parallel to their orientation preference. The fact that the lateral weight depends on a rotated vector expresses its anisotropy. Observations by Hirsch and Gilbert (1991) suggest that b is small and therefore that the lateral connections modulate rather than drive V1 activity. It can be shown (Wiener, 1994; Cowan, 1997; Bressloff et al., 2001) that the weighting function dened in equation 1.2 has a well-dened symmetry: it is invariant with respect to certain operations in the plane of V1–translations fr, w g ! fr C u, wg, reections fx, y, wg ! fx, ¡y, ¡w g, and a rotation dened as fr, w g ! fRh [r], w Ch g, where Rh [r] is the vector r rotated by the angle h. This form of the rotation operation follows from the anisotropy of the lateral weighting function and comprises a translation or shift of the orientation preference label w to w C h, together with a rotation or twist of the position vector r by the angle h. Such a shift-twist operation (Zweck & Williams, 2001) provides a novel way to generate the Euclidean group E (2) of rigid motions in the plane. The fact that the weighting function is invariant with respect to this form of E (2) has important consequences for any model of the dynamics of V1. In particular, the equivariant branching lemma (Golubitsky & Schaeffer, 1985) guarantees that when the homogeneous state a (r, w ) D 0 becomes unstable, new states with the symmetry of certain subgroups of E (2) can arise. These new states will be linear combinations of patterns we call planforms.
480
P. C. Bressloff et al.
2 A Model of the Dynamics of V1 Let a (r, w , t) be the average membrane potential or activity in an iso-orientation patch at the point r with orientation preference w . The activity variable a (r, w , t ) evolves according to a generalization of the Wilson-Cowan equations (Wilson & Cowan, 1973) that incorporates an additional degree of freedom to represent orientation preference and the continuum limit of the weighting function dened in equation 1.2: @a (r, w , t) @t
D ¡aa(r, w , t) C
Z
Cº
1 ¡1
m p
Z
p 0
wLOC (w ¡ w 0 ) s[a(r, w 0 , t) ]dw 0
wLAT ( s) s[a(r C sew , w , t) ] ds,
(2.1)
where a, m , and º D mb are, respectively, time and coupling constants, and s[a] is a smooth sigmoidal function of the activity a. It remains only to specify the form of the functions wLOC (w ) and wLAT ( s) . In the single population model considered here, we do not distinguish between excitatory and inhibitory neurons and therefore assume both wLOC (w ) and wLAT ( s) to be “Mexican hats” of the generic form, s1¡1 exp(¡x2 / 2s12 ) ¡ s2¡1 exp(¡x2 / 2s22 ) ,
(2.2)
with Fourier transform W ( p ) D exp(¡s12 p2 ) ¡ exp(¡s22 p2 ) ,
(s1 < s2 ) , such that iso-orientation patches at nearby locations mutually excite if they have similar orientation preferences, and inhibit otherwise, and similar iso-orientation patches at locations at least r0 mm apart mutually inhibit. In models that distinguish between excitatory and inhibitory populations, it is possible to achieve similar results using inverted Mexican hat functions that incorporate short-range inhibition and long-range excitation. Such models may provide a better t to anatomical data (Levitt, Lund, & Yoshioka, 1996). An equation F[a] D 0 is said to be equivariant with respect to a symmetry operator c if it commutes with it, that is, if c F[a] D F[c a]. It follows from the Euclidean symmetry of the weighting function w (r, w | r0 , w 0 ) with respect to the shift-twist action that equation 2.1 is equivariant with respect to it. This has important implications for the dynamics of our model of V1. Let a (r, w ) D 0 be a homogeneous stationary state of equation 2.1 that depends smoothly on the coupling parameter m . When no iso-orientation patches are activated, it is easy to show that a (r, w ) D 0 is stable for all values of m less than a critical value m 0 (Ermentrout, 1998). However, the parameter m can reach m 0 if, for example, the excitability of V1 is increased by the action of hallucinogens on brain stem nuclei such as the locus ceruleus or the
Geometric Visual Hallucinations
481
raphe nucleus, which secrete the monoamines serotonin and noradrenalin. In such a case, the homogeneous stationary state a (r, w ) D 0 becomes unstable. If m remains close to m 0 , then new stationary states develop that are approximated by (nite) linear combinations of the eigenfunctions of the linearized version of equation 2.1. The equivariant branching lemma (Golubitsky & Schaeffer, 1985) then guarantees the existence of new states with the symmetry of certain subgroups of the Euclidean group E (2). This mechanism, by which dynamical systems with Mexican hat interactions can generate spatially periodic patterns that displace a homogeneous equilibrium, was rst introduced by Turing (1952) in his article on the chemical basis of morphogenesis. 2.1 Eigenfunctions of V1. Computing the eigenvalues and associated eigenfunctions of the linearized version of equation 2.1 is nontrivial, largely because of the presence of the anisotropic lateral connections. However, given that lateral effects are only modulatory, so that b ¿ 1, degenerate Rayleigh-Schr odinger ¨ perturbation theory can be used to compute them (Bressloff et al., 2001). The results can be summarized as follows. Let s1 be the slope of s[a] at the stationary equilibrium a D 0. Then to lowest order in b, the eigenvalues l § are l§ ( p, q ) D ¡a C m s1 [WLOC ( p ) C bfWLAT (0, q) § W LAT (2p, q) g],
(2.3)
where W LOC ( p ) is the pth Fourier mode of wLOC (w ) and Z 1 WLAT ( p, q ) D (¡1) p wLAT ( s) J2p ( qs) ds 0
is the Fourier transform of the lateral weight distribution with J2p ( x ) the Bessel function of x of integer order 2p. To these eigenvalues belong the associated eigenfunctions º§ (r, w ) D cn u § (w ¡ w n ) exp[ikn ¢ r]
C c ¤n u § ¤ (w ¡ w n ) exp[¡ikn ¢ r],
(2.4)
where kn D fq cos w n , q sin w n g and (to lowest order in b), u § (w ) D cos 2pw or sin 2pw, and the ¤ operator is complex conjugation. These functions will be recognized as plane waves modulated by even or odd phase-shifted p – periodic functions cos[2p(w ¡ w n ) ] or sin[2p(w ¡ w n ) ]. The relation between the eigenvalues l § and the wave numbers p and q given in equation 2.3 is called a dispersion relation. In case l § D 0 at m D m 0 , equation 2.3 can be rewritten as mc D
a s1 [WLOC ( p) C bfWLAT (0, q ) C WLAT (2p, q) g]
(2.5)
482
P. C. Bressloff et al.
Figure 4: Dispersion curves showing the marginal stability of the homogeneous mode of V1. Solid lines correspond to odd eigenfunctions, dashed lines to even ones. If V1 is in the Hubel-Wiesel mode (pc D 0), eigenfunctions in the form of unmodulated plane waves or roll patterns can appear rst at the wave numbers (a) qc D 0 and (b) qc D 1. If V1 is in the coupled–ring mode (pc D 1), odd modulated eigenfunctions can form rst at the wave numbers (c) qc D 0 and (d) qc D 1.
and gives rise to the marginal stability curves plotted in Figure 4. Examination of these curves indicates that the homogeneous state a (r, w ) D 0 can lose its stability in four differing ways, as shown in Table 1. Table 1: Instabilities of V1. Wave Numbers
Local Interactions
Lateral Interactions
pc D q c D 0 6 0 pc D 0, qc D 6 0, qc D 0 pc D 6 0, qc D 6 0 pc D
Excitatory Excitatory Mexican hat Mexican hat
Excitatory Mexican hat Excitatory Mexican hat
Eigenfunction cn cn cos[kn ¢ r] cn u§ (w ¡ w n ) C c¤n u § ¤ (w ¡ w n ) cn u § (w ¡ w n ) cos[kn ¢ r]
Geometric Visual Hallucinations
483
We call responses with pc D 0 (rows 1 and 2) the Hubel-Wiesel mode of V1 operation. In such a mode, any orientation tuning must be extrinsic to V1, generated, for example, by local anisotropies of the geniculo-cortical map 6 (Hubel & Wiesel, 1962). We call responses with pc D 0 (rows 3 and 4) the of V1 operation (Somers, Nelson, & Sur, 1996; Hansel & coupled ring mode Sompolinsky, 1997; Mundel, Dimitrov, & Cowan, 1997). In both cases, the model reduces to a system of coupled hypercolumns, each modeled as a ring of interacting iso-orientation patches with local excitation and inhibition. Even if the lateral interactions are weak (b ¿ 1), the orientation preferences become spatially organized in a pattern given by some combination of the eigenfunctions º§ (r, w ) . 3 Planforms and V1 It remains to compute the actual patterns of V1 activity that develop when the uniform state loses stability. These patterns will be linear combinations of the eigenfunctions described above, which we call planforms. To compute them, we nd those planforms that are left invariant by the axial subgroups of E (2) under the shift-twist action. An axial subgroup is a restricted set of the symmetry operators of a group, whose action leaves invariant a one-dimensional vector subspace containing essentially one planform. We limit these planforms to regular tilings of the plane (Golubitsky, Stewart, & Schaeffer, 1988), by restricting our computation to doubly periodic planforms. Given such a restriction, there are only a nite number of shift-twists to consider (modulo an arbitrary rotation of the whole plane), and the axial subgroups and their planforms have either rhombic, square, or hexagonal symmetry. To determine which planforms are stabilized by the nonlinearities of the system, we use Lyapunov-Schmidt reduction (Golubitsky et al., 1988) and Poincare -Lindstedt perturbation theory (Walgraef, 1997) to reduce the dynamics to a set of nonlinear equations for the amplitudes cn in equation 2.4. Euclidean symmetry restricts the structure of the amplitude equations to the form (on rhombic and square lattices) 2 X d c nm |cm | 2 ], cn D cn [(m ¡ m 0 ) ¡ c 0 |cn | 2 ¡ 2 dt 6 n mD
(3.1)
where c nm depends on D w , the angle between the two eigenvectors of the lattice. We nd that all rhombic planforms with 30± · D w · 60± are stable. Similar equations obtain for hexagonal lattices. In general, all such planforms on rhombic and hexagonal lattices are stable (in appropriate parameter regimes) with respect to perturbations with the same lattice structure. However, square planforms on square lattices are unstable and give way to simple roll planforms comprising one eigenfunction.
484
P. C. Bressloff et al.
Figure 5: V1 Planforms corresponding to some axial subgroups. (a, b): Roll and hexagonal patterns generated in V1 when it is in the Hubel-Wiesel operating mode. (c, d) Honeycomb and square patterns generated when V1 is in the coupled-ring operating mode.
3.1 Form Constants as Stable Planforms. Figure 5 shows some of the (mostly) stable planforms dened above in V1 coordinates and Figure 6 in visual eld coordinates, generated by plotting at each point r a contour element at the orientation w most strongly signaled at that point—that point,that is, for which a (r, w ) is largest. These planforms are either contoured or noncontoured and correspond closely to Kluver’s form constants. In what we call the Hubel-Wiesel mode, interactions between iso-orientation patches in a single hypercolumn are weak and purely excitatory, so individual hypercolumns do not amplify any particular orientation. However, if the long-range interactions are stronger
Geometric Visual Hallucinations
485
Figure 6: The same planforms as shown in Figure 5 drawn in visual eld coordinates. Note however, that the feathering around the edges of the bright blobs in b is a fortuitous numerical artifact.
and effectively inhibitory, then plane waves of cortical activity can emerge, with no label for orientation preference. The resulting planforms are called noncontoured and correspond to the types I and II form constants, as originally proposed by Ermentrout and Cowan (1979). On the other hand, if coupled-ring-mode interactions between neighboring iso-orientation patches are strong enough so that even weakly biased activity can trigger a sharply tuned response, under the combined action of many interacting hypercolumns, plane waves labeled for orientation preference can emerge. The resulting planforms correspond to types III and IV form constants. These results suggest that the circuits in V1 that are normally involved in the detection of oriented edges and the formation of contours are also responsible
486
P. C. Bressloff et al.
for the generation of the form constants. Since 20% of the (excitatory) lateral connections in layers 2 and 3 of V1 end on inhibitory interneurons (Hirsch & Gilbert, 1991), the overall action of the lateral connections can become inhibitory, especially at high levels of activity. The mathematical consequence of this inhibition is the selection of odd planforms that do not form continuous contours. We have shown that even planforms that do form continuous contours can be selected when the overall action of the lateral connection is inhibitory, if the model assumes deviation away from the visuotopic axis by at least 45 degrees in the pattern of lateral connections (Bressloff et al., 2001) (as opposed to our rst model, in which connections between iso-orientation patches are oriented along the visuotopic axis). This might seem paradoxical given observations suggesting that there are at least two circuits in V1: one dealing with contrast edges, in which the relevant lateral connections have the anisotropy found by Bosking et al. (1997), and others that might be involved with the processing of textures, surfaces, and color contrast, which seem to have a much more isotropic lateral connectivity (Livingstone & Hubel, 1984). However, there are several intriguing possibilities that make the analysis more plausible. The rst can occur in case V1 operates in the Hubel-Wiesel mode (pc D 0). In such an operating mode, if the lateral interactions are not as weak as we have assumed in our analysis, then even contoured planforms can form. The second related possibility can occur if at low levels of activity, V1 operates initially in the Hubel-Wiesel mode and the lateral and long-range interactions are all excitatory (pc D qc D 0), so that a bulk instability occurs if the homogeneous state becomes unstable, which can be followed by the emergence of patterned noncontoured planforms at the critical wavelength of about 2.67 mm, when the level of activity rises, and the longer-ranged inhibition 6 is activated (pc D 0, qc D 0), until V1 operates in the coupled-ring mode 6 6 when the short-range inhibition is also activated (pc D 0, qc D 0). In such a case, even planforms can be selected by the anistropic connectivity, since the degeneracy has already been broken by the noncontoured planforms. A third possibility is related to an important gap in our current model: we have not properly incorporated the observation that the distribution of orientation preferences in V1 is not spatially homogeneous. In fact, the well-known orientation “pinwheel” singularities originally found by Blasdel and Salama (1986) turn out to be regions in which the scatter or rate of change | D w / D r| of differing orientation preference labels is high—about 20 degrees—whereas in linear regions, it is only about 10 degrees (Maldonado & Gray, 1996). Such a difference leads to a second model circuit (centered on the orientation pinwheels) with a lateral patchy connectivity that is effectively isotropic (Yan, Dimitrov, & Cowan, 2001) and therefore consistent with observations of the connectivity of pinwheel regions (Livingstone & Hubel, 1984). In such a case, even planforms are selected in the coupled-ring mode. Types I and II noncontoured form constants could also arise from a “lling-in” process similar to that embodied in the retinex algo-
Geometric Visual Hallucinations
487
Figure 7: (a) Lattice tunnel hallucination seen following the taking of marijuana (redrawn from Siegel & Jarvik, 1975), with the permission of Alan D. Iselin. (b) Simulation of the lattice tunnel generated by an even hexagonal roll pattern on a square lattice.
rithm of Land and McCann (1971) following the generation of types III or IV form constants in the coupled-ring mode. The model allows for oscillating or propagating waves, as well as stationary patterns, to arise from a bulk instability. Such waves are in fact observed: many subjects who have taken LSD and similar hallucinogens report seeing bright white light at the center of the visual eld, which then “explodes” into a hallucinatory image (Siegel, 1977) in at most 3 seconds, corresponding to a propagation velocity in V1 of about 2.5 cm per second, suggestive of slowly moving epileptiform activity (Senseman, 1999). In this respect, it is worth noting that in the continuum approximation we use in this study, both planforms arising when the longrange interactions are purely excitatory correspond to the perception of a uniform bright white light. Finally, it should be emphasized that many variants of the Kluver ¨ form constants have been described, some of which cannot be understood in terms of the model we have introduced. For example the lattice tunnel shown in Figure 7a is more complicated than any of the simple form constants shown earlier. One intriguing possibility is that this image is generated as a result of a mismatch between the corresponding planform and the underlying structure of V1. We have (implicitly) assumed that V1 has patchy connections that endow it with lattice properties. It is clear from the optical image data (Blasdel & Salama, 1986; Bosking et al., 1997) that the cortical lattice is somewhat disordered. Thus, one might expect some distortions to occur when planforms are spontaneously generated in such
488
P. C. Bressloff et al.
a lattice. Figure 7b shows a computation of the appearance in the visual eld of a roll pattern on a hexagonal lattice represented on a square lattice, so that there is a slight incommensurability between the two. The resulting pattern matches the hallucinatory image shown in Figure 7a quite well. 4 Discussion This work extends previous work on the cortical mechanisms underlying visual hallucinations (Ermentrout & Cowan, 1979) by explicitly considering cells with differing orientation preferences and by considering the action of the retino-cortical map on oriented edges and local contours. This is carried out by the tangent map associated with the complex logarithm, one consequence of which is that w , the V1 label for orientation preference, is not exactly equal to orientation preference in the visual eld, w R but differs from it by the angle hR , the polar angle of receptive eld position. It follows that elements tuned to the same angle w should be connected along lines at that angle in V1. This is consistent with the observations of Blasdel and Sincich (personal communication, 1999) and Bosking et al. (1997) and with one of Mitchison and Crick’s (1982) hypotheses on the lateral connectivity of V1. Note that near the vertical meridian (where most of the observations have been made), changes in w approximate closely changes in w R . However, such changes should be relatively large and detectable with optical imaging near the horizontal meridian. Another major feature outlined in this article is the presumed Euclidean symmetry of V1. Many systems exhibit Euclidean symmetry. What is novel here is the way in which such a symmetry is generated: by a shift fr, w g ! fr C s, w g followed by a twist fr, w g ! fRh [r], w C h g, as well as by the usual translations and reections. It is the twist w ! w C h that is novel and is required to match the observations of Bosking et al. (1997). In this respect, it is interesting that Zweck and Williams (2001) recently introduced a set of basis functions with the same shift-twist symmetry as part of an algorithm to implement contour completion. Their reason for doing so is to bind sparsely distributed receptive elds together functionally, so as to perform Euclidean invariant computations. It remains to explain the precise relationship between the Euclidean invariant circuits we have introduced here and Euclidean invariant receptive eld models. Finally, we note that our analysis indicates the possibility that V1 can operate in two different dynamical modes, the Hubel-Wiesel mode or the coupled-ring mode, depending on the levels of excitability of its various excitatory and inhibitory cell populations. We have shown that the HubelWiesel mode can spontaneously generate noncontoured planforms and the coupled-ring mode contoured ones. Such planforms are seen as hallucinatory images in the visual eld, and their geometry is a direct consequence of the architecture of V1. We conjecture that the ability of V1 to process edges,
Geometric Visual Hallucinations
489
contours, surfaces, and textures is closely related to the existence of these two modes. Acknowledgments We thank A. G. Dimitrov, T. Mundel, and G. Blasdel and the referees for many helpful discussions. The work was supported by the Leverhulme Trust (P.C.B.), the James S. McDonnell Foundation (J.D.C.), and the National Science Foundation (M.G.). P. C. B. thanks the Mathematics Department, University of Chicago, for its hospitality and support. M. G. thanks the Center for Biodynamics, Boston University, for its hospitality and support. J. D. C. and P. C. B. thank G. Hinton and the Gatsby Computational Neurosciences Unit, University College, London, for hospitality and support. References Blackmore, S. J. (1992). Beyond the body: An investigation of out–of–the–body experiences. Chicago: Academy of Chicago. Blasdel, G., & Salama, G. (1986). Voltage-sensitive dyes reveal a modular organization in monkey striate cortex. Nature, 321(6070), 579–585. Bosking, W. H., Zhang, Y., Schoeld, B., & Fitzpatrick, D. (1997). Orientation selectivity and the arrangement of horizontal connections in tree shrew striate cortex. J. Neurosci., 17, 2112–2127. Bressloff, P. C., Cowan, J. D., Golubitsky, M., Thomas, P. J., & Wiener, M. C. (2001). Geometric visual hallucinations, Euclidean symmetry, and the functional architecture of striate cortex. Phil. Trans. Roy Soc. Lond. B, 356, 299–330. Clottes, J., & Lewis-Williams, D. (1998).The shamans of prehistory:Trance and magic in the painted caves. New York: Abrams. Collier, B. B. (1972). Ketamine and the conscious mind. Anaesthesia, 27, 120–134. Cowan, J. D. (1977). Some remarks on channel bandwidths for visual contrast detection. Neurosciences Research Program Bull., 15, 492–517. Cowan, J. D. (1997). Neurodynamics and brain mechanisms. In M. Ito, Y. Miyashita, & E. T. Rolls (Eds.), Cognition, computation, and consciousness (pp. 205–233). New York: Oxford University Press. Drasdo, N. (1977). The neural representation of visual space. Nature, 266, 554– 556. Dybowski, M. (1939). Conditions for the appearance of hypnagogic visions. Kwart. Psychol., 11, 68–94. Ermentrout, G. B. (1998). Neural networks as spatial pattern forming systems. Rep. Prog. Phys., 61, 353–430. Ermentrout, G. B., & Cowan, J. D. (1979). A mathematical theory of visual hallucination patterns. Biol. Cybernetics, 34, 137–150. Gilbert, C. D., & Wiesel, T. N. (1983). Clustered intrinsic connections in cat visual cortex. J. Neurosci., 3, 1116–1133. Golubitsky, M., & Schaeffer, D. G. (1985). Singularities and groups in bifurcation theory I. Berlin: Springer-Verlag.
490
P. C. Bressloff et al.
Golubitsky, M., Stewart, I., & Schaeffer, D. G. (1988). Singularities and groups in bifurcation theory II. Berlin: Springer-Verlag. Hansel, D., & Sompolinsky, H. (1997). Modeling feature selectivity in local cortical circuits. In C. Koch & I. Segev (Eds.), Methods of neuronal modeling (2nd ed., pp. 499–567). Cambridge, MA: MIT Press. Helmholtz, H. (1925). Physiological optics (Vol. 2). Rochester, NY: Optical Society of America. Hirsch, J. D., & Gilbert, C. D. (1991). Synaptic physiology of horizontal connections in the cat’s visual cortex. J. Neurosci., 11, 1800–1809. Horton, J. C., & Hedley-Whyte, E. T. (1984). Mapping of cytochrome oxidase patches and ocular dominance columns in human visual cortex. Phil. Trans. Roy. Soc. B, 304, 255–272. Hubel, D. H., & Wiesel, T. N. (1962). Receptive elds, binocular interaction and functional architecture in the cat’s visual cortex. J. Physiol. (Lond.), 160, 106– 154. Hubel, D. H., & Wiesel, T. N. (1974). Sequence regularity and geometry of orientation columns in the monkey striate cortex. J. Comp. Neurol., 158, 267–294. Kluver, ¨ H. (1966). Mescal and mechanisms of hallucinations. Chicago: University of Chicago Press. Land, E., & McCann, J. (1971). Lightness and retinex theory. J. Opt. Soc. Am., 61(1), 1–11. Levitt, J. B., Lund, J., & Yoshioka, T. (1996). Anatomical substrates for early stages in cortical processing of visual information in the macaque monkey. Behavioral Brain Res., 76, 5–19. Livingstone, M. S., & Hubel, D. H. (1984). Specicity of intrinsic connections in primate primary visual cortex. J. Neurosci., 4, 2830–2835. Maldonado, P., & Gray, C. (1996). Heterogeneity in local distributions of orientation-selective neurons in the cat primary visual cortex. Visual Neuroscience, 13, 509–516. Mavromatis, A. (1987). Hypnagogia:The unique state of consciousness between wakefulness and sleep. London: Routledge & Kegan Paul. Mitchison, G., & Crick, F. (1982). Long axons within the striate cortex: Their distribution, orientation, and patterns of connection. Proc. Nat. Acad. Sci. (USA), 79, 3661–3665. Miyashita, Y. (1995). How the brain creates imagery: Projection to primary visual cortex. Science, 268, 1719–1720. Mundel, T., Dimitrov, A., & Cowan, J. D. (1997). Visual cortex circuitry and orientation tuning. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 887–893). Cambridge, MA: MIT Press. Patterson, A. (1992).Rock art symbols of the greatersouthwest. Boulder, CO: Johnson Books. Purkinje, J. E. (1918). Opera omnia (Vol. 1). Prague: Society of Czech Physicians. Schwartz, E. (1977). Spatial mapping in the primate sensory projection: Analytic structure and relevance to projection. Biol. Cybernetics, 25, 181–194. Senseman, D. M. (1999). Spatiotemporal structure of depolarization spread in cortical pyramidal cell populations evoked by diffuse retinal light ashes. Visual Neuroscience, 16, 65–79.
Geometric Visual Hallucinations
491
Sereno, M. I., Dale, A. M., Reppas, J. B., Kwong, K. K., Belliveau, J. W., Brady, T. J., Rosen, B. R., & Tootell, R. B. H. (1995). Borders of multiple visual areas in humans revealed by functional magnetic resonance imaging. Science, 268, 889–893. Siegel, R. K. (1977). Hallucinations. Scientic American, 237(4), 132–140. Siegel, R. K., & Jarvik, M. E. (1975). Drug–induced hallucinations in animals and man. In R. K. Siegel & L. J. West (Eds.), Hallucinations: Behavior, experience and theory (pp. 81–161). New York: Wiley. Smythies, J. R. (1960). The stroboscopic patterns III. further experiments and discussion. Brit. J. Psychol., 51(3), 247–255. Somers, D., Nelson, S., & Sur, M. (1996). An emergent model of orientation selectivity in cat visual cortex simple cells. J. Neurosci., 15(8), 5448–5465. Turing, A. M. (1952). The chemical basis of morphogenesis. Phil. Trans. Roy Soc. Lond. B, 237, 32–72. Tyler, C. W. (1978). Some new entoptic phenomena. Vision Res., 181, 1633–1639. Tyler, C. W. (1982). Do grating stimuli interact with the hypercolumn spacing in the cortex? Suppl. Invest. Ophthalmol., 222, 254. Walgraef, D. (1997). Spatio–temporal pattern formation. New York: SpringerVerlag. Wiener, M. C. (1994). Hallucinations, symmetry, and the structure of primary visual cortex: A bifurcation theory approach. Unpublished doctoral dissertation, University of Chicago. Wilson, H. R., & Cowan, J. D. (1973). A mathematical theory of the functional dynamics of cortical and thalamic nervous tissue. Kybernetik, 13, 55–80. Yan, C.-P., Dimitrov, A., & Cowan, J. (2001). Spatial inhomogeneity in the visual cortex and the statistics of natural images. Unpublished manuscript. Zubek, J. (1969). Sensory deprivation:Fifteen years of research. New York: AppletonCentury-Crofts. Zweck, J. W., & Williams, L. R. (2001). Euclidean group invariant computation of stochastic completion elds using shiftable-twistable functions. Manuscript submitted for publication. Received December 20, 2000; accepted April 26, 2001.
LETTER
Communicated by Bard Ermentrout
An Amplitude Equation Approach to Contextual Effects in Visual Cortex Paul C. Bressloff [email protected] Department of Mathematics, University of Utah, Salt Lake City, Utah 84112, U.S.A.
Jack D. Cowan [email protected] Mathematics Department, University of Chicago, Chicago, IL 60637, U.S.A.
A mathematical theory of interacting hypercolumns in primary visual cortex (V1) is presented that incorporates details concerning the anisotropic nature of long-range lateral connections. Each hypercolumn is modeled as a ring of interacting excitatory and inhibitory neural populations with orientation preferences over the range 0 to 180 degrees. Analytical methods from bifurcation theory are used to derive nonlinear equations for the amplitude and phase of the population tuning curves in which the effective lateral interactions are linear in the amplitudes. These amplitude equations describe how mutual interactions between hypercolumns via lateral connections modify the response of each hypercolumn to modulated inputs from the lateral geniculate nucleus; such interactions form the basis of contextual effects. The coupled ring model is shown to reproduce a number of orientation-dependent and contrast-dependent features observed in center-surround experiments. A major prediction of the model is that the anisotropy in lateral connections results in a nonuniform modulatory effect of the surround that is correlated with the orientation of the center. 1 Introduction The discovery that a majority of neurons in the visual or striate cortex of cats and primates (usually referred to as V1) respond selectively to the local orientation of visual contrast patterns (Hubel & Wiesel, 1962) initiated many studies of the precise circuitry underlying this property. Two cortical circuits have been fairly well characterized. There is a local circuit operating at subhypercolumn dimensions (<0.7 mm) in monkeys comprising strong orientation-specic recurrent excitation and weaker intrahypercolumnar inhibition (Michalski, Gerstein, Czarkowska, & Tarnecki, 1983; Hata, Tsumoto, Sato, Hagihara, & Tamura, 1988; Douglas, Koch, Mahowald, Martin, & Suarez, 1995). The other circuit operates between hypercolumns, Neural Computation 14, 493–525 (2002)
° c 2002 Massachusetts Institute of Technology
494
Paul C. Bressloff and Jack D. Cowan
connecting cells with similar orientation preferences separated by up to several millimeters of cortical tissue (Gilbert & Wiesel, 1989; Hirsch & Gilbert, 1992; Malach, Amir, Harel, & Grinvald, 1993; Fitzpatrick, Zhang, Schoeld, & Muly, 1993; Grinvald, Lieke, Frostig, & Hildesheim, 1994; Yoshioka, Blasdel, Levitt, & Lund, 1996). The lateral connections that mediate this circuit arise almost exclusively from excitatory neurons (Rockland & Lund, 1983; Gilbert & Wiesel, 1983), although 20% terminate on inhibitory cells and can thus have signicant inhibitory effects (McGuire, Gilbert, Rivlin, & Wiesel, 1991). Stimulation of a hypercolumn via lateral connections modulates rather than initiates spiking activity (Hirsch & Gilbert, 1992; Toth, Rao, Kim, Somers, & Sur, 1996). Thus, the lateral connectivity is ideally structured to provide local cortical processes with contextual information about the global nature of stimuli. As a consequence, these lateral connections have been invoked to explain a wide variety of context-dependent visual processing phenomena (Gilbert, Das, Ito, Kapadia, & Westheimer, 1996; Fitzpatrick, 2000). (Interestingly, contextual processing also occurs in extrastriate visual areas, and feedback from these areas to V1 has been invoked to explain aspects of contextual processing; Wenderoth & Johnstone, 1988). One common experimental paradigm is to investigate the response to stimuli consisting of circular (center) and annular (surround) gratings of differing contrasts, orientations, and diameters. When both center and surround are stimulated with strong suprathreshold inputs, one typically nds that the surround suppresses the center unit’s tuning response for surround stimulation close to the orientation of the unit’s peak tuning response and facilitates the response when the surround is stimulated at orientations sufciently dissimilar to the preferred orientation of the center (Blakemore & Tobin, 1972; Li & Li, 1994; Sillito, Grieve, Jones, Cudeiro, & Davis, 1995). The center response to surround stimulation depends signicantly on the contrast of center stimulation (Toth et al., 1996; Levitt & Lund, 1997; Polat, Mizobe, Pettet, Kasamatsu, & Norcia, 1998). For example, a xed surround stimulus tends to facilitate responses to stimuli at a preferred orientation when the center contrast is low but suppresses responses when it is high. Both effects are strongest when center and surround stimuli are iso-oriented. The response to variations in the size of a stimulus is also contrast dependent. For example, responses to a high-contrast circular grating decline beyond a characteristic preferred stimulus size due to activation of an inhibitory surround. In addition, the effective size of the excitatory receptive eld tends to increase with decreasing contrast, and at low contrasts becomes a monotonically increasing function of stimulus size (Jagadeesh & Ferster, 1990; Sceniak, Ringach, Hawken, & Shapley, 1999). One common approach to modeling the role of lateral connections in center-surround modulation is to consider a reduced local cortical circuit composed of excitatory and inhibitory populations receiving feedforward
Amplitude Equation Approach
495
inputs from the lateral geniculate nucleus (LGN), together with long–range intracortical inputs, some of which could provide feedback from higher cortical areas (Somers, Nelson, & Sur, 1994; Mundel, 1996; Mundel, Dimitrov, & Cowan, 1997; Somers et al., 1998; Li, 1999; Dragoi & Sur, 2000; Stetter, Bartsch, & Obermayer, 2000; Grossberg & Raizada, 2000). Inclusion of some form of contrast-related asymmetry between local excitatory and inhibitory neurons is then sufcient to account for the switch between low-contrast facilitation and high-contrast suppression (Somers et al., 1998; Stetter et al., 2000). Although these (mainly) computational studies have provided some insights into the possible role of lateral connections in mediating center-surround effects, a more general analytical framework has so far been lacking. In this article, we develop such a framework by considering the nonlinear dynamics of interacting hypercolumns. In particular, previous work on the dynamics of sharp orientation tuning in recurrent models of a cortical hypercolumn (Ben-Yishai, Bar-Or, & Sompolinsky, 1995; Somers, Nelson, & Sur, 1995; Vidyasagar, Pei, & Volgushev, 1996; Ben-Yishai, Hansel, & Sompolinsky, 1997; Mundel et al., 1997; Bressloff, Bressloff, & Cowan, 2000; Pugh, Ringach, Shapley, & Shelley, 2000) is extended to incorporate the effects of lateral connections between hypercolumns. A related computational approach has been initiated by McLaughlin, Shapley, Shelley, and Wielaard (2000). Our basic assumptions are as follows: (1) each active hypercolumn is close to a bifurcation point signaling the onset of sharp orientation tuning, and (2) the interactions between hypercolumns are weak. We rst use analytical methods from bifurcation theory to derive nonlinear equations for the amplitude and phase of the tuning curves in which the effective interactions are linear in the amplitudes. (Amplitude equations with linear interaction terms also arise in studies of weakly interacting neurons; Hoppensteadt & Izhikevich, 1997). These amplitude equations describe how mutual interactions between hypercolumns via lateral connections modify the response of each hypercolumn to modulated inputs from the lateral geniculate nucleus; such interactions form the basis of contextual effects. We then show how our model reproduces a number of orientationdependent and contrast-dependent features observed in center-surround experiments. One novel aspect of our model is that we incorporate the explicit anisotropy of lateral connections as observed in a number of optical imaging experiments (Yoshioka et al., 1996; Bosking, Zhang, Schoeld, & Fitzpatrick, 1997). In recent studies of the global dynamics of V1, idealized as a continuous two–dimensional sheet of interacting hypercolumns, we showed that the patterns of connection exhibit a very interesting symmetry (Wiener, 1994; Cowan, 1997; Bressloff, Cowan, Golubitsky, Thomas, & Wiener, 2001a); they are invariant under the action of the planar Euclidean group E(2) — the group of rigid motions in the plane—rotations, reections, and translations. By virtue of the anisotropy of the lateral connections, they are also
496
Paul C. Bressloff and Jack D. Cowan
invariant with respect to certain shifts and twists of the plane such that a new E(2) group action is needed to represent its properties. This shift–twist symmetry plays an important role in determining which cortical activity patterns form through spontaneous symmetry breaking. Our previous results on the global dynamics of V1 can also be obtained by considering a continuum version of the amplitude equations for weakly interacting hypercolumns derived in this paper. This establishes that both the feedforward response properties of V1 and its intrinisic dynamical properties can be understood within the same analytical framework.
2 A Dynamical Model of V1 and Its Intrinsic Circuitry Recent work using optical imaging (Blasdel & Salama, 1986; Blasdel, 1992) augments the early work of Hubel and Wiesel (1962) concerning the largescale organization of iso–orientation patches in V1. It shows that approximately every 0.7 mm or so in monkey V1, there is an iso–orientation patch of a given preference w . We therefore view the tangential structure of the cortex as a lattice of hypercolumns, where each hypercolumn comprises a continuum of iso-orientation patches that are designated by their preferred orientation w 2 [¡90± , 90± ]. This simplied anatomy is most easily motivated (though the analysis does not depend on this assumption in any crucial manner) if we assume a pinwheel two–dimensional architecture for V1 (Obermayer & Blasdel, 1993; Bonhoeffer, Kim, Malonek, Shoham, & Grinvald, 1995). The sharply orientation-tuned elements of a hypercolumn then form a ring of interacting neural populations. (A lattice model is also consistent with the observation that each location in the visual eld is represented in V1, roughly speaking, by a hypercolumn-sized region containing all orientations.) In addition to this two-dimensional tangential architecture, V1 has a complex laminar vertical structure. Although this vertical organization has been studied in some detail, its functional characteristics with regard to orientation selectivity are largely unknown. The solution we adopt here is to collapse this vertical organization into the dynamical interaction between two neuronal components: one excitatory (E), representing pyramidal cells, and the other inhibitory (I). The latter lumps together a number of different types of interneuron, for example, interneurons with horizontally distributed axonal elds of the basket cell subtype and inhibitory interneurons that have predominantly vertically aligned axonal elds (see Figure 1). Suppose that there are N hypercolumns labeled i D 1, . . . , N. Let al (ri , w , t) denote the average membrane potential or activity at time t of a population of cells of type l D E, I belonging to an iso–orientation patch of the ith hypercolumn, i D 1, . . . , N, where ri denotes the cortical position of the hypercolumn (on some coarsely grained spatial scale) and w 2 [¡90± , 90± ] is the orientation preference of the patch. The activity variables al evolve to
Amplitude Equation Approach
497
Figure 1: Local interacting neuronal populations. Py—excitatory population of pyramidal cells; Ba—spatially extended basket cell inhibitory population; Ma—local inhibitory population; Martinotti, chandelier, and double bouquet populations.
the set of equations f
@al (ri , w , t) @t C
D ¡al (ri , w , t) C hl (ri , w )
N Z p /2 X X mD E,I j D 1
¡p / 2
wlm (ri , w |rj , w 0 ) £ sm (am (rj , w 0 , t ) )
dw 0 , p
(2.1)
where f is a decay time constant (set equal to unity), hl (ri , w ) is the input to the ith hypercolumn (in this study presumed to be from the LGN), and sl is taken to be a smooth output function,1 sl ( x) D
1
1 1
C e ¡gl ( x¡k l )
,
(2.2)
If sl is interpreted as the ring rate of a single neuron with al its membrane potential, then sl typically has a hard threshold, that is, sl (al ) D 0 for al < gl . However, at the population level, such a ring-rate function can be smoothed out by either spontaneous activity or dispersion of membrane properties within the population (Wilson & Cowan, 1972). From a mathematical perspective, the assumption of smoothness is needed in order to carry out the bifurcation analysis of Section 3.
498
Paul C. Bressloff and Jack D. Cowan
where gl determines the slope or sensitivity of the input–output characteristics of the population and kl is a threshold. The kernel wlm (ri , w |rj , w 0 ) represents the strength or weight of connections from the mth population of the iso-orientation patch w 0 at cortical position rj to the lth population of the orientation patch w at position ri . Note that in this article, we do not specify how visual stimuli impinging on the retina are mapped to the cortex via the LGN. We simply assume that when there is an oriented stimulus in the (aggregate) classical receptive eld of the ith hypercolumn, then the input h (ri , w ) is maximal at w D W i where W i is the stimulus orientation. We also assume that the hypercolumns have nonoverlapping aggregate elds so that each probes a distinct region of visual space. Optical imaging combined with labeling techniques has generated considerable information concerning the pattern of connections both within and between hypercolumns (Blasdel & Salama, 1986; Blasdel, 1992; Malach et al., 1993; Yoshioka et al., 1996; Bosking et al., 1997). A particularly striking result concerns the intrinsic lateral connections in layers II, III, and (to some extent) V of V1. The axons of these connections make terminal arbors only every 0.7 mm or so along their tracks (Rockland & Lund, 1983; Gilbert & Wiesel, 1983), and they seem to connect mainly to cells with similar orientation preferences (Malach et al., 1993; Yoshioka et al., 1996; Bosking et al., 1997). In addition, there is a pronounced anisotropy of the pattern of such connections; its long axis runs parallel to a patch’s preferred orientation (Gilbert & Wiesel, 1983; Bosking et al., 1997). Thus, differing iso-orientation patches connect to patches in neighboring hypercolumns in differing directions. This contrasts with the pattern of connectivity within any one hypercolumn that is much more isotropic; any given iso-orientation patch connects locally in all directions to all neighboring patches within a radius of less than 0.7 mm. Motivated by these observations concerning the intrinsic circuitry of V1, we decompose w in terms of local connections from elements within the same hypercolumn and patchy (excitatory) lateral connections from elements in other hypercolumns: wlm (ri , w |rj , w 0 ) D wlm (w ¡ w 0 )di, j C 2 bldm,E wO (ri , w |rj , w 0 ) (1 ¡ di, j ) , (2.3) where 2 is a parameter that measures the weight of lateral relative to local connections. Observations by Hirsch and Gilbert (1992) suggest that 2 is small and therefore that the lateral connections modulate rather than drive V1 activity. The relative strengths of the lateral inputs into local excitatory and inhibitory populations are represented by the factors bl . (Recall that although the lateral connections are excitatory—Rockland & Lund, 1983; Gilbert & Wiesel, 1983—20% of the connections in layers II and III of V1 end on inhibitory interneurons, so the overall action of the lateral connections can become inhibitory, especially at high levels of activity; Hirsch & Gilbert, 1992). The contrast dependence of the factors b l will be considered
Amplitude Equation Approach
499
in Section 4. Substituting equation 2.3 into equation 2.1 gives @al (ri , w , t) @t
D ¡al (ri , w , t) C
X dw 0 C hl (ri , w ) C 2 b l p 6 i jD
X Z p /2 m DE,I ¡p / 2
Z p/2
¡p / 2
wlm (w ¡ w 0 ) sm ( am (ri , w 0 , t) )
wO (ri , w |rj , w 0 )sE ( aE (rj , w 0 , t) )
dw 0 p
(2.4)
for i D 1, . . . , N. In order to incorporate information concerning the anisotropic nature of lateral connections, we further decompose wO according to2 wO (ri , w |rj , w 0 ) D wO ( rij ) p 0 (w 0 ¡ hij ) p1 (w ¡ w 0 ) ,
(2.5)
where rj D ri C rij (cos(hij ) , sin(hij ) ). In the special case pk (w ) D d (w ) , the distribution wO ( rij ) determines the weight of lateral connections between iso-orientation patches separated by a cortical distance rij along a visuotopic axis whose direction hij is parallel to their (common) orientation preference (see Figure 2). The p -periodic functions p k (w ) then determine the spread of the lateral connections with respect to the visuotopic axis for k D 0 (see Figure 3a) and orientation preference for k D 1 (see Figure 3b). We take p k (w ) to be an even, monotonically decreasing function of w over the domain Rp 2 ¡p / 2 < w < p / 2 such that ¡p/ / 2 p k (w ) dw / p D 1. 3 Amplitude Equation for Interacting Hypercolumns 3.1 Sharp Orientation Tuning in a Local Cortical Circuit. In the absence of lateral connections (2 D 0), each hypercolumn is independently described by the ring model of orientation tuning (Somers et al., 1995; Ben-Yishai et al., 1995, 1997; Mundel et al., 1997; Bressloff et al., 2000). That is, equation 2.4 reduces to the pair of equations @aE @t @aI @t
D ¡aE C wEE ¤ sE ( aE ) ¡ wEI ¤ sI ( aI ) C hE
(3.1)
D ¡aI C wIE ¤ sE ( aE ) ¡ wII ¤ sI ( aI ) C hI ,
(3.2)
where ¤ indicates a convolution operation, w ¤ f (ri , w ) D
Z p /2 ¡p / 2
w (w ¡ w 0 ) f (ri , w 0 )
dw 0 , p
(3.3)
2 The distribution of lateral connections given by equation 2.5 differs from the one used by Li (1999) in that there is no explicit cross–orientation inhibition.
500
Paul C. Bressloff and Jack D. Cowan
Figure 2: Anisotropic lateral interactions connect iso-orientation patches located along a visuotopic axis parallel to their (common) orientation preference. Horizontal patches are connected along axis A, whereas oblique patches are connected along axis B.
for a hypercolumn at a xed cortical position ri . In the case of w -independent external inputs, hl (ri , w ) D hN l (ri ) , i D 1, . . . , N, there exists at least one xedpoint solution al (ri , w ) D aN l (ri ) of equations 3.1 and 3.2, which satises the algebraic equations aN E D W EE (0)sE ( aNE ) ¡ W EI (0) sI ( aN I ) C hN E aN I D W IE (0) sE (aN E ) ¡ WII (0) sI ( aNI ) C hN I ,
(3.4) (3.5)
Figure 3: Angular spread in the anisotropic lateral connections with respect to the (a) visuotopic axis and (b) orientation preference. Shaded orientation patches are connected.
Amplitude Equation Approach
501
Rp 2 where Wlm (0) D ¡p/ / 2 wlm (w ) dw / p . If hN l is sufciently small relative to the threshold kl , then this xed point is unique and stable for each ri . Under the change of coordinates al ! al ¡ hN l , it can be seen that the effect of hN l is to shift the threshold by the amount ¡hN l . Thus, there are two ways to increase the excitability of the network and thus destabilize the xed point: increasing the external input hN l or reducing the threshold k l . In the case of external visual stimuli, adaptation mechanisms tend to lter out the uniform part of the stimulus so that the contribution to hN l from the LGN is expected to be small. We assume, however, that various arousal and attentional mechanism act so that neurons sit close to threshold and can respond to small spatial (and temporal) variations of hN l (Tsodyks & Sejnowski, 1995). An alternative mechanism for raising the excitability of the cortex is through the action of drugs on certain brain stem nuclei, which can induce the experience of seeing geometric visual hallucinations (Ermentrout & Cowan, 1979; Bressloff et al., 2001a; Bressloff, Cowan, Golubitsky, Thomas, & Wiener, 2001b). The local stability of ( aN E , aN I ) is found by linearization: h
@bE
D ¡bE C sE0 ( aN E ) wEE ¤ bE ¡ sI0 ( bNI ) wEI ¤ bI @t
@bI @t
£
i
¤
D ¡bI C sE0 ( aN E ) wIE ¤ bE ¡ sI0 ( aN I ) wII ¤ bI ,
(3.6) (3.7)
where bl (rj , w , t) D al (rj , w , t) ¡ aN l (rj ) and sl0 (x ) D dsl ( x ) / dx. Equations 3.6 and 3.7 have solutions of the form h i (3.8) bl (rj , w , t) D Bl elt z (rj ) e2inw C z (rj )e ¡2inw , where z (rj ) is an arbitrary (complex) amplitude. For each positive integer n, the eigenvalues l and corresponding eigenvectors B D (BE , BI ) T satisfy the matrix equation
³ (1 C l) B D
sE0 ( aN E ) WEE ( n ) sE0 (aN E ) W IE (n )
´
¡sI0 ( aN I ) W EI ( n) B, ¡sI0 ( aN I ) WII ( n )
(3.9)
where Wlm ( n ) is the nth Fourier coefcient in the expansion of the p -periodic weights kernels wlm (w ), wlm (w ) D Wlm (0) C 2
1 X nD1
Wlm ( n) cos(2nw ) ,
l, m D E, I,
(3.10)
and we assume that wlm (w ) D wlm (¡w ). For the sake of illustration, suppose that sm0 (aN m ) D m , so that an increase in the excitability of the network can be modeled as an increase in m . Equation 3.9 then has solutions of the form ln§ D ¡1 C m W n§
(3.11)
502
Paul C. Bressloff and Jack D. Cowan
for integer n, where W n§ D
1 [WEE (n ) ¡ W II ( n ) § S ( n) ] 2
(3.12)
are the eigenvalues of the weight matrix with p S ( n) D [WEE ( n) C WII (n) ]2 ¡ 4WEI ( n) WIE (n) .
(3.13)
The corresponding eigenvectors (up to an arbitrary normalization) are
³ §
Bn D
´
1 2
W EI (n ) . [WEE ( n ) C WII ( n ) ¨ S (n ) ]
(3.14)
Suppose that Re WnC has a unique maximum as a function of n ¸ 0 at n D 1, with Im W 1C D 0, and dene m c D 1 / W 1C . For m < m c , we have ln§ < 0 for all n, and thus the homogeneous resting state is stable. However, as m is steadily increased, the homogeneous state will destabilize £ at the critical point¤ m D m c due to excitation of eigenmodes of the form B1C z (rj )e2iw C z (rj ) e ¡2iw , where z (rj ) is an arbitrary complex function of r. It can be shown that the saturating nonlinearities of the system stabilize the tuning curves beyond the critical point m c (Ermentrout, 1998; Bressloff et al., 2000). Using the polar representation z (r) D Z (r) e2ipw (r) , the eigenmode can be rewritten as BpC Z (rj ) cos(2p[w ¡w (rj )]). Thus, the maximum (linear) response of the jth hypercolumn occurs at the orientation w (rj ) C kp / p, k D 0, 1, . . . p ¡ 1 for p > 0. In terms of the orientation tuning properties observed in real neurons, the most relevant cases are p D 0 and p D 1. The former corresponds to a bulk instability within a hypercolumn in which the new steady state exhibits no orientation preference. We call such a response the Hubel–Wiesel mode since any orientation tuning must be extrinsic to V1, generated, for example, by local anisotropies of the geniculo–cortical map (Hubel & Wiesel, 1962). A bulk instability will occur when the local inhibition is sufciently weak. In the second case, each hypercolumn supports an activity prole consisting of a solitary peak centered about the angle w (rj ) , that is, the population response is characterized by an orientation tuning curve. One mechanism that generates a p > 0 mode is if the local connections comprise short-range excitation and longer-range inhibition, which would be the case if the inhibitory neurons are identied with basket cells (I D Ba) whose lateral axonal spread has a space constant of about 250 m m. However, it is also possible within a two-population model to generate orientation tuning in the presence of short–range inhibition, as would occur if the inhibitory neurons are identied with local interneurons (I D Ma) instead whose lateral axonal spread is about 20 m m (see Figure 1). Example spectra W nC for both of these cases are plotted in Figure 4 for gaussian coefcients p 2 2 (3.15) W lm ( n ) D 2p jlm alm e ¡n jlm / 2 ,
Amplitude Equation Approach
503
Figure 4: (a) Distribution of eigenvalues WnC as a function of n in the case of long-range inhibition. Parameter values are jEE D jIE D 15± , jII D jEI D 60± , aEE D 1, aEI aIE D 0.3, aII D 0. (b) Distribution of eigenvalues WnC as a function of n in the case of short-range inhibition. Parameter values are jEE D 35± , jIE D 40± , jII D jEI D 5± , aEE D 0.5, aEI aIE D 0.5, aII D 1. Solid line shows real part and dashed line shows the imaginary part.
where jlm determine the range of the axonal elds of the excitatory and inhibitory populations. For both examples, it can be seen that ReW1C > ReWnC 6 for all n D 1 so that the n D 1 mode becomes marginally stable rst, leading to the spontaneous formation of sharp orientation tuning curves. (Note 6 that if ImW1C D 0 at the critical point, then the homogeneous resting state bifurcates to an orientation tuning curve whose peak spontaneously either rotates as a traveling wave or pulsates as a standing wave at a frequency determined by Iml1C (Ben-Yishai et al., 1997; Bressloff et al., 2000). We consider the time–periodic case in Section 3.3. The location of the peak w ¤ (rj ) of the tuning curve at rj is arbitrary in the presence of w-independent inputs hl (rj , w ) D hN l (rj ) . However, the inclusion of an additional small-amplitude input D hl (rj , w ) » cos[2(w ¡ Wj ) ] breaks the rotational invariance of the system and locks the location of the tuning curve to the orientation corresponding to the peak of the stimulus, that is, w ¤ (rj ) D Wj . This is illustrated in Figure 5, where the input and output
504
Paul C. Bressloff and Jack D. Cowan
Figure 5: Sharp orientation tuning curve in a single hypercolumn. Local recurrent excitation and inhibition amplies a weakly modulated input from the LGN. Dotted line is the baseline output without orientation tuning.
of the excitatory population of a single hypercolumn are shown (see also Section 4.1). Thus, the local intracortical connections within a hypercolumn serve to amplify a weakly oriented input signal from the LGN (Somers et al., 1995; Ben-Yishai et al., 1997). The proportion of thalamocortical versus intracortical input in vivo and the tuning of the thalamocortical input are matters of ongoing debate. It is still not known for certain whether orientation selectivity is introduced in a feedforward manner to subsequent layers or arises from the intrinsic recurrent architecture of the upper layers of layer 4 or higher, or from some combination of these mechanisms. Recent work appears to conrm the original proposal of Hubel and Wiesel (1962) that geniculate input is sharply tuned in the cat (Ferster, Chung, & Wheat, 1997). On the other hand, a considerable body of work indicates that intracortical processes, including recurrent excitation and intracortical inhibition, are important in determining orientation selectivity (Blakemore & Tobin, 1972; Sillito, 1975; Douglas et al., 1995; Somers et al., 1995). In primates such as the macaque, most layer IV C cells receiving direct thalamocortical input are unoriented, and thus the introduction of orientation asymmetry is clearly a cortical process. If orientation information is to be represented cortically and kept “on–line” for any period of time, then specic intracortical circuitry not too dissimilar from the tuning circuitry of the primate (e.g., local recurrent excitation and inhibition) is necessary to stably represent the sharply tuned input. The work of Phleger and Bonds (1995) demonstrating a loss of stable orientation tuning in cats with blocking of intracortical inhibition supports this view. 3.2 Cubic Amplitude Equation: Stationary Case. So far we have established that in the absence of lateral connections, each hypercolumn (labeled
Amplitude Equation Approach
505
by cortical position rj ) can exhibit sharp orientation tuning when in a sufciently excited state. The peak of the tuning curve is xed by inputs from the LGN signaling the presence of a local oriented bar in the classical receptive eld (CRF) of the given hypercolumn. We wish to investigate how such activity is further modulated by stimuli outside the CRF due to the presence of anisotropic lateral connections between hypercolumns leading to contextual effects. Our approach will be to exploit the fact that the lateral connections are weak relative to local circuit connections and to use bifurcation analysis to derive dynamical equations for the amplitudes of the excited modes close to the bifurcation point. These equations then allow us to explore the effects of local and lateral inputs on sharp orientation tuning. For simplicity, we assume that each hypercolumn can be in one of two states of activation distinguished by the index  (ri ) D 0, 1, where ri is the cortical position of the ith hypercolumn. If  (ri ) D 0, then the neurons within the hypercolumn are well below threshold so that there is only spontaneous activity, sl ¼ 0. On the other hand, if  (ri ) D 1, then the hypercolumn exhibits sharp orientation tuning with mean levels of output activity sl ( aN l (ri ) ) where aN l (ri ) is a xed-point solution of equations 3.4 and 3.5 close to the bifurcation point. Any hypercolumn in the active state is taken to have the same mean output activity. That is, if we denote the set of active hypercolumns by J D fi 2 1, . . . , NI  (ri ) D 1g, then aN l (ri ) D aN l for all i 2 J , and we set sl ( aNl ) D sN l . First, perform a Taylor expansion of equation 2.4 with respect to bl (ri , w , t ) D al (ri , w , t) ¡ aN l , i 2 J , @bl @t
D ¡bl C
X m D E,I
h i wlm ¤ m bm C c m b2m C c m0 b3m C . . . C D hl
¡
C 2 bl w O ± [sN E C m bE C . . .] Â
¢
(3.16)
where D hl D hl ¡ hN l and m D sl0 (aN l ) , c l D s 00 (aN l ) / 2, c l0 D s 000 (aN l ) / 6. The convolution operation ¤ is dened by equation 3.3 and [wO ± f ] (ri , w ) D
X 6 i j2J , j D
wO ( rij )
Z p /2 ¡p / 2
p 0 (w 0 ¡hij ) p1 (w ¡ w 0 ) f (rj , w 0 )
dw 0 2p 0
(3.17)
for an arbitrary function f (r, w ) , and wO given by equation 2.5. Suppose that the system is 2 -close to the point of marginal stability of the homogeneous xed point associated with excitation of the modes e § 2iw . That is, take m D m c C2 D m where m c D 1 / W 1C (see equation 3.11). Substitute into equation 3.16
506
Paul C. Bressloff and Jack D. Cowan
the perturbation expansion, (1) (2) (3) bm D 2 1 / 2 bm C 2 bm C 2 3/ 2 bm C ¢ ¢ ¢
(3.18)
Finally, introduce a slow timescale t D 2 t and collect terms with equal powers of 2 . This leads to a hierarchy of equations of the form (up to O (2 3 / 2 ) ): [L b (1) ]l D 0 [L b
(2)
]l D ´
[L b
(3)
(3.19)
(2) vl
(3.20)
X
(1) 2 ] C b l sN E wO ± Â c m wlm ¤ [bm
mD E,I (3)
]l D vl
(1)
´¡
@bl
C
@t
C
X m D E,I
h i (1) (1) 3 (1) (2) 0 Ccm [bm ] C 2c m bm wlm ¤ D m bm bm ±
²
D hl C m c bl wO ± bE(1) Â ,
with the linear operator L dened according to X [L b]l D bl ¡ m c wlm ¤ bm .
(3.21)
(3.22)
m D E,I
We have also assumed that the modulatory external input is O (2 3/ 2 ) and rescaled D hl ! 2 3 / 2 D hl . Recall that each active hypercolumn is assumed to exhibit sharp orientation tuning, so that the local connections are such that the rst equation in the hierarchy, equation 3.19, has solutions of the form ± ² b (1) (ri , w , t ) D z (ri , t ) e2iw C z (ri , t ) e ¡2iw B (3.23) for all i 2 J , with B ´ B1C dened in equation 3.14 and z denotes the complex conjugate of z. We obtain a dynamical equation for the complex amplitude z (ri , t ) by deriving solvability conditions for the higher-order equations. We proceed by taking the inner product of equations 3.20 and 3.21 with the dual eigenmode e b(w ) D e2iw e B where
³
e BD
WIE (1) 1 ¡ 2 [WEE (1) C WII (1) ¡ S (1)]
so that [L
b]l ´ e bl ¡ m c
Te
X m DE,I
wml ¤ e bm D 0.
´
(3.24)
Amplitude Equation Approach
507
The inner product of any two vector-valued functions of w is dened as Z hu|vi D
p 0
[uE (w ) vE (w ) C uI (w ) vI (w ) ]
dw . p
(3.25)
With respect to this inner product, the linear operator L satises he b| L bi D h L Te b|bi D 0 for any b. Since L b ( p) D v ( p ) , we obtain a hierarchy of solvability conditions he b|v ( p) i D 0 for p D 2, 3, . . .. It can be shown from equations 3.17, 3.20, and 3.23 that the rst solvability condition requires that X (1) (1) bl sN E P 0 P1 (3.26) wO ( rij )e ¡2ihij · O (2 1 / 2 ) j2J
where "Z ( n) Pk
D
#
p /2
dw e ¡2niw p k (w ) p ¡p / 2
.
(3.27)
This is analogous to the so-called adaptation condition in the bifurcation theory of weakly connected neurons (Hoppensteadt & Izhikevich, 1997). If it is violated, then at this level of approximation, the interactions between hypercolumns become trivial. There are two ways in which equation 3.26 can be P satised: the conguration of surrounding hypercolumns is such that j wO (rij ) e ¡2ihij · O (2 1/ 2 ) or the mean ring rate sN E · O (2 1/ 2 ). Here we assume that sN E D O (2 1/ 2 ) . The solvability condition he b|v (3) i D 0 generates a cubic amplitude equation for z (ri , t ). As a further simplication, we set c m D 0, since this does not alter the basic structure of the amplitude equation. (Note, however, that the coefcients of the amplitude equation are c m dependent, and hence the 6 0.) Using equations 3.17, stability properties of a pattern may change ifc m D 3.21, and 3.23, we then nd that (after rescaling t ) @z (ri , t ) @t
2 D z (ri , t ) ( D m ¡ A|z (ri , t ) | ) C f (ri ) C b
X j2J
wO (rij )
h i (2) (1) £ z (rj , t ) C z (rj , t ) P 0 e ¡4ihij C sP 0 e ¡2ihij , p for all i 2 J where s D sN E / 2 m c BE , X (1) b D P1 Dl bl l D E,I
f (ri ) D m c
X l DE,I
el B
Z
p 0
e ¡2iw D hl (ri , w )
(3.28)
(3.29) dw p
(3.30)
508
Paul C. Bressloff and Jack D. Cowan
and Dl D
m 2c BE e Bl , e BT B
AD¡
3 Xe 0 3 Blc l Bl . e BT B lD E,I
(3.31)
Equation 3.28 is our reduced model of weakly interacting hypercolumns. It describes the effects of anisotropic lateral connections and modulatory inputs from the LGN on the dynamics of the (complex) amplitude z (ri , t ) . The latter determines the response properties of the orientation tuning curve associated with the hypercolumn at cortical position ri . The coupling parameter b is a linear combination of the relative strengths of the lateral connections’ innervating excitatory neurons and those innervating inhibitory neurons, with DE , DI determined by the local weight distribution. Since DE > 0 and DI < 0, we see that the effective interactions between hypercolumns have both an excitatory and an inhibitory component. (The factor (1) P1 appearing in equation 3.29, which arises from the spread in lateral connections with respect to orientation—see Figure 3b—generates a positive rescaling of the lateral interactions and can be absorbed into the parameters (n ) DE and DI .) Note that in the case of isotropic lateral connections, P k D 0 for all n > 1 so that z decouples from z in the amplitude equation 3.28. 3.3 Cubic Amplitude Equation: Oscillatory Case. In our derivation of the amplitude equation, 3.28, we assumed that the local cortical circuit generates a stationary orientation tuning curve. However, as shown in Section 3.1, it is possible for a time-periodic tuning curve to occur when 6 0. Taylor expanding 2.4 as before leads to the hierarchy of equaImW1C D tions, 3.19 through 3.21, except that the linear operator L ! L t D L C @/ @t. The lowest-order solution, equation 3.23, now takes the form h i b (1) (ri , w , t, t ) D zL (ri , t ) ei (V0 t¡2w ) C zR (ri , t )ei (V0 t C 2w ) B C c.c.,
(3.32)
where zL and zR represent the complex amplitudes for anticlockwise (L) and clockwise (R) rotating waves (around the ring of a single hypercolumn), and p V 0 D m c 4WEI (1) WIE (1) ¡ [WEE (1) C W II (1)]2 .
(3.33)
Introduce the generalized inner product, 1 T!1 T
hu | vi D lim
Z T/ 2 Z ¡T/ 2
p 0
[uE (w , t) vE (w , t) C uI (w , t) vI (w , t) ]
dw dt, p
(3.34)
and the dual vectors e bL D e Bei (V0 t¡2w ) , e bR D e Bei (V0 t C 2w ) . Using the fact that e e hbL | L t bi D hbR | L t bi D 0 for arbitrary b, we obtain the pair of solvability conditions he bL |v ( p) i D he bR |v ( p) i D 0 for each p ¸ 2.
Amplitude Equation Approach
509
The p D 2 solvability conditions are identically satised. The p D 3 solvability conditions then generate cubic amplitude equations for zL , zR of the form @zL (ri , t ) @t
h D (1 C iV 0 ) zL (ri , t ) C f ¡ (ri ) C b
X j2J
D m ¡ A|zL (ri , t ) | 2 ¡2A|zR (ri , t ) | 2
i
h i (2) (3.35) wO ( rij ) zL (rj , t ) C zR (rj , t ) P 0 e4ihij
and @zR (ri , t ) @t
h D (1 C iV 0 ) zR (ri , t ) C f C (ri ) C b
X j2J
D m ¡ A|zR (ri , t ) | 2 ¡ 2A|zL (ri , t ) | 2
i
h i (2) wO ( rij ) zR (rj , t ) C zL (rj , t ) P 0 e ¡4ihij , (3.36)
where mc f§ (ri ) D lim T!1 T
Z T/ 2 Z ¡T/ 2
p 0
e ¡i(V0 t § 2w )
X
dw e Bl D hl (ri , w , t) dt. p l D E,I
(3.37)
It can be seen that the amplitudes couple only to time–dependent inputs from the LGN. 4 Contextual Effects In the previous section, we used perturbation techniques to reduce the original innite-dimensional system of Wilson-Cowan equations (2.4) to a corresponding nite-dimensional system of amplitude equations (3.28). This is illustrated in Figure 6. The complex amplitude z (ri ) D Zi e ¡2iw i determines the linear response function Ri (w ) of the ith hypercolumn according to Ri (w ) D z (ri )e2iw C zN i (ri ) e ¡2iw D 2Zi cos(2[w ¡ w i ]) ,
(4.1)
and this in turn determines the population tuning curves of the excitatory and inhibitory populations. In this section, we use the amplitude equation, 3.28, to investigate (at least qualitatively) how contextual stimuli falling outside the classical receptive eld of a hypercolumn modify their response to stimuli within the classical receptive eld; such contextual effects are mediated by the lateral interactions between hypercolumns. 4.1 A Single Driven Hypercolumn. We begin by determining the response of a single hypercolumn to an LGN input in the absence of lateral interactions (b D 0) and under the assumption that the orientation tuning
510
Paul C. Bressloff and Jack D. Cowan
Figure 6: Reduction from a dynamical model of interacting hypercolumns to a dynamical model of interacting population tuning curves.
curves and external stimuli are stationary. Let the input from the LGN to the ith active hypercolumn be of the form
D hl (ri , w )
D gl Ci cos(2[w ¡ W i ]) ,
(4.2)
where Ci , Ci > 0, is proportional to the contrast of the center stimulus, W i is its orientation, and gl ¸ 0 determines the relative strengths of the feedforward input to the local excitatory and inhibitory populations. Such a stimulus represents an oriented edge or bar in the aggregate receptive eld of the hypercolumn. It then follows from equation 3.30 that f (ri ) D Ci e ¡2iWi ,
(4.3) P
where we have set m c lD E,I BQ l gl D 1 for convenience. This term is positive if gE À gI . Setting b D 0 in equation 3.28 and writing z (ri ) D Zi e¡2iwi , with Zi real, we obtain the following pair of equations: dZi 2 D Zi ( D m ¡ AZi ) C Ci cos(2[w i ¡ W i ]) dt dw i Ci sin(2[w i ¡ W i ]). D ¡ dt 2Zi
(4.4) (4.5)
Assuming that the coefcients D m > 0, A > 0, these have a stable xedpoint solution (w i¤ , Z¤i ) with w i¤ D W i (independent of the contrast), and Z¤i
Amplitude Equation Approach
511
is the positive root of the cubic Zi ( D m ¡AZ2i ) C Ci D 0. Thus, the hypercolumn encodes the orientation W i of a local bar in its aggregate receptive eld by locking the peak of its tuning curve to W i (see Figure 5). The amplitude of the tuning curve is determined by Z¤i , which is an increasing function of contrast. Note that in the absence of an external stimulus (Ci D 0), the tuning curve is marginally stable since w i¤ is arbitrary. This means that the phase will be susceptible to noise-induced diffusion. Recall from Section 3 that the amplitude of the LGN input was assumed to be O (2 3 / 2 ) , whereas the amplitude of the response is O (2 1/ 2 ) . Since 2 ¿ 1, the intrinsic circuitry of the hypercolumn amplies the LGN input. Now consider the oscillatory case. In order that the amplitudes zL and zR of equations 3.35 and 3.36 couple to an LGN input, the latter must be time dependent with a Fourier component at the natural frequency of oscillation V 0 of the hypercolumn. It is certainly reasonable to assume that external stimuli have a time-dependent part. For example, neurophysiological experiments often use ashing or drifting oriented gratings as external stimuli. (These are chosen to compensate for the spike frequency adaptation of neurons in the cortex.) Time-dependent signals might also be generated in the LGN. Suppose that Ci is the contrast of the relevant temporal frequency component and W i is the orientation of the external stimulus. Neglecting lateral inputs, the amplitude equations 3.35 and 3.36 become @zL (ri , t ) @t
D (1 C iV 0 ) zL (ri , t )
h i £ D m ¡ A|zL (ri , t ) | 2 ¡ 2A|zR (ri , t ) | 2 C Ci e ¡2iWi
(4.6)
and @zR (ri , t ) @t
D (1 C iV0 ) zR (ri , t )
h i £ D m ¡ A|zR (ri , t ) | 2 ¡ 2A|zL (ri , t ) | 2 C Ci e2iWi .
(4.7)
If Ci D 0, D m > 0 and A > 0, there are three types of xed-point solution: (1) anticlockwise rotating waves |zL | 2 D D m / A, zR D 0, (2) clockwise rotating waves |zR | 2 D D m / A, zL D 0, and (3) standing waves |zR | 2 D |zL | 2 D D m / 3A. The standing waves are unstable, and the traveling waves are marginally stable due to the existence of arbitrary phases (Kath, 1981; Aronson, Ermentrout, & Kopell, 1990). (Note, however, that it is possible to obtain stable 6 standing waves when c m D 0 in equation 3.21.) If Ci > 0, the traveling wave solutions no longer exist, but standing wave solutions do, of the form zL D Zi e ¡2iWi eiy , zR D Zi e2iW i eiy with tan y D ¡V 0 and Zi a positive root of the cubic Zi [D m ¡ 3AZ2i ] C Ci cos( y ) D 0.
(4.8)
512
Paul C. Bressloff and Jack D. Cowan
Equation 3.32 then implies that the tuning curves are of the form 4Zi cos(V0 t C y ) cos(2[w ¡ W i ]) . It can be shown that these solutions are unstable for sufciently low contrasts but stable for high contrasts. The existence of a stable, time–periodic tuning curve in the case of a single isolated hypercolumn means that it effectively acts as a single giant oscillator. One could then investigate, for example, the synchronization properties of a network of hypercolumns coupled via anisotropic lateral interactions. However, it is not clear how to interpret these oscillations biologically. One possibility is that they correspond to the 40 Hz c frequency oscillations that are thought to play a role in the synchronization of cell assemblies (Gray, Konig, Engel, & Singer, 1989; Singer & Gray, 1995). One difculty with this interpretation is that the c oscillations appear in the spikes of individual neurons, and thus the mechanism for generating them is likely to be washed out in any mean-eld theory analysis used to derive the rate models considered in this article. On the other hand, recent work on integrate–and–re networks has shown that time–dependent ring rates can be viewed as modulations of single neuron spikes (Bressloff et al., 2000). Given the difculties concerning the signicance of oscillatory tuning curves, we assume that the normal operating regime of each hypercolumn is to generate stationary tuning curves in response to slowly changing LGN inputs. 4.2 Center–Surround Interactions. Consider the following experimental situation (Blakemore & Tobin, 1972; Li & Li, 1994; Sillito et al., 1995): a particular hypercolumn designated the “center ” is stimulated with a grating at orientation W c , while outside the receptive area of this hypercolumn—in the “surround”—there is a grating at some uniform orientation W s as shown in Figure 7a. In order to analyze this problem, we introduce a mean-eld approximation. That is, the active region of cortex responding to the centersuround stimulus of Figure 7a is taken to have the conguration shown in Figure 7b. This consists of a center hypercolumn at rc D 0 interacting with a ring of N identical surround hypercolumns at relative positions rj , with |rj | D R, j D 1, . . . , N. Note that the total input from surround to center is strong relative to that from center to surround. Therefore, as a further approximation, we can neglect the latter and treat the system as a single hypercolumn receiving a mixture of inputs from the LGN and lateral inputs from the surround (see Figure 7c). Suppose that the surround hypercolumns are sharply tuned to the stimulus orientation W s , that is, they have steady-state amplitudes of the form z (rj ) D Zs e ¡2iWs , where Zs is positive and real. It follows from equation 3.28 that the complex amplitude zc (t ) characterizing the response of the center
Amplitude Equation Approach
513
Figure 7: (a) Circular center–surround stimulus conguration. (b) Cortical conguration of a center hypercolumn interacting with a ring of surround hypercolumns. (c) Effective single hypercolumn circuit in which there are direct LGN inputs from the center and lateral inputs from the surround. The relative strengths of the lateral inputs to the excitatory (E) and inhibitory (I) populations are determined by bE and bI , respectively. The corresponding LGN input strengths are denoted by gE and gI .
satises the equation dzc (t ) D zc (t ) ( D m ¡ A|zc (t ) | 2 ) dt ± ² Q C Cc e ¡2iWc , C bws e ¡2iW s C L e2iWs C L
(4.9)
where ws D NZs wO ( R) > 0, Cc is the contrast of the center stimulus and L D
(2) N P0 X e ¡4ihj , N jD 1
LQ D
(1) N sP 0 X
NZs
e ¡2ihj ,
(4.10)
jD 1
with hj the direction of the line from the center to the jth surround hypercolumn (see Figure 7b). We now make the simplifying assumption that L D LQ D 0 due to the rotational symmetry of the mean-eld conguration. Writing zc D Zc e ¡2iw c , we then obtain the following pair of equations: dZc D Zc ( D m ¡ AZ2c ) C Cc cos(2[w c ¡ W c ]) dt C bws cos(2[w c ¡ W s ])
dw c bws Cc sin(2[w c ¡ W c ]) ¡ sin(2[w c ¡ W s ]). D ¡ dt 2Zc 2Zc
(4.11) (4.12)
514
Paul C. Bressloff and Jack D. Cowan
Figure 8: Response of center population as a function of the relative orientation of the surround stimulus D W D W s ¡Wc . Parameters in amplitude equations 4.11 O s D 0.8, Cc D 1, and A D 1. (a) Solid line: total and 4.12 are D m D 0.1, |b| w response dZc at preferred orientation as a fraction of the response in the absence of surround modulation. Dashed line: amplitude of maximum response Z¤c (b ) . (b) Plot of the shift in the peak of the orientation tuning curve dw c D W c ¡ w c¤ (b ) .
Recall from equation 3.29 that the effective interaction parameter b has both an excitatory and an inhibitory component. As noted previously, experimental data suggest that the effect of lateral inputs on the response of a hypercolumn is contrast sensitive, tending to be suppressive at high contrasts and facilitatory at low contrasts (Toth et al., 1996; Levitt & Lund, 1997; Polat et al., 1998). In terms of the amplitude equation, 4.9, this implies that b D b ( Cc ) . For the moment, we assume that the contrast of the center stimulus is large enough so that b < 0. Let ( Z¤c (b ) , w c¤ (b ) ) be the unique, stable, steady-state solution of equations 4.11 and 4.12 for a given b. We dene two important quantities: the shift dw c in the peak of the tuning curve and the fractional change dZc in the linear response at the preferred orientation W i , dw c D W c ¡ w c¤ (b ) ,
dZc D
Z¤c (b ) cos(2dw c ) . Z¤c (0)
(4.13)
If dZc < 1, the effect of the lateral inputs is supressive, whereas if dZc > 1, then it is facilitatory. In Figure 8 we plot the resulting quantities dZc and dw c as a function of D W D W s ¡ W c . It can be seen that surround modulation changes from suppression to facilitation as the difference in the orientation of the surround and the center increases beyond around 60± . This is consistent with experimental results on center-surround suppression and facilitation at high contrasts (Blakemore & Tobin, 1972; Li & Li, 1994; Sillito et al., 1995). An important contribution to the asymmetry between suppression and facilitation arises from the shift in the peak of the effective tuning curve in the presence of the surround, as measured by dw c . Since this is positive, there is an apparent increase in the difference between the center and surround tuning curve peaks, which is analogous to the direct tilt effect observed in psychophysical experi-
Amplitude Equation Approach
515
ments (Wenderoth & Johnstone, 1988), although the effect is much larger here. We now give a simple argument for the switch from suppression to facilitation as the relative angle D W D W s ¡ W c is increased. If the surround is stimulated by a grating parallel to that of the center stimulus (W s D W c ), then w c¤ D W c and equation 4.11 becomes dZc D Zc ( D m ¡ AZ2c ) C Cc C bws . dt
(4.14)
Similarly, in the case of an orthogonal surround (W s D W c C 90± ), dZc D Zc ( D m ¡ AZ2c ) C Cc ¡ bws . dt
(4.15)
The steady-state amplitude Z¤c for b D 0 is an increasing function of Cc (see Section 4.1). Hence, if b < 0, then Cc ! Cc ¡ |bwc | for a collinear surround, resulting in a suppression of the center response, whereas Cc ! Cc C |bwc | for an orthogonal surround leading to facilitation. The occurrence of a facilitatory response to an orthogonal surround stimulus has been observed experimentally by a number of groups (Blakemore & Tobin, 1972; Sillito et al., 1995; Levitt & Lund, 1997; Polat et al., 1998). At rst sight, however, this conicts with the consistent experimental nding that stimulating a hypercolumn with an orthogonal stimulus suppresses the response to the original stimulus. In particular, DeAngelis, Robson, Ohzawa, and Freeman (1992) show that cross–orientation suppression (with orthogonal gratings) originates within the receptive eld of most cat neurons examined and is a consistent nding in both complex and simple cells. The degree of suppression depends linearly on the size of the orthogonal grating up to a critical dimension, which is smaller than the classical receptive eld dimension. Such observations are compatible with the above ndings. In the case of orthogonal inputs to the same hypercolumn, we have the simple linear summation Cc cos(2w ) C C0c cos(2[w ¡p / 2]) D ( Cc ¡C0c ) cos(2w ) , where Cc > C0c > 0. Thus, the orthogonal input of amplitude C0c reduces the amplitude of the original input and hence gives rise to a smaller response. On the other hand, the effect of an orthogonal surround stimulus is mediated by the lateral connections and (in the suppressive setting) is input primarily to the orthogonal inhibitory population, which can lead to disinhibition and consequently facilitation. Similar arguments were used by Mundel et al. (1997) and more recently by Dragoi and Sur (2000) in their numerical study of interacting hypercolumns. The center response to surround stimulation depends signicantly on the contrast of the center stimulation (Toth et al., 1996; Levitt & Lund, 1997; Polat et al., 1998). For example, a xed surround stimulus tends to facilitate responses to preferred orientation stimuli when the center contrast is low
516
Paul C. Bressloff and Jack D. Cowan
Figure 9: Variation in the response Rc of a center hypercolumn at its preferred orientation as a function of (log) contrast Cc for collinear high-contrast surround (solid curve) and no surround (dashed curve). Parameters in the amplitude O s D 0.8. The contrastequations 4.11 and 4.12 are D m D 0.1, A D 1, and w dependent coupling is taken to be of the form b D (0.5 ¡ Cc ) .
but suppresses responses when it is high. Both effects are strongest when the center and surround stimuli are at the same orientation. It has recently been shown that inclusion of some form of contrast-related asymmetry between local excitatory and inhibitory neurons is sufcient to account for the switch between low-contrast facilitation and high-contrast suppression (Somers et al., 1998). One way to model the thresholding properties of the lateral interactions is to distinguish between the differing classes of inhibitory interneurons. For example, certain types of local interneuron are well placed to provide a source of feedforward inhibition from lateral connections, as illustrated in Figure 1. This feedforward inhibition will have its own threshold and gain that will be determined by the contrast of the LGN inputs, and there will thus be a contrast-dependent disynaptic inhibition arising from the lateral connections. We incorporate such an effect into our cortical model by including a contrast-dependent contribution to the lateral inputs of the excitatory population in equation 2.4: bE ! b E ¡ b ¤ , where b ¤ is contrast dependent. In the case of high contrasts, we expect b ¤ to be sufciently large so that the effective coupling b < 0. This has been assumed in our analysis of high-contrast center-surround stimulation. Now suppose that we consider the response of the center in the presence of a high-contrast surround and varying center contrast. For the sake of illustration, suppose that b ¤ varies linearly with the contrast. The response of the center at its preferred orientation, Rc , is plotted as a function of contrast in Figure 9, both with and without a collinear surround. It can be seen that there is facilitation at
Amplitude Equation Approach
517
low contrasts and suppression at high contrasts, which is consistent with experimental data (Polat et al., 1998). 4.3 Anisotropy in Surround Suppression. So far we have considered a surround conguration in which the the anisotropic nature of the lateral connections is effectively averaged away. This is no longer the case if only a subregion of the surround is stimulated. For concreteness, suppose that only a single hypercolumn in the surround is active, and take its location relative to the center to be r D r (cosh , sin h ) (see Figure 7b). Since the effects of feedback from the center hypercolumn to the surround cannot now be ignored, we have to solve the pair of equations dzc (t ) D zc (t ) ( D m ¡ A|zc (t ) | 2 ) C Ce ¡2iWc dt h i (2) (1) C bw O ( r) zs C zN s P 0 e ¡4ih C sP 0 e ¡2ih dzs (t ) 2 ¡2iW s D zs (t ) ( D m ¡ A|zs (t ) | ) C Ce dt h i (2) (1) C bw O ( r) zc C zN c P 0 e ¡4ih C sP 0 e ¡2ih ,
(4.16)
(4.17)
where for simplicity the contrast of the center and surround stimuli is taken to be equal. It is clear that if the lateral connections are isotropic, then p 0 (w ) D (2) (1) 1 for all w 2 [¡p / 2, p / 2) , so that P 0 D 0 D P 0 , and the effective interaction between the center and surround is independent of their relative angular (1) (2) location h. However, if P 0 and P 0 are nonzero, then there will be some dependence on the relative angle h, which is a signature of the anisotropy of the lateral connections. For the sake of illustration, suppose that W c D W s D 0 (iso-oriented center and surround). Then, by symmetry, a xed-point solution exists for which zc D zs D Ze ¡2iw , where Z ( D m ¡ AZ2 ) C C cos(2w ) C b wO ( r) Z h i (2) (1) £ 1 C P 0 cos(4[w ¡ h ]) C sP 0 cos(2[w ¡ h]) D 0 h i (2) (1) C sin(2w ) C b wO ( r) P 0 sin(4[w ¡h]) C sP 0 sin(2[w ¡h]) D 0.
(4.18) (4.19)
Some solutions of these equations are plotted in Figure 10 with dZc D Z (b ) cos(w ) / Z (0). It can be seen that the degree of suppression of the center (at high-contrast) depends on both the relative angular position of the surround hypercolumn and the degree of spread in the lateral connections as (1) (2) characterized by the parameters P 0 and P 0 . For certain parameter values
518
Paul C. Bressloff and Jack D. Cowan
Figure 10: Variation in the degree of suppression of a center hypercolumn as a function of the relative angular position h of a surround hypercolumn. Both center and surround are stimulated by a high-contrast horizontal bar such that h D 0± corresponds to a collinear or end-ank conguration and h D 90± corresponds to an orthogonal or side-ank conguration as shown. The two curves (1) (2) differ in the choice of spread parameters: (i) sP 0 D 0.1, P 0 D 0.6 and (ii) (1) (2) sP0 D 0.8, P0 D 0.4. Other parameters in equations 4.16 and 4.17 are m D 0.1, A D 1, C D 1.0, and |b |w(r) D 0.5.
(see curve ii in Figure 10), the suppressive effect of the surround hypercolumn is maximal when located close to the end-ank position and minimal when located around the side-ank position. However, it is also possible for the suppressive effect to be minimal at some oblique conguration (see curve i in Figure 10). It is interesting that recent experimental data concerning the spatial organization of surrounds in primary visual cortex of the cat imply that in many cases, suppression originates from a localized region in the surround rather than being distributed uniformly in the region encircling the center ’s classical receptive eld, that is, the surrounds are spatially asymmetric (Walker, Ohzawa, & Freeman, 1999). The suppressive portion of the surround can arise at any location with a bias toward the ends of the center’s classical receptive. Our analysis suggests that the effect of the surround depends on a subtle combination of the spread in the anisotropic lateral connections and the spatial organization of the surround. 5 Discussion The mathematical model introduced here assumes that V1 dynamics is close to threshold, so that even weakly modulated signals from the LGN can trigger a tuned response. One way to achieve this is if there is a balance between intracortical excitation and inhibition. This and related ideas have been explored in a number of recent studies of the role of intracortical interactions on the observed properties of cortical neurons (Douglas et al., 1995; Carandini et al., 1997; Tsodyks & Sejnowski, 1995; Chance & Abbott, 2000; Wielaard, Shelley, McLaughlin, & Shapley, 2001).
Amplitude Equation Approach
519
5.1 Balanced Excitation and Inhibition. As discussed by Tsodyks and Sejnowski (1995), when a network with balanced excitation and inhibition is close to threshold, several properties obtain. (1) Fluctuations about the threshold are large, and remain so, even in the continuum limit. (2) The network switches discontinuously from threshold to a tuned excited state, indicating the existence of a subcritical bifurcation. (3) Critical slowing down occurs, that is, the network state changes more slowly from threshold to tuned the closer the stimulus is to threshold (see also Cowan & Ermentrout, 1978)—given realistic neural parameters it takes about 50 msec to reach the peak of the tuned state. However, (4) if the input orientation bias is switched instantaneously to a new value, the net switches to the new tuned state in less than 50 msec, particularly if the inhibition in the network is faster than the excitation. Finally, (5) the orientation-tuned response is contrast invariant. Most, if not all, of these properties are found in the mathematical model described above. On a more technical level, we note that one consequence of having balanced excitation and inhibition is that to have any effect on the bifurcation process whereby new orientation tuned stable states emerge, external stimuli, for example, those from the LGN, which we labeled as hl (rj , w ) , must be of the form hl (rj , w ) D gl Cj f1 ¡ 2 3/ 2 C 2 3 / 2 cos(2[w ¡ Wj ]) g, where gl measures the relative strengths of LGN inputs to excitatory and inhibitory populations in V1, and Cj is the stimulus contrast at the jth hypercolumn. Thus, the nonoriented component (1 ¡ 2 3/ 2 ) gl Cj is 0(1) and couples to the steady-state equations, whereas the orientation-tuned component 2 3 / 2 gl Cj cos(2[w ¡ Wj ]) is 0(2 3 / 2 ) and couples to the cubic amplitude equations. It follows that if V1 is not sitting close to a bifurcation point, so that much stronger modulations are needed to drive it to an orientation-tuned state, then a different mathematical analysis is needed. In this respect, it is clear that the noise uctuations described above can play an important role in keeping the network dynamics close to the bifurcation point. As recently noted by Anderson, Lampl, Gillespie, and Ferster (2000), noise uctuations can also secure contrast invariance in the tuned response of a population of spiking neurons, which is a property we found in the uncoupled hypercolumn model (Bressloff et al., 2000) using sigmoid population equations that effectively take account of effects produced by stochastic resonance. 5.2 Divisive Inhibition. Another well-known set of studies concerns the use of recurrent inhibition to model the nearly linear behavior of simple cortical cells. Thus, Carandini et al. (1997) have explored the use of shunting or divisive inhibition provided by a pool of inhibitory neurons, to linearize neural responses and produce contrast invariance and satura-
520
Paul C. Bressloff and Jack D. Cowan
tion properties consistent with observations. As Majaj, Smith, and Movshon (2000) recently showed, many contextual effects can be modeled within this framework if the inhibition acts to normalize the neural response. Thus, cross–orientation suppression within the classical receptive eld is easily obtained. In general, many suppression experiments can be modeled with such a mechanism, but it is more difcult to account for effects such as cross–orientation facilitation by stimuli in the nonclassical receptive eld. The idea of using division rather than subtraction to model the effects of inhibition has been around for a long time. One early mathematical use was by Grossberg (1973), who developed population equations similar to those of Wilson and Cowan (1972, 1973), but using divisive inhibition. An immediate problem is then to nd a biophysical basis for such a mechanism. Recent work by Wielaard et al. (2001) has provided a natural solution. If one uses conductance-based models of the evolution of changes in membrane potentials in a network of integrate–and–re neurons with recurrent excitation and inhibition and if the conductances are sufciently large, then the steady-state voltages reached depend on both differences between excitation and inhibition, and also on ratios of conductances, in such a way as to implement divisive inhibition. One can then use such conductance models to see how linear simple cell properties can emerge from network interactions. It is the balance between excitation and inhibition that produces effective linearity. A similar but less biophysically direct model was also recently introduced by Chance and Abbott (2000). However, they went further in devising a circuit to maintain rapid switching in the presence of critical slowing down by using a three-population model comprising one excitatory and two (divisive) inhibitory populations. This circuit is more elaborate than that of Tsodyks and Sejnowski (1995) and in fact is very close to the three-neuron circuit we have studied in this article (see Figure 1), except that we have used subtractive inhibition. But the mathematical analysis carried out in this article proceeds by rst linearizing the basic equations about an equilibrium state and then carrying out various perturbation expansions. It follows that either subtractive or divisive inhibition will lead to very similar equations and conclusions. The only difference lies in the particular parameter values needed to obtain the various behaviors. Thus, our results are qualitatively similar to what can be expected from a three-neuron circuit employing divisive rather than subtractive inhibition, particularly as we deal only with steady-state behaviors in this article. Given the framework described above, we note again that the analysis outlined in this article provides an account of both suppression and facilitation in a single model. We expect that the equivalent divisive inhibitory model will also exhibit such properties, at least if stationary tuning curves are analyzed. It remains to be seen what differences, if any, exist in the case of time–dependent responses.
Amplitude Equation Approach
521
5.3 Contrast Dependence. One other benet is provided by the threeneuron circuit described here. In various parts of the article, we have imposed contrast dependence of the lateral connectivity in a somewhat ad hoc manner simply by assuming the effective coupling coefcient b of the lateral connections between hypercolumns to be of the form b D b E ¡ b ¤ ( C) , where bE is the excitatory coupling between hypercolumns and C is the stimulus contrast. Evidently if b ¤ ( C) increases monotonically with contrast, there exists a contrast threshold at which b < 0. One relatively straightforward way to achieve this is to suppose that basket cells, which receive about 20% of the signal from excitatory lateral connections (McGuire et al., 1991), have relatively high thresholds and are therefore not activated by low-contrast stimuli. In effect, this implies that the effective excitatory part of the aggregate hypercolumn receptive eld is larger at low contrast (Movshon, pers. comm., 2001). Acknowledgments We thank Trevor Mundel for many helpful discussions. This work was supported in part by grant 96-24 from the James S. McDonnell Foundation to J. D. C. The research of P. C. B. was supported by a grant from the Leverhulme Trust. P. C. B. wishes to thank the Mathematics Department, University of Chicago, for its hospitality and support. P. C. B. and J. D. C. also thank Geoffrey Hinton and the Gatsby Computational Neurosciences Unit, University College, London, for hospitality and support. References Anderson, S., Lampl, I., Gillespie, D. C., & Ferster, D. (2000). The contribution of noise to contrast invariance of orientation tuning in cat visual cortex. Science, 290 (5498), 1968–1972. Aronson, D. G., Ermentrout, G. B., & Kopell, N. (1990). Amplitude response of coupled oscillators. Physica D, 41, 403–449. Ben-Yishai, R., Bar-Or, R. L., & Sompolinsky, H. (1995). Theory of orientation tuning in visual cortex. Proc. Nat. Acad. Sci., 92, 3844–3848. Ben-Yishai, R., Hansel, D., & Sompolinsky, H. (1997). Traveling waves and the processing of weakly tuned inputs in a cortical network module. J. Comput. Neurosci., 4, 57–77. Blakemore, C., & Tobin, E. (1972). Lateral inhibition between orientation detectors in the cat’s visual cortex. Exp. Brain Res., 15, 439–440. Blasdel, G. G. (1992). Orientation selectivity, preference, and continuity in monkey striate cortex. J. Neurosci., 12, 3139–3161. Blasdel, G. G., & Salama, G. (1986). Voltage-sensitive dyes reveal a modular organization in monkey striate cortex. Nature, 321, 579–585.
522
Paul C. Bressloff and Jack D. Cowan
Bonhoeffer, T., Kim, D., Malonek, D., Shoham, D., & Grinvald, A. (1995). Optical imaging of the layout of functional domains in area 17/18 border in cat visual cortex. European J. Neurosci., 7(9), 1973–1988. Bosking, W. H., Zhang, Y., Schoeld, B., & Fitzpatrick, D. (1997). Orientation selectivity and the arrangement of horizontal connections in tree shrew striate cortex. J. Neurosci., 17, 2112–2127. Bressloff, P. C., Bressloff, N. W., & Cowan, J. D. (2000). Dynamical mechanism for sharp orientation tuning in an integrate–and–re model of a cortical hypercolumn. Neural Comput., 12, 2473–2511. Bressloff, P. C., Cowan, J. D., Golubitsky, M., Thomas, P. J., & Wiener, M. (2001a). Geometric visual hallucinations, Euclidean symmetry and the functional architecture of striate cortex. Phil. Trans. Roy. Soc. Lond. B, 356, 299–330. Bressloff, P. C., Cowan, J. D., Golubitsky, M., Thomas, P. J., & Wiener, M. (2001b). What geometric visual hallucinations tell us about the visual cortex. Neural Comput., 13, 1–19. Carandini, M., Heeger, D. J., & Movshon, J. A. (1997). Linearity and normalization in simple cells of the macaque primary visual cortex. J. Neurosci., 17(21), 8621–8644. Chance, F. S., & Abbott, L. F. (2000). Divisive inhibition in recurrent networks. Network: Comp. Neural Sys., 11, 119–129. Cowan, J. D. (1997). Neurodynamics and brain mechanisms. In M. Ito, Y. Miyashita, & E. T. Rolls (Eds.), Cognition, computation, and consciousness (pp. 205–233). Oxford: Oxford University Press. Cowan, J. D., & Ermentrout, G. B. (1978). Some aspects of the “eigenbehavior ” of neural nets. In S. Levin (Ed.), Studies in mathematical biology, 15 (pp. 67–117). Providence, RI: Mathematical Association of America. DeAngelis, G., Robson, J., Ohzawa, I., & Freeman, R. (1992). Organization of suppression in receptive fields of neurons in cat visual cortex. J. Neurophysiol., 68(1), 144–163. Douglas, R., Koch, C., Mahowald, M., Martin, K., & Suarez, H. (1995). Recurrent excitation in neocortical circuits. Science, 269, 981–985. Dragoi, V., & Sur, M. (2000). Some properties of recurrent inhibition in primary visual cortex: Contrast and orientation dependence on contextual effects. J. Neurophysiol., 83, 1019–1030. Ermentrout, G. B. (1998). Neural networks as spatial pattern forming systems. Rep. Prog. Phys., 61, 353–430. Ermentrout, G. B., & Cowan, J. D. (1979). A mathematical theory of visual hallucination patterns. Biol. Cybernetics, 34, 137–150. Ferster, D., Chung, S., & Wheat, H. (1997). Orientation selectivity of thalamic input to simple cells of cat visual cortex. Nature, 380, 249–281. Fitzpatrick, D. (2000). Seeing beyond the receptive eld in primary visual cortex. Curr. Op. in Neurobiol., 10, 438–443. Fitzpatrick, D., Zhang, Y., Schoeld, B., & Muly, M. (1993).Orientation selectivity and the topographic organization of horizontal connections in striate cortex. Soc. Neurosci. Abstracts, 19, 424. Gilbert, C., Das, A., Ito, M., Kapadia, M., & Westheimer, G. (1996). Spatial integration and cortical dynamics. Proc. Nat. Acad. Sci., 93, 615–622.
Amplitude Equation Approach
523
Gilbert, C., & Wiesel, T. (1983). Clustered intrinsic connections in cat visual cortex. J. Neurosci., 3, 1116–1133. Gilbert, C., & Wiesel, T. (1989). Columnar specicity of intrinsic horizontal and corticocortical connections in cat visual cortex. J. Neurosci., 9, 2432–2442. Gray, C. M., Konig, P., Engel, A. K., & Singer, W. (1989). Oscillatory responses in cat visual cortex exhibit inter–columnar synchronization which reects global stimulus properties. Nature, 338, 334–337. Grinvald, A., Lieke, E., Frostig, R., & Hildesheim, R. (1994). Cortical point– spread function and long–range lateral interactions revealed by real–time optical imaging of macaque monkey primary visual cortex. J. Neurosci., 14, 2545–2568. Grossberg, S. (1973). Contour enhancement, short term memory, and constancies in reverberating neural networks. Studies in Applied Mathematics, 52(3), 213–257. Grossberg, S., & Raizada, R. D. S. (2000). Contrast-sensitive perceptual grouping and object–based attention in the laminar circuits of primary visual cortex. Vision Res., 40, 1413–1432. Hata, Y., Tsumoto, T., Sato, H., Hagihara, K., & Tamura, H. (1988). Inhibition contributes to orientation selectivity in visual cortex of cat. Nature, 335, 815–817. Hirsch, J., & Gilbert, C. (1992). Synaptic physiology of horizontal connections in the cat’s visual cortex. J. Physiol. Lond., 160, 106–154. Hoppensteadt, F., & Izhikevich, E. (1997). Weakly connected neural networks. New York: Springer. Hubel, D., & Wiesel, T. (1962). Receptive elds, binocular interaction and functional architecture in the cat’s visual cortex. J. Neurosci., 3, 1116–1133. Jagadeesh, B., & Ferster, D. (1990). Receptive eld lengths in cat striate cortex can increase with decreasing stimulus contrast. Soc. Neurosci. Abstr., 16, 130.11. Kath, W. L. (1981). Resonance in periodically perturbed Hopf bifurcation. Stud. Appl. Math., 65, 95–112. Levitt, J. B., & Lund, J. (1997). Contrast dependence of contextual effects in primate visual cortex. Nature, 387, 73–76. Li, C., & Li, W. (1994). Extensive integration eld beyond the classical receptive eld of cat’s striate cortical neurons—classication and tuning properties. Vision Res., 34(18), 2337–2355. Li, Z. (1999). Pre-attentive segmentation in the primary visual cortex. Spatial Vision, 13, 25–39. Majaj, N., Smith, M. A., & Movshon, J. A. (2000). Contrast gain control in macaque area MT. Soc. Neuroci. Abstr.. Malach, R., Amir, Y., Harel, M., & Grinvald, A. (1993). Relationship between intrinsic connections and functional architecture revealed by optical imaging and in vivo targeted biocytin injections in primate striate cortex. Proc. Natl. Acad. Sci., 90, 10469–10473. McGuire, B., Gilbert, C., Rivlin, P., & Wiesel, T. (1991). Targets of horizontal connections in macaque primary visual cortex. J. Comp. Neurol., 305, 370–392. McLaughlin, D., Shapley, R., Shelley, M., & Wielaard, D. J. (2000). A neuronal network model of macaque primary visual cortex (V1): Orientation tuning and dynamics in the input layer 4Ca. Proc. Natl. Acad. Sci., 97, 8087–8092.
524
Paul C. Bressloff and Jack D. Cowan
Michalski, A., Gerstein, G., Czarkowska, J., & Tarnecki, R. (1983). Interactions between cat striate cortex neurons. Exp. Brain Res., 53, 97–107. Mundel, T. (1996). A theory of cortical edge detection. Unpublished doctoral dissertation, University of Chicago. Mundel, T., Dimitrov, A., & Cowan, J. D. (1997). Visual cortex circuitry and orientation tuning. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 886–893). Cambridge, MA: MIT Press. Obermayer, K., & Blasdel, G. (1993). Geometry of orientation and ocular dominance columns in monkey striate cortex. J. Neurosci., 13, 4114–4129. Phleger, B., & Bonds, A. B. (1995). Dynamic differentiation of GABA A –sensitive inuences on orientation selectivity of complex cells in the cat striate cortex. Exp. Brain Res., 104, 81–88. Polat, U., Mizobe, K., Pettet, M. W., Kasamatsu, T., & Norcia, A. M. (1998). Collinear stimuli regulate visual responses depending on a cell’s contrast threshold. Nature, 391, 580–584. Pugh, M. C., Ringach, D. L., Shapley, R., & Shelley, M. J. (2000). Computational modeling of orientation tuning dynamics in monkey primary visual cortex. J. Comput. Neurosci., 8, 143–159. Rockland, K. S., & Lund, J. (1983). Intrinsic laminar lattice connections in primate visual cortex. J. Comp. Neurol., 216, 303–318. Sceniak, M. P., Ringach, D. L., Hawken, M. J., & Shapley, R. (1999). Contrast’s effect on spatial summation by macaque V1 neurons. Nature Neurosci., 2, 733–739. Sillito, A. M. (1975). The contribution of inhibitory mechanisms to the receptive– eld properties of neurones in the striate cortex of the cat. J. Physiol. Lond., 250, 305–329. Sillito, A. M., Grieve, K. L., Jones, H. E., Cudeiro, J., & Davis, J. (1995). Visual cortical mechanisms detecting focal orientation discontinuities. Nature, 378, 492–496. Singer, W., & Gray, C. M. (1995). Visual feature integration and the temporal correlation hypothesis. Ann. Rev. of Neurosci., 18, 555–586. Somers, D., Nelson, S., & Sur, M. (1994). Effects of long–range connections on gain control in an emergent model of visual cortical orientation selectivity. Soc. Neurosci. Abstr., 20, 646.7. Somers, D., Nelson, S., & Sur, M. (1995). An emergent model of orientation selectivity in cat visual cortical simple cells. J. Neurosci., 15, 5448–5465. Somers, D., Todorov, E. V., Siapas, A. G., Toth, L. J., Kim, D.-S., & Sur, M. (1998). A local circuit approach to understanding integration of long–range inputs in primary visual cortex. Cerebral Cortex, 8, 204–217. Stetter, M., Bartsch, H., & Obermayer, K. (2000). A mean eld model for orientation tuning, contrast saturation, and contextual effects in the primary visual cortex. Biol. Cybernetics, 82, 291–304. Toth, L. J., Rao, S. C., Kim, D., Somers, D., & Sur, M. (1996). Subthreshold facilitation and suppression in primary visual cortex revealed by intrinsic signal imaging. Proc. Natl. Acad. Sci., 93, 9869–9874. Tsodyks, M., & Sejnowski, T. (1995). Rapid state switching in balanced cortical network models. Network: Comp. Neural Sys., 6, 1–14.
Amplitude Equation Approach
525
Vidyasagar, T. R., Pei, X., & Volgushev, M. (1996).Multiple mechanisms underlying the orientation selectivity of visual cortical neurons. TINS, 19(7), 272–277. Walker, G. A., Ohzawa, I., & Freeman, R. D. (1999). Asymmetric suppression outside the classical receptive eld of the visual cortex. J. Neurosci., 19, 10536– 10553. Wenderoth, P., & Johnstone, S. (1988). The different mechanisms of the direct and indirect tilt illusions. Vision Res., 28, 301–312. Wielaard, D., Shelley, M., McLaughlin, D., & Shapley, R. (2001). How simple cells are made in a nonlinear network model of the visual cortex. J. Neurosci., 21, 5203–5211. Wiener, M. C. (1994). Hallucinations, symmetry, and the structure of primary visual cortex: A bifurcation theory approach. Unpublished doctoral dissertation, University of Chicago. Wilson, H. R., & Cowan, J. D. (1972). Excitatory and inhibitory interactions in localized populations of model neurons. Biophys. J., 12, 1–24. Wilson, H. R., & Cowan, J. D. (1973). A mathematical theory of the functional dynamics of cortical and thalamic nervous tissue. Kybernetik, 13, 55–80. Yoshioka, T., Blasdel, G. G., Levitt, J. B., & Lund, J. (1996). Relation between patterns of intrinsic lateral connectivity, ocular dominance, and cytochrome oxidase—reactive regions in macaque monkey striate cortex. Cerebral Cortex, 6, 297–310. Received January 10, 2001; accepted May 25, 2001.
LETTER
Communicated by Fred Rieke
Derivation of the Visual Contrast Response Function by Maximizing Information Rate Allan Gottschalk [email protected] Department of Anesthesiology and Critical Care Medicine, Johns Hopkins Medical Institutes, Baltimore, MD 21287, U.S.A. A graph of neural output as a function of the logarithm of stimulus intensity often produces an S-shaped function, which is frequently modeled by the hyperbolic ratio equation. The response of neurons in early vision to stimuli of varying contrast is an important example of this. Here, the hyperbolic ratio equation with a response exponent of two is derived exactly by considering the balance between information rate and the neural costs of making that information available, where neural costs are a function of synaptic strength and spike rate. The maximal response and semisaturation constant of the neuron can be related to the stimulus ensemble and therefore shift accordingly to exhibit contrast gain control and normalization. 1 Introduction
The response of neurons in early vision to variations in stimulus contrast has been studied extensively (Albrecht & Hamilton, 1982; Sclar & Freeman, 1982; Albrecht, Farrar, & Hamilton, 1984; Ohzawa, Sclar, & Freeman, 1985; Sclar, Maunsell, & Lennie, 1990; Bonds, 1991; Geisler & Albrecht, 1992; Wilson & Humanski, 1993) and is an important example of the many types of neural data that can be t by the hyperbolic ratio equation (Naka & Rushton, 1966), R D RMaxCn / ( Cn C Cn1/ 2 ) ,
(1.1)
where the response R is an S-shaped function of the logarithm of the stimulus amplitude C, RMax is the maximal possible response, n is the response exponent, and C1 / 2 is the semisaturation constant, the amplitude of the stimulus that will produce a response of RMax / 2. This relationship is also well known from enzyme kinetics (Koshland, Goldbeter, & Stock, 1982). What is less clear is the role that such a relationship plays in the processing of sensory inputs, particularly the stimulus-dependent alterations in RMax and/or C1 / 2 that are often observed in cortical neurons but not those of the lateral geniculate nucleus (Albrecht et al., 1984; Ohzawa et al., 1985; Bonds, 1991; Geisler & Albrecht, 1992). Hypotheses regarding these stimulus-dependent Neural Computation 14, 527–542 (2002)
° c 2002 Massachusetts Institute of Technology
528
Allan Gottschalk
alterations range from neural fatigue to a neural strategy designed to maximize sensitivity to adjustments in stimulus amplitude (Albrecht et al., 1984; Ohzawa et al., 1985; Ullman & Schechtman, 1982). 2 Derivation of the Hyperbolic Ratio Equation
To investigate these possibilities and to motivate the existence of an Sshaped neural response to sensory stimuli, consider the simple neural model in Figure 1A, where the objective is to adjust the size of the overall neural gain b to maximize the amount of information conveyed to the output about the input subject to the costs associated with increases in neural gain and neural output power. In this context, the neural gain b is the product of both the transduction of the input at the synaptic level, where both preand postsynaptic mechanisms could be involved, and the translation of the resulting membrane potential to a given neural spike rate, whose power is also considered to be a limited neural resource. For the model of Figure 1A, the mutual information between g and h is well known (Gallegher, 1968) and given by 1 / 2 log(1 C b 2 l / sn2 ) , where l is the input power and sn2 is the power of the neural noise. If neural gain is penalized as C 0b m and neural output power is penalized as k 0b 2 l, where C 0 and k 0 are, respectively, the relative costs of neural gain and neural output power, then the balance between mutual information and neural resource use is given by the criterion function 1 / 2 log(1 C b 2 l / sn2 ) ¡ C 0 b m ¡ k 0 b 2 l.
(2.1)
The value of the neural gain b, which maximizes this expression, can be obtained by differentiating the expression with respect to b, setting the result equal to zero, and solving for b. For m D 2, the total output power (b 2 l C sn2 ) at the optimum is given exactly by P D (2k 0 ) ¡1 C2 / ( C2 C C 0 /k 0 ) ,
(2.2)
since the input power is obtained from the square of the stimulus amplitude (l D C2 ). Thus, Rmax » (2k 0 ) ¡1 , C1 / 2 » (C 0 /k 0 ) 1 / 2 and the response exponent n is equal to two in order to obtain the stimulus power from the stimulus amplitude. This relationship is depicted by the middle curve of Figure 1B. Although neural output power has a natural denition in the form of b 2 C2 against which a penalty can be assessed, there is no natural penalty function for the neural gain b. However, as shown in Figure 1B, the optimal solution is somewhat robust with respect to the choice of a penalty function for neural gain from the family b m , although the corresponding solutions are not nearly as tractable as equation 2.2 and must be obtained numerically. The above results readily generalize to the case where multiple inputs are involved without changing the form of equation 2.2. To demonstrate this,
Visual Contrast Response Function and Information Rate
529
Figure 1: Simplied neural channel and its optimal response function. (A) Zeromean gaussian stimulus ensemble g of power l undergoes neural gain b and corruption by an independent zero-mean gaussian noise source n of power sn2 to produce the neural output h. (B) Optimal neural output power as a function of input amplitude for the gain that optimizes the balance between information rate and the neural costs of making that information available. Although both neural output power and neural gain are penalized, there is a natural denition for neural output power only; consequently, a penalty on neural gain proportional to b m is considered for a range of m, where the value of m corresponding to each curve increases in the direction of the arrow. The optimal solution for the case m D 2 corresponds to the hyperbolic ratio equation, whereas the other results must be obtained numerically.
consider the same neural model as in Figure 1A except that multiple inputs are multiplied by the vector ’ prior to being summed and corrupted by an additive gaussian noise source of power sn2 . Thus, the mutual information
530
Allan Gottschalk
between the input and output of the neuron becomes 1 / 2 log(1 C ’T C’ / sn2 ) , where C is the correlation matrix of the inputs, and () T denotes the vector transpose. In this context, the neural output power becomes ’T C’, and the cost of neural gain is ’T ’. The optimum can then be obtained from 2 @f1 / 2 log(1 C ’T C’ / sn ) ¡ C 0 ’T ’ ¡ k 0’T C’g / @’ D 0.
(2.3)
This can be solved by inspection and the use of earlier results. The optimal ’ D b v is proportional to the eigenvector v of C, which is associated with the largest eigenvalue l of C, since this provides the greatest power with the smallest use of neural gain. If v is normalized such that v T v D 1, the power of the optimal solution is given by P D (2k 0 ) ¡1 l / (l C C 0 /k 0 ) ,
(2.4)
where the exponent of two is present implicitly since the eigenvalue l is already an expression of stimulus power. Note that a penalty only on neural output power leads to an optimal solution where the relative distribution of the gains is arbitrary as long as they are sufciently large. 3 Contrast Gain Control
Thus far, these results have assumed that if justied by the information rate, sufcient neural resources will be available to achieve the optimum. However, it is conceivable that neural gain is regulated so that it remains bounded. This situation is investigated by adjoining the constraint b m · SMax to the expression of equation 2.1 with the Lagrange multiplier C(Luenberger, 1969), and obtaining the optimum as before. The optimal neural output power when m D 2 becomes P D (2k 0 ) ¡1 C2 / ( C2 C (C 0 C C ) /k 0 ) ,
(3.1)
where C is zero unless the constraint is binding (b m D SMax ) , and where C is obtained algebraically by jointly solving equation 3.1 and the binding constraint. Note that the form of equation 3.1 is the same as equation 2.2, except that C 0 C C appears in place of C 0 . In this context, the Lagrange multiplier C expresses the improvement in information rate that would occur for an increase in SMax and therefore renes the balance between information rate and neural gain induced by C 0. If an adapting contrast determines the value of C but the neuron is still free to instantaneously adjust its output power accordingly, the contrast response function (CRF) will shift to the right in a manner analogous to physiologic contrast gain control (Albrecht et al., 1984; Ohzawa et al., 1985), as shown in Figure 2A for one-octave steps in the adapting contrast. When determining the contrast response of neurons in the visual pathway, drifting sinusoids of varying contrast are typically employed as the
Visual Contrast Response Function and Information Rate
531
Figure 2: (A) Shift of contrast response function (CRF) to the right for oneoctave increments in the contrast (3.1%, 6.3%,12.5%, 25%) of an adapting stimulus when neural gain is constrained. (B) CRFs for one-octave increments in the contrast (3.1%, 6.3%,12.5%, 25%, 50%) of the adapting stimulus when both neural gain and neural output power have an impact on some common neural substrate, such as energy. For both panels, the curve on the left represents the unadapted state.
532
Allan Gottschalk
stimulus, and the response is reported as the spike rate obtained from the peristimulus time histogram from either the DC or rst harmonic (Albrecht & Hamilton, 1982), depending on the linearity of the neuron being recorded from (e.g., lateral geniculate, simple, or complex). Spike rate is the usual output variable reported for physiological determinations of the contrast response and is the mean of the underlying point process. Consequently, neural output power is equated with neural spike rate in Figures 2 and 4 since, to a rst approximation, the variance (power) and mean of such a process are proportional (Fienberg, 1974; Dean, 1981; Tolhurst, Movshon, & Dean, 1983). Consider what takes place if both neural gain and neural output power affect some regulated common neural substrate, such as cellular energy. In this case, the constraint b 2 l C rb m · Emax is adjoined to the criterion function of equation 2.1 with the Lagrange multiplier m , where r quanties the relative impact of neural output power and neural gain on Emax . When m D 2, the optimal solution can be obtained using the previously described approach to reveal P D (2(k 0 C m ) ) ¡1 C2 / ( C2 C (C 0 C mr ) / (k 0 C m ) ) ,
(3.2)
where m D 0 unless the constraint is binding (b 2 l C rb m D Emax ), and where m can be obtained algebraically by jointly solving equation 3.2 and the binding constraint. Again, the optimal solution takes the same form as the prior equations, where only the constants differ. Figure 2B illustrates the shift in the CRF with one-octave steps in the adapting stimulus. In addition to a rightward shift in the CRF, a decrease in the maximal response occurs. Both of these effects are often observed in physiological measurements of contrast gain control (Albrecht et al., 1984; Ohzawa et al., 1985). 4 Impact of Static Nonlinearity on Information Rate
Apart from preventing unbounded use of neural resources, there is an additional motivation for the nervous system to regulate neural gain. Consider an implementation of the above relationships, where a static nonlinearity adjusts output power in accordance with equation 2.2. Although the inputs may be gaussian, the outputs of the nonlinearity are not, and the extent of the deviation from a gaussian distribution depends on the level of input power relative to the semisaturation constant. In addition to inducing distortions that the visual processor may or may not be able to compensate for, nonlinearities also limit the information rate since gaussian signals are the most information rich of all possible distributions (Gallegher, 1968). The decrease in information transmission due to the nonlinearity is illustrated in Figure 3, where the stimulus is processed in a push-pull manner (Palmer & Davis, 1981), with the nonlinearity applied separately to the positive and negative portions of the signal, which are then recombined with an appropriate
Visual Contrast Response Function and Information Rate
533
Figure 3: Static nonlinearity and its impact on information rate. (A) The input g and its negative value are half-wave rectied, passed through the static nonlinearity, and recombined with the appropriate sign before being corrupted by an additive independent zero-mean gaussian noise source. (B) Mutual information between g and h as a function of the semisaturation constant of the static nonlinearity for one-octave increments in the neural noise (sn2 D 0.125, 0.25, 0.50, 1.0) where the stimulus power is 1.0. The mutual information for the nonlinear case (solid line) is shown along with that for a gaussian output of equal power (dashed line), where the curve with the smallest level of neural noise is shown at the top. The vertical line is the value of the semisaturation constant at which for half of the stimuli the amplitude of the input is at or below the value of the semisaturation constant. Mutual information is expressed in units of nats, which reect the use of natural logarithms, and may be converted to bits by dividing by ln 2.
sign and corrupted by an additive independent gaussian noise source. The mutual information between the neural input and output is computed as a function of the semisaturation constant of the nonlinearity. Separate curves
534
Allan Gottschalk
are given for one-octave increments in the power of the neural noise while the stimulus input power remains constant. The results are contrasted with those for a gaussian process of the same output power. Note that differences between the gaussian (linear) and nongaussian (nonlinear) case are not observed until approximately the point (vertical line) where half of the stimuli have amplitudes below the value of the semisaturation constant and half above. Thus, the regulation of neural gain helps to conserve this resource and keeps the information rate close to its optimum value. In this sense, an increase in the semisaturation constant represents the increasing cost of neural gain at higher stimulus power since the corresponding information rate is less because of the nongaussian distribution of neural output induced by the nonlinearity. 5 Normalization
It is conceivable that there may be an upper limit to a neuron’s output, despite the otherwise clear benets of being able to increase its output in response to larger stimuli. Because of the explicit relationship between synaptic size and neural output power for the model of Figure 1A, a single neuron cannot, in general, have binding constraints on both. However, when multiple neurons are present, binding constraints on total neural gain and output power are possible. Consider constraints of the form Si bim · Smax on neural gain and Si bi2 li · Pmax on total neural output power, which are adjoined to the criterion function by the Lagrange multipliers C and k , respectively, and where the different neurons are indexed by i. The constraint on total neural power induces a type of normalization in the sense that neural power is collectively adjusted so that its sum remains within a given limit (Heeger, 1992a). Because multiple neurons are involved, each contributes a term of the form of equation 2.2 to the criterion function to which the above constraints are adjoined. This will be the case as long as the neural outputs remain uncorrelated. This permits an analytic solution for the optimal output in terms of the Lagrange multipliers, which is found to be P i D (2(k 0 C k ) ) ¡1 C2i / ( C2i C (C 0 C C ) / (k 0 C k ) ) ,
(5.1)
which is the same as equation 3.1 except that k 0 C k now appears in place of k 0 , and C now corresponds to the constraint on total neural gain. Closedform solutions for the Lagrange multipliers cannot generally be obtained, and their solution must be obtained numerically with techniques such as Newton’s method (Luenberger, 1969). Although neural outputs are not necessarily uncorrelated, there are specic physiological examples to which this expression can be applied. Consider the example where a counterphase grating is adjusted so that the output of a given cortical neuron is zero, and then a test grating that stimulates the neuron is added to the stimulus (Geisler & Albrecht, 1992).
Visual Contrast Response Function and Information Rate
535
Figure 4: Shift in the CRF to the right and down in response to one-octave increments in stimulus contrast to surrounding neurons when total output power and neural gain are constrained. The use of linear axes permits these results to be compared with those in the physiological literature (Geisler & Albrecht, 1992). See the text for details.
For a xed contrast level of the counterphase stimulus, the CRF of the neuron is then obtained by varying the contrast of the test stimulus. If the Lagrange multipliers from equation 5.1 are computed for each contrast level of the counterphase stimulus, a family of CRFs is generated that shift down and to the right in response to one-octave increments in the contrast level of the counterphase stimulus, as shown in Figure 4. Note that considering only a limit on total neural output power will not lead to a shift in the CRF that is both down and to the right, since increasing k by itself would lead to a decrease in the value C 0 / (k 0 C k ) of the semisaturation constant. 6 Discussion
The hyperbolic ratio equation with a response exponent of two can be derived exactly from a simple information processing model that balances
536
Allan Gottschalk
information transmission against the costs of obtaining that information. This leads to an interpretation of neural data approximated by this relationship as efforts to maximize the extraction of sensory information from the environment with available neural resources. As derived here, the important parameters, but not the shape, of the hyperbolic ratio equation are functions of the costs of and constraints on the available neural resources, and these results lead to shifts in the CRF analogous to physiological descriptions of contrast gain control and normalization. Although simplistic, the model described here would lead to very different conclusions if either synaptic size or neural output power did not in some way constrain the acquisition of information. A system for which there was no penalty on synaptic size (C 0 D 0 in equation 2.2) should always operate near its maximal power, whereas one for which there was no constraint on neural output power (k 0 D 0 in equation 2.2) should linearly increase its output as a function of stimulus power. These results provide a basis for the expansive nonlinearity that is thought to underlie a large array of visual data, including directional selectivity (Albrecht & Geisler, 1991; Heeger, 1992a, 1992b, 1993). Moreover, the interpretation of the CRF given here provides a means of estimating the balance between information rate and the neural resources allocated to making that information available, since the semisaturation constant and maximal response are readily estimated from the CRF and may be related to C 0 , C, k 0 , and k for a family of CRFs obtained under different adaptation conditions. The full neurobiological justication for using mutual information as a component of the criterion function remains to be established. In the current context, its use sets an upper bound on system performance, and the extent that the results presented here are compatible with established physiology argue that mutual information is relevant. The chief concern about very general information measures based on the stimulus ensemble is that, by themselves, they presume that transduction of each member of the stimulus ensemble is important to the survival of the organism and species in proportion to its presence in the stimulus ensemble. However, because the current model determines an optimal gain only as a function of stimulus contrast, as opposed to a spatial or temporal lter, it is not emphasizing a particular component of the stimulus ensemble and only seeks to preserve best the outputs of earlier ltering operations without undue burden to the system. Since redundancy is the only means of preserving information in a noisy environment and redundancy requires neural resources, the need to strike this balance makes clear the benet of avoiding excessive redundancy (Attneave, 1954; Barlow, 1961). This was recognized in earlier work that used graphical arguments to obtain the CRF of the y eye from the distribution of visual stimuli (Laughlin, 1981). By linking the parameters that govern the shape of the CRF with an information measure, it may be possible to impute these aspects of cortical information processing from contrast response data.
Visual Contrast Response Function and Information Rate
537
An issue that may affect the shape of the derived CRF is the assumption made concerning the neural noise, which was introduced as an independent additive gaussian source. Although this maintained the analytic tractability that permitted the above relationships to be obtained, neural noise is stimulus dependent and varies for the most part as the stimulus amplitude (Fienberg, 1974). In early vision, this relationship can be quite complex with a variance-to-mean ratio greater than the value of one that would be expected for a Poisson process (Dean, 1981; Tolhurst et al., 1983). The incorporation of statistical models that reect this behavior could substantially affect the detailed shape of the model CRF. For the hyperbolic ratio equation that best ts the outputs of such models, this could lead to response exponents that are more consistent with those obtained physiologically (Albrecht & Hamilton, 1982; Sclar et al., 1990). Despite these limitations, the results presented here are signicant in that they represent bounds on system performance since gaussian signals are, for a given variance, the most information rich of all distributions, and additive gaussian noise is, for a given variance, the most corrupting (Gallegher, 1968). The simplied assumptions concerning signal and noise permitted an analytical expression for mutual information that was balanced against the costs of making that information available. These costs considered both the size of the synapse and the output of the neuron, and failure to include either one would result in qualitative behavior grossly at variance with physiological experience. The expression for the cost of neural output was an exact expression for output power, which relates directly to the expenditure of cellular energy, a commodity previously hypothesized to be an important factor in the rate of information transmission (Levy & Baxter, 1996). The hypothesis that cortical neurons should strike a balance between information transmission and ring rate has been given some experimental support (Baddeley et al., 1997). Although it is known that information transfer across a synapse also requires the expenditure of cellular energy (Laughlin, Anderson, O’Carroll, & de Ruyter van Steveninck, 2000), and it is reasonable that larger synaptic connections should cost more than smaller ones, there is no natural expression for the cost of a synapse. Consequently, a family of synaptic cost functions was explored as part of Figure 1 and demonstrated that the properties of the optimal solution did not depend critically on the penalty for synaptic costs. Therefore, in order to facilitate the subsequent analysis, the function that permitted the optimal solution to be obtained analytically was employed, although, in principle, any member of the family of functions evaluated in Figure 1 could have been used. Although the assumptions introduced could account for many types of stimulus-dependent shifts in the CRF, they do not indicate why contrast gain control is present in cortical neurons but not in those of the lateral geniculate nucleus (Ohzawa et al., 1985). The scope of the model may be too limited to address this issue satisfactorily. It may be that such shifts in the responsiveness of lateral geniculate neurons prior to the formation
538
Allan Gottschalk
of cortical receptive elds would introduce distortion that could not be compensated for. A computational test of this hypothesis would require a model of cortical receptive eld formation from lateral geniculate afferents. Another limitation of the model is the value of the response exponent of two, which emerged in the process of obtaining input power from input amplitude. However, measurements of the response exponent that corresponds to the steepness of the CRF appear to vary as a function of location along the visual pathway, species, and cell type. The response exponent is least in the lateral geniculate, larger and closest to two in V1, and even larger in MT (Sclar et al., 1990). Those of the monkey are greater than those of the cat, and those of complex cells may be slightly greater than simple cells (Albrecht & Hamilton, 1982). More realistic assumptions concerning the signal and noise statistics, as discussed above, as well as a more comprehensive model of how receptive elds from earlier points along the visual pathway are combined to generate the receptive elds of subsequent points, may be necessary to account fullyfor the detailed shape of the CRF at these different locations. It is reasonable to speculate as to the extent that implementation of the gain control and normalization schemes described here could be responsible for aspects of the inuence of stimuli from outside the classical receptive eld. Recent evidence supports the concept of a network of horizontal inhibitory interactions that are radially symmetric, decrease their inuence as a function of cortical separation, and are independent of orientation preference (Das & Gilbert, 1999). In the context of the model used to generate the normalization in Figure 4, the interactions described by Das and Gilbert would permit normalization of receptive eld output by local neural output power, where more proximal neural outputs exert the greatest inuence. These relationships may not be straightforward. Although the overall output by the stimulated portion of the cortex would lead to decreases in the maximal response of the unstimulated region, they would also drive the semisaturation constant in the direction of increased receptive eld sensitivity, as long as the limitation on neural gain does not become dominant. These conditions would be particularly favored if regional neural output power was moderated by cortical interactions (Toyama, Kimura, & Tanaka, 1981a, 1981b; Das & Gilbert, 1999) whose spatial extent differed from that of a local neurochemical moderator of synaptic efcacy such as the diffuse retrograde neurotransmitter nitric oxide (Montague, Gancayco, Winn, Marchase, & Friedlander, 1994). These effects could contribute to the marked increases in receptive eld sensitivity seen in experiments with an articial scotoma (Pettet & Gilbert, 1992; DeAngelis, Anzai, Ohzawa, & Freeman, 1995; Gilbert, Das, Ito, Kapadia, & Westheimer, 1996). It is important to recognize that the results presented here do not determine the time course or means by which the above relationships are implemented. However, the form of the optimal solution contains terms involving pre- and postsynaptic correlations, which lead naturally to a Heb-
Visual Contrast Response Function and Information Rate
539
Figure 5: Change in optimal neural output power as stimulus power is varied from some reference value (arrow). These changes are shown as a function of pre- and postsynaptic correlation, which was determined for the optimum at each stimulus power level. Although pre- and postsynaptic correlation is always nonzero, whether the level of neural output at the optimum increases or decreases with respect to that of some reference stimulus depends on the difference between the new stimulus power and the reference level (prior experience). Effectively, prior experience sets a “sliding threshold” with respect to whether the synapse should increase or decrease in strength. The asymptote on the left occurs because eventually, the synapse is set to zero since this may be more efcient than incurring the neural costs necessary to convey relatively little information.
bian synaptic modication scheme (Gottschalk, 1997). An important aspect of this algorithm is that correlated pre- and postsynaptic activity will not always lead to an increase in synaptic strength since this should increase only as long as the resulting information is appropriately balanced against the neural costs of making it available (see Figure 5). Otherwise, it may be optimal for synaptic efcacy to decrease, and in some cases, the information produced by a synapse does not justify its existence at all. Thus, synaptic strength for the optimal solution varies as stimulus power, but the direction of change depends entirely on prior stimulus levels. This balance between information and the neural resources necessary for making it available may be why “sliding threshold” models appear necessary to account for certain types of physiological data (Bienenstock, Cooper, & Munro, 1982; Bear, Cooper, & Ebner, 1987; Clothiaux, Bear, & Cooper, 1991). Examples of socalled anti-Hebbian behavior have emerged in other types of neural network analysis (Plumbley, 1993). In summary, nonlinearities of the form of the hyperbolic ratio equation, which are seen throughout sensory physiology, can be interpreted as a natural means of striking a balance between the information these neurons convey about their inputs and the neural costs of doing so. If either neural gain or output power were not costly and therefore highly regulated, the
540
Allan Gottschalk
form of this nonlinearity should be very different. This balance between information and neural costs may play a vital role in neural development by explaining why Hebbian synapses with correlated pre- and postsynaptic activity might either increase or decrease their efcacy, or even disappear. These results also permit the stimulus-dependent shifts of the CRF in early visual cortex to be interpreted as an effort to maximize information when the resources for doing so are limited Acknowledgments
I thank Judith McLean, Larry A. Palmer, and Alan C. Rosenquist for their many helpful suggestions during the conduct of this work and the preparation of this article. This work was supported by grant EY10915 from the National Institutes of Health References Albrecht, D. G., Farrar, S. B., & Hamilton, D. B. (1984). Spatial contrast adaptation characteristics of neurones recorded in the cat’s visual cortex. J. Physiol. (Lond.), 347, 713–739. Albrecht, D. G., & Geisler, W. S. (1991). Motion selectivity and the contrastresponse function of simple cells in the visual cortex. Visual Neuroscience, 7, 531–546. Albrecht, D. G., & Hamilton, D. B. (1982). Striate cortex of monkey and cat: Contrast response function. J. Neurophysiol., 48, 217–237. Attneave, F. (1954). Some informational aspects of visual perception. Psychol. Rev., 61, 183–193. Baddeley, R., Abbott, L. F., Booth M. C. A., Sengpiel, F., Freeman, T., Wakeman, E. A., & Rolls, E. T. (1997). Responses of neurons in primary and inferior temporal visual cortices to natural scenes. Proc. R. Soc. Lond. B., 264, 1775– 1783. Barlow, H. B. (1961). Possible principles underlying the transformation of sensory messages. In W. A. Rosenblith (Ed.), Sensory communication. Cambridge, MA: MIT Press. Bear, M. F., Cooper, L. N., & Ebner, F. F. (1987). A physiological basis for a theory of synapse modication. Science, 237, 42–48. Bienenstock, E. L., Cooper, L. N., & Munro, P. W. (1982). Theory for the development of neuron selectivity: Orientation specicity and binocular interaction in visual cortex. J. Neurosci., 2, 32–48. Bonds, A. B. (1991). Temporal dynamics of contrast gain in single cells of the cat striate cortex. Visual Neuroscience, 6, 239–255. Clothiaux, E. E., Bear, M. F., & Cooper, L. N. (1991). Synaptic plasticity in visual cortex: Comparison of theory with experiment. J. Neurophys., 66, 1785– 1804. Das, A., & Gilbert C. D. (1999). Topography of contextual modulations mediated by short-range interactions in primary visual cortex. Nature, 399, 655–661.
Visual Contrast Response Function and Information Rate
541
Dean, A. F. (1981). The variability of discharge of simple cells in the cat striate cortex. Exp. Brain Res., 44, 437–440. DeAngelis, G. C., Anzai, A., Ohzawa, I., & Freeman, R. D. (1995). Receptive eld structure in the visual cortex: Does selective stimulation induce plasticity? Proc. Natl. Acad. Sci. U.S.A., 92, 9682–9686. Fienberg, S. E. (1974). Stochastic models for single neuron ring trains: A survey. Biometrics, 30, 399–427. Gallegher, R. G. (1968). Information theory and reliable communication. New York: Wiley. Geisler, W. S., & Albrecht, D. G. (1992). Cortical neurons: Isolation of contrast gain control. Vision Res, 32, 1409–1410. Gilbert, C. D., Das, A., Ito, M., Kapadia, M., & Westheimer, G. (1996). Spatial integration and cortical dynamics. Proc. Natl. Acad. Sci. U.S.A., 93, 615– 622. Gottschalk, A. (1997). Information based limits on synaptic growth in Hebbian models. In J. M. Bower (Ed.), Computational neuroscience: Trends in research, 1997 (pp. 309–313). New York: Plenum. Heeger, D. J. (1992a). Normalization of cell responses in cat striate cortex. Visual Neuroscience, 9, 181–197. Heeger, D. J. (1992b). Half-squaring in responses of cat striate cells. Visual Neuroscience, 9, 427–443. Heeger, D. J. (1993). Modeling simple-cell direction selectivity with normalized, half-squared, linear operators. J. Neurophysiol., 70, 1885–1898. Koshland, D. E., Goldbeter, A., & Stock, J. B. (1982).Amplication and adaptation in regulatory and sensory systems. Science, 217, 220–225. Laughlin, S. B. (1981). A simple coding procedure enhances a neuron’s information capacity. Z. Naturforsch., 36, 910–912. Laughlin, S. B., Anderson, J. C., O’Carroll, D., & de Ruyter van Steveninck, R. (2000). Coding efciency and the metabolic cost of sensory and neural information. In R. Baddeley, P. Hancock, & P. Foldi ¨ ak (Eds.), Information theory and the brain (pp. 41–61). New York: Cambridge University Press. Levy, W. B., & Baxter, R. A. (1996). Energy efcient neural codes. Neural Computation, 8, 531–543. Luenberger, D. G. (1969). Optimization by vector space methods. New York: Wiley. Montague, P. R., Gancayco, C. D., Winn, M. J., Marchase, R. B., & Friedlander, M. J. (1994). Role of NO production in NMDA receptor-mediated neurotransmitter release in cerebral cortex. Science, 263, 973–977. Naka, I. K., & Rushton, W. A. H. (1966). S-potentials from colour units in the retina of sh (Cyprinidae). J. Physiol. (Lond.), 185, 536–555. Ohzawa, I., Sclar, G., & Freeman, R. D. (1985). Contrast gain control in the cat’s visual system. J. Neurophysiol., 54, 651–667. Palmer, L. A., & Davis, T. L. (1981). Receptive-eld structure in cat striate cortex. J. Neurophysiol., 46, 260–276. Pettet, M. W., & Gilbert, C. W. (1992). Dynamic changes in receptive-eld size in cat primary cortex. Proc. Natl. Acad. Sci. U.S.A., 89, 8366–8370. Plumbley, M. D. (1993). Efcient information transfer and anti-Hebbian neural networks. Neural Networks, 6, 823–833.
542
Allan Gottschalk
Sclar, G., & Freeman, R. D. (1982). Orientation selectivity in the cat’s striate cortex is invariant with stimulus contrast. Exp. Brain Res., 4, 457–461. Sclar, G., Maunsell, J. H. R., & Lennie, P. (1990). Coding of image contrast in central visual pathways of the macaque monkey. Vision Res., 30, 1–10. Tolhurst, D. J., Movshon, J. A., & Dean, A. F. (1983). The statistical reliability of signals in single neurons in cat and monkey visual cortex. Vision Research, 23, 775–785. Toyama, K., Kimura, M., & Tanaka, K. (1981a). Cross-correlation analysis of interneuronal connectivity in cat visual cortex. J. Neurophysiol., 46, 191–201. Toyama, K., Kimura, M., & Tanaka, K. (1981b). Organization of cat visual cortex as investigated by cross-correlation. J. Neurophysiol., 46, 202–214. Ullman, S., & Schechtman, G. (1982). Adaptation and gain normalization. Proc. R. Soc. Lond. B, 216, 299–313. Wilson, H. R., & Humanski, R. (1993). Spatial frequency adaptation and contrast gain control. Vision Res., 33, 1133–1149. Received April 24, 2000; accepted May 25, 2001.
LETTER
Communicated by Christian Wehrhahn
A Bayesian Framewor k for Sensory Adaptation Norberto M. Grzywacz [email protected] Department of Biomedical Engineering, Universityof Southern California,Los Angeles, CA, 90089-1451, U.S.A. Rosario M. Balboa [email protected] Departamento de BiotecnologÂõ a, Universidad de Alicante, Apartado de Correos 99, 03080 Alicante, Spain Adaptation allows biological sensory systems to adjust to variations in the environment and thus to deal better with them. In this article, we propose a general framework of sensory adaptation. The underlying principle of this framework is the setting of internal parameters of the system such that certain prespecied tasks can be performed optimally. Because sensorial inputs vary probabilistically with time and biological mechanisms have noise, the tasks could be performed incorrectly. We postulate that the goal of adaptation is to minimize the number of task errors. This minimization requires prior knowledge of the environment and of the limitations of the mechanisms processing the information. Because these processes are probabilistic, we formulate the minimization with a Bayesian approach. Application of this Bayesian framework to the retina is successful in accounting for a host of experimental ndings. 1 Introduction
One of the most important properties of biological sensory systems is adaptation. This property allows these systems to adjust to variations in the environment and thus to deal better with it. Adaptation has been studied in many biological sensory systems (Thorson & Biederman-Thorson, 1974; Laughlin, 1989). Are there general principles that guide biological adaptation? A host of theoretical principles has been proposed in the literature, and many of them have been applied to the visual system (Srinivasan, Laughlin, & Dubs, 1982; Atick & Redlich, 1992; Field, 1994). Recently, we showed that many of these principles cannot account for spatial adaptation in the retina (Balboa & Grzywacz, 2000a). We then proposed that the principle underlying this type of adaptation is the setting of internal parameters of the retina such that it can perform optimally certain prespecied visual tasks (Balboa & Grzywacz, 2000b). Key to this proposal was that the visual inforNeural Computation 14, 543–559 (2002)
° c 2002 Massachusetts Institute of Technology
544
Norberto M. Grzywacz and Rosario M. Balboa
mation is probabilistic (Yuille & Bulthoff, ¨ 1996; Knill, Kersten, Mamassian, 1996; Knill, Kersten, & Yuille, 1996) and retinal mechanisms noisy (Rushton, 1961; Fuortes & Yeandle, 1964; Baylor, Lamb, & Yau, 1979). Hence, any optimization would have to follow Bayesian probability theory (Berger, 1985). For this purpose, the visual system would have to store knowledge about the statistics of the visual environment (Carlson, 1978; van Hateren, 1992; Field, 1994; Zhu & Mumford, 1997; Balboa & Grzywacz, 2000b, 2000c; Balboa, Tyler, & Grzywacz, 2001; Ruderman & Bialek, 1994), and about its own mechanisms and neural limitations. In this article, we extend these principles to formalize a general framework of sensory adaptation. This framework begins with the Bayesian formulas but encodes tasks and errors through terms akin to the loss function (Berger, 1985), which quanties the cost of making certain sensory decisions. As an example, we will show how to apply the framework to the spatial adaptation of retinal horizontal cells. The framework and some of the results appeared previously in abstract form (Balboa & Grzywacz, 1999). 2 A Bayesian Framework of Adaptation 2.1 Rationale. The basic architecture of the new framework of adaptation comprises three stages (see Figure 1). The rst, which we call the stage of preprocessing, transforms the sensory input into the internal language of the system. The second stage, task coding, transforms the output of the stage of preprocessing to a code that the system can use directly to perform its tasks. 1 We propose that the tasks are the estimation of desired attributes from the world, which we call the task-input attributes. This estimation would be performed by specially implemented recovery functions (task-coding functions). Finally, the third stage estimates the typical error in performing the tasks, that is, the discrepancy between the task-input attributes and their estimates from the task-coding functions. The key phrase here is typical error. Because the system does not know what the actual environmental attributes are and can only estimate them, it cannot know what the actual error is at every instant. Hence, the best the system can aim at is a statistically typical estimation of error for the possible ensemble of inputs. Such an estimate is possible only if one assumes that the environment is not random, which means that the statistics of the input to the system at a given time has some consistency with the statistics of the input at subsequent times. For instance, a forest at night produces mostly dark images and thus denes a temporally consistent environment for the visual system. In turn, the same forest during the day produces bright images and represents a different, consistent environment from the one dened at night. 1 We postulate a stage of preprocessing separate from a stage of task coding, because the output of the stage of preprocessing can be used for multiple tasks. For example, the output of retinal bipolar cells can carry contrast information, which could be useful for motion-, shape-, and color-based tasks.
A Bayesian Framework for Sensory Adaptation
545
Figure 1: Schematic of the new framework of adaptation. A multivariate input impinges on the system. The input is rst processed by the stage of preprocessing, whose output is then fed to a task-coding apparatus. The goal of this apparatus is prepare the code for the system to achieve the desired task. The apparatus has two outputs: one to fulll the task and another to an error estimation center. This center integrates the incoming information to evaluate the environment and then selects the appropriate prior knowledge from the world to use. In addition, the error estimation center uses knowledge about the stage of preprocessing and the tasks to be performed. With this knowledge, the system estimates the mean task-performance error, not just for the current input but for all possible incoming inputs. The system then sends an adaptation signal to the stage of processing so that this stage modies its internal parameters to minimize the error.
The principle underlying our framework for adaptation is the setting of internal parameters to adjust the system to the environment such that the system can perform optimally certain prespecied tasks (Balboa & Grzywacz, 2000a, 2000b). This optimization means that the goal of the system is to minimize the expected number of errors in the task-coding process, adjusting the free parameters of adaptation at the stage of preprocessing such that the estimated error is as low as possible. Therefore, optimality implies that the system has knowledge about the environment and about the mechanisms and limitations of the system itself (for instance, noisiness, sluggishness, and small number of information channels) (see Figure 1). These kinds of knowledge could come from evolution, development, or learning.2 2 Although people often refer to these three processes as forms of adaptation, we are not addressing them in this article. We mention them here only as a means through which biological systems can attain knowledge useful for sensory adaptation.
546
Norberto M. Grzywacz and Rosario M. Balboa
2.2 Framework. This section expresses mathematically each element of the framework illustrated in Figure 1 and emphasized in section 2.1. Let IE be the input in Figure 1 (process labeled 1 in the gure) and H1 ( IE) , H2 ( IE) , . . . , HN E For instance, in sec(IE) the relevant task-input attributes to extract from I. tion 3, these attributes will be things like contrasts and positions of occluding borders in a visual image. The output (the process labeled 3) of the stage of E is transformed by the task-coding stage processing (process labeled 2), O, E E ). These functions are ( (the process labeled 4) to RH1 O ) , RH2 (OE ) , . . . , RHN ( O estimates of the values of the task-input attributes. To evaluate the error (the process labeled 5), Ei , in each of these estimates, one must measure the disE ) . The system cannot know exactly what crepancy between Hi ( IE) and RHi ( O this discrepancy is, since it does not have access to the input, only to its estimates. However, the system can estimate the expected amount of error. To do so, the system must have previous knowledge (the process labeled 6) E about the task, and about the stage of processing that about the probable I, E Adaptation (the process labeled 7) would be the setting of the produces O. stage-of-preprocessing parameters (A) such that the error is minimal over the ensemble of inputs.3 The error to be minimized is
E ( A) D
N X
Ei ( A ) .
(2.1)
iD 1
The error, Ei ( A) , can be dened using statistical decision theory (Berger, E RE H ) (where RE H D RH ( O E ) ), 1985). We begin by dening a loss function L ( I, i i i which measures the cost of deciding that the ith attribute is RE Hi given that E Using the loss function, statistical decision theory denes the the input is I. Bayesian expected loss as ± ² Z ± ² ± ² ER EH L E RE H , lA RE Hi D P A I| I, i i IE
(2.2)
E RE H ) is the conditional probability, with the parameters of adapwhere P A ( I| i tation set to A, that the correct interpretation of the input is IE given that the response of the system is RE Hi . The Bayesian expected loss is the mean loss 3
The parameters of adaptation can be thought of as a representation of the environment. Here, we choose to dene the framework through the parameters of adaptation instead of those of the environment (for instance, time of the day) for two reasons. First, our framework is a mathematical formulation for the role of adaptation. Second, there may not be a one-to-one correspondence between the parameters of the environment and those of adaptation. If we had chosen to parameterize our framework through the environment, then the framework would be similar to the Bayesian process of prior selection (Berger, 1985). In prior selection, one chooses the correct prior distribution to use before performing a task. One denes the various possible priors by special hyperparameters.
A Bayesian Framework for Sensory Adaptation
547
of the system given that response. We dene the error as the mean Bayesian expected loss over all possible responses, that is, Z ± ² ± ² ( ) (2.3) Ei A D P A RE Hi lA RE Hi . RE Hi
That we minimize the sum of these errors (see equation 2.1) is the same as using conditional Bayes’ principle of statistical decision theory (Berger, 1985). In simple words, we choose the “action” that minimizes the loss. From standard Bayesian analysis, one can obtain a particularly useful form of the error in equation 2.3 by using Bayes’ theorem,4 that is, ± ² ±² ± ² P A RE Hi | IE P IE ± ² . (2.4) PA IE | RE Hi D P A RE Hi ER E H ) in equation 2.2 and then plugging By substituting equation 2.4 for P A (I| i the result in equation 2.3, one gets Z Z ± ² ±² ± ² E RE H . (2.5) Ei ( A ) D P A RE Hi | IE P IE L I, i IE RE Hi
One can give intuitive and practical interpretations of the probability terms in this equation. The rst term, the likelihood function P A (RE Hi | IE) , embodies the knowledge about the sensory mechanisms; the second, P ( IE), is the prior knowledge about the input. In other words, PA ( RE Hi | IE) tells how the system E Because the response is composed of the responds when the stimulus is I. stage of preprocessing and the task-coding functions, it is sometimes useful to unfold equation 2.5 as Z Z Z ± ² ± ² ±² ± ² E PA O| E IE P IE L E RE H . Ei ( A ) D P A RE Hi | O I, i E IE O
RE Hi
In this equation, we divided the knowledge of the sensory process into E ) and P A ( O| E IE) . The former reects the task, while the latter reects P A (RE Hi | O the computations performed in the stage of preprocessing. In this article, we assume for simplicity that the task-coding functions have no noise and thus use the simpler form in equation 2.5. E in this equation as if this probability function does not depend on We write P(I) the parameters of adaptation. However, as discussed in note 3, these parameters can be thought of as a representation of the environment. Hence, it would have been more correct E in equation 2.4. The reason that we avoid doing that is that to insert the subindex A in P (I) for some readers, it might be confusing having a notation suggesting that the distribution of images depends on adaptation. We preferred not to use the subindex A and to think of E as a prior distribution of the current environment (see section 2.1). P(I) 4
548
Norberto M. Grzywacz and Rosario M. Balboa
3 An Application of the Framework: Spatial Retinal Adaptation 3.1 Some Task-Input Attributes Proposed for the Retina. We previously proposed that retinal horizontal cells are part of a system to extract from images the position of occluding borders, contrast at occluding borders, and intensity away from them (Balboa & Grzywacz, 2000b). To quantify these image attributes (the functions Hi of equation 2.5) we begin by dening contrast as | r I (Er) / I (Er ) | (see Balboa & Grzywacz, 2000b, 2000c, for a justication).5 From this denition, the functions Hi may be
³ ´ ¡ ¢ ¡ ¡ ¢¢ r I Er ¡ ¢ H1 I Er D PO I Er ¡ ¢ ¡ ¡ ¢¢ r I Er H2 I Er D ¡ ¢ I Er ¡ ¡ ¢¢ ¡ ¡ ¢¢ ¡ ¢ ¡¢ H3n C 1 I Er C H3 I Er D I Er G IN ,
(3.1)
where 0 · PO (| r I ( Er ) / I (Er) |) · 1 is the probability that a point with the given contrast has an occluding border and 0 < G ( IN) · 1 is a prespecied, positive, decreasing function of the mean intensity, I.N Consequently, H1 , H2 , and H3 quantify the likelihood of a border at position Er, contrast, and intensity, respectively. In this application, H1 is made to depend only on rI I(Er(Er) ) for simplicity; one may introduce other variables in more complex models. The quantication of intensity by H3 is indirect, but we can justify this quantication from independent results. The function G transforms the right-hand side of the denition of H3 (see equation 3.1) to a compressed function of I. The polynomial form of H3 causes it to be a fractional power law of the right-hand side. Hence, H3 is a compressed function of intensity, allowing the retina to deal with the wide range of natural intensities (Shapley, 1997; Smirnakis, Berry, Warland, Bialek, & Meister, 1997; Adorjan, Piepenbrock, & Obermayer, 1999). 3.2 Prior Knowledge of Images. Many literature studies address the prior knowledge on the distribution of natural images (Carlson, 1978; van Hateren, 1992; Field, 1994; Zhu & Mumford, 1997; Ruderman & Bialek, 1994; Balboa & Grzywacz, 2000b, 2000c). These studies focus on certain statistical moments of the images and thus cannot tell what P (IE) is. What these 5
For notational simplicity, we assume that the input is mapped to the output of the outer plexiform layer (OPL) in a one-to-one manner. (In this layer, the photoreceptor synapses serve as input, the horizontal cells serve as interneurons, and the bipolar cells serve as output.) Thus, we will use Er to indicate position in both the input and the output. Alternatively, one could use subscripts to indicate the discrete positions of receptive-eld centers (Srinivasan et al., 1982; Balboa & Grzywacz, 2000b).
A Bayesian Framework for Sensory Adaptation
549
studies reveal is that the moments obey certain regularities, and thus not all possible images occur naturally. The moments of interest here are those specied by equation 3.1, that is, the task-input attributes. For simplicity, the approach of this article is to assume rst that P (IE) is constant over the range of naturally occurring images and zero outside it. Then we use a sample of natural images to estimate the occurrence of the attributes in images. (By the last assumption, the images in the sample have equal probability among themselves and the same probability as any image outside the sample.) 3.3 Knowledge of Neural Processes. To specify the knowledge of the neural processing, one has to describe what is known about the OPL (see note 5), dene the task-coding functions (see Figure 1), and then calculate PA ( RE Hi | IE) . Unfortunately, the outputs of the task-coding functions have E Therefore, it is hard to provide comcomplex, nonlinear dependencies on I. plete analytical formulas for PA (RE Hi | IE) . In Balboa and Grzywacz (2000b), we diminish the need for such formulas through some reasonable approximations. A model for the OPL is what we use for the stage-of-preprocessing box of Figure 1. This model is based on a wealth of literature (for a review, see Dowling, 1987) and was presented elsewhere (Balboa & Grzywacz, 2000b). The basic equation of the model is
¡ ¢ B Er D
¡¢ T Er ¡ ¡¢ ¡ ¢¢n , 1 C hA Er ¤ B Er
(3.2)
where T ( Er) and B (Er ) are the intensity-dependent input and output of the synapse of the photoreceptor, ¤ stands for convolution, hA (Er ) is the lateralinhibition lter (which can change depending on the state A of adaptation), and n ¸ 1 is a constant. The variable B (Er ) can be pre- or postsynaptic, since we assume here for simplicity that the photoreceptor-bipolar synapse is linear. It is also postulated that hA (Er) is positive (h A (Er ) > 0), isotropic (if | Er| D R | rE0 |, then h A (Er ) D hA (rE0 ) ), and normalized ( hA (Er) D 1). The normalization assumption is made for simplicity, because it was shown that the results are invariant with the integral of hA (Balboa & Grzywacz, 2000b). One can think of T (Er) as the current generated by phototransduction. Hence, a simple way to introduce the phototransduction adaptation is to dene a gain function G ( IN) such that T ( Er) D I (Er) G ( IN) (see equation 3.1). In this case, equation 3.2 becomes ¡ ¢ B Er D
¡¢ ¡¢ I Er G IN ¡ ¡¢ ¡ ¢¢n . 1 C hA Er ¤ B Er
(3.3)
Dened as in this equation, B ( Er) provides a straightforward signal for the desired attributes specied in equation 3.1. If one stimulates the retina
550
Norberto M. Grzywacz and Rosario M. Balboa
with a full-eld illumination and disregards noise, then in steady state, equation 3.3 becomes BN D
¡¢ IN G IN , 1 C BN n
(3.4)
which resembles the H3 term in equation 3.1. Moreover, mathematical analysis (derivations not shown) shows that if one stimulates the retina with an edge and disregards noise, then in steady state, one gets to a good approximation at the edge: ¡ ¢ r B Er D ¡ ¢ B Er
¡ ¢ r I Er . ¡ ¢ I Er
(3.5)
In other words, luminance contrast is proportional to the contrast in the bipolar cell responses. 3.4 Task-Coding Functions. The role of the task-coding functions is to estimate the task attributes (Hi ) from the output of the OPL. By comparing equation 3.4 to the H3 term of equations 3.1, one sees that B and H3 have the same dependence on light intensity. Therefore, to recover a compressed version of light intensity from the bipolar signals, all that the task-coding stage has to do is to read them directly. Comparison of equation 3.5 to the H2 term of equation 3.1 yields a similar conclusion. The task-coding stage can compute the illumination contrast directly from the bipolar signals. To do so, this mechanism only has to compute the gradient of the bipolar cell responses and divide it by the responses themselves. Furthermore, assume that the function P O from the rst term of equation 3.1 is known (Balboa & Grzywacz, 2000c). In this case, the task-coding stage can extract the borderposition attribute from the contrast signal in the bipolar cells. For given adaptation settings (A), we can express these conclusions in the following equation:
RH1
RH2 RH3
³ ´ ¡ ¢ ¡¢ r B Er Er D PO ¡ ¢ B Er ¡ ¢ ¡¢ r B Er Er D ¡ ¢ B Er ¡¢ ¡¢ Er D B Er .
(3.6)
3.5 Retinal Loss Functions. We now must specify the cost of the system extracting these image attributes incorrectly. The rst attribute (H1 ) quanties the likelihood that there is a border in a given position. A good loss
A Bayesian Framework for Sensory Adaptation
551
function for this attribute should penalize missing true occluding borders. (We worry less about false borders being detected, since the central visual system has heuristics to eliminate false borders, such as false borders not giving rise to long, continuous contours; Field, Hayes, & Hess, 1993; Kovacs & Julesz, 1993; Pettet, McKee, & Grzywacz, 1997.) For the contrast attribute (H2 ), the loss function should increase as the contrast diverges from truth, especially at points likely to contain a border. In contrast, a good loss function for the third attribute (H3 ) should penalize discrepancies of intensity estimation at points unlikely to be occluding borders. A loss function for the attribute H1 that meets these conditions is ± ² ³Z ¡ ¢¡ ¡ ¢¢ E E L I, RH1 D H1 I (Er) 1 ¡ RH1 Er Er
¡
¡ ¢ ¡ ¢¢2k £ RH1 Er ¡ H1 I (Er )
´ 1/ 2
,
(3.7)
where k is an integer. This equation penalizes missing borders. The term H1 ( I ( Er) ) enforces computations on input borders, while the term (1¡ RH1 (Er ) ) matters only if the system missed a border at Er. Because 0 · H1 ( I ( Er) ) , RH1 ( Er ) · 1 (see equations 3.1 and 3.6), large values of k force the loss function to penalize clear-cut errors. Such errors are RH1 (Er) ¼ 0 and H1 ( I (Er) ) ¼ 1. In contrast, results like RH1 (Er) D H1 (I (Er ) ) D 0.5, which are not in error but would contribute to equation 3.7 if k D 0 would not do so if k À 1. In general, this equation works as a counter of the number of errors instead of as a measure of their magnitudes. This equation is similar to the standard 0 ¡ 1 loss function (Berger, 1985). For the attribute H2 , a loss function meeting the conditions above is ´ ± ² ³Z ¡ ¢¡ ¡¢ ¡ ¢¢2 1 / 2 E RE H D ( ) ( ) E E ¡ E . I, H I r R r H I r 1 2 H 2 2
L
(3.8)
Er
This equation also enforces computations on borders as equation 3.7. The H2 term in equation 3.8 expresses contrast discrepancies in absolute terms, since the visual system is sensitive to contrast. This equation is similar to a standard type of loss function called the squared-error loss (Berger, 1985) and to the L2 metric of functional analysis (Riesz & -Nagy, 1990). Finally, a good loss function for the attribute H3 is 0 ³ ¡¢ ¡ ¡ ¢¢ ´2 11 / 2 Z ± ² ¡ ¡ ¡ ¢¢¢ RH3 Er ¡ H3 I Er E RE H D @ A , ¡ ¡ ¢¢ L I, 1 ¡ H1 I Er 3 H3 I Er Er
(3.9)
which computes errors away from borders (the 1 ¡ H1 ( I ( Er) ) term). The H3 ratio of this equation shows that intensity discrepancies are measured as
552
Norberto M. Grzywacz and Rosario M. Balboa
a noise-to-signal ratio. Such a ratio allows the expression of the noise in a dimensionless form and is the same type of error measurement that successfully explained spatial adaptation in Balboa and Grzywacz (2000b). 4 Results
Elsewhere (Balboa & Grzywacz, 2000b), we showed that our retinal application of the general framework of adaptation can account correctly for the spatial adaptation in horizontal cells. This application can also explain the division-like mechanism of inhibitory action (see equation 3.3 and Balboa & Grzywacz, 2000b) and linearity at low contrasts (Tranchina, Gordon, Shapley, & Toyoda, 1981). Moreover, the application is consistent with the psychophysical Stevens’ power law of intensity perception (Stevens, 1970). This section extends the validity of this application by showing that it can account for three new retinal effects. The rst new retinal effect that this application of the framework can account for is the shrinkage of the extent of lateral inhibition for stimuli that are not spatially homogeneous. Experiments by Reifsnider and Tranchina (1995) show that background contrast reduces the OPL lateral spread of responses to superimposed stimuli (see also Smirnakis et al., 1997). Kamermans, Haak, Habraken, and Spekreijse (1996) have neurally realistic computer simulations of the OPL, demonstrating a similar reduction. Figure 2 shows that our model could account for such reductions. Furthermore, this gure illustrates a qualitative prediction of our model: it can account for this reduction when the retina is stimulated by checkerboards. The smaller the squares in the checkerboard are, the less extent the predicted lateral inhibition has. The reason is that with smaller squares, this extent must decrease to prevent as much as possible the straddling of multiple borders by lateral inhibition. If this straddling occurs, the model makes errors in border localization. Because an important task postulated in the retinal application of the general framework is the measurement of contrast, we argued for a divisionlike model of lateral inhibition (see section 3.3). Figure 3 shows that a consequence of this model is the disappearance of lateral inhibition at lowbackground intensities. The inhibition emerges only at high intensities, with its threshold decreasing as n (the cooperativity of inhibition) increases. The disappearance of lateral inhibition at low background intensities is well known, having been demonstrated both physiologically (Barlow, Fitzhugh, & Kufer, 1957; Bowling, 1980) and psychophysically (Van Ness & Bouman, 1967). Functionally, it is not too hard to understand why horizontal cell inhibition disappears in the model. As intensity falls, horizontal cell responses also do so. This causes the conductance change in the photoreceptor synapse (the Bn term in equation 3.4) to be smaller than the resting conductance (the 1 in that equation). The conductance changes specied by the model also lead to a quantitative prediction about the modulation of the inhibitory strength
A Bayesian Framework for Sensory Adaptation
553
Figure 2: Dependence of lateral inhibition extent on the spatial structure of the image. The stimuli (checkerboard images) are those that appear in the insets. The curves display a bell-shape behavior, with the lateral inhibition extent increasing as the size of the squares in the checkerboard increases.
with intensity. Because the phototransduction adaptation gain (G( IN) ) can be estimated from the photoreceptor literature (Hamer, 2000; Fain, Matthews, Cornwall, & Koutalos, 2001), equation 3.4 can be solved. With BN in hand, one can then predict how the inhibitory strength increases with intensity and test dependence experimentally. The division-like mechanism of lateral inhibition also has implications for the model Mach bands. A property of Mach bands is that there is an asymmetry of positive and negative Mach bands at an intensity border (Fiorentini & Radici, 1957; Ratliff, 1965); the Mach band is larger on the positive side than it is on the negative side. This is a well-known Mach band effect without a good explanation until now. Figure 4 shows that this Mach band asymmetry occurs in our model. This gure also shows that the asymmetry is more prominent as the parameter n increases. From our model, it is not difcult to understand this edge asymmetry. Although the inhibition coming from the positive side of the edge is strong, if the intensity at the negative side is near zero, then the response there will be near zero (see the I (Er) term in equation 3.3). In contrast, the intensity on the positive side of the edge is high, and thus the effect of lateral inhibition can work there, producing a large Mach band.
554
Norberto M. Grzywacz and Rosario M. Balboa
Figure 3: Lateral inhibition strength as a function of intensity. These curves are parametric on n (see equation 3.2) and assume G ( IN ) D 1, and their inhibitory strength is dened as Bn / (1 C Bn ) . As the intensity falls, the inhibition disappears, as indicated by its strength going to zero. The threshold for the emergence of inhibition as a function of intensity falls as n increases.
5 Discussion 5.1 Adaptation as a Bayesian Process. Ours is not the rst Bayesian framework of sensory processing. Other theories have proposed Bayesian mechanisms for perception (for instance, Yuille & Bulthoff, ¨ 1996; Knill, Kersten, & Mamassian, 1996; Knill, Kersten, & Yuille, 1996). Those theories begin with the output of neurons or lters, and then ask how to interpret the environment most correctly. What is new in our framework is not the use of the Bayesian framework but the computation of the best internal state for the system. In other words, the new framework is concerned not with interpreting the current input but rather with setting the parameters of the system such that future interpretations are as correct as possible. Another novel aspect of the new framework is the emphasis on the loss function. Past theories tended to make their perceptual decision based on the maximal probability of interpreting the input correctly given some internal data (Yuille & Bulthoff, ¨ 1996; Knill, Kersten, & Mamassian, 1996; Knill, Kersten, & Yuille, 1996). The use of loss functions means that the most probable interpretation of the input may not be sufcient. Sometimes errors made by not picking less probable interpretations are more costly.
A Bayesian Framework for Sensory Adaptation
555
Figure 4: Mach band asymmetries in the retinal application of the framework. The stimulus was an edge at position 50 (arbitrary units) and of intensities 1.8 and 0.1 at positions lower and higher than 50, respectively. The responses of the bipolar cells were simulated parametric on n (see equation 2.7), with G ( NI ) D 1, and the lter h being at and with a radius of 15. The Mach band was larger at the high-intensity side of the edge than at the low-intensity side. This Mach band asymmetry became more prominent as n increased.
One positive aspect of using a Bayesian framework is that it forces us to state explicitly the assumptions of our models. In particular, one must state the knowledge that the system has about the environment and about its mechanisms, and state the tasks that the system must perform. How does one go about specifying tasks? We believe that tasks are chosen by both what the systems need to do and can do. For instance, although it would be lovely for a barnacle to perform recognition tasks, this animal has only ten photoreceptors. (It has four photoreceptors in its only median eye—Hayashi, Moore, & Stuart, 1985; Oland & Stuart, 1986—and three photoreceptors in each of its two lateral eyes—Krebs & Schaten, 1976; Oland & Stuart, 1986.) Nevertheless, its visual system is sufciently good to allow the animal to hide in its shell when a predator approaches. Therefore, one should not assume automatically that the goal of the early sensory system is to maximize transmission of information (Haft & van Hemmen, 1998; Wainwright, 1999). Furthermore, the tasks that a system performs might be intimately coupled to its hardware, against what was argued by Marr (1982). The system must perform computation, and its hardware makes certain types of computations easier than others. 5.2 Limitations of the Framework. Elsewhere, we discuss the limitations of our retinal application of the framework (Balboa & Grzywacz, 2000b); here, we focus on problems related to the framework itself.
556
Norberto M. Grzywacz and Rosario M. Balboa
Perhaps the most serious problem with the framework is that to apply E R E H ) , that is, the prior equation 2.5, one must have PA ( RE Hi | IE) , P ( IE) , and L ( I, i knowledge and the loss function. We assume that (but do not state how) the system would attain these things through evolution, development, and learning. A complete framework would have to specify the algorithms and mechanisms to attain the prior knowledge and the loss function. An even more serious problem is how to apply the framework when the environment changes (see note 4). 5.3 The Importance of Understanding Tasks in Neurobiology. We have provided one example of how to apply the new framework for sensory adaptation—namely, to spatial adaptation in horizontal cells. Besides specifying the neural processes, the example had to formalize the retinal tasks. As we showed here and elsewhere (Balboa & Grzywacz, 2000a, 2000b), the choice of the tasks can have a large impact on the behavior of the system. For instance, the assumed horizontal cell functions were border localization, contrast estimation, and intensity estimation. One could have argued that all of this may just be a consequence of contrast sensitivity (Shapley, 1997). However, requiring maximization of contrast sensitivity has different consequences from requiring optimization of border location. Responding to any edge is not the same as encoding its position with precision. The requirement of encoding position correctly is essential to obtain the correct behavior for the spatial adaptation of horizontal cells (Balboa & Grzywacz, 2000b). Hence, we argue that to understand the behavior of ensembles of nerve cells, it is not sufcient to gure out their neurobiological processes. One must also carefully study their information processing tasks. Acknowledgments
We thank Alan Yuille and Christopher Tyler for critical comments on the manuscript and JoaquÂõ n de Juan for support and many discussions in the early phases of this project. We also thank the Smith-Kettlewell Institute, where we performed a large portion of the work. This work was supported by National Eye Institute Grants EY08921 and EY11170 to N.M.G. References Adorjan, P., Piepenbrock, C., & Obermayer, K. (1999). Contrast adaptation and infomax in visual cortical neurons. Rev. Neurosci., 10(3–4), 181–200. Atick, J. J., & Redlich, A. N. (1992). What does the retina know about natural scenes? Neural Comp., 4, 196–210. Balboa, R. M., & Grzywacz, N. M. (1999). Biological evidence for an ecologicalbased theory of early retinal lateral inhibition. Invest. Ophthalmol. Vis. Sci., 40, S386. Balboa, R. M., & Grzywacz, N. M. (2000a). The role of early retinal lateral inhibition: More than maximizing luminance information. Visual Neurosci., 17, 77–89.
A Bayesian Framework for Sensory Adaptation
557
Balboa, R. M., & Grzywacz, N. M. (2000b).The minimal-local asperity hypothesis of early retinal lateral inhibition. Neural Comp., 12, 1485–1517. Balboa, R. M., & Grzywacz, N. M. (2000c). Occlusions and their relationship with the distribution of contrasts in natural images. Vision. Res., 40, 2661– 2669. Balboa, R. M., Tyler, C. W., & Grzywacz, N. M. (2001). Occlusions contribute to scaling in natural images. Vision Res., 41, 955–964. Barlow, H. B., Fitzhugh, R., & Kufer, S. W. (1957). Change of organisation in the receptive elds of the cat’s retina during dark adaptation. J. Physiol., 137, 338–354. Baylor, D. A., Lamb, T. D., & Yau, K.-W. (1979). Responses of retinal rods to single photons. J. Physiol., 288, 613–634. Berger, J. O. (1985). Statistical decision theory and Bayesian analysis. New York: Springer-Verlag. Bowling, D. B. (1980). Light responses of ganglion cells in the retina of the turtle. J. Physiol., 209, 173–196. Carlson, C. R. (1978). Thresholds for perceived image sharpness. Phot. Sci. and Eng., 22, 69–71. Dowling, J. E. (1987). The retina: An approachablepart of the brain. Cambridge, MA: Belknap Press, Harvard University Press. Fain, G. L., Matthews, H. R., Cornwall, M. C., & Koutalos, Y. (2001). Adaptation in vertebrate photoreceptors. Physiol. Rev., 81, 117–151. Field, D. J. (1994). What is the goal of sensory coding? Neural Comp., 6, 559– 601. Field, D. J., Hayes, A., & Hess, R. F. (1993). Contour integration by the human visual system: Evidence for a local “association eld.” Vision Res., 33(2), 173– 193. Fiorentini, A., & Radici, T. (1957). Binocular measurements of brightness on a eld presenting a luminance gradient. Atti. Fond. Giorgio Ronchi, 12, 453–461. Fuortes, M. G. F., & Yeandle, S. (1964). Probability of occurrence of discrete potential waves in the eye of the Limulus. J. Physiol., 47, 443–463. Haft, M., & van Hemmen, J. L. (1998). Theory and implementation of infomax lters for the retina. Network, 9(1), 39–71. Hamer, R. D. (2000). Analysis of CAC C -dependent gain changes in PDE activation in vertebrate rod phototransduction. Mol. Vis., 6, 265–286. Hateren, J. H. van (1992). Theoretical predictions of spatiotemporal receptive elds of y LMC’s, and experimental validation. J. Comp. Physiol. A, 171, 157–170. Hayashi, J. H., Moore, J. W., & Stuart, A. E. (1985).Adaptation in the input-output relation of the synapse made by the barnacle’s photoreceptor. J. Physiol., 368, 179–195. Kamermans, M., Haak, J., Habraken, J. B., & Spekreijse, H. (1996). The size of the horizontal cell receptive elds adapts to the stimulus in the light adapted goldsh retina. Vision Res., 36, 4105–4119. Knill, D. C., Kersten, D., & Mamassian, P. (1996). Implications of a Bayesian formulation of visual information for processing for psychophysics. In D. C. Knill & R. Whitman (Eds.), Perception as Bayesian inference (pp. 239–286). Cambridge: Cambridge University Press.
558
Norberto M. Grzywacz and Rosario M. Balboa
Knill, D. C., Kersten, D., & Yuille, A. (1996).Introduction: A Bayesian formulation of visual perception. In D. C. Knill & R. Whitman (Eds.), Perception as Bayesian inference (pp. 1–21). Cambridge: Cambridge University Press. Kovacs, I., & Julesz, B. (1993). A closed curve is much more than an incomplete one: Effect of closure in gure-ground segmentation. Proc. Natl. Acad. Sci. USA, 90, 7495–7497. Krebs, W., & Schaten, B. (1976).The lateral photoreceptor of the barnacle, Balanus eburneus: quantitative morphology and ne structure. Cell Tissue Res., 168, 193–207. Laughlin, S. B. (1989). The role of sensory adaptation in the retina. J. Exp. Biol., 146, 39–62. Marr, D. (1982). Vision. San Francisco: Freeman. Oland, L. A., & Stuart, A. E. (1986). Pattern of convergence of the photoreceptors of the barnacle’s three ocelli onto second-order cells. J. Neurophysiol., 55, 882– 895. Pettet, M. W., McKee, S. P., & Grzywacz, N. M. (1997). Constraints of long range interactions mediating contour detection. Vision Res., 38, 865– 879. Ratliff, F. (1965). Mach bands: Quantitative studies on neural networks in the retina. San Francisco: Holden-Day. Reifsnider, E. S., & Tranchina, D. (1995).Background contrast modulates kinetics and lateral spread of responses to superimposed stimuli in outer retina. Vis. Neurosci., 12, 1105–1126. Riesz, F., & -Nagy, B. S. Z. (1990). Functional analysis. New York: Dover. Ruderman, D. L., & Bialek, W. (1994). Statistics of natural images: Scaling in the woods. Phys. Rev. Letter, 73, 814–817. Rushton, W. A. H. (1961). The intensity factor in vision. In W. D. McElroy & H. B. Glass (Eds.), Light and life (pp. 706–722). Baltimore, MD: Johns Hopkins University Press. Shapley, R. (1997). Retinal physiology: Adapting to the changing scene. Curr. Biol., 7(7), R421–R423. Smirnakis, S. M., Berry, M. J., Warland, D. K., Bialek, W., & Meister, M. (1997). Adaptation of retinal processing to image contrast and spatial scale. Nature, 386(6620), 69–73. Srinivasan, M. V., Laughlin, S. B., & Dubs, A. (1982). Predictive coding: A fresh view of inhibition in the retina. Proc. R. Soc. Lond. B, 216, 427–459. Stevens, S. S. (1970). Neural events and the psychophysical law. Science, 170, 1043–1050. Thorson, J., & Biederman-Thorson, M. (1974). Distributed relaxation processes in sensory adaptation. Science, 183, 161–172. Tranchina, D., Gordon, J., Shapley, R., & Toyoda, J. (1981). Linear information processing in the retina: A study of horizontal cell responses. Proc. Natl. Acad. Sci. USA, 78, 6540–6542. Van Ness, F. L., & Bouman M. A. (1967). Spatial modulation transfer in the human eye. J. Opt. Soc. Am., 57, 401–406. Wainwright, M. J. (1999).Visual adaptation as optimal information transmission. Vision Res., 39, 3960–3974.
A Bayesian Framework for Sensory Adaptation
559
Yuille, A., & Bulthoff, ¨ H. H. (1996). Bayesian decision theory and psychophysics. In D. C. Knill & R. Whitman (Eds.), Perception as Bayesian inference (pp. 123– 161). Cambridge: Cambridge University Press. Zhu, S. C., & Mumford, D. (1997). Prior learning and Gibbs reaction-diffusion. IEEE Trans. on Pattern Analysis and Machine Intelligence, 19, 1236–1250. Received November 13, 2000; accepted May 22, 2001.
LETTER
Communicated by Bard Ermentrout
Analysis of Oscillations in a Reciprocally Inhibitory Network with Synaptic Depression Adam L. Taylor [email protected] Garrison W. Cottrell
[email protected] William B. Kristan, Jr.
[email protected] University of California, San Diego, La Jolla CA 92093-0357, U.S.A.
We present and analyze a model of a two-cell reciprocally inhibitory network that oscillates. The principal mechanism of oscillation is short-term synaptic depression. Using a simple model of depression and analyzing the system in certain limits, we can derive analytical expressions for various features of the oscillation, including the parameter regime in which stable oscillations occur, as well as the period and amplitude of these oscillations. These expressions are functions of three parameters: the time constant of depression, the synaptic strengths, and the amount of tonic excitation the cells receive. We compare our analytical results with the output of numerical simulations and obtain good agreement between the two. Based on our analysis, we conclude that the oscillations in our network are qualitatively different from those in networks that oscillate due to postinhibitory rebound, spike-frequency adaptation, or other intrinsic (rather than synaptic) adaptational mechanisms. In particular, our network can oscillate only via the synaptic escape mode of Skinner, Kopell, and Marder (1994).
1 Introduction
A reciprocally inhibitory (RI) oscillator, also known as a half-center oscillator, is an oscillatory neuronal circuit that consists of two neurons, each of which exerts an inhibitory effect on the other. (More generally, a circuit could be constructed of two populations of neurons, each population inhibiting the other.) An example of such a circuit is shown in Figure 1A. When the circuit oscillates, one cell res for a time while the other is inhibited below threshold; then the other cell res for a time while the rst is inhibited, and this cycle repeats itself indenitely. Such oscillators have been found to be important components in a number of central pattern Neural Computation 14, 561–581 (2002)
° c 2002 Massachusetts Institute of Technology
562
Adam L. Taylor, Garrison W. Cottrell, and William B. Kristan, Jr.
Figure 1: (A) Circuit schematic. Excitatory synapses are represented by bars and inhibitory synapses by circles. (B) Example oscillation. The values of the four state variables of the system, u1 , u2 , d1 , and d2 , are shown versus time, t. All state variables are unitless, owing to the fact that they are nondimensionalized versions of physical variables. ui values are membrane potentials measured relative to the potential at which the synapses are half-activated, this quantity being given in units of the width of the linear regime of the synaptic activation curve. Time is in units of the membrane time constant of the cells. The parameters for the simulation are W D 16, b D 9, t D 16, with s ( x) D logistic (4x) D 1 / (1 C e¡4x ) . See the text for an explanation of the intervals labeled a, b, c, and d.
generators (CPGs), including the leech heartbeat CPG (Calabrese, Nadim, & Olsen, 1995), the Clione swim CPG (Arshavsky et al., 1998), the lobster gastric and pyloric CPGs (Selverston & Moulins, 1987), the leech swim CPG (Brodfuehrer, Debski, O’Gara, & Friesen, 1995; Friesen, 1989), the Xenopus tadpole swim CPG (Roberts, Soffe, & Perrins, 1997), and the lamprey swim CPG (Grillner et al., 1995). A number of mechanisms have been shown to be sufcient to cause an RI circuit to oscillate. Among these are postinhibitory rebound, spikefrequency adaptation, and short-term synaptic depression (Reiss, 1962;
Oscillations with Synaptic Depression
563
Perkel & Mulloney, 1974). The rst two mechanisms are similar, and we will refer to them collectively as intrinsic adaptation. Intrinsic adaptation is a property of the cells themselves, whereas synaptic depression is a property of the connections between them. Without some sort of adaptational mechanism in the circuit, one would not expect it to oscillate. This has in fact been proven in one natural formalization of a nonadaptational RI circuit (Ermentrout, 1995). A number of authors (Wang & Rinzel, 1992; Skinner, Kopell, & Marder, 1994; LoFaro, Kopell, Marder, & Hooper, 1994; Rowat & Selverston, 1997) have analyzed the behavior of intrinsic adaptation RI circuits by using dynamical systems techniques. This work has revealed a number of interesting properties. For instance, Skinner et al. (1994) found that there were at least four functionally distinct modes of oscillation their model could exhibit: intrinsic release, intrinsic escape, synaptic escape, and synaptic release. Each of these modes displayed a different mechanism of burst termination and a different frequency response to parameter variations. Despite the success of applying dynamical systems methods to intrinsic adaptation RI circuits, we are aware of no previous attempt to apply the same techniques to RI circuits in which oscillations arise solely due to synaptic depression. Synaptic depression is known to play a role in the oscillation of at least one well-studied RI circuit, the leech heart CPG, for some modes of oscillation (Calabrese et al., 1995). There is also evidence of synaptic depression in CPGs that contain RI subcircuits, including the leech swim CPG (Mangan, Cometa, & Friesen, 1994) and the lobster pyloric CPG (Manor, Nadim, Abbott, & Marder, 1997). It has been proposed that synaptic depression acts as a switch in the lobster pyloric CPG, although it is not the ultimate source of oscillation in this system (Nadim, Manor, Kopell, & Marder, 1999). In addition, there has recently been a great deal of interest in possible functions of synaptic depression in mammalian neocortex (Abbott, Varela, Sen, & Nelson, 1997; Tsodyks & Markram, 1997; Galarreta & Hestrin, 1998). In this article, we use dynamical systems methods to analyze an RI circuit in which oscillations arise due to synaptic depression. Using these methods and employing certain approximations, we are able to derive a closed-form expression for the parameter regime in which the model will oscillate. We are also able to derive closed-form expressions for the period, amplitude, and DC offset of these oscillations. We nd that these synaptic-depressionmediated oscillations always operate via the synaptic escape mode of Skinner et al. (1994). The work presented here grew out of a larger effort to understand how the leech swim CPG generates oscillations (Taylor, Cottrell, & Kristan, 2000). In our earlier work on this subject, we modeled the leech segmental swim CPG as a set of passive cells with instantaneous synapses between them and connectivity roughly consistent with that determined by experiment. The tentative result of this work was that such a model is able to oscillate
564
Adam L. Taylor, Garrison W. Cottrell, and William B. Kristan, Jr.
but that even small perturbations of the synaptic strengths destroy this ability. We therefore became interested in the possible role of various cellular and synaptic properties in generating more robust oscillations. One such property, for which there is some evidence in the leech swim CPG (Mangan et al., 1994), is synaptic depression. This mechanism was incorporated in a recent VLSI model of the leech swim CPG (Wolpert, Friesen, & Laffely, 2000). While this was our original motivation, we should say at the outset that the purpose of this article is not to model a known neuronal system (such as the leech swim CPG). Our goal is simply to elucidate some of the behaviors one would expect of an RI circuit that oscillates via synaptic depression. It therefore behooves us to study a simple model of such a system, since little data is available to justify various possible embellishments. Our model makes predictions about the possible modes of oscillation of such a system, about what parameter regimes permit oscillations, and about how those oscillations should change as the parameters change. Thus, we provide a set of hypothetical hallmarks of RI oscillation via synaptic depression. An experimenter could then compare these hallmarks with a system under investigation to evaluate whether it is likely to oscillate via synaptic depression in the way described by our model. 2 Description of the Model
A common simplied model of neuronal dynamics evolves according to the equation X (2.1) uPi D ¡ui C Wij s ( uj ) C bi . j
There are several ways to derive this equation (or similar equations) from detailed descriptions of neuronal biophysics, all of which involve some simplifying assumptions (Hopeld, 1984; Wilson & Cowan, 1972). All variables in this equation are unitless, owing to the fact that they are nondimensionalized versions of physical variables. ui corresponds to the membrane potential of cell i, Wij to the maximal postsynaptic current due to the synapse from cell j to cell i (the synaptic strength), and bi to any constant current being injected into cell i. s ( x ) is a sigmoid function as dened by Wilson and Cowan (1972). We add synaptic depression to this basic model by adding a state variable dj associated with the synapses from cell j. This variable ranges from zero to one; at zero, the synapse is not depressed at all, and at one, the synapse is completely depressed and can exert no inuence on the postsynaptic cell. The modied equation for the dynamics of ui is then given by X (1 ¡ dj ) Wij s ( uj ) C bi . (2.2) uP i D ¡ui C j
Oscillations with Synaptic Depression
565
We adopt a form for the dynamics of synaptic depression that is rst order and linear: t dPi D 12 s (ui ) ¡ di .
(2.3)
The parameter t is the time constant of synaptic depression. This form for the dynamics of synaptic depression is adopted because it is simple and captures the basic qualitative features of the phenomenon (Tsodyks & Markram, 1997). The factor of 1 / 2 in equation 2.3 is to ensure that the steady-state synaptic efcacy ((1 ¡ dj ) Wij s ( uj ) , where uj has been held constant long enough for dj to come to equilibrium), does not decline with increasing uj . (In fact, we can generalize by replacing the factor 1 / 2 by a factor r, such that 0 < r · 1 / 2, and the dynamics do not change qualitatively. We assume r D 1 / 2 here because this makes the effects of depression large and for simplicity.) We remain agnostic about the specic biophysics of depression—whether it arises from presynaptic vesicle depletion, presynaptic Ca2 C channel inactivation, postsynaptic receptor desensitization, or some other mechanism. By using a single time constant, we are implicitly assuming that the time constants associated with onset and offset of depression are identical, another simplifying assumption. Since we are concerned here only with two-cell networks with symmetric reciprocal inhibition and no self-connections (see Figure 1A), we can rewrite the system equations: uP 1 D ¡u1 ¡ (1 ¡ d2 ) Ws (u2 ) C b uP 2 D ¡u2 ¡ (1 ¡ d1 ) Ws (u1 ) C b
t dP1 D t dP2 D
(2.4) (2.5)
1 ( ) 2 s u1
¡ d1
(2.6)
1 ( ) 2 s u2
¡ d2 .
(2.7)
Here W is taken to be nonnegative, since we are interested only in inhibitory networks. Thus, the model has just three parameters: W, b, and t . For a range of parameter settings, the two-cell system with synaptic depression is capable of stable oscillation, as shown in Figure 1B. The sigmoid function used here and in later examples is s ( x ) D logistic(4x) D 1 / (1 C e¡4x ). The factor of four is to make the slope at x D 0 equal to one, and thus make the linear regime approximately one unit wide. The oscillation proceeds in the following manner. Beginning with cell 1 above threshold and cell 2 below (marked as interval a in the gure), cell 1 inhibits cell 2, and the membrane potential of cell 1 remains relatively constant. This inhibition wanes, however, due to a buildup of synaptic depression (d1 ), so the membrane potential of cell 2 creeps slowly upward. At the same time, the synapse from cell 2 onto cell 1 is recovering (reected by the decrease in d2 ) while cell 2 is below threshold. Eventually, cell 2’s membrane potential is high enough
566
Adam L. Taylor, Garrison W. Cottrell, and William B. Kristan, Jr.
to cross threshold and inhibit cell 1 (interval b). Following this quick transition, we arrive at a state in which cell 2 is above threshold and cell 1 is below (interval c). This situation is the mirror image of the original state, with the roles of the cells reversed. Thus, the membrane potential of cell 1 creeps slowly upward as the incoming inhibition wanes, until cell 1 crosses threshold and begins to inhibit cell 2. Again, a switch quickly occurs (interval d), bringing cell 1 above threshold and pushing cell 2 below. This is the original state, and so the process repeats itself, and an oscillation is produced. 3 Analysis of Model Behavior
In this section we analyze the system to understand in detail how the oscillation works. This enables us to derive the range of parameters for which the system oscillates and how the features of that oscillation (e.g., period, amplitude) vary as the parameters are varied. Our analysis relies on two simplications: Synaptic depression changes slowly relative to changes in membrane potential (t À 1). The sigmoid function can be approximated as a unit step function (s ( x ) ¼ s (x ) , where s ( x ) D 1 if x ¸ 0, otherwise s (x ) D 0). Using these two simplications will allow us to derive approximate closedform expressions for the parameter regime in which oscillations are possible and for the period and amplitude of these oscillations. The rst allows us to apply singular perturbation theory, considering separately the fast u system of equations 2.4 and 2.5 and the slow d system of equations 2.6 and 2.7. Furthermore, synaptic depression seems to be about 5 to 10 times slower than typical membrane time constants in many systems of interest, among them the leech swim CPG (Mangan et al., 1994; Calabrese et al., 1995; Manor et al., 1997). The second simplication allows us to replace a difcult-to-analyze nonlinear function with an easier-toanalyze piecewise constant function. This also seems to accord roughly with the behavior of known RI circuits. Cells oscillate between a hyperpolarized level at which they have little synaptic output and a highly active level at which they have substantial synaptic output (Friesen, 1989; Calabrese et al., 1995; Manor et al., 1997). Making these two assumptions results in a reduced system that is piecewise linear and thus can be solved exactly. 3.1 The Fast System. Assumption 1 implies that the u variables will come to equilibrium much faster than the d variables and will remain at approximate equilibrium as the d variables slowly evolve. If the u variables
Oscillations with Synaptic Depression
567
are approximately at equilibrium, then uP 1 ¼ uP 2 ¼ 0. Thus, equations 2.4 and 2.5 become ¡u1 ¡ (1 ¡ d2 ) Ws ( u2 ) C b ¼ 0
(3.1)
¡u2 ¡ (1 ¡ d1 ) Ws ( u1 ) C b ¼ 0.
(3.2)
We would like to solve these equations for u1 and u2 given d1 and d2 , but the nonlinearity makes it impossible to do this exactly, so we invoke assumption 2, replacing the sigmoid with a unit step function. Thus equations 3.1 and 3.2 become ¡u1 ¡ (1 ¡ d2 ) Ws ( u2 ) C b ¼ 0
(3.3)
¡u2 ¡ (1 ¡ d1 ) Ws ( u1 ) C b ¼ 0.
(3.4)
Since the step function can take on only the values 0 or 1, we can break down equations 3.3 and 3.4 into four separate cases, corresponding to the four possible values of the pair ( s ( u1 ) , s ( u2 ) ) . In each of these cases we can then solve exactly for the u equilibrium. Case 00. s ( u1 ) D 0, s ( u2 ) D 0. In this case, u1 < 0 and u2 < 0, and we have
u1 D b < 0
u2 D b < 0.
(3.5) (3.6)
In this case, the inequalities imply that this equilibrium of the u system exists only if b < 0. This makes intuitive sense: b < 0 corresponds to the situation in which the tonic input is insufcient to elevate the cells’ membrane potential above synaptic threshold. Case 01. s ( u1 ) D 0, s ( u2 ) D 1. In this case, u1 < 0 and u2 > 0, and we have
u1 D b ¡ (1 ¡ d2 ) W < 0 u2 D b > 0.
(3.7) (3.8)
Note that the inequalities imply that this solution exists only if b > 0, and then only if the values of d2 , b, and W satisfy the inequality in equation 3.7. We can reexpress this inequality as d2 < 1 ¡ b / W. Again, this makes intuitive sense. Cell 2 is above threshold and cell 1 below, which can happen only if the tonic input is sufciently excitatory and the inhibition onto cell 1 is not too depressed—that is, only if d2 is below some specied value. Note that the inequality in equation 3.7 implies that b > 0, since d2 must lie in the interval [0, 1 / 2].
568
Adam L. Taylor, Garrison W. Cottrell, and William B. Kristan, Jr.
Case 10. s ( u1 ) D 1, s ( u2 ) D 0. This case is just the mirrorimage of the 01 case.
In this case, u1 > 0 and u2 < 0, and we have u1 D b > 0
u2 D b ¡ (1 ¡ d1 ) W < 0.
(3.9) (3.10)
As in the above case, the inequality in equation 3.10 can be reexpressed as d1 < 1 ¡ b / W. Again, this implies that b > 0.
Case 11. s ( u1 ) D 1, s ( u2 ) D 1. In this case, u1 > 0 and u2 > 0, and we have
u1 D b ¡ (1 ¡ d2 ) W > 0
u2 D b ¡ (1 ¡ d1 ) W > 0.
(3.11) (3.12)
As in cases 01 and 10, the inequalities can be reexpressed as d1 > 1 ¡ b / W
d2 > 1 ¡ b / W.
(3.13) (3.14)
Intuitively, this case corresponds to both synapses being so depressed that neither cell is able to inhibit the other below threshold, so both cells are active. And again, it should be noted that the inequalities in this case imply that b > 0. In addition to deriving solutions for the equilibria of the u system, we have also derived the conditions under which each exists in terms of the W, b, d1 , and d2 . Furthermore, for each case above, there is a unique u equilibrium, so below we will often speak, for instance, of the 01 equilibrium or the 11 equilibrium. The regions in the (d1 , d2 ) plane where each equilibrium is possible are shown in Figure 2. Figure 2A shows the rather dull situation that results when b < 0. In this case, only the 00 equilibrium is possible; both cells are inactive because they receive no tonic excitation. For b > 0, as shown in Figures 2B and 2C, the picture is more interesting. The lines d1 D 1 ¡ b / W and d2 D 1 ¡ b / W divide the plane into four quadrants, each of which allows for a different set of u equilibria, from the case-by-case analysis above. Only one equilibrium is possible in three of these quadrants, but in the fourth (the lower left one), two equilibria are possible: 01 and 10. This is because d1 < 1 ¡ b / W denes the region of possible (d1 , d2 ) values for the 10 equilibrium (we will call points in this region viable and the region as a whole the viable region for that equilibrium), and d2 < 1 ¡ b / W denes the viable region for the 01 equilibrium, and the lower left quadrant in Figure 2B is simply the overlap of these two regions. Figure 2B represents the case where b / W > 1 / 2, and thus 1 ¡ b / W < 1 / 2. In this case, the lines d1 D 1 ¡ b / W and d2 D 1 ¡ b / W fall within the square
Oscillations with Synaptic Depression
569
Figure 2: Regions of the ( d1 , d2 ) plane for which the various u equilibria exist. Each panel shows the ( d1 , d2 ) plane for a different parameter regime. The dashed lines in each panel indicate the square to which ( d1 , d2 ) is constrained. (A) b < 0. Only u equilibrium 00 is possible. (B) b > 0 and b / W > 1 / 2. There are three possible u equilibria. (C) b > 0 and b / W < 1 / 2. Similar to B, except that the lines d1 D 1 ¡ b / W and d2 D 1 ¡ b / W are now outside the square of possible ( d1 , d2 ) values.
570
Adam L. Taylor, Garrison W. Cottrell, and William B. Kristan, Jr.
of possible (d1 , d2 ) values (from equation 2.3, each of the di ’s must be on the interval [0, 1 / 2]). If b / W < 1 / 2 (Figure 2C), these lines fall outside the square, and the system no longer has any regions with only one possible u equilibrium. The system can be in either the 01 or 10 equilibria, regardless of ( d1 , d2 ) , and it must be in one of these two equilibria. This has important consequences for the overall behavior of the system, as will become clear when we consider the behavior of the slow d system. 3.2 The Slow System. Now that we have characterized the equilibria of the fast u system for given d1 and d2 , we would like to describe how the slow d variables evolve, given that the u system is maintained at equilibrium. In equations 2.6 and 2.7, u1 and u2 appear only via s ( u1 ) and s (u2 ) . Utilizing assumption 2, we can write the slow dynamics as
t dP1 D t dP2 D
1 ( ) 2 s u1
¡ d1
(3.15)
1 ( ) 2 s u2
¡ d2 .
(3.16)
Each of the u equilibria discussed above corresponds to one of the four possible values of ( s ( u1 ) , s ( u2 ) ) . Which u equilibrium the system currently occupies will determine the dynamics of the d system. Because of this, we can draw a separate phase portrait in the ( d1 , d2 ) plane for each of these equilibria. This is done in Figure 3. In general, the d dynamics have a single xed point, which is stable, at ( s ( u1 ) / 2, s ( u2 ) / 2) . However, for u equilibria 01 and 10, this point may or may not be viable, depending on whether b / W > 1 / 2 (i.e., depending on whether Figure 2B or 2C applies). We concentrate here on the case depicted in Figure 2B, for which the d xed points for the 01 and 10 equilibria are not in the respective viable regions. Accordingly, Figure 3 represents this case.
Figure 3: Facing page. Viable regions and phase portraits for the three u equilibria possible when b / W > 1 / 2. (A) 10 equilibrium. (B) 01 equilibrium. (C) 11 equilibrium. Points representing viable states of the system are in white; nonviable points are dark gray. Example trajectories for each u equilibrium are shown. The system as shown contains two attractors: a stable xed point at s ( u1 ) D 1, s ( u2 ) D 1, d1 D 1 / 2, d2 D 1 / 2 (C) and a stable limit cycle comprising line segments along the line d1 C d2 D 1 / 2 for s ( u1 ) D 1, s ( u2 ) D 0 and s ( u1 ) D 0, s ( u2 ) D 1 (A and B). The points in the limit cycle where the u equilibrium changes are indicated by the arcs connecting the 10 and 01 planes (A and B, respectively). The various segments of the limit cycle are also labeled a, b, c, and d, to correspond with the labeled parts of the oscillation in Figure 1B. The light gray triangles represent regions for which the system will not oscillate, but rather will eventually transition to the 11 equilibrium.
Oscillations with Synaptic Depression
571
For b > 0 and b / W > 1 / 2, the 01 and 10 equilibria are not in the viable region. In this case, the d system will move toward the xed point until it hits the edge of the viable region, where the u equilibrium the system has been tracking will disappear. At this point, ( u1 , u2 ) will quickly (i.e., on the timescale of the fast system) change to some other equilibrium of the fast u system. Note that at all points where the d system can cross from a viable region to a nonviable region, there is only one u equilibrium to which the fast system can settle after the transition. Thus, the posttransition u equilibrium is uniquely determined in all cases. Graphically, the current value of (d1 , d2 )
572
Adam L. Taylor, Garrison W. Cottrell, and William B. Kristan, Jr.
will move within the white region of one of the panels of Figure 3 until it hits a viable-nonviable (white–dark gray) border, at which point it will jump to the same ( d1 , d2 ) point on another panel, the destination panel being that one for which the current value of ( d1 , d2 ) is in the viable region. Intuitively, the jumps happen when one of the synapses depresses enough (or recovers enough) that the system as a whole can no longer support the u-equilibrium the system was previously in. For instance, if the system is in the 10 equilibrium, but ( d1 , d2 ) crosses the line d1 D 1 ¡ b / W, d1 will now be too large for cell 1 to inhibit cell 2 below threshold, and the system will quickly jump to a state with cell 2 above threshold. 3.3 Attractors. Since the reduced dynamics of ( d1 , d2 ) are piecewise linear, they can be solved exactly. The general solution (including transients) is cumbersome, however, and does not yield additional insight. Instead of presenting the general solution, we describe the various steady-state solutions of the system. It should be noted that these solutions are exact for the reduced system only and are only approximate solutions of the full system. We assess how good these approximations are in section 4.
3.3.1 Limit Cycles. For certain values of the parameters, the system is capable of oscillation, and the mechanism of oscillation can be visualized in Figure 3. For the 10 equilibrium (see Figure 3A), the dynamics of ( d1 , d2 ) has an attracting xed point at (1 / 2, 0). Therefore ( d1 , d2 ) moves toward this point but is unable to reach it because it is in the nonviable region. Upon reaching the line d1 D 1 ¡ b / W, ( s ( u1 ) , s ( u2 ) ) switches to the value (0, 1) . (Actually, if d2 > 1 ¡ b / W at the transition, ( s ( u1 ) , s ( u2 ) ) switches instead to (1, 1), and no oscillation occurs. We discuss this possibility below, but for now assume that d2 < 1 ¡ b / W at the transition.) For ( s ( u1 ) , s ( u2 ) ) D (0, 1) (see Figure 3B), the dynamics of (d1 , d2 ) have an attracting xed point at (0, 1 / 2), and ( d1 , d2 ) begins to move toward this point. Once again the nonviable region intervenes, causing a switch back to ( s ( u1 ) , s ( u2 ) ) D (1, 0) . This cycle then repeats itself. The fact that this cyclic behavior approaches a stable limit cycle can be seen by considering the quantity d1 C d2 . Adding equation 3.15 to equation 3.16, we have that d ( d1 C d2 ) D 12 [s(u1 ) C s (u2 ) ] ¡ ( d1 C d2 ) . dt
(3.17)
Since s (u1 ) C s ( u2 ) D 1 for both ( s ( u1 ) , s ( u2 ) ) D (1, 0) and (s (u1 ) , s (u2 ) ) D (0, 1), the sum d1 C d2 monotonically approaches 1 / 2 as the system switches back and forth between the u equilibria, and thus the system approaches a stable limit cycle along the line d1 C d2 D 1 / 2. This limit cycle is shown in
Oscillations with Synaptic Depression
573
Figures 3A and 3B. The various phases of the limit cycle are labeled a, b, c, and d in the gure to correspond with the same labels in Figure 1. Since the reduced dynamics are linear between the jump points, we can write an exact expression for the steady-state oscillatory solution: ( d1 ( t) D ( d2 ( t) D
dhi exp(¡tQ / t ) 1 / 2 C (dlo ¡ 1 / 2) exp(¡tQ / t )
tQ · T / 2 tQ > T / 2
(3.18)
1 / 2 C (dlo ¡ 1 / 2) exp(¡tQ / t ) dhi exp(¡tQ / t )
tQ · T / 2 tQ > T / 2
(3.19)
where tQ D mod(t, T ), dhi D 1 ¡b / W, dlo D 1 ¡dhi D b / W ¡ 1 / 2, and T is the period of the oscillation, given below. This is the steady-state solution for the initial conditions d1 (0) D dhi and d2 (0) D dlo , but any initial conditions such that d1 C d2 D 1 / 2 will also yield a steady-state oscillation, which will simply be a phase-shifted version of the above. Essentially, the di ’s exponentially decay from dhi to dlo and back again, in antiphase to one another. The period of the oscillations is twice the time it takes ( d1 , d2 ) to go from one jump point to the other along the line d1 C d2 D 1 / 2; thus, we can show that the period is equal to µ T D ¡2t ln
¶ 1 ¡1 . 2(1 ¡ b / W )
(3.20)
From equations 3.18 and 3.19 (or by examining the geometry of the limit cycle, as shown in Figures 3A and 3B), we can derive the amplitude (dened here as the difference between the maximum and minimum values) and time average of the oscillations of the d variables: Ad D
3 2
hdi D
1 4.
¡ 2 Wb
(3.21) (3.22)
Similarly, we can derive these quantities for the u variables: Au D
3 2W
¡b
£ ¡ T ¢¤ hui D b ¡ 14 W C Tt (b ¡ W ) 1 ¡ exp ¡ 2t .
(3.23) (3.24)
3.3.2 Equilibrium Points. In addition to the stable limit cycle, the system also allows for stable xed points. A stable xed point is located at ( d1 , d2 ) D (1 / 2, 1 / 2) for the 11 equilibrium (see Figure 3C). This corresponds to the situation in which both synapses are maximally depressed, and neither synapse is able to inhibit the postsynaptic cell below threshold. In fact, all states for which the u system is at the 11 equilibrium are in the
574
Adam L. Taylor, Garrison W. Cottrell, and William B. Kristan, Jr.
basin of attraction for this xed point, so if both d1 and d2 are greater than 1 ¡ b / W at any point in time, the system settles to this xed point. This reects the fact that if both synapses get so depressed that neither cell is able to inhibit the other below threshold, then both synapses can only get increasingly depressed as time goes on, and hence approach a state of maximal depression, represented by the xed point. Points in the other u equilibria that are also in the basin of attraction for this xed point are shown in light gray in Figures 3A and 3B. All of the above applies to the b / W > 1 / 2 case. On the other hand, if 0 < b / W < 1 / 2, as in Figure 2C, neither the limit cycle nor the described stable xed point exists. In this case, the d equilibria in Figures 3A and 3B are in the viable region and are therefore equilibria of the full system. The system behaves very much like a RI circuit without synaptic depression. It possesses two mirror-symmetricstable states: one with cell 1 above threshold and cell 2 below and one with cell 2 above and cell 1 below. 3.4 Oscillatory Regime. The size of the viable region for the 11 equilibrium changes as a function of the parameters, since its borders are dened by the lines d1 D 1 ¡ W / b and d2 D 1 ¡ W / b. In contrast, the limit cycle (when it exists) is always on the line d1 C d2 D 1 / 2. As a consequence, the limit cycle exists only for parameter values such that the valid region of the 11 equilibrium does not overlap the line d1 C d2 D 1 / 2. This leads to the constraint that 1 ¡ b / W > 1 / 4 (or, equivalently, b / W < 3 / 4) in order for a stable limit cycle to exist, since the corner of the viable region for the 11 equilibrium is at ( d1 , d2 ) D (1 ¡ b / W, 1 ¡ b / W ), and this point just intersects the limit cycle when 1 ¡ b / W D 1 / 4. Combined with the fact that b must also satisfy b / W > 1 / 2 for the limit cycle to exist, we have the following constraint that must be satised for the system to oscillate: 1 2
< b/ W <
3 4.
(3.25)
This denes the region of parameter space in which the system possesses a limit cycle attractor. 4 Comparison with Simulations
The analysis above applies in the limits of large t and steep sigmoid function. However, if typical values of the ui ’s are such that |ui | À 1, then even if the linear regime of the sigmoid is of approximately unit width, the sigmoids will be saturated most of the time and so will be effectively steep. In the case of oscillations, therefore, we would expect our approximations to be most accurate when the amplitude of the oscillations is much greater than one. In simulations, we nd that the oscillations get bigger in amplitude as W and b get bigger. (This is fully consistent with equation 3.23: if we scale both W and b by a constant factor, Au will be scaled by the same factor.)
Oscillations with Synaptic Depression
575
Thus, we would expect our approximations to give the best agreement with simulations of the full system for large t , W, and b. As a practical matter, however, we nd that the results of numerical simulations of the full system agree fairly well with the analytical predictions, even when the system is far from these limits. Figure 4 shows the volume of parameter space for which the model of equations 2.4 through 4.7 is oscillatory, both in numerical simulations and according to the analysis. In general, the analysis gives a good approximation to the oscillatory regime of the full system. Of course, the closer the system parameters adhere to the assumptions, the better the predictions are. For large t, the area of the ( W, b ) plane that is oscillatory in simulation is closest to the theoretical prediction, with gradually lessening agreement for lower t . To illustrate further the qualitative agreement between simulation and analysis, Figure 5 shows the period of the system as a function of b for a number of discrete values of W and t . The analytical value given by equation 3.20 shows good agreement with the measured period from numerical simulations over a wide range of parameter values. The agreement is again best for large values of t , as would be expected. Also, the agreement is better for large values of W. As W is decreased, assumption 2 becomes less appropriate, and the agreement between analysis and numerical experiment is correspondingly worse. Nonetheless, the agreement between analysis and simulation is quite respectable even in Figure 5C, when W is only a few times greater than one. 5 Discussion
We have demonstrated an extremely simple RI circuit that oscillates solely due to synaptic depression. We have analyzed the circuit to elucidate the mechanisms of this oscillation. Using simple approximations, we have derived closed-form expressions for the region in parameter space in which oscillation occurs, and for various characteristics of this oscillation. Furthermore, we have shown that these approximate expressions match simulations of the full system for a wide range of parameter values. 5.1 Synaptic Depression vs. Intrinsic Adaptation. A natural question to ask is whether oscillations in our system differ in any qualitative way from oscillations arising due to intrinsic cellular properties. On the face of it, we might expect the two mechanisms to lead to similar behavior. If we consider only the input-output relationship of a single neuron, intrinsic adaptation and synaptic depression seem very similar in their effects. In response to a step increase in input, both intrinsic adaptation and synaptic depression produce a similar effect on the output: a large transient increase in output followed by a decay to a level intermediate between the initial level and the peak transient level. This similarity is misleading, however. The oscillations in our system do differ qualitatively from those arising in
576
Adam L. Taylor, Garrison W. Cottrell, and William B. Kristan, Jr.
Figure 4: Volume of parameter space for which the system oscillates. Systems that produce stable oscillations in numerical simulations of the full system are on the inside of the gray wedge. The planes b / W D 1 / 2 and b / W D 3 / 4, which delineate the analytically determined oscillatory regime (see equation 3.25), are also shown, in outline, by solid lines. A log scale is used for the t -axis to display a large range of possible values. The insets show the oscillatory regime in the ( W, b) plane for xed values of t —horizontal slices through the three-dimensional volume. The gray region in the insets is the interior of the wedge shown in the three-dimensional view.
intrinsic adaptation circuits. In order to describe the ways in which they differ, we will rst discuss the modes of oscillation that arise in intrinsic adaptation RI networks and then contrast these with the single mode of oscillation available to our system. Skinner et al. (1994), building on earlier work by Wang and Rinzel (1992), analyzed oscillations of RI networks in which the model cells were of the
Oscillations with Synaptic Depression
577
Morris-Lecar type. These cells include a fast Ca2 C current as well as a slow K C current. (Actually, their analysis does not depend on the particular ionic currents present, only on the nullcline structure of the dynamics. The MorrisLecar cell is simply a well-known model with the required nullcline structure.) They found that such a model circuit can support oscillation in one of four distinct modes: intrinsic escape, intrinsic release, synaptic escape, and synaptic release. Which mode the system is in is characterized by the events leading up to burst termination, and each mode has particular features, such as whether the period of oscillation is sensitive to the synaptic threshold. The two intrinsic modes of oscillation are made possible by the presence of the fast Ca2 C current, which makes the membrane bistable on short timescales. When this current is removed, oscillation can still occur, but only via the two synaptic mechanisms (A. L. Taylor, unpublished simulations). In the synaptic escape mode of oscillation, the inhibited cell’s membrane potential creeps upward until it crosses the synaptic threshold. At this point, a fast switch occurs in which the inhibited cell becomes dominant, and vice versa. When the dominant-inhibited transitions take place in this fashion, it is found that the frequency of oscillations increases as the synaptic threshold is lowered. In the synaptic release mode, the dominant cell’s membrane potential creeps downward until it crosses the synaptic threshold, at which point the dominant-inhibited switch occurs. In this case, it is found that the frequency of oscillations decreases as the synaptic threshold is lowered. Since it is possible for the membrane potential of the dominant cell to creep downward at the same time as the membrane potential of the inhibited cell is creeping upward, the mode in which the system oscillates will be determined by which cell’s membrane potential hits the synaptic threshold rst. In our system, on the other hand, only the synaptic escape mode of oscillation occurs. The membrane potential of the inhibited cell creeps upward due to the waning synaptic inhibition from the dominant cell, but the membrane potential of the dominant cell stays constant for the duration of dominance, since there are no slow currents to cause it to decrease, and it is not under any synaptic inuence from the inhibited cell. Thus, dominant-inhibited transitions can occur only when the inhibited cell crosses the synaptic threshold. In our model, raising b is analogous to lowering the synaptic threshold, and from equation 3.2 we see that increasing b results in an increase in frequency, as we would expect for the synaptic release mode of oscillation. Thus, oscillation via synaptic depression is qualitatively different from oscillation via intrinsic adaptation: synaptic depression oscillations occur only via a single mode, the synaptic escape mode, whereas intrinsic adaptation oscillations can occur via either the synaptic release or synaptic escape modes. Furthermore, this has immediately relevant functional consequences: The frequency of oscillation in an intrinsic adaptation RI circuit can either increase or decrease in response to increased tonic excitation, whereas the frequency of a synaptic depression RI circuit can only increase.
578
Adam L. Taylor, Garrison W. Cottrell, and William B. Kristan, Jr.
Figure 5: The period, T, of the oscillation as a function of b for various values of W and t . This gure shows the agreement between the analytical value for the period of the oscillation (see equation 3.20) and the results of numerical simulations of the full four-dimensional system. A log scale is used for the T-axis to display periods over several orders of magnitude conveniently. The dashed lines indicate the limits of the oscillatory regime according to the analysis. The curves end where the system no longer oscillates, in simulation or according to the analysis.
Oscillations with Synaptic Depression
579
5.2 Relation to Previous Work. The idea that synaptic depression can lead to oscillations in an RI circuit is not novel. It was suggested in some of the earliest work on spinal CPGs in mammals (Graham Brown, 1911) and since then has been considered one of a handful of standard mechanisms for generating oscillations in such a circuit (Reiss, 1962; Perkel & Mulloney, 1974; Friesen, 1994; Ermentrout, 1995). Outside their role in motor CPGs, RI circuits have also been considered as a possible mechanism underlying various kinds of perceptual alternations, such as binocular rivalry and Marroquin patterns (Matsuoka, 1984; Blake, 1989; Mueller, 1990; Wilson, Krupa, & Wilkinson, 2000). Synaptic depression has also been studied in connection with oscillations in larger networks. A model of rhythmic bursting in developing chick spinal cord relies on synaptic depression of excitatory synapses on two separate timescales, one being that of the within-burst oscillations and the other being that of the interburst period (Tabak, Senn, O’Donovan, & Rinzel, 2000). Another recent model shows that large networks of randomly connected neurons with nonlinear synapses (in this case, synapses that both depress and facilitate) can spontaneously generate population bursts: times at which nearly all cells emit a spike, and do so in near synchrony (Tsodyks, Uziel, & Markram, 2000). However, none of the above work examines the case of two cells with depressing reciprocal inhibition in an analytically tractable model. As a result, none is able to derive simple expressions for the oscillatory regime, period, and amplitude of the resulting oscillations. Reiss (1962) considers a model similar to ours, but with spiking neurons that are not amenable to analysis. Matsuoka (1984) considers a model that is amenable to analysis, but incorporating a form of intrinsic adaptation rather than synaptic depression. Hence, this article provides a novel analysis of a form of neuronal bursting. Acknowledgments
A.L.T. was supported by National Institutes of Health training grant GM08107 and a La Jolla Interfaces in Science Predoctoral Fellowship (funded by the Burroughs Wellcome Fund). G.W.C. and W.B.K. were supported by National Institutes of Health grant MH43396. We thank Peter Rowat and the two reviewers for their helpful comments and suggestions. References Abbott, L. F., Varela, J. A., Sen, K., & Nelson, S. B. (1997). Synaptic depression and cortical gain control. Science, 275(5297), 220–224. Arshavsky, Y. I., Deliagina, T. G., Orlovsky, G. N., Panchin, Y. V., Popova, L. B., & Sadreyev, R. I. (1998). Analysis of the central pattern generator for swimming in the mollusk clione. Annals of the New York Academy of Sciences, 860(5), 51–69. Blake, R. (1989). A neural theory of binocular rivalry. Psychological Review 96(1), 145–167.
580
Adam L. Taylor, Garrison W. Cottrell, and William B. Kristan, Jr.
Brodfuehrer, P. D., Debski, E. A., O’Gara, B. A., & Friesen, W. O. (1995). Neuronal control of leech swimming. Journal of Neurobiology, 27(3), 403–418. Calabrese, R. L., Nadim, F., & Olsen, Ø. H. (1995). Heartbeat control in the medicinal leech: A model system for understanding the origin, coordination, and modulation of rhythmic motor patterns. Journal of Neurobiology, 27(3), 390–402. Ermentrout, G. B. (1995). Phase-plane analysis of neural activity. In M. Arbib (Ed.), The handbook of brain theory and neural networks (pp. 732–738). Cambridge, MA: MIT Press. Friesen, W. O. (1989).Neuronal control of leech swimming movements. In J. Jacklet (Ed.), Neuronal and cellular oscillators. New York: Marcel Dekker. Friesen, W. O. (1994).Reciprocal inhibition: A mechanism underlying oscillatory animal movements. Neuroscience and Biobehavioral Reviews, 18(4), 547–553. Galarreta, M., & Hestrin, S. (1998). Frequency-dependent synaptic depression and the balance of excitation and inhibition in the neocortex. Nature Neuroscience, 1(7), 587–594. Graham Brown, T. (1911). The intrinsic factors in the act of progression in the mammal. Proceedings of the Royal Society of London, B84, 308–319. Grillner, S., Deliagina, T., Ekeberg, O., el Manira, A., Hill, R. H., Lansner, A., Orlovsky, G. N., & WallÂe, P. (1995). Neural networks that coordinate locomotion and body orientation in lamprey. Trends in Neurosciences, 18(6), 270–279. Hopeld, J. J. (1984). Neurons with graded response have collective computational properties like those of two-state neurons. Proceedings of the National Academy of Sciences USA, 81(10), 3088–3092. LoFaro, T., Kopell, N., Marder, E., & Hooper, S. (1994).Subharmonic coordination in networks of neurons with slow conductances. Neural Computation, 6(1), 69– 84. Mangan, P. S., Cometa, A. K, & Friesen, W. O. (1994). Modulation of swimming behavior in the medicinal leech. IV: Serotonin-induced alteration of synaptic interactions between neurons of the swim circuit. Journal of Comparative Physiology A, 175(6), 723–736. Manor, Y., Nadim, F., Abbott, L. F., & Marder, E. (1997). Temporal dynamics of graded synaptic transmission in the lobster stomatogastric ganglion. Journal of Neuroscience, 17(14), 5610–5621. Matsuoka, K. (1984). The dynamic model of binocular rivalry. Biological Cybernetics, 49(3), 201–208. Mueller, T. J. (1990). A physiological model of binocular rivalry. Visual Neuroscience, 4(1), 63–73. Nadim, F., Manor, Y., Kopell, N., & Marder, E. (1999).Synaptic depression creates a switch that controls the frequency of an oscillatory circuit. Proceedings of the National Academy of Sciences of the United States of America, 96(14), 8206–8211. Perkel, D. H., & Mulloney, B. (1974). Motor pattern production in reciprocally inhibitory neurons exhibiting postinhibitory rebound. Science, 185(146), 181– 183. Reiss, R. F. (1962). A theory and simulation of rhythmic behavior due to reciprocal inhibition in small nerve nets. In AFIPS Spring Joint Computer Conference, 21 (pp. 171–194). San Francisco: National Press.
Oscillations with Synaptic Depression
581
Roberts, A., Soffe, S., & Perrins, R. (1997).Spinal networks controlling swimming in hatchling Xenopus tadpoles. In P. Stein, S. Grillner, A. Selverston, & D. Stuart (Eds.), Neurons, networks, and motor behavior(pp. 83–90). Cambridge, MA: MIT Press. Rowat, P. F., & Selverston, A. I. (1997). Oscillatory mechanisms in pairs of neurons connected with fast inhibitory synapses. Journal of Computational Neuroscience, 4(2), 103–127. Selverston, A., & Moulins, M. (1987). The Crustacean stomatogastricsystem. Berlin: Springer-Verlag. Skinner, F. K., Kopell, N., & Marder, E. (1994). Mechanisms for oscillation and frequency control in reciprocally inhibitory model neural networks. Journal of Computational Neuroscience, 1(1–2), 69–87. Tabak, J., Senn, W., O’Donovan, M. J., & Rinzel, J. (2000). Modeling of spontaneous activity in developing spinal cord using activity-dependent depression in an excitatory network. Journal of Neuroscience, 20(8), 3041–3056. Taylor, A., Cottrell, G. W., & Kristan, Jr., W. B. (2000). A model of the leech segmental swim central pattern generator. Neurocomputing, 32–33, 573–584. Tsodyks, M. V., & Markram, H. (1997).The neural code between neocortical pyramidal neurons depends on neurotransmitter release probability. Proceedings of the National Academy of Sciences of the United States of America, 94(2), 719–723. Tsodyks, M., Uziel, A., & Markram, H. (2000). Synchrony generation in recurrent networks with frequency-dependent synapses. Journal of Neuroscience, 20(1), RC50. Wang, X.-J., & Rinzel, J. (1992). Alternating and synchronous rhythms in reciprocally inhibitory model neurons. Neural Computation, 4, 44–97. Wilson, H. R., & Cowan, J. D. (1972). Excitatory and inhibitory interactions in localized populations of model neurons. Biophysical Journal, 12(1), 1–24. Wilson, H. R., Krupa, B., & Wilkinson, F. (2000). Dynamics of perceptual oscillations in form vision. Nature Neuroscience, 3(2), 170–176. Wolpert, S., Friesen, W. O., & Laffeley, A. J. (2000). A silicon model of the Hirudo swim oscillator. IEEE Engineering in Medicine and Biology Magazine, 19(1), 64– 75. Received February 28, 2000; accepted May 30, 2001.
LETTER
Communicated by Rajesh P. N. Rao
Activity-Dependent Development of Axonal and Dendritic Delays, or, Why Synaptic Transmission Should Be Unreliable Walter Senn [email protected] Physiologisches Institut, Universit¨at Bern, CH-3012 Bern, Switzerland Martin Schneider [email protected] Berthold Ruf [email protected] Institute for Theoretical Computer Science, Technische Universit¨at Graz, A-8010-Graz, Austria Systematic temporal relations between single neuronal activities or population activities are ubiquitous in the brain. No experimental evidence, however, exists for a direct modication of neuronal delays during Hebbian-type stimulation protocols. We show that in fact an explicit delay adaptation is not needed if one assumes that the synaptic strengths are modied according to the recently observed temporally asymmetric learning rule with the downregulating branch dominating the upregulating branch. During development, slow, unbiased uctuations in the transmission time, together with temporally correlated network activity, may control neural growth and implicitly induce drifts in the axonal delays and dendritic latencies. These delays and latencies become optimally tuned in the sense that the synaptic response tends to peak in the soma of the postsynaptic cell if this is most likely to re. The nature of the selection process requires unreliable synapses in order to give successful synapses an evolutionary advantage over the others. The width of the learning function also determines the preferred dendritic delay and the preferred width of the postsynaptic response. Hence, it may implicitly determine whether a synaptic connection provides a precisely timed or a broadly tuned “contextual” signal. 1 Introduction
The temporal structure in any type of cognitive or behavioral task such as memory recall, motor execution, or navigation must be reected in the temporal activity of neurons or neuronal populations. Time in this neuronal correlate is often considered to be encoded in the synaptic strengths that map to a phase advance of a postsynaptic neuron (Roberts, 1999; see also Neural Computation 14, 583–619 (2002)
° c 2002 Massachusetts Institute of Technology
584
Walter Senn, Martin Schneider, and Berthold Ruf
Hopeld & Brody, 2000, 2001). Time may also directly be encoded in axonal propagation delays and dendritic latencies. While in the case of a representation in synaptic weights the timing of a postsynaptic cell may be adapted by any synaptic modication, there is no experimental evidence for a direct adaptation of axonal and dendritic delays. Here we show that a temporally asymmetric spike-time-dependent synaptic modication may substitute for such a direct adaptation of delays in favor of an implicit adaptation through selection. According to these rules, the synaptic efcacy is upregulated if the presynaptic action potential (AP) is elicited before the postsynaptic AP, while it is downregulated otherwise (Markram, Lubke, ¨ Frotscher, & Sakmann, 1997; Bi & Poo, 1998; Feldman, 2000). The high temporal resolution occurring in these synaptic rules is puzzling, especially in view of the wide range of possible delays and latencies encountered in the brain. On the one hand, the temporal order of the pre- and postsynaptic signals can be distinguished with a remarkable sensitivity, which may fall below 10 ms. On the other hand, substantial delays may be encountered between the generation of the presynaptic AP, the excitatory postsynaptic potential (EPSP) at the synaptic site, and the maximal response of the EPSP in the soma of the postsynaptic cell (see Figure 1a). We show that despite this temporal discrepancy, the learning rule can select the correct pre- and postsynaptic delays (where “correct” means that the somatic response peaks when the postsynaptic cell is most likely to re), provided that the synaptic transmission is unreliable. We also show that in the presence of small, unbiased delay uctuations, the learning rule makes the delays drift until the correct timing is achieved. In order to prevent an unbounded growth of the synaptic weights during this drift, it is sufcient to require that the downregulating branch in the learning rule dominates the upregulating branch in strength (cf. Abbott & Song, 1999; Kempter, Gerstner, & van Hemmen, 1999). The paradigm of delay adaptation through selection may be of particular importance in the developing nervous system, where it could explain the temporal ne tuning of specic synaptic pathways. The idea that delays may be selected through an adaptation of synaptic weights was explored early in theoretical works focused on learning spatiotemporal activity patterns (Sompolinsky & Kanter, 1986; Kuhn ¨ & van Hemmen, 1991; Gerstner, 1993; Natschl¨ager & Ruf, 1998). Delay selection is also a key mechanism for localizing sound signals. It has been suggested that during the ontogenetic development of the barn owl auditory system, synaptic connections with matching delays are selected through an asymmetric learning rule (Gerstner, Kempter, van Hemmen, & Wagner, 1996; see also Carr & Friedman, 1999). Other theoretical works discuss learning rules that explicitly adjust the synaptic delay as a function of the pre- and postsynaptic activities (Huning, ¨ Glunder, ¨ & Palm, 1998; Eurich, Pawelzik, Ernst, Cowan, & Milton, 1999). Instead of postulating an explicit form of delay adaptation, however, we claim that delay learning is a natural consequence of a temporally asymmetric modication of the synaptic strength. Our anal-
Activity-Dependent Development of Axonal and Dendritic Delays
585
Figure 1: The model. (a) The different signal delays between a presynaptic neuron (or a neuronal population) A and a postsynaptic neuron B. The total forward delay is composed of an axonal delay Dax , a (vanishing) synaptic delay Dsyn, and the dendritic latency consisting of a forward dendritic (propagation) delay f f C trise . At the Dden and the EPSP rise time trise measured in the soma, Dlat den D Dden peak of the EPSP, there is an enhanced probability that neuron B will re. An AP generated in the soma of neuron B arrives with a backward dendritic delay Dbden at the synaptic site S. (b) The synaptic weight w is modied depending on the local time difference 4tsyn between the EPSP and the backward AP at site S (inset, 4tsyn D ( tA C Dax ) ¡ ( tB C D bden ) ). This rule, together with slow delay uctuations and unreliable synaptic transmissions, induces a drift of the total forward delay from A to B toward the most probable spike time difference tB ¡tA between the two cells. Depending on the time difference, a, that corresponds to the maximum of the learning function, the nal synaptic position will be either close to the soma (a small) or far on the apical dendritic tree (a large). Hence, the shape of the learning function determines whether a synaptic connection tends to trigger a postsynaptic AP (through a proximal synapse) or whether it tends to support the postsynaptic activity by a slow contextual EPSP (through a distal synapse), respectively.
ysis can be generalized to reinforcement learning, which may work with a similar type of asymmetric learning rule, although on a larger timescale and on the level of neuronal populations. An example of learning motor sequences within the basal ganglia–thalamocortical feedback loop outlined in Section 7 shows that the same delay structure between a presynaptic spike, the synaptic release, the postsynaptic spike, and the backward reinforcement signal may also occur within cortical circuitries (compare the stages A, S, and B in Figures 1a and 8). 1.1 Integration Field of a V1 Cell: An Example. As a rst example for delay adaptations, we consider the integration eld of a layer 5 pyrami-
586
Walter Senn, Martin Schneider, and Berthold Ruf
dal cell in the primary visual cortex. This cell is driven by feedforward visual input through the lateral geniculate nucleus (LGN), which synapses on the dendritic shaft in layer 4 (Nieuwenhuys, 1994). At the same time, the cell receives lateral input onto the apical dendritic tree from neighboring cortical columns (cf. Figure 1b). In vivo experiments show that the cell horizontally integrates neuronal activity over a cortical patch of roughly 1 cm in diameter (Bringuier, Chavane, Glaeser, & Fre gnac, 1999). This cortical integration eld corresponds to a visual receptive eld on the retina of roughly 10 degrees. However, only nearby cortical input, corresponding to visual stimulation within 2 degrees of the receptive center, can activate the cell, while peripheral horizontal input provides only subthreshold input. The weak peripheral horizontal input can also be delayed up to 50 ms. It is assumed that these delays are mainly due to the slowly conducting horizontal pathways, which show axonal conductance velocities ranging from 0.1 to 0.3 m per second. The existence of the different latencies between the feedforward vertical input onto the proximal region and the peripheral intracortical input onto the distal portion of the apical tree suggests that the latencies are adjusted to optimize the somatic summation (Bringuier et al., 1999). Our simulations show that the temporally asymmetric learning rule is in fact a means to tune these intracortical delays optimally. 1.2 Outline of the Basic Mechanisms. To explain the basic mechanisms, we consider a neocortical cell (or population of cells) A and a layer 5 pyramidal cell B with different delay lines from A to B. In a rst scenario, we assume that the feedforward input triggers in sequence an AP in cell A and in cell B at times tA < tB , respectively. The presynaptic AP arrives at the synaptic site at time tA C Dax , while the backpropagating postsynaptic AP arrives at the same location at time tB C Dbden, with Dax and Dbden being the axonal delay and backward dendritic delay of the corresponding synaptic connection, respectively (see Figure 1a). Assuming a competitive Hebbian modication among the synaptic connections (realized by a learning function with longterm depression, LTD, dominating long-term potentiation, LTP; cf. Song, Miller, & Abbott, 2000), only these synapses experiencing the strongest upregulation survive; all the others are suppressed. In other words, the rule selects the connections for which the local time difference measured at the synaptic site, tsyn, corresponds to the maximum of the learning function, tsyn D ¡a (see Figure 1b). Hence, the optimal delay line satises the condition 4tsyn D (tA C Dax ) ¡ ( tB C Dbden ) D ¡a. Assuming that the maximum of the learning function covers the sum of the (forward) dendritic latency C Dbden , this equation reduces and the backward dendritic delay, a D Dlat den lat to Dax C Dden D tB ¡ tA . As dendritic latency Dlat , we dene the sum of den the EPSP propagation delay and the rise time of the somatic EPSP. The last equation conrms our expectation that if cell A should support the ring of
Activity-Dependent Development of Axonal and Dendritic Delays
587
cell B, the total forward delay should match the typical spike time difference between these cells. In a second scenario, we assume that the third input provides only subthreshold input and that the spike in cell B is nally elicited by the synaptic responses from cell A. In this case, the spike time tB is roughly given by . If, as above, we insert this expression in the conditB D tA C Dax C Dlat den ( ) ( C Dbden D a. Hence, in tion tA C Dax ¡ tB C Dbden ) D ¡a, we obtain Dlat den the second scenario, the rule selects the connections for which the sum of the forward and backward dendritic delays corresponds to the peak of the learning function. For instance, if the maximum of the learning function is given by a D 9 ms, that synaptic pathway is selected whose EPSP triggers 7 ms after the synaptic releases a postsynaptic AP, which itself peaks after 2 ms back at the synaptic site. It is tempting to speculate that the match between the total dendritic delays and the peak in the learning function determines the position of a synapse on the dendritic tree and that different learning functions (different a’s) prefer different synaptic positions (see Figure 1b). To make our reasoning on the pre- and postsynaptic delay selection physiologically more plausible, we are weakening the assumptions in two ways. First, we do not require that the full spectrum of possible delay lines must initially be present. Rather, we assume only a small number of synaptic connections (say, 10) from A to B with similar pre- and postsynaptic delays. These delays, however, are assumed to exhibit slow stochastic uctuations caused, for instance, by the dis- and reappearance of synapses at different dendritic locations or by the modulation of active membrane conductances. The uctuations, together with the asymmetric synaptic modications, are enough to induce drifts in the average axonal and dendritic delays toward optimality. Second, we do not restrict ourselves to the above scenarios according to which the ring of cell B is determined always by the third input or always by the ring of cell A. Rather, we assume some degree of temporal correlation in the Poisson activities of the two cells A and B, which is implicitly induced by third input and depends on the strength of the synaptic connections. In this generalized scenario, the axonal delay and the dendritic latency are both shifted toward optimality; they are adapted such that their sum becomes equal to the typical spike time difference, Dax C Dlat ¡tA . den D tB Note that there are different ways to distribute the total delay on the individual components Dax and Dlat . In fact, after the adaptation process, the den b C additional relation Dlat a is satised, and this implicitly deterD D den den mines Dlat and therefore the time course of the somatic EPSP. Hence, in den the generalized scenario, the learning rule not only adjusts the correct total delay from A to B but also determines whether the corresponding somatic EPSP is sharp (a small) or wide (a large). The simultaneous selection of axonal and dendritic delays deserves a closer look. It starts with the observation that the learning rule cannot distinguish between delay lines with axonal and backward dendritic delays
588
Walter Senn, Martin Schneider, and Berthold Ruf
of the form Dax C 2 and Dbden C 2 simply because the argument 4tsyn D (tA C Dax ) ¡ ( tB C Dbden ) of the learning function depends on the difference between these delays only. For instance, two delay lines, the second having a 3 ms longer axonal and a 3 ms longer backward dendritic delay than the rst, both have the same signal time difference at the synaptic site and therefore go through the same synaptic modications. Assuming that the forward dendritic latency is equal to the backward dendritic delay, however, the somatic EPSP of the second delay line peaks systematically 6 ms (D 22 ) after the EPSP of the rst and therefore favors a different postsynaptic spike time. A way nevertheless to select between the two delay lines is to introduce unreliable synaptic transmissions and modify the synaptic strength only if neurotransmitter was released. A synapse that then actively supports a postsynaptic spike will have more chances to be upregulated than a synapse that does not, and this resolves the above ambiguity. From a general point of view, any selection process is built up on some source of randomness. In our case, the selection of the appropriate axonal delay requires slow delay uctuations, while the selection of the appropriate dendritic latency requires unreliable synaptic transmissions.
1.3 Integration Field of a V1 Cell: Some Implications. Coming back to the introductory example, A is an orientation-selective layer 5 pyramidal cell in the primary visual cortex of a cat, and B is a nearby neural population in the same cortical area. According to our reasoning, the axonal delays plus the dendritic latencies from B to A should reect the time differences with which the corresponding retinal positions are most likely to be stimulated. Since an orientation-selective cell is more sensitive to motions orthogonal to the preferred orientation, the integration eld should be more developed along this orthogonal orientation. For instance, we expect from our model that the subthreshold receptive eld of such a cell is largest along the orthogonal orientation and that along this same orientation, the propagation velocity is highest. Both predictions seem to be conrmed to some degree by the in vivo recordings of Bringuier et al. (1999). In fact, the statistics of the orientation-selective cells (Figure 4 in the cited work) show an average width of the subthreshold receptive eld of 5 and 12 degrees in the preferred and its orthogonal orientation, respectively, and a common latency of 50 ms across these differently sized half-diameters. A selection of different axonal delays is further conrmed by the statistics of the apparent speed of horizontal propagation, which ranges from 0.05 to 0.6 m per second. According to our model, the receptive eld asymmetries concerning the axonal delays would be enhanced if, during rearing, the kittens were exposed to fast motions in exclusively one direction. A similar conrmation of the model concerning the adaptation of the dendritic latency is more difcult due to the lack of data. However, the intracellular recordings show that the relative width of the subthreshold depolarization is narrower for stimuli within 2 de-
Activity-Dependent Development of Axonal and Dendritic Delays
589
grees of the receptive eld center than for stimuli in the periphery (Figure 1 in Bringuier et al., 1999). This is in agreement with the anatomical nding that feedforward input projects proximal to the soma, while horizontal cortical connections project onto the distal apical tree (Nieuwenhuys, 1994). According to our model, this synaptic distribution would be the result of a self-organizing process, provided that the modication of proximal (feedforward) synaptic projections is governed by a narrower learning function than distal (horizontal) ones. 1.4 Overview. This article is organized as follows. In section 2, we formalize the model and state the assumptions on the synaptic modications and the delay uctuations. In section 3, we introduce spike correlations between the inputs onto our neuron and derive the population equation for the dynamics of the synaptic strengths. In section 4, we consider the change in the average delays induced by the weight modications under the assumption of xed spike correlations between the pre- and postsynaptic sites. In section 5, the spike correlation is allowed to be inuenced by the synaptic modications, and we study the induced shift in the dendritic latencies at xed axonal delays. In section 6, the general case is treated where correlated spike activity jointly entrains axonal and dendritic delays in the presence of stochastic delay uctuations and unreliable synaptic transmissions. Section 7 provides physiological evidence for our assumptions and ends with an example of delay learning between different cortical regions. 2 The Model
We consider different delay lines from a presynaptic neuron (or neuronal population) A to a postsynaptic neuron B. The delay lines may vary in the axonal pathway and the synaptic position on the dendritic tree (see Figure 1a). Each delay line is specied by an axonal delay, Dax , a (negligible) f
synaptic delay, Dsyn, a forward dendritic (propagation) delay, Dden , the EPSP f
rise time in the postsynaptic soma, trise (to which, together with Dden , we refer as the dendritic latency), and a backward dendritic delay, Dbden, specifying the delay of the postsynaptically generated AP back to the subsynaptic site (the site in the dendritic tree adjacent to the synapse). If pre- and postsynaptic APs are triggered at times tA and tB , respectively, the EPSP faces the postsynaptic AP at the subsynaptic site with a time difference 4
tsyn D ( tA C Dax C Dsyn ) ¡ (tB C Dbden ) D 4t C d,
(2.1)
where 4t D tA ¡ tB and d D Dax C Dsyn ¡ Dbden (cf. Figure 2b). Note that d is the relative delay encountered at the subsynaptic site for the case of equal spike times.
590
Walter Senn, Martin Schneider, and Berthold Ruf
Figure 2: The synaptic learning function and the different delays after learning. (a) The learning function y± ( 4 tsyn ) for the synaptic weights as a function of the time difference between the pre- and postsynaptic signals at the subsynaptic site (full line, see equations 2.2 and 2.1). For repetitive regular stimulations (& D 0) lat
f
the average axonal delay Dax and the average dendritic latency Dden D D den C trise are adapted according to the derivative y±0 (dashed line, cf. equations 4.3 and 5.5). The delay adaptation has an attracting xed point 4tsyn D ¡a and a repulsive xed point 4tsyn D b. (b) The different delays involved in the signal transmission from neuron A to neuron B and back to the synaptic site. The bent curve represents the learning function y± centered at the time when the backpropagated AP reaches the subsynaptic site. The choice of the delays reects the steady-state behavior when 4tsyn D ¡a.
The synaptic strength w is assumed to be up- or downregulated depending on the sign of the local time difference 4 tsyn. Specically, we assume that w is modied proportionally to the bi-alpha-shaped function of the local time difference, 8 > < 4
w » y± ( 4tsyn ) D
c 4 | |e¡ a t syn
> ¡ :¡ c | 4 |e b t syn
(4tsyn ) 2 2a2
,
4
tsyn < 0
,
4
tsyn ¸ 0,
(4tsyn ) 2 2b 2
(2.2)
with appropriate constants a, b, c > 0 (see Figure 2a). The form of y± is such that the weight is maximally upregulated if the presynaptic spike precedes the postsynaptic spike by a ms, and it is maximally downregulated if the presynaptic spike follows the postsynaptic spike by b ms. The normalization is such that the absolute values of the peaks of the LTP and LTD branches are independent of their widths a and b, respectively. We assume that the proportionality factor in equation 2.2 is the actual weight itself, 4w D wy± ( 4t) . If w expresses the number of quantal release sites, for
Activity-Dependent Development of Axonal and Dendritic Delays
591
instance, the factor w states that each site separately undergoes long-term modication in that it vanishes or forms another nearby release site. The same factor emerges if one deduces equation 2.2 from kinetic schemes governing the involved neurotransmitter receptors and secondary messengers (Senn, Tsodyks, & Markram, 2001). To avoid a blowing up of the synaptic strengths, our analysis will use the intrinsic normalization property of the learning rule, which is obtained when the downregulation dominates the upregulation (b > a; see equation 3.5). Next, we consider slow stochastic uctuations in the axonal delays and dendritic latencies. We observe that according to equation 2.1, it is reasonable to group those delay lines together with the same relative delay d. This is possible since only the time difference 4 tsyn D 4t C d of the signals measured at the subsynaptic site enters in the synaptic dynamics of equation 2.2. We may therefore parameterize the synaptic weights with the relative delay, w (t) D w (d , t) . Due to the linear scaling factor occurring in the rule, these lumped synaptic weights are adapted according to 4 w (d , t) D w (d , t) y± (4 t). Similarly, stochastic changes in the axonal and dendritic delays are relevant only for the synaptic dynamics as far as they change the relative delay d. For an individual pathway, this relative delay may be shortened or extended by changing the axonal delay, Dax , the backward dendritic delay, Dbden , or both. To take account of these uctuations, we assume that for each delay line, the relative delay d may change during a time interval T by 4d with a small probability 2 ¸ 0. With probability 1 ¡ 22 , the delay of the synaptic pathway d does not change. On the level of the lumped weights w (d , t) , this dynamics can be expressed by w (d , t) D 2 w (d ¡ 4d, t ¡ T ) C (1 ¡ 22 ) w (d , t ¡ T ) C 2 w (d C 4d, t ¡ T ) .
(2.3)
If the timescale T of these uctuations is long compared to the timescale of the synaptic adaptation, the two processes do not interfere, and the dynamics for the lumped weights is the same as that for the individual connections. On a timescale longer than T, we are dealing with derivatives, and taking the limit 4d, T ! 0 we obtain after simple algebraic manipulations (4 2 @w (d ,t) @2 w (d,t ) , provided that lim4d,T!0 Td) D 1. Thus, the slow, unbiD 2 @t @d 2 ased stochastic uctuation introduces a diffusion of the synaptic weights along the relative delays. The general scenario we consider is that neuron B res, with some jitter, | 4 t| after neuron A (note that in this case 4t < 0). The ring time tB depends on additional input from a third population onto B (as, e.g., forward input from the LGN; see Figure 1b) but may also be inuenced by the ring of neuron A itself. The basic mechanism is that the rule 2.2 selects delay lines for which the pre- and postsynaptic signal time difference at the subsynaptic site equals the maximum of the learning function, 4tsyn D 4t C d D ¡a. The
592
Walter Senn, Martin Schneider, and Berthold Ruf
stochastic process, equation 2.3, of building new delay lines (i.e., of assigning positive weights to delay lines with previously vanishing weights) ensures that after a while, there is always a delay line with an appropriate relative delay d and nonvanishing w. If the LTD branch of the learning function dominates the LTP branch, b > a in equation 2.2, delay lines that do not meet the above condition will be suppressed. If, in addition, we introduce synaptic transmission failures, those delay lines that do satisfy 4t C d D ¡a, but do not support the postsynaptic spike by virtue of their total forward f
f
6 4t, will eventually delay, that is, for which Dtot D Dax C Dsyn C Dden C trise D be suppressed as well.
3 Weight Modication Induced by Correlated Activity
We assume that the pre- and postsynaptic neurons are ring with some instantaneous Poisson rates fpre and fpost , respectively, which are themselves correlated to some degree. Let us consider a delay line with relative delay d as dened in equation 2.1, and let us assume that the synaptic strength is modied according to equation 2.2. The change in w accumulated in the time interval T (À a, b) preceding the current time t can then be approximated by Z
4
w ( t) ¼ w ( tN)
t t¡T
Z
dt0
t¡t0
t¡t0 ¡T
dt y± (t C d) fpre ( t0 C t ) fpost ( t0 ) ,
(3.1)
with some appropriate value tN between t ¡ T and t. Assuming that the individual weight changes in equation 2.2 are small, we obtain, on a timescale that is large with respect to T, the differential equation Z 1 4w ( t ) dw (t) ¼ (3.2) D w ( t) dt y± (t C d) C (t I t) , dt T ¡1 with correlation function Z 1 t C (t I t) D dt0 fpre ( t0 C t ) fpost ( t0 ) . T t¡T
(3.3)
Observe that the integral domain in equation 3.2 can be extended to innity since T is larger than the width of y± . Note further that in equations 3.1 and 3.2, we assume that the decay of the synaptic memory (i.e., of w) is on an even longer timescale, so that all the changes are accumulated without degradation. Next, we require that the instantaneous pre- and postsynaptic ring rates are correlated in the sense that there is a tendency for the postsynaptic neuron to re around 4 t § & ms after or before a presynaptic spike (after for 4t < 0, before for 4t > 0). Thus, we assume a correlation function of the form C (t I t) D cG (4 t, &, t ) C f pre f post (t ) ,
(3.4)
Activity-Dependent Development of Axonal and Dendritic Delays
593
with a constant presynaptic mean frequency f pre ; a postsynaptic mean frequency f post ( t) , which may depend on the synaptic weight w ( t) but is not correlated with the presynaptic spikes; a spike correlation factor c ¸ 0; and ) ) being a gaussian distribution around G ( 4t, &, t ) D (2p &2 ) ¡ 2 exp(¡ ( t¡t 2&2 4t with standard deviation &. For the specic choices of the learning function 2.2 and the correlation function 3.4, the integral in equation 3.2 can be analytically calculated. One obtains 1
4
2
dw ( t) D w ( t) ( cy& ( 4t C d) ¡ y ± f pre f post ( t) ) , dt R1 where y& ( 4t) D ¡1 y± (t ) G (4 t, &, t ) dt is approximately 8 > > <
y& (4 t) ¼
> > :¡
(4t) 2
3
| 4t|e
¡
2(a2 C & 2 )
3
| t|e
¡
2(b 2 C & 2 )
c a2 (a2 C &2 ) 2 c b2 (b 2 C &2 ) 2
4
,
4
,
4
(3.5)
t< 0 (3.6)
(4t) 2
t ¸ 0,
R1 and ¡y ± D ¡1 dt y± (t ) D ¡c (b ¡a) . Note that for b > a, one has ¡y ± < 0. A negative integral of the learning function was postulated in Abbott and Song (1999) and Kempter et al. (1999) to normalize the synaptic weights and stabilize the postsynaptic potential in a subthreshold regime. There is in fact later experimental evidence that for synaptic connections onto pyramidal cells in layer II of the rat barrel cortex, the width of the LTD window, b, is larger than the one of the LTP window, a, and that the overall integral over the learning function is negative (Feldman, 2000). The exact form of y& is a smoothed version of equation 3.6, and the approximation is strict for a D b. The shape of y& is thus again ofpthe original form depicted in Figure 2a, but now with maximum at 4t D ¡ a2 C &2 and minimum at 4t D p 2 b C &2 . Note that for vanishing variance in the spike time differences, & D 0, one retrieves equation 2.2. The effective learning window for a synapse exposed to gaussian distributed spike time differences appears therefore to be downscaled and broadened. To close the feedback loop, we split the uncorrelated postsynaptic ring rate arising in equation 3.4 into a part originating from the specic presynaptic neuron and a constant background rate due to the remaining afferents, X o (3.7) f post ( t) D w ( t) f pre C f post, d
where the sum extends over the different synaptic delay lines from the specic presynaptic neuron (with weights w D wd ). In writing a linear sum, we assume that the postsynaptic neuron is operating in a linear regime
594
Walter Senn, Martin Schneider, and Berthold Ruf
of its input-output function. We also assumed that the specic presynaptic neuron affects the correlation function 3.4 only by the overall synaptic input without considering the temporal structure of the interaction (but compare equation 6.1). Inserting equation 3.7 into 3.5 yields " # dw (t) 2 X o D w ( t) cy& ( 4t C d) ¡ y ± f pre w (t) ¡ y ± f pre f post . dt d
(3.8)
Observe that in the presence of noncorrelated spike activity, c D 0 in equation 3.4, the synaptic weight would asymptotically decay toward zero due to the negative overall integral over the learning function. Furthermore, the synaptic strength cannot grow to innity due to the negative feedback loop represented by the second term in the brackets. Finally, we consider a family of delay lines whose weights may stochastically uctuate among nearest neighbors according to equation 2.3. In the limit of a continuum of delay lines, the uctuations result in a diffusion term that adds to equation 3.8, µ ¶ Z 1 dw (d , t ) 2 o D w (d , t) cy& ( 4t C d) ¡ y ± f pre w (d , t) dd ¡ y ± f pre f post dt ¡1 C2
@2 w (d , t) @d 2
.
(3.9)
To reveal the structure of this population equation for the synaptic weights, we rewrite it in the form dw @2 w D w[y& ( 4t C d) ¡ c1 hwi ¡ c2 ] C 2 , dt @d 2
(3.10)
R1 2 o with hwi D ¡1 w (d , t) dd and constants c1 D y ± f pre and c2 D y ± f pre f post . The correlation factor c was absorbed in the learning function y& . 4 Induced Delay Shift for Fixed Spike Correlations
Let us consider the supervised learning scenario by imposing the stationary spike statistics, equation 3.4, between the pre- and postsynaptic neuron. Thus, besides some uncorrelated component, the time differences of the signals at the synapse with delay d are gaussian distributed around 4t C d with standard deviation &. We are interested in the dynamics of the relative delays d, which are implied by the weight modication 3.9. To gain insight into this dynamics, let us rst neglect the stochastic renewal process and consider the synaptic modication in the form 3.10 with 2 D 0. Due to the normalization term ¡c1 hwi in equation 3.10, the delay adaptation is then purely governed by a selection process, selecting those
Activity-Dependent Development of Axonal and Dendritic Delays
595
connections that experience the strongest potentiation. (To prove this, we rst observe that due to the negative feedback induced by the uncorrelated spikes—the term ¡c1 hwi—and the saturation toward zero—the factor w— the weights w (d , t ) do always converge to a steady state. In this limit, either w (d , t ) D 0 or y& ( 4t C d) ¡ c1 hwi ¡ c2 D 0. Since ¡c1 hwi ¡ c2 is independent of d, there are only discrete values of d satisfying the latter equation. Now let us assume that the delay line da with 4t C da D ¡a representing the maximum of y& initially has a positive weight, w (da , 0) > 0. Since according to equation 3.10 weights with y& ( 4t C d) ¡ c1 hwi ¡ c2 > 0 are potentiated while others are depressed, we conclude that in the steady state, the delay line da survived. Following the reasoning above, we then must have y& ( 4t C da ) ¡ c1 hwi ¡ c2 D 0, while for all other delay lines, w (d , t) D 0. In general, exactly those delay lines survive for which, among the initial weight distribution, the value y& (4 t C d) is maximal (and positive).) If we next consider stochastic uctuations of the delays, 2 > 0 in equation 3.10, then the delays may shift beyond their initial regime, and delays can nally be selected that were not present among the initial conguration. As can be guessed from the analysis, the relative delays d in general will shift such that the average time difference at the synaptic sites becomes 4t C d D ¡a. To formalize this statement, we consider the average delay d dened by the center of gravity of the d distribution, R1
d ( t) D R¡1 1
dw(d, t) dd
¡1
w (d , t) dd
D
hdw(d, t) i , hw(d, t) i
(4.1)
where the brackets represent the integral over d. Note that d D Dax C Dsyn ¡ b
Dden. The dynamics of d is obtained by differentiating equation 4.1 with respect to time, P P hd wihwi ¡ hdwihwi 1 (hd wi P ¡ dhwi P ), dP D D hwi2 hwi
(4.2)
and inserting for wP the weight modication 3.10. As we show in Section A.1, the equation can then be reduced to d d ¼ s 2 y&0 ( 4t C d) , dt
(4.3)
where y&0 is the derivative of the learning function 3.6. The dynamics 4.3 has exactly two stationary solutions, d a and d b , corresponding to the zeros p p 2 2 2 4t 4 4 4 b C &2 of y&0 (cf. syn D t C d a D ¡ a C & and tsyn D t C d b D Figure 2a). To investigate the stability, we consider the second derivative of p p 00 (¡ 00 2 2 2 y& at these points and nd y& a C & ) < 0 and y& ( b C &2 ) > 0. We conclude that the delay d a is attracting with domain of attraction (¡1, d b ),
596
Walter Senn, Martin Schneider, and Berthold Ruf
while the delay d b is repulsive. The standard deviation of the delay distribution at the attracting steady state can qualitatively be approximated by s s¼
4
22 p D 00 | y& (¡ a2 C & 2 ) |
³ p ´ 14 a 2 e p , 2 c a C &2
(4.4)
(see Section A.2). Since the s steeply increases with small 2 , we conclude from equation 4.3 that small delay uctuations are enough to make the delay drifting. Moreover, due to the term y&00 , the steady-state weight distribution is broad for learning functions with a at peak. To summarize, for xed and nonat pre- and postsynaptic spike correlations, the synaptic pathways develop such that a presynaptic signal on average meets a postsynaptic signal at the synaptic site with a time difp ference of 4 tsyn D ¡ a2 C &2 ms, where a is the peak of the LTP function and & is the width of the gaussian part in the spike correlation function 3.4. In particular, the average spike time difference 4t can be compensated by appropriate delays Dax and Dbden as long as, on average, the presynaptic p signals arrive at the synaptic site not later than b 2 C & 2 ms after the backpropagated spike (cf. Figure 2). Two assumptions are crucial for these delay adaptations: (1) the small stochastic uctuations in the delays and (2) the negative integral over the learning function leading to a normalization of the synaptic weights through the uncorrelated spikes. 4.1 Simulation Results. To examine the quality of the approximation 4.3, we simulated a stochastic version of equation 3.10 with discrete weight updates and discrete random delay uctuations after each pairing of preand postsynaptic spikes. The spike time differences 4t D tA ¡ tB were sampled from a gaussian distribution with mean 4t D ¡20 ms, standard deviation & D 3 ms, and a sampling rate of f pre D 20 Hz. We have chosen seven delay lines with axonal delays Dax evenly distributed between 9.4 and 11.6 ms and 17.4 and 18.6 ms, respectively (left and right rectangular curves in Figure 3a). The synaptic delay and the backward dendritic delay were xed to 1 ms, Dsyn D Dbden D 1, so that d D Dax . After each spike pair, we evaluated the learning function 2.2 and set for the weight change 4w (d , t) D w (d , t) y± ( 4t C d) . The parameters for y± were a D 5 ms, b D 7 ms, and c D 3.5. The allowed range for the relative delays d was between dmin D 9 and dmax D 21 ms, with mesh width 4d D 0.2 ms. The additional stochastic delay uctuations were implemented by replacing a synaptic weight w (d , t) after each presynaptic spike with probability 2 D 0.1 by one of its two neighbors w (d § 4d, t) ; see equation 2.3. Finally, the different weights were reduced after each pairing by ¡c1 w (d , t) hwi with c1 D 0.3 and P 4 (d )4 hwi D 60 i D 0 w min C i d, t d. The remaining parameter in equation 3.10 was o set to c2 D 0, thus assuming f post D 0.
Activity-Dependent Development of Axonal and Dendritic Delays
597
Figure 3: Axonal delay adaptation with xed synaptic and dendritic delays ( D 1 ms) and with gaussian spike time differences uctuating around 4t D ¡20 ms. (a) Evolution of the axonal delays Dax for two different initial distributions (left and right rectangular curves), shown after 120 ms (corresponding to two pairings, circled curves) and after 20 sec (corresponding to 400 pairings, curves with dots) when converged to a gaussian distribution centered at Dax D 14.2 ms. (b) Time evolution of the average axonal delay D ax ( D d) corresponding to the simulations in a (blurred lines) and according to the approximation 4.3 (dashed lines). The steady state for Dax is well predicted by equation 4.3, according to which Dax adapts such that eventually the argument of y&0 , 4 t C D ax , is equal
p
to the zero ¡ a2 C &2 of y&0 (which itself corresponds to the peak of y& ; cf. Figurep2a). In fact, in the steady state, we have 4 t C Dax D ¡20 C 14.2 D ¡5.8 and ¡ a2 C &2 D ¡5.83. Without jitter in the spike time differences (& D 0), the average axonal delay Dax would be roughly 1 ms longer (dot at 100 sec). Other parameters: a D 5 ms and & D 3 ms.
The simulation was run for 100 seconds biological time (which is determined by the learning rate c ). From both rectangular initial distributions, the weights w (d , t) converged to a gaussian distribution centered at d D Dax D 14.2 ms with standard deviation s D 0.8 (see Figure 3a). The time course of Dax , obtained by evaluating equation 4.1 with the corresponding weights from the simulation, can be matched with the dynamics 4.3 with s D 0.2 (see Figure 3b). Although, due to the different approximation steps, the two s’s do not coincide (the one from equation 4.4, for comparison, would be 0.31), the steady state for the average axonal delay (¼ 14 ms) is well predicted by the dynamics 4.3. Note that the jitter in the spike times shortens the average axonal delay (the dot in Figure 3b). 5 Induced Delay Shift for Variable Spike Correlations
We next consider an unsupervised learning scenario according to which the presynaptic neuron may directly inuence the postsynaptic spike time tB ,
598
Walter Senn, Martin Schneider, and Berthold Ruf
which in turn affects the synaptic strength from the considered presynaptic neuron. To investigate this feedback loop, we rst consider the simplied scenario where the postsynaptic spike is exclusively triggered by the presynaptic input. We further simplify matters by xing the axonal and synaptic delays to some constant value, say, Dax D Dsyn D 0, and focus on the adaptation of the dendritic delays and latencies. To take account of the dependencies among the different dendritic delays, we parameterize the synapses on the dendritic tree by their distance D from f
the soma and identify this distance by the forward dendritic delay, Dden ´ D. Since distal synapses induce a longer (and smaller) somatic EPSP, we assume for the rise time a monotonically increasing function of the distance, trise D R ( D) . For the backward dendritic delay, we assume a monotonically increasing function, Dbden D B (D ) . The somatic EPSP induced by synapse D is assumed to have the form E R ( D) ( t ) D
t R2 ( D )
t
¢ e¡ R(D) , for t ¸ 0,
(5.1)
and ED ( t) D 0 for t < 0 (see the inset of Figure 4b). To obtain the time course of the somatic voltage induced by a single presynaptic spike at time tA , we integrate over all synaptic connections, Z V ( t) D
1 0
w ( D, t) ER ( D ) (t ¡ ( tA C D) ) dD,
(5.2)
where w ( D, t) denotes the synaptic weight of the connection from neuron A to neuron B with forward dendritic delay D (recall that Dax D 0). We further assume that the postsynaptic neuron is sufciently depolarized such that the input from neuron A alone may push the postsynaptic voltage across some xed threshold h. The postsynaptic spike time tB is then given by the rst crossing of h. If h is in the upper third of the maximal depolarization, equation 5.2, say (cf. also the inset in Figure 4), and the delay distribution is not too wide, we may roughly estimate lat
tB ¼ tA C D C R D tA C Dden.
(5.3)
Here, D and R represent the average forward dendritic delay and the average lat
rise time of the somatic EPSPs, respectively, and Dden ´ D C R is the average dendritic latency. The averages are taken with respect to the synaptic weight distribution. For instance, for the average dendritic latency, we have lat Dden ( t)
D
R1
lat ( ) ( ) ¡1 Dden D w D, t dD R1 , ( ) ¡1 w D, t dD
(5.4)
Activity-Dependent Development of Axonal and Dendritic Delays
599
Figure 4: Adaptation of the dendritic latency, with postsynaptic spikes triggered by the specic presynaptic neuron. (a) Evolution of the weight distribution as a f C trise , with snapshots of the initial function of the dendritic latency Dlat den D Dden distributions (left and right rectangular curves), after 120 ms (slightly deformed lat
circled curves), and after 60 sec (nearly identical gaussians centered at Dden ¼ 8 lat Dden
f
ms). (b) Time evolution of the average dendritic latency D Dden C trise corresponding to the two simulations in a (full lines) and the approximation 5.5 with s D 0.2 (dashed lines). Superimposed is the evolution of the spike time difference tB ¡ tA (dashed-dotted lines) for the two simulations in a. The average dendritic latency is implicitly adapted such that in the steady state, the den
lat
b
argument of y±0 is equal to the rst zero, Dtot ´ Dden C Dden D a (cf. equation 5.5 and Figure 2a). In fact, from b
den
lat Dden
b
f
¼ 8, trise D 4 and D den D 12 Dden one calculates
Dden ¼ 2, Dtot ¼ 10, and this is in the range of a D 10.5 (cf. also Figure 2b). The inset shows the normalized EPSP according to equation 5.1) with the threshold crossing at 34 trise (horizontal and vertical dashed lines). lat
( ) C R ( D) . To estimate the shift in Dden induced by the with Dlat den D D D lat
synaptic modications, we have to relate Dden with the average synaptic time difference 4tsyn. From equation 2.1 we get 4tsyn ( D ) D tA ¡ ( tB C B (D ) ). lat
Setting tA D 0 and inserting equation 5.3 yields 4tsyn D ¡Dden ¡ B, where B is the average backward dendritic delay dened similarly to equation 5.4. If the synaptic weights w ( D, t ) are now subject to the dynamics 3.9, with d replaced by D, we obtain (see Section A.3) d lat B0 ( D ) (1 C R0 ( D ) ) lat . Dden ¼ ¡s 2g y±0 (¡Dden ¡ B ) , with g D dt 1 C R0 ( D ) C B0 (D )
(5.5)
Here, s represents the standard deviation of the distribution of the forward dendritic delays D. It is given by equation 4.4 with & D 0.
600
Walter Senn, Martin Schneider, and Berthold Ruf
The unique attracting steady state of equation 5.5 is again given by the lat
zero of y±0 with y±00 < 0, that is, by 4tsyn D ¡Dden ¡ B D ¡a. Thus, assuming small, unbiased uctuations in the synaptic positions (encoded by D) together with repeatedly suprathreshold input from the considered presynaptic neuron, the synaptic population slowly shifts toward an average den
lat
position D with total average dendritic delay Dtot ´ Dden C B ¼ a. Observe that the derivative of B ( D) enters as a factor in equation 5.5 and that therefore the postsynaptic delay is adapted only if the backward delay is a nonconstant function of the synaptic position. The reason is that we assume a simultaneous occurrence of the synaptic releases, and if the backward dendritic delay would be the same for all synapses, each synapse would see the same local time difference between the forward and backward signal. For B (D ) D const, therefore, no shift in the average dendritic latency is expected. Note further that the sign in equations 4.3 and 5.5 is different. This reects the fact that in order to change from an initial 4 tsyn with ¡a < 4tsyn < 0 to the nal 4tsyn D ¡a, either the axonal delay has to decrease, as in equation 4.3 (at xed spike times tA , tB ), or the dendritic latency has to increase, as in equation 5.5 (and thereby increasing tB ). 5.1 Simulation Results. To test the dynamics 5.5 qualitatively for the lat
average dendritic latency Dden, we simulated again the discretized version of equation 3.10. We initially chose eight and seven delay lines with dendritic latencies Dlat equally spaced within 4.2 ¡ 5.6 ms and 10.4 ¡ 11.6 den ms, respectively (see Figure 4). The dendritic latency was decomposed of a constant EPSP rise time R ( D) D trise D 4 ms and a corresponding forward C dendritic delay D with Dlat . The backward dendritic delays were den D D trise 1 ( ) half the forward ones, B D D 2 D. We further set & D 0 and Dax D Dsyn D 0. The presynaptic cell was stimulated with a frequency of 20 Hz, and whenever the sum of EPSPs crossed the threshold h D 6.5 (cf. equation 5.2), a postsynaptic spike was elicited. The weight changes were calculated by evaluating y± at 4tsyn D tA ¡tB ¡B(D) , where tA and tB are the corresponding spike times of the pre- and postsynaptic neuron. The parameters in y± were a D 10.5, b D 14, c D 0.7. To implement the other terms in equation 3.10, we reduced the weights by ¡c1 w ( D, t) hwi with c1 D 0.3 and c2 D 0, and implemented the stochastic delay uctuations as described in the previous section. The simulation was run for 60 seconds biological time. From both initial distributions, the weights w ( D, t) converged to a gaussian distribution centered at D D 4.8 ms (see Figure 4a). The time course of the spike time lat
differences 4t D tA ¡ tB and of the average dendritic latency Dden extracted from the simulation described above are shown in Figure 4b. Superimposed lat
is the evolution of Dden (t ) obtained from the approximated dynamics 5.5. According to the steady state of equation 5.5, the average total dendritic
Activity-Dependent Development of Axonal and Dendritic Delays
601
lat
delay, Dden tot ´ Dden C B ( D ) ¼ D C trise C B ( D ) ¼ 11.2 ms, is related to the rst zero of y±0 , that is, to a D 10.5. The difference arises from the fact that the postsynaptic spike is typically triggered before the EPSP culmination, and the effective dendritic latency is therefore smaller than the one considered lat
(see the inset of Figure 4b). Note that the time course of Dden is roughly three times slower than that of Dax (see Figure 3b) as predicted by the factor g D 13 obtained from the formula in equation 5.5. 6 Coevolution of Axonal Delays and Dendritic Latencies
In the previous simulations, we xed either the dendritic or the axonal delay, while only uctuations in the other delay were considered. Under this restriction, we showed that each of the delays will separately evolve until the optimal timing imposed by the learning function is reached. For xed dendritic delays and xed spike correlations, the axonal delays adapt such that the induced EPSPs peak at the time the postsynaptic neuron is expected to re. For xed axonal delays, in turn, the dendritic delays are adapted such that they nally match the width of the learning window. Such a self-organization toward unique delays, however, is no longer possible if the uctuations would affect both the axonal and the dendritic delays at the same time. This is because the synaptic strength is changed based only on the local time difference measured at the synaptic site, and this may be the same for different pairs of axonal and dendritic delays. In fact, according to equation 2.1, the local time difference is the same for all axonal and backward dendritic delays with d D Dax ¡ Dbden D const and the stability condition 4 tsyn D 4t C d D ¡a (see the discussion after equations 4.3 and 5.5) is formally met for all pairs of the form ( Dax C2 , Dbden C2 ) with 2 > 0. The same degeneracy is also present if Dax conuently changes with the width a of the learning function. In either case, synapses are going to be strengthened that are not causally contributing to the postsynaptic spike, although they see the pre- and postsynaptic signal in the optimal timing (i.e., with a ms delay). It turns out that the unreliability of the synaptic transmission is the property that prevents an acausal development by disadvantaging synapses that are only “blind passengers” and do not contribute to the postsynaptic spike. To reveal this point, we rst observe that if the synaptic modication is induced, for example, by via activation of the postsynaptic NMDA receptors, a synaptic release, and not just a presynaptic spike, must occur to cause the synaptic modication. Let us now consider two unreliable delay lines with the same release probability P rel < 1 and the same relative delay d, but different axonal and dendritic delays and thus different total forward delays. Let us assume that the EPSP of the rst delay line falls preferentially together with a subthreshold depolarization caused by a third input onto the same postsynaptic neuron. In this case, the EPSP from the rst delay line is expected to trigger more often a spike than the EPSP from the second. This
602
Walter Senn, Martin Schneider, and Berthold Ruf
implies that compared to the second delay line, the synaptic release from the rst line is stronger correlated with a postsynaptic spike, and since the occurrence of a release is a prerequisite to induce a synaptic change, the rst delay line is more often upregulated. The second delay line, which shows the same number of releases with the same local time difference, is less upregulated since these releases do not exhibit the strong correlation with an immediately following spike. Additional uncorrelated postsynaptic spikes may downregulate both weights such that only the rst delay line survives. To formalize this reasoning in the general case, we reparameterize the f synaptic weights according to w ( Dax , D) , where D ´ Dden again abbreviates the forward dendritic delay. We consider a postsynaptic priming scenario with additional synaptic input from other neurons onto B roughly 4t after the activity of the presynaptic neuron A. In addition, we now assume that the individual releases may also inuence the timing of the postsynaptic spikes in a statistical sense. To simplify matters, we again assume a linear transfer function with threshold at 0 and, to reduce the number of constants, with gain 1. Instead of the correlation between the (instantaneous) pre- and postsynaptic ring rate, it is now the correlation between the presynaptic release rate and the postsynaptic ring rate given a release that determines the synaptic weight change. To describe the weight change of the synapse (Dax , D ) formally, we introduce the postsynaptic ring rate fpost (t , Dax , DI t) conditional to a presynaptic spike at time t C t and a subsequent release at synapse ( Dax , D ). In extension of equation 3.7, this conditional postsynaptic ring rate is f post (t , Dax , DI t) D f pre w ( Dax , DI t) ER ( D ) (t C Dax C D ) C ¢ ¢ ¢ X C Prel f pre w ( D0ax , D0 I t) 6 Dax ,D0 D 6 D D0ax D
o
£ER ( D0 ) (t C D0ax C D0 ) C f post .
(6.1)
Note that since we assume a release at synapse ( Dax , D) , no factor Prel occurs in the rst term on the right-hand side. In favor of the notational load, we o represents the contribution of uncorrelated set Dsyn D 0. The last term fpost afferents to the postsynaptic ring rate. By considering the effect of the individual spikes onto the instantaneous postsynaptic rate, we now obtain an additional dependency of the spike correlation function 3.3 on the individual synapses and the relative presynaptic spike times t . Substituting f pre ( t) with f rel ( t) and f post ( t) with fpost (t , Dax , DI t) in equation 3.4, we obtain a conditional release-spike correlation function for synapse ( Dax , D) of the form C ( Dax ,D ) (tI t) D cG (4 t, &, t ) C f rel f post (t , Dax , DI t ).
(6.2)
Activity-Dependent Development of Axonal and Dendritic Delays
603
Recall that the rst term represents the contribution to the correlation function caused by a postsynaptic spike induced by other input. Instead of equations 3.2 and 3.8 we now get Z 1 dw ( Dax , DI t) D w ( Dax , DI t) dt y± (t C d) C ( Dax ,D ) (t I t) ¼ w ( Dax , DI t) dt ¡1 2 4cy& ( 4t C d) C (1 ¡ Prel ) f w ( Dax , DI t) yR ( D) (¡Dden ) tot pre 3 ¢ ¢ ¢ C f rel
X D0ax ,D0
0
0
0f
w ( Dax , D I t) yR ( D0 ) (¡Dtot C d)
o ¡ y ± f rel f post 5 ,
(6.3)
where Dden tot D D C R ( D ) C B ( D ) is the sum of forward dendritic delay plus the EPSP rise time plus backward dendritic delay, d D Dax ¡ B ( D ) , and 0f Dtot D D0ax C D0 C R (D0 ) is the total forward delay of another delay line. The ¼ in equation 6.3 comes from the approximation of the EPSPs by gaussian 0f
functions of normalized area, with center t C Dtot and standard deviation R ( D0 ), so that the integrals reduce to the evaluation of yR ( D0 ) given by equation 3.6. For Prel D 1 we recover in equation 6.3 our problem that the learning rule cannot distinguish between delay lines (Dax , D ) with common relative delay d (since then the bracket in equation 6.3 is independent of Dax and D). If the synapses are unreliable, however, the second term in the brackets becomes positive, and the synaptic weights evolve differently for each delay line. Due to this positive feedback term, delay lines for which the value yR ( D ) (¡Dden advantage on the other delay tot ) is large have an evolutionary p 2 C lines with the same d. Provided that a R ( D ) 2 does not vary too much, the valuepof yR ( D ) is largest for that delay line satisfying Dden tot ´ D C R ( D ) C 2 2 B ( D) ¼ a C R ( D) . Note that the analogous condition is satised for the attracting steady state of the dynamics 5.5, which we deduced under the assumption of xed axonal delays. On top, the rst term in the brackets favors delay lines for which y& (4 t C d)pis maximal and thus delays satisfying 2 2 4t syn ´ 4t C d D 4 t C Dax ¡ B ( D ) ¼ ¡ a C & . Together with the previous condition, this xes the optimal axonal delay Dax , as well as the optimal forward dendritic delay D, characterizing the delay line. Recall that the latter condition is also satised in the attracting steady state of the dynamics 4.3. Note further that combining the two conditions while assuming R (D ) ¼ & f yields Dtot ´ Dax C D C R ( D) ¼ ¡4 t, and that this is also compatible with the third term within the bracket in equation 6.3. Thus, the learning rule prefers a total dendritic delay corresponding to the width of the effective learning function and a total forward delay corresponding to the the average spike time differences between the pre- and postsynaptic neuron (see Figure 5).
604
Walter Senn, Martin Schneider, and Berthold Ruf
Figure 5: Sketch of the joint adaptation of the axonal and dendritic delays in the postsynaptic priming protocol. (a) Conguration of the different delays before learning. (b) Repetitive stimulations of neuron A together with an increased background ring rate (“priming”) of neuron B during the subsequent depolarization | 4 t| later will lead to a drift of the axonal and dendritic delays such that eventually the average total dendritic delay matches the width of the learning den
f
function, Dtot ´ Dden C trise C Dbden ¼ a, and the average total forward delay f
f
matches the time delay of the depolarization, Dtot ´ D ax C Dden C trise ¼ | 4t|.
By comparing equations 3.8 with 6.3, we see that the negative feedback loop preventing the synaptic weights from growing to innity is lost when we consider the spike correlation induced by the presynaptic neuron. The stability property is regained if we consider synaptic projections from other neurons being subject to the same type of synaptic modications of the weights. In fact, if the spike activity among the presynaptic neurons has an uncorrelated component, an overall increase in the synaptic weights leads to o an enhanced component f post and this establishes a negative feedback loop through the last term in equation 6.3. Recall that a positive y ± is equivalent to a negative integral over the learning function y± . Alternatively, we may postulate that with each postsynaptic spike, the synaptic weight slightly degrades, independent of any spike timing (Kempter et al., 1999). In this case, we obtain an additional term in the brackets of equation 6.3, which is proP o portional to ¡( f post C f rel D0 ,D0 w ( D0ax , D0 I t) ) . Since this term is equal for ax all delay lines, it acts only as a normalization without distorting the weight distribution and therefore changes neither the average axonal nor dendritic delays. Finally, small, unbiased stochastic uctuations in the axonal and dendritic delay will cause the average delays to move toward the steady p f state given by the above conditions Dden a2 C R (D ) 2 and Dtot ¼ ¡4t. tot ¼ Similarly to the last term in equation 3.9, these stochastic ± uctuations ² will appear in equation 6.3 as a diffusion term of the form 2
@2 w @D2ax
C
@2 w @D2
.
Activity-Dependent Development of Axonal and Dendritic Delays
605
6.1 Simulation Results. To test whether the unreliability of the synaptic transmission indeed helps to select a specic conguration among the two-dimensional parameterization of delay lines, we combined the “supervised” and “unsupervised” learning scenario from the previous sections. We initially distributed 20 delay lines with axonal delays Dax between 8.5 f and 11.5 ms and forward dendritic delays D ´ Dden between 4.5 and 7.5 ms and set for the corresponding synaptic weights w (Dax , D ) D 0.4, while the remaining weights were zero. The backward dendritic delay of each delay line was half the forward dendritic delay, B (D ) ´ Dbden ( D ) D 12 D, and the EPSP rise time was set to R ( D ) ´ trise D 3 ms. The common release probability was Prel D 0.5, and the threshold of the postsynaptic neuron was set to h D 20 mV. We stimulated the presynaptic neuron with a periodic train of f pre D 20 Hz and primed the postsynaptic cell with a 10 mV subthreshold depolarization between 25 and 27 ms after each presynaptic spike (cf. Figure 5). This induces the additional spike correlation described in equation 6.2 with 4t D 26 ms and & ¼ 1 ms. Whenever the postsynaptic potential, composed of this depolarization and the sum of the evoked EPSPs (cf. equations 5.1 and 5.2) crossed the threshold, a postsynaptic spike was elicited unless a spontaneous spike was already triggered within 5 ms before. The postsynaptic cell showed a spontaneous background ring rate o f post D 3 Hz, which was increased to an instantaneous Poisson rate of 30 Hz during the 2 ms period of the additional depolarization. For each pair of synaptic release and postsynaptic spike, we modied the corresponding synaptic weight according to 4w D wy± ( 4tsyn ) with learning function y± given by equation 2.2 and local time difference 4tsyn given by equation 2.1. The parameters of y± were a D 14 ms, b D 20 ms, and c D 0.8. To prevent an innite synaptic growth, we reduced the synaptic weights after each postsynaptic spike by 0.01 times w times the sum of all synaptic weights, as discussed at the end of the last paragraph. To simulate the stochastic uctuations, we assumed a two-dimensional mesh grid of axonal and dendritic delays ( Dax , D ) with a granularity of 0.2 ms. After each presynaptic spike, we redistributed each synaptic weight with probability 0.1 by 80% of its own weight and 5% of the weights of the four neighbors. The simulation was run for 35 minutes biological time. Although initially the presynaptic neuron could not contribute to the ring of the postsynaptic cell, it learned to do so after 28 minutes of correlated spike activity (see Figure 6a). After this period, the postsynaptic spikes were always evoked by the EPSPs from the presynaptic cell, which were strengthened by the f
rule. The connections adapted their total forward delay such that Dtot ´ f
Dax C Dden C trise D 27.0 ¼ ¡4t D 26 ms, and this led to a peak of the EPSPs during the period of the additional postsynaptic depolarization (see Figure 6b and the dashed-dotted line in Figure 7a).
606
Walter Senn, Martin Schneider, and Berthold Ruf
Figure 6: Postsynaptic activity during the priming protocol (cf. Figure 5). (a) Raster plot of the postsynaptic spikes. Only each tenth trial is shown. During the time interval 25–27 ms after a presynaptic spike, a postsynaptic depolarization increased the probability of a spontaneous postsynaptic spike. After 28 minutes of simulated time (corresponding to »34,000 pairings) the presynaptic neuron learned to trigger the postsynaptic spike during the period of the postsynaptic depolarization (darker points top right). (b) Same as in a but superposed with the sum of EPSPs induced by the presynaptic neuron. The horizontal lines indicate the period of the additional postsynaptic depolarization. For parameter values, see the text.
The simulation shows that it is possible to learn the peak EPSP implicitly with a precision of almost 1 ms, although the width a of the learning function itself is more than 10 times broader. To explain this fact, we inspect the different dendritic delays after reaching the steady state (see Figure 7a). Toward 30 minutes simulation time, the total dendritic delay eventually den
f
b
fully covers the learning window, Dtot ´ Dden C trise C Dden D 13.5 ¼ p 2 a C ( trise ) 2 D 14.3 ms, as predicted by the theoretical consideration above (see Figure 7a; cf. the sketch in Figure 5). Independent of the width of the LTP f
branch, the total forward delay Dtot adapts toward the typical spike time difference 4t and thus eventually supports existing temporal structures in the spike activity, as long as the width a of the learning function can be den
absorbed by the total dendritic delay Dtot . Finally, the stochastic synaptic transmission (Prel < 1) led to a unique average axonal delay Dax ¼ 17 ms and a unique average forward denf
dritic delay Dden ¼ 7 ms after the application of the stimulation protocol (see Figure 7b). Without transmission failures (Prel D 1), an elongated ridge f would evolve in the (Dax , Dden ) -space corresponding to those delays with xed synaptic time difference 4tsyn between the pre- and postsynaptic signals. This shows that unreliable synapses are in fact necessary for a unique development of the axonal and dendritic delays.
Activity-Dependent Development of Axonal and Dendritic Delays
607
Figure 7: Implicit adaptation of the different delays during the postsynaptic priming protocol (cf. Figure 5). (a) The evolution of the mean axonal delay and mean dendritic latency during the simulation shows that the synaptic modication implicitly adjusts the different delays according to equations 4.3 and 5.5, such that eventually the average total forward delay matches the averf
lat
age spike time difference, Dtot D D ax C Dden D | 4t| D 26 ms, and such that the total average dendritic delay matches the width of the learning function, den
lat
b
lat
f
Dtot D Dden C Dden ¼ a (with Dden D Dden C trise). (Lower thick line: Dax , f
f
f
dots: Dax C Dden, dashed-dotted line: Dtot D Dax C Dden C trise , upper thick line: f
b
Dtot C Dden, upper dots: Dax C a, shaded band: time of subthreshold input.) (b) Distribution of the axonal and forward dendritic delays at the initial conguration (left blur) and after 35 minutes of the simulated time (right blur). The gray level is coding the synaptic weight corresponding to the different delays. Without transmission failures, the nal distribution of the delays would be smeared out within the two dashed lines.
7 Discussion
The work presented here extends the idea of delay selection by showing that slow stochastic uctuations in the axonal and dendritic delays lead to delay drifts beyond the original delay distributions (cf. Figures 3 and 4). We showed that during correlated pre- and postsynaptic activity, the temporally asymmetric synaptic modication implicitly (1) adjusts the total forward delay composed of the axonal, synaptic, forward dendritic delay and EPSP rise time until it matches the statistically dominant time difference between the pre- and postsynaptic spikes, and (2) tunes the total dendritic delays composed of the forward dendritic delay, the EPSP rise time, and the backward dendritic delay until it ts the width of the effective learning function (cf. Figures 5, 6, and 7). Although the evolution of these delays is perturbed by uncorrelated spontaneous activity, the precision of the delay adaptation may be of an order higher than the width of the learning func-
608
Walter Senn, Martin Schneider, and Berthold Ruf
tion itself. Interestingly, the tight selection of the optimal delays is possible only if the synaptic transmission is unreliable. This unreliability gives the “successful” delay lines an evolutionary advantage over the others. To stabilize the self-organization of the different delays, one must assume that the integral over the learning function is negative. Thus, three properties are important for an activity-dependent development of axonal delays and dendritic latencies through asymmetric weight modications: slow, unbiased delay uctuations, unreliable synapses, and LTD dominating LTP. 7.1 Physiological Evidence for the Hypothesized Mechanisms. The shape of the temporally asymmetric learning function (see equation 2.2 and Figure 2) is motivated by different recent experiments (Markram et al., 1997; Zhang, Tao, Holt, Harris, & Poo, 1998; Bi & Poo, 1998; Feldman, 2000). These experiments do not allow discriminating among axonal, synaptic, and dendritic delays as dened in equation 2.1. Instead, they provide the change in the synaptic strength as a function of the spike time differences tA ¡ tB between the two neurons recorded from. In these experiments, the neurons lay nearby, and in case of cortical recordings, they mostly synapse on each other ’s proximal dendritic tree. As a consequence, the axonal and dendritic delays appear to be small, and the time difference at which maximal upregulation is observed corresponds roughly to our time difference a dening the peak in the learning function 2.2. Depending on the experimental study, this peak lays between ¡25 and 0 ms. Different sources for the pre- and postsynaptic delays exist. Axonal propagation velocities for horizontal cortico-cortical connections are on the order of 0.2 m per second, corresponding to a propagation time of 50 ms for cells that are 1 cm apart, while the propagation velocities of vertical bers are roughly 10 times faster (Bringuier et al., 1999). Considerable delays may also be present postsynaptically. In passive dendritic structures, delays up to 30 ms and more were calculated for the time difference between the centroids of an injected current, the locally induced EPSPs, and the EPSP propagated into the soma (Agmon-Snir & Segev, 1993). The peak-to-peak time from a distal EPSP to the forward propagated somatic EPSP and from the somatic AP to the backpropagated dendritic AP can both reach up to 8 ms (Segev, Rapp, Manor, & Yarom, 1992; Stuart & Sakmann, 1994). In addition to these peak-to-peak delays, the signal is delayed by the rise time of the somatic EPSP, which is on the order of 3 to 10 ms or more. Moreover, in active dendritic structures, transient potassium currents can delay the generation of a postsynaptic AP by tens up to hundreds of milliseconds (McCormick, 1991). The small delay uctuations we assume may come along with the appearance of new dendritic spines and synapses observed during induction of long-term modications or their unspecic stochastic disappearing (Engert & Bonhoeffer, 1999; Toni, Buchs, Nikonenko, Bron, & Muller, 1999). Other sources of delay uctuations may be new axonal collaterals bifurcating from an existing synaptic connection and targeting onto a different
Activity-Dependent Development of Axonal and Dendritic Delays
609
dendritic subtree. The axonal and dendritic delays of such new connections may have slightly changed due to different channel densities or morphological properties of the corresponding axonal collateral or the dendritic subtrees (Engert & Bonhoeffer, 1999; Maletic-Savatic, Malinow, & Svoboda, 1999). Variable delays are particularly present at the developmental stage of the central nervous system during which neurites can grow with speeds up to 50 m m per hour (Gomez & Spitzer, 1999). This speed is indirectly proportional to the frequency of the endogenous Ca2 C transients and it is tempting to relate the Ca2 C signals decelerating neurite growth with the Ca2 C signals accelerating synaptic long-term changes once the connections are formed (see Bliss & Collingridge, 1993). Evidence for the formation of new connections during the development is given by the functional differences between the axonal arborization of cortical V1 cells in young and adult cats. In the juvenile animal, horizontal axon collaterals often form connections only in a restricted proximal region, while they are found on distal axon terminals in the adult animal (Katz & Shatz, 1996). In pyramidal cells, dendritic latencies may change during the gated growth of the apical tree toward the pial surface, where it integrates input from the supercial layers (Polleux, Morrow, & Glosh, 2000). The hypothesis that the emergence and selection of interneuronal delays might be guided by a temporally asymmetric learning rule is further supported by the fact that this type of synaptic modication is particularly prominent in embryonic cultures (Bi & Poo, 1998, 1999) and in cortical slices of animals younger than 2 or 3 weeks (Markram et al., 1997; Feldman, 2000), while long-term plasticity cannot be found at synapses from the thalamic input after this stage (Crair & Malenka, 1995; Feldman, Nicoll, Malenka, & Isaac, 1998). There are two possible concerns related to recent experimental ndings that are worth discussing. First, we would like to emphasize that shortterm synaptic plasticity, as found between cortico-cortical pyramidal cells (Markram, Pikus, Gupta, & Tsodyks, 1998), does not fundamentally change the existing picture of delay adaptation. Short-term depression, for instance, is believed to reduce the vesicle release probability as a function of the presynaptic activity, and hence could be easily incorporated in our analysis where the release probability explicitly enters. On a phenomenological level, synaptic depression pronounces changes in the presynaptic ring rates, which may even support the temporal structure between the pre- and postsynaptic cells (compare also Tsodyks & Markram, 1996; Abbott, Varela, Sen, & Nelson, 1997; and Senn, Segev, & Tsodyks, 1998). In this context, the long-term modication of synaptic depression (Markram & Tsodyks, 1996) has a similar effect on delay selection as the long-term modication of the absolute synaptic strength. Second, recent ndings show that for CA1 pyramidal neurons in the hippocampus, the shape of the somatic EPSP is independent of the synaptic location on the dendritic tree (Magee & Cook, 2000). In our framework, this implies that there is no one-to-one correspondence between the synaptic location and the peak of the learning function.
610
Walter Senn, Martin Schneider, and Berthold Ruf
It should be emphasized, however, that experimental support for the location independence of the synaptic response in mammals is present only in hippocampal neurons and, to some degree, in motoneurons (see Magee & Cook, 2000, for references). In contrast, neocortical layer 5 pyramidal cells, for instance, actively attenuate synaptic inputs from the apical tree to the soma and show a wide range of somatic EPSP rise times (Berger, Larkum, & Luscher, ¨ 2001). These neurons can well be integrator and coincidence detector in one (Larkum, Zhu, & Sakmann, 1999), and therefore require a mechanism that determines whether a synaptic input should supply contextual information (by a broad, distally generated EPSP) or timing information (by a sharp, proximally generated EPSP). 7.2 Some Predictions. Synaptic delay lines and their modication have recently been the focus of an experimental study in cultured networks of embryonic rat hippocampus (Bi & Poo, 1999). This work shows that the different mono- or polysynaptic delay lines between two neurons may indeed play a crucial role in the induction of either LTP or LTD. If paired pulse stimulation of a presynaptic neuron triggers an AP in a postsynaptic neuron through a specic pathway, then the strength of that and putative shorter pathways are upregulated, while pathways with putative longer delays are downregulated. Our investigation suggests that simultaneous single pulse stimulations at two different locations with interpulse intervals of 10, 30, and 50 ms would shift, after minutes of repetitions, the (axonal) delays of the pathways toward 10, 30, and 50 ms, respectively. Besides studying network effects, it would be interesting to investigate in vitro synaptic modications at neurons with delayed responses and test whether the learning windows are in fact broadly tuned to match the dendritic latencies. In the context of subthreshold receptive elds (Bringuier et al., 1999), our work predicts an asymmetric distribution of horizontal propagation velocities in the primary visual cortex of cats reared under unidirectional background motions (see Section 1). Another prediction is that due to the self-organization of the synaptic locations, distal synapses with slow somatic EPSPs should have broader learning functions than proximal ones with fast somatic EPSPs. In fact, for certain connections to layer IV spiny stellate cells, the peak of the spike-based temporal modication function is measured to be at tpost ¡ tpre D 60 ms (Cowan & Stricker, 1999). 7.3 A Further Example: Learning Delays Between Cortical Areas. Our analysis also applies to the problem of delay adaptation between neuronal populations with correlated activities. As an example, we may consider the learning of nger sequences, say, for playing a music instrument. The storage and recall of such sequences require a temporal control of the activity in the motor cortex from which the appropriate spinal motor patterns are recruited. According to a current hypothesis, the automation of motor sequences is a form of habit learning, which involves the basal ganglia–
Activity-Dependent Development of Axonal and Dendritic Delays
611
Figure 8: Learning time delays between neuronal populations. (a) Sequences of nger movements, for example, for playing a music instrument, require the temporal control of activity in the motor cortex. For instance, when looking at the notes a ¡ b, the visual stimulus activates through cortico-cortical pathways in sequence the populations A and B in the motor cortex at times tA and tB D tA C | 4t|. When automatizing such nger sequences, the corresponding activity pattern is believed to be stored in the basal ganglia (S), from where it can be retrieved through the basal ganglia–thalamocortical feedback loop. (b) The temporal structure between the activities at the sites A, B, and S in the motor cortex and the basal ganglia, respectively, is similar to that considered for a single synaptic connection (see Figure 1a, where A, B, and S stand for the presynaptic soma, the postsynaptic soma, and the synaptic site, respectively), although it may extend over a larger timescale. A reinforcement signal from B to S (mediated, e.g., by dopamine projections) modifying the strength of the relay S according to a temporally asymmetric rule may similarly select pathways through S that support a sequence of activities in the upper area. Assuming a variety of delay lines with stochastic transmission, our analysis predicts that after repeated presentation of the visual stimulus sequence, the total forward delay through S will eventually match the time difference between the cortical f populations A and B, DAS C DSB D | 4t|. This is possible even in the presence b of a delay DSB of the reinforcement signal. In this case, the forward delays are f
selected such that in addition, DSB C DbSB D a holds where a is the peak in the learning function of the relay S (cf. also Figure 5).
thalamocortical feedback loop (Petri & Mishkin, 1994; Rolls & Treves, 1998). Such learning is driven by an external input and probably internally involves a delayed reinforcement signal. Let us assume that looking in sequence to two notes, say a ¡ b, the visual input eventually evokes through the parietal pathway a well-timed activity at sites A and B, respectively, in the motor cortex (see Figure 8a; cf. also Ungerleider, 1995). During the many repetitions of the nger sequence, the basal ganglia are taught by the motor
612
Walter Senn, Martin Schneider, and Berthold Ruf
cortex to store a mirror image of the cortical spatiotemporal activity pattern. The input from the basal ganglia to the motor cortex, which correctly predicts the ongoing visual input, may evoke a reinforcement signal back to the basal ganglia, for instance, through dopamine projections, which strengthens the appropriate pathways through a temporally asymmetric rule. After learning, the motor sequence can be retrieved in the habit mode by a sequence of look-ups from the basal ganglia, thereby freeing motor cortex for other, higher-level tasks. The temporal structure between the two cortical populations and the corresponding memory trace in the basal ganglia is similar to that between a pre- and postsynaptic neuron and the connecting synapses (see Figure 8b). Since the reinforcement signal intrinsically has a delay beyond the temporal precision of the motion, an additional gating mechanism is required that takes this delay into account. Our analysis shows that a temporally asymmetric learning rule is enough to solve this timing problem. Put into a general framework, the problem is to adjust the delay from A to B, given that the tuning mechanism is located at some intermediate site S, and hence receives itself only delayed signals from A and B. It is not evident how such a delay adaptation can work at all. Summarized, the constraints are: Implicitness: No explicit mechanism exists for a directed adaptation of the individual delays. Rather, the average delay of different nearby delay lines can only implicitlybe adapted by strengthening or weakening the connections in the presence of slow, unbiased delay uctuations. Locality:There is no view that tells the time elapsed between the events at the two sites. Rather, only the time differences between a forward and a backward signal can be locally measured at the site S in between the two signal sources. Directionality: In adapting the delay line based on the local time differences, one should take into account the backward delay from site B to site S and the fact that this backward delay may be different from the corresponding forward delay. Recurrence:The change in the connection strengths may itself inuence the activity at location B and thus change the statistics of the temporal relationship between the signals at the two locations. As we showed, a temporally asymmetric learning rule may simultaneously cope with all these constraints. 7.4 Comparison with Recent Theoretical Work. The apparent importance of temporal relationships among cortical activities motivated different studies on delay adaptations. Huning ¨ et al. (1998) suggested that synaptic delays should adapt proportionally to the negative temporal derivative of the EPSP with peak centered at the postsynaptic spike time. Such an explicit
Activity-Dependent Development of Axonal and Dendritic Delays
613
delay adaptation, which assumes the temporal modications of intracellular messenger cascades, however, is difcult to motivate, apart from the fact that the effective synaptic delays measured in central synapses are in the range of 0.4 ms (Lloyd, 1960), a number that is down-corrected in a later study to 0.17 ms (Munson & Sypert, 1979). Other works suggested that delay changes should have qualitatively the same form as the learning function for the synaptic weights (Eurich, Cowan, & Milton, 1997). If deduced from the modication of the synaptic strength, however, the rule for the delay adaptation is not given by the derivative of the EPSP and is not identical to the original weight modication. In contrast, as our analysis shows, the best motivated form for an explicit delay adaptation is given by the derivative of the original weight modication function (cf. Figure 2a). In a series of other works, delay modications are suggested to result from two independent processes: delay selection through weight modication and delay shifts through explicit delay adaptation (Eurich, Ernst, & Pawelzik, 1998; Eurich et al., 1999). However, no physiological evidence for an explicit delay adaptation exists so far, and, as we argue, no such rule is in fact necessary. In the presence of weak delay uctuations and stochastic synaptic transmissions, a temporally asymmetric synaptic weight modication is enough to explain axonal and dendritic delay shifts. While the adaptation of the total forward delay occurs independent of the peak in the learning function, this peak implicitly determines the width of the somatic EPSP. Hence, during development, the rule controls axonal delays and dendritic latencies to support stimulus-driven temporal structures and generate different types of causal relationships between neuronal activities. Appendix A.1. Proof of Equation 4.3. Dening the normalized density p (d) D (d ) (d p , t) D whwi,t of connections from neuron B to neuron A with relative delay d we may write
Z dD
C1
¡1
dp(d) dd ´ hdp(d) i,
where by denition h.i denotes the integral over d. Inserting the expansion
y ( 4t C d) ¼ y (4 t C d) C (d ¡ d) y 0 (4 t C d)
(A.1)
into the rule 3.10 and averaging with respect to d, we get P ¼ hwiy (4t C d) C hw(d ¡ d) iy 0 ( 4t C d) hwi ½ 2 ¾ @ w 2 ¡ c1 hwi ¡ c2 hwi C 2 . @d 2
(A.2)
614
Walter Senn, Martin Schneider, and Berthold Ruf
This approximation is good if we assume that the width s of the delay distribution is narrow compared to the width 2(a C & ) of y . Evaluating the 2 integral in the last term of the right-hand side, we nd h @@dw2 i D 0, since we may assume that the derivative @@wd at the boundaries d D § 1 vanishes. Since hwdi hw(d ¡ d) i D hwdi ¡ hwid D hwi ¡ hwid D 0, hwi the second term in equation A.2 cancels as well. Multiplying the remainder of equation A.2 with d, we get P ¼ dhwiy ( 4t C d) ¡ dhwi(c1 hwi C c2 ) . dhwi
(A.3)
By equation 3.10 and the Taylor expansion, equation A.1, we also approximate after averaging P ¼ hdwiy ( 4t) C (hd 2 wi ¡ hdwid) y 0 ( 4t ) hd wi ½ 2 ¾ @ (d w) ) C C ¡ hdwi(c1 hwi c2 . 2 @d 2
(A.4)
Again, the last term on the right-hand side vanishes by partial integration §1 with boundary conditions w§ 1 D @w@d D 0. The factor in the second term reduces to hd 2 wi ¡ hdwid D
hd 2 wi hdwi 2 hwi ¡ dhwi D (d 2 ¡ d ) hwi D s 2 hwi, hwi hwi
where s 2 is the variation of the delay distribution given by the density p (d) . Using hdwi D dhwi, equation A.4 simplies to P ¼ dhwiy ( 4t C d) C s 2 hwiy 0 (4 t C d) ¡ dhwi(c1 hwi C c2 ) . hd wi
(A.5)
Subtracting equation A.3 from A.5, we obtain P ¡ dhwi P ¼ hwis 2 y 0 (4 t C d), hd wi and combining this with equation 4.2, we get the desired rule, equation 4.3, governing the adaptation of the average relative delay. A.2 Proof of Equation 4.4. We rst show that the weight distribution is nonsingular. To this end, we derive the differential equation for s 2 and 2 show that for small s 2 , we always have ds > 0. Substituting equation 3.10 dt dw for dt , one calculates
d 2 d 2 d 2 (d ¡ d ) D s D dt dt dt
³
hd 2 wi hdwi2 ¡ hwi hwi2
D 22 C (d 2 y ¡ d 2 y ) C 2d (d y ¡ d y ) ,
´ D ¢¢¢ D
Activity-Dependent Development of Axonal and Dendritic Delays
615
where the overlining denotes the average with respect to the normalized density p (d) . At the course of this calculation, we used that by partial in§1 2 tegration with boundary conditions w §1 D @w@d D 0, one nds h @@d 2 wi D 2
2
hd @@d2wi D 0 and hd 2 @@d 2 wi D 2hwi. Assuming that s 2 ! 0, the density p would become a delta function, and the average would commute with the product. Hence we would also have (d 2 y ¡ d 2 y ) ! 0 and (d y ¡ d y ) ! 0, and therefore dtd s 2 ! 22 . But this contradicts the above assumption, and we conclude that in fact, s 2 cannot shrink to zero. We can now t the nonsingular steady-state distribution with the gaus2
(d¡d) sian wQ (d) D w (d) exp(¡ 2s 2 ) . Inserting wQ for w in equation 3.10 and evalp uating atp d and d § s yields 0 D y (¡ a2 C &2 ) ¡ c1 hwi Q ¡ c2 ¡ 2 / s 2 and 2 Q ¡ c2 . Eliminating c1 hwi Q C c2pand using the 0 D y (¡ a C & § s) ¡ c1 hwi p p (¡ a C & ) ¡ y (¡ a C & § s ) ¼ ¡y 00 (¡ a2 C & 2 ) s 2 / 2 approximation y p gives y 00 (¡ a2 C & 2 )s 2 / 2 C 2 / s 2 D 0, from which equation 4.4 follows. Finally, the total synaptic weight at the steady state can be qualitatively 1 approximated by hwi Q ¼ (c e 2 ¡ 2 / s 2 ) / c1 with s given by equation 4.4. Since p hwi Q pD s 2p w (d), the maximal weight at steady state is roughly w (d) ¼ y (¡
a2 C &2 ) ¡2 / s 2 p . c1 s 2p
A.3 Proof of Equation 5.5. For xed presynaptic delays, we have, ac-
cording to equation 5.3, dDpost dtB dtB dD dD ¼ (1 C R0 ( D ) ) D D , dt dt dt dD dt
(A.6)
where D is the average forward dendritic delay and R ( D) ¼ R (D ) the average rise time. Since we are interested in replacing the factor dD / dt in equation A.6 by a term containing the learning function y± , we rst replace the change in D by a change in B from where we bridge to y± . Thus, we rst write dD D dt
³
´¡1
dB dD
dB ( D) . dt
(A.7)
In the same way as we deduced equation 4.3 in Section A.1 (note that there d4tsyn
(4
t Cd) we have dt D d dt ) we may approximate the change of the D dd dt 4 average time difference tsyn D tA C Dpre ¡ (tB C B ) at the synaptic site by
d4 tsyn dt
D ¡
dtB dB ¡ ¼ sB2 y±0 ( 4t) , dt dt
(A.8)
616
Walter Senn, Martin Schneider, and Berthold Ruf
where sB is the standard deviation of the distribution of B ( D ). Solving equation A.8 for dB , identifying dt dD ¼ dt
³
dtB ¡ sB2 y±0 dt
dB dt
´³
¼
dB( D ) , and inserting into equation A.7 yields dt
dB ( D ) dD
´¡1
.
(A.9)
Inserting equation A.9 into A.6 and solving for tB , we obtain dtB B0 ¡1 (1 C R0 ) ¼ ¡sB2 y±0 . 1 C B0 C R0 dt Estimating the standard deviation of the backward delay, sB , by the one for the forward dendritic delay, sB ¼ B0 ( D) s, yields the desired formula, equation 5.5. Acknowledgments
This study was supported by the Swiss National Science Foundation (grant 21-57076.99) and the Silva Casa foundation (W. S.). W. S. thanks Stefano Fusi for helpful discussions. References Abbott, L. & Song, S. (1999). Temporally asymmetric Hebbian learning, spike timing and neuronal response variability. In M. Kearns, S. Solla, & D. Cohn (Eds.), Advances in neural information processing systems, 11. Cambridge MA: MIT Press. Abbott, L., Varela, J., Sen, K., & Nelson, S. (1997).Synaptic depression and cortical gain control. Science, 275, 220–224. Agmon-Snir, H., & Segev, I. (1993). Signal delay and input synchronization in passive dendritic structures. J. Neurophysiol., 70(5), 2066–2085. Berger, T., Larkum, M., & Luscher, ¨ H.-R. (2001). A high Ih channel density in the distal apical dendrite of layer V neocortical pyramidal cells increases bidirectional attenuation of EPSPs. J. Neurophysiol., 85, 855–868. Bi, G., & Poo, M. (1998). Synaptic modications in cultured hippocampal neurons: Dependence on spike timing, synaptic strength, and postsynaptic cell type. J. Neuroscience, 18(24), 10464–10472. Bi, G., & Poo, M. (1999). Distributed synaptic modications in neural networks induced by patterned stimulation. Nature, 401, 792–796. Bliss, T., & Collingridge, G. (1993). A synaptic model of memory: Long term potentiation in the hippocampus. Nature, 361, 31–39. Bringuier, V., Chavane, F., Glaeser, L., & FrÂegnac, Y. (1999). Horizontal propagation of visual activity in the synaptic integration eld of area 17 neurons. Science, 283, 695–699. Carr, C., & Friedman, M. (1999). Evolution of time coding systems. Neural Computation, 11, 1–20.
Activity-Dependent Development of Axonal and Dendritic Delays
617
Cowan, A., & Stricker, C. (1999). Long-term synaptic plasticity in the granular layer of rat barrel cortex. Society for Neuroscience, Abstracts, no. 689.7. Crair, M., & Malenka, R. (1995). A critical period for long-term potentiation at thalamocortical synapses. Nature, 375, 325–328. Engert, F., & Bonhoeffer, T. (1999). Dendritic spine changes associated with hippocampal long-term synaptic plasticity. Nature, 399, 66–70. Eurich, C., Cowan, J., & Milton, J. (1997). Hebbian delay adaptation in a network of integrate-and-re neurons. In W. Gerstner, A. Germond, M. Hasler, & J.-D. Nicoud (Eds.), Articial Neural Networks—ICANN’97 (pp. 157–162). Berlin: Springer-Verlag. Eurich, C., Ernst, U., & Pawelzik, K. (1998). Continuous dynamics of neuronal delay adaptation. In L. Niklasson, M. BodÂen, & T. Ziemke (Eds.), Articial Neural Networks—ICANN’98 (pp. 355–360). Berlin: Springer-Verlag. Eurich, C., Pawelzik, K., Ernst, U., Cowan, J., & Milton, J. (1999). Dynamics of self-organized delay adaptation. Phys. Rev. Lett., 82(7), 1594–1597. Feldman, D. (2000). Timing-based LTP and LTD at vertical inputs to layer II/III pyramidal cells in rat barrel cortex. Neuron, 27, 45–56. Feldman, D., Nicoll, R., Malenka, R., & Isaac, J. (1998). Long-term depression at thalamocortical synapses in developing rat somatosensory cortex. Neuron, 21, 347–357. Gerstner, W. (1993).Why spikes? Hebbian learning and retrieval of time-resolved excitation patterns. Biol. Cybernetics, 69, 503–515. Gerstner, W., Kempter, R., van Hemmen, J., & Wagner, H. (1996). A neuronal learning rule for sub-millisecond temporal coding. Nature, 383, 76–78. Gomez, T., & Spitzer, N. (1999). In vivo regulation of axon extension and pathnding by growth-cone calcium transients. Nature, 397, 350–355. Hopeld, J., & Brody, C. (2000). What is a moment? Cortical sensory integration over brief intervals. PNAS, 97, 13919–13924. Hopeld, J., & Brody, C. (2001). What is a moment? Transient synchrony as a collective mechanism for spatiotemporal integration. PNAS, 98, 1282–1287. Huning, ¨ H., Glunder, ¨ H., & Palm, G. (1998). Synaptic delay learning in pulsecoupled neurons. Neural Computation, 10, 555–565. Katz, L., & Shatz, C. (1996). Synaptic activity and the construction of cortical circuits. Science, 274, 1133–1138. Kempter, R., Gerstner, W., & van Hemmen, J. (1999). Spike-based compared to rate-based Hebbian learning. In M. S. Kearns, S. Solla, & D. Cohn (Eds.), Advances in neural information processing systems, 11. Cambridge, MA: MIT Press. Kuhn, ¨ R., & van Hemmen, J. (1991). Temporal association. In E. Domany, J. van Hemmen, & K. Schulten (Eds.), Model of neural networks (pp. 213–280). Berlin: Springer-Verlag. Larkum, M., Zhu, J., & Sakmann, B. (1999). A novel cellular mechanism for coupling inputs arriving at different cortical layers. Nature, 398, 338– 341. Lloyd, D. (1960). Section I: Neurophysiology. In H. Magoun (Ed.), Handbook of physiology (vol. 2, pp. 929–949). Washington, DC: American Physiology Society.
618
Walter Senn, Martin Schneider, and Berthold Ruf
Magee, J., & Cook, E. (2000). Somatic EPSP amplitude is independent of synapse location in hippocampal pyramidal neurons. Nature Neuroscience, 3(9), 895– 903. Maletic-Savatic,M., Malinow, R., & Svoboda, K. (1999).Rapid dendritic morphogenesis in CA1 hippocampal dendrites induced by synaptic activity. Science, 283, 1923–1927. Markram, H., Lubke, ¨ J., Frotscher, M., & Sakmann, B. (1997). Regulation of synaptic efcacy by concidence of postsynaptic APs and EPSPs. Science, 275, 213–215. Markram, H., Pikus, D., Gupta, A., & Tsodyks, M. (1998). Potential for multiple mechanisms, phenomena and algorithms for synaptic plasticity at single synapses. Neuropharmacology, 37, 489–500. Markram, H., & Tsodyks, M. (1996). Redistribution of synaptic efcacy between neocortical pyramidal neurons. Nature, 382, 807–810. McCormick, D. (1991). Functional properties of slowly inactivating potassium current in guinea pig dorsal lateral geniculate relay neurons. J. Neuroscience, 66, 1176–1189. Munson, J., & Sypert, G. (1979). Properties of single bre excitatory postsynaptic potentials in triceps surae motoneurones. J. Physiol. (London), 296, 329–342. Natschl¨ager, T., & Ruf, B. (1998). Spatial and temporal pattern analysis via spiking neurons. Network: Computation in Neural Systems, 9, 319–332. Nieuwenhuys, R. (1994). The neocortex: An overview of its evolutionary development, structural organization and synaptology. Anat Embryol., 190, 307– 337. Petri, H., & Mishkin, M. (1994). Behaviorism, cognitivism, and the neurophysiology of memory. American Scientist, 82, 30–37. Polleux, F., Morrow, T., & Glosh, A. (2000). Semaphorin 3A is a chemoattractant for cortical apical dendrites. Nature, 404, 567–573. Roberts, P. (1999).Computational consequences of temporally asymmetric learning rules: I. Differential Hebbian learning. J. Computational Neuroscience, 7, 235–246. Rolls, E., & Treves, A. (1998). Neural networks and brain functions. Oxford: Oxford University Press. Segev, I., Rapp, M., Manor, Y., & Yarom, Y. (1992). Analog and digital processing in single nerve cells: Dendritic integration and axonal propagation. In T. McKenna, J. Davis, & S. Zornetzer (Eds.), Single neuron computation (pp. 173– 198). Orlando, FL: Academic Press. Senn, W., Segev, I., & Tsodyks, M. (1998). Reading neural synchrony with depressing synapses. Neural Computation, 10, 815–819. Senn, W., Tsodyks, M., & Markram, H. (2001).An algorithm for modifying neurotransmitter release probability based on pre-and post-synaptic spike timing. Neural Computation, 13(1), 35–68. Sompolinsky, H., & Kanter, I. (1986). Temporal association in asymmetric neural networks, I. Phys. Rev. Lett., 57, 2861–2864. Song, S., Miller, K., & Abbott, L. (2000). Competitive Hebbian learning through spike-timing dependent synaptic plasticity. Nature Neuroscience, 3, 919– 926.
Activity-Dependent Development of Axonal and Dendritic Delays
619
Stuart, G. J., & Sakmann, B. (1994). Active propagation of somatic action potentials in cerebellar Purkinje cells. Nature, 367, 69–72. Toni, N., Buchs, P.-A., Nikonenko, I., Bron, C., & Muller, D. (1999). LTP promotes formation of multiple spine synapses between a single axon terminal and a dendrite. Nature, 402, 421–425. Tsodyks, M., & Markram, H. (1996). Plasticity of neocortical synapses enables transitions between rate and temporal coding. In C. von der Malsburg (Ed.), Proceedings of the ICANN’96 (pp. 445–450). Berlin: Springer-Verlag. Ungerleider, L. (1995). Functional brain image studies of cortical mechanisms for memory. Science, 270, 769–775. Zhang, L., Tao, H., Holt, C., Harris, W., & Poo, M. (1998). A critical window in the cooperation and competition among developing retinotectal synapses. Nature, 395, 37–44. Received May 31, 2000; accepted May 24, 2001.
LETTER
Communicated by Paul Tiesinga
Impact of Geometrical Structures on the Output of Neuronal Models: A Theoretical and Numerical Analysis Jianfeng Feng [email protected] Computational Neuroscience Laboratory, Babraham Institute, Cambridge CB2 4AT, U.K., and COGS, University of Sussex at Brighton, BN1 9QH, U.K. Guibin Li [email protected] Computational Neuroscience Laboratory, Babraham Institute, Cambridge CB2 4AT, U.K. What is the difference between the efferent spike train of a neuron with a large soma versus that of a neuron with a small soma? We propose an analytical method called the decoupling approach to tackle the problem. Two limiting cases—the soma is much smaller than the dendrite or vica versa—are theoretically investigated. For both the two-compartment integrate-and-re model and Pinsky-Rinzel model, we show, both theoretically and numerically, that the smaller the soma is, the faster and the more irregularly the neuron res. We further conclude, in terms of numerical simulations, that cells falling in between the two limiting cases form a continuum with respect to their ring properties (mean ring time and coefcient of variation of inter-spike intervals). 1 Introduction
It is well documented in the literature that the geometrical structure of a neuron considerably contributes to its information processing capacity (Koch, 1999). However, a fully detailed model is usually hard to study theoretically, and the nonhomogeneous distribution of ionic channels along dendritic trees prevents such an investigation. Hence, devising methods for approximating detailed biophysical models by simple models, which preserve the essential complexity of the detailed biophysical mechanism yet are simultaneously concise and transparent, is an important continuing task in computational neuroscience. The advantages are obvious. Detailed biophysical modes are usually difcult to understand and can be extremely time-consuming to simulate. Thus, two-compartment models, which mimic detailed biophysical neuron models of hundreds of compartments, are proposed and studied in Mainen and Sejnowski (1996) and Pinsky and Rinzel (1994). Their results, however, are exclusively numerical Neural Computation 14, 621–640 (2002)
° c 2002 Massachusetts Institute of Technology
622
Jianfeng Feng and Guibin Li
and primarily based on models with deterministic inputs as well. The main questions we address in this article are: Can we, at least for some extreme cases, theoretically understand the behavior of two-compartment models? Such a study will provide new insights into the problem of understanding the function of neuronal morphology. Two-compartment models with deterministic inputs have been extensively studied in Mainen and Sejnowski (1996) and Pinsky and Rinzel (1994). A more interesting and theoretically more challenging problem is to investigate the model behavior with stochastic inputs since the stochastic part of an input signal might play a functional role in processing information (see Collins, Chow, & Imhoff, 1995; Feng, 1997; Harris & Wallpert, 1998; and Sejnowski, 1998). We propose a novel theoretical approach, which we call a decoupling approach, to investigate the properties of two-compartment models, in particular, how the somatic size affects its output when the neuron receives stochastic inputs from the dendritic compartment. The theory is general enough to be applied to both abstract models, such as the two-compartment integrate-and-re (IF) model, and biophysical models, such as the PinskyRinzel model. When the somatic compartment is small, the model tends to burst, and the bursting length is totally determined by the activity of dendritic compartment. In other words, the somatic compartment is only a reporter of the dendritic compartment activity. When the somatic compartment is large, the model can also be reduced to a single compartment. For both cases, the problem of studying a complicated dynamical system of two-coupled variables is reduced to a much simpler dynamical system of a single variable—hence, the name decoupling approach. As a consequence, all theoretical results developed in the literature for single-compartment models can be applied. For example, we could theoretically estimate the bursting length and ring rates of the two-compartment IF model. Our conclusions are obtained for generic two-compartment models. The decoupling approach serves as a bridge between one-compartment and two-compartment models. For both the IF and the Pinsky-Rinzel model, in terms of numerical simulations, we further conclude that cells falling in between the two limiting cases of large soma and small soma form a continuum with respect to their ring properties (mean ring time and coefcient of variation of interspike intervals). Hence, we grasp the ring properties of the models of general cases once we know that the two limiting cases could be tackled by a theoretical approach. The paper is organized as follows. In section 2, the two-compartment IF model and Pinsky-Rinzel model are dened. Section 3 is devoted to theoretically analyzing generic two-compartment models. Section 4 presents
Impact of Geometrical Structures
623
numerical results. As an application of results of the previous sections, in Feng and Li (2001) we take into account two-compartment models with signal inputs (nonhomogeneous inputs). Analytically we prove that when the somatic compartment is small enough, the two-compartment IF model could naturally be a slope detector via its bursting activity. Numerically, we show that the conclusion is also valid for Pinsky-Rinzel model. 2 Models
Let us assume that a neuron is composed of two compartments: a somatic compartment and a dendritic compartment. Suppose that a cell receives excitatory postsynaptic potentials (EPSPs) at qE excitatory synapses and inhibitory postsynaptic potentials (IPSPs) at qI inhibitory synapses and that Vs (t ) and Vd ( t) are the membrane potential of the soma and dendritic compartments at time t, respectively. The following two kinds of two-compartment models are considered in this article. 2.1 Two-Compartment IF Model. When the somatic membrane potential Vs ( t) is between the resting potential Vrest and the threshold Vthre ,
8 1 Vd ( t) ¡ Vs (t ) > > dt V ( t) ¡ Vd ( t) > :dVd (t ) D ¡ (Vd (t) ¡ Vrest ) dt C gc s dt C , c 1¡p 1 ¡p
(2.1)
where 1 /c is the decay rate and p is the ratio between the membrane area of the somatic compartment and the whole cell. gc > 0 is a constant, and the synaptic input is isyn (t ) D a
qE X iD 1
dEi ( t) ¡ b
qI X
dIj ( t) ,
jD 1
where Ei (t ) , Ii ( t) are Poisson processes with rates lE and lI , respectively, and a,b are the magnitudes of each EPSP and IPSP. After Vs ( t) crosses Vthre from below, a spike is generated, and Vs ( t) is reset to Vrest . This model is called the two-compartment IF model. The interspike interval of efferent spikes is T ( p) D infft: Vs ( t) ¸ Vthre g for 1 > p > 0. It is well known that Poisson input can be approximated by Isyn ( t) D m t C sBt ,
624
Jianfeng Feng and Guibin Li
where Bt is the standard Brownian motion, m D aqE lE ¡ bqI lI and s D p a2 qE lE C b2 qI lI . Thus, equation 2.1 can be approximated by 8 1 Vd (t ) ¡ Vs ( t) > > dt V ( t) ¡ Vd ( t) > :dV d ( t) D ¡ ( Vd ( t) ¡ Vrest ) dt C gc s . dt C c 1¡p 1¡p
(2.2)
In the following, we consider the two-compartment IF model to be the model dened by equation 2.2. 2.2 Pinsky-Rinzel Model. We also consider a simplied, twocompartment biophysical model proposed by Pinsky and Rinzel (1994). They have demonstrated that the model mimics a full, very detailed Traub model quite well. The Pinsky-Rinzel model is dened by
8 > Cm dV s ( t) D ¡ILeak ( Vs ) dt ¡ INa (Vs , h ) dt ¡ IK¡DR ( Vs , n ) dt > > > > Vd ( t) ¡ Vs ( t) > > C gc dt > > > p < Cm dV d ( t) D ¡ILeak ( Vd ) dt ¡ ICa ( Vd , s) dt ¡ IK¡AHP ( Vd , q ) dt > > > dISyn Vs ( t) ¡ Vd ( t) > > C gc ¡IK¡C ( Vd , Ca, c) dt C dt > > > 1¡p 1¡p > > : 0 Ca D ¡0.002ICa ¡ 0.0125Ca.
(2.3)
In our calculations, all parameters and equations are identical to those used in Pinsky and Rinzel (1994) except for the parameters of the calcium equation, which are from Wang (1998). 3 Analytical Results
In this section, we consider two limiting cases: soma of innitely small size and the limit of small dendrite. For these two cases, analytical results are derived that give the mean interspike intervals as well as the variance and the coefcient of variation (CV). The analysis implies that neurons with small soma re more irregularly than neurons with a large soma. Also, small soma models show intermittent bursting; the interburst intervals and burst lengths are calculated. Without loss of generality, we assume that gc D 1 in the following discussion.
Impact of Geometrical Structures
625
3.1 Decoupling Approach. Let us now consider a generic two-
compartment model dened by 8 Vd ( t) ¡ Vs ( t) > > dt V ( t) ¡ Vd ( t) > :dVd (t ) D g ( Vs , Vd ) dt C s dt C , 1¡p 1¡p
(3.1)
where f, g are two functions. Equation 3.1 certainly includes the PinskyRinzel model as a special case. p ! 0 Multiplying both sides of the equation for the somatic compartment by p and taking p ! 0, we see that Vd D Vs (for the existence of the limit, see appendix B). At the same time, the equation for the dendritic compartment becomes Case 1.
dV d D g ( Vd , Vd ) dt C dI syn (t) ,
(3.2)
since p ! 0 and Vd D Vs . p ! 1 Multiplying both sides of the equation for the dendritic 0 compartment by 1 ¡ p and taking p ! 1, we see that Vd ¡ Vs D Isyn . At the same time, the equation for the somatic compartment becomes Case 2.
0 ) dt C dIsyn ( t) . dV s D f (Vs , Vs C Isyn
(3.3)
We emphasize that the derivation of equations 3.3 and 3.2 is independent of the concrete form of synaptic inputs. Hence, the conclusions hold true for more biologically based inputs, such as AMPA, NMDA, GABAA , and GABAB (Destexhe, Mainen, & Sejnowski, 1998). Summarizing the results above, we arrive at the following theorem: Theorem 1 (Decoupling Theorem ).
For a two-compartment model dened by equation 3.1, we have the following conclusions: When p ! 0, we have Vs D Vd and dVd D g (Vd , Vd ) dt C dIsyn ( t). When p ! 1, the model is reduced to
0 ) dt C dIsyn ( t) . dVs D f ( Vs , Vs C Isyn
3.2 Firing Patterns. Now we can apply theorem 1 to the twocompartment IF and Pinsky-Rinzel models. Let us consider the twocompartment IF model rst.
626
Jianfeng Feng and Guibin Li
p ! 0 The somatic membrane potential is identical to that of the dendritic compartment. Suppose that Vd ( t) > Vthre for t 2 [t0 , t1 ]. Then Vs ( t) will re very fast during [t0 , t1 ]. In theory, it res with an innite frequency. This cannot happen as a real neuron, which has a refractory period. On the other hand, if Vd ( t) < Vthre for t 2 [t0, t1 ], then Vs ( t) D Vd < Vthre , and the neuron is silent. The analysis above for p D 0 gives rise to bursting behavior. During the interval [t0 , t1 ], when Vd > Vthre , the cell res with a very high frequency; when [t0, t1 ] with the property that Vd < Vthre , the cell is completely silent. Figure 1 provides an intuitive picture of the ring pattern. The somatic compartment reports the activity of the dendritic compartment: it bursts whenever the membrane potential of the dendritic compartment is above the threshold and is completely silent if the membrane potential of the dendritic compartment is below the threshold. Case 1.
p ! 1 The two-compartment model behavior is the same as that of the conventional IF model which is well studied in the literature (Brown, Feng, & Feerick, 1999; Feng, 1997, in press). When c m > Vthre , the output spike trains are generally regular with a CV smaller than 0.5. In this case, instances where the threshold is crossed are mainly due to deterministic forces. When c m < Vthre , the output spike trains are irregular, with a CV greater than 0.5. Also, the threshold crossings are primarily due to random uctuations (refer to Figure 2 for an intuitive illustration). The issue on how to generate spike trains with large CV has been extensively discussed in the literature (Bernander, Douglas, & Koch, 1992; Destexhe & Pare, 2000; Shadlen & Newsome, 1998; Schneidman, Freedman, & Segev, 1998; Troyer & Miller, 1997). Nevertheless, it seems that all computational models are concentrated on synaptic inputs or biophysical neurons. Our results here provide a transparent, geometrical picture on how to generate spike trains with a large CV. Case 2.
Similar arguments can be applied to Pinsky-Rinzel model. When p ! 1, the model is reduced to a somatic compartment. When p ! 0, if the neuron res, it will re with its highest possible frequency, provided that Vd ( t) is greater than the threshold of the somatic compartment membrane potential. 3.3 Statistical Characteristics. For the case of small p, it is not very informative to calculate the interspike intervals. It is more interesting to consider the interburst intervals Ti and burst length Tl . The Ti and Tl are dened as follows:
(
Ti D supft: V0 D Vthre , Vd (s ) · Vthre , for 0 · s · tg Tl D supft: V0 D Vthre , Vd (s ) ¸ Vthre , for 0 · s · tg.
(3.4)
Impact of Geometrical Structures
627
Figure 1: Membrane potential of the two-compartment IF model with p D 0.05. Note that the membrane potentials of the soma and the dendrite are almost identical, except the reset of the somatic compartment. Compare with Figure 2.
Note that the cases where Vs for p D 1 and Vd for p D 0 are very different when t is small, although they appear to be identical. Vs for p D 1 is not a stationary process, but Vd for p D 0 is a stationary process. Combining the arguments above and dening q CV ( p) D
hT ( p ) 2 ¡ (hT ( p) i) 2 i/ hT ( p ) i,
we arrive at the following conclusions. For the two-compartment IF model, we have P (B i )hTi i » hT (0) i < hT (1) i, and CV (0) > CV (1) ,
(3.5)
where a rigorous and analytical expression of hT (1) i is presented in the d | following and Bi D fVd (0) D Vthre , dV < 0g. dt tD 0 Numerical examples for 0 < p < 1 are included in the next section. In the following, we discuss the cases of p D 1 and p D 0 separately. When p D 1, we consider the conventional one-compartment IF model. An analytical expression for the mean interspike intervals of the IF model has been obtained in terms of the solution of partial differential equations, as shown in Musila and LÂansk Ây (1994), Ricciardi and Sato (1990), and Tuckwell
628
Jianfeng Feng and Guibin Li
Figure 2: Membrane potential of the two-compartment IF model with p D 0.95. The sudden jumping up and down of the membrane potential between 0 mV (resting potential) and 20 mV (threshold) is due to the reset of the IF model.
(1988). However, such a formula usually involves some special functions and is in general not very informative; various approximations have to be found. In appendix C, we present a rigorous, analytical approach for calculating the mean ring time, which could be applied to any one-dimensional neuron model. Examples are the IF model, the h-neuron (Ermentrout, 1996), and the IF–Fitzhugh-Nagumo model, although here we conne ourselves to the IF model. Applying theorem 2 in appendix C to the IF model, we obtain "Z ³ ´ # y Vthre (x ¡ c m ) 2 | Vrest 2 hT (1) i D 2 exp dy s s 2c Vrest "Z ³ ´ # y Vrest (x ¡ c m ) 2 | Vrest ¢ exp ¡ dy s 2c ¡1 ³ ´ # (3.6) Z Vthre "Z Vthre ( x ¡ c m ) 2 | uVrest 2 C exp du s 2 Vrest s 2c y
³
¢ exp ¡
y
(x ¡ c m ) 2 | Vrest s 2c
´
dy.
Note that it is easier numerically to calculate the mean ring time using
Impact of Geometrical Structures
629
equation 3.6 than using the method in Tuckwell (1988). From equation 3.6, we see that when Vthre < c m , hT (1) i < 1 as s ! 0; when Vthre > c m , hT (1) i ! 1 as s ! 0. In terms of equation 3.6, we could further calculate, for example, the Fish information, and we will discuss it in subsequent publications. Now we turn to the case where p D 0. Solving the equation for the dendritic compartment, we obtain Vd (t ) ¡ c m (1 ¡ exp(¡t /c ) ) D Vd (0) exp(¡t /c ) C s
Z
t 0
exp(¡(t ¡ s) /c ) dBs .
Its covariance function is given by s 2c exp(¡|t | /c ) 2 for t 2 R. The burst frequency is identical to the upcrossing of Vthre for the stationary process Vd . By a direct application of Rice’s formula (see Leadbetter, Lindgren, & RootzÂen, 1983), we obtain the following conclusions. The bursting frequency is given by r (t ) D
³ ´ ³ ´ (V ( Vthre ¡ c m ) 2 1 p 00 ¡c m )2 s |r (0) | exp ¡ thre . D p exp ¡ 2p 2 2p c 2 Furthermore, using large deviation theory (Albeverio, Feng, & Qian, 1995; Feng, in press), we could estimate the mean interbursting intervals Ti when Vthre > c m and mean burst length Tl when Vthre < c m . The obtained results are approximations, and so we do not present them here. A rigorous and exact estimate would be of interest. Finally, we emphasize that there would never be a case where p D 0 or p D 1. All of our conclusions above are true when p is sufciently close to zero or sufciently close to one. As we have demonstrated, the decoupling approach reduces the problem of studying a complicated dynamical system of twocoupled variables to a much simpler dynamical system of single variable. All results on the single-compartment model developed in the literature can be applied. Moreover, the approach opens up many new theoretical problems for further study. In the next section, we consider models and behavior between these two extreme cases. 4 Simulation Studies
In this section, for both the IF model and the Pinsky-Rinzel model, we show that cells falling in between the two limiting cases (large soma or large dendrite) form a continuum with respect to their ring properties (mean and CV), that is, for CV and mean ring time, the gap between p D 0 and p D 1 is lled in a monotone manner with respect to p.
630
Jianfeng Feng and Guibin Li
4.1 Parameters and Computation Methods. We simulate the twocompartment IF model and Pinsky-Rinzel model with the following synaptic parameters: a D b D 0.5 mV, lE D 100 Hz, lI D 0, 10, ¢ ¢ ¢ , 100 Hz, qE D qI D 100. The threshold for detecting a spike for both the soma and dendrite of Pinsky-Rinzel model is 30 mV and gc D 2.1. In the two-compartment IF model, we let Vrest D 0, Vthre D 20 mV, c D 20.2, and gc D 4. All numerical simulations for the IF model are carried out with a step size 0.01, Euler scheme, and zero initial values for both Vs and Vd . For the Pinsky-Rinzel model, an algorithm for solving stiff equations from the Numerical Algorithm Group (NAG) library (D02NBF) is used with a step size of 0.01; initial values for Vs , Vd , Ca, n, h, s, c, q are ¡4.6, ¡4.5, 0.2, 0.001, 0.999, 0.009, 0.007, 0.01, respectively. The normal random numbers used in the simulation are generated by the NAG subroutine G05DDF. Further small step sizes are used, and we conclude that no signicant improvements are obtained. 4.2 Numerical Results. For the two-compartment IF model, shown in Figure 3, we see that both the mean ring frequency and the CV are increasing functions of p. Therefore, the gap between p D 0 and p D 1 described in the previous section is lled in a monotone manner with respect to p. The meaning ring time for p D 0.5 does not attain the maximum (see lemma 1 in appendix A). We also note that when p D .1, efferent spike trains are very irregular, even when inputs are exclusively excitatory. As we pointed out in the previous section, when p is small, the CV of efferent spike trains is quite high. For the Pinsky-Rinzel model, behavior similar to that of the IF model is observed. The gap between p D 0 and p D 1 is lled in a monotone way (see Figure 4) with respect to p. As we adjust the geometrical parameter p, the model exhibits a variety of behavior. In Figures 3 and 4, we also plot mean ring time and CV versus p for 0 < p < 1. From results in the previous section, we know that when p D 0 and p D 1, the two-compartment models are reduced to singlecompartment models; their behavior is well studied in the literature (Feng, in press). In conclusion, we have obtained a complete picture of the behavior of these two-compartment models. Both mean ring time and CV are monotone functions of p where p is the ratio of the somatic compartment area with respect to the whole cell. 5 Discussion
We have considered the behavior of two-compartment models as a rst step to understanding the impact of the geometrical structures of neurons on their input-output relationship. We nd that when the somatic compartment is large, a two-compartment model can be reduced to a single
Impact of Geometrical Structures
631
Figure 3: Mean ring time and CV versus lI for p D 0.1, 0.3, 0.5, 0.8 (upper panel) and mean ring time and CV versus p for r D lI / lE D 0.2, 0.5, 0.8 (bottom panel) for the two-compartment IF model. Ten thousand spikes are generated for calculating the mean ring time and CV. Parameters are specied in the text.
632
Jianfeng Feng and Guibin Li
Figure 4: Mean ring time and CV versus lI with p D 0.3, 0.5, 0.8 (upper panel) and mean ring time and CV versus p with r D lI / lE D 0.2, 0.5, 0.8 (bottom panel) for Pinsky-Rinzel model. Ten thousand spikes are generated for calculating the mean ring time and CV. Parameters are specied in the text.
Impact of Geometrical Structures
633
one-compartment model. When the somatic compartment is small, it exhibits bursting behavior, and its behavior is determined by the activity of the dendritic compartment alone. Therefore, in each case, the original coupled two-compartment model is reduced to a single-compartment model, which we call the decoupling approach. By a combination of theoretical analysis and numerical simulations, we conclude that both the rst- and second-order statistics (mean and CV) of interspike intervals are monotone functions of the somatic size. We then present theoretical results for calculating the mean interspike intervals and burst frequencies. We briey discuss the biological implications of changing p—what we call global plasticity. Subcellular morphology changes have been reported recently in a few types of neurons (Andersen, 1999; Engert & Bonhoeffer, 1999). The most striking and direct example of global plasticity is from vasopressin and oxytocin cells in the supraoptic and paraventricular nulcei. For both types of neurons, it is found that p changes during lactation (Stern & Armstrong, 1998). For example, in the oxytocin cell, a shrinkage of the dendritic tree and an enlargement of the soma is observed, accompanied by a considerable change in the electrical properties of the neuron. In the classical approach of learning theory, the Hebb learning rule plays an essential role. Based on it, many models are developed (Feng, in press). The Hebb learning rule is a local plasticity principle where increasing the activities presynaptically and postsynaptically induces an increase in the strength of this synapse. Due to the large number on synapses of each neuron, the functional consequence of the modication of the strength of a single synapse is bound to be limited. In contrast, a global change of the neuronal morphology 1 would be a much more efcient way to modify the input-output relationship. As we demonstrate here theoretically and numerically, an increase or a decrease in p alone causes a neuron to behave completely differently where both ring rate and the ring pattern (CV) vary. Furthermore, we have not taken into account the effect of input modications due to the pruning or elaboration of dendritic trees, as reported in experiments (Stern & Armstrong, 1998), which will more dramatically modify the input-output relationship. Several challenging problems remain. We must determine whether global plasticity can be observed experimentally in other neurons and, if so, for which stimuli the global plasticity occurs. We may also be able to apply global plasticity to the design of articial neural networks that achieve a more powerful and efcient learning. It would also be interesting to consider changing p on the synchronization properties of a group of neurons. We want to emphasize that here we exclusively focus on the impact of minimal change of geometrical structures on the input-output relationship of a neuronal model. There are many other factors that might interfere with
1 We do not exclude the possibility that modifying a cell’s biophysical properties can result in similar effects as morphology changes.
634
Jianfeng Feng and Guibin Li
the global plasticity. For example, a neuromodulator such as acetylcholine might play a similar role as the changing of p, but it has a shorter timescale in comparison with the change of geometrical structures. Finally, we examine the scaling in synaptic inputs in the models considered in this article. All synaptic inputs are scaled by a factor of 1 / (1 ¡ p) (see equations 2.1 and 2.2). Therefore, when p ! 1, according to results in section 3, we know that a two-compartment is reduced to a single-compartment model with an input Isyn . The situation is totally different if the synaptic input is not scaled. By this, we mean that the two-compartment IF and Pinsky-Rinzel models are
8 1 Vd (t ) ¡ Vs ( t) > > dt > :dV d ( t) D ¡ ( Vd ( t) ¡ Vrest ) dt C gc dt C dISyn ( t) c 1¡p
(5.1)
8 Cm dV s ( t) D ¡ILeak ( Vs ) dt ¡ INa (Vs , h ) dt ¡ IK¡DR ( Vs , n ) dt > > > > Vd ( t) ¡ Vs ( t) > > C gc dt > > > p < Cm dV d ( t) D ¡ILeak ( Vd ) dt ¡ ICa ( Vd , s) dt ¡ IK¡AHP ( Vd , q ) dt > > > Vs (t) ¡ Vd ( t) > > ¡IK¡C ( Vd , Ca, c) dt C dISyn ( t) C gc dt > > 1¡p > > : 0 Ca D ¡0.002ICa ¡ 0.0125Ca.
(5.2)
and
By applying the decoupling method in section 3 to a two-compartment without scaling synaptic inputs, we conclude that when p ! 1, the twocompartment model is reduced to a single-compartment model again, but with the input being identical to zero. In other words, now the model is completely silent. When p ! 0, the conclusion is the same as in section 3. In Figure 5, we plot numerical simulations for both the two-compartment IF model and the Pinsky-Rinzel model without scaling synaptic inputs. We see that qualitatively, all conclusions in this article remain true. In general, both the IF and the Pinsky-Rinzel models without scaling synaptic inputs re more slowly than the models with scaling synaptic inputs. This is easily understandable since synaptic input ISyn (t ) / (1 ¡ p) (with scaling) is stronger than ISyn ( t) (without scaling).
Impact of Geometrical Structures
635
Figure 5: Mean ring time and CV versus lI for p D 0.3, 0.5, 0.8 for the PinskyRinzel model (upper panel) and mean ring time and CV versus lI for p D 0.3, 0.5, 0.8 (bottom panel) for the two-compartment IF model. Ten thousand spikes are generated for calculating the mean ring time and CV. Parameters are specied in the text.
636
Jianfeng Feng and Guibin Li
Appendix A
Assuming Vs | t D 0 D Vd | t D 0 D 0, we solve the linear equation for the twocompartment IF model: 8 ¡1 ¢ C 1 > > (1 ¡ e c p(1¡p) t ) mc p > > > >Vs ¡ Vd D ¡ > p (1 ¡ p ) C c > > Z t ¡ ¢ > > 1 s > ( t¡x ) ¡ c1 C p(1¡p) > > ¡ e dBx > > 1 ¡ p > 0 > > > t > c mp C 2c 2 m > 1 (1 ¡ 2p)s > p (1 ¡ 2p)c m ¡ c1 C p(1¡p) t > > C C e > > p (1 ¡ p ) C c p (1 ¡ p ) 2 > > > Z t Z x ¡ ¢ > > ( x¡y ) ¡ 1C 1 > ¡ t¡x > £ e c e c p(1¡p) dBy dx > > > 0 0 > > Z > t > s > ¡ t¡x > C e c dBx . : 1¡p 0
(A.1)
Therefore, t t mc 2 mc p (1 ¡ p ) ¡ ct ¡ p (1¡p) ¡ mc e¡ c C e p (1 ¡ p ) C c p (1 ¡ p ) C c Z Z x¡y (1 ¡ 2p) s t ¡ t¡x x ¡ x¡y ¡ c C e e c p (1¡p) dBy dx 2 2p(1 ¡ p ) 0 0 Z t ± ² t¡x s ¡ t¡x ¡ C e c 1 ¡ e p(1¡p) dBx 2(1 ¡ p ) 0 c mp C c 2 m p2c m ¡t¡ t ¡t ¡c me c ¡ Vd D e c p(1¡p) (1 ¡ p ) C c p (1 ¡ p ) C c Z p Z x¡y (1 ¡ 2p) s t ¡ t¡x x ¡ x¡y ¡ p (1¡p) c c C e e dBy dx 2p(1 ¡ p ) 2 0 0 Z t ± ² t¡x s ¡ t¡x C e c 1 C e p(1¡p) dBx . 2(1 ¡ p ) 0
Vs D
By applying the martingale property of the integral measurable function f , we have the following identities: Lemma 1.
8 > > >
(A.2)
Rt 0
t t mc 2 mc p (1 ¡ p ) ¡ ct ¡ p(1¡p) ¡ mc e ¡ c C e p (1 ¡ p ) C c p (1 ¡ p ) C c > t mc p C mc 2 mc p2 ¡t¡ t > > ¡ mc e¡ c ¡ e c p(1¡p) :hVd i D p (1 ¡ p ) C c p (1 ¡ p) C c
f dBt for any
(A.3)
Impact of Geometrical Structures
637
and 8 s 2c 3 s 2c ¡ c2 t > 2 > h(V ¡ hV i) i ¡ D e > s s > 2[p(1 ¡ p ) C c ][2p(1 ¡ p ) C c ] 2 > > > > > t 2s 2c p (1 ¡ p ) ¡ 2tc ¡ p(1¡p) > > C e > > > ) C 2p(1 ¡ c p > > > > 2t s 2c p (1 ¡ p ) ¡ 2tc ¡ p(1¡p) > > > ¡ e < 2[p(1 ¡ p ) C c ] 2c 3 (1¡ ) 2 C 8s 2c 2 (1¡ ) 2 C 4s 2c 2 (1¡ ) (3¡2 ) > 4s p p p p p p > > h(Vd ¡hVd i) 2 i D > > 2 > 8(1 ¡ p ) [p(1 ¡ p ) C c ][2p(1 ¡ p ) C c ] > > > 2 > s c 2s 2c p2 > ¡ 2t ¡ t ¡2 > > ¡ e ct¡ e c p (1¡p) > > ) C 2 2p(1 ¡ c p > > > > > s 2c p3 ¡ 2t ¡ 2t > : (A.4) ¡ e c p (1¡p) . 2(1 ¡ p )[p (1 ¡ p ) C c ] Appendix B
Denote Vs ( t, p ) and Vd (t, p ) as the solution of equation 3.1. We consider only the case of p ! 0. For any t > 0, the space C[0, t] of all continuous functions on time interval [0, t] is compact, with the metrics | |x|| D max[0,t] |x(t) |. Hence, it sufces to prove that for any subsequence Vs (t, pn ) and Vd ( t, pn ) with limn!1 pn ! 0, its limit is unique. Multiplying both sides of equation 3.1 for the somatic compartment by pn and taking pn ! 0, we see that limn!1 Vd ( t, pn ) D limn!1 Vs ( t, pn ) , provided that Vd and Vs are bounded. Now the equation for the dendritic compartment becomes dV d D g ( Vs , Vd ) dt C dIsyn ( t) D g ( Vd , Vd ) dt C dIsyn ( t) ,
(B.1)
and its solution is unique, under some reasonable conditions on g and Isyn. Hence, for any sequence pn with pn ! 0 its limit limn!1 Vd ( t, pn ) D limn!1 Vs ( t, pn ) is unique, satisfying equation B.1. This completes our proof. Appendix C
We rst consider a diffusion process dened by dX t D m (Xt ) dt C s (Xt ) dBt . Let us introduce the following quantities: ³ Z x ´ 8 2m ( y) > >s ( x ) D exp ¡ dy > 2 < 0 s ( y) ±R ² ( z) > exp 0x 2m 2 ( z ) dz > 1 s > :m ( x ) D D , s 2 ( x) s (x ) s 2 (x )
(C.1)
(C.2)
638
Jianfeng Feng and Guibin Li
where m is the speed density and s Ris the scale function. We say that a diffu1 sion process is positive recurrent if ¡1 m ( x) dx < 1, which is equivalent to T < 1 where T is the rst exit time of Vthre . For a positive-recurrent process, the stationary distribution density is given by p ( x ) » m (x ) . For a positive-recurrent diffusion process Xt ,
Theorem 2.
Z hTi D 2
Vthre Vrest
Z
C2
Z s (u ) du ¢
Vthre
³Z
Vrest
Vthre y
Vrest
m ( u ) du ¡1
´
s ( u) du ¢ m ( y ) dy.
(C.3)
For a 2 R, let Ga,Vthre ( Vrest , y ) D y be the mean time spent in ( y, y C D y) by the process Xt started at Vrest and run until it exits ( a, Vthre ). The expression for Ga,Vthre (Vrest , y ) is given in Karlin and Taylor (1982, p. 198). Therefore, we have Z hTi D lim
Vthre
a!¡1 a
Ga,Vthre ( Vrest , y ) dy,
and theorem 2 follows. Acknowledgments
We are grateful to the anonymous referees for their valuable comments on an earlier version of this article. The work was partially supported by BBSRC and an ESEP grant of the Royal Society. References Albeverio, S., Feng, J., & Qian, M. (1995) Role of noises in neural networks Phys. Rev. E., 52, 6593–6606. Andersen, P. (1999). A spine to remember Nature, 399, 19–21. Bernander, O., Douglas, R. J., Martin, K.A.C., and Koch, C. (1992). Synaptic background activity inuences spatiotemporal integration in single pyramidal cells. PNAS, 88, 11569–11573. Brown, D., Feng, J., & Feerick, S. (1999). Variability of ring of Hodgkin-Huxley and FitzHugh-Nagumo neurons with stochastic synaptic input. Phys. Rev. Lett., 82, 4731–4734. Collins, J. J., Chow, C. C., & Imhoff, T. T. (1995). Stochastic resonance without tuning. Nature, 376, 236–238.
Impact of Geometrical Structures
639
Destexhe, A., Mainen, Z. F., & Sejnowski, T. J. (1998). Kinetic models of synaptic transmition. In C. Koch & I. Segev (Eds.), Methods in neuronal modeling. Cambridge, MA: MIT Press. Destexhe, A. & Pare, D. (2000). A combined computational and intracellular study of correlated synaptic bombardment in neocortical pyramidal neurons in vivo. Neurocomputing, 32, 113–119. Engert, F., & Bonhoeffer, T. (1999). Dendritic spine changes associated with hippocampal long-term synaptic plasticity. Nature, 399, 66–70. Ermentrout, B. (1996).Type I membranes, phase resetting curves, and synchrony. Neural Computation, 8, 979–1001. Feng, J. (1997). Behaviours of spike output jitter in the integrate-and-re model. Phys. Rev. Lett., 79, 4505–4508. Feng, J. (in press). Is the integrate-and-re model good enough?—A review. Neural Networks. Feng, J., & Li, G. B. (2001). Two compartment model as slope detector. Manuscript in preparation. Harris, C. M., & Wallpert, D. M. (1998). Signal-dependent noise determines motor planning. Nature, 394, 780–783. Karlin, S., & Taylor, H. M. (1982). A second course in stochastic processes. New York: Academic Press. Koch, C. (1999). Biophysics of computation. Oxford: Oxford University Press. Leadbetter, M. R., Lindgren, G., & RootzÂen, H. (1983). Extremes and related properties of random sequences and processes. New York: Springer-Verlag. Mainen, Z. F., & Sejnowski, T. J. (1996). Inuence of dendritic structure on ring pattern in model neocortical neuron. Nature, 382, 363–366. Musila, M., & LÂansk Ây, P. (1994). On the interspike intervals calculated from diffusion approximations for Stein’s neuronal model with reversal potentials. J. Theor. Biol., 171, 225–232. Pinsky, P. F., & Rinzel, J. (1994). Intrinsic and network rhythmogenesis in a reduced Traub model for CA3 neurons. J. Computational Neuroscience, 1, 39– 60. Ricciardi, L. M., & Sato, S. (1990).Diffusion process and rst-passage-times problems. In L. M. Ricciardi (Ed.), Lectures in applied mathematics and informatics. Manchester: Manchester University Press. Sejnowski, T. J. (1998). Making smooth moves. Nature, 394, 725–726. Schneidman, E., Freedman, B., & Segev, I. (1998). Ion channel stochasticity may be critical in determining the reliability and precision of spike timing. Neural Computation, 10, 1679–1703. Shadlen, M. N., & Newsome, W. T. (1998). The variable discharge of cortical neurons: Implications for connectivity, computation, and information coding. J. Neurosci., 18, 3870–3896. Stern, J. E., & Armstrong, W. E. (1998). Reorganization of the dendritic trees of oxytocin and vasopressin neurons of the rat supraoptic nucleus during lactation. J. Neuroscience, 18, 841–853. Troyer, T. W., & Miller, K. D. (1997).Physiological gain leads to high ISI variability in a simple model of a cortical regular spiking cell. Neural Computation, 9, 971– 983.
640
Jianfeng Feng and Guibin Li
Tuckwell, H. C. (1988). Stochastic processes in the neurosciences. Philadelphia: Society for Industrial and Applied Mathematics. Wang, X. J. (1998). Calcium coding and adaptive temporal computation in cortical pyramidal neurons. J. Neurophysiol., 79, 1549–1566.
Received September 18, 2000; accepted May 22, 2001.
LETTER
Communicated by Chris Williams
Sparse On-Line Gaussian Processes Lehel Csat Âo [email protected] Manfred Opper [email protected] Neural Computing Research Group, Department of Information Engineering, Aston University, B4 7ET Birmingham, U. K. We develop an approach for sparse representations of gaussian process (GP) models (which are Bayesian types of kernel machines) in order to overcome their limitations for large data sets. The method is based on a combination of a Bayesian on-line algorithm, together with a sequential construction of a relevant subsample of the data that fully species the prediction of the GP model. By using an appealing parameterization and projection techniques in a reproducing kernel Hilbert space, recursions for the effective parameters and a sparse gaussian approximation of the posterior process are obtained. This allows for both a propagation of predictions and Bayesian error measures. The signicance and robustness of our approach are demonstrated on a variety of experiments.
1 Introduction
Gaussian processes (GP) (Bernardo & Smith, 1994; Williams & Rasmussen, 1996) provide promising Bayesian tools for modeling real-world statistical problems. Like other kernel-based methods, such as support vector machines (SVMs) (Vapnik, 1995), they combine a high exibility of the model by working in often innite-dimensional feature spaces with the simplicity that all operations are “kernelized”—performed in the lower-dimensional input space using positive denite kernels. An important advantage of GPs over other non-Bayesian models is the explicit probabilistic formulation of the model. This allows the modeler to assess the uncertainty of the predictions by providing Bayesian condence intervals (for regression) or posterior class probabilities (for classication). It also opens the possibility of treating a variety of other nonstandard data models (e.g., quantum inverse statistics, Lemm, Uhlig, & Weiguny, 2000; wind-elds, Evans, Cornford, & Nabney, 2000; Berliner, Wikle, & Cressie, 2000) using a kernel method. Neural Computation 14, 641–668 (2002)
° c 2002 Massachusetts Institute of Technology
642
Lehel CsatÂo and Manfred Opper
GPs are nonparametric in the sense that the “parameters” to be learned are functions fx of a usually continuous input variable x 2 Rd . The value fx is used as a latent variable in a likelihood P ( y| fx,x ) that denotes the probability of an observable output variable y given the input x. The a priori assumption on the statistics of f is that of a gaussian process: any nite collection of random variables fi is jointly gaussian. Hence, one must specify the prior means and the prior covariance function of the variables fx . The latter is called the kernel K0 ( x, x0 ) D Cov ( y, y0 ) (Vapnik, 1995; Kimeldorf & Wahba, 1971). Thus, if a zero-mean GP is assumed, the kernel K0 fully species the entire prior information about the model. Based on a set of input-output observations ( xn , yn ) (n D 1, . . . , N), the Bayesian approach computes the posterior distribution of the process fx using the prior and the likelihood (Williams, 1999; Williams & Rasmussen, 1996; Gibbs & MacKay, 1999). A straightforward application of this simple appealing idea is impeded by two major obstacles: nongaussianity of the posterior process and the size of the kernel matrix K0 ( xi , xj ). A rst obvious problem stems from the fact that the posterior process is usually nongaussian (except when the likelihood itself is gaussian in the fx ). Hence, in many important cases, its analytical form precludes an exact evaluation of the multidimensional integrals that occur in posterior averages. Nevertheless, various methods have been introduced to approximate these averages. A variety of such methods may be understood as approximations of the nongaussian posterior process by a gaussian one (Jaakkola & Haussler, 1999; Seeger, 2000); for instance, in Williams and Barber (1998), the posterior mean is replaced by the posterior maximum (MAP), and information about the uctuations is derived by a quadratic expansion around this maximum. The computation of these approximations, which become exact for regression with gaussian noise, require the solution of a system of coupled nonlinear equations of the size equal to the number of data points. The second obstacle that prevents GPs from being applied to large data sets is that the matrix that couples these equations is typically not sparse. Hence, the development of good sparse approximations is of major importance. Such approximations aim at performing the most time-consuming matrix operations (inversions or diagonalizations) only on a representative subset of the training data. In this way, the computational time is reduced from O ( N3 ) to O (Np2 ) , where N is the total size of the training data and p is the size of the representative set. The memory requirement is O ( p2 ) as opposed to O ( N 2 ) . So far, a variety of sparsity techniques (Smola & Sch¨olkopf, 2000; Williams & Seeger, 2001) for batch training of GPs have been proposed. This article presents a new approach that combines the idea of a sparse representation with an on-line algorithm that allows for a speedup of the GP training by sweeping through a data set only once. A different sparse approximation, which also allows for an on-line processing, was recently introduced by Tresp (2000). It is based on combining predictions of models trained on smaller data subsets and needs an additional query set of inputs.
Sparse On-Line Gaussian Processes
643
Central to our approach are exact expressions for the posterior means h fx it and the posterior covariance Kt ( x, x0 ) (subscripts denote the number of data points), which are derived in section 2. Although both quantities are continuous functions, they can be represented as nite linear (or respective bilinear) combinations of kernels K0 ( x, xi ) evaluated at the training inputs xi (Csat Âo, FokouÂe, Opper, Schottky, & Winther, 2000). Using sequential projections of the posterior process on the manifold of gaussian processes, we obtain approximate recursions for the effective parameters of these representations. Since the size of representations grows with the number of training data, we use a second type of projection to extract a smaller subset of input data (reminiscent of the Support Vectors of Vapnik, 1995, or Relevance Vectors of Tipping, 2000). This subset builds up a sparse representation of the posterior process on which all predictions of the trained GP model rely. Our approach is related to that introduced in Wahba (1990). Although we use the same measure for projection, we do not x the set of basis vectors from the beginning but decide on-line which inputs to keep. 2 On-Line Learning with Gaussian Processes
In Bayesian learning, all information about the parameters that we wish to infer is encoded in probability distributions (Bernardo & Smith, 1994). In the GP framework, the parameters are functions, and the GP priors specify a gaussian distribution over a function space. The posterior process is entirely specied by all its nite-dimensional marginals. Hence, let f D f f ( x1 ) , . . . , f ( xM )g be a set of function values such that fD µ f, where fD is the set of f ( xi ) D fi with xi in the observed set of inputs, we compute the posterior distribution using the data likelihood together with the prior p 0 (f ) as ppost (f ) D
P ( D | f ) p0 ( f ) , hP( D | f D ) i0
(2.1)
where hP( D | f D ) i0 is the average of the likelihood with respect to the prior GP (GP at time 0). This form of the posterior distribution can be used to express posterior expectations as typically high-dimensional integrals. For prediction, one is especially interested in expectations of functions of the process at inputs, which are not contained in the training set. At rst glance, one might assume that every prediction on a novel input would require the computation of a new integral. Even if we had good methods for approximate integration, this would make predictions at new inputs a rather tedious task. Luckily, the following lemma shows that simple but important predictive quantities like the posterior mean and the posterior covariance of the process at arbitrary inputs can be expressed as a combination of a nite set of parameters that depend on the training data only. For arbitrary likelihoods we can show that:
644
Lehel CsatÂo and Manfred Opper
The result of the Bayesian update, equation 2.1, using a GP prior with mean function h fx i0 and kernel K0 (x, x0 ) and data D D f(xn , yn ) | n D 1, . . . , Ng, is a process with mean and kernel functions given by Lemma 1 (parameterization).
h fx ipost D h fx i0 C
N X
K0 ( x, xi ) q (i)
iD 1
Kpost (x, x0 ) D K0 (x, x0 ) C
N X
(2.2) K0 ( x, xi ) R ( ij) K 0 ( xj , x0 ).
i, j D 1
The parameters q ( i) and R (ij ) are given by Z 1 @P ( D | f ) q ( i) D df p0 ( f ) and Z @f ( xi ) Z 1 @2 P ( D | f ) ¡ q ( i) q ( j) , R ( ij ) D dfp 0 ( f ) Z @f ( xi ) @f ( xj ) where f D [ f ( x1 ) , . . . , f ( xN ) ]T and Z D stant.
R
(2.3)
df p 0 ( f ) P ( D | f ) is a normalizing con-
The parameters q ( i) and R ( ij ) have to be computed only once during the training of the model and are xed when we make predictions. The parametric form of the posterior mean (assuming a zero mean for the prior) resembles the representations for the predictors in other kernel approaches (such as Support Vector machines) that are obtained by minimizing certain cost functions. While the latter representations are derived from the celebrated representer theorem of Kimeldorf and Wahba (1971), our result, equation 2.2, does, to our best knowledge, not follow from this but is derived from simple properties of gaussian distributions. To keep focused on the main ow, we defer the proof to appendix B. Making an immediate use of this representation is usually not possible because the posterior process is in general not gaussian and the integrals cannot be computed exactly. Hence, we need approximations in order to keep the inference tractable (Csat Âo et al., 2000). One popular method is to approximate the posterior by a gaussian process (Williams & Barber, 1998; Seeger, 2000). This may be formulated within a variational approach, where a certain dissimilarity measure between the true and the approximate distribution is minimized. The most popular choice is the Kullback-Leibler divergence between distributions, dened as Z p (h ) (2.4) KL ( p|q ) D dh p (h ) ln , q (h ) where h denotes the set of arguments of the densities. If pO denotes the apO post ) proximating gaussian distribution, one usually tries to minimize KL ( pkp
p0
dKL p^1
,y N
ppost
(x N
(x1 , y1 )
ppost
645
)
Sparse On-Line Gaussian Processes
p^N 1
^ p N
Figure 1: Visualization of the on-line approximation of the intractable posterior process. The resulting approximate process from previous iteration is used as the prior for the next one.
O which, in contrast to KL ( ppost kpO), requires only the comwith respect to p, putation of expectations over tractable distributions. In this article, we use a different approach. To speed up the learning process in order to allow for the learning of large data sets, we aim at learning the data by a single sequential sweep through the examples. Let pOt denote the gaussian approximation after processing t examples; we use Bayes’ rule, ppost (f ) D
P ( yt C 1 | f ) pOt ( f ) , hP(yt C 1 | f D ) it
(2.5)
to derive the updated posterior. Since ppost is no longer gaussian, we use a variational technique in order to project it to the closest gaussian process pOt C 1 (see Figure 1). Unlike the usual variational method, we now minimize the divergence KL ( ppostkpO) . This is possible because in our on-line method, the posterior, equation 2.5, contains only the likelihood for a single example, and the corresponding nongaussian integral is one-dimensional, which can be performed analytically for many relevant cases. It is a simple exercise to show (Opper, 1998) that the projection results in the matching of the rst two moments (mean and covariance) of ppost and the approximated gaussian posterior pOt C 1 . We expect that the use of the divergence KL ( ppostkpO) has several advantages over other variational methods (Gibbs & MacKay, 1999; Williams & Barber, 1998; Jaakkola & Haussler, 1999; Seeger, 2000). First, this choice avoids the numerical optimizations that are usually necessary for the divergence with inverted arguments. Second, this method is very robust, allowing for arbitrary choices of the single data likelihood. The likelihood can be noncontinuous and may even vanish over some range of values of the process. Finally, if one interprets the KL divergence as the expectation of the relative log loss of two distributions, our choice of divergence weights the losses with the correct distribution rather than the approximated one.
646
Lehel CsatÂo and Manfred Opper
We expect that this may correspond to an improved quality of approximation. In order to compute the on-line approximations of the mean and covariance kernel Kt , we apply lemma 1 sequentially with only one likelihood term P ( yt |xt ) at an iteration step. Proceeding recursively, we arrive at h fx it C 1 D h fx it C q ( t C 1) Kt (x, xt C 1 )
Kt C 1 ( x, x0 ) D Kt (x, x0 ) C r (t C 1) Kt ( x, xt C 1 ) Kt (xt C 1 , x0 ) ,
(2.6)
where the scalars q ( t C 1) and r ( t C 1) follow from lemma 1 (see appendix B for details): q ( t C 1) D r ( t C 1) D
@ @h ft C 1 it @2
@h ft C 1 it
lnhP(yt C 1 | ft C 1 ) it lnhP(yt C 1 | ft C 1 ) it .
(2.7)
The averages in equation 2.7 are with respect to the gaussian process at time t and the derivatives taken with respect to h ft C 1 it D h f ( xt C 1 ) it . Note again that these averages require only a one-dimensional integration over the process at the input xt C 1 . Unfolding the recursion steps in the update rules, equation 2.6, we arrive at the parameterization for the approximate posterior GP at time t as a function of the initial kernel and the likelihoods (“natural parameterization”): h fx it D
t X iD 1
K0 ( x, xi )at (i) D ®Tt kx
Kt ( x, x0 ) D K0 ( x, x0 ) C
t X
K0 ( x, xi ) Ct (ij ) K0 (xj , x0 )
(2.8)
i, j D 1
D K0 ( x, x0 ) C kT x C t kx0 ,
with coefcients at (i) and Ct ( ij ) not depending on x and x0 (for details, see appendix C). For simplicity, the values at ( i) are grouped into the vector ®t D [at (1) , . . . , at ( t) ]T , C t D fCt ( ij ) gi, jD 1,t and we also used vectorial (typeset in bold) notations for kx D [K0 (x1 , x ) , . . . , K 0 ( xt , x )]T . The recursion for the GP parameters in equation 2.8 is found from the recursion equation 2.6 and the parameterization lemma:
®t C 1 D Tt C 1 ( ®t ) C q (t C 1) s t C 1 C t C 1 D Ut C 1 ( C t ) C r ( t C 1) st C 1 s TtC 1 s t C 1 D Tt C 1 ( C t k t C 1 ) C e t C 1 ,
(2.9)
Sparse On-Line Gaussian Processes
647
where kt C 1 D kxt C 1 and e t C 1 the t C 1th unit vector and s t C 1 is introduced for clarity. We also introduced the operators Tt C 1 and Ut C 1 . They extend a t-dimensional vector and matrix to a t C 1-dimensional one by appending zeros at the end of the vector and to the last row and column of the matrix, respectively. Since e t C 1 is the t C 1th unit vector, we see that the dimension of the vector ® and the size of matrix C increase with each likelihood point added. Equations 2.6 and 2.7 show some resemblance to the well-known extended Kalman lter. This is to be expected, because the latter approach can also be understood as a sequential propagation of an approximate gaussian distribution. However, the main difference between the two methods is in the way the likelihood model is incorporated. While the extended Kalman lter (see Bottou, 1998, for a general framework) is based on a linearization of the likelihood, our approach uses a more robust smoothing of the likelihood instead. The drawback of using equation 2.9 in practice is the quadratic increase of the number of parameters with the number of training examples. This is a feature common to most other methods of inference with gaussian processes. A modication of the learning rule that controls the number of parameters is the main contribution of this article and is detailed in the following. 3 Sparseness in Gaussian Processes
Sparseness can be introduced within the GP framework by using suitable approximations on the representation equation 2.8. Our goal is to perform an update without increasing the number of parameters ® and C when, according to a certain criterion, the error due to the approximation is not too large. This could be achieved exactly if the new input xt C 1 would be such that the relation
K0 (x, xt C 1 ) D
t X iD 1
eOt C 1 ( i) K0 (x, xi )
(3.1)
holds for all x. In such a case, we would have a representation for the updated process in the form equation 2.8 using only the rst t inputs, but with O and CO . A glance at equation 2.9 shows that “renormalized” parameters ® the only change would be the replacement of the vector st C 1 by sO t C 1 D C t kt C 1 C eO t C 1 .
(3.2)
Note that eO t C 1 is a vector of dimensionality t. Unfortunately, for most kernels and inputs xt C 1 , equation 3.1 cannot be fullled for all x. Nevertheless, as
648
Lehel CsatÂo and Manfred Opper
an approximation, we could try an update of the form 3.2, where eO t C 1 is determined by minimizing the error measure, ® ®2 ® ® t X ® ® eOt C 1 (i) K0 (¢, xi ) ® , ®K0 (¢, xt C 1 ) ¡ ® ® D1
(3.3)
i
where k ¢ k is a suitably dened norm in a space of functions of inputs x (related optimization criteria in a function space are presented by Vijayakumar & Ogawa, 1999). Equation 3.3 becomes especially simple when the norm is based on the inner product of the reproducing kernel Hilbert space (RKHS) generated by the kernel K 0P . In this case, for any twoPfunctions g and h that are represented as g ( x ) D i ci K0 (x, ui ) and h ( x ) D i di K0 ( x, vi ) , for some arbitrary set of ui ’s and vi ’s, the RKHS inner product is dened as (Wahba, 1990) ¡
g (¢) , h (¢)
¢ RKHS
X
D
ci dj K 0 ( ui , vi )
(3.4)
ij
with norm ® ®2 ® g®
¡
RKHS
D g (¢) , g (¢)
¢ RKHS
D
X
ci cj K 0 ( ui , uj ) .
(3.5)
ij
Hence, in this case, equation 3.3 is K 0 ( xt C 1 , xt C 1 ) C ¡2
t X i D1
t X i, j D 1
eOt C 1 ( i) eOt C 1 ( j) K0 ( xi , xj )
eOt C 1 ( i) K 0 ( xt C 1 , xi ) ,
(3.6)
and simple minimization of equation 3.6 yields (Smola & Sch¨olkopf, 2000) (¡1)
eO t C 1 D K t
kt C 1 ,
(3.7)
where K t D fK0 (xi , xj )gi, jD 1,t is the Gram matrix. The expression b0 ( x, xt C 1 ) D K
t X i D1
eOt C 1 ( i) K 0 ( x, xi )
(3.8)
is simply the orthogonal projection (in the sense of the inner product, equation 3.4) of the function K 0 ( x, xt C 1 ) on the linear span of the functions K0 (x, xi ). The approximate update using equation 3.2 will be performed
Sparse On-Line Gaussian Processes
Ft+ 1
649
F1 ^ F t+ 1
(F,...,F 1 t) Ft
Fre s Figure 2: Visualization of the projection step. The new feature vector w t C 1 is projected to the subspace spanned by fw 1 , . . . , w t g, resulting in the projection wO tC 1 and the orthogonal residual (the “left-out” quantity) w res. It is important that w res has t C 1 components; it needs the extended basis including w tC 1 .
only when a certain measure for the approximation error (to be discussed later) is not exceeded. The set of inputs, for which the exact update is performed and the number of parameters is increased, will be called basis vector set (BV set), an element will be BV . Proceeding sequentially, some of the inputs are left out, and others are included in BV set. However, due to the projection 3.8, the inputs left out from BV set will still contribute to the nal GP conguration—the one used for prediction and to measure the posterior uncertainties. But the latter inputs will not be stored and do not lead to an increase of the size of the parameter set. This procedure leads to the new representation for the posterior GP only in terms of the BV set and the corresponding parameters ® and C : h fx i D
X
K0 ( x, xi )a(i)
i2BV
K ( x, x0 ) D K 0 ( x, x0 ) C
X
K0 ( x, xi ) C ( ij ) K0 ( xj , x0 ).
(3.9)
i, j2BV
An alternative derivation of these results can be obtained from the representation of the Mercer kernel (Wahba, 1990; Vapnik, 1995), K0 (x, x0 ) D w ( x) T w ( x0 ) ,
(3.10)
in terms of the possibly innite-dimensional “feature vector” w ( x) (Wahba, 1990). Minimizing equation 3.6 and using equation 3.2 for an update is equivalent to replacing the feature vector w t C 1 corresponding to the new input by its orthogonal projection, wOt C 1 D
X i
eOt C 1 (i) w i ,
(3.11)
650
Lehel CsatÂo and Manfred Opper
onto the space of the old feature vectors (as in Figure 2). Note, however, that this derivation may be somewhat misleading by suggesting that the mapping induced by the feature vectors plays a special role for our approximation. This would be confusing because the representation equation, 3.10, is not unique. Our rst derivation based on the RKHS norm shows, however, that our approximation uses only geometrical properties that are induced by the kernel K0 . 3.1 Projection-Induced Error. We need a rule to decide if the current input will be included in the BV set. We base the decision on a measure of change on the sample averaged posterior mean of the GP due to the sparse approximation. Assuming a learning scenario where only the basis vectors are memorized, we measure the change of the posterior mean due to the approximation by
D h fx it C1
D h fx it C 1 ¡ h fx i[ , tC1
where h fx i[ is the posterior mean with respect to the approximated protC 1 cess. Summing up the absolute values of the changes for the elements in the BV set and the new input leads to
et C 1 D
tC1 X iD 1
D |q
| D h fi it C 1 | D |qt C 1 |
( t C 1)
tC1 X b0 (xi , xt C 1 ) K0 ( xi , xt C 1 ) ¡ K iD 1
® ® b0 (¢, xt C 1 ) ®2 | ®K0 (¢, xt C 1 ) ¡ K , RKHS
(3.12)
where the second line follows from the orthogonal projection together with the denition of the inner product in the RKHS (see equation 3.5) It is an important observation that, also due to orthogonal projection, the b0 (¢, xt C 1 ) D K 0 (¢, xt C 1 ) error is concentrated only on the last data point since K at the old data points xi , i D 1, . . . , t. Rewriting equation 3.12 using the b0 (¢, xt C 1 ) from equation 3.7, the error is coefcients for K ± ² et C 1 D |q ( t C 1) | k¤t C 1 ¡ kTtC 1 K t¡1 kt C 1 D |q ( t C 1) |c t C 1 , (3.13) where k¤t C 1 D K0 ( xt C 1 , xt C 1 ) and q ( t C 1) is given from equation 2.7. The error measure et C 1 is a product of two terms. If the new input would be included in BV , the corresponding coefcient at C 1 in the posterior mean would be equal to q ( t C 1) , which is the likelihood-dependent part. The second term, c t C 1 D kt¤C 1 ¡ kTtC 1 K t¡1 kt C 1 ,
(3.14)
gives the geometrical part, which is the squared norm of the “residual vector” from the projection in the RKHS (shown in Figure 2), or, equivalently,
Sparse On-Line Gaussian Processes
651
the “novelty” of the current input. If we use the RBF kernels, then the error equation, 3.13, is similar to the one used in deciding if new centers have to be included in the resource-allocating network (McLachlan & Lowe, 1996; Platt, 1991). To compute the geometrical component of the error et C 1 , a matrix inversion is needed at each step. The costly matrix inversion can be avoided by keeping track of the inverse Gram matrix Q t D K t¡1 . The updates for the matrix can also be expressed with the variables c t C 1 and eO t C 1 (for details, see appendix D), and these updates will be important when deleting a BV : ( ( O t C 1 ) ¡ e t C 1 ) ( Tt C 1 ( eO t C 1 ) ¡ e t C 1 ) T , (3.15) Q t C 1 D Ut C 1 ( Q t ) C c t¡1 C 1 Tt C 1 e where Ut C 1 and Tt C 1 are the extension operators for a matrix and a vector, respectively (introduced in equation 2.9). 3.2 Deleting a Basis Vector. Our algorithm may run into problems when there is no possibility of including new inputs into BV without deleting one of the old basis vectors because we are at the limit of our resources. This gives the motivation to implement pruning: whenever a new example is found important, one should get rid of the BV with the smallest error and replace it by the new input vector. First, we discuss the elimination of a BV and then the criterion based on which we choose the BV to be removed. To remove a basis vector from the BV set, we rst assume that the respective BV has just been added; thus, the previous update step was done with e t C 1 , the t C 1th unit vector. With this assumption, we identify the elements q (t C 1) , r ( t C 1) , and st C 1 from equation 2.9; compute eO t C 1 (this computation is also replaced by an identication from equation 3.15); and use equation 3.11 for an update without including the new point in the BV set. If we assume t C 1 basis vectors, ®t C 1 has t C 1 elements, and the matrices C t C 1 and Q t C 1 are ( t C 1) £ ( t C 1) . Further assuming that we want to delete the last added element, the decomposition is as illustrated in Figure 3. Computing the “previous” model parameters and then using the nonincreasing update leads to the deletion equations (see appendix E for details):
O D ® ( t) ¡ a¤ ® O D C ( t ) C c¤ C O D Q ( t) ¡ Q
Q¤
q¤
Q ¤ Q ¤T
q¤2
Q ¤ Q ¤T
q¤
,
¡
i 1 h ¤ ¤T C C ¤ Q ¤T Q C q¤
(3.16)
652
Lehel CsatÂo and Manfred Opper
a t+ 1
C t+ 1
Q t+ 1
1 .. t
a
C
C*
Q
Q*
t+ 1
a*
C*
c*
Q*
q*
t+ 1
1 .. t
t+ 1
(t)
(t)
1 .. t
(t)
Figure 3: Grouping of the GP parameters for the update equation, 3.16.
O are the parameters after the deletion of the last basis vecO , CO , and Q where ® tor and C ( t ) , Q ( t ) , ® ( t) , Q ¤ , C ¤ , q¤ , and c¤ are taken from GP parameters before deletion. A graphical illustration of each element is provided in Figure 3. Of particular interest is the identication of the parameters q ( t C 1) and c t C 1 since their product gives the score of the basis vector that is being deleted. This leads to the score
et C 1 D
a¤ at C 1 (t C 1) . D q¤ Qt C 1 ( t C 1, t C 1)
(3.17)
Thus, we have the score for the last input point. Neglecting dependence of the GP posterior on the ordering of the data, equation 3.17 gives us a score measure for each element i in the BV set,
ei D
|at C 1 ( i) | , Qt C 1 (i, i)
(3.18)
by rearranging the order in BV with element i at the last position. To summarize, if a deletion is needed, then the basis vector with minimal score (from equation 3.18) will be deleted. The scores are computationally cheap (linear). 3.3 The Sparse GP Algorithm. The following numerical experiments are based on the version of the algorithm that assumes a given maximal size for the BV set. We start by initializing the BV set with an empty set, the maximal number of elements in the BV set with d, the prior kernel K0 , and a tolerance parameter 2 tol . This latter will be used to prevent the Gram matrix from being singular and is used in step 2. The GP parameters ®, C , and the inverse Gram matrix Q are set to empty values. For each data element ( yt C 1 , xt C 1 ) , we will iterate the following steps:
1. Compute q (t C 1) , r ( t C 1) , k¤t C 1 , kt C 1 , eO t C 1 , and c t C 1 .
2. If c t C 1 < 2 tol , then perform a reduced update with sO t C 1 from equation 3.2 without extending the size of the parameters ® and C . Advance to the next data.
Sparse On-Line Gaussian Processes
653
3. (else) Perform the update equation 2.9 using the unit vector et C 1 . Add the current input to the BV set, and compute the inverse of the extended Gram matrix using equation 3.15. 4. If the size of the BV set is larger than d, then compute the scores ei for all BV s from equation 3.18, nd the basis vector with the minimum score, and delete it using equations 3.16. The computational time for a single step is quadratic in d, the maximal number of BV s allowed. Having an iteration over all data, the computational time is O (Nd2 ) . This is a signicant improvement from the O (N 3 ) scaling of the nonsparse GP algorithm. Since we propose a “subspace” algorithm, we have the same computing time as in Wahba (1990) and the Nystr om ¨ approximation for kernel matrices in Williams and Seeger (2001). An important feature of the sparse GP algorithm is that we provide an approximation to the posterior kernel of the process providing predictive variance for a new inputs, as shown by the error bars in Figure 4. Another aspect of the algorithm is that the basis vectors are selected during runtime from the data. A nal remark is that the iterative computation of the inverse Gram matrix Q prevents the numerical instabilities caused by inverting a singular Gram matrix: c t C 1 is zero if the new element xt C 1 would make the Gram matrix singular. The comparison of c t C 1 with the preset tolerance value 2 td prevents this. 4 Experimental Results
In all experiments, we used spherical RBF kernels
³ 0
K (x, x ) D exp
kx ¡ x0 k2 2dsK2
´ ,
(4.1)
where sK is the width of the kernel and d is the input dimension. 4.1 Regression. In the regression model, we assume a multidimensional input x 2 Rm and an output y with the likelihood
(
P ( y|x ) D p
ky ¡ fx k2 exp ¡ 2s02 2p s0 1
) .
(4.2)
Since the likelihood is gaussian, the use of a gaussian posterior in the on-line algorithm is exact. Hence, only the sparsity will introduce an approximation into our procedure. For a given number of examples, the parameterization 2.8 of the posterior in terms of ® and C leads to a predictive distribution of
654
Lehel CsatÂo and Manfred Opper
1.2 1.2
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0
0.2
0.2
0.4 2
1.5 1
0.5
0
0.5
1
1.5
2
0.4 2
1.5 1
0.5
0
0.5
1
1.5
2
Figure 4: Results of the GP regression with 1000 noisy training data points with noise s02 D 0.02. The gures show the results for RBF kernels with two different widths. (Left) The good t of the GP mean function (continuous lines) to the true function (dashed lines) is also consistent with the tight Bayesian error bars (dashdotted lines) around the means. (Right) The error bars are broader, reecting the larger uncertainty. The BV set is marked with rhombs and we kept 10 and 15 basis vectors. We used sK2 D 1 for the left, and sK2 D 0.1 for the right subgure, respectively.
y for an input x, ³ p ( y|x, ®, C ) D
1 2p sx2
´2
» ¼ ky ¡ ®T kx k2 exp ¡ , 2sx2
(4.3)
with sx2 D s02 C kTx Ct kx C k¤x . The on-line update rules, equation 2.9, for ® and C in terms of the parameters q ( t C 1) and r ( t C 1) are q ( t C 1) D ( y ¡ ®Tt kx ) / sx2
r ( t C 1) D ¡
1 . sx2
(4.4)
To illustrate the performance, we begin with the toy example y D sin ( x ) / x C g, where g is a zero-mean gaussian noise with variance s02 D 0.02. The results for the posterior means and the Bayesian error bars together with the basis vectors are shown in Figure 4. The large error bars obtained for the “misspecied” kernel (with small width sK2 D 0.1) demonstrate the advantage of propagating not only the means but also the uncertainties in the algorithm. The second data set is the Friedman data set 1 (Friedman, 1991), an articial data set frequently used to assess the performance of regression algorithms. For this example, we demonstrate the effect of the approximation introduced by the sparseness. The upper solid line in Figure 5 shows the development of the test error with increasing numbers of examples without sparseness, that is, when all data are included sequentially. The dots are the
Sparse On-Line Gaussian Processes A v e r a g e p e r fo rm a n c e fo r th e F r ie d m a n d a ta s e t
10
Test error (500 test points)
655
F u ll G P s o lu tio n B V re m o v a l F ix e d B V s e t s iz e
9 8 7 6 5 4 3 2 1 50
100
150 200 250 Ite ra tio n # / B V n u m b e r
300
Figure 5: Results for the Friedman data using the full GP regression (continuous line) and the proposed sparse GP algorithm with a xed BV size (dots with error bars). The dash-dotted line is obtained by sequentially reducing the size of the BV set. The lines show the average performance over 50 runs. The full GP solution uses only the specied number of data, whereas the other two curves are obtained by iterating over the full data set (sK2 D 1 was used with 300 training and 500 test data).
test errors obtained by running the sparse algorithm using different sizes of the BV set. We see that almost two-thirds of the original training set can be excluded from the BV set without a signicant loss of predictive performance. Finally, we have tested the effect of the greediness of the algorithm by adding or removing examples in different ways. The dependence on the data of the sparse GP is shown with the error bars around the dots, and the dependence of the result on the different orders is well within these error bars. The dash-dotted line is obtained by rst running the on-line algorithm without sparseness on the full data set and then building sets BV of decreasing sizes by removing the least signicant examples one after the other. Remarkably, the performance is rather stable against these variations in the plateau region of (almost) constant test error. 4.2 Classication. For classication, we use the probit model (Neal, 1997) where a binary value y 2 f¡1, 1g is assigned to an input x 2 Rm with the data likelihood
³
y fx P ( y| fx ) D Erf s0
´ .
(4.5)
656
Lehel CsatÂo and Manfred Opper Binary classification: 4 < > non 4 Single sweep Two sweeps
5.5
7 Test Error %
5 Test Error %
Multiclass classification Single Sweep Two Sweeps
7.5
6.5
4.5 4
6
5.5
3.5
5
3
4.5 100 150 200 250 300 350 400 450 500 550 Basis Vector #
2.5 100 150 200 250 300 350 400 450 500 550 Basis Vector #
Figure 6: Results for the binary (left) and multiclass (right) classication. The multiclass case is a combination of the 10 individual classiers. The example x is assigned to the class with highest P ( Ci |x) . We compare different sizes of the BV set and the effect of reusing data a second time (signied by the circles).
Erf(x) is the cumulative gaussian distribution,1 with s0 the noise variance. The predictive distribution for a new example x is ³ p ( y|x, ®, C ) D hP(y| fx ) it D Erf
´ y h fx i , sx
(4.6)
where h fx i the mean of the GP at x given by equation 2.8 and sx2 D s02 C k¤x C k Tx Ckx . Based on equation 2.9, for a given input-output pair ( x, y) , the update coefcients q (t C 1) and r ( t C 1) are computed (for details, see Csat Âo et al., 2000):
q
( t C 1)
y Erf0 D sx Erf
r
( t C 1)
1 D 2 sx
(
Erf00 ¡ Erf
³
Erf0 Erf
´2 )
y ®T k
,
(4.7)
with Erf(z) evaluated at z D sxt x and Erf0 and Erf00 are the rst and second derivatives at z. We have tested the sparse GP algorithm on the U. S. Postal Service (USPS) data set2 of gray-scale handwritten digit images (of size 16 £ 16) with 7291 training patterns and 2007 test patterns. In the rst experiment, we studied the problem of classifying the digit 4 against all other digits. Figure 6 (left) plots the test errors of the algorithm for different BV set sizes and xed values of hyperparameter sK2 D 1. 1 2
Rx p Erf(x) D ¡1 dt exp (¡t2 / 2) / 2p . Available from http://www.kernel-machines.org/ data/ .
Sparse On-Line Gaussian Processes
657
The USPS data set has been used previously to test the performance of other kernel-based classication algorithms that are based on a sparse representations. We mention the kernel PCA method of (Scholkopf ¨ et al., 1999) or the Nystr om ¨ method of Williams and Seeger (2001). They obtained slightly better results than our on-line algorithm. When the basis of the Nystr om ¨ approach is reduced to 512, the mean error is ¼ 1.7% (Williams & Seeger, 2001) and the PCA reduced-set method of Scholkopf ¨ et al. (1999) leads to an error rate of ¼ 5%. This may be due to the fact that the sequential replacement of the posterior by a gaussian is an approximation for the classication problem. Hence, some of the information contained in an example is lost even when the BV set would contain all data. As shown in in Figure 6, we observe a slight improvement when the algorithm sweeps several times through the data. However, it should be noted that the use of the algorithm (in its present form) on data that it has already seen is a mere heuristic and can no longer be justied from a probabilistic point of view. A change of the update rule based on a recycling of examples will be investigated elsewhere. We have also tested our method on the more realistic problem of classifying all 10 digits simultaneously. Our ability to compute Bayesian predictive probabilities is absolutely essential in this case. We have trained 10 classiers on the 10 binary classication problems of separating a single digit from the rest. A new input was assigned to the class with the highest predictive probability given by equation 4.5. Figure 6 summarizes the results for the multiclass case for different BV set sizes and gaussian kernels (with the external noise variance s02 D 0). In this case, the recycling of examples was of less signicance. The gap between our on-line result and the batch performance reported in Sch¨olkopf et al. (1999) is also smaller; this might be due to the Bayesian nature of the GPs that avoids the overtting. To reduce the computational cost, we used the same set for all individual classiers (only a single inverse of the Gram matrix was needed, and also the storage cost is smaller). This made the implementation of deleting a basis vector for the multiclass case less straightforward. For each input and each basis vector, there are 10 individual scores. We implemented a minimax deletion rule: whenever a deletion was needed, the basis vector having the smallest maximum value among the 10 classier problems was deleted; that is, the index l of the deleted input was l D arg min max eic . i2BV c20,9
(4.8)
Figure 7 shows the evolution of the test error when the sparse GP algorithm was initially trained with 1500 BV s and (without any retraining) the “least scoring” basis vectors are deleted. As in the regression case (see Figure 5), we observe a long plateau of almost constant test error when up to 70% of the BV s are removed.
658
Lehel CsatÂo and Manfred Opper T e s t E rro r w h e n re m o v in g B V s
45 40
Test Error (%)
35 30 25 20 15 10 5 0
0
5 00 10 00 # o f re m o v e d B V s (o u t o f 1 5 0 0 )
15 00
Figure 7: Performance of the combined classier trained with an initial BV size of 1500 and a sequential removal of basis vectors.
5 Conclusion and Further Investigations
We have presented a greedy algorithm that allows computing a sparse gaussian approximation for the posterior of GP models with general (factorizing) likelihoods based on a single sweep through the data set. So far, we have applied the method to regression and classication tasks and obtained a performance close to batch methods. The strength of the method lies in the fact that arbitrary, even noncontinuous likelihoods, which may be zero in certain regions, can be treated by our method. Such likelihoods may cause problems for other approximations based on local linearizations (advanced Kalman lters) or on the averaging of the log-likelihood (variational gaussian approximation). Our method merely requires the explicit computation of a gaussian smoothed likelihood and is thus well suited for cases where (local) likelihood functions can be modeled empirically as mixtures of gaussians. If such expressions are available, the necessary one-dimensional integrals can be done analytically, and the on-line updates require just matrix multiplications and function evaluations. A model of this structure for which we already have obtained promising preliminary results is the one used to predict wind elds from ambiguous satellite measurements based on a GP prior for the wind elds and a likelihood model for the measurement process. A further development of the method requires the solution of various theoretical problems at which we are now working. An important problem is to assess the quality of our approximations. There are two sources of errors: one coming from the gaussian on-line approximation and another
Sparse On-Line Gaussian Processes
659
stemming from the additional sparsity. In both cases, it is easy to obtain explicit expressions for single-step errors, but it is not obvious how to combine these in order to estimate the cumulative deviation between the true posterior and our approximation. It may be interesting to concentrate on the regression problem rst because in this case, there is no approximation involved in computing the posterior. A different question is the (frequentist) statistical quality of the algorithm. Our on-line gaussian approximation (without sparseness) was found to be asymptotically efcient (in the sense of Fisher) in the nite-dimensional (parametric) case (Opper, 1996, 1998). This result does not trivially extend to the present innite-dimensional GP case, and further studies are necessary. These may be based on the idea of an effective, nite dimensionality for the set of well-estimated parameters (Trecate, Williams, & Opper, 1999). Such work should also give an estimate for the sufcient number of basis vectors and explain the existence of the long plateaus (see Figures 7 and 5) with practically constant test errors. Besides a deeper understanding of the algorithm, we nd it also important to improve our method. Our sparse approximation was found to preserve the posterior means on previous data points when projecting on a representation that leaves out the current example. A further improvement might be achieved if information on the posterior variance would also be used (e.g., by taking the Kullback-Leibler loss rather than the RKHS norm) in optimizing the projection. This may, however, result in more complex and time-consuming updates. Our experiments show that in some cases, the performance of the on-line algorithm is inferior to a batch method. We expect that our algorithm can be adapted to a recycling of data (e.g., along the lines of Minka, 2000) such that a convergence to a sparse representation of the TAP mean-eld method (Opper & Winther, 1999) is achieved. A further drawback that will be addressed in future work is the lack of an (on-line) adaptation of the kernel hyperparameters. Rather than setting them by hand, an approximate propagation of posterior distributions for the hyperparameters would be desirable. Finally, there may be cases of probabilistic models where the restriction to unimodal posteriors as given by the gaussian approximation is too severe. Although we know that for gaussian regression and classication with logistic or probit functions the posterior is unimodal, for more complicated models, an on-line propagation of a mixture of GPs should be considered.
Acknowledgments
We thank Bernhard Schottky for initiating the gaussian process parameterization lemma. The work was supported by EPSRC grant no. GR/M81608.
660
Lehel CsatÂo and Manfred Opper
Appendix A: Properties of Zero-Mean Gaussians
The following property of the gaussian probability density function (pdf) is often used in this article; here we state it in a form of a theorem: Let x 2 Rm and p ( x) zero-mean gaussian pdf with covariance § D f Sij g (i, j from 1 to m). If g: Rm ! R is a differentiable function not growing faster than a polynomial and with partial derivatives, Theorem 1.
@j g ( x ) D
@ @xj
g ( x) ,
then Z Rm
dxp ( x) xi g ( x ) D
m X j D1
Z
Sij
Rm
dxp ( x ) @j g (x ) .
(A.1)
In the following we will assume denite integration over Rm whenever the integral appears. Alternatively, using the vector notation, the above identity reads: Z
Z dxp ( x ) r g ( x) .
dxp (x ) xg (x ) D §
(A.2)
For a general gaussian pdf with mean ¹, above A.2 transforms to Z
Z dxp (x ) xg (x ) D ¹
Proof.
Z dxp ( x ) g ( x ) C §
dxp (x ) r g ( x ).
(A.3)
The proof uses the partial integration rule:
Z
Z dxp ( x ) r g ( x) D ¡
dxg ( x ) r p ( x) ,
where we have used the fast decay of the gaussian function to dismiss one of the terms. Using the derivative of a gaussian pdf, we have Z
Z dxp ( x ) r g ( x) D
dx g ( x ) S ¡1 xp (x ) .
Multiplying both sides with § leads to equation A.2, completing the proof. For the nonzero mean, the deductions are also straightforward.
Sparse On-Line Gaussian Processes
661
Appendix B: Proof of the Parameterization Lemma
Using Bayes’ rule, the posterior process has the form pO (f ) D R
p0 ( f ) P ( D | f ) , dfp 0 ( f ) P ( D | f )
where f is a set of realizations for the random process indexed by arbitrary points from Rm , the inputs for the GPs. We compute rst the mean function of the posterior process: R
Z df pO ( f ) fx D
h fx ipost D D
1 Z
Z dfx
N Y
d f p ( f ) fx P ( D | f ) R 0 df p 0 ( f ) P ( D | f )
dfi p 0 ( fx , f1 , . . . , fN ) fx P ( D | f1 , . . . , fN ) ,
(B.1)
iD 1
where the denominator was denoted by Z and we used index notation for the realizations of the process also (thus, f ( x ) D fx and f ( xi ) D fi ). Observe that regardless of the number of the random variables of the process considered, the dimension of the integral we need to consider is only N C 1; all other random variables will integrate out (as in equation B.1). We thus have an N C 1-dimensional integral in the numerator, and Z is an N-dimensional integral. If we group the variables related to the data as fD D [ f1 , . . . , fN ]T and apply theorem 1 (see equation A.1), replacing xi by fx and g ( x ) by P ( D | f D ) , we have h fx ipost
³ Z 1 h fx i0 dfx df D p 0 ( fx , f D ) P ( D | fD ) D Z ´ Z N X C K0 (x, xi ) dfx df D p0 ( fx , f D ) @i P ( D | fD ) ,
(B.2)
i D1
where K0 is the kernel function generating the covariance matrix (§ in theorem 1). The variable fx in the integrals disappears since it is contained only in p 0 . Substituting back Z leads to h fx ipost D h fx i0 C
N X
K0 ( x, xi ) qi ,
(B.3)
iD 1
where qi is read off from equation B.2, R df D p0 ( fD ) @i P (D | fD ) qi D R , df D p0 ( fD ) P (D | fD )
(B.4)
662
Lehel CsatÂo and Manfred Opper
and the coefcients qi depend only on the data and are independent from x, at which the posterior mean is evaluated. We can simplify the expression for qi by performing a change of variables in the numerator, fi0 D fi ¡ h fi i0, where h fi i0 is the prior mean at xi , 6 and keeping all other variables unchanged fj0 D fj , j D i, leading to the numerator Z dfD p 0 (f 0D ) @i P (D | f10 , . . . , fi0 C h fi i0, . . . , fN0 ) , and the differentiation is with respect to fi0 . We then change the partial differentiation with respect to fi0 with the partial differentiation with respect to h fi i0 and exchange the differentiation and integral operators (they apply to a distinct set of variables), leading to Z @ df0D p 0 (f 0D ) P ( D | f10 , . . . , fi0 C h fi i0 , . . . , fN0 ) . @h fi i0 We then perform the inverse change of variables inside the integral and substitute back into the expression for qi : R @ df D p 0 ( fD ) P ( D | fD ) @h f i i0 R qi D df D p0 ( fD ) P ( D | fD ) Z @ ln dfD p 0 (f D ) P ( D | f D ). (B.5) D @h fi i0 Writing the expression for the posterior kernel, Kpost ( x, x0 ) D h fx fx0 ipost ¡ h fx ipost h fx0 ipost,
(B.6)
and applying theorem 1 twice leads to Kpost ( x, x0 ) D K0 (x, x0 ) C
N X N X i D1 j D1
¡ ¢ K 0 ( x, xi ) Dij ¡ qi qj K0 (xj , x0 ) ,
(B.7)
where Dij is 1 Dij D Z
Z dfD p0 ( fD )
@2 @fj @fi
P ( D | fD ) .
(B.8)
Identifying Rij D Dij ¡ qi qj leads to the required parameterization in equation 2.3 from lemma 1. Simplication of Rij D Dij ¡ qi qj is made by changing the arguments of the partial derivative and using the logarithm of the expectation (repeating steps B.4 and B.5 from qi ), leading to Z @2 ln df D p0 ( fD ) P (D | fD ) , (B.9) Rij D @h fi i0 @h fj i0 and using a single datum in the likelihood leads to the scalar coefcients q ( t C 1) and r ( t C 1) from equation 2.7.
Sparse On-Line Gaussian Processes
663
Appendix C: On-Line Learning in GP Framework
We prove equation 2.8 by induction. We will show that for every time step, we can express the mean and kernel functions with coefcients ® and C given by the recursion (also equation 2.9):
®t C 1 D Tt C 1 ( ®t ) C q ( t C 1) s t C 1 C t C 1 D Ut C 1 ( C t ) C r ( t C 1) st C 1 s TtC 1 s t C 1 D Tt C 1 ( C t k t C 1 ) C e t C 1 ,
(C.1) (C.2)
where ® and C depend only on the data points xi and kernel function K 0 but do not depend on the values x and x0 (from equation 2.8) at which the mean and kernel functions are computed. Proceeding by induction and using the induction hypothesis ®0 D C 0 D 0 for time t D 1, we have a1 (1) D q (1) and C1 (1, 1) D r (1) . The mean function at time t D 1 is h fx i D a1 (1) K0 ( x1 , x) (from lemma 1 for a single datum; see equation 2.6). Similarly, the modied kernel is K1 ( x, x0 ) D K0 (x, x0 ) C K ( x, x1 ) C1 (1, 1)K 0 ( x1 , x0 ) with ® and C independent of x and x0 , proving the induction hypothesis. We assume that at time moment t, we have the parameters ®t and C t independent of the points x and x0 . These parameters specify a prior GP for which we apply the on-line learning:
h fx it C 1 D
t X i D1
( K0 ( xi , x) at ( i) C q t C 1)
( C C q t 1) K0 ( x, xt C 1 )
D
tC 1 X i D1
t X i, j D1
K 0 ( x, xi ) Ct ( i, j) K0 (xj , xt C 1 ) (C.3)
K0 ( x, xi ) at C 1 (i) ,
and by pairwise identication we have equation C.1 or C.2 from the main body. The parameters ®t C 1 do not depend on the particular value of x. Writing down the update equation for the kernels, Kt C 1 ( x, x0 ) D Kt ( x, x0 ) C r ( t C 1) Kt (x, xt C 1 ) Kt ( xt C 1 , x0 ) , leads to equation C.2 in a straightforward manner with Ct C 1 ( i, j) independent of x and x0 , completing the induction. Appendix D: Iterative Computation of the Inverse Gram Matrix
In the sparse approximation equation 3.7, we need the inverse Gram matrix of the BV set: K BV D fK 0 ( xi , xj ) g is needed. In the following, the elements
664
Lehel CsatÂo and Manfred Opper
of the BV set are indexed from 1 to t. Using the matrix inversion formula, 3 the addition of a new element can be carried out sequentially. This is a wellknown fact, exploited also in the Kalman lter algorithm. We consider the new element at the end (last row and column) of matrix K t C 1 . Matrix K t C 1 is decomposed: " K t C1 D
Kt kTtC 1
kt C 1
#
k¤t C 1
.
(D.1)
Assuming K t¡1 known and applying the matrix inversion lemma for K t C 1 , " K t¡1 C1
D
" D
Kt kTtC 1
kt C 1
#¡1
kt¤C 1
K t¡1 C K t¡1 kt C 1 kTtC 1 K t¡1c t¡1 C1
¡K t¡1 kt C 1c t¡1 C1
¡kTtC 1 K t¡1c t¡1 C1
c t¡1 C1
(D.2)
# ,
where c t C 1 D k¤t C 1 ¡ kTtC 1 K t¡1 kt C 1 is the geometric term from equation 3.14. Using notations K t¡1 kt C 1 D eO t C 1 from equation 3.7, K t¡1 D Q t , and K t¡1 C1 D Q t C 1 , we have a recursion equation, " QtC1 D
Q t C c t¡1 O t C 1 eO TtC 1 C1 e
¡c t¡1 eO T C 1 t C1
¡c t¡1 O t C1 C 1e c t¡1 C1
# ,
(D.3)
and in a more compact matrix notation, ( O t C 1 ¡ e t C 1 ) (eO t C 1 ¡ e t C 1 ) T , Q t C 1 D Q t C c t¡1 C1 e
(D.4)
where e t C 1 is the t C 1th unit vector. With this recursion equation, all matrix inversion is eliminated (this result is general for block matrices; such implementation, together with an interpretation of the parameters has been also made in (Cauwenberghs & Poggio, 2001). Using the score (see equation 3.17) and including in BV only inputs with nonzero scores, the Gram matrix is guaranteed to be nonsingular; c t C 1 > 0 guarantees nonsingularity of the extended Gram matrix. For numerical stability, we can use the Cholesky decomposition of the inverse Gram matrix Q . Using the lower triangular matrix R with the cor3 A useful guide to formulas for matrix inversions and block matrix manipulation can be found at Sam Roweis’s home page: http://www.gatsby.ucl.ac.uk/ ~roweis/ notes.html .
Sparse On-Line Gaussian Processes
665
responding indices and the identity Q D R T R , we have the update for the Cholesky decomposition,
³
Rt ¡1/ 2 T ¡c t C 1 eO t C 1
Rt C 1 D
´
0
,
c t C 1/
¡1 2
(D.5)
which is a computationally very inexpensive operation, without additional operations, provided that the quantities c t C 1 and e t C 1 are already computed. Appendix E: Deleting a BV
Adding a basis vector is made with the equations: s t C 1 D Tt C 1 ( C t k t C 1 ) C e t C 1
®t C 1 D Tt C 1 (®t ) C q ( t C 1) st C 1 C t C 1 D Ut C 1 ( C t ) C r (t C 1) s t C 1 sTtC 1
( ( O t C 1 ) ¡ e t C 1 ) ( Tt C 1 ( eO t C 1 ) ¡ e t C 1 ) T Q t C 1 D Ut C 1 ( Q t ) C c t¡1 C 1 Tt C 1 e
(E.1) (E.2) (E.3)
where ® and C are the GP parameters, Q is the inverse Gram matrix,c t C 1 and eO t C 1 the geometric terms of the new basis vector, kt C 1 D [K 0 ( x1 , xt C 1 ) , . . . , K0 ( xt , xt C 1 ) ]T , and e t C 1 is the t C 1th unity vector. The optimal decrease of BV s needs an answer to two questions. The rst question is how to delete a basis vector from the set of basis vectors with minimal loss of information. If the method is given, then we have to nd the BV to remove. The rst problem is solved by inverting the learning equations, E.1 through E.3. Assuming ®t C 1 , C t C 1 , and Q t C 1 known and using pairwise correspondence considering the t C 1th element of ®t C 1 , we def
can identify q ( t C 1) D at C 1 (t C 1) D at¤C 1 (the notations are illustrated in Figure 3). Using similar correspondences for the matrix C t C 1 , the following identications can be done: def
r ( t C 1) D C t C 1 ( t C 1, t C 1) D c¤t C 1 C t kt C 1 D C t C 1 (1.. t, t C 1)
(E.4)
¤ def C t C 1 D ¤ ct C 1
with c¤t C 1 and C ¤t C 1 sketched in Figure 3. Substituting back into equations E.1 and E.2, the old values for GP parameters are: ( )
®t D ®t tC 1 ¡ at¤C 1 ( t)
Ct D Ct C1 ¡
C ¤t C 1
(E.5)
c¤t C 1
C ¤t C 1 C ¤T tC1
c¤t C 1
,
(E.6)
666
Lehel CsatÂo and Manfred Opper ( )
¡1 ( ) where ®t tC 1 D Tt¡1 C 1 ®t C 1 and Tt C 1 is the inverse operator that takes the ( ) ( ) rst t elements of a t C 1-dimensional vector. We dene C t tC 1 D Ut¡1 C1 Ct C 1 similarly. Proceeding similarly, using elements of matrix Q t C 1 , the correspondence with equation E.3 is as follows:
c tC1 D
1 Qt C 1 ( t C 1, t C 1)
eO t C 1 D ¡
1
def
Q t C 1 (1..t, t C 1)
q¤t C 1
D
(E.7)
q¤t C 1
def
Q ¤t C 1
D ¡ ¤ q
,
t C1
O t C 1: with the reduced set matrix Q ( t)
O tC1 D Q Q tC 1 ¡
Q ¤t C 1 Q ¤T tC 1
q¤t C 1
.
(E.8)
The matrix Q does not need any further modication; however, for ® and C , a sparse update is needed. Replacing c t C 1 , eO t C 1 together with the “old” parameters ®t and C t , we have the “optimally reduced” GP parameters as ( )
O t D ®t tC 1 C at¤C 1 ® ( )
O t D C t C c¤C C t 1 t C1
Q ¤t C 1
(E.9)
q¤t C 1
Q ¤t C 1 Q ¤T tC 1
q¤2 tC 1
¡
1 h q¤t C 1
i
¤ ¤T Q ¤t C 1 C ¤T t C 1 C Ct C1 Q t C1 .
(E.10)
References Berliner, L. M., Wikle, L., & Cressie, N. (2000). Long-lead prediction of Pacic SST via Bayesian dynamic modelling. Journal of Climate, 13, 3953–3968. Bernardo, J. M., & Smith, A. F. (1994). Bayesian theory. New York: Wiley. Bottou, L. (1998). Online learning and stochastic approximations. In D. Saad (Ed.), On-line learning in neural networks (pp. 9–42). Cambridge: Cambridge University Press. Cauwenberghs, G., & Poggio, T. (2001). Incremental and decremental support vector machine learning. In T. K. Leen, T. G. Diettrich, & V. Tresp (Eds.), Advances in neural information processing systems, 13. Cambridge, MA: MIT Press. CsatÂo, L., FokouÂe, E., Opper, M., Schottky, B., & Winther, O. (2000). Efcient approaches to gaussian process classication. In S. A. Solla, T. K. Leen, & K.R. Muller ¨ (Eds.), Advances in neural information processing systems. Cambridge, MA: MIT Press. Evans, D. J., Cornford, D., & Nabney, I. T. (2000). Structured neural network modelling for multi-valued functions for wind vector retrieval from satellite scatterometer measurements. Neurocomputing, 30, 23–30.
Sparse On-Line Gaussian Processes
667
Friedman, J. H. (1991). Multivariate adaptive regression splines. Annals of Statistics, 19, 1–141. Gibbs, M., & MacKay, D. J. (1999). Efcient implementation of gaussian processes (Tech. Rep.) Cambridge: Department of Physics, Cavendish Laboratory, Cambridge University. Available at: http://wol.ra.phy.cam.ac.uk/mackay/ abstracts/gpros.html. Jaakkola, T., & Haussler, D. (1999). Probabilistic kernel regression. In Online Proceedings of 7th Int. Workshop on AI and Statistics. Available at: http://uncertainty99.microsoft.com/proceedings.htm. Kimeldorf, G., & Wahba, G. (1971). Some results on Tchebychefan spline functions. J. Math. Anal. Applic., 33, 82–95. Lemm, J. C., Uhlig, J., & Weiguny, A. (2000). A Bayesian approach to inverse quantum statistics. Phys. Rev. Lett., 84, 2068–2071. McLachlan, A., & Lowe, D. (1996). Tracking of non-stationary time-series using resource allocating RBF networks. In R. Trappl (Ed.), Cybernetics and Systems ’96 (pp. 1066–1071). Proceedings of the 13th European Meeting on Cybernetics and Systems Research. Vienna: Austrian Society for Cybernetic Studies. Minka, T. P. (2000). Expectation propagation for approximate Bayesian inference. Unpublished doctoral dissertation, MIT. Available at: vismod.www.media. mit.edu/»tpminka/. Neal, R. M. (1997). Regression and classication using gaussian process priors (with discussion). In J. M. Bernerdo, J. O. Berger, A. P. Dawid, & A. F. M. Smith (Eds.), Bayesian statistics (Vol. 6). New York: Oxford University Press. Available at: ftp://ftp.cs .utoronto .ca/pub /radford/mc-gp.ps.Z. Opper, M. (1996). Online versus ofine learning from random examples: General results. Phys. Rev. Lett., 77(22), 4671–4674. Opper, M. (1998). A Bayesian approach to online learning. In D. Saad (Ed.), On-line learning in neural networks (pp. 363–378). Cambridge: Cambridge University Press. Opper, M., & Winther, O. (1999). Gaussian processes and SVM: Mean eld results and leave-one-out estimator. In A. Smola, P. Bartlett, B. Sch¨olkopf, & C. Schuurmans (Eds.), Advances in large margin classiers (pp. 43–65). Cambridge, MA: MIT Press. Platt, J. (1991). A resource-allocating network for function interpolation. Neural Computation, 3, 213–225. Sch¨olkopf, B., Mika, S., Burges, C. J., Knirsch, P., Muller, ¨ K.-R., R¨atsch, G., & Smola, A. J. (1999). Input space vs. feature space in kernel-based methods. IEEE Transactions on Neural Networks, 10(5), 1000–1017. Seeger, M. (2000). Bayesian model selection for support vector machines, gaussian processes and other kernel classiers. In S. A. Solla, T. K. Leen, & K.-R. Muller ¨ (Eds.), Advances in neural informationprocessingsystems,12. Cambridge, MA: MIT Press. Smola, A. J., & Sch¨olkopf, B. (2000). Sparse greedy matrix approxmation for machine learning. In International Conference on Machine Learning. San Mateo, CA: Morgan Kaufmann. Available at: www.kernel-machines.org/ papers/upload 4467 kfa long.ps.gz.
668
Lehel CsatÂo and Manfred Opper
Tipping, M. (2000). The relevance vector machine. In S. A. Solla, T. K. Leen, & K.-R. Muller ¨ (Eds.), Advances in neural information processing systems, 12. Cambridge, MA: MIT Press. Trecate, G. F., Williams, C. K. I., & Opper, M. (1999). Finite-dimensional approximation of gaussian processes. In M. S. Kearns, S. A. Solla, & D. A. Cohn (Eds.), Advances in neural information processing systems, 11. Cambridge, MA: MIT Press. Tresp, V. (2000). A Bayesian committee machine. Neural Computation, 12(11), 2719–2741. Vapnik, V. N. (1995). The nature of statistical learning theory. New York: SpringerVerlag. Vijayakumar, S., & Ogawa, H. (1999). RKHS based functional analysis for exact incremental learning. Neurocomputing, 29(1–3), 85–113. Wahba, G. (1990). Splines models for observational data. Philadelphia: SIAM. Williams, C. K. I. (1999). Prediction with gaussian processes. In M. I. Jordan (Ed.), Learning in graphical models. Cambridge, MA: MIT Press. Williams, C. K. I., & Barber, D. (1998). Bayesian classication with gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(12), 1342–1351. Williams, C. K. I., & Rasmussen, C. E. (1996). Gaussian processes for regression. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems, 8. Cambridge, MA: MIT Press. Williams, C. K. I., & Seeger, M. (2001). Using the Nystr om ¨ method to speed up kernel machines. In T. K. Leen, T. G. Diettrich, & V. Tresp (Eds.), Advances in neural information processing systems, 13. Cambridge, MA: MIT Press. Received February 23, 2001; accepted May 24, 2001.
Communicated by Erkki Oja
LETTER
Orthogonal Series Density Estimation and the Kernel Eigenvalue Problem Mark Girolami [email protected] Laboratory of Computing and Information Science, Helsinki University of Technology FIN-02015 HUT, Finland
Kernel principal component analysis has been introduced as a method of extracting a set of orthonormal nonlinear features from multivariate data, and many impressive applications are being reported within the literature. This article presents the view that the eigenvalue decomposition of a kernel matrix can also provide the discrete expansion coefcients required for a nonparametric orthogonal series density estimator. In addition to providing novel insights into nonparametric density estimation, this article provides an intuitively appealing interpretation for the nonlinear features extracted from data using kernel principal component analysis. 1 Introduction Kernel principal component analysis (KPCA) is an elegant method of extracting nonlinear features from data, the number of which may exceed the dimensionality of the data (Sch¨olkopf, Smola, & Muller, ¨ 1996, 1998). There have been many notable applications of KPCA for the denoising of images and extracting features for subsequent use in linear support vector classiers (Sch¨olkopf et al., 1996; Sch¨olkopf, Bruges, & Smola, 1999). Computationally efcient methods have been proposed in Rosipal and Girolami (2001) for the extraction of nonlinear components from a Gram matrix, thus obviating the computationally burdensome requirement of diagonalizing a potentially high-dimensional Gram matrix. 1 In KPCA, the implicit nonlinear mapping from input space to a possibly innite-dimensional feature space often makes it difcult to interpret features extracted from the data. However, by considering the estimation of a probability density function from a nite data sample using an orthogonal
1 The term Gram matrix refers to the N £ N kernel matrix. The terms kernel matrix and Gram matrix may be used interchangeably.
Neural Computation 14, 669–688 (2002)
° c 2002 Massachusetts Institute of Technology
670
Mark Girolami
series, some insights into the nature of the features extracted by KPCA can be provided. Section 2 briey reviews orthogonal series density estimation, and Section 3 introduces the notion of using KPCA to extract the orthonormal features required in constructing a nite series density estimator. Section 4 considers the important aspect of selecting the appropriate number of components that should appear in the series. Section 5 highlights the fact that the quadratic Renyi entropy of the data sample can be estimated using the associated Gram matrix. This strengthens the view that KPCA provides features that can be viewed as estimated components of the underlying data density. Section 6 provides some illustrative examples, and section 7 is devoted to conclusions and related discussion. 2 Finite Sequence Density Estimation The estimation of a probability density by the construction of a nite series of orthogonal functions is briey described in this section. (See Izenman, 1991, and the references in it for a complete exposition of this nonparametric method of density estimation.) We rst consider density estimation using an innite-length series expansion. 2.1 Innite-Length Sequence Density Estimator. A probability density function that is square integrable can be represented by a convergent orthogonal series expansion (Izenman, 1991), such that p (x) D
1 X
ckW k (x) ,
(2.1)
kD 1
where x 2
p (x) W k (x) dx D Ep fW k (x) g ´ m W k,
(2.2)
where Ep denotes expectation the density function p. Thus, P1 Wwith respect to W ¢ (x) we can write p (x) D m W h D ¹ © (x) i where h¢i represents k kD 1 k the canonical (Euclidean) inner product. We can also consider the elements of the series © (x) as a nonlinear map from the data space to some feature space (Sch¨olkopf et al., 1999), such that the inner product in feature space is computed directly using a kernel function h© (x0 ) ¢ © (x) i D K (x0 , x) . Consider a reproducing space (RKHS) H with associated P1 kernel Hilbert 0) (x) (x kernel K (x0 , x) D l w w and inner product denoted by h¢iH , k k k kD 1
Orthogonal Series Estimation
671
p therefore implicitlydening the elements of the mapping as W k (x) ´ l kw k (x) . The density function is given by the inner product of the mapped point in feature space © (x0 ) and the mean of the distribution, ¹W , in the dened feature space; thus, the reproducing property of the space yields the following density at the point x0 : p (x0 ) D hp(x) ¢ K (x0 , x) iH D h¹W ¢ © (x0 ) i.
(2.3)
The form of the expression for the density function 2.3 gives an indication of a link between the fundamental problem of density estimation and unsupervised learning methods that employ the kernel trick (Sch¨olkopf et al., 1998, 1999). If there is a nite sample of points [x1 , ¢ ¢ ¢ , xN ] drawn from the true distribution p (x) , then an unbiased numerical estimate of the above R P (x expectation (coefcients of the expansion)2 is ck ¼ cOk D N1 N nD1 d n ¡ P W (x ) x)W k (x) dx D N1 N W ´ m O . The estimated value of the probabilnD 1 k n k ity density function for a point x0 , denoted by pO (x0 ), is then given by the expression
O W ¢ © (x0 ) i D pO (x0 ) D h¹ D
N 1 X h© (xn ) ¢ © (x0 ) i N nD 1
N 1 X K (x0 , xn ) D 1TN k(x0 ) , N nD 1
where the N £ 1 vector [K (x0 , x1 ) ¢ ¢ ¢ K (x0 , xN )]T is denoted by k(x0 ) and the vector 1N is an N £ 1 dimensional vector whose individual elements are each the value 1 / N. This expression is the familiar Parzen window density estimator with K (., .) being the smoothing kernel. For the case where a density-dependent weighting an is employed in estimating the series coefP (x0 , xn ), and support vector methods can be cients, then pO (x0 ) D N n D 1 an K employed in estimating the appropriate weighting coefcients (Mukherjee & Vapnik, 1999), which will yield a sparser representation of the density estimate than the Parzen window estimator. 2.2 Truncated Sequence Density Estimator. Turning to the truncated form of the innite series estimate, the estimated value of the probability
2
Note that the empirical density estimate based on the sum of delta functions
x) can also be exchanged for a weighting an such that 0 · an · 1 and
PN
PN
nD 1
1 (x d n¡ N
an D 1, in
a W (x ), a which case cO k D nD 1 n k n where each n will be estimated from the data using, for example, maximum likelihood.
672
Mark Girolami
density function for a point x0 is then given by the expression
pOM (x0 ) D D
M X kD 1
cOkW k (x0 )
N X M N X M 1 X 1 X W k (xn ) W k (x0 ) D l kw k (xn ) w k (x0 ) , N n D 1 kD 1 N n D 1 kD 1
(2.4)
where pOM (x0 ) denotes the density estimate yielded when M components of the series are retained. Many systems of orthogonal functions exist— for example, Legendre, Fourier, and Hermite systems have all been use in equation 2.4 for density estimation (Izenman, 1991). For the case where p (x) has innite support on the real line, then the system of orthogonal functions chosen is typically the normalized Hermite polynomials, and if the support is strictly positive, the Laguerre system may be used (Izenman, 1991; Tou & Gonzalez, 1974). Although density estimation using an orthogonal series is asymptotically unbiased, it has one distinct shortcoming: it can produce negative point values (Izenman, 1991). A standard method for estimating a probability density function is the truncated orthogonal series density estimate, and this has been briey introduced. The following section begins by considering the solution of the continuous Karhunen-Lo e` ve (KL) expansion from discrete data and points out that the nonlinear features provided by KPCA may be used in forming the basis functions for a series density estimator. 3 KPCA and Orthonormal Basis Functions The solution of the continuous KL expansion from a discrete sample of data is a well-studied problem. Consider the integral equation that describes the KL expansion: Z D
K (x, x0 ) w i (x) dx D li w i (x0 ) ,
(3.1)
where K (x, x0 ) denes the covariance kernel of the associated stochastic process and the domain of integration is dened by D . This integral can be generalized to take into account the variation over the domain of integration given by the underlying probability density function, in which case the integral in equation 3.1 becomes an expectation, and the eigenfunctions are then orthonormal with respect to the data density (Williams & Seeger, 2000). The estimation of the eigenfunctions of the continuous integral equation 3.1 from the associated discrete version, equation 3.2, generated by a nite sample of N data points was studied in Ogawa
Orthogonal Series Estimation
673
and Oja (1986): N 1 X K (xn , xm ) w i (xn ) D li w i (xm ) . N nD 1
(3.2)
Indeed, equation 3.1 is a homogeneous Fredholm integral equation of the second kind, and there are a number of quadrature methods available for its approximate numerical solution based on the discretized form of equation 3.2 (Delves & Mohamed, 1985). This is achieved by performing an eigenvalue decomposition on the N £ N-dimensional Gram matrix K whose elements are Kin D K (xi , xn ) , where K (xi , xn ) denotes the kernel function associated with the two points. The eigen-decomposition satises KU D US, where the columns of the N £ N matrix U are the eigenvectors of the Gram matrix and the diagonal matrix S contains the elements lQ k, the corresponding eigenvalues. These eigenvectors form p an estimate O (x ) (x ) of the actual eigenfunctions such that w k n ¼ w k n D Nunk and the eigenvalues are estimated by l k ¼ N ¡1 lQ k (Williams & Seeger, 2001). Estimates of the eigenfunctions at a new point x0 can be made by simply using equation 3.2 as an interpolatory formula (Delves & Mohamed, 1985), in p P N N 0 O (x0 , xn ) . This approach is a form of the which case w k (x ) D Q n D 1 unk K lk
Nystr om ¨ routine (Delves & Mohamed, 1985) and has recently been proposed in Williams and Seeger (2001) as a method for speeding up the inversion of the Gram matrix in kernel-based classication methods such as gaussian process classiers. In summary, the eigenvalue decomposition of a Gram matrix associated with a particular kernel (K (xi , xn ) ) provides estimates of the orthogonal system of eigenfunctions associated with the continuous integral equation (the KL expansion). It is now proposed that the extracted eigenvectors can be used to form the required orthogonal series for a probability density function estimate, and by doing so we gain some insights into the nature of the features extracted using KPCA.
3.1 Orthogonal Series Density Estimation from KPCA. It is clear that the eigenvalue decomposition of the Gram matrix K as in KPCA now provides estimates of the eigenfunctions associated with the kernel appearing in equation 3.1. 3 Using the eigenvectors as nite sample estimates of the corresponding eigenfunctions, it is straightforward to see that the truncated estimate of the probability density function (see equation 2.4) at a point x0 3
In fact, kernel PCA is an eigenvalue decomposition performed on the centered kernel Q which is related to the original kernel matrix by K Q D (I ¡ 1N )K (I ¡ 1N ), where matrix K, I is an N £ N identity matrix and 1N is an N £ N matrix whose elements are all 1/N.
674
Mark Girolami
follows as pOM (x0 ) D
M X kD 1
N X M Q 1 X lk O w k (xn ) wO k (x0 ) N n D 1 kD 1 N p N X M Q p N 1 X lk NX D Nunk ulkK (x0 , xl ) N D1 N lQ k
cOkW k (x0 ) ¼
n
kD 1
l D1
N X M q N X 1 X u q lk K (x0 , xl ) lQ kunk D N n D 1 kD 1 lD 1 lQ k 8 9 M q <X N = X 1 lQ ku k q ulkK (x0 , xl ) , D 1T N : ; kD 1 l D1 lQ k
where the vector 1 is an N £ 1-dimensional vector whose individual elements are each the value one. It is worthy of note that the bracketed term appearing in the last line of the above equations is the projection of the kernel for the point x0 onto the corresponding eigenvector—that is, the nonlinear principal component as in KPCA (Sch¨olkopf et al., 1996). In other words, the density estimate is a weighted sum of the M signicant (the sense of this signicance will be dened in the following section) normalized eigenvectors of the Gram matrix. The weighting terms correspond to the associated nonlinear principal components of the point under question, so in this sense, the features extracted using KPCA can be viewed as components of the estimated data density. The implication is that the choice of kernel and any associated parameters4 will have a profound effect on the quality of the associated density estimate. Therefore, good5 features will be extracted by KPCA when the Gram matrix is such that a faithful reconstruction of the underlying data density can be made from the associated eigenfunctions. The choice of kernel to use requires some consideration, and indeed it is this question that the focus of much research into kernel methods in machine learning is moving. From the viewpoint taken in this article, the choice of kernel is determined by the desire to model any density function. The gaussian, radial basis function (RBF) kernel has well-known universal approximation properties, and tting a sufcient number of them to continuous data provides a means of estimating an arbitrary density function. This can be achieved either by semiparametric modeling using, for example, a mixture of gaussians, or by nonparametric modeling and using the gaussian smoothing window. The enhanced performance of support vector machines (Sch¨olkopf et al., 1999) when employing features extracted by 4 5
The width parameter, in the case of an RBF kernel. In the sense of providing discrimination power for a classier, for example.
Orthogonal Series Estimation
675
KPCA on difcult classication problems using RBF kernels (Sch¨olkopf et al., 1998) can now be understood from the perspective given in this article. The principled selection of other kernels will be motivated by prior data and domain knowledge, and this is an open area of research investigation. The eigenvectors extracted are estimates of the corresponding eigenfunctions of the continuous KL expansion; the accuracy of this estimate has been studied in Ogawa and Oja (1986). More recently in Williams and Seeger (2000), the signicance of the eigenvectors of the Gram matrix for the purposes of classication has been discussed in detail. A method for selecting the number of eigenvectors that should be retained in the expansion is presented in the following section. 4 Selecting the Length of Sequence Now we see that this probability density function estimator can be written in compact format as 8 9 M q <X N = X 1 lQ ku k q ulkK (x0 , xl ) (4.1) pOM (x0 ) D 1TN : ; kD 1 lD 1 lQ k T 0 ) D 1T N UM U M k(x ,
where UM is the N £ M matrix, which retains M of the eigenvectors of K. For the case when M D N as UUT D I, equation 4.1 reduces to the familiar form of p (x0 ) D 1TN k(x0 ) , the standard Parzen window density estimate. This representation of the orthogonal series estimator provides some insight into the kernel PCA approach to density estimation. It is clear from this that inserting the matrix U M UTM in the standard Parzen window estimator has a smoothing effect on the estimated density. In other words, if, by way of an example, a series of noisy observations is available to estimate the density, then the orthogonal series estimator will potentially smooth out the effects of the noise by selectively removing N ¡M eigenvectors from the expansion. The length of the sequence of eigenvectors retained in the expansion requires consideration. The error generated by truncating the innite expansion at M can be shown (Kreyszig, 1989; Tou & Gonzalez, 1974) to give P 2 , where each an overall integrated-square truncation error of E D 1 c kD M C 1 k ck is dened in equation 2.2. Therefore, the corresponding error associated with each element of the expansion is »Z
Ek D
c2k
D
¼ p (x) W k (x) dx (
l ¼ k N
N X n D1
)2 unk
2
( 1 ¼ 2 N
N X nD 1
n o2 ¼ lQ k 1TN u k ,
)2 W k (xn )
676
Mark Girolami
where u k and lQ k are the kth eigenvector and eigenvalue of K, respectively. The approximate truncation error associated with each element in the se© ª2 ries based on the N-sample density estimate is EOk D lQ k 1TN u k . Therefore, expansion coefcients that have a small squared-sum value of the components will have a negligible effect on the overall integrated-square error.6 Although this error gives an indication of the contribution to the overall error from each component, it does not provide a stopping rule as such for the elimination of eigenvectors within the expansion. In Kronmal and Tarter (1968), the value of the mean integrated squared error is used in deriving a stopping criterion. This is one criterion that is appropriate for the application of kernel PCA in density estimation. In Diggle and Hall (1986), an alternative criterion is proposed; however, for the purposes of this work, the criterion of Kronmal and Tarter (1968) is adequate for the exposition of the ideas set forth. Details found in Kronmal and Tarter (1968) show that series elements that satisfy the following inequality should be considered for retention in the sequence: »
¼ 1 XN (xn ) W k nD 1 N
2
>
2 1CN
»
¼ 1 XN 2 (x )g W . n nD 1 k N
Substituting the eigenvector approximations for the eigenfunctions in the above equation gives the following cutoff threshold: n
1T u k
o2
>
2N 2N (uTk u k ) D . 1CN 1CN
(4.2)
If the sample size is large, then 12N ! 2, and the stopping criterion is CN © ª2 simply 1T u k > 2. Much has been written in the literature of nonparametric statistics regarding the stopping criteria for orthogonal series density estimators. (See Diggle & Hall, 1986, for an extensive overview of these.) This section has shown that the eigenvalue decomposition of the Gram matrix (KPCA) provides features that can be used in density function estimation based on a nite sample estimate of a truncated expansion of orthonormal basis functions. One criterion for the selection of the appropriate eigenvectors that will appear in the series has been considered. The following section presents the nonparametric estimation of Renyi entropy from a data sample. The importance of the constructed Gram matrix along with the associated eigenspectrum is considered.
6 This does not take into account the error in approximating the eigenfunctions using the estimated eigenvectors.
Orthogonal Series Estimation
677
5 Nonparametric Estimation of Quadratic Renyi Entropy Thus far, the discussion regarding the form of the kernel appearing in equation 3.1 has been general, and no specic kernel has been assumed. Let us now consider specically a gaussian RBF kernel. Note that the quadratic Renyi entropy, dened as Z p (x) 2 dx,
HR2 (X ) D ¡log
(5.1)
can easily be estimated using a nonparametric Parzen estimator based on an RBF kernel. This above integral formed a measure of distribution compactness in (Friedman & Tukey, 1974) and has recently been used for certain forms of information-theoretic learning (Principe, Fisher, & Xu, 2000). Denoting an isotropic gaussian computed at x centered at ¹ with covariance ¤ as N x (¹, ¤ ), employing the standard result for a nonparametric density estimator using gaussian kernels and noting the convolution theorem for gaussians, the following holds: Z
Z p (x) 2 dx ¼
1 pO (x) 2 dx D 2 N D
8 Z <X N X N :
9 =
N
x (xi ,
¤ )N
iD 1 jD 1
N X N 1 X N N 2 i D1 j D 1
xi (xj ,
x (xj ,
¤ )
;
dx
2¤ ).
For an RBF kernel with a common width of 2¤ , it is clear that the quadratic integral can be estimated from the sum of each element in the Gram matrix— in other words, Z pO (x) 2 dx D
N X N 1 X K (xi , xj ) D 1TN K1N , N2 i D1 j D1
(5.2)
where each N £ 1 vector 1N has each element equal to 1 / N. Now we can see that the contribution to the overall estimated data entropy of each orthonormal component vector can be viewed using an eigenvalue decomposition of the Gram matrix Z pO (x) 2 dx D
N X kD 1
N n o2 X lQ k 1TN u k D EOk. kD 1
It is clear that large contributions to the entropy will come from components © ª2 that have small values of lQ k 1TN u k and can be attributed to elements with little or no structure. This can be considered as the contribution caused by
678
Mark Girolami
observation noise in some cases or diffuse regions in the data. Large values © ª2 of lQ k 1TN u k therefore indicate regions of high density or compactness and are also indicative of possible modes of the density or underlying class and cluster structure. Interestingly, the R integral considered in the computation of the quadratic Renyi entropy pO (x) 2 dx also denes the squared norm of the functional form of pO (x) such that Z
OW¢¹ O Wi D pO (x) 2 dx D k pO k2H D h¹
1 N X N X 1 X li w i (xn ) w i (xm ) N 2 nD 1 m D1 i D1
T T T D 1N K1N D 1N USU 1N D
N X kD 1
EOk.
The above equation provides a more general result in that the kernel used in estimating the quadratic integral need not be restricted to an RBF form. The main point being made here is that the Gram matrix is fundamental to the estimation of the Renyi entropy, which is based on data density. When creating a Gram matrix for extraction of nonlinear features using, for example, KPCA, from the arguments presented in this and the previous section, the key is to choose a kernel that provides a reasonable estimate of the underlying data density. Because different types of kernel produce varying forms of associated eigenfunction, it is clear that the eigenfuctions should be appropriate for the density to be estimated. For example, an RBF kernel has eigenfunctions of the form of normalized Hermite polynomials (Zhu, Williams, Rohwer, & Morciniec, 1998; Williams & Seeger, 2001) and as such would be suitable for distributions with innite support. The following section provides a number of illustrative examples. 6 Simulation The rst simulation provides a two-dimensional illustration of the extracted features from the Gram matrix and how these can be interpreted. Analytic solutions to the one-dimensional form of equation 3.1 have been provided in Zhu et al. (1998) and Williams and Seeger (2001). An RBF kernel is used, and the weighting function is taken as a gaussian, in which case the eigenfunctions take the form of normalized Hermite polynomials (Kreyszig, 1989). Normalized Hermite polynomials form an orthonormal sequence such that in the univariate case, W i ( x) D
1 p
1
(2i i! p ) 2
exp(¡x2 / 2) Hi ( x ) ,
(6.1)
where the Hermite polynomials Hi ( x) are dened by the recursion H0 ( x) D 1 and Hi ( x ) D (¡1) i exp (x2 )
di exp(¡x2 ) . dxi
(6.2)
Orthogonal Series Estimation
679
1.5
1.5
0
0
1.5 1.5
1.5 1.5
0
1.5
0
1.5
Figure 1: A sample of 300 two-dimensional points where 100 are drawn from each of three two-dimensional gaussian clusters. (Left) Scatterplot of the 300 points drawn from the gaussians with an isotropic variance of value 0.1. (Right) The same points with additive gaussian noise whose variance is 0.38.
The generalization of the univariate form to a multivariate representation is straightforward (Tou & Gonzalez, 1974). Consider a sample of two-dimensional data points distributed in a similar manner to those presented in Sch¨olkopf et al. (1998) for illustrative purposes. Three clusters of identical variance with value s D 0.1 and centers ¹ D [0.0, 0.7I 0.7, ¡0.7I ¡0.7 ¡ 0.7] are generated. The left-hand plot of Figure 1 shows the scatter plot of the data. The density of the data corresponds to PK the general form P of mixture such that p (x) D k c k N x ( ¹ k , sI), where the usual constraints k c k D 1 hold. Now note that an orthonormal set of basis functions with respect to the density of the data p (x) must satisfy
K X k D1
Z ck
N x
x ( ¹k,
sI) Qi (x ¡ ¹ k ) Qj (x ¡ ¹ k ) dx D dij .
(6.3)
Because the weighting function is a gaussian, two-dimensional Hermite polynomials of the form Hi (x ¡ ¹ k ) will satisfy this requirement. This indicates that each of the components of the mixture (clusters in this instance) will have a set of orthonormal Hermite polynomial functions associated with them. An RBF kernel of equal width to the individual cluster components was used, and 300 points were drawn from the distribution (see Figure 1). An eigenvalue decomposition was performed on the associated Gram matrix, and the eigenvectors were used in forming the series density estimate.
680
Figure 2 data of Figure 1 are shown starting from the top row. The characteristic structure of two-dimensional orthonormal Hermite polynomial functions is clear and most evident.
Figure 2 shows the nonlinear features associated with each of the rst sixteen eigenvectors. These are clearly estimates of the orthonormal Hermite polynomial expansion coefcients associated with the density. The rst four extracted features for one of the clusters are shown in Figure 3. It is clear that the extracted features are indeed, up to a rotation and scaling, estimates of the orthonormal Hermite polynomial functions (see Figure 3). In essence, what we see is that the nonlinear features extracted by kernel PCA using an RBF kernel are estimates of the orthonormal Hermite polynomial components associated with the underlying data distribution. between the actual density sity estimated using the expectation maximization algorithm) and the KPCA method was computed for a range of the isotropic variance values ranging from 0.05 to 0.5, in 0.05 increments. The empirical KLD was computed
Orthogonal Series Estimation
681
Figure 3 vectors associated with one of the clusters in the data. (Bottom) The rst four two-dimensional orthonormal Hermite polynomial functions.
using the standard form
Three hundred sample points were used to create the Gram matrix and estimate the parameters of the gaussian mixture. Six hundred uniformly distributed points within the region dened in Figure 1 were used to compute the KLD for both methods. The mean KLD for the KPCA method was 0.036 compared to 0.043 for the mixture method over the range of values. The comparison shows similar performance for both methods on this particular two-dimensional data, as would be expected. KPCA density estimation method. The 300 points from the clustered data in the previous simulation had gaussian noise of variance 0.38 added ure 1 shows the original data and the noisy samples. From Figure 4, it is quite obvious that three components are all that is required to estimate the distribution corresponding to the noiseless data. However, in the case of the noisy data, a slowly decaying eigenspectrum can be seen (see Figure 5). However, when examining the contributions to the error of each eigenvector, it is still apparent that there are three signicant generators of the data. The rst three eigenvectors satisfy the Kronmal and Tarter criterion, after which a small number of eigenvectors satisfy the criterion. The right-hand plot of Figure 6 shows the estimated density contour plots when the rst three eigenvectors are retained in the series expansion. This should be contrasted to that which a Parzen window estimator yields (middle plot of Figure 6). The smoothing effect can be noted due to the removal of series elements that capture the diffuse areas within the data and the sharpening of the three modes in the sample.
682
Mark Girolami 60 50 40 30 20 10 0
0
5
10
15
20
25
30
35
40
45
50
0
5
10
15
20
25
30
35
40
45
50
0.08 0.06 0.04 0.02 0
Figure 4: (Top) The rst 50 eigenvalues of the Gram matrix created from the noiseless and well-separated clusters of Figure 1. Due to the distinct structure in the data, there are only three dominant eigenvalues. (Bottom) The contribution to the overall integrated square error of each of the rst 50 eigenvectors. Again it is clear that only three eigenvectors satisfy the Kronmal and Tarter (1968) criterion, and it is these that are required for the density estimate.
The nal illustrative simulation uses data drawn from a uniform distribution with nite support. The left-hand plot of Figure 7 shows the data drawn from a uniform distribution within the annular region that satises 9 · x2 C y2 · 25. The Parzen window estimated density iso-contours are superimposed on these. The adjacent plot in Figure 7 gives the contour plot of the estimated density using the kernel PCA method. Figures 8 and 9 show the related nonlinear features and the relative importance of each. 7 Conclusion Kernel PCA has proven to be an extremely useful method for extracting nonlinear features from a data set, and its utility has been demonstrated
Orthogonal Series Estimation
683
15
10
5
0
0
5
10
15
20
25
30
35
40
45
50
0
5
10
15
20
25
30
35
40
45
50
0 .02 0 .0 15 0 .01 0 .0 05 0
Figure 5: (Top) The rst 50 eigenvalues of the Gram matrix created from the noisy clustered data of Figure 1. It is apparent that the slow exponential decay of the eigenvalues is attributed to the high level of additive noise on the nite number of observations. The fast decay of the previous example is now difcult to discern from the eigenvalues alone. (Bottom) The contribution to the overall integrated square error of each of the rst 50 eigenvectors. It is apparent that there are only three dominant eigenvectors required for the majority of the density estimate. It is also clear that only the rst three eigenvectors satisfy the Kronmal and Tarter (1968) criterion before it is violated, and it is these that are retained for the density estimate.
on, among other applications, many complex and demanding classication problems. An intuitive insight into the nature of these particular features has been somewhat lacking to date. This article has presented an argument that the nonlinear features extracted using KPCA (the eigendecomposition of a Gram matrix created using a specic kernel) provides features that can be considered as components of an orthogonal series density estimate. This follows somewhat from the observations made in Williams and Seeger (2001) regarding the effect of the data distribution on kernel-based classiers.
684
Mark Girolami
Figure 6: (Left) The estimated probability density contour plot of the data consisting of the three noiseless clusters using the kernel PCA-based orthogonal series approach. A Parzen window estimator using a gaussian kernel with a width of 0.1 gives an identical result. (Middle) The contour plots of the density estimate, using a Parzen window estimator, for the noisy data in Figure 1. (Right) The density for the noisy data using an orthogonal series estimator that consists of the rst three eigenvectors of the Gram matrix that satised the Kronmal and Tarter (1968) criterion. A smoothing of the effects of the noise on the density estimate is apparent in this example.
Figure 7: (Left) The scatterplot of 1000 points drawn from a uniform annular ring centered at the origin with uniform width. Superimposed on this are the iso-contours of the estimated probability density using Parzen window estimator. (Right) The iso-contours of the estimated probability density using kernel PCA method. An RBF kernel of unit width was used in this experiment. It is apparent that the uniform region of support has been extended in the density estimate due to the innite support of the RBF kernel. Only eight eigenvectors satised the stopping criterion, and these were retained in the series expansion estimate, which amounts to a representation that uses 0.8% of the possible features.
Orthogonal Series Estimation
685
60 50 40 30 20 10 0 0
5
10
15
20
25
30
35
40
45
50
5
10
15
20
25
30
35
40
45
50
0.02 0.015 0.01 0.005 0 0
Figure 8: (Top) The eigenvalue spectrum for the rst 50 eigenvalues of the ker© ª2 nel. (Bottom) The values of lQ k 1TN u k associated with each eigenvector. Eight of the eigenvectors satisfy the Kronmal and Tarter criterion (1968).
The probability density function estimate provided by the relevant M eigenvectors pOM (x0 ) D 1TN UM UTM k(x0 ) can be seen to be a smoothed Parzen window estimate where the matrix UM UTM acts to smooth the estimate based on the data sample. In some sense, this can be seen as a reduced-set representation of the density function estimate based on the retained eigenvectors of the Gram matrix. The decomposition of the Gram matrix shows how each of the eigenvectors contributes to the overall data entropy (or norm of the functional form of the estimated density). Components that are related to the possible class structure (or modes) have a large sum-squared value, while those that are attributed to unstructured noise have low values. These particular values correspond to the induced error in the series density estimate when the related eigenvectors are discarded. One point of note is the accuracy of the eigenvectors as estimates of the corresponding eigenfunctions and their effect on the density estimate. It is
686
Figure 9 matrix. The characteristic Hermite polynomial structure of the features is most apparent.
noted that the series representaion is very sparse, with only eight components required to form the series density estimator for the data uniformly distributed within the annular region (see Figure 7). This amounts to the removal of 99.2% of the possible components in the series, with the large majority corresponding to the smaller eignvalues being discarded. Williams and Seeger (2001) show that for an RBF kernel, the accuracy of the estimates of the dominant eigenvalues is good, and this deteriorates for the estimation of the smaller values. The important components for the density function estimate are situated at the top end of the eigenspectrum, which do not suffer the effects of poor estimation. sity estimation and provides an insight into the signicance of the associated nonlinear features. This view may prove useful when considering kernel PCA as a means of nonlinear feature extraction for classier design or data clustering and, of course, nonparametric density estimation.
Orthogonal Series Estimation
687
Acknowledgments This work was carried out under the support of the Finnish National Technology Agency TEKES. I am grateful to Harri Valpola and Erkki Oja for helpful discussions regarding this work. Many thanks to the anonymous reviewers for their helpful comments. References Delves, L. M., & Mohamed, J. L. (1985). Computational methods for integral equations. Cambridge: Cambridge University Press. Diggle, P. J., & Hall, P. (1986). The selection of terms in an orthogonal series density estimator. Journal of the American Statistical Association, 81, 230– 233. Friedman, J. H., & Tukey, J. W. (1974). A projection pursuit algorithm for exploratory data analysis. IEEE Transactions on Computing, 23, 881–890. Izenman, A. J. (1991). Recent developments in nonparametric density estimation. Journal of the American Statistical Association, 86, 205–224. Kreyszig E. (1989). Introductory functional analysis with applications. New York: Wiley. Kronmal, R., & Tarter, M. (1968). The estimation of probability densities and cumulatives by Fourier series methods. Journal of the American Statistical Association, 63, 925–952. Mukherjee, S., & Vapnik, V. (1999). Support vector method for multivariate density estimation (AI Memo 1653). Available at: ftp://publications.ai. mit.edu/ai-publications/1500-1999/AIM-1653.ps. Ogawa, H., & Oja, E. (1986). Can we solve the continuous Karhunen-Lo e` ve eigenproblem from discrete data? Transactions of the IECE of Japan, 69(9), 1020–1029. Principe, J., Fisher III, J., & Xu, D. (2000). Information theoretic learning. In S. Haykin (Ed.), Unsupervised adaptive ltering. New York: Wiley. Rosipal, R., & Girolami, M. (2001). An expectation maximisation approach to nonlinear component analysis. Neural Computation, 13(3), 500–505. Sch¨olkopf, B., Bruges, C., & Smola, A. (Eds.). (1999). Advances in kernel methods— Support vector learning. Cambridge, MA: MIT Press. Sch¨olkopf, B., Smola, A., & Muller, ¨ K. R. (1996). Nonlinear component analysis as a kernel eigenvalue problem (Tech. Rep. MPI TR. 44). Available at: http://www.kernel-machines.org Sch¨olkopf, B., Smola, A., & Muller, ¨ K. R. (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10(5), 1299–1319. Tou, J. T., & Gonzalez, R. C. (1974). Pattern recognition principles. Reading, MA: Addison-Wesley. Williams, C. K. I., & Seeger, M. (2000). The effect of the input density distribution on kernel-based classiers. In P. Langley (Ed.), Proceedings of the Seventeenth International Conference on Machine Learning, 2000 (pp. 1159–1166). San Mateo, CA: Morgan Kaufmann.
688
Mark Girolami
Williams, C. K. I., & Seeger, M. (2001). Using the Nystr om ¨ method to speed up kernel machines. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.), Advances in neural information processing systems, 13 (pp. 682–688). Cambridge, MA: MIT Press. Zhu, H., Williams, C. K. I., Rohwer, R. J., & Morciniec, M. (1998). Gaussian regression and optimal nite dimensional linear models. In C. M. Bishop (Ed.), Neural networks and machine learning. Berlin: Springer-Verlag. Received September 26, 2000; accepted May 30, 2001.
LETTER
Communicated by Carsten Peterson
Natural Discriminant Analysis Using Interactive Potts Models Jiann-Ming Wu [email protected] Department of Applied Mathematics, National Donghwa University, Shoufeng, Hualien 941,Taiwan, Republic of China Natural discriminant analysis based on interactive Potts models is developed in this work. A generative model composed of piece-wise multivariate gaussian distributions is used to characterize the input space, exploring the embedded clustering and mixing structures and developing proper internal representations of input parameters. The maximization of a log-likelihood function measuring the tness of all input parameters to the generative model, and the minimization of a design cost summing up square errors between posterior outputs and desired outputs constitutes a mathematical framework for discriminant analysis. We apply a hybrid of the mean-eld annealing and the gradient-descent methods to the optimization of this framework and obtain multiple sets of interactive dynamics, which realize coupled Potts models for discriminant analysis. The new learning process is a whole process of component analysis, clustering analysis, and labeling analysis. Its major improvement compared to the radial basis function and the support vector machine is described by using some articial examples and a real-world application to breast cancer diagnosis. 1 Introduction
The task of discriminant analysis (Hastie & Simard 1998; Hastie & Tibshirani 1996; Hastie, Tibshirani, & Buja, 1994; Hastie, Buja, & Tibshirani, 1995) aims to achieve a mapping function from a parameter space Rd to a set of discrete and nonordered output labels S subject to interpolating conditions proposed by training samples f(xi , qi ) , 1 5 i 5 N, xi 2 Rd , qi 2 Sg. S is represented by M fe1M , e M 2 , . . . , e M g for the discrete and nonordered property of output labels, where M denotes the number of categories and e M m is a unitary vector of M elements with only the mth bit one. The category of an item is predicted by d measurements of features x 2 Rd . The following design cost quantitatively measures the tness of all training samples to a mapping function g: Rd ! S, DD
X
L ( g ( xi ) , qi ) ,
(1.1)
1 i N
Neural Computation 14, 689–713 (2002)
° c 2002 Massachusetts Institute of Technology
690
Jiann-Ming Wu
where L (¢) denotes an arbitrary distance between two unitary vectors. A mapping function absolutely minimizing the design cost automatically satises all interpolating conditions proposed by training samples; however, such a mapping function can be obtained by simply recording all samples in a look-up table. It is not the purpose of supervised learning, since without any other objective, a learning process subject only to the design cost is an ill-posed problem. Coming from a future validation by testing samples, the generalization cost proposes another essential criterion. Both the training set and the testing set are assumed to have the same underlying input-output relation. An effective supervised learning process is expected to minimize the design cost as well as the generalization cost. The minimization of this generalization cost has been considered as the reduction of model size in the eld of statistics (Hastie & Simard, 1998; Hastie & Tibshirani, 1996; Hastie et al., 1994, 1995), which is used for maximal generalization in this work. The derivation of our new learning process starts with a generative model responsible for characterizing the parameter space on the basis of a nonoverlapping partition. Assume that there exist K internal regions Vk , 1 5 k 5 K, in the partition; each region Vk is centered at yk and is dened by V k D fx| arg minj kx ¡ yj kAk D k, x 2 Rd g, where kxk A denotes the p Mahalanobis distance of x0 Ax. The local distribution of the input parameter in each region V k is modeled by a multivariate gaussian distribution with a mean vector at the center yk and a covariance matrix A k. As a special case, all local generative models are assumed to have the same covariance the A in this work to facilitate our presentations. The kernels fy kg partition S space RdT into K nonoverlapping subspaces with the property of k Vk D Rd 6 k2 . The tness of all parameters in each interand V k1 V k2 D ; for all k1 D nal region V k to the corresponding local generative model is quantitatively measured by a log-likelihood function. In this work, the supervised learning process for discriminant analysis is a process of maximizing the sum of all log-likelihood functions and minimizing the design cost. This work insists on solving the task of discriminant analysis by collective decisions performed by the architecture of neural networks. The formulation of the above two objectives involves two kinds of variables: discrete combinatorial variables and continuous geometrical variables. The resulting optimization framework is a mixed integer and linear programming, of which the optimization is difcult for the gradient-descent method due to numerous shallow local minimum within the corresponding energy function. The Potts encoding, which possesses exibility in internal representations and reliability in collective decisions, is employed to deal with this computational difculty. The Potts encoding is suitable for the design of neural networks and has been applied to fundamental complex tasks, including combinatorial optimizations (Peterson & Soderberg, ¨ 1989), selforganization (Liou & Wu, 1996; Rose, Gurewitz, & Fox, 1990, 1993), classication, and regression (Rao, Miller, Rose, & Gersho, 1999). The multistate
Natural Discriminant Analysis
691
Potts neuron, generalized from the two-state spin neuron, is used to reduce the search space of feasible congurations and realize the problem modeling for effective internal representations. In this work, the combinatorial internal representations include the assignment of each input parameter xi to one and only one internal region, denoted by the membership vector di 2 feK1 , . . . , eKK g, and the dynamical assignment of each region V k to one M output label, denoted by the category response j k 2 feM 1 , . . . , eM g. Each di or j k is considered as the discrete Potts neural variable. The continuous geometrical variables include the kernels fykg and the common covariance matrix A. By these representations, the maximization of the sum of all loglikelihood functions, each measuring the nesse of the corresponding local generative model, and the minimization of the design cost together form a mixed integer and linear programming and lead to a novel energy function for discriminant analysis. All these variables—fdi g, fj kg, fy kg, and A— are collectively optimized by a hybrid of the mean-eld annealing and the gradient-descent method toward the minimization of the energy function. The resulting learning process consists of four sets of interactive dynamics characterizing the coupled Potts models of discriminant analysis. The evolution of the four sets of interactive dynamics is well controlled by an annealing process for the minimization of the energy function. The annealing process is analogous to physical annealing, which is a process of gradually and carefully scaling the temperature from a sufciently large scale to a small one. At each temperature, the mean conguration of the whole system is a balancing result of trading off minimizing the mean energy against maximizing the entropy. When this process is used, mean activations of a Potts neuron, indicating probabilities of active states, are increasingly inuenced by injected mean elds. At the beginning, mean activations are independent of injected mean elds; the system is ruled over by the principle of maximal entropy; a Potts neuron has almost the same probability of activating each of its states. As the process progresses, the symmetry is broken; each Potts neuron has a decreasing degree of freedom; the mean conguration of the system is increasingly dominated by the tendency toward the minimal mean energy and decreasingly by the criterion of maximal entropy. Toward the end of the process, the mean conguration is totally controlled by the force of minimal mean energy; mean activations of a Potts neuron behave winner-take-all. The Potts encoding has been applied to model collective decisions (Liou & Wu, 1996; Peterson & Soderberg, ¨ 1989). Its applicability to the task of discriminant analysis is explored in this work.
1.1 The Learning Network. The learning process is composed of four sets of interactive dynamics, which constitute the coupled Potts model in architecture. The coupled Potts model is a modular recurrent neural network
692
Jiann-Ming Wu
Figure 1: The learning network is composed of four interactive dynamics and the interconnection network.
for supervised learning. As shown in Figure 1, the coupled Potts model consists of four interactive modules, of which two are Potts neural networks calculating the mean hdi and hj i of combinatorial variables fdi g and fjkg, respectively, and the others are linear networks for updating kernels fykg and the covariance matrix A. Four interactive modules communicate with each other through interconnection networks. The same learning network has appeared in implementing the recurrent backpropagation (Pineda, 1987, 1989) with two modules. 1.2 The Discriminant Network. The discriminant network derived by the new learning process is closely related to a network of multilayer perceptrons (MLP) (Rumelhart & McClelland, 1986; Pineda, 1987, 1989) and radial basis functions (RBF) (Benaim, 1994; Freeman & Saad, 1995; Girosi, Jones, & Poggio, 1995; Girosi, 1998; Moody & Darken, 1989). It is a network of normalized radial basis functions (RBFs) with generalized hidden units. The hidden units of a normalized RBF network (Moody & Darken, 1989) use a normalized gaussian activation function,
GIk ( x)
® ®2 exp( ®x ¡ y k® / 2s 2 ) D P , ® ®2 ® ® / 2s 2 ) j exp( x ¡ yj
(1.2)
Natural Discriminant Analysis
693
with a scalar variance s 2 . The hidden units of the current recognizing network have a normalized multivariate gaussian activation function, exp(¡b ( x ¡ y k ) 0 A ( x ¡ y k ) ) ( ) P GA , k x D ( )0 ( )) j exp(¡b x ¡ yj A x ¡ yj
(1.3)
where A is a covariance matrix and b is the inverse of an articial temperature process. By normalization, we mean that P A used in the annealing I ( ) is a special case of A ( ) whereas the func( ) 1 for any G x D x. G x Gk x , j j k A I tion Gk ( x) can be translated to the form of G k ( z) if one rewrites the term 0 ( x ¡y k ) 0 A ( x ¡y k ) as (z ¡zk ) 0 ( z ¡zk ) with zk D Byk and z D Bx, where B B D A. A By this translation, the function Gk ( x) is decomposed as the combination of z D Bx and GIk ( z) . The current recognizing network is exactly the composition of a linear transformation and a normalized RBF network. The learning process derived in this work can be directly applied to a normalized RBF network by xing the covariance matrix as I, and it is also applicable to an MLP network based on the connection (Girosi et al., 1995; Girosi, 1998) between a normalized RBF network and an MLP network. Practical experiments (Miller & Uyar, 1998) have shown the gradient-descent-based learning algorithms, including the backpropagation algorithm for an MLP network and the learning algorithm (Moody & Darken, 1988, 1989) for an RBF network, suffering at the trap of tremendous local minima in optimizing their internal representations. Based on a hybrid of the mean-eld annealing and the gradient-descent method, the new learning process proposed in this work is essential for developing effective nonlinear boundaries, well optimizing the internal representation for the parameter space of a real application. The normalized multivariate gaussian activation function in equation 1.3 denes an overlapping partition into the parameter space, where the overlapping degree is modulated by the b parameter, indirectly by the anneal( ) as the projection probability ing process. We consider the function GA k x of assigning an input parameter x to an internal region V k. For an input ( ) 1 5 k 5 Kg characterize parameter x, all its projection probabilities fGA k x , different partition phases; at a sufciently low b, they are nearly identical to K1 , denoting a complete overlapping partition. As the b value increases, they become asymmetric for some degree of overlapping partition. To a sufciently large b, they behave winner-take-all, such that the only winner ( ) is one and the others are zero, where k¤ D arg mink kx ¡ y kkA . If GA k¤ x each region Vk is equipped with an optimal category response j k, based on the above projection mechanism, the mapping function of the current recognizing network is g (x) D
X k
(x) j k. GA k
(1.4)
694
Jiann-Ming Wu
This discriminant function at a sufciently large b is similar to the nearest prototype classier, but with a generalized distance measure k ¢ k A , g ( x ) D j k¤ ,
® ® k D arg min ®x ¡ y k® A , ¤
(1.5)
k
and can be further translated to a composition of g ( z ) D j k¤ ¤
(1.6)
k D arg min kz ¡ z kk and z D Bx,
k
(1.7)
where zk D By k and A D B0 B. If A D I, the latter form exactly denes the nearest prototype classier (Rao et al., 1999), of which the measure is the Euclidean distance. The partition of fzkg into the parameter space is nonoverlapping, and each internal region is attached with its own category response. The nearest prototype classier is known to be suitable for the case with statistically independent components, but for most real applications, this assumption is not valid, so a preprocessor for feature extraction like the above linear transformation, z D Bx, is usually additionally employed. But the development of a preprocessor, such as using independent component analysis (Lin, Grier, & Cowan, 1997; Makeig, Jung, & Bell, 1997) or principal component analysis, is independent of the formation of a classier; the combined discriminant function may suffer from the inconsistency between the extracted feature and the classier. Alternatively, based on the Mahalanobis distance, the new learning process in this work focuses on the mapping function in equation 1.5 and directly explores the whole discriminant process of component analysis, clustering analysis, and labeling analysis. In the next section, we introduce the generative model for characterizing the parameter space and derive a mathematical framework for discriminant analysis. Four sets of interactive dynamics and the coupled Potts model are developed in section 3. Another issue in this work is incremental learning, a procedure for determining the optimal number of internal regions or the minimal model size for maximal generalization. The incremental learning scheme is introduced in section 4. In the nal section, we test the new method in comparisons with RBF (Muller et al., 1999; Ratsch, Onoda, & Muller, 2001) and the support vector machine (Vapnik, 1995; Platt, 1999; Cawley, 2000) using articial examples and a real-world application to breast cancer diagnosis (Wolberg & Mangasarian, 1990; Malini Lamego, 2001), and discussions about simulation results are described.
Natural Discriminant Analysis
695
2 Supervised Learning for Discriminant Analysis 2.1 The Generative Model. The generative model for a parameter space Rd is composed of K piece-wise multivariate normal distributions. Each is of ³ ´ 1 1 0 ( ) ( ) q exp ¡ ¡ ¡ (2.1) p k ( x) D x y A x y k k d 2 (2p ) 2 A¡1
centered at the vector y k with a nonsingular covariance matrix A. The K distributions have been assumed to have the same covariance matrix. These kernels fy kg form K internal S regions fV kg in theTparameter space, which are 6 k2 . For nonoverlapping, such as k V k D Rd and V k1 Vk2 D ; for all k1 D each internal region Vk, the tness of p k (x ) to all input parameters xi 2 V k is measured proportional to the following log-likelihood function: Y (2.2) lk D log p k ( xi ) . xi 2Vk
A summation of all lk leads to the following function: X lD lk k
D D
X
Y
log
pk ( xi )
xi 2Vk
k
X X
log pk (xi ) .
(2.3)
k xi 2Vk
Recall that the assignment of each input xi to one of K internal regions has been represented by a P membership vector di , of which each element dik is either one or zero and k dik D 1. The function l can be rewritten as XX dik log pk ( xi ) lD i
k
1 XX dik ( xi ¡ y k ) 0 A ( xi ¡ y k ) D ¡ 2 i k ¡
N Nd log det(A ¡1 ) ¡ log(2p ) , 2 2
(2.4)
where det(¢) denotes the determinant of a matrix. By the fact det(A ¡1 ) D ¡ det(A) and neglecting the last constant term, we obtain the following objective: E1 D
1 XX N dik ( xi ¡ y k ) 0 A ( xi ¡ y k ) ¡ log det(A). 2 i k 2
(2.5)
Maximizing the function l is equivalent to minimizing the function E1 .
696
Jiann-Ming Wu
2.2 A Mathematical Framework. Recall that each region Vk has been M attached with a category response j k 2 feM 1 , . . . , eM g for classication. The design cost in equation 1.1 can be expressed as
X 1X | |qi ¡ dikj k || 2 2 i k 1X | |qi ¡ Ldi || 2 , D 2 i
E2 D
(2.6)
where the matrix L D [j1 , . . . , j k, . . . , jK ]. By combining the two objectives of equations 2.5 and 2.6 and injecting all constraints, we have the following mathematical framework for discriminant analysis. Minimize E (d , j, y, A ) D E1 C cE2 1 XX dik ( xi ¡ yk ) 0 A ( xi ¡ yk ) D 2 i k N cX ||qi ¡ Ldi || 2 , ¡ log det(A) C 2 2 i
(2.7)
(2.8)
subject to X k
X m
dik 2 f0, 1g, for all i, k dik D 1, for all i
j km 2 f0, 1g, for all k, m
(2.9)
d km D 1, for all k,
where d, j , and y denote collections of fdi g, fj kg, and fykg respectively, and c is a weighting constant. The learning process for discriminant analysis turns to search for a set of d, j , y, and A, which minimize the weighted sum of the negative log-likelihood function and the design cost subject to a set of constraints as in equation 2.9. We consider the mathematical framework in equations 2.8 and 2.9 as a mixed integer and linear programming, of which fyk g and A are continuous geometrical variables and fdi g and fjk g are discrete combinatorial variables. In the following section, we employ a hybrid of the mean-eld annealing and gradient-descent methods to the optimization of all these variables simultaneously. 3 Interactive Dynamics and Coupled Potts Models
A hybrid of the mean-eld annealing and the gradient-descent methods is applied to the above mixed integer and linear programming. As a result,
Natural Discriminant Analysis
697
four sets of interactive dynamics are developed for the variables A, fy kg, fdi g, and fj kg, respectively. These dynamics interact following an analogous process of the physical annealing and perform a parallel and distributed learning process for discriminant analysis. When we relate each vector di or j k to a Potts neuron, the two unitary constraints in equation 2.9 are subsequently taken over by the normalization of the Potts activation function. The above mathematical framework is reduced to the minimization of the energy function E. By xing the matrix A and fy kg, the mean-eld annealing traces the mean conguration < d > and < j > , emulating thermal equilibrium at each temperature. It follows that the probability of the system conguration is proportional to the Boltzmann distribution: Pr(d, j ) / exp(¡bE(d, j ) ) .
(3.1)
Following the annealing process to a sufciently large b value, the Boltzmann distribution is ultimately dominated by the optimal conguration, lim Pr (d ¤ , j ¤ ) D 1,
b !1
where E (d ¤ , j ¤ ) D min E (d , j ). d,j
The annealing process gradually increases the parameter b from a sufciently low value to a large one. At each b value, the process iteratively executes the mean-eld equations to a stationary point, which represents the mean conguration for thermal equilibrium. The obtained mean conguration at each b value is used as the initial conguration for the process at its subsequent b value. The mean-eld equation can be derived from the following free energy function, which is similar to that proposed by Peterson and Soderberg ¨ (1989), Y ( y, A, hdi , hj i , v, u ) XX XX hdik i vik C hj km i u km D E ( y, A, hdi , hj i) C i
³
X 1X ¡ ln exp(bvik ) b i k
k
´
k
³
m
(3.2)
´
X 1X ¡ ln exp(bu km ) , b m
(3.3)
k
where hdi, hj i, u, and v denote fdi g, fj kg, fu km g, and fvikg, respectively, and ui and v k are auxiliary vectors. When xing y, A and b, a saddle point of the
698
Jiann-Ming Wu
free energy satises the following condition: @Y @ hdi i @Y
@ hjk i
@Y
D 0,
@vi @Y
D 0,
@u k
D 0, for all i D 0, for all k.
These lead to two sets of mean-eld equations in the following vector form: vi D ¡
@E ( y, A, hdi , hj i)
(3.4)
@ hdi i
1 2
D ¡ ( xi ¡ yk ) 0 A ( xi ¡ yk ) ¡ cL 0 ( qi ¡ Ldi )
µ
hdi i D
exp(bvi1 ) exp(bviK ) P ,..., P ) exp(bv ih h h exp(bvih )
uk D ¡ D c
µ hj k i D
¶0
(3.5)
@E ( y, A, hdi , hj i)
X i
(3.6)
@ hjk i
hdik i (qi ¡ L hdi i)
exp(bu k1 ) exp(bukM ) P , ..., P ) exp(bu km m m exp(bu km )
¶0
.
(3.7)
During the stage of evaluating the mean conguration at each b value, the matrix A and the kernels fykg are considered constants. Once determined, the mean conguration feeds back to the adaptation of the covariance matrix and the kernels. By applying the gradient-descent method to the free energy, we have the following updating rule for each element Amn in the matrix A: @Y
4Amn / ¡
@Amn
D ¡
@Amn
D ¡
1 XX N hdik i ( xim ¡ y km ) ( xin ¡ ykn ) C [ ( A0 ) ¡1 ]mn . 2 i k 2
@E
(3.8)
When all 4Amn D 0, we have A D ( W ¡1 ) 0 ,
(3.9)
Natural Discriminant Analysis
699
where Wmn D
1 XX hdik i ( xim ¡ y km ) ( xin ¡ y kn ) . N i k
(3.10)
The adaption of the kernels fyk g is also derived by the gradient-descent method: 4y k / ¡ D
@Y @y k
(3.11)
1X hdik i ( A C A0 ) ( xi ¡ y k ) . 2 i
Again, when 4y k D 0, we have P hdik i xi . y k D Pi i hdik i
(3.12)
Now we conclude the new learning process for discriminant analysis as follows: 1. Set a sufciently low b value, each kernel y k near the mean of all 1 predictors and each hdik i near K1 and hjkm i near M . 2. Iteratively update all hdik i and vik by equations 3.4 and 3.5, respectively, to a stationary point.
3. Iteratively update each hj km i and u km by equations 3.6 and 3.7, respectively, to a stationary point. 4. Update each y i by equation 3.12. 5. Update A by equations 3.9 and 3.10. P P 6. If ikhdik i2 and km hj km i2 are larger than a prior threshold, then halt; otherwise increase b by an annealing schedule and go to step 2. The convergence of the algorithm is well guaranteed. For steps 2 and 3, two sets of mean-eld equations dene a stationary point of the free energy, equation 3.2. For steps 4 and 5, since all fhdik ig and fhj km ig are xed, the change D y of the free energy, equation 3.2, due to the change D y k and D Amn has the nonincreasing property supported by the gradient-descent method. A mathematical treatment to the convergence property is given in the appendix. The two sets of mean-eld equations, 3.4–3.5 and 3.6–3.7, constitute two interactive Potts models. The learning network for the whole learning process is shown in Figure 1.
700
Jiann-Ming Wu
4 Incremental Learning for Optimal Model Size
An incremental learning procedure is developed for the interactive Potts models to optimize the model size subject to the criterion of minimal design cost. The model size indicates the number of kernels. According to the learning process in the last section, when the annealing process halts with a sufciently large b, each individual mean hdik i or hjkm i is close to either one or zero and the two square sums of mean activations are larger than the predetermined threshold. The variables fhdik ig and fhj km ig are rst recorded ¤ ¤ by temporal variables fdik g, fj km g, respectively, to establish an intermediate recognizing network with kernels fy kg and the matrix A. Whether the model size of this intermediate recognizing network, which is the size of fyk g, is sufP P ¤ || 2 cient depends on the quantity of the design cost i ||qi ¡ k j k¤dik , which denotes the number of errors of predicting all training samples. Dene the P ¤ k2 . If the hit ratio is not acceptable, such as hit ratio r as 1 ¡ N1 Si kqi ¡ k j k¤dik r < h and h D 0.98, it is conjectured that the underlying boundary structure for well-discriminating training samples overloads the partition formed by the kernels fy kg and the covariance matrix A. The incremental learning procedure aims to improve this hit ratio r by properly increasing the model size and the boundary structure. Dene the local hit ratio r k as P adapting ¤ 1 ¡ N1k i dik kqi ¡j k¤ k for each internal region Vk D fx|k D arg minj kx¡yj kA g, where N k denotes the size of the set fxi 2 V kg. A set of underestimated internal regions, each having an unacceptable local hit ratio, such as rk < h, is then picked out and their kernels are duplicated. The duplication involves the variation of the original coupled Potts model in organization. The idea of divide-and-conquer tends to invoke a subtask for each underestimated region and then deal with each subtask independently. This is not the best choice, since it loses the point of global optimization of the boundary structure for discriminant analysis. Alternatively, the incremental learning procedure makes use of collective decisions of interactive Potts models. The kernel y k of an underestimated internal region, such as rk < h, is duplicated with small perturbation to produce its twin kernel denoted by y¤k . A set of new kernels fynew g is created to be the union of fy kg and fy¤k | k rk < hg, having the model size of K C K¤ , where K and K¤ , respectively, denote the number of the original kernels and that of the underestimated internal regions. In the new set, let the index of y k be still k and that of y¤k be denoted by k0 . For each input parameter xi , assuming xi 2 V ( yk ), a new membership vector dinew , now with K C K¤ element, is created as follows: ¤
dinew D eKk C K if rk ¸ h ² 1 ± K C K¤ C K¤ C eK otherwise. D ek 0 k 2 In the second line of the above equation, the new vector dinew means that xi has the same probability 12 of being mapped to each of the twins, ynew and k
Natural Discriminant Analysis
701
. Create new category responses fj knew g for the new kernels fynew g, and set ynew k0 k 1 each element of j knew near M . After duplication and replacing all means with new vectors fdinew g, fjknew , 1 5 k 5 K C K¤ g and kernels with fynew g, the system k variables include fyk , 1 5 k 5 K C K¤ g, fhdi ig and fhjk i, 1 5 k 5 K C K¤ g. Note that now each hdi i contains K C K¤ elements. Return to the annealing process. The hasP been increased to a sufciently large one, where the Pb value ¤ ) 2 and ¤ ) 2 are larger than the predetermined halting (j km two sums (dik threshold, but these system variables after duplication no more satisfy the halting condition. The b value can be further increased to continue the annealing process. We conclude the incremental learning process as follows for the interactive Potts models: 1. Set a sufciently low b value, a threshold h, and an initial model size K. Set each kernel y k near the mean of all predictors, each hdik i near K1 , 1 and hj km i near M . 2. Iteratively update all hdik i and vik by equations 3.4 and 3.5, respectively, to a stationary point. 3. Iteratively update each hj km i and u km by equations 3.6 and 3.7, respectively to a stationary point. 4. Update each yi by equation 3.12. 5. Update A by equations 3.9 and 3.10. P P 6. If ik hdik i2 and km hj km i2 are larger than a prior threshold, such as 0.98, then go to step 7. Otherwise, increase b by an annealing schedule, and then go to step 2. 7. Record fhdi ig, fhj k ig by temporal variables fdi¤ g, fj k¤ g, respectively. 8. Determine r and all r k using A, fy kg, fdi¤ g, fj k¤ g.
9. If r > h, halt. Otherwise, duplicate the kernels of K¤ underestimated regions with small perturbation and create new variables fynw g, fdinew g, k new ¤ ¤ ¤ fj k g using fykg, fyk |rk < hg, fdi g, fj k g, as described in the text.
10. K Ã K C K¤ . Decrease b with a small constant, and replace fy kg, fhdi ig, and fhj k ig with fynew g, fdinew g, and fj knew g respectively, and go to step 2. k 5 Numerical Simulations and Discussion
The incremental learning process in section 4 has been implemented in Matlab codes and is referred to as PottsDA in the following context. 5.1 Examples. We rst test the new method (PottsDA) in comparisons with the RBF and support vector machine (SVM) methods (Vapnik, 1995) using some articial examples. In our simulations, the b parameter of the 1 PottsDA is initialized as 3.8 , and each annealing process increases it to a
702
Jiann-Ming Wu
b value of 0.98 I the weighting constant is c D 4.5. Theoretical derivations of the weighting constant c and the initial b parameter can further refer to Aiyer, Niranjan, and Fallside (1990) and Peterson and Soderberg ¨ (1989), respectively. The Matlab package used for the RBF (Muller et al., 1999; Ratsch et al., 2001) is provided in Ratsch et al. (2001), where the centers are initialized with k-means clustering. For the SVM (Vapnik, 1995), the Matlab package is provided in Cawley (2000), and the corresponding learning method is of sequential minimal optimization (Platt, 1999). In this section, the three methods are executed ve times for each example. Their average error rates for training and testing are reported. The input parameters in the rst example are generated by a linear mixture, x ( t) D Hs ( t), where s ( t) D [s1 ( t) s2 ( t)]0 denotes time-varying samples from two independently uniform distributions within [¡.05, 0.5], and " # 0.4384 ¡0.8988 HD ¡0.8493 0.5279
is a randomly generated mixing matrix. The desired output of each input parameter is determined by the rule q ( t) D sign ( s1 (t ) ) ¤sign(s2 ( t) ) . We use the same process to generate 1600 samples and split them into two equal sets— one for training and the other for testing. In Figure 2, the position of each input parameter in the training set is marked with a black or gray dot, which, respectively, denote two distinct output labels. Since the input parameter in this example is a linear mixture of independent sources, a discriminant rule depending on only the input parameter, such as sign (x1 (t ) ) ¤sign(x2 ( t) ) , does not describe an optimal prediction for correct output labels. The primary challenge to the learning process is the recovery of the original independent sources such that an effective discriminant rule can be encoded by a minimal set of kernels to achieve maximal generalization. Our simulations show that the PottsDA method outperforms the RBF and the SVM methods for this example. The PottsDA has an initial model of two kernels and halts with an optimal hit ratio of r D 100%. As shown in Table 1, the error rates of the PottsDA for both training and testing are zero. This is a result carried out by a discriminant network composed of a covariance matrix, four kernels and their category responses. In Figure 2, the position of each of four kernels is marked with a circle or cross symbol representing the category denoted by black or gray dots, respectively. By the relation B0 B D A, one can obtain a demixing matrix, " # 12.0370 5.8743 BD 5.8743 10.2847 from the covariance matrix A. The two columns of the inverse B ¡1 are shown by the two lines in Figure 2, which exactly coincide with the mixing structure in direction for this example. The obtained covariance matrix provides
Natural Discriminant Analysis
703
0.8
0.6
0.4
0.2
0
-0 . 2
-0 . 4
-0 . 6 - 0. 8
- 0 .6
-0.4
-0 . 2
0
0.2
0.4
0.6
0.8
Figure 2: The training patterns of the rst example and the result of the learning process, including the two columns of the inverse of the demixing matrix, the four kernels, and their category responses. Table 1: Performance of the Three Methods, First Example. RBF(4)
RBF(8)
RBF(12)
RBF(24)
SVM
PottsDA(4)
Training
14.1%
12.0%
8.6%
3.9%
13.2%
0%
Testing
13.0
12.1
8.3
4.5
14.3
0
Note: Numbers in parentheses refer to number of kernels.
a suitable distance measure between input parameters and kernels such that the four kernels faithfully partition the parameter space into four internal regions and the resulting discriminant network successfully classies samples in the training set and the testing set. In contrast, since the RBF network is based on the Euclidean distance, its kernels result in nonfaithful representations for input parameters. To illustrate this point, the RBF method was tested separately with 4, 8, 12, and 24 kernels. Our simulations show that the average error rate of the RBF method with 24 kernels is 3.9% for training and 4.5% for testing and that of the SVM method is 13.2% for training and 14.3% for testing. The average execution time of the PottsDA for this example is 13.4 seconds. In the second example, the input parameter contains three elements. Each of the input parameters, x ( t) D [x1 (t) x2 (t ) x3 ( t) ]0 , is a result of the linear
704
Jiann-Ming Wu
mixture, x ( t) D Hs ( t) , of three independent sources, s ( t) D [s1 ( t) s2 ( t) s3 ( t) ]0 , where H is a randomly generated mixing matrix with entries as follows: 2 3 0.9288 0.2803 0.3770 6 7 H D 4 0.3122 0.9366 0.2572 5 . 0.1994 0.2098 0.8897 The rst two sources, s1 (t) and s2 ( t) , are of uniform distributions within p [¡0.5, 0.5], and s3 (t ) is a gaussian noise of N (0, 2) . The discriminant rule is the same as in the rst example, sign ( s1 ( t) ) ¤ sign (s2 ( t) ), treating the third source as a noise for prediction. To retrieve this discriminant rule from the mixture with the minimal model size, the learning process has to deal with interference caused by the mixing structure and the noise source. As in the rst example, both the training set and the testing set contain 800 samples each generated by the same linear mixture. For the testing set, all samples of three independent sources are shown in the rst three rows in Figure 3, and the three mixed signals, x1 ( t) , x2 (t ), and x3 ( t) , are shown by the next three rows. The seventh row in Figure 3 shows the desired output of each
Figure 3: The time sequence of the three independent sources, the three input parameter, the desired output, and the predicted output of the second example.
Natural Discriminant Analysis
705
Table 2: Performance of the Three Methods, Second Example. RBF(4)
RBF(8)
RBF(12)
RBF(24)
SVM
PottsDA(4)
Training
45.3%
31.2%
22.6%
10.9%
3.2%
0.2%
Test
44
31.1
24.6
13.9
5.9
0
Note: Numbers in parentheses refer to number of kernels.
sample. In all of ve executions, the PottsDA derives a discriminant network composed of a distance measure, four kernels and their category responses, by which the resulting output for 800 testing samples as shown in the eighth row exactly coincides with the desired output. To facilitate presentations, according to the combination of the sign of s1 (t) and s2 ( t) , we have sorted the 800 testing samples in Figure 3 into four segments, such that each segment contains the same category response. As shown in Table 2, the PottsDA method is better than the RBF and the SVM methods in handling input parameters of linear mixtures with noises. The average execution time of the PottsDA for this example is 30.07 seconds. The third example tests the three methods for the spiral data as shown in Figure 4, where two distinct categories are denoted by stars and dots, respectively. In this example, both the training set and testing set contain 40 spiral-distributed interleave clusters, and each cluster contains 20 input parameters. The primary challenge to a learning method is to nd the centers of 40 clusters. For this example, the PottsDA method has an initial model of P 10 kernels. The quantity hdik i2 measured at step 7 in the PottsDA learning process along updating iterations is shown in Figure 5, which also displays the change of the model size. The nal model size has been further reduced to 40 by considering possible combinations of any two neighboring kernels. The obtained 40 kernels in one of ve executions with their category responses, denoted by circle or cross symbols, are plotted in Figure 4. The two lines in Figure 4 denote the two columns of the inverse of the obtained demixing matrix in direction. The average training and testing error rates of the three methods are shown in Table 3. Obviously, the PottsDA method also outperforms the RBF and the SVM methods for this example. 5.2 Breast Cancer Data. We use the Wisconsin Breast Cancer Database (as of July 1992) to test the PottsDA method for actual applications. This database contains 699 instances, each containing 9 features for predicting one of benign and malignant categories. There are 458 instances in the benign category and 241 instances in the malignant category in this database. The input parameters are clump thickness, uniformity of cell size, uniformity of cell shape, marginal adhesion, single epithelial cell size, bare nuclei, bland chromatin, normal nucleoli, and mitoses, each represented by integers ranging from 1 to 10. The original work (Wolberg & Mangasarian, 1990) ap-
706
Jiann-Ming Wu 1 .5
1
0 .5
0
-0.5
-1
-1.5 -1
-0. 8
- 0.6
-0. 4
-0.2
0
0. 2
0 .4
0 .6
0. 8
1
Figure 4: The training patterns of the third example and the result of the learning process, including the two columns of the inverse of the demixing matrix, the 40 kernels, and their category responses.
plied the multisurface method to a 369-case subset of the database, resulting in error testing rates more than 6% (Malini Lamego, 2001). Recently, Malini Lamego (2001) used the neural network with algebraic loops to deal with the rst 683 instances of the database. In his experiment, the last 200 instances of the 683 instances form the testing set, and the others form the learning set; the resulting error rates are 2.3% for learning and 4.5% for testing. It has been claimed (Malini Lamego, 2001) that the classication is more difcult than the previous one (Wolberg & Mangasarian, 1990), and the result is better than those of all previous works for this database. For comparison, we use the same training set and testing set as in Malini Lamego (2001) to evaluate the performance of the PottsDA method. The PottsDA method obtains a discriminant network with 42 kernels and has error rates of 1.4% for training and 1% for testing, as shown in Table 4. Only two instances in the testing set are incorrectly classied by the discriminant network. If the testing set also includes the last 19 instances in the database, one additional instance is missed by the same discriminant network, and the test error rate is 1.39%. For the 219-case test set, the RBF method with 80 kernels and the SVM method result in error rates of 4.17% and 4.63% for testing, respectively. The PottsDA method is signicantly better than the other two methods for the Wisconsin Breast Cancer Database.
Natural Discriminant Analysis
707
1 0.9 0.8 0.7 0.6
K=35
0.5 0.4 0.3
K=10
0.2 0.1 0
500
K=44
K=20 1000
1500
2000
2500
3000
Figure 5: The convergence of the learning network for the third example and the change of the model size. The vertical coordinate denotes the ratio of the square sum of the mean activations of membership vectors to the number of training patterns. The horizontal coordinate is the time index, each denoting a change of the beta value.
Table 3: Performance of the Three Methods, Third Example. RBF(40)
RBF(50)
RBF(60)
RBF(80)
SVM
PottsDA(40)
Training
14.6%
10.4%
7.8%
3.3%
45.5%
0.8%
Test
15.7
12.3
9.5
4.1
45.6
0.4
Note: Numbers in parentheses refer to number of kernels.
Table 4: Performance of the PottsDA Method and the Neural Network with Algebraic Loops for the 683-Case Subset of the Wisconsin Breast Cancer Database. PottsDA(42)
Neural Net with Algebraic Loops
Training (483)
1.4%
2.3%
Testing (200)
1
4.5
708
Jiann-Ming Wu
5.3 Discussion. The major improvement of the PottsDA method compared to the other methods is illustrated from four perspectives: the exibility of the discriminant network, effective collective decisions of the annealed recurrent learning network, the advantages of the generative model, and the capability of the incremental learning process.
5.3.1 Discriminant Network. The discriminant network of the PottsDA method composed of normalized multivariate gaussian activation functions fGA g of equation 1.3 is a general version of a normalized RBF network. To an k extreme large b, the discriminant function, equation 1.5, is a piecewise function composed of a set of local functions, each dened within an internal region of a faithful nonoverlapping partition into the parameter space based on the Mahalanobis distance k ¢ kA . This discriminant function is indeed a composition of a linear transformation and the nearest prototype classier as in equation 1.6, and it possesses more exibility for a desired mapping function. Consider the rst two articial examples, where the training parameters are results of linear mixtures of independent sources, and their targets are exactly encoded with source instances instead of mixtures. With an adaptive distance measure A, the PottsDA method succeeds in locating four kernels fy kg for the optimal discriminant function. In contrast, because of using the Euclidean distance and lacking circumspect efforts to recover independent instances from mixtures, the RBF method results in nonfaithful internal representations, a relatively large number of local functions based on a Voronoi partition. As shown in Tables 2 and 3, the testing error rate of the RBF method of 24 kernels is higher than that of the PottsDA method of four kernels. This reects the weakness of the discriminant function of a normalized RBF network as being a special case of the PottsDA method. 5.3.2 Annealed Recurrent Learning Network. The annealed recurrent learning network of the PottsDA method containing four sets of interactive dynamics is effective for optimizing highly coupled parameters of the discriminant network under an annealing process. The supervised learning process is formulated into the mathematical framework, equation 2.7, composed of a mixed integer and linear programming, and a hybrid of the mean-eld annealing and the gradient-descent method is employed to derive linear and nonlinear interactive dynamics. The primary advantage of the annealed recurrent learning network over a cascaded learning process or simply a gradient-descent-based learning process is the effect of collective decisions for four sets of continuous and discrete variables and the capability of escaping from the trap of tremendous local minima within the energy function to approach a global minimum. Consider the second articial example, where input parameters are a result of linear mixtures and one independent source is treated as noise to its discriminant rule. Collective decisions realized by the annealed recurrent learning network succeed in dealing with component analysis, clustering
Natural Discriminant Analysis
709
analysis, and labeling determination as a whole process and can thus achieve an optimal discriminant function for this problem. Numerical simulations show that the power of the discriminant network is strongly supported by the annealed recurrent learning network. Although the RBF method has employed the k-means algorithm to set up its initial kernels, it cannot well handle the interference caused by linear mixtures and noises due to the limitation from its discriminant function and learning process. Further evaluations show that the testing error rate of the RBF method of 100 kernels is still above 10% for this example. 5.3.3 Generative Model. Both the discriminant network and the annealed recurrent learning network are rooted from the generative model of multiple disjoint multivariate gaussian distributions. The overall distribution corresponding to the generative model is general enough to characterize an arbitrary parameter space, and the involved parameter estimation gains potential advantages from maximal likelihood principle, which provides solid theoretical fundamentals to develop the mathematical framework. To realize natural discriminant analysis, the PottsDA method initiates a generative model to all predictors, using the annealed recurrent learning network to adapt the kernels and the covariance matrix subject to interpolating conditions proposed by paired training samples, and using the discriminant network to classify instances. A simplied version of the generative model is simulated with a unied covariance matrix. For the Wisconsin Breast Cancer Database, its performance is encouraging. The obtained parameters, including the 42 kernels and the covariance matrix, could provide feedback for understanding relations among components of predictors. 5.3.4 Incremental Learning Process. The incremental learning process is capable of determining minimal model size subject to interpolating conditions. Consider the third articial example composed of 40 interleaving clusters in two different classes. Without interferences caused by linear mixtures and noises, this problem tests the capability for clustering analysis. For this example, the RBF method behaves better than the SVM method, but it still takes the RBF method 80 kernels to produce a testing error rate of 4.1%. The incremental learning process of the PottsDA method is more effective. It obtains a discriminant network of 40 kernels with testing error rates near zero. 6 Conclusions
We have proposed a new learning process for discriminant analysis based on the four sets of interactive dynamics, and its encouraging performance has been shown by numerical simulations for some articial and real examples. To develop the interactive dynamics, we have proposed multiple disjoint multivariate gaussian distributions to serve as a generative model for
710
Jiann-Ming Wu
the parameter space. By combining the maximization of the log-likelihood functions, the tness of the generative model to all input parameters, and the minimization of the design cost, we have a mathematical framework for discriminant analysis. By relating the discrete variables to Potts neural variables, we can further apply a hybrid of the mean-eld annealing and gradient-descent methods to the optimization of the mixed integer and linear programing and obtain the four sets of interactive dynamics performing the annealed recurrent learning network for discriminant analysis. The new learning process is of a parallel and distributed process, and its evolution is well controlled by an annealing process in an analog with the physical annealing. An effective incremental learning procedure is also developed for optimizing the model size. The adaptive covariance matrix of the discriminant network plays a central role of retrieving the unknown mixing structure within the input parameters and extracting output-dependent features for discriminant analysis. The new learning process is effective for developing faithful internal representations of the input parameters and constructing essential boundary structures for classication. Appendix
That steps 2–5 in the learning process converge can be proved. Rewrite the mean-eld equations in the context as the following continuous form, ¡@E ( y, A, hdi , hj i) du ik @y D ¡ D dt @ hdik i @ hdik i µ ¶ exp(bui1 ) exp(buiK ) 0 hdi i D P ¢¢¢ P l exp(buil ) l exp(buil ) X exp(buik ) P D ek l exp(buil ) k
and ¡@E ( y, A, hdi , hj i) dv km @y D ¡ D dt @ hj km i @ hj km i µ ¶ exp(bv k1 ) exp(bv kM ) 0 hj k i D P ¢¢¢ P ) ) l exp(bv kl l exp(bv kl X exp(bv kh ) P D eh , l exp(bv kl ) h
where vector ek is a standard unit vector of which the kth element is one. Then rewrite the updating rule as the following dynamics: dAmn @y @E ( y, A, hdi , hj i) ´ ¡g1 D ¡g1 dt @Amn @Amn
Natural Discriminant Analysis
711
and dy k @y @E ( y, A, hdi , hj i) ´ ¡g2 . D ¡g2 dt @y k @y k Then the convergence of the free energy y along the trace of these four sets of dynamics can be shown: X ³ @y ´0 d hd k i X ³ @y ´0 d hjk i dy C D dt @ hdi i dt @ hjk i dt i k ³ ´ ³ ´ X @y dA mn X @y 0 dy k C C @Amn dt @yk dt mn k X ³ dui ´0 ³ dui ´ X ³ dv k ´0 ³ dv k ´ C1 ¡ C2 D ¡ dt dt dt dt i k ³ ´ ³ ´ ³ ´ X dA mn X dy k 0 ³ dy k ´ dAmn ¡ g1 ¡ g2 , dt dt dt dt mn k
50 where C 1 is the Hessian of ln z ( u k, b ) , P 0 0 [sl ] exp(b hdk i sl )[sl ¡ hd k i][sl ¡ hd k i] P C1 D . 0 [sl ] exp(b hd k i sl ) [sk] runs over fe1 , . . . , eK g . Since C 1 is positive denite, ³ ´0 ³ ´ du k du k C1 > 0. dt dt For the same reason, ³ ´ ³ ´ dvm 0 dvm C2 > 0. dt dt dy dt
5 0 is shown.
References Aiyer, S. V. B., Niranjan, M., & Fallside, F. (1990). A theoretical investigation into the performance of the Hopeld model. IEEE Trans. Neural Networks, 1, 204–215. Benaim, M. (1994). On functional approximation with normalized gaussian units. Neural Computation, 6, 319–333. Cawley, G. C. (2000). MATLAB Support Vector Machine Toolbox. Available online at: http://theoval.sys.uea.ac.uk/svm/toolbox.
712
Jiann-Ming Wu
Freeman, J. A. S., & Saad, D. (1995). Learning and generalization in radial basis function networks. Neural Computation, 7, 1000–1020. Girosi, F. (1998). An equivalence between sparse approximation and support vector machine. Neural Computation, 10, 1455–1480. Girosi, F., Jones, M., & Poggio, T. (1995). Regularization theory and neural networks architectures. Neural Computation, 7, 219–269. Hastie, T., Buja, A., & Tibshirani, R. (1995).Penalized discriminant-analysis. Ann. Stat., 23, 73–102. Hastie, T., & Simard, P. Y. (1998). Metrics and models for handwritten character recognition. Stat. Sci., 13, 54–65. Hastie, T., & Tibshirani, R. (1996). Discriminant analysis by gaussian mixtures. J. Roy. Stat. Soc. B Met., 58, 155–176. Hastie, T., Tibshirani, R., & Buja, A. (1994). Flexible discriminant-analysis by optimal scoring. J. Am. Stat. Assoc., 89, 1255–1270. Lin, J. K., Grier, D. G., & Cowan, J. D. (1997). Faithful representation of separable distribution. Neural Computation, 9, 1305–1320. Liou, C. Y., & Wu, J.-M. (1996). Self-organization using Potts models. Neural Networks, 9, 671–684. Makeig, S., Jung, T. P., & Bell, A. J. (1997). Blind separation of auditory event– related brain responses into independent components. P. Natl. Acad. Sci. USA, 94, 10979–10984. Malini Lamego, M. (2001). Adaptive structures with algebraic loops. IEEE Trans. Neural Networks, 12, 33–42. Miller, D. J., & Uyar, H. S. (1998). Combined learning and use for a mixture model equivalent to the RBF classier. Neural Computation, 10, 281–293. Moody, J., & Darken, C. (1988). Learning with localized receptive elds. In D. Touretzky, G. Hinton, & T. Sejnowski (Eds.), Proceedings of the 1988 Connectionist Models Summer School (pp. 133–143). San Mateo, CA: Morgan Kaufmann. Moody, J., & Darken, C. (1989). Fast learning in networks of locally-tuned processing units. Neural Computation, 1, 281–294. Muller, K.-R., Smola, A. J., Ratsch, G., Scholkopf, B., Kohlmorgen, J., & Vapnik, V. (1999). Using support vector machines for time series prediction. In B. Scholkopf, C. J. C. Burges, & A. J. Smola (Eds.), Advances in kernel methods— Support vector learning (pp. 243–254). Cambridge, MA: MIT Press. Peterson, C., & Soderberg, ¨ B. (1989). A new method for mapping optimization problems onto neural network. Int. J. Neural Syst., 1, 3–22. Pineda, F. J. (1987).Generalization of back-propagation to recurrent neural networks. Physical Review Letters, 59, 2229–2232. Pineda, F. J. (1989). Recurrent back-propagation and the dynamical approach to adaptive neural computation. Neural Computation, 1, 161–172. Platt, J. C. (1999). Fast training of support vector machines using sequential minimal optimization. In B. Scholkopf, C. Burges, & A. J. Smola (Eds.), Advances in kernel methods—Support vector learning (pp. 185–208). Cambridge, MA: MIT Press. Rao, A. V., Miller, D. J., Rose, K., & Gersho, A. (1999). A deterministic annealing approach for parsimonious design of piecewise regression models. IEEE Trans. on Pattern Analysis and Machine Intelligence, 21, 159–173.
Natural Discriminant Analysis
713
Ratsch, G., Onoda, T., & Muller, K. R. (2001). Soft margins for AdaBoost. Machine Learning, 42, 287–320. Rose, K., Gurewitz, E., & Fox, G. C. (1990). Statistical mechanics and phase transitions in clustering. Phys. Rev. Lett., 65, 945–948. Rose, K., Gurewitz, E., & Fox, G. C. (1993). Constrained clustering as an optimization method. IEEE Trans. on Pattern Analysis and Machine Intelligence, 15, 785–794. Rumelhart, D. E., & McClelland, J. L. (1986). Parallel and distributed processing (Vol. 1). Cambridge, MA: MIT Press. Vapnik, V. (1995). The nature of statistical learning theory. New York: SpringerVerlag. Wolberg, W. H., & Mangasarian, O. L. (1990). Multisurface method of pattern separation for medical diagnosis applied to breast cytology. Proceedings of the National Academy of Science, 87, 9193–9196. Received September 8, 2000; accepted June 18, 2001.
Communicated by Peter Foldiak
ARTICLE
Slow Feature Analysis: Unsupervised Learning of Invariances Laurenz Wiskott [email protected] Computational Neurobiology Laboratory, Salk Institute for Biological Studies, San Diego, CA 92168, U.S.A.; Institute for Advanced Studies, D-14193, Berlin, Germany; and Innovationskolleg Theoretische Biologie, Institute for Biology, Humboldt-University Berlin, D-10115 Berlin, Germany
Terrence J. Sejnowski [email protected] Howard Hughes Medical Institute, The Salk Institute for Biological Studies, La Jolla, CA 92037, U.S.A., and Department of Biology, University of California at San Diego, La Jolla, CA 92037, U.S.A.
Invariant features of temporally varying signals are useful for analysis and classication. Slow feature analysis (SFA) is a new method for learning invariant or slowly varying features from a vectorial input signal. It is based on a nonlinear expansion of the input signal and application of principal component analysis to this expanded signal and its time derivative. It is guaranteed to nd the optimal solution within a family of functions directly and can learn to extract a large number of decorrelated features, which are ordered by their degree of invariance. SFA can be applied hierarchically to process high-dimensional input signals and extract complex features. SFA is applied rst to complex cell tuning properties based on simple cell output, including disparity and motion. Then more complicated input-output functions are learned by repeated application of SFA. Finally, a hierarchical network of SFA modules is presented as a simple model of the visual system. The same unstructured network can learn translation, size, rotation, contrast, or, to a lesser degree, illumination invariance for one-dimensional objects, depending on only the training stimulus. Surprisingly, only a few training objects sufce to achieve good generalization to new objects. The generated representation is suitable for object recognition. Performance degrades if the network is trained to learn multiple invariances simultaneously. 1 Introduction Generating invariant representations is one of the major problems in object recognition. Some neural network systems have invariance properties Neural Computation 14, 715–770 (2002)
° c 2002 Massachusetts Institute of Technology
716
L. Wiskott and T. Sejnowski
built into the architecture. In both the neocognitron (Fukushima, Miyake, & Ito, 1983) and the weight-sharing backpropagation network (LeCun et al., 1989), for instance, translation invariance is implemented by replicating a common but plastic synaptic weight pattern at all shifted locations and then spatially subsampling the output signal, possibly blurred over a certain region. Other systems employ a matching dynamics to map representations onto each other in an invariant way (Olshausen, Anderson, & Van Essen, 1993; Konen, Maurer, & von der Malsburg, 1994). These two types of systems are trained or applied to static images, and the invariances, such as translation or size invariance, need to be known in advance by the designer of the system (dynamic link matching is less strict in specifying the invariance in advance; Konen et al., 1994). The approach described in this article belongs to a third class of systems based on learning invariances from temporal input sequences (FoldiÂak, ¨ 1991; Mitchison, 1991; Becker, 1993; Stone, 1996). The assumption is that primary sensory signals, which in general code for local properties, vary quickly while the perceived environment changes slowly. If one succeeds in extracting slow features from the quickly varying sensory signal, one is likely to obtain an invariant representation of the environment. The general idea is illustrated in Figure 1. Assume three different objects in the shape of striped letters that move straight through the visual eld with different direction and speed, for example, rst an S, then an F, and then an A. On a high level, this stimulus can be represented by three variables changing over time. The rst one indicates object identity, that is, which of the letters is currently visible, assuming that only one object is visible at a time. This is the what-information. The second and third variable indicate vertical and horizontal location of the object, respectively. This is the where-information. This representation would be particularly convenient because it is compact and the important aspects of the stimulus are directly accessible. The primary sensory input—the activities of the photoreceptors in this example—is distributed over many sensors, which respond only to simple localized features of the objects, such as local gray values, dots, or edges. Since the sensors respond to localized features, their activities will change quickly, even if the stimulus moves slowly. Consider, for example, the response of the rst photoreceptor to the F. Because of the stripes, as the F moves across the receptor by just two stripe widths, the activity of the receptor rapidly changes from low to high and back again. This primary sensory signal is a low-level representation and contains the relevant information, such as object identity and location, only implicitly. However, if receptors cover the whole visual eld, the visual stimulus is mirrored by the primary sensory signal, and presumably there exists an input-output function that can extract the relevant information and compute a high-level representation like the one described above from this low-level representation.
Slow Feature Analysis
717
Stimulus x 2 (t) x 3 (t)
visual field x 1(t)
Primary sensory signal
Object identity
x 1 (t)
F A S
x 2 (t)
time t
Object 2D location
x 3 (t)
top
time t
bottom left right time t
Figure 1: Relation between slowly varying stimulus and quickly varying sensor activities. (Top) Three different objects, the letters S, F, and A, move straight through the visual eld one by one. The responses of three photoreceptors to this stimulus at different locations are recorded and correspond to the gray value proles of the letters along the dotted lines. (Bottom left) Activities of the three (out of many) photoreceptors over time. The receptors respond vigorously when an object moves through their localized receptive eld and are quiet otherwise. High values indicate white; low values indicate black. (Bottom right) These three graphs show a high-level representation of the stimulus in terms of object identity and object location over time. See the text for more explanation.
718
L. Wiskott and T. Sejnowski
What can serve as a general objective to guide unsupervised learning to nd such an input-output function? One main difference between the high-level representation and the primary sensory signal is the timescale on which they vary. Thus, a slowly varying representation can be considered to be of higher abstraction level than a quickly varying one. It is important to note here that the input-output function computes the output signal instantaneously, only on the basis of the current input. Slow variation of the output signal can therefore not be achieved by temporal low-pass ltering, but must be based on extracting aspects of the input signal that are inherently slow and useful for a high-level representation. The vast majority of possible input-output functions would generate quickly varying output signals, and only a very small fraction will generate slowly varying output signals. The task is to nd some of these rare functions. The primary sensory signal in Figure 1 is quickly varying compared to the high-level representation, even though the components of the primary sensory signal have extensive quiet periods due to the blank background of the stimulus. The sensor of x1 , for instance, responds only to the F, because its receptive eld is at the bottom right of the visual eld. However, this illustrative example assumes an articial stimulus, and under more natural conditions, the difference in temporal variation between the primary sensory signals and the high-level representation should be even more pronounced. The graphs for object identity and object location in Figure 1 have gaps between the linear sections representing the objects. These gaps need to be lled in somehow. If that is done, for example, with some constant value, and the graphs are considered as a whole, then both the object identity and location vary on similar timescales, at least in comparison to the much faster varying primary sensory signals. This means that object location and object identity can be learned based on the objective of slow variation, which sheds new light on the problem of learning invariances. The common notion is that object location changes quickly while object identity changes slowly, or rarely, and that the recognition system has to learn to represent only object identity and ignore object location. However, another interpretation of this situation is that object translation induces quick changes in the primary sensory signal, in comparison to which object identity and object location vary slowly. Both of these aspects can be extracted as slow features from the primary sensory signal. While it is conceptionally convenient to call object identity the what-information of the stimulus and object location the whereinformation, they are of a similar nature in the sense that they may vary on similar timescales compared to the primary sensory signal. In the next two sections, a formal denition of the learning problem is given, and a new algorithm to solve it is presented. The subsequent sections describe several sample applications. The rst shows how complex cell behavior found in visual cortex can be inferred from simple cell outputs. This is extended to include disparity and direction of motion in the second example. The third example illustrates that more complicated input-output
Slow Feature Analysis
719
functions can be approximated by applying the learning algorithm repeatedly. The fourth and fth examples show how a hierarchical network can learn translation and other invariances. In the nal discussion, the algorithm is compared with previous approaches to learning invariances. 2 The Learning Problem The rst step is to give a mathematical denition for the learning of invariances. Given a vectorial input signal x(t), the objective is to nd an input-output function g(x) such that the output signal y(t) :D g(x(t) ) varies as slowly as possible while still conveying some information about the input to avoid the trivial constant response. Strict invariances are not the goal, but rather approximate ones that change slowly. This can be formalized as follows: Learning problem. Given an I-dimensional input signal x(t) D [x1 (t ) , . . . , xI ( t) ]T with t 2 [t0 , t1 ] indicating time and [. . .]T indicating the transpose of [. . .]. Find an input-output function g(x) D [g1 (x) , . . . , g J (x)]T generating the J-dimensional output signal y(t) D [y1 (t ) , . . . , y J ( t) ]T with y j ( t) :D g j (x( t) ) such that for each j 2 f1, . . . , Jg
D j :D D ( y j ) :D hyP 2j i is minimal
(2.1)
under the constraints hy j i D 0
8 j < j: 0
hy2j i
D 1
hy j0 y j i D 0
(zero mean),
(2.2)
(unit variance),
(2.3)
(decorrelation),
(2.4)
where the angle brackets indicate temporal averaging, that is, Z t1 1 h f i :D f ( t) dt. t1 ¡ t 0 t 0 Equation 2.1 expresses the primary objective of minimizing the temporal variation of the output signal. Constraints 2.2 and 2.3 help avoid the trivial solution y j ( t) D const. Constraint 2.4 guarantees that different output signal components carry different information and do not simply reproduce each other. It also induces an order, so that y1 ( t) is the optimal output signal component, while y2 ( t) is a less optimal one, since it obeys the additional constraint hy1 y2 i D 0. Thus, D ( y j0 ) · D ( y j ) if j0 < j. The zero mean constraint, 2.2, was added for convenience only. It could be dropped, in which case constraint 2.3 should be replaced by h(y j ¡ hy j i) 2 i D 1 to avoid the trivial solutions y1 ( t) D § 1. One can also drop the unit variance constraint, 2.3, and integrate it into the objective, which would
720
L. Wiskott and T. Sejnowski
then be to minimize hyP 2j i / h(y j ¡hy j i) 2 i. This is the formulation used in (Becker and Hinton 1992). However, integrating the two constraints (equations 2.2 and 2.3) in the objective function leaves the solution undetermined by an arbitrary offset and scaling factor for y j . The explicit constraints make the solution less arbitrary. This learning problem is an optimization problem of variational calculus and in general is difcult to solve. However, for the case that the inputoutput function components g j are constrained to be a linear combination of a nite set of nonlinear functions, the problem simplies signicantly. An algorithm for solving the optimization problem under this constraint is given in the following section. 3 Slow Feature Analysis Given an I-dimensional input signal x(t) D [x1 ( t) , . . . , xI ( t) ]T , consider an input-output function g(x) D [g1 (x) , . . . , g J (x) ]T , each component of which P is a weighted sum over a set of K nonlinear functions h k (x) : g j (x) :D KkD1 w jkh k (x). Usually K > max(I, J ) . Applying h D [h1 , . . . , hK ]T to the input signal yields the nonlinearly expanded signal z(t) :D h(x(t) ) . After this nonlinear expansion, the problem can be treated as linear in the expanded signal components z k ( t) . This is a common technique to turn a nonlinear problem into a linear one. A well-known example is the support vector machine (Vapnik, 1995). The weight vectors w j D [w j1 , . . . , w jK ]T are subject to learning, and the jth output signal component is given by y j ( t) D g j (x( t) ) D wTj h(x(t) ) D wTj z(t) . The objective (see equation 2.1) is to optimize the input-output function and thus the weights such that
D (yj )
D hyP 2j i D wTj hPzzP T iw j
(3.1)
is minimal. Assume the nonlinear functions h k are chosen such that the expanded signal z(t) has zero mean and a unit covariance matrix. Such a set h k of nonlinear functions can be easily derived from an arbitrary set h0k by a sphering stage, as will be explained below. Then we nd that the constraints (see equations 2.2–2.4) hy j i D wTj hzi D 0, |{z} hy2j i 8 j0 < j:
D
wTj
(3.2)
D0
hzzT i w j D wTj w j D 1, | {z }
(3.3)
DI
hy j0 y j i D wTj0 hzzT i w j D wTj0 w j D 0, | {z }
(3.4)
DI
are automatically fullled if and only if we constrain the weight vectors to be an orthonormal set of vectors. Thus, for the rst component of the input-
Slow Feature Analysis
721
output function, the optimization problem reduces to nding the normed weight vector that minimizes D ( y1 ) of equation (3.1). The solution is the normed eigenvector of matrix hPzzP T i that corresponds to the smallest eigenvalue (cf. Mitchison, 1991). The eigenvectors of the next higher eigenvalues produce the next components of the input-output function with the next higher D values. This leads to an algorithm for solving the optimization problem stated above. It is useful to make a clear distinction among raw signals, exactly normalized signals derived from training data, and approximately normalized signals derived from test data. Let x˜ (t) be a raw input signal that can have any mean and variance. For computational convenience and display purposes, the signals are normalized to zero mean and unit variance. This normalization is exact for the training data x(t). Correcting test data by the same offset and factor will in general yield an input signal x0 (t ) that is only approximately normalized, since each data sample has a slightly different mean and variance, while the normalization is always done with the offset and factor determined from the training data. In the following, raw signals have a tilde, and test data have a dash; symbols with neither a tilde nor a dash usually (but not always) refer to normalized training data. The algorithm now has the following form (see also Figure 3): 1. Input signal. For training, an I-dimensional input signal is given by x(t). ˜ 2. Input signal normalization. Normalize the input signal to obtain x(t) :D [x1 ( t) , . . . , xI (t) ]T xQ i ( t) ¡ hxQ i i with xi ( t) :D p , h( xQ i ¡ hxQ i i) 2 i so that hxi i D 0, and
hx2i i
D 1.
(3.5) (3.6) (3.7) (3.8)
˜ 3. Nonlinear expansion. Apply a set of nonlinear functions h(x) to generate an expanded signal z˜ (t) . Here all monomials of degree one (resulting in linear SFA sometimes denoted by SFA1 ) or of degree one and two including mixed terms such as x1 x2 (resulting in quadratic SFA sometimes denoted by SFA2 ) are used, but any other set of functions could be used as well. Thus, for quadratic SFA, ˜ h(x) :D [x1 , . . . , xI , x1 x1 , x1 x2 , . . . , xI xI ]T , ˜ z˜ (t) :D h(x(t) ) D [x1 ( t) , . . . , xI ( t) , x1 (t ) x1 (t) , x 1 ( t ) x 2 ( t ) , . . . , x I ( t ) x I ( t ) ]T .
(3.9) (3.10)
˜ Using rst- and second-degree monomials, h(x) and z˜ ( t) are of dimensionality K D I C I ( I C 1) / 2.
722
L. Wiskott and T. Sejnowski
1
2
x1 t
x2 t 0
1
1
1
0
z1 t 0
1
1
2
2
z3 t 1 0
1 15
1
15
z1 t 0
z1 t 0
1
2
z2 t 0 15
z t 15 30 15 w1 axis
15
c) Sphered expanded signal z(t)
0
t
. d) Time derivative signal z(t) 1
2
0 1
x2 0
1
2 1
1 yt
z3 t 1
b) Expanded signal ~z(t)
a) Input signal x(t) z2 t 0 1
0
2
1
1
z2 t 0 1
x1
0 1 2
e) Output signal y(t)
g x1 ,x2
f) Input output function g(x)
Slow Feature Analysis
723
4. Sphering. Normalize the expanded signal z˜ ( t) by an afne transformation to generate z(t) with zero mean and identity covariance matrix I, z(t) :D S (z(t) ˜ ¡ h˜zi), with and
hzi D 0
hz zT i D I.
(3.11) (3.12) (3.13)
This normalization is called sphering (or whitening). Matrix S is the sphering matrix and can be determined with the help of principal component analysis (PCA) on matrix (z(t) ˜ ¡ h˜zi). It therefore depends on the specic training data set. This also denes ˜ h(x) :D S ( h(x) ¡ h˜z i) ,
(3.14)
which is a normalized function, while z(t) is the sphered data. 5. Principal component analysis. Apply PCA to matrix hPz zP T i. The J eigenvectors with lowest eigenvalues l j yield the normalized weight vectors wj: with
hPz zP T iw j D l j w j l1 · l2 · ¢ ¢ ¢ · lJ ,
(3.15) (3.16)
which provide the input-output function
with
g(x) :D [g1 (x) , . . . , g J (x) ]T
(3.17)
g j (x) :D wTj h(x)
(3.18)
Figure 2: Facing page. Illustration of the learning algorithm by means of a simplied example. (a) Input signal x˜ ( t ) is given by xQ 1 ( t) :D sin ( t) C cos2 (11 t ), xQ 2 ( t ) :D cos(11 t) , t 2 [0, 2p ], where sin ( t) constitutes the slow feature signal. Shown is the normalized input signal x(t) . (b) Expanded signal z˜ ( t) is dened as zQ 1 ( t) :D x1 ( t ) , zQ 2 ( t ) :D x2 ( t) , and zQ 3 ( t ) :D x22 ( t) . x21 ( t) and x1 ( t) x2 ( t ) are left out for easier display. (c) Sphered signal z(t) has zero mean and unit covariance matrix. Its orientation in space is algorithmically determined by the principal axes of z˜ ( t) but otherwise arbitrary. (d) Time derivative signal zP ( t ) . The direction of minimal variance determines the weight vector w 1 . This is the direction in which the sphered signal z(t) varies most slowly. The axes of next higher variance determine the weight vectors w2 and w3 , shown as dashed lines. (e) Projecting the sphered signal z(t) onto the w1 -axis yields the rst output signal component y1 ( t ) , which is the slow feature signal sin ( t) . (f) The rst component g1 ( x1 , x2 ) of the input-output function derived by the steps a to e is shown as a contour plot.
724
L. Wiskott and T. Sejnowski
and the output signal y(t) :D g(x(t) ) hyi D 0,
with
T
and
D (yj )
hy y i D I,
D hyP 2j i D l j .
(3.19) (3.20) (3.21) (3.22)
The components of the output signal have exactly zero mean, unit variance, and are uncorrelated. 6. Repetition. If required, use the output signal y(t) (or the rst few components of it or a combination of different output signals) as an input signal x(t) for the next application of the learning algorithm. Continue with step 3. 7. Test. In order to test the system on a test signal, apply the normalization and input-output function derived in steps 2 to 6 to a new input signal x˜ 0 ( t) . Notice that this test signal needs to be normalized with the same offsets and factors as the training signal to reproduce the learned input-output relation accurately. Thus, the training signal is normalized only approximately to yield x0 (t ) :D [x01 ( t) , . . . , x0I ( t) ]T xQ 0 ( t) ¡ hxQ i i with x0i (t ) :D p i , h( xQ i ¡ hxQ i i) 2 i so that hx0i i ¼ 0, and
02
hxi i ¼ 1.
(3.23) (3.24) (3.25) (3.26)
The normalization is accurate only to the extent the test signal is representative for the training signal. The same is true for the output signal y0 ( t) :D g(x0 ( t) )
with and
0
hy0 i ¼ 0, 0T
hy y i ¼ I.
(3.27) (3.28) (3.29)
For practical reasons, singular value decomposition is used in steps 4 and 5 instead of PCA. Singular value decomposition is preferable for analyzing degenerate data in which some eigenvalues are very close to zero, which are then discarded in Step 4. The nonlinear expansion sometimes leads to degenerate data, since it produces a highly redundant representation where some components may have a linear relationship. In general, signal components with eigenvalues close to zero typically contain noise, such as rounding errors, which after normalization is very quickly uctuating and
Slow Feature Analysis
725 y1 (t)
g 1 (x) ~ ~ ~ w11 h 1(x) w 10
y2 (t) g 2 (x)
~ ~h (x) w 15 5
y1 (t) ~ g 1(z) ~ ~ w11 w 10
~ z1 (t) ~ h1 (x)
x 2(t)
y 2 (t)
non linear units g(x)
~ g 2 (z)
input signal x(t)
y 3 (t) ~ g 3(z)
~ w 15
z~3 (t) ~ h 3 (x)
~ z 4 (t) ~ h4 (x)
output signal y(t) ~ linear units g(z)
~ w 35
~ z 2 (t) ~ h2 (x)
output signal y(t)
~ ~h (x) w 35 5 ~ modifyable weights w non linear ~ synaptic clusters h(x)
x 1 (t)
bias
g3 (x)
1
bias
1
y 3 (t)
~ modifyable weights w linear synapses ~ z5 (t) ~ h5 (x)
~ non linear units h(x)
non linear synaptic clusters
x 1(t)
x 2 (t)
input signal x(t)
Figure 3: Two possible network structures for performing SFA. (Top) Interpretation as a group of units with complex computation on the dendritic trees (thick lines), such as sigma-pi units. (Bottom) Interpretation as a layered network of simple units with xed nonlinear units in the hidden layer, such as radial basis function networks with nonadaptable hidden units. In both cases, the input˜ , with output function components are given by g j (x) D wTj h(x) D w Q j0 C w ˜ Tj h(x) appropriate raw weight vectors w ˜ j . The input signal components are assumed to be normalized here.
would not be selected by SFA in Step 5 in any case. Thus, the decision as to which small components should be discarded is not critical. It is useful to measure the invariance of signals not by the value of D directly but by a measure that has a more intuitive interpretation. A good
726
L. Wiskott and T. Sejnowski
measure may be an index g dened by g ( y ) :D
T p D ( y) 2p
(3.30)
p for t 2 [t0 , t0 C T]. For a pure sine wave y (t) :D 2 sin(n 2p t / T ) with an integer number of oscillations n the index g ( y ) is just the number of oscillations, that is, g ( y ) D n.1 Thus, the index g of an arbitrary signal indicates what the number of oscillations would be for a pure sine wave of same D value, at least for integer values of g. Low g values indicate slow signals. Since output signals derived from test data are only approximately normalized, g ( y0 ) is meant to include an exact normalization of y0 to zero mean and unit variance, to make the g index independent of an accidental scaling factor. 3.1 Neural Implementation. The SFA algorithm is formulated for a training input signal x(t) of denite length. During learning, it processes this signal as a single batch in one shot and does not work incrementally, such as on-line learning rules do. However, SFA is a computational model for neural processing in biological systems and can be related to two standard network architectures (see Figure 3). The nonlinear basis functions hQ k (x) can, for instance, be considered as synaptic clusters on the dendritic tree, which locally perform a xed nonlinear transformation on the input data and can be weighted independent of other synaptic clusters performing other but also xed nonlinear transformations on the input signal (see Figure 3, top). Sigma-pi units (Rumelhart, Hinton, & McClelland, 1986), for instance, are of this type. In another interpretation, the nonlinear basis functions could be realized by xed nonlinear units in a hidden layer. These then provide weighted inputs to linear output units, which can be trained (Figure 3, bottom). The radial basis function network with nonadaptable basis functions is an example of this interpretation (cf. Becker & Hinton, 1995; see also Bishop, 1995, for an introduction to radial basis function networks). Depending on the type of nonlinear functions used for hQ k (x) , the rst or the second interpretation is more appropriate. Lateral connections between the output p p For symmetry reasons and since ( 2 sin (x))2 C ( 2 cos (x))2 D 2, it is evident that p 2 hyi D 0 and hy i D 1 if averaging y (t) :D 2 sin(n 2p t / T) over t 2 [t 0 , t 0 C T], that is, over an integer number of oscillations n. Setting t 0 D 0 without loss of generality, we nd that 1
D (y) D hyP 2 i D
1 T
Z
T 0
n2 4p 2 n2 4p 2 1 2 cos2 (n 2p t / T) dt D 2 T T2 n 2p
and g(y) D
T p T D (y) D 2p 2p
Z
n 2p 0
r n2 4p 2 D n. T2
2 cos2 (t0 ) dt0 D
n2 4p 2 T2
Slow Feature Analysis
727
units g j , either only from lower to higher units as shown in the gure or between all units, are needed to decorrelate the output signal components by some variant of anti-Hebbian learning (see Becker & Plumbley, 1996, for an overview). Each of these two networks forms a functional unit performing SFA. In the following we will refer to such a unit as an SFA module, modeled by the algorithm described above. 4 Examples The properties of the learning algorithm are now illustrated by several examples. The rst example is about learning response behavior of complex cells based on simple cell responses (simple and complex cells are two types of cells in primary visual cortex). The second one is similar but also includes estimation of disparity and motion. One application of slow feature analysis is sufcient for these two examples. The third example is more abstract and requires a more complicated input-output function, which can be approximated by three SFAs in succession. This leads to the fourth example, which shows a hierarchical network of SFA modules learning translation invariance. This is generalized to other invariances in example 5. Each example illustrates a different aspect of SFA; all but the third example also refer to specic learning problems in the visual system and present possible solutions on a computational level based on SFA, although these examples do not claim to be biologically plausible in any detail. All simulations were done with Mathematica. 4.1 Examples 1 and 2: Complex Cells, Disparity, and Motion Estimation. The rst two examples are closely related and use subsets of the same data. Consider ve monocular simple cells for the left and right eyes with receptive elds, as indicated in Figure 4. The simple cells are modeled by spatial Gabor wavelets (Jones & Palmer, 1987), whose responses xQ ( t) to a visual stimulus smoothly moving across the receptive eld are given by a combination of nonnegative amplitude aQ (t ) and phase wQ ( t) both varying in time: xQ (t) :D aQ ( t) sin( wQ (t ) ) . The output signals of the ve simple cells shown in Figure 4 are modeled by xQ 1 ( t) :D (4 C a0 ( t) ) sin(t C 4 w 0 (t) ) ,
(4.1)
xQ 2 ( t) :D (4 C a1 ( t) ) sin(t C 4 w 1 (t) ) ,
(4.2)
xQ 3 ( t) :D (4 C a1 ( t) ) sin(t C 4 w 1 (t) C p / 4) ,
xQ 4 ( t) :D (4 C a1 ( t) ) sin(t C 4 w 1 (t) C p / 2 C 0.5w D ( t) ) ,
xQ 5 ( t) :D (4 C a1 ( t) ) sin(t C 4 w 1 (t) C 3p / 4 C 0.5w D ( t) ) ,
(4.3) (4.4) (4.5)
728
L. Wiskott and T. Sejnowski
Simple cell receptive fields in left and right eye + +
~ x1 (t) left eye
right eye
~ x2 (t)
~ x4 (t)
~ x3 (t)
~ x5 (t)
Input signal x(t) (normalized simple cell output signals) 32.71
x1 t
x5 t
t
x10 t
t
x5 t x2 t
30.95
x10 t
30.95
t
x3 t
t
x2 t
t
35.03
x3 t
x2 t
x1 t
33.36
x3 t
x5 t
Figure 4 (Examples 1 and 2): (Top) Receptive elds of ve simple cells for left and right eye, which provide the input signal x˜ ( t) . (Bottom) Selected normalized input signal components xi ( t) plotted versus time and versus each other. All graphs range from ¡4 to C 4; time axes range from 0 to 4p .
t 2 [0, 4p ]. All signals have a length of 512 data points, resulting in a step size of D t D 4p / 512 between successive data points. The amplitude and phase modulation signals are low-pass ltered gaussian white noise normalized to zero mean and unit variance. The width s of the gaussian low-pass lters is 30 data points for w D ( t) and 10 for the other four signals: a 0 (t) , a1 (t) , w 0 ( t) , and w 1 ( t) . Since the amplitude signals a0 ( t) and a1 ( t) have positive and negative values, an offset of 4 was added to shift these signals to a positive range, as required for amplitudes. The linear ramp t within the sine ensures that all phases occur equally often; otherwise, a certain phase would be overrepresented because the phase signals w 0 ( t) and w 1 ( t) are concentrated around zero. The factor 4 in front of the phase signals ensures that phase changes more quickly than amplitude. The raw signals xQ 1 (t) , . . . , xQ 5 ( t) are
Slow Feature Analysis
729
normalized to zero mean and unit variance, yielding x1 ( t) , . . . , x5 ( t) . Five additional signal components are obtained by a time delay of 1 data point: x6 ( t) :D x1 (t ¡ D t ) , . . . , x10 (t) :D x5 (t ¡ D t). Some of these 10 signals are shown in Figure 4 (bottom), together with some trajectory plots of one component versus another. The g value of each signal as given by equation 3.30 is shown in the upper right corner of the signal plots. Since the rst simple cell has a different orientation and location from the others, it also has a different amplitude and phase modulation. This makes it independent, as can be seen from trajectory plot x2 ( t) versus x1 ( t) , which does not show any particular structure. The rst simple cell serves only as a distractor, which needs to be ignored by SFA. The second and third simple cells have the same location, orientation, and frequency. They therefore have the same amplitude and phase modulation. However, the positive and negative subregions of the receptive elds are slightly shifted relative to each other. This results in a constant phase difference of 45± (p / 4) between these two simple cells. This is reected in the trajectory plot x3 ( t) versus x2 ( t) by the elliptic shape (see Figure 4). Notice that a phase difference of 90± (p / 2) would be computationally more convenient, but is not necessary here. If the two simple cells had a phase difference of 90± , like sine- and cosine-Gabor wavelets, the trajectory would describe circles, and the square of the desired complex cell response a1 ( t) would be just the square sum x22 (t ) C x23 ( t) . Since the phase difference is 45± , the transformation needs to be slightly more complex to represent the elliptic shape of the x3 ( t) -x2 ( t) -trajectory. The fourth and fth simple cells have the same relationship as the second and third one do, but for the right eye instead of the left eye. If the two eyes received identical input, the third and and fth simple cell would also have this relationship—the same amplitude and phase modulation with a constant phase difference. However, disparity induces a shift of the right image versus the left image, which results in an additional slowly varying phase difference w D ( t) . This leads to the slowly varying shape of the ellipses in trajectory plot x5 (t ) versus x3 ( t) , varying back and forth between slim left-oblique ellipses over circles to slim right-oblique ellipses. This phase difference between simple cell responses of different eyes can be used to estimate disparity (Theimer & Mallot, 1994). Since x10 ( t) is a time-delayed version of x5 ( t) , the vertical distance from the diagonal in the trajectory plot x10 ( t) versus x5 (t ) in Figure 4 is related to the time derivative of x5 ( t) . Just as the phase difference between corresponding simple cells of different eyes can be used to estimate disparity, phase changes over time can be used to estimate direction and speed of motion of the visual stimulus (Fleet & Jepson, 1990). In the rst example, x1 ( t) , x2 (t ), and x3 ( t) serve as an input signal. Only the common amplitude modulation a1 ( t) of x2 ( t) and x3 ( t) represents a slow feature and can be extracted from this signal. Example 2 uses all ve normalized simple cell responses, x1 (t ) , . . . , x5 ( t) , plus the time-delayed versions
730
L. Wiskott and T. Sejnowski
of them, x6 ( t) :D x1 (t ¡ D t ) , . . . , x10 (t ) :D x5 ( t ¡ D t) . Several different slow features, such as motion and disparity, can be extracted from this richer signal. Example 1. Consider the normalized simple cell responses x1 ( t) , x2 (t ) , and x3 ( t) as an input signal for SFA. Figure 5 (top) shows the input signal trajectories already seen in Figure 4, cross-sections through the rst three components of the learned input-output function g(x) (arguments not varied are not listed and set to zero, e.g., g1 ( x2 , x3 ) means g1 (0, x2 , x3 )), and the rst three components of the extracted output signal y(t) . The rst component of the input-output function represents the elliptic shape of the x3 ( t) -x2 ( t) trajectory correctly and ignores the rst input signal component x1 (t) . It therefore extracts the amplitude modulation a1 ( t) (actually the square of it) as desired and is insensitive to the phase of the stimulus (cf. Figure 5, bottom right). The correlation2 r between a1 (t ) and ¡y1 (t ) is 0.99. The other input-output function and output signal components are not related to slow features and can be ignored. This becomes clear from the g values of the output signal components (see Figure 5, bottom left). Only g ( y1 ) is signicantly lower than those of the input signal. Phase cannot be extracted, because it is a cyclic variable. A reasonable representation of a cyclic variable would be its sine and cosine value, but this would be almost the original signal and is therefore not a slow feature. When trained and tested on signals of length 16p (2048 data points), the correlation is 0.981 § 0.004 between a1 ( t) and § y1 ( t) (training data) and 0.93 § 0.04 between a01 (t ) and § y01 ( t) (test data) (means over 10 runs § standard deviation). Example 2. The previous example can be extended to binocular input and the time domain. The input signal is now given by all normalized simple cell responses described above and the time-delayed versions of them: x1 ( t) , . . . , x10 (t ). Since the input is 10-dimensional, the output signal of a second-degree polynomial SFA can be potentially 65-dimensional. However, singular value decomposition detects two dimensions with zero variance—the nonlinearly expanded signal has two redundant dimensions, so that only 63 dimensions remain. The g values of the 63 output signal components are shown in Figure 6 (top left). Interestingly, they grow almost linearly with index value j. Between 10 and 16 components vary signicantly more slowly than the input signal. This means that in contrast to the previous example, there are now several slow features being extracted from the input signal. These include elementary features, such as amplitude (or complex cell response) a1 (t) and disparity w D ( t), but also more complex 2
A correlation coefcient r between two signals a (t) and b(t) is dened as r(a, b) :D p
h(a ¡ hai)(b ¡ hbi)i/ h(a ¡ hai)2 ih (b ¡ hbi)2 i. Since the signs of output signal components are arbitrary, they will be corrected if required to obtain positive correlations, which is particularly important for averaging over several runs of an experiment.
Slow Feature Analysis
731
Input output function g(x)
x2
x2
x1
x1 t
g3 x1 ,x2
x1
x1 31.53
y3 t
30.37
y2 t
y1 t
6.52
t
t
values of x(t) and y(t)
t
Correlation a1 (t) vs. y1 (t) 6.61
60
a1 t
a1
50 40
6.52
y1 t
70
t
30
t r 0.99
20
0
2
4
i,j
6
8
y1 t
xi yj
10 0
x2
g2 x1 ,x2
x2
x2
g1 x1 ,x2
x2
x2 t
Output signal y(t)
g3 x2 ,x3
x3
g2 x2 ,x3
x3
x3
Input signal x(t) x2 t x3 t
g1 x2 ,x3
10
a1 t
Figure 5 (Example 1): Learning to extract complex cell response a1 ( t) from normalized simple cell responses x1 ( t) , x2 ( t) , and x3 ( t ) . (Top) First three components of the learned input-output function g(x) , which transforms the input signal x(t) into the output signal y(t). All signal components have unit variance, and all graphs range from ¡4 to C 4, including the gray-value scale of the contour plots. The thick contour lines indicate the value 0, and the thin ones indicate § 2 and § 4; white is positive. Time axes range from 0 to 4p . (Bottom left) g values of the input and the output signal components. Only y1 is a relevant invariant. g(a1 ) is shown at i, j D 0 for comparison. (Bottom right) ¡y1 ( t) correlates well with amplitude a1 ( t) .
732
L. Wiskott and T. Sejnowski
ones, such as phase change wP 1 (t) (which is closely related to the velocity 2 ( ), of stimulus movement), the square of disparity w D t or even the product w D ( t) aP 1 ( t) and other nonlinear combinations of the phase and amplitude signals that were used to generate the input signal. If so many slow features are extracted, the decorrelation constraint is not sufcient to isolate the slow features into separate components. For instance, although the motion signal wP 1 (t) is mainly represented by ¡y8 ( t) with a correlation of 0.66, part of it is also represented by ¡y5 ( t) and ¡y6 ( t) with correlations of 0.34 and 0.33, respectively. This is inconvenient but in many cases not critical. For example, for a repeated application of SFA, such as in the following examples, the distribution of slow features over output signal components is irrelevant as long as they are concentrated in the rst components. We can therefore ask how well the slow features are encoded by the rst 14 output components, for instance. Since the output signals obtained from training data form an orthonormal system, the optimal linear combination to represent a normalized slow signal s ( t) is given by YQ 14 [s](t) :D
14 X hs y j i y j ( t) ,
(4.6)
jD 1
where YQ l indicates a projection operator onto the rst l output signal components. Normalization to zero mean and unit variance yields the normalized projected signal Y14 [s](t) . The correlation of wP 1 (t) with Y14 [wP1 ](t) is 0.94. Figure 6 and Table 1 give an overview over some slow feature signals and their respective correlations r with their projected signals. Since the test signals have to be processed without explicit knowledge about the slow feature signal s0 to be extracted, the linear combinations for test signals are computed with the coefcients determined on the training signal, and we P 0( ) 0 write YQ 014 [s](t) :D 14 j D1 hs y j i y j t . Normalization yields Y 14 [s](t) . 4.2 Example 3: Repeated SFA. The rst two examples were particularly easy because a second-degree polynomial was sufcient to recover the slow features well. The third example is more complex.First generate two random time series, one slowly varying xs (t) and one fast varying x f ( t) (gaussian white noise, low-pass ltered with s D 10 data points for xs and s D 3 data points for x f ). Both signals have a length of 512 data points, zero mean, and unit variance. They are then mixed to provide the raw input signal x˜ (t) :D [x f , sin(2x f ) C 0.5xs ]T , which is normalized to provide x(t) . The task is to extract the slowly varying signal xs ( t) . The input-output function required to extract the slow feature xs cannot be well approximated by a polynomial of degree two. One might therefore use polynomials of third or higher degree. However, one can also repeat the learning algorithm, applying it with second-degree polynomials, leading
Slow Feature Analysis
733
Correlation
(t) vs. Y14 [
a1
80
1
5.24
Y 14
.
60
t
t r 0.97
t
40
Y14
D
xi yj 0 10 20 30 40 50 60 i,j
1
13.09
D
. a 1 ](t)
13.06
Y14
t
t
r 0.98
Y14
Y 14
1
D a1
t
. (t) a 1 (t) vs. Y14 [
t
t r 0.94
D
D
Y14
1
t
Corr.
t
t
](t)
14.94
t
13.74
1
t a1 t
Correlation 1(t) vs. Y14 [
D
D a1
20
t
](t)
D
100
t
D
D
D
3.05
120
0
D
t
values of x(t) and y(t)
1
t
D
t a1 t
Figure 6 (Example 2): Learning to extract several slow features from normalized simple cell responses x1 , x2 , . . . , x10. (Top left) g values of the input and the output signal components. The rst 14 output components were assumed to carry relevant information about slow features. (Top right, bottom left, and right) The rst 14 output components contain several slow features: elementary ones, such as disparity w D and phase variation wP 1 (which is closely related to motion velocity), and combinations of these, such as the product of disparity and amplitude change w D aP 1 . Shown are the slow feature signals, the optimal linear combinations of output signal components, and the correlation trajectories of the two.
SFA1 SFA2 SFA2 SFA2 SFA2
r(s0 , Y014 [s]) r(s, Y14 [s])
g(s0 ) r(s0 , Y010 [s]) j¤ · 14 r(s0 , § y0j¤ )
22 2.0 ¡.03 .07 1.0 0.0 .87 .02 .88 .01 .90 .01
s D wD 32 3.0 ¡.03 .07 2.1 0.3 .80 .07 .89 .03 .91 .02
2 s D wD
37 1.0 ¡.02 .04 2.8 0.4 .78 .11 .89 .01 .90 .01
s D wP D 66 4.0 ¡.00 .08 4.0 0.0 .96 .01 .98 .00 .99 .00
s D a1 66 2.0 .01 .05 6.0 4.2 .04 .02 .02 .04 .22 .06
s D w1 113 2.0 ¡.01 .04 7.9 1.2 .53 .12 .85 .02 .87 .02
s D wP 1 114 5.0 ¡.00 .06 8.7 2.1 .58 .12 .89 .03 .91 .03
s D w D aP 1
Note: SFA1 and SFA2 indicate linear and quadratic SFA, respectively. j¤ is the index of that output signal component y j¤ among the rst 14 that is best correlated with s on the training data. Yl [s] is an optimal linear combination of the rst l output signal components to represent s (see equation 4.6). All gures are means over 10 simulation runs, with the standard deviation given in small numerals. Signal length was always 4096. We were particularly interested in the signals w D (disparity), a1 (complex cell response), and wP 1 (indicating stimulus velocity), but several other signals get extracted as well, some of which are shown in the table. Notice that linear SFA with 10 input signal components can generate only 10 output signal components and is obviously not sufcient to extract any of the considered feature signals. For some feature signals, such as w D and a1 , it is sufcient to use only the single best correlated output signal component. Others are more distributed over several output signal components, for example, wP 1 . Phase w 1 is given as an example of a feature signal that cannot be extracted, since it is a cyclic variable.
Testing Testing Training Testing Testing Training
Data
Table 1: (Example 2): g-Values and Correlations r for Various Slow Feature Signals s Under Different Experimental Conditions.
734 L. Wiskott and T. Sejnowski
Slow Feature Analysis
735
to input-output functions of degree two, four, eight, sixteen, and so on. To avoid an explosion of signal dimensionality, only the rst few components of the output signal of one SFA are used as an input for the next SFA. In this example, only three components were propagated from one SFA to the next. This cuts down the computational cost of this iterative scheme signicantly compared to the direct scheme of using polynomials of higher degree, at least for high-dimensional input signals, for which polynomials of higher degree become so numerous that they would be computationally prohibitive. Figure 7 shows input signal, input-output functions, and output signals for three SFAs in succession (only the rst two components are shown). The plotted input-output functions always include the transformations computed by previous SFAs. The approximation of g(x) to the sinusoidal shape of the trajectories becomes better with each additional SFA; compare, for instance, g2,2 with g3,1 . The g values shown at the bottom left indicate that only the third SFA extracts a slow feature. The rst and second SFA did not yield an g value lower than that of the input signal. For the rst, second, and third SFA, xs ( t) was best correlated with the third, second (with inverted sign), and rst output signal component, with correlation coefcients of 0.59, 0.61, and 0.97, respectively. The respective trajectory plots are shown at the bottom right of Figure 7. Notice that each SFA was trained in an unsupervised manner and without backpropagating error signals. Results for longer signals and multiple simulation runs are given in Table 2. The algorithm can extract not only slowly varying features but also rarely varying ones. To demonstrate this, we tested the system with a binary signal that occasionally switches between two possible values. To generate a rarely changing feature signal xr ( t), we applied the sign function to low-pass ltered gaussian white noise and normalized the signal to zero mean and unit variance. The s of the gaussian low-pass lter was chosen to be 150 to make the g-values of xr (t ) similar to those of xs ( t) in Table 2. Figure 8 shows the rarely changing signal xr ( t) and single best correlated output signal components of the three SFAs in succession. For the rst, second, and third SFA, xr ( t) was best correlated with the third, third, and rst output signal component, with corresponding correlation coefcients of 0.43, 0.69, and 0.97. This is similar to the results obtained with the slowly varying feature signal xs ( t ) . Table 2 shows results on training and test data for slowly and rarely varying feature signals of length 4096. They conrm that there is no signicant difference between slowly and rarely varying signals, given a comparable g value. In both cases, even two quadratic SFAs in succession do not perform much better than one linear one. It is only the third quadratic SFA that extracts the feature signal with a high correlation coefcient. As in the previous example, results improve if a linear combination of several (three) output signal components is used instead of just the single best one (this is
736
L. Wiskott and T. Sejnowski
x2 t
1×SFA2: g(x) and y(t)
27.79
x1
x1 t
x1 40.15
y1,2 t
27.51
y1,1 t
t
g1,2 x1 ,x2
x2
x2
t
x2 t
g1,1 x1 ,x2
x1 t
Input signal x(t)
41.38
t
2×SFA2: g(x) and y(t)
3×SFA2: g(x) and y(t)
x1
x1
x1
t
t
values of x(t) and y(t)
4
i,j
6
8
0.59
0.61
xs t 3×SFA2, r
0.97
y3,1
2×SFA2, r
y2,2
20
1×SFA2, r
y1,3 t
xi y1, j y2, j y3, j 2
Corr. x s(t) vs. y(t) xs t
40
0
t
7.49
xs xf
60
0
27.62
y3,2 t
y3,1 t
t
80
x1 17.87
35.08
y2,2 t
y2,1 t
27.09
g3,2 x1 ,x2
x2
g3,1 x1 ,x2
x2
g2,2 x1 ,x2
x2
x2
g2,1 x1 ,x2
t
10
xs t
xs t
Slow Feature Analysis
737
Table 2 (Example 3): Correlations r for Slowly (cf. Figure 7) and Rarely (cf. Figure 8) Varying Feature Signals xs ( t) and xr ( t) with Output Signal Components of Several SFAs in Succession. 2£SFA2
3£SFA2
4£SFA2
Slowly varying signal xs (t) with gN = 66 1 Training j¤ · 3 2.0 0.0 3.0 0.0 Testing r(x0s , § y0j¤ ) .61 .02 .59 .02 Testing r(x0s , Y30 [xs ]) .60 .02 .60 .02 Training r(xs , Y3 [xs ]) .62 .01 .62 .01
2.7 0.5 .65 .08 .69 .07 .74 .05
1.6 0.5 .81 .10 .87 .06 .89 .05
1.4 0.5 .85 .11 .86 .10 .91 .05
Rarely changing signal xr (t) with gN = 66 8 Training j¤ · 3 2.0 0.0 3.0 0.0 Testing r(x0r , y0j¤ ) .60 .01 .57 .06 Testing r(x0r , Y30 [xr ]) .60 .01 .58 .06 Training r(xr , Y3 [xr ]) .61 .01 .62 .02
2.7 0.5 .59 .10 .67 .10 .73 .07
1.6 0.5 .84 .08 .87 .08 .91 .04
1.3 0.5 .81 .17 .85 .17 .94 .03
Data
1£SFA1
1£SFA2
Note: SFA1 and SFA2 indicate linear and quadratic SFA, respectively. j¤ is the index of that output signal component y j¤ among the rst three, which is best correlated with s on the training data (in all cases, j¤ was also optimal for the test data). Y3 [s] is an optimal linear combination of the rst three output signal components to represent s (see equation 4.6). All gures are means over 10 simulation runs with the standard deviation given in small numerals. Signal length was always 4096.
strictly true only for training data, since on test data, the linear combination from the training data is used). 4.3 Example 4: Translation Invariance in a Visual System Model. 4.3.1 Network Architecture. Consider now a hierarchical architecture as illustrated in Figure 9. Based on a one-dimensional model retina with 65 sensors, layers of linear SFA modules with convergent connectivity alternate with layers of quadratic SFA modules with direct connectivity. This division into two types of sublayers allows us to separate the contribution of simple spatial averaging from the contribution of nonlinear processing Figure 7 (Example 3): Facing page. Hidden slowly varying feature signal discovered by three SFAs in succession. (Top left) Input signal. (Top right and middle left and right) One, two, and three SFAs in succession and their corresponding output signals. First subscript refers to the number of the SFA; second subscript refers to its component. The slow feature signal xs ( t ) is the slow up and down motion of the sine curve in the trajectory plot x2 ( t) versus x1 ( t ) . (Bottom left) g values of the input signal components and the various output signal components. g ( xs ) and g ( x f ) are shown at i, j D 0 for comparison. Only y3, 1 can be considered to represent a slow feature. (Bottom right) Slow feature signal xs ( t) and its relation to some output signal components. The correlation of xs ( t ) with C y1,3 ( t ) , ¡y2,2 ( t) , and C y3,1 ( t) is 0.59, 0.61, and 0.97, respectively.
738
L. Wiskott and T. Sejnowski
t
16.84
t
10.76
xr t
y3,1 t
41.95
y2,3 t
y1,3 t
48.08
t
t
Figure 8 (Example 3): Hidden rarely changing feature signal discovered by three SFAs in succession.
to the generation of a translation-invariant object representation. It is clear from the architecture that the receptive eld size increases from bottom to top, and therefore the units become potentially able to respond to more complex features, two properties characteristic for the visual system (Oram & Perrett, 1994). 4.3.2 Training the Network. In order to train the network to learn translation invariance, the network must be exposed to objects moving translationally through the receptive eld. We assume that these objects create xed one-dimensional patterns moving across the retina with constant speed and same direction. This ignores scaling, rotation, and so forth, which will be investigated in the next section. The effects of additive noise are studied here. The training and test patterns were low-pass ltered gaussian white noise. The size of the patterns was randomly chosen from a uniform distribution between 15 and 30 units. The low-pass lter was a gaussian with a width s randomly chosen from a uniform distribution between 2 and 5. Each pattern was normalized to zero mean and unit variance to eliminate trivial differences between patterns, which would make object recognition easier at the end. The patterns were always moved across the retina with a constant speed of 1 spatial unit per time unit. Thus, for a given pattern and without noise, the sensory signal of a single retinal unit is an accurate image of the spatial gray-value prole of the pattern. The time between one pattern being in the center of the retina and the next one was always 150 time units. Thus, there was always a pause between the presentation of two patterns, and there were never two patterns visible simultaneously. The high symmetry of the architecture and the stimuli made it possible to compute the input-output function only for one SFA module per layer. The other modules in the layer would have learned the same input-output function, since they saw the same input, just shifted by a time delay determined by the spatial distance of two modules. This cut down computational costs signicantly. Notice that this symmetry also results in an implicit weight-sharing constraint, although this was not explicitly implemented (see the next section for more general exam-
Slow Feature Analysis
739
4b
65
SFA2
4a
65
SFA1
3b
33
SFA2
3a
33
SFA1
2b
17
SFA2
2a
17
SFA1
1b
9
SFA2
1a
9
SFA1
retina
1
Figure 9 (Examples 4 and 5): A hierarchical network of SFA modules as a simple model of the visual system learning translation and other invariances. Different layers correspond to different cortical areas. For instance, one could associate layers 1, 2, 3, and 4 with areas V1, V2, V4, and PIT, respectively. The retinal layer has 65 input units, representing a receptive eld that is only a part of the total retina. The receptive eld size of SFA modules is indicated at the left of the left-most modules. Each layer is split into an a and a b sublayer for computational efciency and to permit a clearer functional analysis. The a modules are linear and receive convergent input, from either nine units in the retina or three neighboring SFA modules in the preceding layer. The b modules are quadratic and receive input from only one SFA module. Thus, receptive eld size increases only between a b module and an a module but not vice versa. The number of units per SFA module is variable and the same for all modules. Only the layer 4b SFA module always has nine units to make output signals more comparable. Notice that each module is independent of the neighboring and the succeeding ones; there is no weight-sharing constraint and no backpropagation of error.
ples). The sensory signals for some training and test patterns are shown in Figure 10. In the standard parameter setting, each SFA module had nine units; a stimulus with 20 patterns was used for training, and a stimulus with 50 patterns was used for testing; no noise was added. To improve generalization to test patterns, all signals transferred from one SFA module to the next were clipped at § 3.7 to eliminate extreme negative and positive values. The limit of §3.7 was not optimized for best generalization but chosen such that clipping became visible in the standard display with a range of § 4, which is not relevant for the gures presented here.
740
L. Wiskott and T. Sejnowski
t
77.49
x,1 t
36.85
x,1 t
x1 t
34.68
t
t
Figure 10 (Example 4): Sensory signal of the central retinal unit in response to three patterns. Since all patterns move with the same constant speed across the retina, the sensory signals have the same shape in time as the patterns have in space, at least for the noise-free examples.(Left) First three patterns of the training sensory signal. Pattern characteristics are (29, 4.0), (25, 4.5), and (16, 3.1), where the rst numbers indicate pattern size and the second ones the width of the gaussian low-pass lter used. (Middle) First three patterns of the test sensory signal without noise. Pattern characteristics are (22, 3.1), (26, 3.8), and (24, 2.5). (Right) Same test sensory signal but with a noise level of 1. With noise, the sensory signals of different retinal units differ in shape and not only by a time delay.
4.3.3 What Does the Network Learn? Figure 11 shows the rst four components of the output signals generated by the trained network in response to three patterns of the training and test stimulus: Instantaneous feedforward processing: Once the network is trained, processing is instantaneous. The output signal can be calculated moment by moment and does not require a continuously changing stimulus. This is important, because it permits processing also of briey ashed patterns. However, it is convenient for display purposes to show the response to moving patterns. No overtting:The test output signal does not differ qualitatively from the training output signal, although it is a bit noisier. This indicates that 20 patterns are sufcient for training to avoid overtting. However, clipping the signals as indicated above is crucial here. Otherwise, signals that get slightly out of the usual working range are quickly amplied to extreme values. Longer responses: The response of a layer 4b unit to a moving pattern is longer than the response of a retinal sensor (compare Figures 10 and 11). This is due to the larger receptive eld size. For a pattern of size 20, a retinal sensor response has a length of 20, given a stimulus speed of 1 spatial unit per time unit. A layer 4b unit responds to this pattern as soon as it moves into the receptive eld and until it leaves
Slow Feature Analysis
741
t 14.87
y,2 t
y,1 t
t
t
t
t 5.83
y,3 t
4.1
B
5.73
8.5
y,4 t
t
5.46
y4 t
y2 t
y1 t
4.86
y3 t
3.89
A
t
t
y,3 t
y,4 t
y,2 t
y,3 t
y,1 t
y,4 t
y,3 t
y,2 t
C
y,2 t
y,4 t
y,1 t
y,1 t Figure 11 (Example 4): First four output signal components in response to three patterns of (A) the training signal and (B) the test signal. (C) All respective trajectory plots for the test signal. Time axes range from 0 to 3 £ 150, all other axes range from ¡4 to +4. All signals are approximately normalized to zero mean and unit variance.
742
L. Wiskott and T. Sejnowski
it, resulting in a response of length 84 = 65 (receptive eld size) + 20 (pattern size) ¡1. Translation invariance: The responses to individual patterns have a shape similar to a half or a full sine wave. It may be somewhat surprising that the most invariant response should be half a sine wave and not a constant, as suggested in Figure 1. But the problem with a constant response would be that it would require a sharp onset and offset as the pattern moves into and out of the receptive eld. Half a sine wave is a better compromise between signal constancy in the center and smooth onsets and offsets. Thus the output signal tends to be as translation invariant as possible under the given constraints, invariance being dened by equation 2.1. Where-information: Some components, such as the rst and the third one, are insensitive to pattern identity. Component 1 can thus be used to determine whether a pattern is in the center or periphery of the receptive eld. Similarly, component 3 can be used to determine whether a pattern is more on the left or the right side. Taken together, components 1 and 3 represent pattern location, regardless of other aspects of the pattern. This becomes particularly evident from trajectory plot y03 (t ) versus y01 ( t) . These two components describe a loop in phase space, each point on this loop corresponding to a unique location of the pattern in the receptive eld. y1 ( t) and y3 ( t) therefore represent where-information. What-information: Some components, such as the second and fourth one, do distinguish well among different patterns, despite the translation invariance. The response to a certain pattern can be positive or negative, strong or weak. A reasonable representation for pattern identity can therefore be constructed as follows. Take components 1, 2, and 3, and subtract the baseline response. This yields a three-dimensional response vector for each moment in time, which is zero if no pattern is present in the receptive eld. As a pattern moves through the receptive eld, the amplitude of the response vector increases and decreases, but its direction tends to change little. The direction of the response vector is therefore a reasonable representation of pattern identity. This can be seen in the three trajectory plots y02 ( t) versus y01 ( t), y04 ( t) versus y01 (t) , and y04 (t ) versus y02 ( t) . Ideally the response vector should describe a straight line going from the baseline origin to some extreme point and then back again to the origin. This is what was observed on training data if only a few patterns were used for training. When training was done with 20 patterns, the response to test data formed noisy loops rather than straight lines. We will later investigate quantitatively to what extent this representation permits translation-invariant pattern recognition.
Slow Feature Analysis
743
Two aspects of the output signal were somewhat surprising. First, why does the network generate a representation that distinguishes among patterns, even though the only objective of slow feature analysis is to generate a slowly varying output? Second, why do where- and what-information get represented in separate components, although this distinction has not been built in the algorithm or the network architecture? This question is particularly puzzling since we know from the second example that SFA tends to distribute slow features over several components. Another apparent paradox is the fact that where-information can be extracted at all by means of the invariance objective, even though the general notion is that one wants to ignore pattern location when learning translation invariance (cf. section 1). To give an intuitive answer to the rst question, assume that the optimal output components would not distinguish among patterns. The experiments suggest that the rst component would then have approximately half-sine-wave responses for all patterns; the second component would have full-sine-wave responses for all patterns, since a full sine wave is uncorrelated to a half sine wave and still slowly varying. It is obvious that a component with half-sine-wave responses with different positive and negative amplitudes for different patterns can also be uncorrelated to the rst component but is more slowly varying than the component with full-sinewave responses only, which is in contradiction to the assumption. Thus, the objective of slow variation in combination with the decorrelation constraint leads to components’ differentiating among patterns. Why where- and what-information gets represented in separate components is a more difcult issue. The rst where-component, y1 (t) , with halfsine-wave responses, is probably distinguished by its low g-value, because it is easier for the network to generate smooth responses if they do not distinguish between different patterns, at least for larger numbers of training patterns. Notice that the what-components, y2 ( t) and y4 ( t) , are noisier. It is unclear why the second where-component, y3 ( t) , emerges so reliably (although not always as the third component) even though its g value is comparable to that of other what-components with half-sine-wave responses. For some parameter regimes, such as fewer patterns or smaller distances between patterns, no explicit where-components emerge. However, more important than the concentration of where-information in isolated components is the fact that the where-information gets extracted at all by the same mechanism as the what-information, regardless of whether explicit wherecomponents emerged. It is interesting to compare the pattern representation of the network with the one sketched in Figure 1. We have mentioned that the sharp onsets and offsets of the sketched representation are avoided by the network, leading to typical responses in the shape of a half or full sine wave. Interestingly, however, if one divides Components 2 or 4 by Component 1 (all components taken minus their resting value), one obtains signals similar to
744
L. Wiskott and T. Sejnowski
that suggested in Figure 1 for representing pattern identity: signals that are fairly constant and pattern specic if a pattern is visible and undetermined otherwise. If one divides component 3 by component 1, one obtains a signal similar to those suggested in Figure 1 for representing pattern location: one that is monotonically related to pattern location and undetermined if no pattern is visible. 4.3.4 How Translation Invariant Is the Representation? We have argued that the direction of the response vector is a good representation of the patterns. How translation invariant is this representation? To address this question, we have measured the angle between the response vector of a pattern at a reference location and at all other valid locations. The location of a pattern is dened by the location of its center (or the right center pixel if the pattern has an even number of pixels). The retina has a width of 65, ranging from ¡32 to +32 with the central sensor serving as the origin, that is, location 0. The largest patterns of size 30 are at least partially visible from location ¡46 up to +47. The standard location for comparison is ¡15. The response vector is dened as a subset of output components minus their resting values. For the output signal of Figure 11, for instance, the response vector may be dened as r(t) :D [y1 (t ) ¡ y1 (0) , y2 ( t) ¡ y2 (0) , y4 ( t) ¡ y4 (0)]T , if components 1, 2, and 4 are taken into account and if at time t D 0 no pattern was visible and the output was at resting level. If not stated otherwise, all components out of the nine output signal components that were useful for recognition were taken for the response vectors. In Example 4, the useful components were always determined on the training data based on recognition performance. Since at a given time tpl the stimulus shows a unique pattern p at a unique location l, we can also parameterize the response vector by pattern index p and location l and write rp ( l) :D r(tpl ). The angle between the response vectors of pattern p at the reference location ¡15 and pattern p0 at a test location l is dened as # (rp (¡15) , rp0 ( l) ) :D arccos( (rp (¡15) ¢rp0 ( l) ) / (k rp (¡15) kk rp0 (l) k) ) , (4.7) where r¢r0 indicates the usual inner product and k r k indicates the Euclidean norm. Figure 12 shows percentiles for the angles of the response vectors at a test location relative to a reference location for the 50 test patterns.
4.3.5 What is the Recognition Performance? If the ultimate purpose of the network is learning a translation-invariant representation for pattern recognition, we should characterize the network in terms of recognition performance. This can be done by using the angles between response vectors as a simple similarity measure. But instead of considering the raw recognition rates, we will characterize the performance in terms of the ranking of
Slow Feature Analysis
745
80
angle
60 40 20 0
40
20 0 20 pattern center location
40
Figure 12 (Example 4): Percentiles of the angles # (rp (¡15) , rp ( l ) ) between the response vectors at a test location l and the reference location ¡15 for 50 test patterns p. For this graph, components 1, 2, and 4 were used (cf. Figure 11). The thick line indicates the median angle or 50th percentile. The dashed lines indicate the smallest and largest angles. The bottom and top thin line indicate the 10th and 90th percentiles, respectively. The white area indicates the size of the retina (65 units). Gray areas indicate locations outside the retina. Patterns presented at the edge between white and gray are only half visible to the retina. The average angle over the white area is 12.6 degrees. If the vectors were drawn randomly, the average and median angle would be 90 degrees (except for the reference location where angles are always zero). Thus, the angles are relatively stable over different pattern locations. Results are very similar for the training patterns with an average angle of 11.3 degrees.
the patterns induced by the angles, since that is a more rened measure. Take the response vectors of all patterns at the reference location ¡15 as the stored models, which should be recognized. Then compare a response vector rp0 (l) of a pattern p0 at a test location l with all stored models rp (¡15) in terms of the enclosed angles # (rp (¡15) , rp0 ( l) ) , which induce a ranking. If # (rp0 (¡15) , rp0 ( l) ) is smaller than all other angles # (rp (¡15) , rp0 ( l) ) with 6 p0 , then pattern p0 can be recognized correctly at location l, and it is on pD rank 1. If two other patterns have a smaller angle than pattern p0 , then it is on rank 3, and so on. Figure 13 illustrates how the ranking is induced, and Figure 14 shows rank percentiles for all 20 training patterns and 50 test patterns over all valid locations.
746
L. Wiskott and T. Sejnowski
y4 Obj. 1 at 15 Obj. 3 at 15 2
1 Obj. 4 at 15
Obj. 2 at +15 induces ranking Obj. 4 Obj. 5 Obj. 2 . .
3 2
1
3
Obj. 1 at +15 induces ranking Obj. 1 Obj. 3 Obj. 2 . .
y2
Obj. 2 at 15 Obj. 5 at 15
Figure 13 (Examples 4 and 5): Pattern recognition based on the angles of the response vectors. Solid arrows indicate the response vectors for objects at the reference location. Dashed arrows indicate the response vectors for objects shown at a test location. The angles between a dashed vector and the solid vectors induce a ranking of the stored objects. If the correct object is at rank one, the object is recognized correctly, as in case of Object 1 in the gure.
For the training patterns, there is a large range within which the median rank is 1; at least 50% of the patterns would be correctly recognized. This even holds for location +15, where there is no overlap between the pattern at the test location and the pattern at the reference location. Performance degrades slightly for the test patterns. The average ranks over all test locations within the retina (the white area in the graph) is 1.9 for the training patterns and 6.9 for the test patterns. Since these average ranks depend on the number of patterns, we will normalize them by subtracting 1 and dividing by the number of patterns minus 1. This gives a normalized average rank between 0 and 1, 0 indicating perfect performance (always rank 1) and 1 indicating worst performance (always last rank). The normalized average ranks for the graphs in Figure 14 are 0.05 and 0.12 for the training and test patterns, respectively; chance levels were about 0.5. Notice that the similarity measure used here is simple and gives only a lower bound for the performance; a more sophisticated similarity measure can only do better.
Slow Feature Analysis
747
Figure 14 (Example 4): Rank percentiles for the 20 training patterns (top) and 50 test patterns (bottom) depending on the test location. For this graph, components 1, 2, and 4 were used (cf. Figure 11). Conventions are as for Figure 12. Performance is, of course, perfect at the reference location ¡15, but also relatively stable over a large range. The average rank over the white area is 1.9 for the training patterns and 6.2 for the test patterns. Perfect performance would be 1 in both cases, and chance level would be approximately 10.5 and 25.5 for training and test patterns, respectively.
748
L. Wiskott and T. Sejnowski
normalized average rank
0.5
2. 3. 3.8
0.4 0.3
5.
0.2
5.2 3.6
3.6
0.1 0
2
5
10
20 number of training patterns
40
Figure 15 (Example 4): Dependence of network performance on the number of training patterns. The standard number was 20. The ordinate indicates normalized average rank; the abscissa indicates the considered parameter—the number of training patterns in this case. Gray and black curves indicate performance on training and test data, respectively. Solid curves indicate results obtained with two-layer 4b output signal components only; dashed curves indicate results obtained with as many components out of the rst nine components as were useful for recognition. The useful components were determined on the training data based on recognition performance. The average number of components used for the dashed curves is shown above. Each point in the graph represents the mean over ve different simulation runs; the standard deviation is indicated by small bars. The curves are slightly shifted horizontally to let the standard deviation bars not overlap. Twenty training patterns were sufcient, and even ve training patterns produced reasonable performance.
4.3.6 How Does Performance Depend on Number of Training Patterns, Network Complexity, and Noise Level? Using the normalized average rank as a measure of performance for translation-invariant pattern recognition, we can now investigate the dependence on various parameters. Figure 15 shows the dependence on the number of training patterns. One nds that 20 training patterns are enough to achieve good generalization; performance does not degrade dramatically down to ve training patterns. This was surprising, since translation-invariant recognition appears to be a fairly complex task. Figure 16 shows the dependence of network performance on the number of components used in the SFA modules. The standard was nine for all modules. In this experiment, the number of components propagated from
Slow Feature Analysis
normalized average rank
0.5
749
2. 3.8
0.4
2. 2.8
0.3
3.2 3.6
0.2
3.6
0.1 0
1
2
3 4 5 6 7 8 number of propagated components
9
Figure 16 (Example 4): Dependence of network performance on the number of components propagated from one module to the next. Standard value was 9. Conventions are as for Figure 15. Performance degraded slowly down to four components per module and more quickly for fewer components.
one module to the next one was stepwise reduced to just one. At the top level, however, usually nine components were used to make results more comparable. In the case of one and two components per module, however, only two and ve components were available in the layer 4b module, respectively. Performance degraded slowly down to four components per module; below that, performance degraded quickly. With one component, the normalized average rank was at chance level. One can expect that performance would improve slightly with more than nine components per module. With one and two components per module, there were always two clear components in the output signal useful for recognition. With three components per module, the useful information began to mix with components not useful, so that performance actually degraded from two to three components per module. SFA should not be too sensitive to noise, because noise would yield quickly varying components that are ignored. To test this, we added gaussian white noise with different variance to the training and test signals. A sensory signal with a noise level of 1 is shown in Figure 10. The noise was independent only for nine adjacent sensors feeding into one SFA module. Because in the training procedure, only one module per layer was being trained, all modules in one layer effectively saw the same noise, but delayed by 4, 8, 12, and so on time units. However, since the network processed the
750
L. Wiskott and T. Sejnowski
normalized average rank
0.5 2.6 0.4
2.4
0.3 0.2
2.2
3.8
2.8
3.6
0.1 0
0
0.5
1 noise level
1.5
2
Figure 17 (Example 4): Dependence of network performance on the added noise level. Standard value was 0. Conventions are as for Figure 15. Performance degrades gracefully until it approaches chance level at a noise level of 2. A sensory signal with noise level 1 is shown in Figure 10.
input instantaneously, it is unlikely that the delayed reoccurence of noise could be used to improve the robustness. Figure 17 shows the dependence of performance on noise level. The network performance degraded gracefully. 4.3.7 What is the Computational Strategy for Achieving Translation Invariance? How does the network achieve translation invariance? This question can be answered at least for a layer 1b SFA module by visualizing its input-output function (not including the clipping operation from the 1a to the 1b module). At that level, the input-output function is a polynomial of nine input components, for example, x ¡4 , x ¡3 , . . . , x4 , and degree 2; it is a weighted sum over all rst- and second-degree monomials. Let us now dene a center location and a spread for each monomial. For seconddegree monomials, the spread is the difference between the indices of the two input components; its center location is the average over the two indices. For rst-degree monomials, the spread is 0, and the center location is the index. Thus, x ¡3 x 0 has a spread of 3 and center location ¡1.5, x3 x3 has a spread of 0 and center location 3, and x ¡2 has a spread of 0 and center location ¡2. Figure 18 shows the weights or coefcient values of the monomials for the rst two components of the input-output function learned by the central layer 1b SFA module, including the contribution of
Slow Feature Analysis
751
Figure 18 (Example 4): Visualization of the rst two components, g1 (x) (top) and g2 (x) (bottom), of the input-output function realized by the central SFA module of layer 1b. Black dots indicate the coefcient values of the seconddegree monomials at their center location. Dots related to monomials of the same spread are connected by dashed lines, dense dashing indicating a small spread and wide dashing indicating a large spread. The spread can also be inferred from the number of dots connected, since there are nine monomials with spread 0, eight monomials with spread 1, and so on. The thick line indicates the coefcient values of the rst-degree monomials. Second-order features get extracted, and coefcient values related to same spread vary smoothly with center location to achieve a slowly varying output signal (i.e., approximate translation invariance).
752
L. Wiskott and T. Sejnowski
the preceding 1a SFA module. All other modules in the same layer learn the same input-output function just shifted by a multiple of four units. The graphs show that second-degree monomials contributed signicantly (signals generated by rst- and second-degree monomials have similar variance, so that coefcient values of rst- and second-degree monomials can be compared), which indicates that second-order features were extracted. In general, coefcient values related to monomials of the same degree and spread increase smoothly from periphery to center, forming bell-shaped or half-sine-wave-shaped curves. This results in a fade-in / fade-out mechanism applied to the extracted features as patterns move through the receptive eld. The general mechanism by which the network learns approximate translation invariance is closely related to higher-order networks in which translation invariance is achieved by weight sharing among monomials of same degree and spread (see Bishop, 1995). The difference is that in those networks, the invariance is built in by hand and without any fade-in / fadeout. What is the relative contribution of the linear a-sublayers and the quadratic b-sublayers to the translation invariance? This can be inferred from the g values plotted across layers. In Figure 19, a and b sublayers both contribute to the slow variation of the output signals, although the linear a sublayers seem to be slightly more important, at least beyond layer 1b. This holds particularly for the rst component, which is usually the half-sine-waveresponding where-component, for which simple linear averaging seems to be a good strategy at higher levels. 4.3.8 Is There an Implicit Similarity Measure Between Patterns? The representation of a given pattern is similar at different locations: this is the learned translation invariance. But how does the network compare different patterns at the same location? What is the implicit similarity measure between patterns given that we compare the representations generated by the network in terms of the angles between the response vectors? Visual inspection did not give an obvious insight into what the implicit similarity measure would be, and also comparisons with several other similarity measures were not conclusive. For example, the correlation coefcient between the maximum correlation between pairs of patterns over all possible relative shifts on the one hand and the angle between the response vectors (presenting the patterns in the center of the retina and using components 1, 2, and 4) on the other hand was only ¡0.47, and the respective scatterplot was quite diffuse. There was no obvious similarity measure implicit in the network. 4.3.9 How Well Does the System Generalize to Other Types of Patterns? We have addressed this question with an experiment using two different types of patterns: high-frequency patterns generated with a gaussian low-pass lter with s D 2 and low-frequency patterns generated with s D 10 (see
Slow Feature Analysis
753
4
log10
3.5 3 2.5
retina 1a
1b
2a
2b layer
3a
3b
4a
4b
Figure 19 (Example 4): g values of the nine signal components in the different layers averaged over ve simulation runs. Dashed and solid curves indicate the training and test cases, respectively. The bottom curve in each of the two bunches indicates the rst component with the lowest g value, the top one the ninth component. A difference of 0.397 D log10 (2.5) between the dashed and solid curves can be explained by the fact that there were 2.5 times more test than training patterns. The remaining difference and the fact that the solid curves are not monotonic and cross each other reect limited generalization. The step from retina to layer 1a is linear and performs only decorrelation; no information is lost. Linear as well as nonlinear processing contribute to the translation invariance. For the rst where-component, the linear convergent processing seems to be more important.
Figure 10 for patterns with different values of s). Otherwise, the training and testing procedures were unchanged. Average angles and normalized average ranks are given in Table 3 and indicate that (1) translation invariance generalized well to the pattern type not used for training, since the average angles did not differ signicantly for the two different testing conditions given the same training condition; (2) the invariance was in general better learned with low-frequency patterns than with high-frequency patterns (although at the cost of high variance), since the average angles were smaller for all testing conditions if the network was trained with low-frequency patterns; (3) low-frequency patterns were more easily discriminated than high-frequency patterns, since the normalized average rank was on average lower by 0.04 for the test patterns with s D 10 than for those with s D 2; (4) pattern recognition generalized only to a limited degree, since
754
L. Wiskott and T. Sejnowski
Table 3 (Example 4): Average Angles and Normalized Average Ranks for the Networks Trained for One Pattern Type (s D 2 or s D 10) and Tested on the Training Data or Test Data of the Same or the Other Pattern Type. Testing )
+ Training
sD2 Training
Average angles sD2 15.2 1.5 s D 10
s D 10 Testing
Training
Testing
18.1 2.2 11.5 7.0
6.4 4.3
19.0 1.6 8.9 5.3
.06 .04
.15 .01 .09 .02
Normalized average ranks sD2 .08 .02 .13 .03 s D 10 .19 .01
Note: All gures are means over ve simulation runs with the standard deviation given in small numerals. Boldface numbers indicate performances if a network was tested on the same pattern type for which it was trained. The components useful for recognition were determined on the training data. Training (testing) was done with identical patterns for all networks within a row (column).
the normalized average rank increased on average by 0.06 if training was not done on the pattern type used for testing. Notice that recognition performance dropped (point 4) even though the invariance generalized well (point 1). Response vectors for the pattern type not used for training seemed to be invariant but similar for different patterns so that discrimination degraded. 4.4 Example 5: Other Invariances in the Visual System Model. 4.4.1 Training the Network. In this section, we extend our analysis to other invariances. To be able to consider not only geometrical transformations but also illumination variations, patterns are now derived from object proles by assuming a Lambertian surface reectance (Brooks & Horn, 1989). Let the depth prole of a one-dimensional objectqbe given by q (u ) and its surface normal vector by n(u) :D [¡q0 ( u ) , 1]T / q0 2 ( u ) C 12 with q0 ( u ) :D dq(u) / du. If a light source shines with unit intensity from direction l D (sin(a) , cos(a) ) (where unit vector (0, 1) would point away from the retina and (1, 0) would point toward the right of the receptive eld), the pattern of reected light is i (u ) D max (0, n(u) ¢ l). To create an object’s depth prole, take gaussian white noise of length 30 pixels, low-pass lter it with cyclic boundary conditions by a gaussian with a width s randomly drawn from the interval [2, 5], and then normalize it to zero mean and unit variance. This is then taken to be the depth derivative q0 ( u ) of the object. The
Slow Feature Analysis
755
depth prole itself could be derived by integration but is not needed here. Applying the formula for the Lambertian reectance given above yields the gray-value pattern i (u ) . Varying a (here between ¡60± and +60± ) yields different patterns for the same object. In addition, the object can be placed at different locations (centered anywhere in the receptive eld), scaled (to a size between 2 and 60 pixels), rotated (which results in a cyclic shift up to a full rotation), and varied in contrast (or light intensity, between 0 and 2). Thus, beside the fact that each pattern is derived from a randomly drawn object prole, it has specic values for the ve parameters location, size, cyclic shift, contrast, and illumination angle. If a certain parameter is not varied, it usually takes a standard value, which is the center of the receptive eld for the location, a size of 30 pixels, no cyclic shift for rotation, a contrast of 1, and ¡30± for the light source. Notice that location, size, and contrast have values that can be uniquely inferred from the stimulus if the object is entirely within the receptive eld. Cyclic shift and illumination angle are not unique, since identical patterns could be derived for different cyclic shift and illumination angle if the randomly drawn object proles happened to be different but related in a specic way—for example, if two objects are identical except for a cyclic shift. Between the presentation of two objects, there is usually a pause of 60 time units with no stimulus presentation. Some examples of stimulus patterns changing with respect to one of the parameters are given in Figure 20 (top). First, the network was trained for only a single invariance at a time. As in the previous section, 20 patterns were usually used for training and 50 patterns for testing. The patterns varied in only one parameter and had standard values for all other parameters. In a second series, the network was trained for two or three invariances simultaneously. Again, the patterns of the training stimulus varied in just one parameter at a time. The other parameters either had their standard values, if they belonged to an invariance not trained for, or they had a random value within the valid range, if they belonged to an invariance trained for. For example, if the network was to be trained for translation and size invariance, each pattern varied in either location or size while having a randomly chosen but constant value for size or location, respectively (cf. Figure 20, bottom). All patterns had standard parameter values for rotation, contrast, and illumination angle. This training procedure comes close to the natural situation in which each pattern occurs many times with different sets of parameters, is computationally convenient, and permits a direct comparison between different experiments, because identical stimuli are used in different combinations. In some experiments, all nonvaried parameters had their standard value; in the example given above, the patterns varying in location had standard size, and the patterns varying in size had standard location. In that case, however, the network can learn translation invariance only for patterns of standard size and, simultaneously, size invariance for patterns of standard
retinal sensor
L. Wiskott and T. Sejnowski
30 15 0 15 30
retinal sensor
756
30 15 0 15 30
0
100
200 time
300
400
0
100
200 time
300
400
Figure 20 (Example 5): (Top) Stimulus patterns of the same object generated by changing object location from the extreme left to the extreme right in steps of 1, object size from 2 to 30 and back to 2 in steps of 2, cyclic shift from 0 to twice the object size in steps of 1 (resulting in two full rotations), contrast (or light intensity) from 1/15 to 2 and back again in steps of 1/15, and illumination angle from ¡60± to +60± in steps of 2± . (Bottom) Stimulus for training translation and size invariance. Patterns derived from two different objects are shown. The rst two patterns have constant but randomly chosen size and vary in location. The second two patterns have constant but randomly chosen location and vary in size. The rst and third patterns were derived from the same object, and so were the second and fourth patterns. Cyclic shift (rotation), contrast, and illumination angle have their standard values. The pause between two presentations of an object is only 14 or 24 time units instead of 60 in the graphs for display purposes.
location. No invariances can be expected for patterns of nonstandard size and location. 4.4.2 What Does the Network Learn? Figure 21 shows the rst four components of the output signal generated by a network trained for size invariance in response to 10 objects of the training stimulus. In this case, the stimuli consisted of patterns of increasing and decreasing size, like the second example in Figure 20 (top). The response in this case and also for the other invariances is similar to that in case of translation invariance (see Figure 11), with the difference that no explicit where-information is visible. Thus, object recognition should also be possible for these other invariances based on the response vectors. 4.4.3 How Invariant Are the Representations? Figure 22 shows rank percentiles for the ve invariances. For each but the top right graph, a network
Slow Feature Analysis
757
Figure 21 (Example 5): First four output signal components of the network trained for size invariance in response to 10 objects of the training stimulus with patterns changing in size only. At the bottom are also shown some trajectory plots. Time axes range from 0 to 10 £ 119, where 119 corresponds to the presentation of a single pattern including a pause of 60 time units with no stimulation. All other axes range from ¡4 to +4. All signals are approximately normalized to zero mean and unit variance.
was trained with 20 patterns varying in one viewing parameter and having standard values for the remaining four ones. Fifty test patterns were used to determine which of the output components were useful for recognition. Fifty different test patterns were then used to determine the rank percentiles shown in the graphs. Notice that the graph shown for translation invariance differs from the bottom graph in Figure 14 because of differences in the reference location, the number of components used for recognition, and the characteristics of the patterns. The recognition performance was fairly good over a wide range of parameter variations, indicating that the networks have learned invariances well. For varying illumination angle, however, the network did not generalize if the angle had a different sign, that is, if the light comes from the other side. The top right graph shows the performance of a network trained for translation and size invariance simultaneously and tested for translation invariance. 4.4.4 What Are the Recognition Performances? The normalized average ranks were used as a measure of invariant recognition performance, as for translation invariance in section 4.3. Normalized average ranks are listed in Table 4 for single and multiple invariances. The numbers of output signal components used in each experiment are shown in Table 5. Although normalized average ranks for different invariances cannot be directly compared, the gures suggest that contrast and size invariance are
758
L. Wiskott and T. Sejnowski
Networks trained for Translation and size invariance
50
50
40
40
30
30
rank
rank
Networks trained for Translation invariance
20 10
20 10
0
30
20 10 0 10 20 pattern center location
0
30
30
50
40
40
30
30
20 10
20 10
0
10
20
30 40 pattern size
50
0
60
10
5
0 5 cyclic shift
10
15
Illumination invariance 50
40
40
30
30
rank
rank
Contrast invariance 50
20 10 0
30
Rotation invariance
50
rank
rank
Size invariance
20 10 0 10 20 pattern center location
20 10
0
0.5
1 1.5 2 pattern contrast
2.5
3
0
60
40
20 0 20 illumination angle
40
60
Figure 22 (Example 5): Rank percentiles for different networks and invariances and for 50 test patterns. The top right graph shows percentiles for a network that has been trained for translation and size invariance simultaneously and was tested for translation invariance; all other networks have been trained for only one invariance and tested on the same invariance. The thick lines indicate the median ranks or 50th percentiles. The dashed lines indicate the smallest and largest ranks. The bottom and top thin lines indicate the 10th and 90th percentiles, respectively. Chance level would be a rank of 25. Performance is perfect for the standard view, but also relatively stable over a wide range.
Slow Feature Analysis
759
easier to learn than translation invariance, which is easier to learn than rotation and illumination invariance. Contrast invariance can be learned perfectly if enough training patterns are given. This is not surprising, since the input patterns themselves already form vectors that simply move away from the origin in a straight line and back again as required for invariant object recognition. That contrast requires many training patterns for good generalization is probably because a single pattern changing in contrast spans only one dimension. Thus, at least 30 patterns are required to span the relevant space (since patterns have a size of 30). The performance is better for size invariance than for translation invariance, although translation is mathematically simpler, probably because individual components of the input signal change more drastically with translation than with scaling, at least around the point with respect to which objects are being rescaled. Illumination invariance is by far the most difcult of the considered invariances to learn, in part because there is no unique relationship between illumination angle, the object’s depth prole, and the light intensity pattern. This also holds to some degree for rotation invariance, on which the network performs second worst. Illumination invariance seems to be particularly difcult to learn if the illumination angle changes sign—if the light comes from the other side (cf. Figure 22). It is also interesting to look at how well invariances have been implicitly learned for which the network has not been trained. If trained for translation invariance, for instance, the network also learns size invariance fairly well. Notice that this is not symmetric. If trained for size invariance, the network does not learn translation invariance well. A similar relationship holds for contrast and illumination invariances. Learning illumination invariance teaches the network some contrast invariance, but not vice versa. A comparison of the results listed in the bottom of Table 4 with those listed in the top shows that performance degrades when the network is trained on multiple invariances simultaneously. Closer inspection shows that this is due to at least two effects. First, if trained on translation invariance, for instance, patterns vary only in location and have standard size. If trained on translation and size invariance, training patterns that vary in location have a random rather than a standard size. Thus, the network has to learn translation invariance not only for standard-size patterns but also for nonstandard-size patterns. Thus, size becomes a parameter of the patterns the network can represent, and the space of patterns is much larger now. However, since testing is done with patterns of standard size (this is important to prevent patterns from being recognized based on their size), pattern size cannot be used during testing. This effect can be found in a network trained only for translation invariance, but with patterns of random size (compare entries (or ) with entries (or ) in Table 6).
.26 .02
.41 .04
.31 .00
.38 .02
.06 .01 .46 .01 .37 .01 .49 .01 .49 .00 .18 .03 .27 .06 .47 .01
.04 .01
.14 .04 .21 .06
Testing
Training
Location
.06 .00
.07 .03
.04 .01
.01 .00
Training
Size
.34 .01
.19 .02
.33 .02
.15 .02 .03 .01 .33 .03 .42 .04 .44 .03 .13 .01 .22 .04 .17 .05
Testing
.24 .02 .31 .08
.14 .06
.22 .02
.32 .03 .42 .00 .12 .01 .48 .01 .47 .01 .26 .04 .20 .03 .47 .01
Testing
.03 .01
.13 .04
.03 .01
.03 .01
Training
Rotation
.19 .04
.00 .00
Training
.28 .03
.43 .02
.41 .01
.36 .02 .32 .02 .38 .02 .06 .03 .18 .02 .38 .05 .39 .01 .35 .02
Testing
Contrast
.12 .02
.22 .03
.13 .04
.06 .01
Training
.30 .04
.38 .01
.26 .02
.37 .01 .34 .01 .31 .02 .34 .05 .22 .01 .40 .02 .35 .04 .28 .03
Testing
Illumination
Note: All gures are means over three simulation runs with the standard deviation given in small numerals. Boldface numbers indicate performances if the network was tested on an invariance for which it was trained. Training performance was based on the same output signal components as used for testing. Training (testing) was done with identical patterns for all networks within a row (column).
Location Size Rotation Contrast Illumination Location and size Location and rotation Size and illumination Rotation and illumination Location, size and rotation Rotation, contrast, and illumination
+ Training
Testing )
Table 4: (Example 5): Normalized Average Ranks for Networks Trained for One, Two, or Three of the Five Invariances and Tested with Patterns Varying in One of the Five Parameters.
760 L. Wiskott and T. Sejnowski
Slow Feature Analysis
761
Table 5 (Example 5): Numbers of Output Signal Components Useful for Recognition. Training + Testing )
Location
Size
Rotation
Contrast
Location Size Rotation Contrast Illumination Location and size Location and rotation Size and illumination Rotation and illumination Location, size, and rotation Rotation, contrast, and illumination
6.0 0.0 3.7 1.2 6.7 0.6 3.7 2.9 4.7 1.2 8.0 1.0 6.3 1.2 3.3 0.6
6.0 1.0 6.7 0.6 5.7 2.3 6.0 2.6 5.7 3.1 7.3 0.6 6.3 0.6 7.3 1.5
4.3 0.6 3.3 0.6 8.0 1.0 2.3 0.6 3.7 1.2 4.3 0.6 5.0 1.7 4.0 1.0
5.3 0.6 5.3 2.1 4.3 1.5 8.7 0.6 8.7 0.6 4.3 0.6 4.3 1.5 7.0 1.7
5.0 1.7 4.0 1.0 4.7 1.5 5.3 2.5 6.7 0.6 4.3 1.2 4.3 0.6 5.7 0.6
7.0 0.0
6.0 1.0
6.0 .00
6.0 1.0
5.0 2.0
6.3 2.3
6.7 0.6
6.7 1.5
4.0 1.0
4.3 0.6
5.0 1.0
5.3 0.6
4.0 1.0
5.7 0.6
4.7 1.5
Illumination
Note: All gures are means over three simulation runs, with the standard deviation given in small numerals. Networks were trained for one, two, or three of the ve invariances and tested with patterns varying in one of the ve parameters. Boldface numbers indicate experiments where the network was tested on an invariance for which it was trained.
Second, the computational mechanisms by which different invariances are achieved may not be compatible and so would interfere with each other (compare entries (or ) with entries (or ) in Table 6). Further research is required to investigate whether this interference between different invariances is an essential problem or one that can be overcome by using more training patterns and networks of higher complexity and more computational power, and by propagating more components from one SFA module to the next. Table 6 shows that the amount of degradation due to each of the two effects discussed above can vary considerably. Translation invariance does not degrade signicantly if training patterns do not have standard size, but it does if size invariance has to be learned simultaneously. Size invariance, on the other hand, does not degrade if translation invariance has to be learned simultaneously, but it degrades if the patterns varying in size are not at the same (standard) location. Some of these differences can be understood intuitively. For instance, it is clear that the learned translation invariance is largly insensitive to object size but that the learned size invariance is specic to the location at which it has been learned. Evaluating the interdependencies among invariances in detail could be the subject of further research. 4.4.5 How Does Performance Depend on the Number of Training Patterns and Network Complexity? The dependencies on the number of training patterns
762
L. Wiskott and T. Sejnowski
Table 6 (Example 5): Normalized Average Ranks for the Network Trained on One or Two Invariances and Tested on One Invariance. Location ( .06 .01 .16 .02 .46 .01
.10 .03 .18 .03 .40 .00 Rotation (
.12 .01 .17 .03 .47 .01
,
) and Size (
,
.15 .02 .05 .01 .03 .01 ) and Illumination (
.14 .02 .22 .02 .45 .03
)
.31 .02 .28 .03 .22 .01
.22 .02 .13 .01 .14 .04 ) .30 .01 .26 .02 .24 .01
Note: The icons represent the two-dimensional parameter space of two invariances considered; all other parameters have standard values. In the top part of the table, for instance, the vertical dimension refers to location and the horizontal dimension refers to size. The icons then have the following meaning: = 20 training (or 50 test) input patterns with standard size and varying location. = 20 (or 50) input patterns with standard location and varying size. = 20 (or 50) input patterns with random but xed size and varying location. = 20 (or 50) input patterns with random but xed location and varying size. = input patterns and together—40 (or 100) input patterns in total. = input patterns and together—40 (or 100) input patterns in total. The same scheme also holds for any other pair of two invariances. Two icons separated by a colon indicate a training and a testing stimulus. The experiments of the top half of Table 4 would correspond to icons (or ) if the tested invariance is the same as the trained one (boldface gures) and (or ) otherwise. The experiments of the bottom half of Table 4, where multiple invariances were learned, would correspond to icons (or ) if the tested invariance is one of the trained ones (boldface gures). The remaining experiments in the bottom half cannot be indicated by the icons used here, because the networks are trained on two invariances and tested on a third one.
and network complexity are illustrated in Figure 23. It shows normalized average ranks on training and test stimuli for the ve different invariances and numbers of training patterns, as well as different numbers of components propagated from one SFA module to the next. As can be expected, training performance degrades and testing performance improves with the number of training patterns. One can extrapolate that for many training patterns, the performance would be about .050 for location, .015 for size, .055 for rotation, .000 for contrast, and .160 for illumination angle, which conrms the general results we found in Table 4 as to how well invariances can be learned, with the addition that contrast can be learned perfectly given enough training patterns. Testing performance also improves with network complexity, that is, with the number of propagated components, except when overtting occurs, which is apparently the case at the bottom ends of the dashed curves. Contrast invariance generally degrades with network complexity. This is due to the fact that the input vectors already form a perfect representation for invariant recognition (see above),
Slow Feature Analysis
norm. av. rank on train. data
0.14
763
location size rotation contrast illumin. diagonal
0.12 0.1 0.08 0.06 0.04 0.02 0 0
0.05 0.1 0.15 0.2 0.25 normalized average rank on test data
Figure 23 (Example 5): Normalized average ranks on training and test stimuli for the ve invariances. Solid lines and lled symbols indicate the dependency on the number of training patterns, which was 10, 20, 40, and 80 from the lower right to the upper left of each curve. Dashed lines and empty symbols indicate the dependency on the number of components propagated from one SFA module to the next one, which was 5, 7, 9, 11, 13, and 15 from the upper end of each curve to the lower end. The direction of increasing number of components can also be inferred from the polarity of the dashing, which follows the pattern 5 - — - — 7 - — . . . - — 15 (i.e., long dashes point toward large numbers of propagated components). If not varied, the number of training patterns was 20 and the number of propagated components was 9. Notice that the curves cross for these standard values. The solid curve for contrast invariance is shifted slightly downward to make it distinguishable from the dashed curve. Each point is an average over three simulation runs; standard deviations for points with standard parameter sets can be taken from Table 4.
which can only be degraded by further processing, and the fact that each pattern spans only one dimension, which means that overtting occurs easily. 5 Discussion The new unsupervised learning algorithm, slow feature analysis (SFA), presented here yields a high-dimensional, nonlinear input-output function that extracts slowly varying components from a vectorial input signal. Since the learned input-output functions are nonlinear, the algorithm can be applied repeatedly, so that complex input-output functions can be learned in a hierarchical network of SFA modules with limited computational effort.
764
L. Wiskott and T. Sejnowski
SFA is somewhat unusual in that directions of minimal variance rather than maximal variance are extracted. One might expect that this is particularly sensitive to noise. Noise can enter the system in two ways. First, there may be an additional input component carrying noise. Assume an input signal as in Example 1 plus a component x4 ( t) that has a constant value plus some small noise. x4 ( t) would seem to be most invariant. However, normalizing it to zero mean and unit variance would amplify the noise such that it becomes a highly uctuating signal, which would be discarded by the algorithm. Second, noise could be superimposed on the input signal that carries the slowly varying features. This could change the D values of the extracted signals signicantly. However, a slow signal corrupted by noise will usually be slower than a fast signal corrupted by noise of the same amount, so that the slow signal will still be the rst one extracted. Only if the noise is unevenly distributed over the potential output signal components can one expect the slow components not to be correctly discovered. Temporal low-pass ltering might help dealing with noise of this kind to some extent. The apparent lack of an obvious similarity measure in the network of section 4.3 might reect an inability to nd it. For instance, the network might focus on features that are not well captured by correlation and not easily detectable by visual inspection. Alternatively, no particular similarity measure may be realized, and similarity may be more a matter of chance. On the negative side, this contradicts the general goal that similar inputs should generate similar representations. However, on the positive side, the network might have enough degrees of freedom to learn any kind of similarity measure in addition to translation invariance if trained appropriately. This could be valuable, because simple correlation is actually not the best similarity measure in the real world. 5.1 Models for Unsupervised Learning of Invariances. Foldi ¨ ak (1991) presented an on-line learning rule for extracting slowly varying features based on models of classical conditioning. This system associated the existing stimulus input with past response activity by means of memory traces. This implemented a tendency to respond similarly to stimuli presented successively in time, that is, a tendency to generate slow feature responses. A weight growth rule and normalization kept the weights nite and allowed some useful information to be extracted. A simple form of decorrelation of output units was achieved by winner-take-all competition. This, however, did not lead to the extraction of different slow features but rather to a one-of-n code, where each output unit codes for a certain value of the slow feature. The activity of a unit was linear in the inputs, and the output was a nonlinear function of the total input and the activities of neighboring units. Closely related models were also presented in (Barrow and Bray 1992, O’Reilly and Johnson 1994, and Fukushima 1999) and applied to learning lighting and orientation invariances in face recognition in (Stewart Bartlett and Sejnowski 1998). These kinds of learning rules have also been applied to
Slow Feature Analysis
765
hierarchical networks with several layers, most clearly in (Wallis and Rolls 1997). A more detailed biological model of learning invariances also based on memory traces was presented by Eisele (1997). He introduced an additional memory trace to permit association between patterns not only in the backward direction but also in the forward direction. Another family of on-line learning rules for extracting slowly varying features has been derived from information-theoretic principles. Becker and Hinton (1992) trained a network to discover disparity as an invariant variable of random-dot stereograms. Two local networks, called modules, received input from neighboring but nonoverlapping parts of the stereogram, where each module received input from both images. By maximizing the mutual information between the two modules, the system learned to extract the only common aspect of their input—the disparity, given that disparity changed slowly. This is an example of spatial invariance, in contrast to temporal invariance considered so far. Spatial and temporal invariance are closely related concepts, and algorithms can usually be interchangeably applied to one or the other domain. An application to the temporal domain, for instance, can be found in (Becker 1993). For an overview, see (Becker 1996). The information-theoretic approach is appealing because the objective function is well motivated. This becomes particularly obvious in the binary case, where the consequent application of the principles is computationally feasible. However, in the case of continuous variables, several approximations need to be made in order to make the problem tractable. The resulting objective function to be maximized is I :D 0.5 log
V ( a C b) , V ( a ¡ b)
(5.1)
where a and b are the outputs of the two modules and V (¢) indicates the variance of its argument. If we set a :D y ( t) and b :D y ( t ¡ 1) , assuming discretized time, this objective function is almost identical to the objectives formalized in equations 2.1 through 2.4. Thus, the information-theoretic approach provides a systematic motivation but not a different optimization problem. Zemel and Hinton (1992) have generalized this approach to extract several invariant variables. a and b became vectors, and V (¢) became the determinant of the covariance matrix. The output units then preferentially produced decorrelated responses. Stone and Bray (1995) have presented a learning rule that is based on an objective function similar to that of Becker and Hinton (1992) or equation 2.1 and which includes a memory trace mechanism as in (Foldi ¨ ak 1991). They dene two memory traces, P1 0 0 t0 D1 exp(¡t / ts ) y ( t ¡ t ) P1 yQ :D , 00 00 D 1 exp(¡t / ts ) P1 t 0 0 1 exp(¡t / tl ) y ( t ¡ t ) t0 D P (5.2) yN :D , 1 00 ) t00 D1 exp(¡t / tl
766
L. Wiskott and T. Sejnowski
one on a short timescale (yQ with a small ts ) and one on a long timescale (yN with a large tl ). The objective function is dened as F :D 0.5 log
h(y ¡ yN ) 2 i , h(y ¡ yQ ) 2 i
(5.3)
which is equivalent to equation 5.1 in the limit ts ! 0 and tl ! 1. The derived learning rule performs gradient ascent on this objective function. The examples in (Stone and Bray 1995) are linearly solvable so that only linear units are used. They include an example where two output units are trained. Inhibitory input from the rst to the second output unit enforces decorrelation. Examples in (Stone 1996) are concerned with disparity estimation, which is not linearly solvable and requires a multilayer network. Backpropagation was used for training with the error signal given by ¡F. A similar system derived from an objective function was presented in (Peng, et al. 1998). Mitchison (1991) presented a learning rule for linear units that is also derived from an objective function like that of equation 2.1. He pointed out that the optimal weight vector is given by the last eigenvector of matrix hPxPxT i. In the on-line learning rule, weights P were prevented from decaying to zero by an explicit normalization, such as w2i D 1. The extracted output signal would therefore depend strongly on the range of the individual input components, which may be arbitrarily manipulated by rescaling the input components. For instance, if there were one zero input component xi D 0, 6 an optimal solution would be wi D 1 and wi0 D 0 for all i0 D i, which would be undesirable. Therefore, it seems to be preferable to prevent weight decay by controlling the variance of the output signal and not the sum over the weights directly. The issue of extracting several output components was addressed by introducing a different bias for each output component, which would break the symmetry, if the weight space for slowest output components were more than one-dimensional. However, redundancy can probably be better reduced by the decorrelation constraint of equation 2.4 also used in other systems, with the additional advantage that suboptimal output-signal components also can be extracted in an ordered fashion. Many aspects of the learning algorithm presented here can be found in these previous models. Particularly the optimization problem considered by Becker and colleagues is almost identical to the objective function and constraints used here. The novel aspect of the system presented here is its formulation as a closed-form learning algorithm rather than an on-line learning rule. This is possible because the input signal is expanded nonlinearly, which makes the problem linear in the expanded signal. The solution can therefore be found by sphering and applying principal component analysis to the time-derivative covariance matrix. This has several consequences that distinguish this algorithm from others: The algorithm is simple and guaranteed to nd the optimal solution in one shot. Becker and Hinton (1995) have reported problems in nding
Slow Feature Analysis
767
the global maxima, and they propose several extensions to avoid this problem, such as switching between different learning rules during training. This is not necessary here. Several slow features can be easily extracted simultaneously. The learning algorithm automatically yields a large set of decorrelated output signals, extracting decorrelated slow features. (This is different from having several output units representing only one slow feature at a time by a one-of-n code.) This makes it particularly easy to consider hierarchical networks of SFA modules, since enough information can be propagated from one layer to the next. The learning algorithm presented here suffers from the curse of dimensionality. The nonlinear expansion makes it necessary to compute large covariance matrices, which soon becomes computationally prohibitive. The system is therefore limited to input signals of moderate dimensionality. This is a serious limitation compared to the on-line learning rules. However, this problem might be alleviated in many cases by hierarchical networks of SFA modules, where each module has only a low-dimensional input, such as up to 12 dimensions. Example 4 shows such a hierarchical system where a 65-dimensional input signal is broken down by several small SFA-modules. 5.2 Future Perspectives. There are several directions in which slow feature analysis as presented here can be investigated and developed further: Comparisons with other learning rules should be extended. In particular, the scaling with input and output dimensionality needs to be quantied and compared. The objective function and the learning algorithm presented here are amenable to analysis. It would be interesting to investigate the optimal responses of an SFA module and the consequences and limitations of using SFA modules in hierarchical networks. Example 2 demonstrates how several slowly varying features can be extracted by SFA. It also shows that the decorrelation constraint is insufcient to extract slow features in a pure form. It would be interesting to investigate whether SFA can be combined with independent component analysis (Bell & Sejnowski, 1995) to extract the truly independent slow features. Example 5 demonstrates how SFA modules can be used in a hierarchical network for learning various invariances. It seems, however, that learning multiple invariances simultaneously leads to a signicant degradation of performance. It needs to be investigated whether this is a principal limitation or can be compensated for by more training patterns and more complex networks.
768
L. Wiskott and T. Sejnowski
In the network of Example 4, there was no obvious implicit measure of pattern similarity. More research is necessary to clarify whether there is a hidden implicit measure or whether the network has enough capacity to learn a specic measure of pattern similarity in addition to the invariance. Finally, it has to be investigated whether invariances can also be learned from real-world image sequences, such as from natural scenes, within a reasonable amount of time. Some work in this direction has been done by Wallis and Rolls (1997). They trained a network for translation and other invariances on gray-value images of faces, but they did not test for generalization to new images. The number of images was so small that good generalization could not be expected. It is clear that SFA can be only one component in a more complex selforganizational model. Aspects such as attention, memory, and recognition (more sophisticated than implemented here) need to be integrated to form a more complete system. Acknowledgments We are grateful to James Stone for very fruitful discussions about this project. Many thanks go also to Michael Lewicki and Jan Benda for useful comments on the manuscript. At the Salk Institute, L. W. was partially supported by a Feodor-Lynen fellowship by the Alexander von Humboldt-Foundation, Bonn, Germany; at the Innovationskolleg, L. W. has been supported by HFSP (RG 35-97) and the Deutsche Forschungsgemeinschaft (DFG). References Barrow, H. G., & Bray, A. J. (1992). A model of adaptive development of complex cortical cells. In I. Aleksander & J. Taylor (Eds.), Articial neural networks II: Proc. of the Intl. Conf. on Articial Neural Networks (pp. 881–884). Amsterdam: Elsevier. Becker, S. (1993). Learning to categorize objects using temporal coherence. In C. L. Giles, S. J. Hanson, & J. D. Cowan (Eds.), Advances in neural information processing systems, 5 (pp. 361–368). San Mateo, CA: Morgan Kaufmann. Becker, S. (1996). Mutual information maximization: Models of cortical selforganization. Network: Computation in Neural Systems, 7(1), 7–31. Becker, S., & Hinton, G. E. (1992). A self-organizing neural network that discovers surfaces in random-dot stereograms. Nature, 355(6356), 161–163. Becker, S., & Hinton, G. E. (1995). Spatial coherence as an internal teacher for a neural network. In Y. Chauvin & D. E. Rumelhart (Eds.), Backpropagation: Theory, architecture and applications (pp. 313–349). Hillsdale, NJ: Erlbaum.
Slow Feature Analysis
769
Becker, S., & Plumbley, M. (1996). Unsupervised neural network learning procedures for feature extraction and classication. J. of Applied Intelligence, 6(3), 1–21. Bell, A. J., & Sejnowski, T. J. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7, 1129– 1159. Bishop, C. M. (1995). Neural networks for pattern recognition. Oxford: Oxford University Press. Brooks, M. J., & Horn, B. K. P. (Eds.). (1989). Shape from shading. Cambridge, MA: MIT Press. Eisele, M. (1997). Unsupervised learning of temporal constancies by pyramidaltype neurons. In S. W. Ellacott, J. C. Mason, & I. J. Anderson (Eds.), Mathematics of neural networks (pp. 171–175). Norwell, MA: Kluwer. Fleet, D. J., & Jepson, A. D. (1990). Computation of component image velocity from local phase information. Intl. J. of Computer Vision, 5(1), 77–104. Foldi ¨ ak, P. (1991). Learning invariance from transformation sequences. Neural Computation, 3, 194–200. Fukushima, K. (1999).Self-organization of shift-invariant receptive elds. Neural Networks, 12(6), 791–801. Fukushima, K., Miyake, S., & Ito, T. (1983). Neocognitron: A neural network model for a mechanism of visual pattern recognition. IEEE Trans. on Systems, Man, and Cybernetics, 13, 826–834. Jones, J. P., & Palmer, L. A. (1987). An evaluation of the two-dimensional Gabor lter model of simple receptive elds in cat striate cortex. J. of Neurophysiology, 58, 1233–1258. Konen, W., Maurer, T., & von der Malsburg, C. (1994).A fast dynamic link matching algorithm for invariant pattern recognition. Neural Networks, 7(6/7), 1019– 1030. LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., & Jackel, L. D. (1989). Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4), 541–551. Mitchison, G. (1991).Removing time variation with the anti-Hebbian differential synapse. Neural Computation, 3(3), 312–320. Olshausen, B. A., Anderson, C. H., & Van Essen, D. C. (1993). A neurobiological model of visual attention and invariant pattern recognition based on dynamic routing of information. J. of Neuroscience, 13(11), 4700–4719. Oram, M. W., & Perrett, D. I. (1994). Modeling visual recognition from neurobiological constraints. Neural Networks, 7(6/7), 945–972. O’Reilly, R. C., & Johnson, M. H. (1994).Object recognition and sensitive periods: A computational analysis of visual imprinting. Neural Computation, 6(3), 357– 389. Peng, H. C., Sha, L. F., Gan, Q., & Wei, Y. (1998). Energy function for learning invariance in multilayer perceptron. Electronics Letters, 34(3), 292–294. Rumelhart, D. E., Hinton, G. E., & McClelland, J. L. (1986). A general framework for parallel distributed processing. In D. E. Rumelhart & J. L. McClelland (Eds.), Parallel distributed processing (Vol. 1, pp. 45–76). Cambridge, MA: MIT Press.
770
L. Wiskott and T. Sejnowski
Stewart Bartlett, M., & Sejnowski, T. J. (1998). Learning viewpoint invariant face representations from visual experience in an attractor network. Network: Computation in Neural Systems, 9(3), 399–417. Stone, J. V. (1996). Learning perceptually salient visual parameters using spatiotemporal smoothness constraints. Neural Computation, 8(7), 1463– 1492. Stone, J. V., & Bray, A. J. (1995). A learning rule for extracting spatio-temporal invariances. Network: Computation in Neural Systems, 6(3), 429–436. Theimer, W. M., & Mallot, H. A. (1994). Phase-based binocular vergence control and depth reconstruction using active vision. CVGIP: Image Understanding, 60(3), 343–358. Vapnik, V. (1995). The nature of statistical learning theory. New York: SpringerVerlag. Wallis, G., & Rolls, E. (1997). Invariant face and object recognition in the visual system. Progress in Neurobiology, 51, 167–194. Zemel, R. S., & Hinton, G. E. (1992). Discovering viewpoint-invariant relationships that characterize objects. In R. P. Lippmann, J. E. Moody, & D. S. Touretzky (Eds.), Advances in neural information processing systems, 3 (pp. 299–305). San Mateo, CA: Morgan Kaufmann. Received January 29, 1998; accepted June 1, 2001.
NOTE
Communicated by Jonathan Victor
Information Loss in an Optimal Maximum Likelihood Decoding In e s Samengo [email protected] Centro AtÂomico Bariloche, (8400) San Carlos de Bariloche, RÂõ o Negro, Argentina The mutual information between a set of stimuli and the elicited neural responses is compared to the corresponding decoded information. The decoding procedure is presented as an articial distortion of the joint probabilities between stimuli and responses. The information loss is quantied. Whenever the probabilities are only slightly distorted, the information loss is shown to be quadratic in the distortion.
Understanding the way external stimuli are represented at the neuronal level is one central challenge in neuroscience. An experimental approach to this end (Optican & Richmond, 1987; Eskandar, Richmond, & Optican, 1992; TovÂee, Rolls, Treves, & Bellis, 1993; Kjaer, Hertz, & Richmond, 1994; Heller, Hertz, Kjaer, & Richmond, 1995; Rolls, Critchley, & Treves, 1996; Treves, Skaggs, & Barnes, 1996; Rolls, Treves, & TovÂee, 1997; Treves, 1997; Rolls & Treves, 1998; Rolls, Treves, Robertson, Georges-Fran c¸ ois, & Panzeri, 1998) consists of choosing a particular set of stimuli s 2 S that can be controlled by the experimentalist and exposing these stimuli to a subject whose neural activity is being registered. The set of neural responses r 2 R is then dened as the whole collection of recorded events. It is up to the researcher to decide which entities in the recorded signal are considered as events r. For example, r can be dened as the ring rate in a xed time window, or as the time difference between two consecutive spikes, or the k rst principal components of the time variation of the recorded potentials in a given interval, and so forth. Once the stimulus set S and the response set R have been settled on, the joint probabilities P (r, s) may be estimated from the experimental data. This is usually done by measuring the frequency of the joint occurrence of stimulus s and response r for all s 2 S and r 2 R. The mutual information between stimuli and responses reads (Shannon, 1948)
ID
XX s
r
µ P ( r, s) log2
¶ P (r, s) , P (r ) P (s )
Neural Computation 14, 771–779 (2002)
(1)
° c 2002 Massachusetts Institute of Technology
772
InÂes Samengo
where P (r ) D P (s ) D
X
P ( r, s)
(2)
P ( r, s) .
(3)
s
X r
The mutual information quanties how much can be learned about the identity of the stimulus shown just by looking at the responses. Accordingly, and since I is symmetrical in r and s, its value is also a measure of the amount of information that the stimuli give about the responses. From a theoretical point of view, I is the most appealing quantity characterizing the degree of correlation between stimuli and responses that can be dened. This stems from the fact that I is the only additive functional of P ( r, s) , ranging from zero (for uncorrelated variables) up to the entropy of stimuli or responses (for a deterministic one to one mapping) (Fano, 1961; Cover & Thomas, 1991). However, even if formally sound, the mutual information has a severe drawback when dealing with experimental data. Many times, and specifically when analyzing data of multiunit recordings, the response set R is quite large, its size increasing exponentially with the number of neurons sampled. Therefore, the estimation of P ( r, s) from the experimental frequencies may be far from accurate, especially when recording from the vertebrate cortex, where there are long timescales in the variability and statistical structure of the responses. The mutual information I, being a nonlinear function of the joint probabilities, is extremely sensitive to the errors that may be involved in their measured values. As derived in Treves and Panzeri (1995), Panzeri and Treves (1996) and Golomb, Hertz, Panzeri, Treves, and Richmond (1997), the mean error in calculating I from the frequency table of events r and s is linear in the size of the response set. This analytical result has been obtained under the assumption that different responses are classied independently. Although there are situations where such a condition does not hold (Victor & Purpura, 1997), it is widely accepted that the bias grows rapidly with the size of the response set. Therefore, a common practice when dealing with large response sets is to calculate the mutual information not between S and R but between the stimuli and another set T , each of whose elements t is a function of the true response r, that is, t D t ( r) (Treves, 1997; Rolls & Treves, 1998). It is easy to show that if the mapping between r and t is one to one, then the mutual information between S and R is the same as the one between S and T . However, for one-to-one mappings, the number of elements in T is the same as in R. A wiser procedure is to choose a set T that is large enough not to lose the relevant information, but sufciently small as to avoid signicant limited sampling errors. One possibility is to perform a decoding procedure (Gochin, Columbo, Dorfman, Gerstein, & Gross, 1994;
Optimal Maximum Likelihood Decoding
773
Rolls et al., 1996; Victor & Purpura, 1996; Rolls & Treves, 1998). In this case, T is taken to coincide with S . To make this correspondence explicit, the set T will be denoted by S 0 and its elements t by s0 . Each s0 in S 0 is taken to be a function of r and is called the predicted stimulus of response r. As stated in Panzeri, Treves, Schultz, and Rolls (1999), this choice for T is the smallest that could potentially preserve the information of the identity of the stimulus. The data processing theorem (Cover & Thomas, 1991) states that since s0 is a function of r alone, and not of the true stimulus s eliciting response r, the information about the real stimulus can only be lost and not created by the transformation from r ! s0 . Therefore, the true information I is always at least as large as the decoded information ID , the latter being the mutual information between S and S 0 .1 In order to have I and ID as close as possible, it is necessary to choose the best s0 for every r. The procedure consists of identifying which of the stimuli was most probably shown, for every elicited response. The conditional probability of having shown stimulus s given that the response was r reads P (s|r) D
P ( r, s ) . P ( r)
(4)
Therefore, the stimulus that has most likely elicited response r is s0 ( r) D max P ( s|r ) D max P (r, s) . s
s
(5)
By means of equation 5, a mapping r ! s0 is established: each response has its associated maximum likelihood stimulus. Equation 4 provides the only denition of P ( s|r ) that strictly follows Bayes’ rule, so in this case, the decoding is called optimal. There are other alternative ways of dening P ( s|r ) (Georgopoulos, Schwartz, & Kettner, 1986; Wilson & McNaughton, 1993; Seung & Sompolinsky, 1993; Rolls et al., 1996), some of which have the appealing property of being simple enough to be plausibly carried out by downstream neurons themselves. The purpose here, however, is to quantify how much information is lost when passing from r to s0 using an optimal maximum likelihood decoding procedure. In general, there are several r associated with a given s0 . One may therefore partition the response space R in separate classes C (s ) D fr/ s0 (r ) D sg, one class for every stimulus. The number of responses in class s0 is Ns0 . Of course, some classes may be empty. Here, the assumption is made that each r belongs to one and only class (that is, equation 5 has a unique solution). 1
It should be kept in mind, however, that when ID is calculated from actual recordings, its value is typically overestimated because of limited sampling. Therefore, when dealing with real data sets, one may eventually obtain a value for ID that surpasses the true mutual information I. Nevertheless, whenever the number of elements in S0 is signicantly smaller than the number of responses r, the sampling bias in ID will be bound by the one obtained in the estimation of I.
774
InÂes Samengo
The joint probability of showing stimulus s and decoding stimulus s0 ( r) reads X (6) P ( s0 , s) D P ( r, s) , r2C ( s0 )
and the overall probability of decoding s0 , X X P ( s0 ) D P ( s0 , s) D P ( r) . s
(7)
r2C ( s0 )
Clearly, with these denitions, the decoded information, µ ¶ XX P ( s0 , s) 0 ( ) ID D P s , s log2 , P ( s0 ) P ( s) s s0
(8)
may be calculated, and has in fact been used in several experimental analyses (Rolls et al., 1996; Treves, 1997; Rolls & Treves, 1998; Panzeri et al., 1999). However, to date, no rigorous relationship between I and ID has been established. The derivation of such a relationship is the main purpose here. When a decoding procedure is performed, r is replaced by s0 . Such a mapping allows the calculation of P ( s0 , s ), after which any additional structure, which may eventually have been present in P ( r, s) , is neglected. For example, if two responses r1 and r2 encode the same stimulus s0 , it becomes irrelevant whether, for a given s, P (r1 , s) is much bigger than P ( r2 , s) or, on the contrary, P ( r1 , s ) ¼ P ( r2 , s ). The only thing that matters is the value of the sum of the two: their global contribution to P ( s0 , s) . As a consequence, it seems natural to consider the detailed variation of P ( r, s ) within each class when estimating the information lost in the decoding. In this spirit, and aiming at quantizing such a loss of information, P (r, s) is written as £ ¤ P s0 ( r) , s ( ) C D ( r, s) , (9) P r, s D Ns0 ( r) where D (r, s) D P ( r, s ) ¡ P[s0 ( r) , s] / Ns0 ( r ) . Thus, the joint probability P (r, s) , which in principle may have quite a complicated shape in R space, is separated into two terms. The rst one is at inside every single class C ( s0 ) , and the second is whatever is needed to re-sum P (r, s) . It should be noticed that X D ( r, s) D 0, (10) r2C ( s0 )
for all s. Summing equation 9 in s, £ ¤ P s0 ( r) C D (r ) , P (r ) D Ns0 ( r)
(11)
Optimal Maximum Likelihood Decoding
775
where
D ( r)
D
X
D ( r, s ) ,
(12)
D 0.
(13)
s
and X
D (r )
r2C ( s0 )
Replacing equations 9 and 11 in equation 1, one arrives at µ ¶ XX P ( r, s) I D ID C P ( r, s) log2 , Q ( r, s) r s
(14)
where Q ( r, s) D
P[s0 ( r) , s] P[s0 ( r) , s] C D (r ) Ns0 P ( s0 )
(15)
is a properly dened distribution, since it can be shown to be normalized and nonnegative. The term on the right of equation 14 is the Kullback-Leibler divergence (Kullback, 1968) between the distributions P and Q, which is guaranteed to be nonnegative. This conrms the intuitive result ID · I, the equality being valid only when
D ( r) P[s0 (r ) , s] D D ( r, s ) P[s0 (r ) ],
(16)
for all r and s. Equation 14 states the quantitative difference between the full and the decoded information and is the main result here. The amount of lost information is therefore equal to the informational distance between the original probability distribution P ( r, s ) and a new function Q (r, s) . It can be easily veried that µ ¶ XX Q ( r, s) (17) ID D Q (r, s) log2 , Q ( r) Q ( s) s r where Q ( r) D Q ( s) D
X s
Q ( r, s ) D P ( s) ,
r
Q ( r, s ) D P ( r) .
X
(18)
Therefore, the decoded information can be interpreted as a full mutual information between the stimuli and the responses, but with a distorted probability distribution Q ( r, s ). In this context, the difference I ¡ ID is no more
776
InÂes Samengo
than the distance between the true distribution P (r, s) and the distorted one Q ( r, s) . When is equation16 fullled? Surely, if there is at most one response in each class, D is always zero, and I D ID . Also, if P ( r, s) is already at in each class, there is no information loss. However, if P (r, s) is not at inside every class but obeys the condition P ( r, s) D Ps0 ( r) P ( s0 , s) for a suitable P (s0 , s) and some function Ps0 ( r) that sums up to unity within C (s0 ), one can easily show that equation 16 holds. Notice that this case implies that if r1 and r2 belong to C ( s0 ) , then P (r1 , s) / P (r2 , s) is independent of r for all s. In other words, within each class C ( s0 ) , the different functions P ( r|s ) obtained by varying s differ from one another by a multiplicative constant. These conditions coincide with the ones given by Panzeri et al. (1999) for having an exact decoding within the short time limit. However, in the derivation here, there are no assumptions about the interval in which responses are measured. Therefore, the decoding being exact whenever equation 16 is fullled is not a consequence of the short time limit carried out by Panzeri et al. (1999), but rather a general property of the maximum likelihood decoding. Next, by making a second-order Taylor expansion of equation 14 in the distortions D ( r, s ) and D (r ), one may show that I D ID C
XX s
P ( s0 , s)
s0
E ( s0 , s ) C O (D 2 ) , 2 ln 2
(19)
where "³ ´2 ³ ´2 # 1 X D (r, s) D (r ) ¡ . E ( s , s) D Ns0 r2C ( s0 ) P (s0 , s ) / Ns0 P (s0 ) / Ns0 0
(20)
Therefore, in the small D limit, the difference between I and ID is quadratic in the distortions D ( r, s ) and D ( r) . This means that if in a given situation these quantities are guaranteed to be small, then the decoded information will be a good estimate of the full information. Equation 20 is equivalent to *³ 0
E ( s , s) D
P (r, s) P ( s0 , s) / Ns0
´2
³ ¡
P (r ) P ( s0 ) / Ns0
´2 + ,
(21)
C ( s0 )
where h f (r ) iC
( s0 )
D
1 N ( s0 )
X
f ( r) .
r2C (s0 )
As a consequence, the relevant parameter in determining the size of E ( s0 , s) is given by the mean value—within C ( s0 ) —of a function that essentially measures how different are the true probability distributions P ( r, s ) and P ( r) from their attened versions, P ( s0 , s) / Ns0 and P ( s0 ) / Ns0 .
Optimal Maximum Likelihood Decoding
777
To summarize, this note presents the maximum likelihood decoding as an articial—but useful—distortion of the distribution P ( r, s ) within each class C ( s0 ) . The decoded information is also shown to be a mutual information, the latter calculated with the distorted probability distribution. The difference between I and ID is the Kullback-Leibler distance between the true and distorted distributions. As such, it is always nonnegative, and it is easy to identify the conditions for the equality between the two information measures. Finally, for small distortions D , the amount of lost information is expressed as a quadratic function in D . In short, the aim of the work is to present a formal way of quantizing the effect of an optimal maximum likelihood decoding. It should be kept in mind that in real situations, where only a limited amount of data is available, the estimation of P (r|s) may well involve a careful analysis in itself. Some kind of assumption (as for example, a gaussian shaped response variability) is usually required. The validity of the assumptions made depends on the particular data at hand. An inadequate choice for P ( r|s ) may of course lead to a distorted value of I, and in fact, the bias may be in either direction. If the choice of P ( r|s ) does not even allow the correct identication of the maximum likelihood stimulus (see equation 5), then the calculated value of ID will also be distorted. The purpose of this note, however, is to quantify how much information is lost when passing from r to s0 (r ). No attempt has been made to quantify I or ID for different estimations of P (r|s) . Sometimes P ( s0 , s) is dened in terms of P (r, s) without actually decoding the stimulus toP be associated with each response. For example, P ( s0 , s) can be introduced as r P ( r, s0 ) P ( r, s) / P2 ( r) (Treves, 1997). This approach, although formally sound, is not based in an r ! s0 mapping and does not allow a partition of R into classes. Therefore, it is not directly related to the analysis presented here. However, there might be analogous derivations where one may get to quantify the information loss in this case, as well. Acknowledgments
I thank Bill Bialek, Anna Montagnini, and Alessandro Treves for very useful discussions. This work has been partially supported with a grant of A. Treves, of the Human Frontier Science Program, number RG 01101998B. References Cover, M. T., & Thomas, J. A. (1991). Elements of information theory. New York: Wiley. Eskandar, E. N., Richmond, B. J., & Optican, L. M. (1992). Role of inferior temporal neurons in visual memory. I. Temporal encoding of information about visual images, recalled images, and behavioural context. J. Neurophysiol., 68, 1277–1295.
778
InÂes Samengo
Fano, R. M. (1961) Transmission of information: A statistical theory of communications. Cambridge, MA: MIT Press. Georgopoulos, A. P., Schwartz, A., & Kettner, R. E. (1986). Neural population coding of movement direction. Science, 233, 1416–1419. Gochin, P. M., Colombo, M., Dorfman, G. A., Gerstein, G. L., & Gross, C. G. (1994). Neural ensemble encoding in inferior temporal cortex. J. Neurophysiol, 71, 2325–2337. Golomb, D., Hertz, J., Panzeri, S., Treves, A., & Richmond, B. (1997). How well can we estimate the information carried in neuronal responses from limited samples? Neural Comp., 9, 649–655. Heller, J., Hertz, J. A., Kjaer, T. W., & Richmond, B. J. (1995). Information ow and temporal coding in primate pattern vision. J. Comput. Neurosci., 2, 175–193. Kjaer, T. W., Hertz, J. A., & Richmond, B. J. (1994).Decoding cortical neuronal signals: Network models, information estimation and spatial tuning. J. Comput. Neurosci., 1, 109–139. Kullback, S. (1968). Information theory and statistics. New York: Dover. Optican, L. M., & Richmond, B. J. (1987). Temporal encoding of two dimensional patterns by single units in primate inferior temporal cortex: III. Information theoretic analysis. J. Neurophysiol., 57, 162–178. Panzeri, S., & Treves, A. (1996). Analytical estimates of limited sampling biases in different information measures. Network, 7, 87–107. Panzeri, S., Treves, A., Schultz, S., & Rolls, E. T. (1999).On decoding the responses of a population of neurons from short time windows. Neural Comput., 11, 1553–1577. Rolls, E. T., Critchley, H. D., & Treves, A. (1996). Representation of olfactory information in the primate orbitofrontal cortex. J. Neurophysiol., 75(5), 1982– 1996. Rolls, E. T., & Treves, A. (1998).Neural networks and brain function. Oxford: Oxford University Press. Rolls, E. T., Treves, A., Robertson, R. G., Georges-Fran¸cois, P., & Panzeri, S. (1998). Information about spatial view in an ensemble of primate hippocampal cells. J. Neurophysiol., 79, 1797–1813. Rolls, E. T., Treves, A., & TovÂee, M. J. (1997). The representational capacity of the distributed encoding of information provided by populations of neurons in primate temporal visual area. Exp. Brain. Res., 114, 149–162. Seung, H. S., & Sompolinsky, H. (1993). Simple models for reading neural population codes. Proc. Nat. Ac. Sci. USA, 90, 10749–10753. Shannon, C. E. (1948). A mathematical theory of communication. AT&T Bell Laboratories Technical Journal, 27, 379–423. TovÂee, M. J., Rolls, E. T., Treves, A., & Bellis, R. J. (1993). Information encoding and the responses of single neurons in the primate temporal visual cortex. J. Neurophysiol., 70, 640–654. Treves, A. (1997). On the perceptual structure of face space. BioSyst., 40, 189–196. Treves, A., & Panzeri, S. (1995). The upward bias in measures of information derived from limited data samples. Neural Comp., 7, 399–407. Treves, A., Skaggs, W. E., & Barnes, C. A. (1996). How much of the hippocampus can be explained by functional constraints? Hippocampus, 6, 666–674.
Optimal Maximum Likelihood Decoding
779
Victor, J. D., & Purpura, K. P. (1996). Nature and precision of temporal coding in visual cortex: A metric space analysis. J. Neurophysiol., 76, 1310–1326. Victor, J. D., & Purpura, K. P. (1997). Metric-space analysis of spike trains: Theory, algorithms and application. Network, 8, 127–164. Wilson, M. A., & McNaughton, B. L. (1993). Dynamics of the hippocampal ensemble code for space. Science, 261, 1055–1058. Received March 5, 2001; accepted June 1, 2001.
NOTE
Communicated by AndrÂe Longtin
The Reliability of the Stochastic Active Rotator K. Pakdaman [email protected] Inserm U444, FacultÂe de MÂedecine Saint-Antoine, 75571 Paris Cedex 12, France The reliability of ring of excitable-oscillating systems is studied through the response of the active rotator, a neuronal model evolving on the unit circle, to white gaussian noise. A stochastic return map is introduced that captures the behavior of the model. This map has two xed points: one stable and the other unstable. Iterates of all initial conditions except the unstable point tend to the stable xed point for almost all input realizations. This means that to a given input realization, there corresponds a unique asymptotic response. In this way, repetitive stimulation with the same segment of noise realization evokes, after possibly a transient time, the same response in the active rotator. In other words, this model responds reliably to such inputs. It is argued that this results from the nonuniform motion of the active rotator around the unit circle and that similar results hold for other neuronal models whose dynamics can be approximated by phase dynamics similar to the active rotator.
Neurons are subject to various forms of internal and external noise (for a review, see Holden, 1976). Such perturbations can alter their response by reducing nonlinear distortions in their input-output relation or enhancing their sensitivity to weak signals (for a review, see Segundo, Vibert, Pakdaman, Stiber, & Diez-MartÂõ nez, 1994). The mechanisms of such noise effects have been investigated using neuronal models (Stein, French, & Holden, 1972; Longtin, 1993; Collins, Chow, & Imhoff, 1995; Bulsara, Elston, Doering, Lowen, & Lindenberg, 1996; Shimokawa, Pakdaman, & Sato, 1999a, 1999b; Shimokawa, Pakdaman, Takahata, Tanabe, & Sato, 2000; Shimokawa, Rogel, Pakdaman, & Sato, 1999; Tanabe, Sato, & Pakdaman, 1999). Noise-like inputs have also been used to reveal the ring reliability of neurons. Preparations of well-identied aplysia neurons (Bryant & Segundo, 1976), rat muscle spindles (Kr¨oller, Grusser, ¨ & Weiss, 1988), rat neocortical neurons (Mainen & Sejnowski, 1995), aplysia buccal pacemakers (Hunter, Milton, Thomas, & Cowan, 1998), salamander and rabbit retinal ganglion cells (Berry, Warland, & Meister, 1997), y ganglion cells (Haag & Borst, 1998), and cat lateral geniculate cells (Reinagel & Reid, 2000) have all exhibited a remarkable reproducibility of discharge times when stimulated repetitively by the same noiselike input. Reliable ring constitutes a prerequisite for nervous systems to operate temporal codes. In this respect, Neural Computation 14, 781–792 (2002)
° c 2002 Massachusetts Institute of Technology
782
K. Pakdaman
determining the conditions for reliable ring also serves to clarify whether nervous systems can potentially take advantage of it. The main purpose of this work is to present a theoretical analysis of ring reliability and identify some of the key issues involved. Previous analyses of this phenomenon have discussed the importance of the voltage slope at threshold crossing and the frequency content of the signal for reliable ring (Hunter et al., 1998; Cecchi et al., 2000). The approach here differs from these others in that it relies on random dynamical system theory (Arnold, 1998). In fact, it is shown through the presentation of an elementary model that results from this eld provide the proper framework for analyzing the reliability of neuronal systems. The model used to illustrate this point is the active rotator (AR) (Shinomoto & Kuramoto, 1986). The AR consists of a system moving around the unit circle whose phase Q satises dQ D f (Q ) C I ( t ) , dt
(1)
where Q 2 S1 (the unit circle), f (Q ) D 1 ¡ a sin(Q ) , and I ( t) represents the stimulation. We denote by W ( t, Q ) the state of the AR at time t given the initial condition Q. The AR is one of the most elementary models of excitable-oscillating systems, and is similar to the h-neuron derived by Gutkin and Ermentrout (1998) as a canonical model for type I membranes. When |a| < 1, Q rotates around the unit circle, displaying periodic oscillations. Conversely, when |a| > 1, there is a pair of equilibrium phases, Qs and Qu , on the unit circle, with the former being locally stable and the latter being unstable. Small perturbations of Qs are rapidly damped out, while larger ones, which take the system over Qu , result in a return to the stable point along the longer arc. In this way, Qu acts as a threshold. The ring is here dened as Q rotating counterclockwise through 2p / 3. The noisy AR—when I is white gaussian noise—has been used as a means to investigate the inuence of random perturbations on coupled oscillators (Stratanovich, 1967). The AR has also been one of the rst models in which two topics, the inuence of noise on the transition between excitable and oscillating regimes (Sigeti & Horsthemke, 1989) and stochastic resonance (Wiesenfeld, Pierson, Pantazelou, & Moss, 1994) have been studied. Similarly, the related h -neuron has been used to investigate the mechanisms underlying high interspike interval variability (Gutkin & Ermentrout, 1998). Furthermore, assemblies of interacting ARs have been shown to display synchronous oscillations in the presence of some noise (Shinomoto & Kuramoto, 1986; Kurrer & Schulten, 1995), which strongly inuence their response to weak periodic forcing (Tanabe, Shimokawa, Sato, & Pakdaman, 1999). These studies are essentially concerned with noise-induced oscillations and their regularity. They do not treat reliability. This work relies on
The Reliability of the Stochastic Active Rotator
783
a different standpoint to address this question. To clarify the basis of the approach, the case of periodic forcing is rst described. When the stimulation I is a T-periodic signal, the dynamics of the forced AR are analyzed through the iterates of a stroboscopic map FT , which to Q 2 S1 associates y D FT (Q ) D W ( T, Q ), that is, the state of the AR after a time T given the initial condition Q. The stroboscopic map is an orientationpreserving diffeomorphism on S1 , so that its iterates display mainly one of two behaviors (de Melo & van Strien, 1993). For all initial Q0 2 S1 , the sequence Qn D FnT (Q0 ) is either (1) asymptotically periodic with a period that does not depend on the initial condition or (2) quasiperiodic with an orbit that is dense in S1 . When Qn is periodic, the unit displays a periodic succession of n rings per m input cycles, referred to as n: m phase locking. In this case, the units display phase synchrony with respect to the input. To illustrate this, let us consider an ensemble of ARs starting at randomly distributed initial phases and all receiving the same periodic stimulation. After a transient time, discharges of the units cluster into m groups such that units within each group re simultaneously, and the discharge times of the different groups are phase shifted with respect to one another. A simple illustration of this is the case of 1: 2 locking, where units re every other input cycle, so that depending on the initial condition, they will re in either even or odd cycles. In such situations, there is synchronization with respect to the input, but this does not ensure that all units discharge simultaneously. To distinguish between synchronization with respect to the input and the interunit synchrony, that is, simultaneous ring of the units (or, equivalently, alignment of discharge times in raster plots recorded from repetitive stimulation of a single neuron), we say that the ring is absolutely reliable when the latter is satised, regardless of synchrony to the input. This denition reects the experimental protocol used to assess neuronal reliability (see the references in the second paragraph). In this sense, absolute reliable ring is the exception rather than the rule in response to periodic stimulation as (1) quasiperiodic ring is unreliable, and (2) in phase-locked regimes, either input parameters have to be set so that the system operates in n: 1 phase locking or initial conditions have to be selected to ensure full interunit synchrony. This situation contrasts with the case of noisy forcing where the system almost surely achieves absolutely reliable ring, irrespective of model and input parameters. This holds in both the excitable regime (|a| > 1) and the oscillating regime (|a| < 1), as long as the AR’s motion around the 6 circle is nonuniform (a D 0). The remainder of this note is devoted to the clarication of this point. At rst, this result is announced from the point of view of random dynamical system theory. The presentation of this part follows Arnold (1998). Then, the interpretation of this result is developed through the introduction of a stochastic map, which plays a role similar to that of the stroboscopic map.
784
K. Pakdaman
When the stimulation I is white gaussian noise of intensity s, equation 1 is a stochastic differential equation with an associated Fokker-Planck equation: @p @t
( t, Q ) D ¡
ª s 2 @2 p ( t, Q ) , f (Q ) p (t, Q ) C 2 @Q2 @Q @ ©
(2)
with periodic boundary conditions p (t, Q C 2p ) D p ( t, Q ). For all initial densities, the solutions of equation 2 tend eventually to the same stationary distribution denoted ps (Q ) (Stratanovich, 1967). The Lyapunov exponent associated with equation 1 is then given by Arnold (1998): Z 1 t 0 f (Q ( s) ) ds t!1 t 0 Z 2p D f 0 (Q ) ps (Q ) dQ
l D lim
(3) (4)
0
D ¡
s2 2
Z
2p 0
p0s2 (Q ) dQ, ps ( Q )
(5)
where f 0 and p0s are, respectively, the derivatives of f and ps . Equation 5 6 0, shows that l < 0 unless p0s (Q ) D 0 for all Q. This means that l < 0 for a D and l D 0 when a D 0. Negative Lyapunov exponents are indicative of local stability of solutions, in the sense that when the Lyapunov exponent computed along a solution W ( t, Q0 ) going through the initial condition Q0 is negative, then solutions going through Q0 C d Q with dQ sufciently small tend asymptotically to W ( t, Q0 ) . Thus, local stability is dened in terms of asymptotic stability of solutions with respect to small variations of initial conditions. When the stochastic dynamics are ergodic and are conned to a bounded set, local stability also constrains the global stability with respect to variations in initial conditions (Le Jan, 1987; Crauel, 1990). More precisely, 6 the fact that the Lyapunov exponent is negative implies that when a D 0, there are two stationary stochastic processes, a(t) and b ( t) , that are solutions of equation 1, and such that all solutions of equation 1, except b (t) , eventually approach a(t) .1 Conversely, when time is reversed, all solutions except a(t) tend to b ( t) . The two solutions a(t) and b (t) are referred to as stochastic equilibria of the system (Arnold, 1998), with a(t) being stable and b (t ) unstable. In the deterministic AR, a bifurcation takes place at |a| D 1, where the two equilibria coexisting for |a| > 1 coalesce and then vanish for |a| < 1. The bifurcation at |a| D 1 separates the excitable and oscillating regimes. 6 There are no such qualitative changes in the stochastic AR: for all a D 0, 1
In general, for a nonconstant f , there are nitely many processes a(t) and b (t).
The Reliability of the Stochastic Active Rotator
785
there are exactly one stable and one unstable stochastic equilibria, even in 6 the oscillating range |a| < 1 (a D 0), where the deterministic AR has no equilibria. The following paragraphs detail how the results on the random dynamics of the AR imply that the AR receiving a white noise stimulation is reliable. The key to this interpretation is the introduction of a stochastic map, which plays a role similar to that of the stroboscopic map, except that in contrast to the stroboscopic map, which can display various forms of phase locking and quasiperiodicity depending on input and system parameters, the iterates of the stochastic map tend almost surely to a unique xed point, regardless of the choice of the parameters, thus ensuring absolute reliability. For the construction of the stochastic return map, we introduce the following equation, dx D f ( x ) C I ( t) , dt
(6)
which is the same as equation 1, except that x is on the real line, not on S1 . We remind that f ( x) D 1 ¡ a sin(x) and I ( t) represents white gaussian noise. The construction of the map holds in both excitable and oscillating regimes. For clarity, in the following we use Roman letters x, u, and v and Greek letters Q, y when the dynamics is considered on the real line and on the unit circle (identied with [0, 2p ], with Q D 0 and Q D 2p representing the same point in S1 ), respectively. Starting from x (t D 0) D 0, we denote by t1 the rst passage time of ( ) x t through 2p . In other words, x (t1 ) D 2p , and x (s ) < 2p for all s 2 [0, t1 ) . For u 2 [0, 2p ], we dene Rt1 ( u ) D x ( t1 , u ) , where t ! x (t, u ) is the solution of equation 6 with the initial condition x (0, u ) D u. Then Rt1 is a diffeomorphism from [0, 2p ] unto [2p , 4p ], and satises (1) Rt1 (0) D 2p , (2) Rt1 (2p ) D 4p , and (3) Rt1 ( u ) > Rt1 ( v ) whenever u > v, that is, Rt1 is strictly increasing. The construction of Rt1 is illustrated in Figure 1. For Q D x 2 [0, 2p ], we dene the stochastic return map Ft1 (Q ) D Rt1 (x ) ¡2p . The map Ft1 is a monotonic increasing diffeomorphism of [0, 2p ] unto itself, which satises Ft1 (0) D 0, and Ft1 (2p ) D 2p . The map Ft1 , when considered from S1 into itself, is the stochastic return map associated with the noisy active rotator. Indeed, t1 is the time at which the state point initially at Q D 0 accomplishes one counterclockwise rotation around the circle and returns to the same position. In the same way, we denote by t2 ( > t1 ), the time after which this state point accomplishes another counterclockwise rotation around the circle. This rotation is associated with a new return map FT2 with T2 D t2 ¡ t1 the duration of the second rotation. The construction can be continued with FTn being the return map between the ( n ¡ 1) th and the nth counterclockwise rotations around the circle. The FTn form a family of independent and identically distributed orientation-preserving circle
786
K. Pakdaman
x
Rt (2 )=4 1 Rt (u) 1 R (v) t1
2
Rt (0)=2 1
u v 0
t1
t
Figure 1: Schematic illustration of the construction of the return map. The abscissa is time in arbitrary units, and the ordinate is the dimensionless variable x. The curves represent solutions x ( t, 0) , x ( t, v) , x ( t, u) , and x ( t, 2p ) of equation 6 for the initial conditions 0, v, u and 2p . Given that the right-hand side of equation 6 is 2p periodic with respect to x, we have x ( t, 2p ) D 2p C x ( t, 0) ; the upper thick line is obtained by translating upward by 2p the lower thick line. Thus, x ( t, 2p ) reaches 4p for the rst time at the same t1 at which x ( t, 0) reaches 2p for the rst time. In this way, given that equation 6 is one-dimensional so that its solutions cannot cross one another when plotted against time, for any u in (0, 2p ) , x ( t, u ) is conned between the two thick lines, and, consequently, x ( t1 , u ) is within (2p , 4p ) . In other words, u ! x ( t1 , u ) ) maps [0, 2p ] unto [2p, 4p ]. This map is denoted by Rt1 , and it satises Rt1 ( v ) < Rt1 ( u) for v < u.
diffeomorphisms. For the analysis of the dynamics, it is more convenient to consider the FTn as maps of [0, 2p ] unto itself rather than as circle maps. Each FTn is strictly increasing on this interval and satises FTn (0) D 0, and FTn (2p ) D 2p . Denoting Gn D FTn ± FTn¡1 ± ¢ ¢ ¢ FT1 , the dynamics of the noisy AR is determined by the iterates Qn D Gn (Q0 ) . These iterates have two xed points: Gn (0) D 0, and Gn (2p ) D 2p . Usually the local stability of a xed point of a map is determined by the slope of this map at that point. The local stability of the xed point Q D 0 with respect to small variations in initial conditions can be measured by the following quantity, which extends that of the slope of a map at a xed point (Arnold, 1998): " SN D lim
n!1
n Y kD 1
#1 / n 0
FTk (0)
.
(7)
The Reliability of the Stochastic Active Rotator
787
Given that »Z F0Tk (0) D exp
tk
¼ f 0 (Q ( s) ) ds ,
(8)
tk¡1
where Q ( s) denotes the solution of equation 1 with Q (0) D 0 and exactly one counterclockwise turn on every interval [tk¡1 , tk ]. We have: » Z tn ¼ 1 SN D lim exp f 0 (Q ( s) ) ds n!1 n 0 £ ¤ D exp lTN ,
(9) (10)
where TN D limn!1 tn / n, so that 1 / TN is the mean rate of rotation (i.e., mean discharge rate), and l is the Lyapunov exponent dened in equation 3. 6 Given that l < 0 for a D 0, we have 0 · SN < 1, which implies that the xed point Q D 0 is locally stable. Given that Q D 0 and Q D 2p represent the same point in S1 , the slope at Q D 2p is exactly the same as that at Q D 0, so that both xed points are locally stable under the action of the Gn . This means that starting with some Q0 2 [0, 2p ], close to zero (resp. 2p ), the sequence Qn D Gn (Q0 ) tends to zero (resp. 2p ). Furthermore, since the Gn are all increasing, if Qn D Gn (Q0 ) ! 0 (resp. 2p ) as n ! C 1, so does Qn0 D Gn (Q00 ) for all 0 · Q00 · Q0 (resp. Q0 · Q00 · 2p ). We take Q¤ D supfQ0: Qn D Gn (Q0 ) ! 0g, and y ¤ D inffQ0: Qn D Gn (Q0 ) ! 2p g; then [0, Q ¤ ) and ( y ¤ , 2p ] are the basins of attraction of zero and 2p , and we have Q¤ · y ¤ . In fact, because of the results on the random dynamics of the AR, we have Q¤ D y ¤ D b (0) , and more generally Gn (Q ¤ ) D b ( tn ) . Indeed, equation 1 has the unique unstable stochastic equilibrium b (t ), and all other solutions aggregate around a single one, a(t). Assuming Q¤ < y ¤ contradicts this because it implies that solutions of initial conditions Q in the nonempty open interval (Q¤ , y ¤ ) would not converge to the others. Thus, this set must be empty, and Q¤ D y ¤ D b (0) . This implies that the iterates of all points on [0, 2p ], except Q¤ , tend either to zero or 2p depending on whether they are larger or smaller than Q¤ . Figure 2 illustrates this point. The two panels represent examples of Gn for two different values of a in the excitable and oscillating regimes. In both examples, the iterates become steplike, with plateaus at zero and 2p . The sharp increase between the two at parts takes place around the iterates of Q¤ . The fact that for the interval maps the iterates of initial conditions tend to zero or 2p means that for the noisy AR, the corresponding sequence Qn tends to the origin of phases because zero and 2p represent the same point on S1 . In other words, this point is the unique stable xed point of the products of stochastic return maps of the AR, and it attracts all orbits except that of Q¤ . Finally, the fact that iterates of most initial phases tend to zero
788
K. Pakdaman
6
6
4
4
2
2
0 0
2
4
6
0 0
2
4
6
Figure 2: Iterates of the stochastic circle map associated with the AR, for a D 0.9 and s D 0.2 (left) and a D 1.1 and s D 0.5 (right). The left panel shows Gn (Q ) for n D 1, 3, 5, 10, and 15 and the right panel for n D 1 and 2. In both cases, the maps Gn tend to a steplike shape, illustrating the convergence of the iterates of most initial conditions to either zero or 2p .
means that after some transient time, AR units started with different initial conditions discharge simultaneously: the spike train evoked by the noiselike stimulation is absolutely reliable. In this sense, noiselike stimulations lead to a response that as far as the long-term dynamics of the iterates of the return map are concerned resembles 1: 1 phase locking. In other words, while depending on the model and stimulation parameters, the map associated with periodic forcing can have diverse dynamics; the one derived for white gaussian noise stimulation always converges to a unique xed point. This difference is not conned to the particular example presented here. In fact, orbits of products of independent and identically distributed orientation-preserving diffeomorphisms converge to a unique xed point under more general conditions (Kaijser, 1993). This remarkable property suggests that whenever the response of a neuron can be approximated by an orientation-preserving circle map, variable inputs evoke reliable ring, even when regular ones, such as pacemaker stimulation, do not. We have veried this point in a number of examples using neuronal models such as the FitzHughNagumo (Pakdaman & Mestivier, 2001). In conclusion, this work has presented an analysis of ring reliability in a neuronal model, namely, the active rotator. It showed that for this model, the ring evoked by white gaussian noise stimulation is reliable. The key to this result was that when the parameter a is different from zero, the Lyapunov exponent of the system is negative, which indicates that trajectories converge one to another. The parameter a controls the motion of the system around the circle. Thus, the important property of the system that leads to reliability is the fact that the speed of the movement of the system around the circle is not constant; at some phases, the system moves faster than
The Reliability of the Stochastic Active Rotator
789
at others. Let us provide a heuristic explanation for the relation between the nonuniform motion around the circle and the reliability through one example, that is, when 0 · a < 1. When a D 0, all points on the circle rotate at the same speed, regardless of their position, even when there is a stimulation. Thus, the distance between two points remains the same along time. In contrast, when 0 < a < 1, points get close to one another on (¡p / 2, p / 2) , where f 0 < 0, and move apart in (p / 2, 3p / 2) , where f 0 > 0. Thus, there is an alternation of contraction and expansion regimes. In the absence of stimulation (I D 0), these effects balance each other out, so that after one turn around the circle, the distances between two points recover their initial value. Conversely, when I is a white gaussian noise, the balance between the expansion and the contraction breaks down: the system spends more time in the contracting regions. This is the meaning of equation 4 in which ps (Q ) dQ can be interpreted as the fraction of time the AR spends in (Q, Q C dQ ). Because of this difference, the distance between points decreases along time, leading to the eventual aggregation of trajectories around a single one. Interestingly, nonuniform movement around a closed trajectory is common in neuronal systems because typically the action potential lasts a few milliseconds and its onset corresponds to the expanding regime, while the refractory period can take orders of magnitude longer and corresponds to the contracting regime. In this way, our analysis predicts that such systems respond reliably to noiselike stimuli as long as they remain in the vicinity of the closed trajectory. The analysis performed in this work establishes that for a typical noiselike stimulation, there exists a single asymptotically stable solution, which attracts the others. Thus, the ring sequence observed is, after possibly a transient time, formed by the discharge times of this specic solution. This phenomenon accounts for absolute reliability. Neurons are subject to internal and external noise, due, for instance, to spontaneous synaptic activity and uctuations in the opening and closing of ionic channels (White, Rubinstein, & Kay, 2000). Adding perturbations to the dynamics of the AR, for instance, through an extra white gaussian noise representing internal noise to equation 1, is one way to account for these and evaluate their inuence of neuronal reliability. Such analyses have been carried out in previous studies in the AR and other neuronal models. The following paragraphs provide a synthetic summary of their results, adapted to the case of the AR, with the notations and concepts introduced in this study. In terms of the dynamics of the aperiodically forced AR (see equation 1), small perturbations, representing noise, evoke small variations around the stable stochastic equilibrium point a(t) . These uctuations induce some variability in the discharge times. The extent of the variability is a decreasing function of the slope of the stochastic equilibrium at the onset of ring (Freidlin & Wentzell, 1998). Heuristically, this means that the steeper the slope at threshold and the shorter the time interval the system spends in the
790
K. Pakdaman
expanding region, the lower is the perturbation-induced variability in the discharge time. The importance of the slope at threshold in ring reliability has been analyzed by Cecchi et al. (2000). The inuence of large perturbations on ring reliability is the one expected when the AR is in the oscillating regime: reliability decreases with perturbation intensity. The situation is similar when the AR is excitable and the amplitude of the stimulation I ( t) is large. However for an excitable AR receiving a weak input I ( t) , ring reliability has a nonmonotonic dependence on the perturbation intensity: it increases up to a certain optimal perturbation level and then decreases. This surprising effect comes from the fact that for low-subthreshold perturbations, discharges appear and cluster around time intervals where the stochastic equilibrium a(t) displays large subthreshold oscillations. Description of this phenomenon when the signal I ( t) is (1) a single excitatory postsynaptic potential is given in Pei, Wilkens, and Moss (1996), Tanabe, Sato, & Pakadaman (1999), and Tanabe and Pakdaman (in press), (2) a periodic input in Shimokawa, Rogel, et al. (1999), and (3) a periodic input or an aperiodic one resembling the one considered in this work in Tanabe and Pakdaman (in press b). Finally, besides the analysis of the specic model, another contribution of this work is methodological. In the same way that the standard methods from the geometric theory of dynamical systems have proven to be useful in understanding the complex responses of neurons to periodic stimulation or the complex intrinsic oscillations displayed by some cells, the extension of these methods to stochastic systems—the theory of random dynamical systems—is an invaluable tool for analyzing the response of neurons to other classes of stimulation, such as repeated presentation of the same noiselike inputs. The example treated in this work aims to illustrate this point by showing how the issue of reliability of ring could be considered as that of the convergence of a random dynamical system to a stochastic xed point. As a conrmation, we have applied this methodology to other neuronal models such as the Hodgkin-Huxley model and the FitzHugh-Nagumo model. The detailed results will be reported elsewhere (Pakadaman & Tanabe, in press). Furthermore, our analysis also highlighted that the key measure for reliability is the Lyapunov exponent of the system. When this quantity is negative, the ring tends to be reliable. In this way, we have at our disposal a measure that can be estimated from experimental recordings and can be used in the assessment of ring reliability.
References Arnold, L. (1998). Random dynamical systems. Berlin: Springer-Verlag. Berry, M. J., Warland, D. K., Meister, M. (1997). The structure and precision of retinal spike trains. Proc. Natl. Acad. Sci. U.S.A., 94, 5411–5416.
The Reliability of the Stochastic Active Rotator
791
Bryant, H., & Segundo, J. P. (1976). Spike initiation by transmembrane current: A white noise analysis. J. Physiol, 260, 279–314. Bulsara, A. R., Elston, T. C., Doering, C. R., Lowen, S. B., & Lindenberg, K. (1996). Cooperative behavior in periodically driven noisy integrate-re models of neuronal dynamics. Phys. Rev. E, 53, 3958–3969. Cecchi, G. A., Sigman, M., Alonso, J. M., Martinez, L., Chialvo, D. R., & Magnasco, M. O. (2000). Noise in neurons is message dependent. Proc. Natl. Acad. Sci. U.S.A., 97, 5557–55561. Collins, J. J., Chow, C. C., & Imhoff, T. T. (1995). Stochastic resonance without tuning. Nature, 376, 236–238. Crauel, H. (1990). Extremal exponents of random dynamical systems do not vanish. J. Dynam. Differ. Equ., 2, 245–291. de Melo, W., & van Strien, S. (1993). One dimensional dynamics. Berlin: SpringerVerlag. Freidlin, M. I., & Wentzell, A. D. (1998). Random perturbations of dynamical systems (2nd ed.): Berlin: Springer-Verlag. Gutkin, B. S., & Ermentrout, G. B. (1998). Dynamics of membrane excitability determine interspike interval variability: A link between spike generation mechanisms and cortical spike train statistics. Neural Computation, 10, 1047– 1065. Haag, J., & Borst, A. (1998).Encoding of visual motion information and reliability in spiking and graded potential neurons. J. Neurosci., 17, 4809–4819. Holden, A. V. (1976). Models of the stochastic activity of neurones. Berlin: SpringerVerlag. Hunter, J. D., Milton, J. G., Thomas, P. J., & Cowan, J. D. (1998). Resonance effect for neural spike time reliability. J. Neurophysiol., 80, 1427–1438. Kaijser, T. (1993). On stochastic perturbations of iterations of circle maps. Physica D, 68, 201–231. Kroller, ¨ J., Grusser, ¨ O. J, & Weiss, L. R. (1988). Observations on phase-locking within the response of primary muscle spindle afferents to pseudo-random stretch. Biol. Cybern., 59, 49–54. Kurrer, C., & Schulten, K. (1995). Noise-induced synchronous neuronal oscillations. Phys. Rev. E, 51, 6213–6218. Le Jan, Y. (1987). Equilibre statistique pour les produits de diffÂeomorphismes alÂeatoires inde pendants. Ann. Inst. Henri PoincarÂe Probab. Statis., 23, 111– 120. Longtin, A. (1993). Stochastic resonance in neuron models. J. Stat. Phys., 70, 309–327. Mainen, Z. F., & Sejnowski, T. J. (1995). Reliability of spike timing in neocortical neurons. Science, 268, 1503–1506. Pakdaman, K., & Mestivier, D. (2001). External noise synchronizes forced oscillators. Phys. Rev. E, 64, 030901(R). Pakdaman, K., & Tanabe, S. (2001). Random dynamics of the Hodgkin-Huxley neuron model. Phys. Rev. E, Rapid Communication, 64: 050902(R) (2001). Pei, X., Wilkens, L., & Moss, F. (1996). Noise-mediated spike timing precision from aperiodic stimuli in an array of Hodgkin-Huxley-type neurons. Phys. Rev. Lett., 77, 4679–4682.
792
K. Pakdaman
Reinagel, P., & Reid, R. C. (2000). Temporal coding of visual information in the thalamus. J. Neurosc., 20, 5392–5400. Segundo, J. P., Vibert, J. F., Pakdaman, K., Stiber, M., & Diez-MartÂõ nez, O. (1994). Noise and the neurosciences: A long history with a recent revival (and some theory). In K. Pribram (Ed.), Origins: Brain and self organization (pp. 299–331). Hillside, NJ: Erlbaum. Shimokawa, T., Pakdaman, K., & Sato, S. (1999a). Time-scale matching in the response of a leaky integrate and re neuron model to periodic stimulus with additive noise. Phys. Rev. E, 59, 3427–3443. Shimokawa, T., Pakdaman, K., & Sato, S. (1999b). Mean discharge frequency locking in the response of a noisy neuron model to subthreshold periodic stimulation. Phys. Rev. E, 60, 33–36. Shimokawa, T., Pakdaman, K., Takahata, T., Tanabe, S., & Sato, S. (2000). A rst-passage-time analysis of the periodically forced noisy leaky integrateand-re model. Biol. Cybern., 83, 327–340. Shimokawa, T., Rogel, A., Pakdaman, K., & Sato, S. (1999). Stochastic resonance and spike timing precision in an ensemble of leaky integrate and re neuron models. Phys. Rev. E, 59, 3461–3470. Shinomoto, S., & Kuramoto, Y. (1986).Phase transitions in active rotator systems. Progr. Theor. Phys., 75, 1105–1110. Sigeti, D., & Horsthemke, W. (1989). Pseudo regular oscillations induced by external noise. J. Stat. Phys., 54, 1217–1222. Stein, R. B., French, A. S., & Holden, A. V. (1972). The frequency response, coherence, and information capacity of two neuronal models. Biophys. J., 12, 295–322. Stratanovich, R. L. (1967). Topics in the theory of random noise (Vol. 2). New York: Gordon and Breach. Tanabe, S., & Pakdaman, K. (2001). Noise induced transition in neuronal models. Biol. Cybern., 85, 269–280. Tanabe, S., & Pakdaman, K. (in press). Noise enhanced neuronal reliability. Phys. Rev. E. Tanabe, S., Sato, S., & Pakdaman, K. (1999). Response of an ensemble of noisy neuron models to a single input. Phys. Rev. E, 60, 7235–7238. Tanabe, S., Shimokawa, T., Sato, S., & Pakdaman, K. (1999). Response of coupled noisy excitable systems to weak stimulation. Phys. Rev. E, 60, 2182–2185. White, J. A., Rubinstein, J. T., & Kay, A. R. (2000). Channel noise in neurons. Trends in Neurosc., 23, 131–137. Wiesenfeld, K., Pierson, D., Pantazelou, E., & Moss, F. (1994). Stochastic resonance on a circle. Phys. Rev. Lett., 72, 2125–2129. Received January 23, 2001; accepted June 18, 2001.
LETTER
Communicated by Misha Tsodyks
A Proposed Function for Hippocampal Theta Rhythm: Separate Phases of Encoding and Retrieval Enhance Reversal of Prior Learning Michael E. Hasselmo [email protected] Clara Bodel Âon [email protected] Bradley P. Wyble [email protected] Department of Psychology, Program in Neuroscience and Center for BioDynamics, Boston University, Boston, MA 02215, U.S.A.
The theta rhythm appears in the rat hippocampal electroencephalogram during exploration and shows phase locking to stimulus acquisition. Lesions that block theta rhythm impair performance in tasks requiring reversal of prior learning, including reversal in a T-maze, where associations between one arm location and food reward need to be extinguished in favor of associations between the opposite arm location and food reward. Here, a hippocampal model shows how theta rhythm could be important for reversal in this task by providing separate functional phases during each 100–300 msec cycle, consistent with physiological data. In the model, effective encoding of new associations occurs in the phase when synaptic input from entorhinal cortex is strong and long-term potentiation (LTP) of excitatory connections arising from hippocampal region CA3 is strong, but synaptic currents arising from region CA3 input are weak (to prevent interference from prior learned associations). Retrieval of old associations occurs in the phase when entorhinal input is weak and synaptic input from region CA3 is strong, but when depotentiation occurs at synapses from CA3 (to allow extinction of prior learned associations that do not match current input). These phasic changes require that LTP at synapses arising from region CA3 should be strongest at the phase when synaptic transmission at these synapses is weakest. Consistent with these requirements, our recent data show that synaptic transmission in stratum radiatum is weakest at the positive peak of local theta, which is when previous data show that induction of LTP is strongest in this layer.
Neural Computation 14, 793–817 (2002)
° c 2002 Massachusetts Institute of Technology
794
M. E. Hasselmo, C. Bodel Âon, and B. P. Wyble
1 Introduction
The hippocampal theta rhythm is a large-amplitude, 3–10 Hz oscillation that appears prominently in the rat hippocampal electroencephalogram during locomotion or attention to environmental stimuli and decreases during immobility or consummatory behaviors such as eating or grooming (Green & Arduini, 1954; Buzsaki, Leung, & Vanderwolf, 1983). In this article, we link the extensive data on physiological changes during theta to the specic requirements of behavioral reversal tasks. Theta rhythm is associated with phasic changes in the magnitude of synaptic currents in different layers of the hippocampus, as shown by current source density analysis of hippocampal region CA1 (Buzsaki, Czopf, Kondakor, & Kellenyi, 1986; Brankack, Stewart, & Fox, 1993; Bragin et al., 1995). As summarized in Figure 1, at the trough of the theta rhythm recorded at the hippocampal ssure, afferent synaptic input from the entorhinal cortex is strong and synaptic input from region CA3 is weak (Brankack et al., 1993), but long-term potentiation at synapses from region CA3 is strong (Holscher, Anwyl, & Rowan, 1997; Wyble, Hyman, Goyal, & Hasselmo, 2001). In contrast, at the peak of the ssure theta rhythm, afferent input from entorhinal cortex is weak, the amount of synaptic input from region CA3 is strong (Brankack et al., 1993), but stimulation of synaptic input from region CA3 induces depotentiation (Holscher et al., 1997).
1.1 Role of Theta Rhythm in Behavior. This article relates these phasic changes in synaptic properties to data showing behavioral effects associated with blockade of theta rhythm. Physiological data suggest that rhythmic input from the medial septum paces theta-frequency oscillations recorded from the hippocampus and entorhinal cortex (Stewart & Fox, 1990; Toth, Freund, & Miles, 1997). This regulatory input from the septum enters the hippocampus via the fornix, and destruction of this ber pathway attenuates the theta rhythm (Buzsaki et al., 1983). Numerous studies show that fornix lesions cause strong impairments in reversal of previously learned behavior (Numan, 1978; M’Harzi et al., 1987; Whishaw & Tomie, 1997), including reversal of spatial response in a T-maze. The T-maze task is shown schematically in Figure 2. During each trial in a T-maze reversal, a rat starts in the stem of the maze and nds food reward when it runs down the stem and into one arm of the maze. After extensive training of this initial association, the food reward is moved to the opposite arm of the maze, and the rat must extinguish the association between left arm and food, and learn the new association between right arm and food. Rats with fornix lesions make more errors after the reversal, continuing to visit the old location, which is no longer rewarded. In the analysis and simulations presented here, we test the hypothesis that phasic changes in the amount of synaptic input and synaptic modica-
A Proposed Function for Hippocampal Theta Rhythm
795
Figure 1: Schematic representation of the change in dynamics during hippocampal theta rhythm oscillations. (Left) Appropriate dynamics for encoding. At the trough of the theta rhythm in the EEG recorded at the hippocampal ssure, synaptic transmission arising from entorhinal cortex is strong. Synaptic transmission arising from region CA3 is weak, but these same synapses show a strong capacity for long-term potentiation. This allows the afferent input from entorhinal cortex to set patterns to be encoded, while preventing interference from previously encoded patterns on the excitatory synapses arising from region CA3. (Right) Appropriate dynamics for retrieval. At the peak of the theta rhythm, synaptic transmission arising from entorhinal cortex is relatively weak (though strong enough to provide retrieval cues). In contrast, the synaptic transmission arising from region CA3 is strong, allowing effective retrieval of previously encoded sequences. At this phase, synapses undergo depotentiation rather than LTP, preventing further encoding of retrieval activity and allowing forgetting of incorrect retrieval.
tion during cycles of the hippocampal theta rhythm could enhance reversal of prior learning. Specically, these phasic changes allow new afferent input (from entorhinal cortex) to be strong when synaptic modication is strong, to encode new associations between place and food reward, without interference caused by synaptic input (from region CA3) representing retrieval of old associations. Synaptic input from region CA3 is strong on a separate phase when depotentiation at these synapses can allow extinction of old associations.
796
M. E. Hasselmo, C. Bodel Âon, and B. P. Wyble
Figure 2: (A) Schematic T-maze illustrating behavioral response during different stages in the task. Location of food reward is marked with the letter F. (A1 ) Initial learning of left turn response. (A2 ) Error after reversal. No food reward is found in the left arm of the maze. (A3 ) Correct after reversal. After a correct response, food reward is found in the right arm of the maze. (A4 ) Retrieval choice. Analysis focuses on performance at the choice point in a retrieval trial (r), when network activity reects memory for food location. (B) Functional components of the network during different trials in a behavioral task. (B1 ) Initial encoding trials (n). When in the left arm, the rat encodes associations between the left arm CA3 place cell activity (L) and left arm food reward activity (F). Thick lines represent strengthened synapses. (B2 ) Performance of an error after reversal (e). When in the left arm (L), no food reward is found. This results in weakening of the previous association (thinner line) between left arm place and left arm food reward. (No food reward activity appears because the rat does not nd reward in error trials). (B3 ) Encoding of correct reversal trial (c). When in the right arm, food reward is found. This results in encoding of an association between right arm CA3 place cell activity (R) and right arm food reward (F) activated by entorhinal input. (B4 ) Retrieval of correct reversal trial (r). This test focuses on activity observed when the rat is at the choice point (Ch), allowing retrieval of both arm place representations (L and R), and potential retrieval of their food reward associations.
A Proposed Function for Hippocampal Theta Rhythm
797
1.2 Overview of the Model. In the model, we focus on the strengthening and weakening of associations between location and food reward. These associations are encoded by modifying synaptic connections between neurons in region CA3 and region CA1 representing location (place cells) and food reward. The place representations are consistent with evidence of hippocampal neurons showing place-selective responses in a variety of spatial environments (McNaughton, Barnes, & O’Keefe, 1983; Skaggs, McNaughton, Wilson, & Barnes, 1996). The food reward representations are consistent with studies showing unit activity selective for the receipt of reward during behavioral tasks (Wiener, Paul, & Eichenbaum, 1989; Otto & Eichenbaum, 1992; Hampson, Simeral, & Deadwyler, 1999; Young, Otto, Fox, & Eichenbaum, 1997). Units have been shown to be selective for the receipt of reward at a particular location (Wiener et al., 1989), and rewarddependent responses have been shown in both region CA1 (Otto & Eichenbaum, 1992) and entorhinal cortex (Young et al., 1997). This model assumes place cell representations already exist and does not focus on their formation or properties, which have been modeled previously (Kali & Dayan, 2000). Instead, the mathematical analysis focuses on encoding and retrieval of associations between place cell activity in each maze arm and the representation of food reward. The structure and phases in the model are summarized in Figure 1. The analysis presented here suggests that oscillatory changes in magnitude of synaptic transmission, long-term potentiation, and postsynaptic depolarization during theta rhythm cause transitions between two functional phases within each theta cycle. In the encoding phase, entorhinal input is dominant, activating cells in regions CA3 and CA1. At this time, excitatory connections arising from region CA3 have decreased transmission, but these same synapses show enhanced long-term potentiation to form associations between sensory events. In the retrieval phase, entorhinal input is relatively weak, but still brings retrieval cues into the network. At this time, excitatory synapses arising from region CA3 show strong transmission, allowing retrieval of the previously learned associations. During this retrieval phase, long-term potentiation must be absent in order to prevent encoding of the retrieval activity. Depotentiation or long-term depression (LTD) during this phase allows extinction of prior learned associations. We propose that these two phases appear in a continuous interleaved manner within each 100–300 msec cycle of the theta rhythm. Section 2 presents mathematical analysis showing how theta oscillations could play an important role in reversal of learned associations. First, we describe the assumptions for modeling synaptic transmission, synaptic modication, and phasic changes in these variables during theta rhythm. Then we describe the assumptions for modeling patterns of activity representing sensory features of the behavioral task in the T-maze. Subsequently, we present a criterion for evaluating the function of the model, in the form of a performance measure based on reversal of learned associations in the
798
M. E. Hasselmo, C. Bodel Âon, and B. P. Wyble
T-maze. Finally, we show how specic phase relationships between network variables provide the best performance of the model. These phase relationships correspond to those observed with physiological recording during hippocampal theta rhythm. 2 Mathematical Analysis of Theta Rhythm
We modeled hippocampal activity when a rat is in one of the two arms of a T-maze during individual trials performed during different stages of a reversal task. A trial refers to the period from the time the rat reaches the end of an arm to its removal from the arm. This period can cover several cycles of theta. A stage of the task refers to multiple trials in which the spatial response of the rat and the food reward location are the same. We represent the pattern of activation of pyramidal cells in region CA1 with the the vector aCA1 ( t) . (See the appendix for a glossary of mathematical terms.) As shown in Figures 1 and 2, the pattern of activity in CA1 is inuenced by two main inputs. First is the synaptic input from region CA3, which is a product of the strength of modiable Schaffer collateral synapses WCA3 (t) and the activity of CA3 pyramidal cells aCA3 ( t). The activity pattern in CA3 remains constant across trials within a single behavioral stage of the task but changes in different stages depending on the spatial response of the rat. The weight matrix remains constant within each trial but can change at the end of the trial. Second is the synaptic input from entorhinal cortex, which is a product of the strength of perforant path synapses W EC (which are not potentiated here) and the activity of projection neurons in layer III of entorhinal cortex aEC (t ). The activity pattern in entorhinal cortex also remains constant within each stage of the task but changes in different stages depending on the location of food reward. The weight matrix W EC does not change in this model. Individual elements of the vectors aCA3 ( t) and aEC ( t) assume only two values, zero or q, to represent presence or absence of spiking activity in each neuron. However, these inuences make the activity of neurons in region CA1 aCA1 ( t) take a range of values. The following equation represents the effects of synaptic transmission on region CA1 activity: aCA1 ( t) D WEC aEC ( t) C WCA3 ( t) aCA3 ( t) .
(2.1)
The structure of the model is summarized in Figures 2 and 3. 2.1 Phasic Changes in Synaptic Input During Theta. This article focuses on oscillatory modulation of equation 2.1 to represent phasic changes during theta rhythm. Experiments have shown phasic changes in the amount of synaptic input in stratum radiatum of region CA1 (Rudell & Fox, 1984; Buzsaki et al., 1986; Brankack et al., 1993; Bragin et al., 1995; Wyble, Linster, & Hasselmo, 2000). These changes in amount of synaptic input could
A Proposed Function for Hippocampal Theta Rhythm
799
depend on the magnitude of transmitter release or on other factors, such as the ring rate of afferent neurons. Both factors have the same functional inuence in the current model and are represented as follows. The synaptic weights, WCA3 ( t), are multiplied by a sine wave function, hCA3 (t) , which has phase w CA3 and is scaled to vary between 1 and (1–X), as shown in equation 2.2: hCA3 (t) D X / 2¤ sin(t C w CA3 ) C (1 ¡ X / 2) .
(2.2)
The parameter X represents the magnitude of change in synaptic currents in different layers and was constrained to values between 0 and 1(0 < X < 1). The simulation also includes phasic changes in magnitude of entorhinal input hEC ( t) , as shown in equation 2.3: hEC ( t) D X / 2¤ sin(t C w EC ) C (1 ¡ X / 2) .
(2.3)
These changes in entorhinal input are consistent with current source density (CSD) data on stratum lacunosum-moleculare (s. l-m) in region CA1 (Brankack et al., 1993; Bragin et al., 1995) and phasic activity of entorhinal neurons (Stewart, Quirk, Barry, & Fox, 1992). The analysis presented here shows how the best performance of the model is obtained with relative phases of these variables that are shown in Figures 1 and 3. With inclusions of these phasic changes, equation 2.1 becomes aCA1 (t ) D hEC ( t) WEC aEC ( t) C hCA3 ( t) W CA3 ( t) aCA3 ( t).
(2.4)
2.2 Phasic Changes in LTP During Theta. Long-term potentiation in the hippocampus has repeatedly been shown to depend on the phase of stimulation relative to theta rhythm in the dentate gyrus (Pavlides, Greenstein, Grudman, & Winson, 1988; Orr, Rao, Stevenson, Barnes, & McNaughton, 1999) and region CA1 (Huerta & Lisman, 1993; Holscher et al., 1997; Wyble et al., 2001). In region CA1, experiments have shown that LTP can be induced by stimulation on the peak, but not the trough, of the theta rhythm recorded in stratum radiatum in slice preparations (Huerta & Lisman, 1993) urethane-anesthetized rats (Holscher et al., 1997), and awake rats (Wyble et al., 2001). Note that theta recorded locally in radiatum will be about 180 degrees out of phase with theta recorded at the ssure. Based on these data, the rate of synaptic modication of the Schaffer collateral synapses (WCA3 (t) ) in the model was modulated in an oscillatory manner given by
hLTP ( t) D sin(t C w LTP ) .
(2.5)
This long-term modication requires postsynaptic depolarization in region CA1 aCA1 ( t) , and presynaptic spiking activity from region CA3 aCA3 ( t). For simplicity, we assume that the synaptic weights during each new trial
800
M. E. Hasselmo, C. Bodel Âon, and B. P. Wyble
Figure 3: Summary of experimental data on changes in synaptic potentials and long-term potentiation at different phases of the theta rhythm. Dotted lines indicate the phase of theta with best dynamics for encoding (E) and the phase with best dynamics for retrieval (R). Entorhinal input: Synaptic input from entorhinal cortex in stratum lacunosum-moleculare is strongest at the peak of theta recorded in stratum radiatum, equivalent to the trough of theta recorded at the hippocampal ssure (Buzsaki et al., 1986; Branckak et al., 1993; Bragin et al., 1995). Soma: Intracellular recording in vivo shows the depolarization of the soma in phase with CA3 input (Kamondi et al., 1998). However, dendritic depolarization is in phase with entorhinal input (Kamondi et al., 1998). LTP: Long-term potentiation of synaptic potentials in stratum radiatum is strongest at the peak of local stratum radiatum theta rhythm in slice preparations (Huerta & Lisman, 1993) and urethane anesthetized rats (Holscher et al., 1997). Strong entorhinal input and strong LTP in stratum radiatum is associated with the encoding phase (E). CA3 input: Synaptic potentials in stratum radiatum are strongest at the trough of the local stratum radiatum theta rhythm as demonstrated with current source density analysis (Buzsaki et al., 1986; Branckak et al., 1993; Bragin et al., 1995) and studies of evoked potentials (Rudell & Fox, 1984; Wyble et al., 2000). Strong synaptic transmission in stratum radiatum is present during the retrieval phase (R).
A Proposed Function for Hippocampal Theta Rhythm
801
remain constant at their initial values and change only at the end of the trial. This is consistent with delayed expression of LTP after its initial induction. We also assume that synapses do not grow beyond a maximum strength. We can compute the change in synaptic weight at the end of a single behavioral trial m C 1, which depends on the integral of activity in the network during that trial (from the time at the start of that trial, tm , to the time at the end of that trial, tm C 1 ). For mathematical simplicity, we integrate across full cycles of theta: tm C 1 Z D WCA3 ( tm C1 ) D hLTP ( t) aCA1 (t) aCA3 ( t) T dt.
(2.6)
tm
The dependence on pre- and postsynaptic activity could be described as Hebbian (Gustafsson, Wigstrom, Abraham, & Huang, 1987). However, the additional oscillatory components of the equation go beyond strict Hebbian properties and can cause the synaptic weight to weaken if pre- and postsynaptic activity occurs during a negative phase of the oscillation. This article focuses on how the quality of retrieval performance in region CA1 depends on the phase relationship of the oscillations in synaptic transmission (in equation 2.4) with the oscillations in synaptic modication (in equation 2.6). Combining these equations gives us tm C 1 Z D WCA3 ( tm C1 ) D hLTP ( t) [hEC ( t) WEC aEC ( t) tm
C hCA3 ( t ) W CA3 ( tm ) aCA3 ( t ) ]aCA3 ( t ) T dt.
(2.7)
Note that we allow theta phasic modulation of W EC , but we are not modeling Hebbian synaptic modication of WEC . Therefore, after this point, we will not include that matrix of connectivity but will represent patterns of input from entorhinal cortex with single static vectors, as if the synapses were an identity matrix. 2.3 Modeling Learning of Associations in a Reversal Task. The mathematical analysis focuses on the performance of the network in the T-maze reversal task, in which learning of an association between one location and food must be extinguished and replaced by a new association. Our analysis focuses on synaptic changes occurring during initial learning of an association (e.g., between left arm and food), subsequent extinction of this association when it is invalid, and encoding of a new association (e.g., between right arm and food). Thus, we focus on the mnemonic aspects of this task. Changes in the synaptic matrix WCA3 ( t) include the learning during three sequential stages of the task, whereas for the analysis of performance,
802
M. E. Hasselmo, C. Bodel Âon, and B. P. Wyble
a fourth stage is also needed. These stages are described as follows: 1. Initial learning. During this period, the rat runs to the left arm of the T-maze and nds food reward in that arm (see Figure 2, A1 and B1 ). Synaptic modication encodes associations between left arm place (represented by CA3 activity) and left arm food reward (represented by entorhinal activity) on the n trials with time ranging from t D 0 to t D tn . During this behavioral stage, activity in CA3 and EC is constant ( n ) to indicate the activity during all n trials. We use the superindex ± ² ± ² ( )
( )
n n during this stage in region CA3 aCA3 and in entorhinal cortex aEC .
2. Reversal learning with erroneous responses. After stage 1, the food reward is changed to the right arm. In the initial stage after this reversal, the rat runs to the left arm place but does not nd the food reward (see Figure 2, A2 and B2 ). This causes extinction or weakening of the associations between left arm place and left arm food reward. We denote with the superindex ( e) the activity during this stage. The left arm place representation in CA3 is the same as during initial learning, and for simplicity we assume the CA3 activity vector for initial trials has unity dot product with the activity vector for erroneous reversal ( ) T ( )
n e trials aCA3 aCA3 D 1. Because the rat does not encounter food reward on erroneous trials, the EC activity vector representing food reward is ( e) zero aEC D 0. This period contains e trials and ends at tn C e . This stage could vary in number of trials in individual rats.
3. Reversal learning with correct responses. During this period, the rat runs to the right arm and nds food (see Figure 2, A3 and B3 ). In this stage, the superindex will be ( c) . Synaptic modication encodes associations ( c) ( c) between right arm place aCA3 and right arm food reward aEC . This period contains c trials and ends at tn C e C c . Note that we make the natural assumption that the CA3 activity representing left arm place cell responses during initial and erroneous trials has a zero dot product with the right arm place cell responses during correct reversal trials ( ) T ( )
( ) T ( )
n c e c aCA3 aCA3 D aCA3 aCA3 D 0. We also assume that the entorhinal input for food in the right arm has zero dot product with the input for ( c ) T ( n) ) aEC D 0 and that the dot product of food food in the left arm ( aEC reward input from entorhinal cortex for the same food location is unity ³± ² ´ ± ²T (c ) T ( c ) ( n) ( n) aEC aEC D aEC aEC D 1 .
4. Retrieval at choice point on subsequent trials. The analysis of performance in the network focuses on retrieval activity at the choice point of the T-maze (see Figure 2, A4 and B4 ). This stage is described further in section 2.4.
A Proposed Function for Hippocampal Theta Rhythm
803
The association between left arm place and food reward, learned on the initial n trials (see Figure 2, A1 and B1 ), is represented by the matrix WCA3 (tn ). Each entry of this matrix does not grow beyond a maximum, which is a factor of K times the pre- and postsynaptic activity. This matrix represents the association±between left arm place cells activated during these trials ² ( )
n in region CA3 aCA3 and food reward representation cells in region CA1 ± ² ( n) activated by entorhinal input aCA1 D aEC . At the end of the n trials, this matrix becomes: ( ) ( ) T
n n WCA3 ( tn ) D KaEC aCA3 .
(2.8)
After reversal, the extinction of the association between left arm and reward occurs on erroneous trials (see Figure 2, A2 and B2 ). The learning during each trial is represented by summing up integrals of equation 2.7 over each trial from n C 1 to n C e. As noted above, the lack of reward is ( e) represented by an absence of entorhinal input aEC D 0. However, retrieval of prior learning can occur during these erroneous trials due to left arm food representations activated by the spread of activity from left arm place cells in region CA3 over the matrix (see equation 2.8) from the rst n trials T
( n) ( e) (recall that aCA3 aCA3 D 1). This appears in the following equation:
WCA3 (tn C e ) D W CA3 (tn ) C tk C 1 n Ce Z h i X (e ) ( e) T hLTP (t) hCA3 ( t) WCA3 ( tk ) aCA3 aCA3 dt.
(2.9)
kD n C 1 t k
In contrast with incorrect trials, correct trials after reversal are indexed with the numbers between n C e C 1 and n C e C c (see Figure 2, A3 and B3 ), where c represents correct postreversal trials. The desired learning during each trial c consists of the association between place cells active in region CA3 when ( ) T
c the rat is at the end of the right arm aCA3 and the food representation is ( )
( ) T ( )
c n c activated by entorhinal input aEC . As noted above, aCA3 aCA3 D 0, so the ( c) spread across connections modied in initial learning is W CA3 ( tn ) aCA3 D ( ) ( ) T ( )
n n c kaEC aCA3 aCA3 D 0. We assume these trials occur sequentially after the error trials described above. However, with the assumption of nonoverlapping place representations in CA3, even if the erroneous and correct trials were intermixed, the same results would apply. The learning on these correct trials is
WCA3 (tn C e C c ) D WCA3 (tn C e ) C
804
M. E. Hasselmo, C. Bodel Âon, and B. P. Wyble Ce Cc nX
ZtkC 1 h i ( c) ( c) C hCA3 ( t ) WCA3 ( t k ) aCA3 hLTP ( t) hEC ( t) aEC
kD n C e C 1 t k ( ) T
c £ aCA3 dt.
(2.10)
2.4 Performance Measure. The behavior of the network was analyzed with a performance measure M, which measures how much the network retrieves the new association between right arm and food reward relative to retrieval of the old association between left arm and food reward. We focus ( r) on a retrieval trial r, with place cell activity in region CA3, given by aCA3 , but before the rat has encountered the food reward, so that no food input enters (r) the network aEC D 0. For the performance measure, we focus on retrieval of memory when the rat is at the choice point (the T-junction). At the choice point, activity spreads along associations between place cells within CA3, such that the place cells in the left and right arms are activated. These place cells can then activate any associated food representation in region CA1 (see Figure 2, A4 and B4 ). Thus, we use an activity pattern in region CA3 ( r) , which overlaps with the region CA3 place activity for both the left arm aCA3 ( ) T ( )
( ) T ( )
n r c r aCA3 aCA3 D 1 and the right arm aCA3 aCA3 D 1. This assumes that activity dependent on future expectation can be induced when the rat is still in the stem or at the T-junction. This is consistent with evidence for selectivity of place cells dependent on future turning response (Wood, Dudchenko, Robitsek, & Eichenbaum, 2000; Frank, Brown, & Wilson, 2000). Note that when the rat is at the choice point on trial r (see Figure 2, A4 and B4 ), the CA1 activity will be represented by an equation that uses the weight matrix shown in equation 2.10: ( )
( )
r r C hCA3 ( t ) (W CA3 (tn C e C c ) ) aCA3 . aCA1 ( t) D hEC ( t) aEC
(2.11)
The performance measure then takes the retrieval activity aCA1 (t) and evaluates its similarity (dot product) with the food representation on the ( c) correct reversal trial aEC minus its similarity (dot product) with the now( n) incorrect memory of food from initial trials aEC . Due to the oscillations of CA3 input to CA1 during this period, we use the maximum, max[], of this measure as an indicator of performance, but similar results would be obtained with integration over time during retrieval. In the model, the pattern of food reward activity in region CA1, evaluated during the choice point period on this trial r, is taken as guiding the choice on this trial (the choice will depend on the relative strength of the activity representing left arm food reward versus the strength of activity representing right arm food reward). Thus, this equation measures the tendency that the response guided by region CA1 activity on this trial will depend on (c ) memory of associations from correct postreversal trials aEC (left side of the
A Proposed Function for Hippocampal Theta Rhythm
805
equation) versus memory for prior strongly learned associations on trials (n ) aEC (right side of the equation): h i ( c) T ( n) T (aEC ) aCA1 ( t) ¡ ( aEC ) aCA1 (t) . (2.12) M D max ( t2 ( tr ,tr C e ) )
To see how this measure of behavior changes depending on the phase of oscillatory variables, we can replace the matrices in equation 2.11 with the components from equations 2.8 through 2.10 and obtain the postsynaptic activity for inclusion in equation 2.12. For the purpose of simplication, we will consider retrieval after a single error trial and a single correct trial. This allows removal of the summation signs and use of the end point of a single error trial (tn C e D tn C 1 ) and the end point of a single correct trial ( tn C e C c D tn C 2 ). We move the constant values outside the integrals. In simplifying this equation, we use the assumptions about activity in different stages itemized at the start of sections 2.3 and 2.4. These assumptions allow the performance measure to be reduced to an interaction of the oscillatory terms: 8 2 > < tZnC e C c 6 hLTP (t )hEC ( t) dt ¡ K M D max 4hCA3 (t ) ( t2 ( tr ,tr C e ) ) > : tn C e
93 Ztn C e = ¡ hLTP ( t)hCA3 (t ) dt 5 . ;
(2.13)
tn
Similarly, we can consider the analytical conditions that would allow the previously learned associations to be erased. To accomplish this during the error stage, we desire that the previously strengthened connections should go toward zero, so we need Ztk C 1 h i (e ) ( e) T hLTP (t ) hCA3 ( t) WCA3 ( tk ) aCA3 aCA3 dt < 0. tk
If we ignore the constant vectors and focus on a single theta cycle, we obtain Z2p
sin(t C w LTP ) ( X / 2) sin (t C w CA3 ) dt C
0
Z2p
sin(t C w LTP ) (1 ¡ X / 2) dt.
0
This gives (X / 4)
Z2p [cos (w LTP ¡ w CA3 ) ¡ cos (2t C w CA3 C w LTP ) ] dt 0
D ( X / 2)p cos (w LTP ¡ w CA3 ) .
806
M. E. Hasselmo, C. Bodel Âon, and B. P. Wyble
This equation will be negative for p 3p < w LTP ¡ w CA3 < . 2 2 Thus, for the old synapses to decay, the oscillations of CA3 input and LTP should be out of phase. A similar analysis shows that the synapses strengthened by the correct association in equation 2.10 will grow for ¡
p p < w LTP ¡ w EC < , 2 2
that is, if the entorhinal input is primarily in phase with synaptic modication. These same techniques can be used to solve equation 2.13, which over a single theta cycle becomes: M D ( X / 2)p cos (w LTP ¡ w EC ) ¡ K ¡ ( X / 2)p cos (w LTP ¡ w CA3 ) . (2.14) Figure 4 shows this equation for X D K D 1. Note that with changes in the value of X (the magnitude of synaptic transmission in equations 2.2 and 2.3), the performance measure changes amplitude but has the same overall shape of dependence on the relative phase of network variables. Increases in the variable M correspond to more retrieval of correct associations between correct right arm location and food after the reversal and a decreased retrieval of incorrect associations between left arm location and food. As illustrated in Figure 4, this function is maximal when hLTP (t) and hCA3 ( t) have a phase difference of 180 degrees (p ) and hLTP ( t) and hEC ( t) have a phase difference near zero. In this framework, the peak phase of hLTP ( t) could be construed as an encoding phase, and the peak phase of hCA3 ( t) could be seen as a retrieval phase. These are the phase relationships shown in Figures 1 and 3. This analysis shows how a particular behavioral constraint—the ability to reverse a learned behavior—requires specic phase relationships between physiological variables during theta rhythm. 2.5 Relation Between Theoretical Phases and Experimental Data. The phase relationship between oscillating variables that provides the best performance in equation 2.14 (see Figure 4) corresponds to the phase relationships observed in physiological experiments on synaptic currents (Buzsaki et al., 1986; Brankack et al., 1993; Bragin et al., 1995) and LTP (Huerta & Lisman, 1993; Holscher et al., 1997; Wyble et al., 2001), as summarized in Figures 1 and 3. In particular, our model suggests that the period of strongest potentiation at excitatory synapses arising from region CA3 should correspond to the period of weakest synaptic input at these same synapses. These synapses terminate in stratum radiatum of region CA1. Experimental data have shown that stimulation of stratum radiatum can best induce LTP at the peak of local theta in urethane anesthetized rats (Holscher et al., 1997).
A Proposed Function for Hippocampal Theta Rhythm
807
Figure 4: Performance of the network depends on the relative phase of oscillations in synaptic transmission and long-term potentiation. This graph shows how performance in T-maze reversal changes as a function of the phase difference between LTP and synaptic input from entorhinal cortex (EC phase) and of the phase difference between LTP and synaptic input from region CA3 (CA3 phase). This is a graphical representation of equation 2.14. Best performance occurs when EC input is in phase (zero degrees) with LTP but region CA3 synaptic input is 180 degrees out of phase with LTP.
To analyze whether this corresponds to the time of weakest synaptic transmission, we have reanalyzed recent data on the size of evoked synaptic potentials during theta rhythm in urethane anesthetized rats (Wyble et al., 2000), using the slope of stratum radiatum theta measured directly before the evoked potential. As shown in Figure 5, the smallest slope of evoked potentials was noted near the peak of stratum radiatum theta, which is the time of peak LTP induction (Holscher et al., 1997). Consistent with the time of weakest evoked synaptic transmission, previous studies show the smallest synaptic currents in stratum radiatum at this phase of theta (Brankack et al., 1993). The analysis presented here provides a functional rationale for this paradoxical phase difference between magnitude of synaptic input and magnitude of long-term potentiation in stratum radiatum. Thus, the requirement for best performance in the model is consistent with physiological data showing that LTP of the CA3 input is strongest
808
M. E. Hasselmo, C. Bodel Âon, and B. P. Wyble
Figure 5: Data showing the magnitude of the initial slope of evoked synaptic potentials (EPSP slope) in stratum radiatum of hippocampal region CA1 relative to the slope of the local theta eld potential (local EEG slope) recorded from the same electrode just before recording of the evoked potential. These data are reanalyzed from a previous study correlating slope of potentials with hippocampal ssure theta (Wyble et al., 2000). The data demonstrate that the initial slope of evoked potentials is smallest near the peak of local theta, the same phase when induction of long-term potentiation is maximal (Holscher et al., 1997; Wyble et al., 2001).
when synaptic input at these synapses is weakest. The time of best LTP induction in stratum radiatum is also the time of least pyramidal cell spiking in region CA1 (Fox, Wolfson, & Ranck, 1986; Skaggs et al., 1996). The lack of pyramidal cell spiking could result from the fact that the pyramidal cell soma is hyperpolarized during this phase, while the proximal dendrites are simultaneously depolarized (Kamondi, Acsady, Wang, & Buzsaki, 1998). This suggests that synaptic modication at the peak of the theta rhythm in stratum radiatum requires that Hebbian synaptic modication at the Schaffer collaterals depends on postsynaptic depolarization caused by perforant pathway input from entorhinal cortex rather than associative input from Schaffer collaterals, but should not require somatic spiking activity. This is consistent with data on NMDA receptor responses to local depolarization and Hebbian long-term potentiation in the absence of spiking (Gustafsson et al., 1987). Experimental data already show that LTP is not induced at the trough of theta rhythm recorded in stratum radiatum (Huerta & Lisman, 1993; Holscher et al., 1997). In fact, consistent with the process of extinction of prior learning during retrieval, stimulation at the trough of local theta results
A Proposed Function for Hippocampal Theta Rhythm
809
in LTD (Huerta & Lisman, 1995) or depotentiation (Holscher et al., 1997). These data suggest that some process must suppress Hebbian long-term potentiation in this phase of theta, because this is the phase of strongest Schaffer collateral input (Brankack et al., 1993) and maximal CA1 spiking activity (Fox et al., 1986; Skaggs et al., 1996). The lack of LTP may result from lack of dendritic depolarization in this phase (Kamondi et al., 1998). As shown in Figure 6, the loss of theta rhythm after fornix lesions could underlie the impairment of reversal behavior in the T-maze caused by these lesions (M’Harzi et al., 1986). This gure summarizes how separate phases of encoding and retrieval allow effective extinction of the initially learned association and learning of the new association, whereas loss of theta results in retrieval of the initially learned association during encoding, which prevents extinction of this association. 2.6 Changes in Postsynaptic Membrane Potential During Theta. The above results extend previous proposals that separate encoding and retrieval phases depend on strength of synaptic input (Hasselmo, Wyble, & Wallenstein, 1996; Wallenstein & Hasselmo, 1997; Sohal & Hasselmo, 1998). This same model can also address the hypothesis that encoding and retrieval can be inuenced by phasic changes in the membrane potential of the dendrite (Hasselmo et al., 1996; Paulsen & Moser, 1998) and soma (Paulsen & Moser, 1998). Data show that during theta rhythm, there are phasic changes in inhibition focused at the soma (Fox et al., 1986; Fox, 1989; Kamondi et al., 1998). This phenomenon can be modeled by assuming that the CA1 activity described above represents dendritic membrane potential in pyramidal cells (adCA1 ) , and a separate variable represents pyramidal cell soma spiking ( asCA1 ) (Golding & Spruston, 1998). Spiking at the soma is inuenced by input from the dendrites ( adCA1 ) , multiplied by oscillatory changes in soma membrane potential (hsoma ) induced by shunting inhibition:
asCA1 D hsoma adCA1 .
(2.15)
If these oscillatory changes in soma membrane potential are in phase with region CA3 input hsoma , then this enhances the output of retrieval activity. If they are out of phase with entorhinal input hEC and LTP hLTP , this will help prevent interference due to network output during encoding, allowing dendritic depolarization to activate NMDA channels without causing spiking activity at the soma. This corresponds to the crossing of a threshold for synaptic modication without crossing a ring threshold for generating output. Use of a modication threshold enhances the inuence of even relatively small values for the modulation of synaptic transmission (the parameter X in equations 2.2, 2.3 and 2.14). A modication threshold would select a segment of each cycle during which postsynaptic activity is above threshold. Outside of this segment, the modication rate would be zero. To simplify the analysis of the integrals, we will take the limiting case as
810
M. E. Hasselmo, C. Bodel Âon, and B. P. Wyble
the modication threshold approaches 1. In this case, the postsynaptic component of potentiation becomes a delta function with amplitude one and phase corresponding to the time of the positive maximum of the sine wave for synaptic transmission, d ( t ¡w CA3 C p / 2). The integral for one component of equation 2.13 then becomes: Z2p sin (t C w LTP ) d ( t ¡ w CA3 C p / 2) dt D sin (p / 2 C w LTP ¡ w CA3 ) 0
D cos (w LTP ¡ w CA3 ) .
A Proposed Function for Hippocampal Theta Rhythm
811
This results in a performance measure very similar to equation 2.14, with a maximum only slightly smaller than that in equation 2.14, but with no dependence on magnitude of X: M D cos(w LTP ¡ w EC ) ¡ K ¡ cos (w LTP ¡ w CA3 ) .
(2.16)
The disjunction between dendritic and somatic depolarization has been observed experimentally (Golding & Spruston, 1998; Kamondi et al., 1998; Paulsen & Moser, 1998). Phasic changes in disinhibition could underlie phasic changes in population spike induction in stratum pyramidale during theta (Rudell, Fox, & Ranck, 1980; Leung, 1984), though population spikes in anesthetized animals appear to depend more on strength of synaptic input (Rudell & Fox, 1984). 3 Discussion
The model presented here shows how theta rhythm oscillations could enhance the reversal of prior learning by preventing retrieval of previously encoded associations during new encoding. We developed expressions for the phasic changes in synaptic input and long-term potentiation during
Figure 6: Facing page. Schematic representation of the encoding and extinction of associations in the model during normal theta (with theta) and after loss of theta due to fornix lesions (no theta). Circles represent populations of units in region CA3 representing arm location (left = L, right = R), and populations in region CA1 activated by food location. Arrows labeled with F represent sensory input about food reward arriving from entorhinal cortex. Filled circles represent populations activated by afferent input from entorhinal cortex or by retrieval activity spreading from region CA3. Initial learning: During initial encoding trials, simultaneous activity of left arm place cells and food reward input causes strengthening of connections between CA3 and CA1 (thicker line). This association can be retrieved in both normal and fornix lesioned rats. Error after reversal: With normal theta, when food reward is not present, the region CA1 neuron is not activated during encoding, so no strengthening occurs, and during the retrieval phase, depotentiation causes the connection storing the association between left arm and food to weaken (thinner line). In contrast, without theta, the activity spreading along the strengthened connection causes postsynaptic activity, which allows strengthening and prevents weakening of the association between left arm and food. Correct reversal: With normal theta, a new stronger association is formed between right arm and food. Without theta, the new and old associations are similar in strength. Retrieval choice: At the choice point on subsequent trials, retrieval of the association between right arm and food dominates with normal theta, whereas without theta the associations have similar strength.
812
M. E. Hasselmo, C. Bodel Âon, and B. P. Wyble
theta rhythm and for sensory input during different stages of behavior in a T-maze task. A performance measure evaluated memory for recent associations between location and food in this task by measuring how much retrieval activity resembles the new correct food location, as opposed to the initial (now incorrect) food location. As shown in Figure 4, the best memory for recent associations in this task occurs with the following phase relationships between oscillations of different variables: (1) sensory input from entorhinal cortex should be in phase (zero degrees difference) with oscillatory changes in long-term potentiation of the synaptic input from region CA3, and (2) retrieval activity from region CA3 should be out of phase (180 degree difference) with the oscillatory changes in long-term potentiation at these same synapses arising from region CA3. As shown in Figures 1 and 3, this range of best performance appears consistent with physiological data on the phase dependence of synaptic transmission (Buzsaki et al., 1986; Brankack et al., 1993; Bragin et al., 1995; Wyble et al., 2000) and long-term potentiation (Huerta & Lisman, 1993; Holscher et al., 1997; Wyble et al., 2001). This model predicts that recordings during tasks requiring repetitive encoding and retrieval of stimuli should show specic phase relationships to theta. These tasks include delayed matching tasks and delayed alternation (Numan, Feloney, Pham, & Tieber, 1995; Otto & Eichenbaum, 1992; Hampson, Jarrard, & Deadwyler, 1999). In these tasks, hippocampal neurons should show spiking activity at the peak of local theta in stratum radiatum for information being encoded and spiking activity near the trough of local theta for information being retrieved. This is consistent with theta phase precession (O’Keefe & Recce, 1993; Skaggs et al., 1996), in which the retrieval activity (before the animal enters its place eld) appears at different phases of theta than the stimulus-driven encoding activity (after the animal has entered the place eld). The encoding of new information will be most rapid if theta rhythm shifts phase to allow encoding dynamics during the time when new information is being processed. This is consistent with evidence that the theta rhythm shows phase locking to the onset of behaviorally relevant stimuli for stimulus acquisition (Macrides, Eichenbaum, & Forbes, 1982; Semba & Komisaruk, 1984; Givens, 1996). Phase locking could involve medial septal activity shifting the hippocampal theta into an encoding phase during the initial appearance of a behaviorally relevant stimulus, thereby speeding the process of encoding. In tasks requiring access to information encoded within the hippocampus, this information will become accessible during the retrieval phase, which could bias generation of responses to the retrieval phase. This requirement is consistent with the theta phase locking of behavioral response times (Buno & Velluti, 1977). The mechanisms described here could also be important for obtaining relatively homogeneous strength of encoding within a distributed environment (Kali & Dayan, 2000). In particular, the model contains an ongoing interplay between synaptic weakening during retrieval and synaptic
A Proposed Function for Hippocampal Theta Rhythm
813
strengthening during encoding. Without a prior representation, retrieval does not occur and synaptic weakening is absent, allowing pure strengthening of relevant synapses. However, for highly familiar portions of the environment, considerable retrieval will occur during the trough of theta, which will balance the new encoding at the peak. With a change in environment (e.g., insertion of a barrier or removal of food reward), the retrieval of prior erroneous associations in the model would no longer be balanced by reencoding of these representations, and the erroneous representations could be unlearned. (If retrieval occurs without external input, a return loop from entorhinal cortex would be necessary to prevent loss of the memory.) Without the separation of encoding and retrieval on different phases of theta, memory for the response performed during a trial cannot be distinguished from retrieval of a previous memory used to guide performance during that trial. In addition to the model of reversal learning presented above, the same principles could account for impaired performance caused by medial septal lesions in tasks such as spatial delayed alternation (Numan & Quaranta, 1990) and delayed alternation of a go/no-go task (Numan et al., 1995), which place demands on the response memory system. In contrast, medial septal lesions do not impair performance of a task requiring working memory for a sensory stimulus (Numan & Klis, 1992). Fornix lesions also impair delayed nonmatch to sample in a spatial T-maze in rats (Markowska, Olton, Murray, & Gaffan, 1989) and monkeys (Murray, Davidson, Gaffan, Olton, & Suomi, 1989), where animals must remember their most recent response in the face of interference from multiple previous responses. Performance of a spatial delayed match to sample task is also impaired by hippocampal lesions (Hampson, Jarrard, et al., 1999). This analysis can guide further experiments testing the predictions about specic phase relationships required by behavioral constraints in reversal and delayed alternation tasks. These behavioral constraints are ethologically relevant; they would arise in any foraging behavior where food sites are xed across some time periods but vary at longer intervals. Appendix: Glossary of Mathematical Terms ( )
n aEC = entorhinal activity. This is a xed vector during each stage of task. The stage is designated by superscript: n = initial learning, e = error after reversal, c = correct after reversal, r = retrieval test. ( )
n = region CA3 activity. Superscript designates stage of task. aCA3
aCA1 ( t) = region CA1 activity. Varies with time, tn = end of initial period, tn C e = end of erroneous trials, tn C e C c = end of correct trials. W CA3 ( t) = synaptic input matrix from region CA3. W EC = synaptic input matrix from entorhinal cortex. hCA3 ( t) = oscillatory modulation of synaptic transmission from CA3 during theta rhythm, with phase = w CA3 .
814
M. E. Hasselmo, C. Bodel Âon, and B. P. Wyble
hEC (t) = oscillatory modulation of synaptic transmission from entorhinal cortex during theta rhythm, with phase = w EC . hLTP ( t) = oscillatory modulation of rate of synaptic modication during theta rhythm, with phase = w LTP . M = performance measure in reversal task dependent on relative phase of oscillatory modulation.
Acknowledgments
We appreciate the assistance on experiments of Christiane Linster and Bradley Molyneaux. We also appreciate critical comments by Marc Howard, James Hyman, Terry Kremin, Ana Nathe, Anatoli Gorchetchnikov, Norbert Fortin, and Seth Ramus. In memory of Patricia Tillberg Hasselmo. This work was supported by NIH MH61492, NIH MH60013, and NSF IBN9996177.
References Bragin, A., Jando, G., Nadasdy, Z., Hetke, J., Wise, K., & Buzsaki, G. (1995). Gamma (40–100 Hz) oscillation in the hippocampus of the behaving rat. Journal of Neuroscience, 15, 47–60. Brankack, J., Stewart, M., & Fox, S. E. (1993). Current source density analysis of the hippocampal theta rhythm: Associated sustained potentials and candidate synaptic generators. Brain Research, 615(2), 310–327. Buno, W., Jr., & Velluti, J. C. (1977). Relationships of hippocampal theta cycles with bar pressing during self-stimulation. Physiol. Behav., 19(5), 615–621. Buzsaki, G., Czopf, J., Kondakor, I., & Kellenyi, L. (1986). Laminar distribution of hippocampal rhythmic slow activity (RSA) in the behaving rat: Currentsource density analysis, effects of urethane and atropine. Brain Res., 365(1), 125–137. Buzsaki, G., Leung, L. W., & Vanderwolf, C. H. (1983). Cellular bases of hippocampal EEG in the behaving rat. Brain Res., 287(2), 139–171. Fox, S. E. (1989). Membrane potential and impedence changes in hippocampal pyramidal cells during theta rhythm. Exp. Brain Res., 77, 283–294. Fox, S. E., Wolfson, S., & Ranck, J. B. J. (1986). Hippocampal theta rhythm and the ring of neurons in walking and urethane anesthetized rats. Exp. Brain Res., 62, 495–508. Frank, L. M., Brown, E. N., & Wilson, M. (2000). Trajectory encoding in the hippocampus and entorhinal cortex. Neuron, 27(1), 169–178. Givens, B. (1996). Stimulus-evoked resetting of the dentate theta rhythm: Relation to working memory. Neuroreport, 8, 159–163.
A Proposed Function for Hippocampal Theta Rhythm
815
Golding, N. L., & Spruston, N. (1998). Dendritic sodium spikes are variable triggers of axonal action potentials in hippocampal CA1 pyramidal neurons. Neuron, 21(5), 1189–2000. Green, J. D., & Arduini, A. A. (1954).Hippocampal electrical activity and arousal. J. Neurophysiol., 17, 533–577. Gustafsson, B., Wigstrom, H., Abraham, W. C., & Huang, Y. Y. (1987). Longterm potentiation in the hippocampus using depolarizing current pulses as the conditioning stimulus to single volley synaptic potentials. J. Neurosci., 7(3), 774–780. Hampson, R. E., Jarrard, L. E., & Deadwyler, S. A. (1999). Effects of ibotenate hippocampal and extrahippocampal destruction on delayed-match and nonmatch-to-sample behavior in rats. J. Neurosci., 19(4), 1492–1507. Hampson, R. E., Simeral, J. D., & Deadwyler, S. A. (1999). Distribution of spatial and nonspatial information in dorsal hippocampus. Nature, 402(6762), 610– 614. Hasselmo, M. E., Wyble, B. P., & Wallenstein, G. V. (1996). Encoding and retrieval of episodic memories: Role of cholinergic and GABAergic modulation in the hippocampus. Hippocampus, 6(6), 693–708. Holscher, C., Anwyl, R., & Rowan, M. J. (1997). Stimulation on the positive phase of hippocampal theta rhythm induces long-term potentiation that can be depotentiated by stimulation on the negative phase in area CA1 in vivo. J. Neurosci., 17(16), 6470–6477. Huerta, P. T., & Lisman, J. E. (1993). Heightened synaptic plasticity of hippocampal CA1 neurons during a cholinergically induced rhythmic state. Nature, 364, 723–725. Huerta, P. T., & Lisman, J. E. (1995). Bidirectional synaptic plasticity induced by a single burst during cholinergic theta oscillation in CA1 in vitro. Neuron, 15(5), 1053–1063. Kali, S., & Dayan, P. (2000). The involvement of recurrent connections in area CA3 in establishing the properties of place elds: A model. J. Neurosci., 20(19), 7463–7477. Kamondi, A., Acsady, L., Wang, X. J., & Buzsaki, G. (1998). Theta oscillations in somata and dendrites of hippocampal pyramidal cells in vivo: Activitydependent phase-precession of action potentials. Hippocampus, 8(3), 244–261. Leung, L.-W. S. (1984). Model of gradual phase shift of theta rhythm in the rat. J. Neurophysiol., 52, 1051–1065. Macrides, F., Eichenbaum, H. B., & Forbes, W. B. (1982). Temporal relationship between snifng and the limbic theta rhythm during odor discrimination reversal learning. J. Neurosci., 2, 1705–1717. Markowska, A. L., Olton, D. S., Murray, E. A., & Gaffan, D. (1989). A comparative analysis of the role of fornix and cingulate cortex in memory: Rats. Exp. Brain Res., 74(1), 187–201. McNaughton, B. L., Barnes, C. A., & O’Keefe, J. (1983). The contributions of position, direction, and velocity to single unit activity in the hippocampus of freely-moving rats. Exp. Brain Res., 52, 41–49. M’Harzi, M., Palacios, A., Monmaur, P., Willig, F., Houcine, O., & Delacour, J. (1987). Effects of selective lesions of mbria-fornix on learning set in the rat. Physiol. Behav., 40, 181–188.
816
M. E. Hasselmo, C. Bodel Âon, and B. P. Wyble
Murray, E. A., Davidson, M., Gaffan, D., Olton, D. S., & Suomi, S. (1989). Effects of fornix transection and cingulate cortical ablation on spatial memory in rhesus monkeys. Exp. Brain Res., 74(1), 173–186. Numan, R. (1978). Cortical-limbic mechanisms and response control: A theoretical re-view. Physiol. Psych., 6, 445–470. Numan, R., Feloney, M. P., Pham, K. H., & Tieber, L. M. (1995). Effects of medial septal lesions on an operant go/no-go delayed response alternation task in rats. Physiol. Behav., 58, 1263–1271. Numan, R., & Klis, D. (1992). Effects of medial septal lesions on an operant delayed go/no-go discrimination in rats. Brain Res. Bull., 29, 643–650. Numan, R., & Quaranta, J. R. Jr. (1990).Effects of medial septal lesions on operant delayed alternation in rats. Brain Res., 531, 232–241. O’Keefe, J., & Recce, M. L. (1993).Phase relationship between hippocampal place units and the EEG theta rhythm. Hippocampus, 3, 317–330. Orr, G., Rao, G., Stevenson, G. D., Barnes, C. A., & McNaughton, B. L. (1999). Hippocampal synaptic plasticity is modulated by the theta rhythm in the fascia dentata of freely behaving rats. Soc. Neurosci. Abstr., 25, 2165 (864.14). Otto, T., & Eichenbaum, H. (1992). Neuronal activity in the hippocampus during delayed non-match to sample performance in rats: Evidence for hippocampal processing in recognition memory. Hippocampus, 2(3), 323–334. Paulsen, O., & Moser, E. I. (1998). A model of hippocampal memory encoding and retrieval: GABAergic control of synaptic plasticity. Trends Neurosci., 21(7), 273–278. Pavlides, C., Greenstein, Y. J., Grudman, M., & Winson, J. (1988). Long-term potentiation in the dentate gyrus is induced preferentially on the positive phase of theta-rhythm. Brain Res., 439(1–2), 383–387. Rudell, A. P., & Fox, S. E. (1984). Hippocampal excitability related to the phase of the theta rhythm in urethanized rats. Brain Res., 294, 350–353. Rudell, A. P., Fox, S. E., & Ranck, J. B. J. (1980). Hippocampal excitability phaselocked to the theta rhythm in walking rats. Exp. Neurol., 68, 87–96. Semba, K., & Komisaruk, B. R. (1984). Neural substrates of two different rhythmical vibrissal movements in the rat. Neuroscience, 12(3), 761–774. Skaggs, W. E., McNaughton, B. L., Wilson, M. A., & Barnes, C. A. (1996). Theta phase precession in hippocampal neuronal populations and the compression of temporal sequences. Hippocampus, 6, 149–172. Sohal, V. S., & Hasselmo, M. E. (1998). Changes in GABAB modulation during a theta cycle may be analogous to the fall of temperature during annealing. Neural Computation, 10, 889–902. Stewart, M., & Fox, S. E. (1990). Do septal neurons pace the hippocampal theta rhythm? Trends Neurosci., 13, 163–168. Stewart, M., Quirk, G. J., Barry, M., & Fox, S. E. (1992). Firing relations of medial entorhinal neurons to the hippocampal theta rhythm in urethane anesthetized and walking rats. Exp. Brain Res., 90(1), 21–28. Toth, K., Freund, T. F., & Miles, R. (1997). Disinhibition of rat hippocampal pyramidal cells by GABAergic afferent from the septum. J. Physiol., 500, 463– 474.
A Proposed Function for Hippocampal Theta Rhythm
817
Wallenstein, G. V., & Hasselmo, M. E. (1997). GABAergic modulation of hippocampal population activity: Sequence learning, place eld development and the phase precession effect. J. Neurophysiol., 78(1), 393–408. Whishaw, I. Q., & Tomie, J. A. (1997). Perseveration on place reversals in spatial swimming pool tasks: Further evidence for place learning in hippocampal rats. Hippocampus, 7(4), 361–370. Wiener, S. I., Paul, C. A., & Eichenbaum, H. (1989). Spatial and behavioral correlates of hippocampal neuronal activity. J. Neurosci., 9(8), 2737–2763. Wood, E. R., Dudchenko, P. A., Robitsek, R. J., & Eichenbaum, H. (2000). Hippocampal neurons encode information about different types of memory episodes occurring in the same location. Neuron, 27(3), 623–633. Wyble, B. P., Hyman, J. M., Goyal, V., & Hasselmo, M. E. (2001). Phase relationship of LTP inducation and behavior to theta rhythm in the rat hippocampus. Soc. Neurosci. Abstr., 27. Wyble, B. P., Linster, C., & Hasselmo, M. E. (2000). Size of CA1-evoked synaptic potentials is related to theta rhythm phase in rat hippocampus. J. Neurophysiol., 83(4), 2138–2144. Young, B. J., Otto, T., Fox, G. D., & Eichenbaum, H. (1997). Memory representation within the parahippocampal region. J. Neurosci., 17, 5183–5195. Received September 28, 2000; accepted June 15, 2001.
LETTER
Communicated by Andrew Barto
Self-Organization in the Basal Ganglia with Modulation of Reinforcement Signals Hiroyuki Nakahara [email protected] Shun-ichi Amari [email protected] Laboratory for Mathematical Neuroscience, RIKEN Brain Science Institute 2-1 Hirosawa, Wako, Saitama, 351-0198, Japan Okihide Hikosaka [email protected] Department of Physiology, School of Medicine, Juntendo University, 2-1-1 Hongo, Bunkyo, Tokyo 113-0033, Japan
Self-organization is one of fundamental brain computations for forming efcient representations of information. Experimental support for this idea has been largely limited to the developmental and reorganizational formation of neural circuits in the sensory cortices. We now propose that self-organization may also play an important role in short-term synaptic changes in reward-driven voluntary behaviors. It has recently been shown that many neurons in the basal ganglia change their sensory responses exibly in relation to rewards. Our computational model proposes that the rapid changes in striatal projection neurons depend on the subtle balance between the Hebb-type mechanisms of excitation and inhibition, which are modulated by reinforcement signals. Simulations based on the model are shown to produce various types of neural activity similar to those found in experiments. 1 Introduction
The basal ganglia (BG) are well known to contribute to sequential motor and cognitive behaviors (Knopman & Nissen, 1991; Graybiel, 1995). Almost the entire cerebral cortex projects to the BG, and the BG project mainly back to the frontal cortex through the thalamus and to the superior colliculus. A striking fact about the BG is a vast convergent projection from the cerebral cortex to the striatum, a major input zone of the BG (Oorschot, 1996) and another convergent projection from the striatum to the output nuclei of the BG, that is, the global pallidus internal segments (GPi), and the substantia nigra pars reticulata (SNr). Given this fact, it is expected that the BG Neural Computation 14, 819–844 (2002)
° c 2002 Massachusetts Institute of Technology
820
H. Nakahara, S. Amari, and O. Hikosaka
have efcient representations of cortical inputs to interact effectively with the cerebral cortex (Graybiel, Aosaki, Flaherty, & Kimura, 1994). We should note that the majority of the neurons in the striatum, including the projection neurons and some types of interneurons, are GABAergic, working most likely as inhibitory within the striatum as well as to target neurons in projection areas (Kita, 1993; Kawaguchi, Wilson, Augood, & Emson, 1995; Wilson, 1998). The BG, the striatum in particular, receive rich reinforcement signals from dopaminergic (DA) neurons, which originate in the substantia nigra pars compacta (SNc). It has been observed that the neural responses in the striatum are strongly modulated by DA neurons or reward conditions (Aosaki et al., 1994; Schultz, Apicella, Romo, & Scarnati, 1995; Kawagoe, Takikawa, & Hikosaka, 1998). DA neurons exhibit a phasic activity when an unexpected reward occurs or when a conditioning stimuli appears that allows the subject to anticipate a coming reward (Schultz, 1998). Driven by these reinforcement signals carried by DA neurons, the striatum has been
Figure 1: Facing page. (A) Example of the memory-guided saccade task in a onedirection-rewarded condition (1DR). In experiments, there were two reward conditions: 1DR and ADR (all-direction-rewarded condition) conditions. The task procedure in each trial is the same as a memory-guided saccade task between both conditions except reward conditions: a task trial started with the onset of a central xation point, which the monkeys had to xate. A cue stimulus (spot of light) then came at one of the four directions. After the xation point turned off, the monkeys had to make a saccade to the cued location. In ADR, all directions are rewarded after a saccade to the cued location in each trial. In 1DR, throughout a block of the experiment (60 trials), only one direction was rewarded among four directions. In the example shown here, the right direction is rewarded. Even for nonrewarded directions in 1DR, the monkeys had to make correct saccades; otherwise, the same trial was repeated. 1DR was performed in four blocks, in each of which a different direction was rewarded. Other than the actual reward, no indication was given to the monkeys as to which direction was to be rewarded. (B) Examples of the three types of neural responses observed in the experiment are shown over four 1DR blocks (taken from Kawagoe et al., 1998). Data obtained in each block of 1DR (left) are shown as a polar diagram indicating the magnitudes of the responses for four cue directions. Rewarded direction is indicated by a bull’s-eye mark. (Top) Flexible type (the most frequently observed type) changes its preferred direction quickly to the rewarded direction in each 1DR block. (Middle) Conservative type maintains its preferred direction (rightward for this neuron) across 1DR blocks (at least in two of four 1DR blocks), while the response toward each direction is enhanced when it is rewarded in each 1DR block. (Bottom) Reverse type (less frequently observed than the other two types) shows the smallest response to the rewarded direction in each 1DR block.
Self-Organization in the Basal Ganglia
821
considered to undergo heterosynaptic plasticity (Calabresi, Maj, Pisani, Mercuri, & Bernardi, 1992; Wickens & Kotter, ¨ 1995) and to contribute to skill memory formation through the cortico-basal ganglia loops (Marsden, 1980; Alexander, Crutcher, & DeLong, 1990; Knowlton, Mangels, & Squire, 1996; Hikosaka et al., 1999; Nakahara, Doya, & Hikosaka, 2001). In a recent experiment (Kawagoe et al., 1998), reward-modulated changes of neural responses in the caudate, a part of the striatum, were investigated in a systematic manner, using asymmetrically rewarded memory-guided saccade tasks, in which one of the four directions was randomly chosen as the saccade target in a trial and the monkeys had to make a memory-guided saccade in the trial (see Figure 1A). In this task, there were two reward
A
No Reward
No Reward
Reward
No Reward
B Flexible
Conservative
Reverse
822
H. Nakahara, S. Amari, and O. Hikosaka
conditions: all-directions-rewarded condition (ADR), where all four directions were rewarded in a block of experiments after a correct saccade in a trial, and one-direction-rewarded condition (1DR), where only one xed direction was rewarded after a correct saccade in a trial (the other three directions were not rewarded throughout a block even when a correct saccade was made). In visual and memory-related periods, the preferred directions of the caudate neural responses, observed in ADR (Kawagoe et al., 1998), were typically found to be contralateral, as previously reported (Hikosaka, Sakamoto, & Usui, 1989), so that the caudate neurons exhibited spatial-directional selective responses. The caudate neural responses, however, were strongly modulated by the rewarded directions in the 1DR blocks (see Figure 1B). Three typical patterns were found in the response changes: exible, conservative, and reverse type patterns (see Figure 1B; their denitions are provided in the legend). When a 1DR block is altered, the caudate responses change within ten to a few tens of trials and develop much slower than the DA activities (Kawagoe et al., 1998; Kawagoe, Takikawa, & Hikosaka, 1999) (see section 2.1). Such changes are supposed to be caused by the synaptic plasticity under the inuence of the DA activity (Calabresi et al., 1992; Wickens & Kotter, ¨ 1995; Reynolds & Wickens, 2000), while the DA-induced changes in the internal states of the striatal neurons may play a role as well (Surmeier, Song, & Yan, 1996). These results suggest that
Figure 2: Facing page. (A) Schematic diagram of the relationship of the cerebral cortex and the caudate. Spatial information is conveyed from the cerebral cortex to the caudate, while reinforcement signals are provided via dopaminergic projections (DA). Black and white arrowheads indicate excitatory and inhibitory connections, respectively. (B) Scheme of self-organization in the caudate. Directional information is topographically represented in the cerebral cortex with two components: one reecting each directional selectively (Ni ) and the other reecting some overlap between different directions (M). Common inputs (M) are shared by cortical representations for different directions (xi ), and in the gure, cortical representations for two directions are shown. Reinforcement signals (a) carried by dopamine (DA) neurons inuence cortical inputs. An integrated inhibitory input to caudate neurons is denoted by x 0 . (C) Schematic example of a conservative-type neuron at equilibrium, responding to its preferred (but nonrewarded) direction (x1 ) and a rewarded direction (x2 ). This neuron does not re to x3 (x4 is dropped here for simplicity). After the learning converges, the synapN D 12 ( x1 C x2 ) . By assuming k w kD 1, each of the inner tic weight converges to w products between the input (xi ) and the weight (w N ) with and without reinforcement signal modulation is indicated on the right. Inhibitory effect is summarized N ¢ x3 < l < w N ¢ x1 , w N ¢ x2 (1 C a2 ) , by l, which is also indicated on the right. Since w the neuron responds to x1 , and x2 with reinforcement signal, but not to x3 . If a2 is not provided, the neuron does not respond to x2 . The dashed cone region indicates the receptive region of this neuron. The neuron responds to all of the inputs, which can be modulated by reinforcement signals, in this region.
Self-Organization in the Basal Ganglia
823
the caudate neurons rapidly alter their response properties guided by the DA activity when the reward condition changes (see the next section and section 4). Importantly, the pattern of changes in a response can vary in each neuron and is classied roughly into one of three categories. We propose that the caudate neurons self-organize their responses under the control of both the intrinsic cortical inputs representing each cue direction and the DA reinforcement signals reecting the reward conditions (Schultz, 1998; see Figure 2A). We therefore study a simplied self-organization model with
A
Cerebral Cortex Spatial information
Caudate
Sp Reinforcement signal
DA
B
Cortex
x
i
x0
Caudate
DA
C Receptive region
x1 w
(1+ )x2
x3 w .x 3
(1+ ) w .x2
w .x1
824
H. Nakahara, S. Amari, and O. Hikosaka
reinforcement signals (see Figure 2B).It is possible to analyze its behaviors rigorously by giving conditions that guarantee the requested behaviors (Kawagoe et al., 1998). Our model includes plastic changes in both the excitatory and inhibitory synapses, and their subtle balance generates a variety of phenomena, as seen in experiments. Under each condition, we can determine which pattern of neural responses appears, using the parameters of both the reinforcement signal and the inhibitory effect with respect to the cortical representations. Our model can explain these typical patterns of modulation by the same simple mechanism in a unied way, which would help us predict the relationship between the neural response modulations and the cortico-striatal and nigra-striatal connections of the neurons. 2 A Theoretical Model 2.1 A Self-Organization Neuron Model. Emergence of various types of self-organization has been studied previously in a unied manner (Amari, 1977, 1983; Amari & Takeuchi, 1978). Our model of the striatum neurons is a new version in that the internal state of a neuron u ( t) at time t is enhanced by the reinforcement signal a(t) , u ( t ) D w ( t ) ¢ x ( t ) f1 C a(t) g ¡ w 0 ( t) x 0 (t ) ,
(2.1)
where the vector x stands for excitatory cortical inputs to a neuron in the striatum, corresponding to the cue signal, which is one of the four directions, while x0 summarizes a population of the inhibitory inputs as a single variable in this model and is assumed to be a constant for simplicity in the later analysis. Here, w denotes the weights of the cortico-striatal connections of this neuron for x, and w 0 denotes the weight for the inhibitory input x 0. The term a(t) denotes the effect of the reinforcement signal, carried by dopaminergic (DA) neurons, on the striatal projection neuron ring rates. Experimentally, the DA effects on striatal ring rates have been found to be facilitatory (i.e., a > 0 in our model) or inhibitory (a < 0). Facilitatory and inhibitory effects may be mediated by D1 and D2 receptors, respectively (Gerfen, 1992; Cepeda, Buchwald, & Levine, 1993). Alternatively, both effects can be mediated by the bistable nature of D1 receptors (Nicola, Surmeier, & Malenka, 2000; Gruber, 2000). The phasic DA activities occur in general to an unexpected reward or a conditioning stimulus preceding a rewards r, once the conditional relationship is well established (Schultz, Apicella, & Ljungberg, 1993). While this phasic DA activity is hypothesized to provide a reward prediction error, the striatal neurons are considered to change their neural activities, using this reinforcement signal by the DA neurons (Schultz, 1998). Correspondingly, experiments in 1DR and ADR (Kawagoe et al., 1999) have shown that there is the phasic DA activity locked with a rewarded direction cue in each 1DR (ADR) block. When a new 1DR block is started, the phasic DA activity quickly shifts to the rewarded cue within a few trials and stays unchanged throughout the block, possibly having a very small, but negligible, decay toward the end of the block (Kawagoe et al., 1999). Accordingly, we denote the phasic DA activity by a and assume
Self-Organization in the Basal Ganglia
825
that a changes without delay between experimental blocks and stays the same throughout a block. The caudate neurons also change their activities when a 1DR block is altered. Interestingly, the changes in the caudate neural activities are much slower, taking a few tens of trials, than those in the DA activity. This suggests that the plastic caudate activities may be induced by synaptic changes in the cortico-caudal projection, under the DA modulation, rather than by the direct DA-induced changes in the caudate response properties, although the latter effect may also play a role, particularly in initiating such plastic caudate activities (see sections 3.1 and 4). Let us now look into the time course of the DA activity in a single trial. The phasic DA activity (a) starts to rise roughly around 100 milliseconds after the cue presentation, when it is rewarded. Next, it starts to drop, rst rapidly, until around 400 milliseconds and then gradually to the resting level around 700 milliseconds, during which a very weak, but still larger than at the resting level, DA activity exists (Kawagoe et al., 1999; also see Schultz et al., 1993; Schultz, Romo, Ljungberg, Minenowicz, Hollerman, & Dickenson, 1995). To distinguish from the phasic peak activity a, we denote this weak DA activity by a0 , which can be regarded as nearly, but not exactly, zero (i.e., at the resting level) (therefore, a0 » 0). In contrast, the post-cue caudate neural activities, depending on each neuron, are conned not only during the time of the phasic DA period but also during the time of the weak DA activity (and even later) (Kawagoe et al., 1998, 1999). In other words, the plastic changes in the caudate neural activities possibly carry over not only in the time of a but also in the time of a0 at least, suggesting that the DA-modulated synaptic changes induce the plastic changes in the caudate neural activities. Note that when a(t) D 0, the neuron model in Eq. (2.1) is equivalent to the primitive self-organization model (Amari & Takeuchi, 1978). When the reinforcement signal exists (i.e., |a| > 0), the internal state of the neuron u ( t ) increases or decreases, and controls the self-organizing process, so that the reinforcement signal works in a modulatory manner. The output y ( t ) of the neuron is given by the transfer function y ( t) D f fu(t) g.
(2.2)
To obtain an analytical solution, we rst choose f (¢) as the step function, given as 1 ( u) D 1 (if u > 0), and 0 otherwise. This is the simplest choice, allowing us to derive an explicit analytical solution for the dynamical behavior of learning in the model. This simplication implies that the state of the neuron is binary, that is, the neuron res (y D 1) or does not (y D 0). Later, we let f (¢) be the sigmoid function f ( a ) D 1 / (1 C e¡c a ) where a is a real value and c is a scaling parameter. The dynamics of a Hebbian-type learning rule is chosen to show changes in synaptic efcacies,
»
P ( t) D ¡w ( t) C cy ( t) x ( t) tw t wP 0 ( t ) D ¡w 0 ( t ) C c0 y ( t ) x0 ,
(2.3)
P and wP 0 represent the time derivatives dw / dt and dw0 / dt, and c and c0 where w are learning rates. In equation 2.3, y ( t) depends on a(t) , because y ( t ) D f fu(t) g
826
H. Nakahara, S. Amari, and O. Hikosaka
and u ( t ) is a function of x ( t) and a(t) . In other words, DA signals, particularly the phasic ones, initiate and guide the learning process, where the phasic activity (a) quickly changes to the weaker one (a0 » 0) in the time course of a trial. In the rst equation, the second term yx is Hebbian, indicating that the weight increases in proportion to the input x when the output y of the neuron is positive. The above model is different from the ordinary Hebb-type neuron in that the inhibitory weight w0 is also modiable, as shown in the second equation (Amari & Takeuchi, 1978; Amari, 1983). The intrinsic mechanism of the model is based on the balance between the excitatory and inhibitory effects, mediated by modiable weights under the initial modulation of the reinforcement signal. To understand the behavior of models of this kind, we need to examine their dynamical behaviors of learning, typically the equilibrium states and their stability under a stationary environment from which input signals x ( t ) are supplied. Many stable equilibrium states exist in general, and this multistability is important to allow an ensemble of neurons, exposed to the same environment, to differentiate with different neural responses and capture various features of the environment. 2.2 Analysis of Learning Dynamics. When inputs xi are presented with probabilities pi ´ p ( xi ) ( i D 1, 2, 3, 4) and reinforcement signals a( xi ) , which quickly change into a0 ( xi ) , the averaged learning equation is given by
P
»
P D ¡w ( t ) C c i pi yi xi tw P P 0 D ¡w 0 ( t) C c0 i pi yi x 0 . tw
(2.4)
N,w The synaptic weights converge to the equilibrium state ( w N 0 ) , satisfying
P Dw P 0 D 0: w P » w N D c i pi yi xi P 0 wN0 D c
i
pi yi x 0 .
(2.5)
Equation 2.5 is not the explicit solution for w N and w N 0 , since yi on the right-hand N and wN0 , so that it is the equation to be solved for them. After side depends on w the weights have converged by learning, the internal state of the neuron for input x with accompanying reinforcement signal a (or a0 ) is N ¢ x (1 C a) ¡ wN0 x 0. uN D w
(2.6)
R D fx | uN ( x ) > 0g,
(2.7)
If uN > 0 for input x, this neuron res. In order to analyze the characteristic of this neuron, we dene the receptive eld of this neuron R as
and using R,
X pR D
³ wR D
(2.8)
pi
xi 2R X
xi 2R
´ pi xi
/ pR .
(2.9)
Self-Organization in the Basal Ganglia
827
Intuitively speaking, pR stands for the probability mass of the signals in region R, and w R stands for the center of gravity in region R. Recall that yi D 1 ( u ( xi ) ) P so that yi xi reduces to the sum over all x in the receptive eld R. Hence, equation 2.6 can be rewritten as
³ uN j D uN ( xj ) D cpR (1 C aj ) wR ¢ xj ¡
c0 c (1 C aj )
´ x20
,
(2.10)
where aj D a( xj ) . When xj is rewarded, aj D a in the beginning and becomes a0 later. Therefore, by dening l´
c0 2 x, c 0
(2.11)
the condition for a neuron to re in response to xj is given by K ( xj ) > 0, where K ( xj ) ´ wR ¢ xj ¡
l , 1 C aj
(2.12)
and the condition to be silent in response to xj is given by K ( xj ) · 0. The two conditions are the necessary condition but are not sufcient. Note that in the later period of a trial, i.e., when aj D a0 , the neuron still res and K ( xj ) > 0 for aj D a0 ¼ 0. Thus, equation 2.12 provides a mathematical criterion for studying the receptive region generated by learning. We can see that l is the important parameter that controls the size of R. As l increases, the receptive eld becomes smaller. To ensure sufciency, we need to check for 8x 2 R,
K ( x ) > 0,
and
6 R, for 8x 2
K ( x ) · 0.
This is because wR depends on R, and R is dened as the set of inputs x by which the neuron should re (Amari & Takeuchi, 1978; Amari, 1983). 2.3 Analysis of Caudate Neurons. The emergence of the three response types of neurons is explained here. There are two types of 1DR blocks in the actual experiment: an exclusive 1DR and a relative 1DR (Kawagoe et al., 1998). We discuss only the former here, because our analysis can be easily applied to the latter with slight modications. The cortical inputs fxi g4iD 1 , corresponding to the four cue directions, are given with probability p ( xi ) D 14 in the experiment (Kawagoe et al., 1998). In the exclusive 1DR, only one of the four directions is a rewarded direction (say, xi with the reward ri D r, r > 0); the other three directions are not rewarded 6 i). When one block of experiments starts, we have ai D a (|a| > 0; a (rj D 0, j D 6 becomes a0 shortly in each trial) and aj D 0 ( j D i) , respectively, as the effect of the phasic reinforcement signal by DA neurons on the striatal projection neuron (see section 2.1). We now describe how the four cue directions are represented in the cortical signal x to a caudate neuron in our model. The signal consists of a bundle of inputs in which some parts are common to all of the cue directions, just representing the appearance of a cue, and the other parts include specic exclusive
828
H. Nakahara, S. Amari, and O. Hikosaka
information for each direction. Let M be the number of common inputs, and let Ni be the number of inputs specic to xi . We rearrange the components such that the common M inputs appear rst. We then have the following representation, M
N1
N2
N3
N4
z }| {
z }| {
z }| {
z }| {
z }| {
x2 D (1, . . . , 1,
0, . . . , 0,
1, . . . , 1,
0, . . . , 0,
0, . . . , 0)
x3 D (1, . . . , 1,
0, . . . , 0,
0, . . . , 0,
1, . . . , 1,
0, . . . , 0)
x4 D (1, . . . , 1,
0, . . . , 0,
0, . . . , 0,
0, . . . , 0,
1, . . . , 1) .
x1 D ( 1, . . . , 1,
1, . . . , 1,
0, . . . , 0,
0, . . . , 0,
0, . . . , 0)
(2.13)
The inner products among fxi g4iD 1 are given by
»
xi ¢ xj D
M C Ni M
(i D j) (i D 6 j) .
(2.14)
This denition of the cortical representations for each direction is only for presentational simplicity. The following theorems can be proved, as is evident in their proofs, with a general denition of inner products,
xi ¢ xj D gij , if we wish. Through this general form of the inner product, the magnitude of the cortical representation for each direction is determined, and furthermore, the proximity between the cortical representations for different directions is determined. These properties are essential to determine each response type with the other two parameters, l and a, as shown in the three theorems below, where equation 2.14 is used. A neuron behaves as if the exible type in all four 1DR blocks if its parameters satisfy Theorem 1.
MC
Nmax · l < minfM(1 C a) , ( M C Nmin ) (1 C a0 ) g, 2
(2.15)
where Nmax D maxi2I Ni , N min D mini2I Ni and I D f1, 2, 3, 4g.
Proof. Without loss of generality, we assume that the rewarded direction is 6 x1 and, hence, a1 D a (a > 0) in the beginning and aj D 0 ( j D 1). We assume that the initial weight w is large enough such that the neuron is excited by x1 . We search for the condition that R D fx1 g is an equilibrium state of the equation, implying that the neuron is excited only by x1 . If this is the case, pR D 14 and 6 w1R D x1 . In equation 2.12, we further require K ( x1 ) > 0 and K (xi ) · 0 ( i D 1) even when a is reduced to a0 in the process. They are equivalently rewritten as M · l < ( M C N1 ) (1 C a0 ) . Now suppose that the reward direction is changed to x2 in the next block. We need to obtain the condition to ensure that the neuron changes its behavior
Self-Organization in the Basal Ganglia
829
to respond only to x2 . By noting that the initial state of the weight vector of the neuron in the new block is the same as the nal equilibrium state of the weights in the previous block, we require w1R ¢ x2 ¡ 1Cl a > 0 for the neuron to respond to x2 in the beginning of the new block, where the reward direction changed and a2 D a. This gives l < M (1 C a) . Even when this condition is satised, it may happen that this neuron responds to both x1 and x2 in the beginning of the second block, though x1 is no more rewarded, that is, a1 D 0. However, this changes as learning takes place in the new situation. We then need a condition such that while the neuron may initially respond to both x1 and x2 , the neuron stops responding to x1 before reaching the equilibrium state of R D fx2 g. This condition is given by MC
N2 · l. 2
By summarizing the above conditions and taking all 1DR blocks into account, the theorem is proved. The theorems for the other two types are given as follows. A neuron behaves as if the conservative type in all four 1DR blocks if its parameters satisfy Theorem 2.
MC
0 Nmax 2
»
· l < min M (1 C a) ,
³ MC
0 Nmin 2
´ (1 C a0 ) , M C
N1 2
¼ ,
(2.16)
where we assume that x1 is the preferred direction (i.e., N1 D Nmax ) and we have 0 0 Nmax D maxi2I 0 Ni , N min D min i2I 0 Ni and I0 D f2, 3, 4g. A neuron behaves as if the reverse type in all four 1DR blocks if its parameters satisfy Theorem 3.
³ MC
Nmax 4
´ (1 C a0 ) · l < M,
(2.17)
where Nmax D maxi2I Ni and a < a0 < 0. See the appendix for the proofs of theorems 2 and 3. Figure 3A demonstrates each response type proved in the three theorems above. Note that a binary output neuron is used in the theorems, since the step function, as the transfer function of neurons, allows neurons only to re or to be silent. Hence, each response type is dened accordingly: Flexible-type neurons respond only for a rewarded direction in each 1DR block; conservative-type
830
H. Nakahara, S. Amari, and O. Hikosaka
A Flexible
Conservative
Reverse
B Flexible
Conservative
Reverse
Figure 3: The three types of simulated neural responses shown over four 1DR blocks. The same parameter setting is used in A and B except transfer functions. (A) Step function is used as the transfer function to generate a neural output. (B) Sigmoid function as the transfer function.
Self-Organization in the Basal Ganglia
831
neurons respond for both rewarded and their intrinsic preferred directions in each 1DR block; reverse-type neurons respond for nonrewarded directions in each 1DR block. The three types proved in the theorems are chosen to stand for the most typical types of neural response changes across 1DR blocks observed in experiments. At the same time, in the rich variety of experimental neural responses, there are some types that are not included in the three types and other types that can be considered as subtypes in one of the three types. For example, a very few neurons behave as if they were reverse-conservative type, which has a larger response to each direction when it is not rewarded, while its intrinsic preferred direction is somewhat maintained in each 1DR block. Other neurons behave as if they were the super-conservative type, which has a response almost only for its intrinsic preferred direction in any 1DR block. It is possible to prove the conditions for these types, but we focus on the typical three types for clarity in this study. 2.4 Fine Characteristics of Neural Responses. The step function allows us to obtain exact analytical solutions; however, simulation results based on the step function differ from experimental ones in that neurons can only be binary mode: to re or to be silent (Figure 3A). To simulate experimental results further, an analog sigmoid function can be employed, by which a normalized mean ring rate can be represented (Figure 3B). In this alteration of the transfer function, qualitative behaviors are expected to be similar because the sigmoid function becomes similar to the step function as the steepness increases. Once the sigmoid function is used, the difference in the magnitudes of the responses ( y D f ( u ) ) can be reected in the difference in the internal states (i.e., u). Hence, by adjusting the proximity of the cortical inputs (i.e., the inner product in equation 2.14 or gij in general), the ne characteristics in 1DR responses can be represented. In other words, the directional selectivity in the caudate neural responses can be reected (see Figure 3B; for example, the exible type is set as having the leftward preferred direction). For exible-type neurons in experiments, the response to each direction in ADR tends to be smaller than the corresponding rewarded response in 1DR. This tendency is observed in the model. We rst note that the total amount of rewards in one block was the same in ADR and 1DR in experiments (Kawagoe et al., 1998). In other words, the amount of rewards in each reward direction is four times larger in any 1DR block than in ADR. Under this condition, our model shows the observed tendency that neural responses become less distinctive in ADR than in 1DR. Given the step function as the transfer function, the condition for neurons of the exible type to have responses in ADR as well as responses in 1DR is summarized by
»
MC
³
N N < l < min M (1 C a) , M C 2 4
´³
1C
a0 4
´¼
,
where we set N D Ni for simplicity. The difference of the internal states in 1DR and ADR (uN 1DR and uN ADR ) is given as uN 1DR ¡ uN ADR D
3c ( Na C 4l ¡ 4M) . 16
832
H. Nakahara, S. Amari, and O. Hikosaka
From the above two equations, we can conclude uN 1DR ¡ uN ADR > 0, that is, we nd the above-mentioned tendency in our model. However, the observed tendency might simply be due to the difference in the amount of rewards per trial. But preliminary experimental results suggest that when a two-directional version of 1DR task is employed, the response amplitude for one direction is inuenced by the other chosen direction, indicating an interactive effect between the different directional inputs on the response amplitudes (Takikawa, Kawagoe, & Hikosaka, 2001). Hence, the tendency may not be a simple result of the difference in the amount of reward per a trial. 3 Remarks 3.1 Neuron Model and Learning Rule. The neuron model in this study is given by equations 2.1 and 2.2, while the learning rule is given by equation 2.3. In this formulation, the reinforcement signal is not directly evident in the learning rule but has an indirect effect on the learning through y ( t ) D f ( u ) D f [w ( t ) ¢ x ( t ) f1 C a(t) g ¡ w 0 ( t ) x 0 ( t) ]. Accordingly, the DA-induced modulation a inuences the neural state u at rst. It then works as a modulator, indirectly affecting the learning process of the synaptic weights w and w 0. DA-induced changes in a neural state (see equation 2.1) eventually lead to selective changes of the synaptic efcacy through the learning process in the model (see equation 2.3). This may illustrate the issue on whether short-latency DA responses are for reinforcement learning or attentional switching (Redgrave, Prescott, & Gurney, 1999). In our model, when a new phasic DA response occurs to an unexpected reward, the phasic DA activity directly affects the striatal neural response properties (attentional switching) and consequently initiates a new learning process (“reinforcement learning”). 3.2 Emergence of Three Neural Response Types. The results in the previous section demonstrate that all three types of neuronal behaviors emerge from the same mechanism, depending on the values of the underlying parameters (see Figure 3A). Table 1 summarizes the conditions for each type of neuron. All of the conditions are expressed by the two factors, the term l and the reinforcement signal a, with respect to the cortical representations (i.e., M, Ni ). The term l D ( c0 / c) x20 (see equation 2.11) is composed of the learning rates for the excitatory and inhibitory weights (i.e., c and c0 ) and the magnitude of inhibitory input x 0, and, roughly speaking, it summarizes the efcacy of learning in the inhibitory effect relative to that in the excitatory effect (see Figures 2B and 2C) (see the next section). The term a indicates the effect of the reinforcement signal, carried by DA neurons, on the striatal ring rate. As shown in Table 1, the analysis indicates
Self-Organization in the Basal Ganglia
833
Table 1: Conditions for Different Types of the Caudate Neurons. Neuron Type Flexible
Condition MC
Conservative M C
Nmax 2
0 Nmax 2
· l < minfM(1 C a) ,
( M C Nmin ) (1 C a ) g
· l < minfM(1 C a) ,
(M C
Reverse
(M C
Nmax 4
a ¸ a0 > 0
0
Nmin 2
0
) (1 C a0 ) , M C
(N1 > Ni , i 2 I0 )
N1 2
g
) (1 C a0 ) · l < M
Nmax ´ maxi2I Ni ,
0 ´ maxi2I0 Ni , Nmax
Nmin ´ mini2I Ni , I ´ f1, 2, 3, 4g
a ¸ a0 > 0
a · a0 < 0
0 ´ mini2I0 Ni , I 0 ´ f2, 3, 4g Nmin M C Ni ( i D j ) 0 l ´ cc x20, xi ¢ xj ´ (i D 6 M j)
»
that the reinforcement signal a should work as facilitatory (a > 0) for the exible and conservative types and as inhibitory (a < 0) for the reverse type. In our simplied denition of the inner product of the cortical representations (see equations 2.14 and 2.13), M represents the overlap between the cortical inputs, while Ni represents a part specic to each directional input. More generally, M C Ni corresponds to the square of the magnitude of each direction 6 (i.e., | xi | 2 ) and M corresponds to xi ¢ xj D | xi || xj | coshxi , xj ( i D j ) , where hxi , xj is the angle between xi and xj . In Table 1, therefore, the terms such as M C Ni , M C N2i , and so on, express how the proximity in the cortical representations, or their overlap, inuences the emergence of each type. For example, a critical difference in the exible and conservative types is that a preferred cue direction input (x1 for N1 in Table 1) satises l < M C N21 for the conservative type but M C N21 < l for the exible type, while there are some other different conditions between the two types. Now, we can write M C N21 D 12 (k x1 k2 C x1 ¢ xi ) , where i 2 f2, 3, 4g. Hence, the term M C N21 is related to both the magnitude of the preferred cue direction input x1 and the angle between x1 and other cue direction inputs. When the cortical representations fxi g are chosen as in equation 2.13, the conditions in Table 1 explicitly relate each response type to the magnitudes of the cortical inputs and their overlap, which is essentially the number of cortico-striatal connections reecting each directional input (see equation 2.13). Since the cortical representations in equation 2.13 and their inner products in equation 2.14 used in Table 1 are only for presentation purposes, in a more general term, the conditions in Table 1 suggest that each response type is determined by the magnitudes and the proximity of the directional inputs conveyed through the cortico-striatal projections (i.e., via the inner product gij ) in relation to l and a. Our model may be generic enough to capture microscopic properties such as small cluster properties (Hikosaka et al., 1989; Flaherty &
834
H. Nakahara, S. Amari, and O. Hikosaka
Graybiel, 1994; Jaeger, Kita, & Wilson, 1994; Kincaid, Zheng, & Wilson, 1998) with a more detailed analysis on the cortical input proximities, along with 1DR experiments.
3.3 Effect of Inhibitory Weight Modiability. The term l, roughly speaking, summarizes the efcacy of inhibitory learning dynamics. If l is too small, all neurons start to re for any directional input at equilibrium, whereas all neurons become silent for any directional input at equilibrium if l is too large. For an ensemble of neurons to acquire discriminative power by self-organization, l should be set relevantly so that the neurons can start to respond to different stimuli with differential response properties, governed by their cortical input 0 proximities and reinforcement signals. Thus, while l D cc x20 is kept constant for each neuron in our simple model, neurons with different values of l lead to different response types. As an example, let us consider how a neuron of the exible type can maintain its response type in transition between 1DR blocks. In the transition, a neuron of the exible type should start to respond to a new rewarded direction in the beginning of the second block (see the proof of theorem 1). In this situation, the inhibitory effect w N 0x 0 in a neural state u is given as w N 0 x0 D c0 pR x20 D cpR l. Thus, if l is so large that it makes u D w N ¢ x (1 C a) ¡ w N 0 x0 < 0, then the neuron fails to respond to the rewarded direction in the transition. In this sense, l should be set relatively small (Jaeger et al., 1994). On the other hand, if l is so small that it allows a neuron to respond to both the previous reward direction and the current reward direction even at its equilibrium, then the neuron fails to maintain the response property of the exible type after the transition. Figure 4 shows an example of the correspondence of the three neural response types with different parameter values. In each graph, the regions for the exible, conservative, and reverse types are drawn with light gray, gray, and dark gray, respectively, and are determined by l and a for xed N max and N (or N / Nmax ) (in 6 1) for simplicity; see the other words, we set N1 D Nmax , say, and then Nj D N ( j D gure legend). In this gure, when Nmax is large, there is a small common input M and, when N / Nmax becomes smaller, the preferred direction input gets larger relative to the other direction inputs. There is a nonlinear effect to determine the region of each response type. Depending on Nmax and N, different values of a and l provide each response type. We note that the example shown in Figure 4 is just one example, chosen to indicate the coexistence of the three types under some parameter regime. For example, there can be a parameter regime where only the exible and inhibitory types, but not the conservative type, exist. The analysis in this study can predict possible changes in the self-organization of neural responses when inhibitory effects are altered. For example, decreasing l leads neurons to lose their directional-selective responses of the exible and conservative types (as exemplied in Figure 4). Decreasing l can be achieved in several ways, for example, by decreasing the inhibitory input (x 0). We wait for experimental examinations on this issue.
Self-Organization in the Basal Ganglia 2
1.5
835 2
N max = 0.3; N/Nmax = 0.7
1.5
1
1
0.5
0.5
0 1
0.5
0
0.5
1
2
0 1
max
= 0.7; N/N
max
= 0.7
N 1.5
1
1
0.5
0.5
0 1
0.5
0
0.5
1
0.5
1
2 N
1.5
N max = 0.3; N/Nmax = 0.3
0.5
0
0.5
1
0 1
max
= 0.7; N/N
0.5
max
0
= 0.3
Figure 4: Examples of correspondences of the three neural response types with different parameter regions. Flexible, conservative, and reverse type regions are drawn in light gray, gray, and dark gray, respectively. Graphs are generated by 0 0 assuming Nmax > Nmax D Nmin D Nmin D N (in other words, given Ni D Nmax , 0 6 Nj D N ( j D i) ) and a D a , and normalizing the conditions by M C Nmax D 1. Hence, we reduced the three conditions in Table 1 to those of the following parameters: l, a, Nmax , N. In each graph, the regions of each type are drawn with respect to l and a for xed Nmax and N (or N / Nmax ).
4 Discussion This study theoretically investigated, as a model of striatum neurons in the basal ganglia, a simple self-organization neuron model under the modulation of reinforcement signals. This model is motivated by experimental observations: directional-selective responses of the striatal neurons, abundant reinforcement signals carried by dopaminergic neurons, a rich source of inhibitory neurons in the striatum, and a vast convergence of the cortico-basal ganglia projection. By a theoretical analysis, the model is shown to explain, in a unied manner, that seemingly different neural response patterns, observed in experiments (Kawagoe et al., 1998), can emerge from the same model with different parameter values. Due to the choice of a simple model, our analysis can explicitly relate each response type with two factors, the reinforcement signal (a) and the inhibitory effect (l), in conjunction with the magnitudes and the proximity of the cortical input representations (M, Ni ).
836
H. Nakahara, S. Amari, and O. Hikosaka
Various types of self-organization rules have been proposed to investigate mainly the cerebral cortical formation (von der Malsburg, 1973; Amari & Takeuchi, 1978; Willshaw & von der Malsburg, 1979; Bienenstock, Cooper, & Munro, 1982; Obermayer, Ritter, & Schulten, 1990; Foldi ¨ a k, 1991). Experimental support has been largely obtained from the developmental formation of neural circuits in the sensory cortices (Wiesel & Hubel, 1963; LeVay, Stryker, & Shatz, 1978) and from the reorganization of the sensory cortical maps (Buonomano & Merzenich, 1988). The BCM theory in particular has been a very successful model in the developmental formation (Bienenstock et al., 1982; Kirkwood, Rioult, & Bear, 1996). All of these rules use the Hebbian synapse as a mathematical core, although controversy remains in the details. Their differences lie in the way each stabilizes learning such as competition among synapses (von der Malsburg, 1973), oating threshold (Bienenstock et al., 1982), or others. Different from others, the modiability of inhibitory synapses (Amari & Takeuchi, 1978) plays a fundamental role in regulating the self-organization in our model (see Figure 2C).Without the reinforcement signal a in equation 2.1, previous studies (Takeuchi & Amari, 1979; Amari, 1980, 1983) have indicated that this inhibitory synapse modiability, along with the excitatory synapse modiability, allows efcient input representations, such as a receptive eld self-organized with respect to various features of environment inputs, a formation of topographic maps, patch structures, and so on (Takeuchi & Amari, 1979; Amari, 1980, 1983). Hence, the current model with the reinforcement signal a can be expected to construct a cortical input representation effectively modulated by reinforcement signals, although a rigorous study remains to be investigated. This property is useful in the face of the strong anatomical convergence in the cortico-basal projections (Oorschot, 1996). Any divergent manner of the corticostriatal projections within this convergence (Parthasarathy, Schall, & Graybiel, 1992; Graybiel et al., 1994) may provide further combinatorial benets for the efciency in the self-organization. Generally, it is important to construct efcient representations of the state information (inputs) under the modulation of reinforcement signals (Dayan, 1991). A notable experiment has indicated that such a remapping under the inuence of a diffused neuromodulator occurs in the sensory cortex (Bakin & Weinberger, 1996; Kilgard & Merzenich, 1998; Sachdev, Lu, Wiley, & Ebner, 1998). Our model can be a primitive model for this issue. Abundant inhibitory sources exist in the striatum (Kawaguchi et al., 1995). The plasticity of inhibitory synapses is found in the hippocampus (Nusser, Hajos, Somogyi, & Mody, 1998), in the cerebellum (Kano, Rexhausen, Dreessen, & Konnerth, 1992), the cerebral cortex (Komatsu & Iwakiri, 1993; Komatsu, 1996) and other areas (Kano, 1995); recent ndings have suggested an important role of inhibitory connections in self-organization even in the cortex, for example, the early visual cortex (Hensch et al., 1998; Fagiolini & Hensch, 2000). To our knowledge, there is no direct evidence of modiable inhibitory synapses in the striatum, which is an important future study. We have not specied a neural origin of inhibitory effects. According to experimental ndings, a neural origin of inhibitory effects could be either inhibitory interneurons (Jaeger et al., 1994; Bennett & Bolam, 1994; Ko Âos & Tepper, 1999) or the collaterals of projection neurons (Groves, 1983; Wickens, 1993). If
Self-Organization in the Basal Ganglia
837
the former is the case, an inhibitory effect may work possibly in a feedforward manner under the inuence of cortical projections. If the latter is the case, an inhibitory effect may work as feedback and mutual inhibition and possibly in a manner similar to a winner-take-all mechanism (Groves, 1983; Wickens, 1993). It is possible to extend the current analysis with a recurrent connection. In this perspective, this study can be regarded as an extension of the winner-take-all mechanism. The current analysis of our model, however, treats the effect of modiable excitatory and inhibitory weights in a feedforward manner, because we do not commit ourselves to specify the inhibitory effect as the collaterals. In this sense, our model shares a feature of reward-modulated feedforward network with previous work by Barto and his colleagues (Barto, Sutton, & Brouwer, 1981; Barto, 1985). A third possibility is that inhibitory interneurons and the collaterals of projection neurons (Groves, 1983; Wickens, 1993) jointly work as an inhibitory source (Wickens, 1997). A recent demonstration of a weak collateral interaction between projection neurons also suggests this possibility (Tunstall, Kean, Wickens, & Oorschot, 2001). In the future, it is important to incorporate our model parameters with the experimentally estimated quantitative nature of these inhibitory resources (Jaeger et al., 1994; Wickens, 1997; Tunstall et al., 2001) as well as of the cortical input magnitudes and proximity (Oorschot, 1996; Kincaid et al., 1998). Generally, the DA modulation on the caudate neurons can be considered in two aspects: the synaptic efcacy (Calabresi et al., 1992; Wickens & Kotter, ¨ 1995) and the response property (Nicola et al., 2000). Caudate response changes are much slower than DA response changes over trials in one 1DR block. In addition, caudate response changes possibly occur beyond the DA phasic response in the time course of a single trial. Hence, we considered that the DA-modulated synaptic efcacy is involved in the emergence of the three response types and investigated how the DA-modulated synaptic plasticity can lead to these response types, while we considered the DA effect on the caudate ring rates to be complementary (see section 3.1). Yet this DA effect on the ring rates may play a larger role in accounting for the three response types (Gruber, 2000). Enhanced caudate activities long after the DA phasic response can be, partially at least, due to a prolonged DA effect that possibly sustains longer than the period of the phasic DA response (Gonon, 1997; Durstewitz, Seamans, & Sejnowski, 2000); The combined effect of D1 and D2 receptors on the caudate ring rates may further help shape the nature of the three response types (Gerfen, 1992; Cepeda et al., 1993; Nicola et al., 2000). How the two DA modulations, on the synaptic efcacy and on the response property, are integrated remains to be investigated. The dynamical aspect of the interaction between these modulations, possibly with the regulation of different DA receptors, is of particular interest because their timescales presumably are different. Finally, in our model, we treated only the current reinforcement signal, not delayed ones. To enjoy the full power of reinforcement learning, it is important to extend our model to include delayed-reinforcement signals (Barto, 1995; Houk, Adams, & Barto, 1995; Montague, Dayan, & Sejnowski, 1996; Berns & Sejnowski, 1998; Schultz, Dayan, & Montague, 1997; Trappenberg, Nakahara, & Hikosaka, 1998; Nakahara, Trappenberg, Hikosaka, Kawagoe, & Takikawa, 1998; Monchi
838
H. Nakahara, S. Amari, and O. Hikosaka
& Taylor, 1999; Hikosaka et al., 1999). For example, it is possible to maintain the same response property in our model even when the magnitude of a is reduced if the receptive region R stays the same with K ( x ) > 0 for x 2 R and 6 with K (x ) < 0 for x 2 R (see section 2.2). This kind of property is useful in transferring an effect of a delayed-reinforcement signal to a preceding stimulus. We are currently investigating this issue. Appendix A.1 Proof of Theorem 2. Recall that the order of the 1DR blocks is randomized in the experiment. Provided that the direction of x1 is the most preferred direction of the neuron, we need to consider how the behaviors of a neuron change in the three cases of transition of reward directions over 1DR blocks: (1) from the 1DR block where the reward direction is x1 to another, say x2 , (2) from the 1DR block where the reward direction is xj ( j D 2, 3, 4) , say x2 , to another 1DR block where the reward direction is not x1 , say, x3 , and (3) from the 1DR block where the reward direction is not x1 , say, x3 , to the 1DR block where the reward direction is x1 . Note also that similar to the proof of theorem 1, we assume that the neuron is excited by x1 initially. We begin with the condition guaranteeing that the neuron is responsive to only x1 at the equilibrium. In this case, the receptive region is R1 D fx1 g and hence pR1 D 14 and w1R D x1 . In equation 2.12, we require K ( x1 ) > 0 and K ( xi ) · 0 (i D 6 1) , which yield M · l < ( M C N1 ) (1 C a0 ) . In the equilibrium states of the other three 1DR blocks (here, we treat the case where x2 is the reward direction), the receptive region should be R2 D fx1 , x2 g. In this case, pR2 D 12 and w2R D 12 ( x1 C x1 ) . We require K ( xi ) > 0 ( i D 1, 2) and K ( xi ) · 0 ( i D 3, 4) , which leads to the following condition:
»
M · l < min M C
³
1 1 N1 , M C N2 2 2
´
¼
(1 C a0 ) .
As for the condition guaranteeing the transition from R D fx1 g to R D fx1 , x2 g as the reward direction changes, the neuron should respond to x2 in addition to x1 as the reinforcement signal is accompanied by x2 . Hence, we need to have wR1 ¢ x2 ¡ 1 Cl a > 0, which is equivalent to l < M (1 C a) . For case 2, the neuron has to respond to x3 in the beginning of the next block, that is, wR2 ¢ x3 ¡ 1 Cl a > 0, which is equivalent to l < M (1 C a) . When this condition is satised, the neuron responds to all of x1 , x2 , x3 when a new 1DR block starts. However, R3 D fx1 , x2 , x3 g is not stable under a certain condition, as stated in the following. In this case, competition should occur among these three inputs to let the neural response to the nonrewarded input x2 be silent. For R3 D fx1 , x2 , x3 g, pR3 D 34 and w3R D 13 ( x1 C x2 C x3 ) . Therefore, the response
Self-Organization in the Basal Ganglia
839
to x2 becomes silent when w3R ¢ x2 ¡ l · 0, or equivalently, MC
1 N2 · l. 3
In case 3, the neuron responds to both x1 and x3 in the rst block, and in the next block, the neuron changes to respond only to x1 . To ensure this requirement, we impose the condition under which the equilibria of R4 D fx1 , x3 g in the next block are not stable, which is written as MC
1 N3 · l. 2
By summing up all these conditions and taking all combinations of 1DR blocks into account, we get leads to
»
³
N0 N0 M C max · l < min M (1 C a) , M C min 2 2
´
N1 (1 C a ) , M C 2 0
¼ ,
0 0 where Nmax D maxi2I 0 Ni , Nmin D min i2I0 Ni , and I0 D f2, 3, 4g.
A.2 Proof of Theorem 3. It is sufcient to consider the condition for equilibrium states in one 1DR block and for transition from one block to another block. Suppose that the reward direction is x1 . A reverse-type neuron should have R D fx2 , x3 , x4 g as the equilibrium. This implies pR D 34 and w1R D 13 ( x2 C x3 C x4 ) . 6 In equation 2.12, we further require K ( x1 ) · 0 and K (xi ) > 0 ( i D 1) , which are equivalently rewritten together as M (1 C a0 ) · l < M C
1 Ni , where i D 2, 3, 4. 3
At the start of the next block (where the reward direction is, say, x2 ), we rst require that the neuron starts to respond to x1 to which the reinforcement signal a is no longer given. This requires w1R ¢ x1 ¡ l > 0, or equivalently, l < M. Provided that this condition is satised, the neuron responds to all of the directions in this block. Hence, we impose another condition under which R D fx1 , x2 , x3 , x4 g is not a stable equilibrium so that the neuron stops ring to the input x2 . This condition can be rewritten, given pR D 1 and w1R D 14 ( x1 C x2 C x3 C x4 ) , as
wR ¢ x2 ¡
1 1 C a0
l · 0,
or equivalently,
³ MC
1 N2 4
´ (1 C a0 ) · l.
Since the equilibrium condition of this block is the same as that of the rst block, summing up the above conditions leads to the theorem.
840
H. Nakahara, S. Amari, and O. Hikosaka
Acknowledgments We thank R. Kawagoe and Y. Takikawa for providing us with experimental details and H. Itoh for his technical assistance. H. N. is supported by grants-inaid 13210154 from the Ministry of Education, Science, Sports and Culture.
References Alexander, G. E., Crutcher, M. D., & DeLong, M. R. (1990). Basal gangliathalamocortical circuits: Parallel substrates for motor, oculomotor, “prefrontal” and “limbic” functions. In H. B. M. Uylings, C. G. Van Eden, J. P. C. De Bruin, M. A. Corner, & M. G. P. Feenstra (Eds.), Progress in brain research (Vol. 85, pp. 119–146). Amsterdam: Elsevier. Amari, S. (1977). Neural theory of association and concept-formation. Biological Cybernetics, 26, 175–185. Amari, S. (1980). Topographic organization of nerve elds. Bulletin of Mathematical Biology, 42, 339–364. Amari, S. (1983). Field theory of self-organizing neural nets. IEEE Transaction on Systems, Man and Cybernetics, SMC-13(9 & 10), 741–748. Amari, S., & Takeuchi, A. (1978). Mathematical theory on formation of category detecting nerve cells. Biological Cybernetics, 29, 127–136. Aosaki, T., Tsubokawa, H., Ishida, A., Watanabe, K., Graybiel, A. M., & Kimura, M. (1994). Responses of tonically active neurons in the primate’s striatum undergo systematic changes during behavioral sensorimotor conditioning. J. Neurosci., 6, 3969–3984. Bakin, J. S., & Weinberger, N. M. (1996). Induction of a physiological memory in the cerebral cortex by stimulation of the nucleus basalis. Proceedings of the National Academy of Sciences, 93, 11219–11224. Barto, A. G. (1985). Learning by statistical cooperation of self-interested neuronlike computing elements. Human Neurobiology, 4, 229–256. Barto, A. G. (1995). Adaptive critics and the basal ganglia. In J. C. Houk, J. L. Davis, & D. G. Beiser (Eds.), Models of information processing in the basal ganglia (pp. 215–232). Cambridge, MA: MIT Press. Barto, A. G., Sutton, R. S., & Brouwer, P. S. (1981). Associative search network: A reinforcement learning associative memory. Biological Cybernetics, 40, 201– 211. Bennett, B. D., & Bolam, J. P. (1994). Synaptic input and output of parbalbuminimmunoreactive neurons in the neostriatum of the rat. Neuroscience, 62, 707– 719. Berns, G. S., & Sejnowski, T. J. (1998). A computational model of how the basal ganglia produce sequences. Journal of Cognitive Neuroscience, 10(1), 108–121. Bienenstock, E. L., Cooper, L. N., & Munro, P. W. (1982). Theory for the development of neuron selectivity: Orientation specicity and binocular interaction in visual cortex. Journal of Neuroscience, 2, 32–48. Buonomano, D. V., & Merzenich, M. M. (1998).Cortical plasticity: From synapses to maps. Annual Review of Neuroscience, 21, 149–186.
Self-Organization in the Basal Ganglia
841
Calabresi, P., Maj, R., Pisani, A., Mercuri, N. B., & Bernardi, G. (1992). Longterm synaptic depression in the striatum: Physiological and pharmacological characterization. Journal of Neuroscience, 12, 4224–4233. Cepeda, C., Buchwald, N. A., & Levine, M. S. (1993).Neuromodulatory actions of dopamine in the neostriatum are dependent upon the excitatory amino acid receptor subtypes activated. Proceedings of the National Academy of Sciences of the United States of America, 90, 9576–9580. Dayan, P. (1991). Navigating through temporal difference. In R. P. Lippmann, J. E. Moody, & D. S. Touretzky (Eds.), Advances in neural information processing systems, 3 (pp. 464–470). San Mateo, CA: Morgan Kaufmann. Durstewitz, D., Seamans, J. K., & Sejnowski, T. J. (2000). Dopamine-mediated stabilization of delay-period activity in a network model of prefrontal cortex. Journal of Neurophysiology, 83, 1733–1750. Fagiolini, M., & Hensch, T. K. (2000). Inhibitory threshold for critical-period activation in primary visual cortex. Nature, 404(6774), 183–186. Flaherty, A. W., & Graybiel, A. M. (1994). Input-output organization of the sensorimotor striatum in the squirrel monkey. J. Neurosci., 2, 599–610. Foldi ¨ ak, P. (1991). Learning invariance from transformation sequences. Neural Computation, 3, 194–200. Gerfen, C. R. (1992). The neostriatal mosaic: Multiple levels of compartmental organization. Trends in Neurosciences, 15(4), 133–139. Gonon, F. (1997). Prolonged and extrasynaptic excitatory action of dopamine mediated by D1 receptors in the rat striatum in vivo. Journal of Neuroscience, 17, 5972–5978. Graybiel, A. M. (1995). Building action repertoires: Memory and learning functions of the basal ganglia. Current Opinion in Neurobiology, 5, 733–741. Graybiel, A. M., Aosaki, T., Flaherty, A., & Kimura, M. (1994). The basal ganglia and adaptive motor control. Science, 265, 1826–1831. Groves, P. M. (1983). A theory of the functional organization of the neostriatum and the neostriatal control of voluntary movement. Brain Research, 286(2), 109–132. Gruber, A. (2000). A computational study of D1 induced modulation of medium spiny neuron response properties. Unpublished master’s thesis, Northwestern University. Hensch, T. K., Fagiolini, M., Mataga, N., Stryker, M. P., Baekkeskov, S., & Kash, S. F. (1998). Local GABA circuit control of experience-dependent plasticity in developing visual cortex. Science, 282(5393), 1504–1508. Hikosaka, O., Nakahara, H., Rand, M. K., Sakai, K., Lu, X., Nakamura, K., Miyachi, S., & Doya, K. (1999). Parallel neural networks for learning sequential procedures. Trends in Neuroscience, 22(10), 464–471. Hikosaka, O., Sakamoto, M., & Usui, S. (1989). Functional properties of monkey caudate neurons. II. visual and auditory responses. Journal of Neurophysiology, 61, 799–813. Houk, J. C., Adams, J. L., & Barto, A. G. (1995). A model of how the basal ganglia generate and use neural signals that predict reinforcement. In J. C. Houk, J. L. Davis, & D. G. Beiser (Eds.), Models of information processing in the basal ganglia (pp. 249–270). Cambridge, MA: MIT Press.
842
H. Nakahara, S. Amari, and O. Hikosaka
Jaeger, D., Kita, H., & Wilson, C. J. (1994). Surround inhibition among projection neurons is weak or nonexistent in the rat neostriatum. Journal of Neurophysiology, 72(5), 2555–2558. Kano, M. (1995). Plasticity of inhibitory synapses in the brain: A possible memory mechanism that has been overlooked. Neuroscience Research, 21, 177– 182. Kano, M., Rexhausen, U., Dreessen, J., & Konnerth, A. (1992).Synaptic excitation produces a long-lasting rebound potentiation of inhibitory synaptic signals in cerebellar Purkinje cells. Nature, 356, 601–604. Kawagoe, R., Takikawa, Y., & Hikosaka, O. (1998). Expectation of reward modulates cognitive signals in the basal ganglia. Nature Neuroscience, 1(5), 411– 416. Kawagoe, R., Takikawa, Y., & Hikosaka, O. (1999). Change in reward-predicting activity of monkey dopamine neurons: Short-term plasticity. Society for Neuroscience Abstracts, 25, 1162. Kawaguchi, Y., Wilson, C. J., Augood, S. J., & Emson, P. C. (1995). Striatal interneurones: Chemical, physiological and morphological characterization. Trends in Neurosciences, 18(12), 527–535. Kilgard, M. P., & Merzenich, M. M. (1998). Cortical map reorganization enabled by nucleus basalis activity. Science, 279, 1714–1718. Kincaid, A. E., Zheng, T., & Wilson, C. J. (1998). Connectivity and convergence of single corticostriatal axons. Journal of Neuroscience, 18(12), 4722–4731. Kirkwood, A., Rioult, M. C., & Bear, M. F. (1996). Experience-dependent modication of synaptic plasticity in visual cortex. Nature, 381(6582), 526–528. Kita, K. (1993). GABAergic circuits of the striatum. In G. W. Arbuthnott & P. C. Emson (Eds.), Chemical signalling in the basal ganglia (pp. 51–72). Amsterdam: Elsevier. Knopman, D., & Nissen, M. J. (1991).Procedural learning is impaired in Huntington’s disease: Evidence from the serial reaction time task. Neuropsychologia, 29(3), 245–254. Knowlton, B. J., Mangels, J. A., & Squire, L. R. (1996).A neostriatal habit learning system in humans. Science, 273, 1399–1402. Komatsu, Y. (1996). GABAb receptors, monoamine receptors, and postsynaptic inositol trisphosphate-induced CA2+ release are involved in the induction of long-term potentiation at visual cortical inhibitory synapses. Journal of Neuroscience, 16(20), 6342–6352. Komatsu, Y., & Iwakiri, M. (1993).Long-term modication of inhibitory synaptic transmission in developing visual cortex. NeuroReport, 4(7), 907–910. Ko Âos, T., & Tepper, J. M. (1999). Inhibitory control of neostriatal projection neurons by GABAergic interneurons. Nature Neuroscience, 2(5), 467–472. LeVay, S., Stryker, M. P., & Shatz, C. J. (1978). Ocular dominance columns and their development in layer IV of the cat’s visual cortex: A quantitative study. Journal of Comparative Neurology, 179(1), 223–244. Marsden, C. D. (1980). The enigma of the basal ganglia and movement. Trends in Neuroscience, pp. 284–287. Monchi, O., & Taylor, J. G. (1999).A hard wired model of coupled frontal working memories for various tasks. Information Sciences, 113(3), 221–243.
Self-Organization in the Basal Ganglia
843
Montague, R., Dayan, P., & Sejnowski, T. J. (1996). Framework for mesencephalic dopamine systems based on predictive Hebbian learning. Journal of Neuroscience, 16, 1936–1947. Nakahara, H., Doya, K., & Hikosaka, O. (2001). Parallel cortico-basal ganglia mechanisms for acquisition and execution of visuo-motor sequences—a computational approach. Journal of Cognitive Neuroscience, 13: 5, 626–647. Nakahara, H., Trappenberg, T., Hikosaka, O., Kawagoe, R., & Takikawa, Y. (1998). Computational analysis on reward-modulated activities of caudate neurons. Society for Neuroscience Abstracts, 28, 1651. Nicola, S. M., Surmeier, J., & Malenka, R. C. (2000). Dopaminergic modulation of neuronal excitability in the striatum and nucleus accumbens. Annual Review of Neuroscience, 23, 185–215. Nusser, Z., Hajos, N., Somogyi, P., & Mody, I. (1998).Increased number of synaptic GABA(A) receptors underlies potentiation at hippocampal inhibitory synapses. Nature, 395(6698), 172–177. Obermayer, K., Ritter, H., & Schulten, K. (1990). A principle for the formation of the spatial structure of cortical feature maps. Proceedings of the National Academy of Sciences, 87(21), 8345–8349. Oorschot, D. E. (1996). Total number of neurons in the neostriatal, pallidal, subthalamic, and substantia nigral nuclei of the rat basal ganglia: A stereological study using the cavalieri and optical disector methods. Journal of Comparative Neurology, 366, 580–599. Parthasarathy, H. B., Schall, J. D., & Graybiel, A. M. (1992). Distributed but convergent ordering of corticostriatal projections: Analysis of the frontal eye eld and the supplementary eye eld in the macaque monkey. Journal of Neuroscience, 12, 4468–4488. Redgrave, P., Prescott, T. J., & Gurney, K. (1999). Is the short-latency dopamine response too short to signal reward error? Trends in Neuroscience, 22(4), 146– 151. Reynolds, J. N. J., & Wickens, J. R. (2000). Substantia nigra dopamine regulates synaptic plasticity and membrane potential uctuations in the rat neostriatum, in vivo. Neuroscience, 99, 199–203. Sachdev, R. N. S., Lu, S.-M., Wiley, R. G., & Ebner, F. F. (1998). Role of the basal forebrain cholinergic projection in somatosensory cortical plasticity. Journal of Neurophysiology, 79, 3216–3228. Schultz, W. (1998). Predictive reward signal of dopamine neurons. Journal of Neurophysiology, 80, 1–27. Schultz, W., Apicella, P., & Ljungberg, T. (1993). Responses of monkey dopamine neurons to reward and conditioned stimuli during successive steps of learning a delayed response task. Journal of Neuroscience, 13(3), 900– 913. Schultz, W., Apicella, P., Romo, R., & Scarnati, E. (1995). Context-dependent activity in primate striatum reecting past and future behavioral events. In J. C. Houk, J. L. Davis, & D. G. Beiser (Eds.), Models of information processing in the basal ganglia (pp. 11–27). Cambridge, MA: MIT Press. Schultz, W., Dayan, P., & Montague, R. (1997). A neural substrate of prediction and reward. Science, 275, 1593–1599.
844
H. Nakahara, S. Amari, and O. Hikosaka
Schultz, W., Romo, R., Ljungberg, T., Mirenowicz, J., Hollerman, J. R., & Dickinson, A. (1995). Reward-related signals carried by dopamine neurons. In J. C. Houk, J. L. Davis, & D. G. Beiser (Eds.), Models of information processing in the basal ganglia (pp. 231–248). Cambridge, MA: MIT Press. Surmeier, D. J., Song, W.-J., & Yan, Z. (1996). Coordinated expression of dopamine receptors in neostriatal medium spiny neurons. Journal of Neuroscience, 16, 6579–6591. Takeuchi, A., & Amari, S. (1979). Formation of topographic maps and columnar microstructures. Biological Cybernetics, 35, 63–72. Takikawa, Y., Kawagoe, R., & Hikosaka (2001). Reward-dependent spatial selection of anticipatory activity in monkey caudate neurons. Manuscript submitted for publication. Trappenberg, T., Nakahara, H., & Hikosaka, H. (1998). Modeling reward dependent activity pattern of caudate neurons. In International Conference on Articial Neural Network (ICANN98) (pp. 973–978). Sk¨ovde, Sweden. Tunstall, M. J., Kean, A., Wickens, J. R., & Oorschot, D. (2001). Inhibitory interaction between spiny projection neurons of the striatum: a physiological and anatomical study. In Abstracts for International Basal Ganglia Society VIIth International Triennial Meeting (p. 38). Waitangi, New Zealand. von der Malsburg, C. (1973). Self-organization of orientation selective cells in the striate cortex. Kybernetik, 14, 85–100. Wickens, J. (1993). A theory of the striatum. Oxford: Pergamon Press. Wickens, J. (1997). Basal ganglia: Structure and computations. Network: Computation in Neural Systems, 8, R77–R109. Wickens, J., & Kotter, ¨ R. (1995). Cellular models of reinforcement. In J. C. Houk, J. L. Davis, & D. G. Beiser (Eds.), Models of information processing in the basal ganglia (pp. 187–214). Cambridge, MA: MIT Press. Wiesel, T. N., & Hubel, D. H. (1963). Single-cell responses in striate cortex of kittens deprived of vision in one eye. Journal of Neurophysiology, 26, 1003– 1017. Willshaw, D. J., & von der Malsburg, C. (1979). A marker induction mechanism for the establishment of ordered neural mappings: Its application to the retinotectal problem. Philos. Trans. R. Soc. Lond. B. Biol. Sci., 287(1021), 203–243. Wilson, C. J. (1998). Basal ganglia. In G. M. Shepherd (Ed.), The synaptic organization of the brain (4th ed., pp. 279–316). Oxford: Oxford University Press. Received August 30, 2000; accepted June 25, 2001.
LETTER
Communicated by Richard Zemel
Population Computation of Vectorial Transformations Pierre Baraduc [email protected] Emmanuel Guigon [email protected] INSERM U483, UniversitÂe Pierre et Marie Curie 75005 Paris, France
Many neurons of the central nervous system are broadly tuned to some sensory or motor variables. This property allows one to assign to each neuron a preferred attribute (PA). The width of tuning curves and the distribution of PAs in a population of neurons tuned to a given variable dene the collective behavior of the population. In this article, we study the relationship of the nature of the tuning curves, the distribution of PAs, and computational properties of linear neuronal populations. We show that noise-resistant distributed linear algebraic processing and learning can be implemented by a population of cosine tuned neurons assuming a nonuniform but regular distribution of PAs. We extend these results analytically to the noncosine tuning and uniform distribution case and show with a num erical simulation that the results remain valid for a nonuniform regular distribution of PAs for broad noncosine tuning curves. These observations provide a theoretical basis for modeling general nonlinear sensorimotor transformations as sets of local linearized representations.
1 Introduction
Many problems of the nervous system can be cast in terms of linear algebraic calculus. For instance, changing the frame of reference of a vector is an elementary linear operation in the process of coordinate transformations for posture and movement (Soechting & Flanders, 1992; Redding & Wallace, 1997). More generally, coordinate transformations are nonlinear operations that can be linearized locally (Jacobian) and become a simpler linear problem (see the discussion in Bullock, Grossberg, & Guenther, 1993). Vectorial calculus is also explicitlyor implicitlyused in models of sensorimotor transformations for reaching and navigation (Grossberg & Kuperstein, 1989; Burnod, Grandguillaume, Otto, Ferraina, Johnson, & Caminiti, 1992; Touretzky, Redish, & Wan, 1993; Redish & Touretzky, 1994; Georgopoulos, 1996). Neural Computation 14, 845–871 (2002)
° c 2002 Massachusetts Institute of Technology
846
Pierre Baraduc and Emmanuel Guigon
Although linear processing is only a rough approximation of generally nonlinear computations in the nervous system, it is worth studying for at least two reasons (Baldi & Hornik, 1995): (1) it displays an unexpected wealth of behaviors, and (2) a thorough understanding of the linear regime is necessary to tackle nonlinear cases for which general properties are difcult to derive analytically. The problem of the neural representation of vectorial calculus can be expressed in terms of two spaces: a low-dimensional space corresponding to the physical space of the task and a high-dimensional space dened by activities in a population of neurons (termed neuronal space) (Hinton, 1992; Zemel & Hinton, 1995). In this framework, a desired operation in the physical space (vectorial transformation) is translated into a corresponding operation in the neuronal space, the result of which can be taken back into the original space for interpretation. The goal of this article is to describe a set of mathematical properties of neural information processing that guarantee appropriate calculation of vectorial transformations by populations of neurons (i.e., that computations in the physical and neuronal spaces are equivalent). An appropriate solution relies on three mechanisms: a decoding-encoding method that translates information between the spaces, a mechanism that favors the stability of operations in the neuronal space, and an unsupervised learning algorithm that builds neuronal representations of physical objects. We will show that these mechanisms are closely related to common properties of neural computation and learning: the distribution of tuning selectivities in the population of neurons and the width of the tuning curves, the pattern of lateral connections between the neurons, and the distribution of input-output patterns used to build synaptic interactions between neuronal populations. In this article, we present a theory that unies these three mechanisms and properties (generally considered separately; Mussa-Ivaldi, 1988; Sanger, 1994; but see Zhang, 1996; Pouget, Zhang, Deneve, & Latham, 1998) into a unique mathematical framework based on the neuronal population vector (PV; Georgopoulos, Kettner, & Schwartz, 1988) in order to explain how neuronal populations can perform vectorial calculus. In contrast to our extensive knowledge of the representation of information by populations of tuned neurons, little attention has been devoted to the learning processes in these populations. Here we show how Hebbian and unsupervised errorcorrecting rules can be used in association with lateral connections to allow the learning of linear maps on the basis of input-output correlations provided by the environment. In this context, we reveal a trade-off between the width of the tuning curves and the uniformity of the distribution of preferred directions. Finally, a statistical approach validates our hypotheses in realistic networks of a few thousand noisy neurons. A particular application of this theoretical framework is the computing of distributed representations of transpose or inverse Jacobian matrices which play a central role in kinematic and dynamic transformations (Hinton, 1984;
Population Computation of Vectorial Transformations
847
Mussa-Ivaldi, Morasso, & Zaccaria, 1988; Crowe, Porrill, & Prescott, 1998). Recent results highlight the relevance of this theory to the understanding of the elaboration of directional motor commands for reaching movements (Baraduc, Guigon, & Burnod, 1999). 2 Notations and Denition
In the following text, we consider a population of N neurons. We note E D RN the neuronal space and E D RD (typically D D 2, 3) the physical space. Lowercase letters (e.g., x) are vectors of the neuronal space. Uppercase letters (e.g., X) are vectors of the physical space. Matrices are indicated by uppercase bold letters roman for E (e.g., M ), calligraphic for E (e.g., M ), and italic for D £ N matrices (e.g., E). A dot (¢) stands for the dot product in E or E. Each neuron j is tuned to a D-dimensional vectorial parameter, that is, it has a preferred attribute in E, which is noted Ej , and its ring rate is given by xj D fj ( X ¢ Ej , bj ) ,
(2.1)
where fj is the tuning function of the neuron, X a unit vector of the physical space (Georgopoulos, Schwartz, & Kettner, 1986), and bj a vector of parameters. The assumption is made that the distributions of these parameters and the distribution of PAs are independent (Georgopoulos et al., 1988). In the particular case of cosine tuning, the ring rate of neuron j is xj D X ¢ Ej C bj ,
(2.2)
where bj is the mean ring rate of the neuron (Georgopoulos et al., 1986). We note E the D £ N matrix of vectors Ej . The PAs are considered either as a set of xed vectors or as realizations of a random variable with a given distribution PE (in this latter case, index i is removed). The mean is denoted by hi and the variance by V. 3 Cosine Tuning and Vectorial Processing in Neural Networks
As a simple case of distributed computation, in this section, we derive conditions that are sufcient to represent and learn vectorial transformations between populations of cosine-tuned neurons. The case of other tuning functions will be treated later (section 4) in the light of this approach. 3.1 Encoding-Decoding Method: Distributed Representation of Vectors. Here we address the representation of vectors by distributed activity
patterns in populations of cosine-tuned neurons. We show that a condition on the distribution of preferred attributes is sufcient to faithfully recover information from the activity of the population. This condition is mathematically exact for populations of innite size, but still leads to accurate representations for populations of biologically reasonable size (e.g., > 103 ).
848
Pierre Baraduc and Emmanuel Guigon
The ring frequency of the population in response to the presentation of a vector X of the physical space is x D ET X C b, where x and b are vectors in E (equation 2.2 written in matrix notation). Based on some hypotheses on E and b, the vector X can be decoded by computing a population vector (Georgopoulos et al., 1988; Mussa-Ivaldi, 1988; Sanger, 1994). The population vector can be dened by X¤ D
1 X 1 ( xi ¡ bi ) Ei D E(x ¡ b) . N i N
A perfect reconstruction (X¤ / X) is obtained if the PAs are such that (MussaIvaldi, 1988; Sanger, 1994) EET / ID ,
(3.1)
where ID is the D £ D identity matrix. In a population of neurons, the offset b could be deduced from the activity of the network over a sufciently long period of time and subtracted via an inhibition mechanism (e.g., global inhibition if all neurons have the same mean ring rate). However, we will consider here the general case: X¤ D Q X C
1 Eb, N
where Q D N1 EET . We make the assumption that the components of the PAs have zero mean, are uncorrelated, and have equal variance sE2 (regularity condition). From our hypothesis, mean ring rates b are independent of the distribution of PAs. Then Q converges in probability toward sE2 ID (see section A.1). Using similar arguments, we can demonstrate that N1 Eb converges in probability toward 0. In the following, we call a family of tuning properties fEi , bi g that satises the regularity condition regular basis. We use the term basis to indicate that a regular family can be used as a basis. However, it is not a basis in a mathematical sense. If X 2 E, x D ET X C b is called the distributed representation of X, or simply a population code. 3.1.1 Finite N. The preceding equalities hold only at the limit N ! C 1. To ascertain if the proposed computational scheme has any relevance to biology, we need to quantify the distortions introduced when populations of nite size are used. Without loss of generality, we can suppose the input to be X D (1, 0, . . . , 0) . The variance of the decoded output, normalized by 1 / sE2 , is in this case
³
V
Ex sE2
´
³
DV
where d 2 D V(E2i1 ) .
1 N 2 sE4
´
T
EE X
2 4 D (d / sE , 1, . . . , 1) / N,
Population Computation of Vectorial Transformations
849
Is this variance small enough in practice? For a uniform distribution of PAs on a three-dimensional sphere, d 2 / sE4 D 13 / 5, which results in an angular variance of less than 0.55 degree for N D 1000. For a distribution of PAs (of the same norm) clustered along the axes, d 2 / sE4 D 2—hence, an angular variance of less than 0.48 degree if N D 1000. This suggests that this encoding scheme is reasonably accurate with small populations of neurons. The regularity condition thus guarantees that encoded information can be recovered from actual ring rates with an arbitrary precision in a sufciently large population of neurons. The regularity condition includes a zero-mean assumption for PA components, which is not used in Sanger (1994). Any departure from this requirement translates the output vectors by a constant amount, which needs to be small in practice. The zero-mean assumption is not a major constraint, since most experimentally measured distributions of selectivity are roughly symmetrical (see section 6). In this sense, our denition of regularity is more general than the previous ones (Mussa-Ivaldi, 1988; Sanger, 1994; Zhang, Ginzburg, McNaughton, & Sejnowski, 1998), as it allows a proper probabilistic treatment when mean ring rate is nonzero. 3.2 Distributed Representation of Linear Transformations. The preceding section has shown how a correspondence can be established between vectors in external space and population activity. In this section, we extend this correspondence to linear mappings and dene the notion of input and output preferred attributes. Consider a linear map from E to F, which are real physical vectorial spaces. Let M be its matrix on the canonical bases and E, F be regular bases in E (NE neurons) and F (NF neurons), respectively. We dene
M D
1 FT M E NE NF
(3.2)
as the matrix of the distributed linear map. In the limit NE , NF ! C 1, and assuming that sE D sF D 1, we have Q E D Q F D ID . Then M operates on the distributed representations as M does in the original space. Let x be the distributed representation of a vector X 2 E (i.e., x D ET X). Taking Y D M X, we have M x D F T M EETX D F T ( M X ) D F T Y. Thus, y D M x is the distributed representation of Y. If we assume that the vectorial input (resp. output) is represented by the collective activity of a population of neurons xj (resp. yi ), and that a weight matrix M links the input and output layers, then the network realizes the transformation M on the distributed vectors. It is immediate that F M ET D M . Thus, the distributed map can be read using the classical population vector analysis.
850
Pierre Baraduc and Emmanuel Guigon
3.2.1 Finite NE and NF . As in the preceding section, it must be checked whether this distributed computation is still precise enough in the case of nite populations. To answer this question while keeping the derivations simple, we assume NE D NF D N, sE D sF D s, and take the identity mapping for the transformation M . The variance of the (normalized) decoded output Y/ s 2 writes in this case: ³ V
FI x s2
´
³ D V
1 FF T EET X N2 s 4
´
1 [V( Q F ) hQ E i2 C hQ F i2 V( Q E ) C V( Q F ) V( Q E ) ] X2 s8 µ » ¼ ¶ 1 d2 1 ( ID,D ¡ ID ) C I D C ID,D C ¢ ¢ ¢ X2 , D 2 N Ns 4 N D
where Im,n is an m £ n matrix of ones and the ellipsis stands for terms dominated by 1 / N2 . Here the notation Q 2 means the matrix of ij-component Q 2ij . For D D 3 and N D 1000, in the case of a uniform distribution, the preceding equation translates into an angular variance of 0.84 degree; in the clustered case, the variance is 0.74 degree. Our scheme of distributed computation is thus viable with small populations of neurons. Consequently, in the following sections, derivations will be made for innite populations with EE T D I D , which allows us to write equalities instead of proportionalities. We will thereby ignore the s 2 N term, except in the study of the effect of noise (see section 3.3.2). We will also assume that b D 0, which makes proofs more straightforward. The general case is considered in section A.3. 3.2.2 Selectivities of Output Units. In a network that computes y D M x, how can one characterize the behavior of an output unit that res with yi ? This output unit i can be described by its intrinsic PA Fi in the output space F. However, this vector is independent of the mapping that occurs between the input and output spaces, and thus does not fully dene the role of the unit. In fact, two vectors can be associated with the output unit i. The rst is the vector of E for which the unit is most active (input PA). Since the unit i res with input X as FTi M X, it is cosine tuned to the input, and its input PA is the column vector M T Fi . The second vector (output PA) is M † Fi , where M † is the Moore-Penrose inverse of M . In the case where the output layer is considered as a motor layer whose effects can be measured in the input space through M † , the output PA can be interpreted in an intuitive way. Indeed, the effect in sensory space of the isolated stimulation of the unit i is precisely the vector M † Fi of F. Thus, the output PA corresponds to projective properties of the cell while input PA is related to receptive properties. Note that in general, the input and output PAs of a unit do not coincide (Zhang et al., 1998).
Population Computation of Vectorial Transformations
851
3.2.3 Weight and Activity Proles. The distributed representation M has interesting structural properties. The transpose of the ith row of M is (FTi M E) T D E T ( M T Fi ) 2 Im ET . In the same way, the jth column of M is F T ( M EjT ) 2 Im F T . Thus, the prole of the weight rows (resp. columns) is identical to the prole of the input (resp. output) activities. Later we will consider the case where the entries of the matrix M are activities rather than static weights. We show below that “cosine” lateral connections between rows (ET E) and columns (F T F) stabilize population codes in E and F , respectively. Thus, lateral connections can help to build an exact matrix of activities from an underspecied initial state. 3.3 Neuronal Noise and Stabilization of Representations. Noise has a strong impact on population coding (Salinas & Abbott, 1994; Abbott & Dayan, 1999). Therefore, it is important to understand how noise affects the reliability of our computational scheme. We will consider here two forms of noise: additive gaussian and Poisson noise.
3.3.1 Additive Gaussian Noise. Assume that a gaussian noise g is added to the population code x. How does this noise affect the encoding-decoding scheme—that is, how large is the variance of the decoded quantity? If g is independently distributed, we can show that the variance of the extra term due to the noise (Eg / N ) is proportional to 1 / N (see section A.2). Conversely, if the additive noise is correlated among neurons, as seems to be the case in experimental preparations (Gawne & Richmond, 1993; Zohary, Shadlen, & Newsome, 1994), it is easy to demonstrate that V (Eg / N ) D
(1 ¡ c) sg2 sE2 N
,
(3.3)
where sg2 is the variance and c the correlation coefcient of the noise. Thus, for this correlated noise as for the uncorrelated one, the variance of the encoded quantity decreases with a 1 / N factor. Besides, the decoding error decreases as a function of c, as does the minimum unbiased decoding error (Abbott & Dayan, 1999). In fact the correlations act to decrease the total entropy of the system. The 1 / N reduction of variance demonstrated for additive gaussian noise no longer holds with multiplicative noise; in such a case, an active (nonlinear) mechanism of noise control may be needed. 3.3.2 Poisson Noise. In the case of an uncorrelated Poisson noise, V (gi ) D xi . It is straightforward to show that the variance of the noise term is inferior to xmax / N, where xmax is the highest ring rate in the population (see section A.2). Thus, as for the gaussian noise, the variance decreases linearly with the number of neurons. Correlations in the noise alter this behavior, and the variance becomes dominated by a term independent of N. This term
852
Pierre Baraduc and Emmanuel Guigon
can be computed for a few special cases of PA distribution; for example, we have V (Egi / N ) · 0.035 c xmax
(3.4)
for a uniform 3D distribution and V (Egi / N ) · 0.22 c xmax
(3.5)
for PAs clustered along the 3D axes (see section A.2). A reduction of the variance in the correlated case is thus obtained through the distributed coding, even if scaling N does not result in any additional benet. It can also be noted that uniform distributions of PAs seem more advantageous as far as noise issues are concerned. To sum up, for the two types of noise treated here, the variability in the decoded quantity is inferior to the variability affecting individual neurons. For gaussian or uncorrelated Poisson noise, using large populations of cells limits even more the noise in the decoded vectors, as noise amplip tude depends on 1 / N. This is not the case with correlated Poisson noise, and more powerful nonlinear methods could be employed (see, e.g., Zhang, 1996; Pouget et al., 1998). 3.3.3 Stabilizing Distributed Representations. The reduction of the noise in the decoded vector shown in the preceding sections can inspire ways to limit the noise inside a population. We show here that ltering the population activity through the matrix W E D ET E / N has this desirable effect. Before proving this fact, we rst note that W E is the distributed representation of ID in E (see equation 3.2). Matrix W E is a projection of E (Strang, 1988). If we note Ep the image of E by W E , then Ep is a D-dimensional subspace of E . Elements of Ep are population codes since they can be written ET (E x0 ) , x0 2 E . In fact, the operation of W E is a decoding-reencoding process. As the variance of the decoded vector coordinates is inferior to the neuronal variance (preceding sections), we can expect from W E good properties regarding noise control. To demonstrate them, we write W E ( x C g) D W E x C ET Eg / N. The term W E x is in general different from x (except if x 2 Ep ), but it preserves part of the information on x, since the population vectors of x and W E x are the same. The variance of W E g is the rst diagonal of (ET EQ ET E) / N 2 , where Q is the correlation matrix of the noise. Building on the results of the previous sections, it is easy to demonstrate that equations similar to equations 3.3, 3.4, and 3.5 apply. For additive gaussian noise, we nd that V (W E
g) D
(1 ¡ c )sg2 sE2 N
IN .
Population Computation of Vectorial Transformations
853
Thus, the effect of W E is to limit gaussian noise in the population. For Poisson noise, the formulas of section 3.3.2 generalize in the same way, leading to a decrease in the variance of the neuronal activity that is proportional to 1 / N for uncorrelated noise and independent of N in the correlated case. 6 Moreover, even if hgi D 0, for independent noise, we get hW E gi D 0 in the limit N ! C 1. This property, due to the fact that W E has balanced weights, can be used to sort out the relevant information from a superposition of uncorrelated codes. The matrix W E can be viewed as a weight matrix, either of feedforward connections between two populations of NE neurons or of lateral interactions inside a population, and extracts the population code of any input pattern in a single step. However, if W E slightly deviates from the denition, it is no longer a projection, and iterations of W E are likely to diverge or fade. A simple way to prevent divergence is to use a saturating nonlinearity (e.g., sigmoid). A more realistic solution is to adjust the shape of a nonsaturating nonlinearity to guarantee a stable behavior (Yang & Dillon, 1994; Zhang, 1996). In particular, an appropriate scaling of the gain of the neurons (maximum of the derivative of the nonlinearity) to the largest eigenvalue of W E leads to the existence of a Lyapunov function for continuous network dynamics. If the distribution of PAs is uniform, W E is a circulant matrix (Davis, 1979). Iterations of a circulant matrix can extract the rst Fourier component of the input, provided the rst Fourier coefcient of the matrix is greater than 1 and all other coefcients strictly less than 1 (Pouget et al., 1998). Here, W E corresponds to the special case where the rst Fourier coefcient is 1 and all others are zero. We could as well consider W F D F T F as a matrix of recurrent connections on the output layer to suppress noise on this layer. 3.4 Learning Distributed Representations of Linear Transformations.
Up to now, we have demonstrated that a correspondence between external and neural spaces could be established and maintained. This correspondence permits a faithful neural representation of external vectors and mappings. It remains now to be shown whether a distributed representation of a linear mapping can be built from examples using a local synaptic learning rule. We prove below that it is indeed possible, provided the training examples satisfy a part of the regularity condition. 3.4.1 Hebbian Learning of Linear Mappings. Let M be a linear transformation and ( Xº, Yº D M Xº ) be training pairs in E £ F, º D 1, . . . , Nex . Hebbian learning writes
M¤ij /
Nex X ºD 1
yºi xºj ,
854
Pierre Baraduc and Emmanuel Guigon
where ( xº, yº ) are the distributed representations of the training samples. Then, X X M¤ / F T Yº ( Xº) T E / F T M Xº ( Xº) T E / F T M E º
º
if the training examples satisfy X º
Xº ( Xº ) T / I dim E .
(3.6)
In this case, the matrix M ¤ is proportional to the required matrix. Thus, any distributed linear transformation can be learned modulo a scaling factor by Hebbian associations between input and output activities if the components of the training inputs are uncorrelated and have equal variances (zero mean is not required). In practice, to control for the weight divergence implied by standard Hebbian procedures, the following stochastic rule is used:
D M¤ij / ( yºi xºj ¡ M¤ij ) .
(3.7)
3.4.2 Nonregular Distribution of Examples and Tuning Properties. Regularity may be a restricting condition in some situations. Distributions of PAs are not necessarily regular, or it may not be possible to guarantee that training examples are regularly distributed. This latter case can occur when learning a (generally ill-dened) linear mapping from samples of its inverse (Kuperstein, 1988; Burnod et al., 1992; Bullock et al., 1993). We denote M 1 the inverse mapping. Training consists of choosing an output pattern yº, calculating the corresponding input pattern xº D ET M 1 Fyº, and then using (xº, yº) as examples. If the yº are regular, Hebbian learning leads to the representation of M T1 but not M 1¡1 (or a generalized inverse if M 1 is singular or noninjective). An appropriate solution to this problem is obtained if the learning takes place only for the Mij receiving maximal x input, that is, Mijmax (º) , where jmax (º) D arg max xºj . If the vectors xº have the same mean norm, we can assimilate xº whose largest coordinate is the jth to ej (distributed representation of Ej ). Then the jth column of M writes 0
M¢j D
X E
TM
1
F
yº D ej
yº D F T @
1 X
YºA .
(3.8)
( Ej ) Yº 2M ¡1 1
It is clear that the latter sum is an element of M 1¡1 ( Ej ). Section A.4 shows that when the Fi are regular, the sum converges toward M †1 Ej . The matrix M is then exactly the distributed representation of the Moore-Penrose inverse of M 1 . Informally, this winner-take-all learning rule works by equalizing
Population Computation of Vectorial Transformations
855
learning over input vectors, whatever their original distribution. In pratice, a soft competitive approach can be used (e.g., to speed up the learning), but the proportion of winners must be kept low in the presence of strong anisotropies. It must be noted that this applies only if the vectors xº have the same norm on average. If this condition is not fullled, a correction by 1 / kxºk2 must be applied. This rule developed in a Hebbian context naturally extends to the parameter-dependent case. 3.4.3 Learning Parameter-Dependent Linear Mappings. We now treat the more general case where a linear mapping depends on a parameter. Typically, such a mapping arises as a local linear approximation (Jacobian) of a nonlinear transformation (see Bullock et al., 1993). Consider a nonlinear mapping y D w ( Â ) (e.g., w is the inverse kinematic transformation for an arm; Â are the cartesian coordinates of the arm end point and y the joint angles). Linearization around Â0 gives yP D M ( Â0 ) Â, P M being the Jacobian of w . If the value y 0 D w ( Â0 ) is given, the nonlinear mapping can be computed by incrementally updating y with yP D M ÂP along any path starting at Â0 . Thus, the problem reduces to computing a parameter-dependent linear mapping, which can be written, using previous notations, as Y D M ( P ) X, where P is a parameter. We denote by P the physical space of parameters and P the space of the neuronal representation of parameters (e.g., P is the two-dimensional space of joint angles and P can be a set of postural signals). A solution to this problem is to consider the coefcients Mij corresponding to the distributed representation of M not as weights, but as activities of neurons modulated by the parameter P 2 P, and to assume a multiplicative interaction between Mij and xj . In the simplest case where P modulates linearly the coefcients, this can be written
³ y D Mx
and
M DV p
i.e., Mij D
X
V
´ ijk p k
,
(3.9)
k
where V is a set of weights dened over E £ F £ P and p 2 P . Then the mapping is learned by retaining for each neuron of layer P M the relationship between the input p and the desired output M¤ij D º yºi xºj . Thus, the weights V can be obtained by
DV
ijk
/ ( yºi xºj ¡ Mij ) pºk ,
(3.10)
which is a stochastic error-correcting learning rule. Contrary to the standard delta rule, equation 3.10 does not require an external teacher, as the reference signal is computed internally. Moreover, connectivity V ijk can be far from complete, as lateral connections between Mij units can help to form the desired activity prole (see section 3.3.3; Baraduc et al., 1999). Note that if
856
Pierre Baraduc and Emmanuel Guigon
the parameter P is coded by a population of cosine-tuned neurons (i.e., p is a distributed representation of P), then equation 3.10 simplies to a Hebbian rule:
DV
ijk
/ yºi xºj pºk .
In a more general case, the activities Mij can depend on p via a perceptron or a multilayer perceptron. The learning rule, equation 3.10, can then be transformed to include a transfer function and possibly be the rst step of an error backpropagation. 4 Generalization to Other Tuning Functions
It can be asked whether the mechanisms and properties of distributed computation proposed here depend on the specic cosine tuning that has been assumed (see equation 2.2). We now show that these results can be extended to a broad class of tuning functions (see equation 2.1), if we assume that the Ei have a uniform distribution. Following Georgopoulos et al. (1988), we use a continuous formalism (see also Mussa-Ivaldi, 1988). Given the previous assumptions, the uniformity guarantees that the population vector points in the same direction as the encoded vector (Georgopoulos et al., 1988): Z Z f (X ¢ E, b ) E dPE dPb D X.
(4.1)
The independence of b and E allows writing (Georgopoulos et al., 1988) ´ Z Z Z ³Z f ( X ¢ E, b ) E dPE dPb D f ( X ¢ E, b ) E dPE|b dPb . Thus, any demonstration made with constant b can be easily generalized to varying b. Accordingly, we remove b in the following calculations. 4.1 Encoding-Decoding Method.
4.1.1 Distributed Representation of Vectors. The distributed representation of a vector X in E is no more a vector but a function x D x ( E ) D f ( X ¢ E ) . According to our hypothesis, the vector X can be recovered from its distributed representation x (see equation 4.1). The dot product of the distributed representations of two vectors X and Z in E is dened by Z h ( X, Z) D
f ( X ¢ E ) f ( Z ¢ E ) dP E .
(4.2)
We rst observe that h can be manipulated as a tuning function. A vector can be reconstructed from tuning curve functions (see equation 4.1), as well
Population Computation of Vectorial Transformations
857
as from h: Z Z Z ( ) h X, E E dPE D f ( X ¢ E0 ) f (E ¢ E0 ) E dP E dPE0 Z Z D f ( X ¢ E0 ) f ( E ¢ E0 ) E dPE dPE0 | {z } E0
(4.3)
D X.
This property is immediate in the cosine case since h D f D dot product. In the general case, it can be shown that h ( X, Z) is a function of X ¢ Z and that if f is nondecreasing, so is h (see section A.5). 4.1.2 Distributed Representation of Linear Transformations. There is a theoretical form (no longer a matrix, but a function) for the distributed representation of a linear mapping M . It is dened by
M ( E, F ) D g ( FT M E) and
Z
M ( E, F ) x (E ) dPE ,
y (F) D
(4.4)
where y ( F ) is the distributed output corresponding to the distributed input x (E ) D f ( X ¢ E ) of a physical vector X. This exact counterpartR of the cosine case (see equation 2.2) is easily demonstrated by showing that y ( F ) F dPF D F Y, with Y D M X. 4.2 Stabilizing Distributed Representations. In the same way, there is
a straightforward generalization of matrix W
W E
0
E
(see section 3.3.3) dened by
0
(E, E ) D f ( E ¢ E ) .
However, unlike the cosine case, these theoretical forms are not particularly useful since they are not in general similar to versions obtained by learning. Thus, in the following section, we derive and use Hebbian versions M ¤ and W ¤E of M and W E . 4.3 Learning Distributed Representation of Linear Transformations.
The learning rules for the xed or the parameter-dependent mapping still apply. We use a continuous formalism for both tuning functions and training examples. A straightforward derivation proves that the distributed transformation can be learned as before through input-output correlations. It can be shown that the distributed map corresponding to a linear transformation M between vectorial spaces E and F is represented by the function Z M ¤ ( E, F) D (4.5) f ( Xº ¢ E ) g ( M Xº ¢ F ) dPº,
858
Pierre Baraduc and Emmanuel Guigon
where Pº is the distribution of training examples (see appendix A.6). It can be seen that M ¤ ( E, F ) is a function of E ¢ F using the method developed for equation 4.2. Next we dene W ¤E as the distributed representation of the identity mapping on E obtained by learning (see equation 4.5) Z W ¤E ( E, E0 ) D f ( Xº ¢ E ) f ( Xº ¢ E0 ) dPº. From equation 4.2, we see that W ¤E D h (E, E0 ) . The function W ¤E can be used as a feedforward or lateral interaction function. Any input distribution x ( E ) is transformed as Z ¤ ( ) W E x E D h ( E, E0 ) x ( E0 ) dPE0 . (4.6) If x is the distributed representation of a vector X of the physical space, it is immediately clear that Z W ¤E x ( E) D h ( E, E0 ) f ( X ¢ E0 ) dPE0 , which is a function of nondecreasing function X ¢ E (see the method in section A.5). W ¤E modies the prole of activity, but changes neither the preferred attribute nor the center of mass of a population code. If x is any distribution, the result of equation 4.6 depends on the shape of the dot product function (and thus the tuning function since the two are tightly related; see section 4.4). W ¤E is a Fredholm integral operator with kernel h. If the kernel is degenerate—that is, it can be written as a nite sum of basis functions (e.g., Fourier series)—Im W ¤E is a nite dimensional space generated by these functions. Thus, W ¤E suppresses all other harmonics. The case of a cosine distribution of lateral interactions (see section 3) corresponds to a two-dimensional space generated by cos and sin functions (and W ¤E is a projection). Gaussian distribution of weights, which contains a few signicant harmonics, is known empirically to suppress noise efciently (Douglas, Koch, Mahowald, Martin, & Suarez, 1995; Salinas & Abbott, 1996). However, W ¤E is not in general a projection, which is problematic if W ¤E represents a transform through recurrent connections. Solutions in the discrete spatial case have been discussed (see 3.3.3) and extend to this case (Zhang, 1996). In particular, the scaling of the largest eigenvalue is possible since W ¤E has the largest eigenvalue, which is equal to kW ¤E k. After learning, the output neurons are not tuned to input vectors as they are during the learning phase; that is, g (M X¢F ) are not their tuning functions. Indeed, activity of an output neuron is Z (4.7) y ( F) D h (X, Xº ) g ( M Xº ¢ F ) dPº, º
Population Computation of Vectorial Transformations
859
which is generally not equal to g ( M X¢F ) . However, using the same reasoning as for equation 4.2 (see section A.5), we can show that y (F ) D gQ ( M X ¢ F ) . It follows that the y are still broadly tuned to M X; moreover, the PAs in input space keep the same expression FT M as in the cosine case. 4.4 Numerical Results for 2D Circular Normal Tuning Functions. Contrary to the cosine case, learning with tuning function g leads to a different output tuning g. Q Is this change important? How similar are these two tuning curves? We illustrate here the differences among the intrinsic tuning functions ( f and g), the dot product (h), and the output tuning function ( g), Q using circular normal (CN) tuning functions (Mardia, 1972) in R2 . These functions have a prole similar to a gaussian while being periodic. Their general expression is f (cosh ) D AeK cos h C B. We used the following version for both input and output tuning,
f ( u) D g (u) D
eKu ¡ e ¡K , eK ¡ e ¡K
where K controls the width at half-height. Thus, f and g take values between 0 and 1 if the coded vectors and the PAs are unit vectors. With these assumptions, h D f ¤ f , where ¤ is the convolution, and thus their respective Fourier coefcients verify hO n D fOn2 . Interestingly, the distribution of the Fourier coefcients of CN functions is such that h, once normalized between 0 and 1, is very close to a broader CN function hCN . In our numerical simulations, the relative error was khnormalized ¡ hCN k < 2%, khnormalizedk where khk denotes the L 2 -norm of h. However, the convolution leads to a widening of h compared to f (see Figure 1A), since it favors the largest Fourier coefcients, which happen to be the rst for CN functions. This broadening effect is maximal for f of width ¼ 110 deg (see Figure 1B). Since gQ D h ¤ g (see equation 4.7), gQ is still broader than h (see Figure 1B). These results show that feedforward or recurrent neural processing preserves the general shape of intrinsic tuning functions but increases their width. After about two to ve feedforward steps, the tuning of output neurons ( g) Q is close to a cosine. 5 Deviations for Nonuniform Distributions of PAs
The preceding results on noncosine tuning curves were obtained for a uniform distribution of PAs, whereas a weaker constraint (regularity condition)
860
Pierre Baraduc and Emmanuel Guigon
Figure 1: (A) Shape of the intrinsic tuning curve of input and output neurons ( f , dotted line), the distributed dot product (h, gray line), and input tuning of output neurons (Qg, solid line). The width (at half-height) of f was 60± (K D 5.2). The curves for h and gQ were constructed from the rst 20 Fourier coefcients of f . (B) Width (at half-height) of h (gray line) and gQ (solid line) as a function of the width of f (K D .01–45).
was sufcient in the cosine case. Here we explore numerically to what extent the population computation can be accurate for a regular nonuniform distribution of PAs. In relation to electrophysiological data (Oyster & Barlow, 1967; Lacquaniti, Guigon, Bianchi, Ferraina, & Caminiti, 1995; Wylie, Bischof, & Frost, 1998), such a distribution was assumed clustered along preferred axes (here in 2D). To express the clustering along the axis h D 0, the probability density of a vector E D (cosh , sin h ) was assumed to follow dPE / dh / exp(¡h 2 / V ) for h 2] ¡ p / 4I p / 4]. The same density was used modulo p / 2 for the directions h D p / 2, p , and 3p / 2. The resulting densities for four different values of V, from V D 3 (moderately clustered distribution) to V D 10¡12 (PAs aligned on the axes), are plotted in the inset of Figure 2.
Population Computation of Vectorial Transformations
861
Figure 2: Precision of the distributed computation as measured by the discrepancy between the decoded input and output in the case of the distributed identity mapping. The error in the transformation is plotted as a function of the tuning width of f and the clustering of the 2D basis vectors E around the axes. The inset shows the four distributions of E that have been tested (see the text). For each condition, the error was calculated as the mean absolute difference between encoded and decoded vectors over 1000 trials (i.e. 1000 randomly chosen encoded vectors).
To illustrate how the scheme of distributed computation proposed here behaves in these conditions, we measured the errors induced by the distributed computation W E of the identity function. The population was sampled exactly regular, so that heavy computations involving large numbers of neurons could be avoided. Assuming tuning functions to be circular normal, we computed the angular difference between the decoded input and output vectors for different distributions of E and different tuning widths. The results shown in Figure 2 were obtained by computing the identity on 1000 random vectors on a regular population of 1000 neurons. As expected, the most uniform distribution behaves best, generating very small errors. The deviation of the population vector increases with the clustering of the basis vectors. However, the more the tuning curves broaden, the less this effect is pronounced. In particular, if the tuning width is greater than 100 degrees, the directional error in the population vector is always inferior to 5 degrees. We conclude that the distributed computation of linear
862
Pierre Baraduc and Emmanuel Guigon
mappings is still possible with minimal error in the case of clustered PAs when tuning curves are sufciently broad. 6 Discussion
This article has addressed the calculation of vectorial transformations by populations of cosine-tuned neurons in a linear framework. We have shown that appropriate distributed representations of these transformations were made possible by simple and common properties of neural computation and learning: decoding with the population vector, regular distributions of tuning selectivities and input-output training examples, Hebbian learning, and cosine-tuned lateral interactions between neurons. We have analytically extended this result to the noncosine broadly tuned case for uniform distributions and numerically to regular nonuniform distributions. The use of the population vector may appear problematic because it is in general not an optimal decoding method (Salinas & Abbott, 1994). Statistical optimality is clearly an important theoretical issue (Snippe, 1996; Pouget et al., 1998), but it is unclear whether it is also a relevant concept for computation in the nervous system. As emphasized by several authors (Paradiso, 1988; Salinas & Abbott, 1994), the use of a large number of cells to estimate a parameter is likely to overcome variability in single-cell behavior. In fact, accuracy (small bias and low variance compared to the coding range) may be more important than optimality. Furthermore, the main difculty with the PV method is its poor behavior when used for biased distributions of preferred directions (Glasius, Komoda, & Gielen, 1997) or populations of sharply tuned neurons (Seung & Sompolinsky, 1993). We have restricted our theory to regular or uniform distributions and broadly tuned neurons. For regular distributions, the PV method is an optimal linear estimator (Salinas & Abbott, 1994). Broadly tuned neurons allow the PV method to approach the maximum likelihood method for Poisson noise (Seung & Sompolinsky, 1993). The question arises whether electrophysiological data actually satisfy the regularity condition. This is clearly the case for uniform distributions (Georgopoulos et al., 1986; Schwartz, Kettner, & Georgopoulos, 1988; Caminiti, Johnson, Galli, Ferraina, & Burnod, 1991). However, not all distributions are uniform (Hubel & Wiesel, 1962; Oyster & Barlow, 1967; van Gisbergen, van Opstal, & Tax, 1987; Cohen, Prud’homme, & Kalaska, 1994; Prud’homme & Kalaska, 1994; Lacquaniti et al., 1995; Rosa & Schmid, 1995; Wylie et al., 1998), and it remained to be checked if these distributions are regular. A particular distribution is a clustering of PAs along preferred axes (Oyster & Barlow, 1967; Cohen et al., 1994; Prud’homme & Kalaska, 1994; Lacquaniti et al., 1995; Wylie et al., 1998; see also Soechting & Flanders, 1992). Populations of neurons in posterior parietal cortex of monkeys have such a distribution of PAs and satisfy the regularity condition (p < 0.01, unpublished observations from the data of Battaglia-Mayer et al., 2000). The same was seen
Population Computation of Vectorial Transformations
863
in anterior parietal cortex (E. Guigon, unpublished observations from the data of Lacquaniti et al., 1995). This latter observation indicates that vectorial computation can occur not only in uniformly distributed neuronal populations, but also at the different levels of a sensorimotor transformation where neurons are closely related to receptors or actuators (Soechting & Flanders, 1992). The regularity condition allows basic operations of linear algebra to be implemented in a distributed fashion. A similar principle was rst proposed by Touretzky et al. (1993). They introduced an architecture called a sinusoidal array, which encodes a vector as distributed activity across a neuronal population (see equation 2.2), and they used this architecture to solve reaching and navigation tasks (Touretzky et al., 1993; Redish, Touretzky, & Wan, 1994; Redish & Touretzky, 1994). However, in their formulation, vector rotation (which is a linear transformation) was implemented in a specic way, using either shifting circuits (Touretzky et al., 1993) or repeated vector addition (Redish et al., 1994). In our framework, vector rotation can be represented by a distributed linear transformation as any morphism (see section 3.2). We derived closely related results for a broad class of tuning functions (see equation 2.1), although under more restricting hypotheses (uniform distribution of PAs). A theoretically unbiased population vector can be constructed from a nonuniformly distributed population of neurons by adjusting the distribution of tuning strength (Germain & Burnod, 1996) or tuning widths (Glasius et al., 1997). However, these methods cannot be used to release the uniformity constraint since the hypothesis of independence of PAs and parameters distribution is violated. A particular example of nonuniform distribution of PAs is their clustering along axes (Oyster & Barlow, 1967; Cohen et al., 1994; Prud’homme & Kalaska, 1994; Lacquaniti et al., 1995; Wylie et al., 1998). In this case, although the operation of the network is only exact for pure cosine tuning curves, we have shown numerically that a good approximate computation is still possible if the tuning is sufciently broad. Salinas and Abbott (1995) derived a formal rule to learn the identity mapping in dimension 1 (i.e., x ¡! x through uniformly distributed examples). Their demonstration relies on the fact that the tuning curves and synaptic connections depend on only the magnitude of the difference between preferred attributes. Our results generalize this idea to arbitrary linear mappings in any dimension. The generalized constraint is that the tuning curves and connections depend on the scalar product of preferred attributes, which includes the one-dimensional case. Salinas and Abbott (1995) also provided a solution to (x, y ) ¡! x C y in dimension 1. However, their method may not be generalizable to higher dimensions. In fact, this transformation is not a (bi)linear transformation and is not easily accounted by our theory (except in the cosine case; see also Touretzky et al., 1993). Interestingly, when one asks how information is read out from distributed maps of directional
864
Pierre Baraduc and Emmanuel Guigon
signals, vector averaging and winner-take-all are more likely decision processes than vector summation (Salzman & Newsome, 1994; Zohary, Scase, & Braddick, 1996; Groh, Born, & Newsome, 1997; Lisberger & Ferrera, 1997; Recanzone, Wurtz, & Schwarz, 1997). An important application of our theory is learning a coordinate transformation from its Jacobian. This problem can be solved formally as an ensemble of position-dependent linear mappings (Baraduc et al., 1999). However, unlike previous models (Burnod et al., 1992; Bullock et al., 1993), it is not required that position information be coded in a topographic manner. Arbitrary codes for position can be used provided that the mapping between the position and the distributed representation of the Jacobian (see equation 3.9) is correctly learned. The most interesting point is that neurons of the network display realistic ring properties, which resemble those of parietal and motor cortical neurons. These results render the theory presented here attractive for modeling sensorimotor transformations. Appendix A.1 Convergence of Q in Probability for Regular PAs. Here we show that Q D N1 EET converges in probability toward the identity matrix (up to a multiplicative constant) if the distribution of the PAs ETi is regular. The kth (1 · k · D) diagonal term of Q is
Q kk D
N 1 X E2 , N i D1 ik
which tends in probability toward sE2 when N tends to innity. Indeed, V ( Q kk ) D
¡ ¢ N V E21k 1 X 2) V(E . D ik N 2 i D1 N
6 l) of Q is Qkl D The off-diagonal element Q kl (k D
lim hQ kl i D 0
N!1
and
N 1 X EikEil ; hence, N iD 1
lim V ( Q kl ) D lim V ( EikEil ) / N D 0.
N!1
N!1
Thus, Q converges in probability toward sE2 ID . A.2 Correlated Noise. Writing Q the correlation matrix of the noise, the variance V (Eg / N ) of the read-out vector can be expressed as the rst diagonal of matrix VD
1 EQ ET . N2
Population Computation of Vectorial Transformations
865
For an independently distributed gaussian noise, Q is proportional to the identity matrix and V (Eg / N ) / 1 / N. In the case of correlated gaussian noise, £ ¤ Q D sg2 I N C c ( IN,N ¡ IN ) , and we get VD
(1 ¡ c) sg2 sE2 1 ¡c 2 T s EE ID . D N2 g N
£ ¤ p For Poisson noise, the noise correlation matrix is Q ij D (1 ¡ c)dij xi C c xi xj . If there is no correlation (c D 0), matrix V is Ediag(xi )ET / N 2 , and its ith diagonal term writes 1 X 1 xmax . x kE2ki · xmax Q Eii D N2 k N N For nonzero c, the term ³ h ´ i p c c ¡ p ¢2 Vc D 2 diag E xi xj ET D 2 E x ij N N must be added. This term is independent of N and can be evaluated numerically for a few types of distribution of ET . For instance, if the PAs are uniformly distributed in 3D space and the minimum ring rate equals zero, and assuming that the norms of the Ei and their directions are independently distributed, " 1 Vc D ckEk 4p
Z
1 ¡1
Z
p 2p
±p 1Cs
1
¡ s2 cosh,
² p 1 ¡ s2 sin h, s ds dh
#2
" Z #2 1 1 p · c xmax s 1 C s ds 2 ¡1 ·
8 c xmax , 225
hence, the upper bound of equation 3.4 (here kEk denotes the mean norm of the Ei vectors). The derivation of equation 3.5 is left to the reader. The demonstration of the properties of W E is analogous. A.3 General Cosine Tuning. In most of the sections on coding and decoding, the baseline term b was 0. We now show how the results change for a nonzero baseline. We can use an approach similar to that in section 3.1 to show that
1 X xi yi ¡! X ¢ Y C bO N
in probability as
N ! 1,
866
Pierre Baraduc and Emmanuel Guigon
where x, y are distributed representations of physical vectors X, Y, and bT b bO D lim N!1 N depends only on b. Thus, the scalar product of two vectors deduces easily from the scalar product of their distributed representations. The expression for matrix W E (see section 3.3.3), which we write W for simplicity, transforms to
W
0
DW
C
b N bN
ITN ,
where bN the mean over i of bi . It can be checked that W 0 is a projection on the afne subspace Ep C b and possesses the same properties as W . Learning a linear transformation amounts to calculating the matrix X (F T Yº C bF ) (ET Xº C bE ) T , M¤ D º
where bE and bF denote the mean activity of input and output neurons, respectively. If the training inputs satisfy the regularity condition, we have
³ ¤
T
M DF M
X º
|
³ C bF
|
´ º ( º) T
X X
{z
kM
X º
³ T
ECF M
} |
X º
{z
´ º
X
bTE
}
0
´
( Xº) T E C Nex bF bTE ,
{z
}
0 where k is a proportionality constant dened by equation 3.6. The regularity condition leads to k D r 2 Nex / DE , where r is the mean norm of input examples and DE D dim E. Hence,
M¤ /M C
r2 bF bTE . DE
Thus, appropriate mapping occurs, although there is no guarantee that the output baseline activity will equate the baseline activity of the training patterns. A.4 Nonuniform Xº:Convergence Toward the Moore-Penrose Inverse. When Fi are uniformly distributed in F, learning from the examples of a noninvertible mapping M 1 between output and input converges toward the distributed representation of its Moore-Penrose inverse.
Population Computation of Vectorial Transformations
867
Start from equation 3.8, and take Yº 2 M 1¡1 ( Ej ) . As the Moore-Penrose inverse of M 1 is zero on the kernel of M 1 , we can write Yº D M †1 Ej C Kº, where Kº 2 ker M 1 . If we assume that yº are uniformly distributed in Fp (index p is dened in section 3.3.3), then the distribution of Yº is uniform. It follows that the distribution of the Kº is symmetric with respect to zero. Hence, X Yº / M †1 Ej . M 1 Yº D Ej
The proportionality factor is identical for all j if equation 3.7 is used, which completes the proof. A.5 Distributed Dot Product. We show now that h ( X, Z) is a function of X ¢ Z, assuming that the encoded vectors are unit vectors. We note S the unit sphere of E and dene
Sa ( X ) D fU 2 S | U ¢ X D ag. Then Z Z h ( X, Z ) D Z D
a Sa ( X )
f (a) f ( Z ¢ (aX C E? ) ) dPE da
Z
f (a) a
|
Sa ( X )
f (aZ ¢ X C Z ¢ E? ) dPE da, {z } ha ( X,Z)
where E? is the projection of E on the subspace orthogonal to X. Let us dene Sau D fE 2 S | Z ¢ E? D u and X ¢ E D ag, and write Z ha ( X, Z) D Z D
Z u
f (aZ ¢ X C u )
u
f (aZ ¢ X C u ) dPu ,
dPE Sau
which depends only on X ¢ Z. This is the required result. Moreover, if f is nondecreasing (which is generally the case for a tuning function), it is immediate that h is nondecreasing. A.6 Hebbian Learning of Distributed Maps: General Case. The following derivation shows that Hebbian learning of linear mappings can still be achieved.
868
Pierre Baraduc and Emmanuel Guigon
Using equation 4.4, we obtain Z
Z Z Z f ( Xº ¢ E ) g (M Xº ¢ F ) x ( E ) F dPE dPF dPº
y ( F ) F dPF D F
E
F º
Z Z
2 4
D º F
3
Z
f (Xº ¢ E ) f ( X ¢ E ) dP E 5 g ( M Xº ¢ F ) F dPF dPº
|E
{z
}
h ( X,Xº )
Z D º
Z
2 3 Z h ( X, Xº) 4 g ( M Xº ¢ F ) F dPF 5 dPº
|F
{z
}
M Xº
h ( X, Xº) M Xº dPº.
D º
If we assume that the distribution of training examples has the same properties as the distribution of PAs, then Y D M X (using equation 4.3). This proves that the vector represented in the output activities is correct. Acknowledgments
We thank Yves Burnod for fruitful discussions, Alexandre Pouget and an anonymous reviewer for helpful comments, and Marc Maier and Pierre Fortier for revising our English. References Abbott, L., & Dayan, P. (1999). The effect of correlated variability on the accuracy of a population code. Neural Comp., 11, 91–101. Baldi, P., & Hornik, K. (1995). Learning in linear networks: A survey. IEEE Trans. Neural Netw., 6(4), 837–858. Baraduc, P., Guigon, E., & Burnod, Y. (1999). Where does the population vector of motor cortical cells point during arm reaching movements? In M. Kearns, S. Solla, & D. Cohn (Eds.), Advances in neural information processing systems, 11 (pp. 83–89). Cambridge, MA: MIT Press. Available on-line at: http://www.snv.jussieu.fr/guigon/nips99.pdf. Battaglia-Mayer, A., Ferraina, S., Mitsuda, T., Marconi, B., Genovesio, A., Onorati, P., Lacquaniti, F., & Caminiti, R. (2000). Early coding of reaching in the parietooccipital cortex. J. Neurophysiol., 83(4), 2374–2391. Bullock, D., Grossberg, S., & Guenther, F. (1993). A self-organizing neural model of motor equivalence reaching and tool use by a multijoint arm. J. Cogn. Neurosci., 5, 408–435.
Population Computation of Vectorial Transformations
869
Burnod, Y., Grandguillaume, P., Otto, I., Ferraina, S., Johnson, P., & Caminiti, R. (1992). Visuo-motor transformations underlying arm movements toward visual targets: A neural network model of cerebral cortical operations. J. Neurosci., 12, 1435–1453. Caminiti, R., Johnson, P., Galli, C., Ferraina, S., & Burnod, Y. (1991). Making arm movements within different parts of space: The premotor and motor cortical representation of a coordinate system for reaching to visual targets. J. Neurosci., 11, 1182–1197. Cohen, D., Prud’homme, M., & Kalaska, J. (1994). Tactile activity in primate primary somatosensory cortex during active arm movements: Correlation with receptive eld properties. J. Neurophysiol., 71, 161–172. Crowe, A., Porrill, J., & Prescott, T. (1998). Kinematic coordination of reach and balance. J. Mot. Behav., 30(3), 217–233. Davis, P. (1979). Circulant matrices. New York: Wiley. Douglas, R., Koch, C., Mahowald, M., Martin, K., & Suarez, H. (1995). Recurrent excitation in neocortical circuits. Science, 269, 981–985. Gawne, T., & Richmond, B. (1993). How independent are the messages carried by adjacent inferior temporal cortical neurons? J. Neurosci., 13, 2758– 2771. Georgopoulos, A. (1996). On the translation of directional motor cortical commands to activation of muscles via spinal interneuronal systems. Cogn. Brain Res., 3(2), 151–155. Georgopoulos, A., Kettner, R., & Schwartz, A. (1988). Primate motor cortex and free arm movements to visual targets in 3-dimensional space. II. Coding of the direction of movement by a neuronal population. J. Neurosci., 8, 2928–2937. Georgopoulos, A., Schwartz, A., & Kettner, R. (1986). Neuronal population coding of movement direction. Science, 233, 1416–1419. Germain, P., & Burnod, Y. (1996). Computational properties and autoorganization of a population of cortical neurons. In Proc. International Conference on Neural Networks, ICNN’96 (pp. 712–717). Piscataway, NJ: IEEE. Glasius, R., Komoda, A., & Gielen, C. (1997). The population vector, an unbiased estimator for non-uniformly distributed neural maps. Neural Netw., 10, 1571– 1582. Groh, J., Born, R., & Newsome, W. (1997). How is a sensory map read out? Effects of microstimulation in visual area MT on saccades and smooth pursuit eye movements. J. Neurosci., 17(11), 4312–4330. Grossberg, S., & Kuperstein, M. (1989). Neural dynamics of adaptive sensory-motor control (Exp. ed.). Elmsford, NY: Pergamon Press. Hinton, G. (1984). Parallel computations for controlling an arm. J. Mot. Behav., 16(2), 171–194. Hinton, G. (1992). How neural networks learn from experience. Sci. Am., 267(3),145–151. Hubel, D., & Wiesel, T. (1962). Receptive elds, binocular interaction and functional architecture in the cat’s visual cortex. J. Physiol. (Lond.), 160, 106– 154. Kuperstein, M. (1988). Neural model of adaptive hand-eye coordination for single postures. Science, 239, 1308–1311.
870
Pierre Baraduc and Emmanuel Guigon
Lacquaniti, F., Guigon, E., Bianchi, L., Ferraina, S., & Caminiti, R. (1995). Representing spatial information for limb movement: Role of area 5 in the monkey. Cereb. Cortex, 5(5), 391–409. Lisberger, S., & Ferrera, V. (1997). Vector averaging for smooth pursuit eye movements initiated by two moving targets in monkeys. J. Neurosci., 17(19), 7490– 7502. Mardia, K. (1972). Statistics of directional data. London: Academic Press. Mussa-Ivaldi, F. (1988). Do neurons in the motor cortex encode movement direction? An alternative hypothesis. Neurosci. Lett., 91, 106–111. Mussa-Ivaldi, F., Morasso, P., & Zaccaria, R. (1988). Kinematic networks. A distributed model for representing and regularizing motor redundancy. Biol. Cybern., 60, 1–16. Oyster, C., & Barlow, H. (1967). Direction-selective units in rabbit retina: Distribution of preferred directions. Science, 155, 841–842. Paradiso, M. (1988). A theory for use of visual orientation information which exploits the columnar structure of striate cortex. Biol. Cybern., 58, 35– 49. Pouget, A., Zhang, K., Deneve, S., & Latham, P. (1998). Statistically efcient estimation using population coding. Neural Comput., 10(2), 373–401. Prud’homme, M., & Kalaska, J. (1994). Proprioceptive activity in primate primary somatosensory cortex during active arm reaching movements. J. Neurophysiol., 72(5), 2280–2301. Recanzone, G., Wurtz, R., & Schwarz, U. (1997). Responses of MT and MST neurons to one and two moving objects in the receptive eld. J. Neurophysiol., 78(6), 2904–2915. Redding, G., & Wallace, B. (1997). Adaptive spatial alignment. Hillsdale, NJ: Erlbaum. Redish, A., & Touretzky, D. (1994). The reaching task: Evidence for vector arithmetic in the motor system? Biol. Cybern., 71(4), 307–317. Redish, A., Touretzky, D., & Wan, H. (1994). The sinusoidal array: A theory of representation for spatial vectors. In F. Eeckman (Ed.), Computation and neural systems (pp. 269–274). Boston: Kluwer. Rosa, M., & Schmid, L. (1995). Magnication factors, receptive eld image and point-image size in the superior colliculus of ying foxes: Comparison with primary visual cortex. Exp. Brain Res., 102, 551–556. Salinas, E., & Abbott, L. (1994). Vector reconstruction from ring rates. J. Comput. Neurosci., 1, 89–107. Salinas, E., & Abbott, L. (1995). Transfer of coded information from sensory to motor networks. J. Neurosci., 15(10), 6461–6474. Salinas, E., & Abbott, L. (1996). A model of multiplicative neural responses in parietal cortex. Proc. Natl. Acad. Sci. U.S.A., 93(21), 11956–11961. Salzman, C., & Newsome, W. (1994). Neural mechanisms for forming a perceptual decision. Science, 264, 231–237. Sanger, T. (1994). Theoretical considerations for the analysis of population coding in motor cortex. Neural Comput., 6, 29–37. Schwartz, A., Kettner, R., & Georgopoulos, A. (1988). Primate motor cortex and free arm movements to visual targets in three-dimensional space. I. Relations
Population Computation of Vectorial Transformations
871
between single cell discharge and direction of movement. J. Neurosci., 8, 2913– 2927. Seung, H., & Sompolinsky, H. (1993). Simple models for reading neuronal population codes. Proc. Natl. Acad. Sci. U.S.A., 90, 10749–10753. Snippe, H. (1996). Parameter extraction from population codes: A critical assessment. Neural Comput., 8(3), 511–529. Soechting, J., & Flanders, M. (1992). Moving in three-dimensional space: Frames of reference, vector, and coordinate systems. Annu. Rev. Neurosci., 15, 167–191. Strang, G. (1988). Linear algebra and its applications (3rd ed.). San Diego: Harcourt Brace Jovanovich. Touretzky, D., Redish, A., & Wan, H. (1993).Neural representation of space using sinusoidal arrays. Neural Comput., 5, 869–884. van Gisbergen, J., van Opstal, A., & Tax, A. (1987). Collicular ensemble coding of saccades based on vector summation. Neuroscience, 21, 541–555. Wylie, D., Bischof, W., & Frost, B. (1998). Common reference frame for neural coding of translational and rotational optic ow. Nature, 392, 278–282. Yang, H., & Dillon, T. (1994). Exponential stability and oscillation of Hopeld graded response neural network. IEEE Trans. Neural Netw., 5(5), 719–729. Zemel, R., & Hinton, G. (1995). Learning population codes by minimizing description length. Neural Comput., 7(3), 549–564. Zhang, K. (1996). Representation of spatial orientation by the intrinsic dynamics of the head-direction cell ensemble: A theory. J. Neurosci., 16(6), 2112–2126. Zhang, K., Ginzburg, I., McNaughton, B., & Sejnowski, T. (1998). Interpreting neuronal population activity by reconstruction: Unied framework with application to hippocampal place cells. J. Neurophysiol., 79(2), 1017–1044. Zohary, E., Scase, M., & Braddick, O. (1996). Integration across directions in dynamic random dot displays: Vector summation or winner take all? Vision Res., 36(15), 2321–2331. Zohary, E., Shadlen, M., & Newsome, W. (1994). Correlated neuronal discharge rate and its implications for psychophysical performance. Nature, 370, 140– 143. Received February 17, 2000; accepted July 5, 2001.
LETTER
Communicated by Michael Eisele
Redistribution of Synaptic Efcacy Supports Stable Pattern Learning in Neural Networks Gail A. Carpenter [email protected] Boriana L. Milenova [email protected] Department of Cognitive and Neural Systems, Boston University, Boston, Massachusetts 02215, U.S.A.
Markram and Tsodyks, by showing that the elevated synaptic efcacy observed with single-pulse long-term potentiation (LTP) measurements disappears with higher-frequency test pulses, have critically challenged the conventional assumption that LTP reects a general gain increase. This observed change in frequency dependence during synaptic potentiation is called redistribution of synaptic efcacy (RSE). RSE is here seen as the local realization of a global design principle in a neural network for pattern coding. The underlying computational model posits an adaptive threshold rather than a multiplicative weight as the elementary unit of long-term memory. A distributed instar learning law allows thresholds to increase only monotonically, but adaptation has a bidirectional effect on the model postsynaptic potential. At each synapse, threshold increases implement pattern selectivity via a frequency-dependent signal component, while a complementary frequency-independent component nonspecically strengthens the path. This synaptic balance produces changes in frequency dependence that are robustly similar to those observed by Markram and Tsodyks. The network design therefore suggests a functional purpose for RSE, which, by helping to bound total memory change, supports a distributed coding scheme that is stable with fast as well as slow learning. Multiplicative weights have served as a cornerstone for models of physiological data and neural systems for decades. Although the model discussed here does not implement detailed physiology of synaptic transmission, its new learning laws operate in a network architecture that suggests how recently discovered synaptic computations such as RSE may help produce new network capabilities such as learning that is fast, stable, and distributed.
Neural Computation 14, 873–888 (2002)
° c 2002 Massachusetts Institute of Technology
874
Gail A. Carpenter and Boriana L. Milenova
1 Introduction
The traditional experimental interpretation of long-term potentiation (LTP) as a model of synaptic plasticity is based on a fundamental hypothesis: “Changes in the amplitude of synaptic responses evoked by single-shock extracellular electrical stimulation of presynaptic bres are usually considered to reect a change in the gain of synaptic signals, and are the most frequently used measure for evaluating synaptic plasticity” (Markram & Tsodyks, 1996, p. 807). LTP experiments tested only with low-frequency presynaptic inputs implicitly assume that these observations may be extrapolated to higher frequencies. Paired action-potential experiments by Markram and Tsodyks (1996) call into question the LTP gain-change hypothesis by demonstrating that adaptive changes in synaptic efcacy can depend dramatically on the frequency of the presynaptic test pulses used to probe these changes. In that preparation, following an interval of preand postsynaptic pairing, neocortical pyramidal neurons are seen to exhibit LTP, with the amplitude of the post-pairing response to a single test pulse elevated to 166% of the pre-pairing response. If LTP were a manifestation of a synaptic gain increase, the response to each higher-frequency test pulse would also be 166% of the pre-pairing response to the same presynaptic frequency. Although the Markram–Tsodyks data do show an amplied response to the initial spike in a test train (EPSPinit), the degree of enhancement of the stationary response (EPSPstat ) declines steeply as test pulse frequency increases (see Figure 1). In fact, post-pairing amplication of EPSPstat disappears altogether for 23 Hz test trains and then, remarkably, reverses sign, with test trains of 30–40 Hz producing post-pairing stationary response amplitudes that are less than 90% the size of pre-pairing amplitudes. Pairing is thus shown to induce a redistribution rather than a uniform enhancement of synaptic efcacy. As Markram, Pikus, Gupta, and Tsodyks (1998) point out, redistribution of synaptic efcacy has profound implications for modeling as well as experimentation: “Incorporating frequency-dependent synaptic transmission into articial neural networks reveals that the function of synapses within neural networks is exceedingly more complex than previously imagined” (p. 497). Neural modelers have long been aware that synaptic transmission may exhibit frequency dependence (Abbott, Varela, Sen, & Nelson, 1997; Carpenter & Grossberg, 1990; Grossberg, 1968), but most network models have not so far needed this feature to achieve their functional goals. Rather, the assumption that synaptic gains, or multiplicative weights, are xed on the timescale of synaptic transmission has served as a useful cornerstone for models of adaptive neural processes and related articial neural network systems. Even models that hypothesize synaptic frequency dependence would still typically have predicted the constant upper dashed line in Figure 1 (see section 2.1), rather than the change in frequency dependence
Neural Networks and Synaptic Efcacy
875
amplitude (% of control)
200
180
160
140
EPSP
stat
120
100
80 0
10
20
30
40
Presynaptic spike frequency (Hz) Figure 1: Relative amplitude of the stationary postsynaptic potential EPSP stat as a function of presynaptic spike frequency (I) (adapted from Markram & Tsodyks, 1996, Figure 3c, p. 809). In the Markram–Tsodyks pairing paradigm, sufcient current to evoke 4–8 spikes was injected, pre- and post-, for 20 msec; this procedure was repeated every 20 sec for 10 min. Data points show the EPSP stat after pairing as a percentage of the control EPSPstat before pairing, for I D 2, 5, 10, 23, 30, 40 Hz; plus the low-frequency “single-spike” point, shown as a weighted average of the measured data: 2 £ 0.25 and 17 £ 0.067 Hz. If pairing had produced no adaptation, EPSP stat would be a function of I that was unaffected by pairing, as represented by the lower dashed line (100% of control). If pairing had caused an increase in a gain, or multiplicative weight, then EPSP stat would equal the gain times a function of I, which would produce the upper dashed line (166% of control). Markram and t their data with an ¡ Tsodyks ¢ ¡ exponential curve, approximately (1 C 0.104[e the neutral point at I D 14.5 Hz.
I¡14.5 7.23
¡ 1]) 100%, which crosses
observed in the Markram-Tsodyks redistribution of synaptic efcacy (RSE) experiments. A “bottom-up” modeling approach might now graft a new process, such as redistribution of synaptic efcacy, onto an existing system. While such a step would add complexity to the model’s dynamic repertoire, it may be difcult to use this approach to gain insight into the functional advantages of
876
Gail A. Carpenter and Boriana L. Milenova
the added element. Indeed, adding the Markram–Tsodyks effect to an existing network model of pattern learning would be expected to alter drastically the dynamics of input coding—but what could be the benet of such an addition? A priori, such a modication even appears to be counterproductive, since learning in the new system would seem to reduce pattern discrimination by compressing input differences and favoring only low-frequency inputs. A neural network model called distributed ART (dART) (Carpenter, 1996, 1997; Carpenter, Milenova, & Noeske, 1998) features RSE at the local synaptic level as a consequence of implementing system design goals at the pattern processing level. Achieving these global capabilities, not the tting of local physiological data, was the original modeling goal. This “top-down” approach to understanding the functional role of learned changes in synaptic potentiation suggests by example how the apparently paradoxical phenomenon of RSE may actually be precisely the element needed to solve a critical pattern coding problem at a higher processing level. The dART network seeks to combine the advantages of multilayer perceptrons, including noise tolerance and code compression, with the complementary advantages of adaptive resonance theory (ART) networks (Carpenter & Grossberg, 1987, 1993; Carpenter, Grossberg, & Reynolds, 1991; Carpenter, Grossberg, Markuzon, Reynolds, & Rosen, 1992). ART and dART models employ competitive learning schemes for code selection, and both are designed to stabilize learning. However, because ART networks use a classical steepest-descent paradigm called instar learning (Grossberg, 1972), these systems require winner-take-all coding to maintain memory stability with fast learning. A new learning law called the distributed instar (dInstar) (see section 2.1) allows dART code representations to be distributed across any number of network nodes. The dynamic behavior of an individual dART synapse is seen in the context of its role in stabilizing distributed pattern learning rather than as a primary hypothesis. RSE here reects a trade-off between changes in frequency-dependent and frequency- independent postsynaptic signal components, which support a trade-off between pattern selectivity and nonspecic path strengthening at the network level (see Figure 2). Models that implement distributed coding via gain adaptation alone tend to suffer catastrophic forgetting and require slow or limited learning. In dART, each increase in frequency-independent synaptic efcacy is balanced by a proportional decrease in frequency-dependent efcacy. With each frequencydependent element assumed to be stronger than its paired frequency-independent element, the net result of learning is redistribution rather than nonspecic enhancement of synaptic efcacy. The system uses this mechanism to achieve the goal of a typical competitive learning scheme, enhancing network response to a given pattern while suppressing the response to mismatched patterns. At the same time, the dART network learning laws are designed to preserve prior codes. They do so by formally replacing the mul-
Neural Networks and Synaptic Efcacy
877
tiplicative weight with a dynamic weight (Carpenter, 1994), equal to the rectied difference between target node activation and an adaptive threshold, which embodies the long-term memory of the system. The dynamic weight permits adaptation only at the most active coding nodes, which are limited in number due to competition at the target eld. Replacing the multiplicative weight with an adaptive threshold as the unit of long-term memory thus produces a coding system that may be characterized as quasi-localist (Carpenter, 2001) rather than localist (winner-take-all) or fully distributed. Adaptive thresholds, which are initially zero, become increasingly resistant to change as they become larger, a property that is essential for code stability. Both ART and dART also employ a preprocessing step called complement coding (Carpenter, Grossberg, & Rosen, 1991), which presents to the learning system both the original external input pattern and its complement. The system thus allocates two thresholds for coding each component of the original input, a device that is analogous to on-cell/off-cell coding in the early visual system. Each threshold can only increase, and, as in the Markram-Tsodyks RSE experiments, each model neuron can learn to enhance only low-frequency signals. Nevertheless, by treating high-frequency and low-frequency component of the original input pattern symmetrically, complement coding allows the network to encode a full range of input features. Elements of the dART network that are directly relevant to the discussion of Markram-Tsodyks RSE during pairing experiments will now be dened quantitatively. 2 Results 2.1 Distributed ART Model Equations. A simple, plausible model of synaptic transmission might hypothesize a postsynaptic depolarization T in response to a presynaptic ring rate I as T D weff ¤ I, where the effective weight weff might decrease as the frequency I increases. Specically, if:
T D [ f ( I ) ¤ w] ¤ I, where w is constant on a short timescale, then the ratio of T before versus after pairing would be independent of I. An LTP experiment that employs only single-shock test pulses relies on such a hypothesis for in vivo extrapolation and therefore implicitlypredicts the upper dashed line in Figure 1. However, this synaptic computation is completely at odds with the Markram-Tsodyks measurements of adaptive change in frequency dependence. The net postsynaptic depolarization signal T at a dART model synapse is a function of two formal components with dual computational properties: a frequency-dependent component S, which is a function of the current presynaptic input I, and a frequency-independent component H, which is independent of I. Both components depend on the postsynaptic voltage y
878
Gail A. Carpenter and Boriana L. Milenova
DUAL COMPUTATIONAL PROPERTIES
FREQUENCYDEPENDENT
FREQUENCYINDEPENDENT
shrink to fit amplify
atrophy due to disuse
gain
SYNAPTIC DYNAMIC BALANCE Figure 2: During dART learning, active coding nodes tend simultaneously to become more selective with respect to a specic pattern and to become more excitable with respect to all patterns. This network-level trade-off is realized by a synaptic-level dynamic balance between frequency-dependent and frequency-independent signal components. During learning, “disused” frequency-dependent elements, at synapses where the dynamic weight exceeds the input, are converted to frequency-independent elements. This conversion will strengthen the signal transmitted by the same path input (or by a smaller input), which will subsequently have the same frequency-dependent component but a larger frequency-independent component. Network dynamics also require that an active frequency-dependent (pattern-specic) component contribute more than the equivalent frequency-independent (nonspecic) component, which is realized as the hypothesis that parameter a is less than 1 in equation 2.1. This hypothesis ensures that among those coding nodes that would produce no new learning for a given input pattern, nodes with learned patterns that most closely match the input are most strongly activated.
and the adaptive threshold t : Frequency dependent: Frequency independent: Total postsynaptic signal:
S D I ^ [y ¡ t ] C H D y^t T D S C (1 ¡ a) H D I ^ [y ¡ t ] C C (1 ¡ a) y ^ t.
(2.1)
Neural Networks and Synaptic Efcacy
879
In 2.1, a ^ b ´ minfa, bg and [a] C ´ maxfa, 0g. Parameter a is assumed to be between 0 and 1, corresponding to the network hypothesis that the patternspecic component contributes more to postsynaptic activation than the nonspecic component, all other things being equal. The dynamic weight, dened formally as [y ¡ t ] C , species an upper bound on the size of S; for smaller I, the frequency-dependent component is directly proportional to I. Note that this model does not assign a specic physiological interpretation to the postsynaptic signal T. In particular, T cannot simply be proportional to the transmitted signal, since T does not equal 0 when I D 0. The adaptive threshold t, initially 0, increases monotonically during learning, according to the dInstar learning law: dInstar:
£ ¤C d t D [y ¡ t ] C ¡ I dt C D [y ¡ t ¡ I] .
(2.2)
The distributed instar represents a principle of atrophy due to disuse, whereby a dynamic weight that exceeds the current input “shrinks to t” that input (see Figure 2). When the coding node is embedded in a competitive network, the bound on total network activation across the target eld causes dynamic weights to impose an inherited bound on the total learned change any given input can induce, with fast as well as slow learning. Note that t remains constant if y is small or t is large and that d t D [y ¡ t ] C ¡ [y ¡ t ] C ^ I dt C D y ¡ y ^ t ¡ [y ¡ t ] ^ I
D y ¡ H ¡ S.
When a threshold increases, the frequency-independent, or nonspecic, component H (see equation 2.1) becomes larger for all subsequent inputs, but the input-specic component S becomes more selective. For a highfrequency input, a nonspecically increased component is neutralized by a decreased frequency-dependent component. The net computational effect of a threshold increase (e.g., due to pairing) is an enhancement of the total signal T subsequently produced by small presynaptic inputs, but a smaller enhancement, or even a reduction, of the total signal produced by large inputs. 2.2 Distributed ART Model Predicts Redistribution of Synaptic Efcacy. Figure 3 illustrates the frequency-dependent and frequency-inde-
pendent components of the postsynaptic signal T and shows how these two competing elements combine to produce the change in frequency dependence observed during pairing experiments. In this example, model elements, dened by equation 2.1, are taken to be piecewise linear, although
880
Gail A. Carpenter and Boriana L. Milenova
this choice is not unique. In fact, the general dART model allows a broad range of form factors that satisfy qualitative hypotheses. The model presented here has been chosen for minimality, including only those components needed to produce computational capabilities, and for simplicity of functional form. Throughout, the superscript b ( before ) denotes values measured before the pairing experiment, and the superscript a (after ) denotes values measured after the pairing experiment. The graphs show each system variable as a function of the presynaptic test frequency (I). Variable I is scaled by a factor (IN Hz), which converts the dimensionless input (see equation 2.1) to frequency in the experimental range. The dimensionless model input corresponds to the experimental test frequency divided by I.N Frequency-dependent S
A
Sb Sa
S a=S b Frequency-independent
B
a b
Combined T=S+(1-
C
Ta Tb 200
Ta/Tb
D 160 120 80
20.3
0
10
20
25.8
30
Presynaptic spike frequency (Hz)
40
Neural Networks and Synaptic Efcacy
881
In the dART network, postsynaptic nodes are embedded in a eld where strong competition typically holds a pattern of activation as a working memory code that is largely insensitive to uctuations of the external inputs. When a new input arrives, an external reset signal briey overrides internal competitive interactions, which allows the new pattern to determine its own unbiased code. This reset process is modeled by momentarily setting all postsynaptic activations y D 1. The resulting initial signals T then lock in the subsequent activation pattern, as a function of the internal dynamics of the competitive eld. Thereafter, signal components S and H depend on y, which is small at most nodes due to normalization of total activation across the eld. The Markram-Tsodyks experiments use isolated cells, so network
Figure 3: Facing page. dART model and Markram–Tsodyks data. (A) The dART postsynaptic frequency-dependent component S increases linearly with the presynaptic test spike frequency I, up to a saturation point. During pairing, the model adaptive threshold t increases, and the saturation point of the graph of S is proportional to (1 ¡ t ) . The saturation point therefore declines as the coding node becomes more selective. Pairing does not alter the frequency-dependent response to low- frequency inputs: Sa D Sb for small I. For high-frequency inputs, Sa is smaller than Sb by a quantity D , which is proportional to the amount by which t has increased during pairing. (B) The dART frequency-independent component H, which is a constant function of the presynaptic input I, increases by D during pairing. (C) Combined postsynaptic signal T D S C (1 ¡ a) H, where 0 < a < 1. At low presynaptic frequencies, pairing causes T to increase ( T a D Tb C (1 ¡ a) D ) , because of the increase in the frequency-independent signal component H. At high presynaptic frequencies, pairing causes T to decrease ( T a D T b ¡ aD ) . (D) For presynaptic spike frequencies below the post-pairing saturation point of Sa , T a is greater than T b . For frequencies above the pre-pairing saturation point of Sb , T a is less than T b . The interval of intermediate frequencies contains the neutral point where Ta D T b . Parameters for the dART model were estimated by minimizing the chi-squared ´2 N ³ X yi ¡ yO i 2 (Press, Teukolski, Vetterling, & Flannery, 1994) statistic: Â D , si iD 1 where yi and si are the mean value and standard deviation of the ith measurement point, respectively, while yO i is the model’s prediction for that point. Four parameters were used: threshold before pairing (t b D 0.225) , threshold after pairing (t a D 0.39) , a presynaptic input scale (IN D 33.28 Hz), and the weighting coefcient (a D 0.6) , which determines the contribution of the frequencydependent component S relative to the frequency-independent component H. The components of the dimensionless postsynaptic signal T D S C (1 ¡ a) H, for a system with a single node in the target eld ( y D 1) , are S D ( I / IN) ^ (1 ¡ t ) and H D t . The dART model provides a good t of the experimental data on changes in synaptic frequency dependence due to pairing ( Â2 (3) D 1.085, p D 0.78) .
882
Gail A. Carpenter and Boriana L. Milenova
properties are not tested, and Figures 3 and 4 plot dART model equations 2.1 with y D 1. Figure 3A shows the frequency-dependent component of the postsynaptic signal before pairing (Sb ) and after pairing (Sa ). The frequency-dependent component is directly proportional to I, up to a saturation point, which is proportional to (1 ¡ t ). Tsodyks and Markram (1997) have observed a similar phenomenon: “The limiting frequencies were between 10 and 25 Hz. . . . Above the limiting frequency the average postsynaptic depolarization from resting membrane potential saturates as presynaptic ring rates increase” (p. 720). The existence of such a limiting frequency conrms a prediction of the phenomenological model of synaptic transmission proposed by Tsodyks and Markram (1997), as well as the prediction of distributed ART (Carpenter, 1996, 1997). The dART model also predicts that pairing lowers the saturation point as the frequency-dependent component becomes more selective. Figure 3B illustrates that the frequency-independent component in the dART model is independent of I and that it increases during training. Moreover, the increase in this component ( D ´ H a ¡ H b D t a ¡ t b ) balances the decrease in the frequency-dependent component at large I, where Sb ¡ Sa D D . Figure 3C shows how the frequency-dependent and frequency- independent components combine in the dART model to form the net postsynaptic signal T. Using the simplest form factor, the model synaptic signal is taken to be a linear combination of the two components: T D S C (1 ¡ a) H (see equation 2.1). For small I (below the post-pairing saturation point of Sa ), pairing causes T to increase, since S remains constant and H increases. For large I (above the pre-pairing saturation point of Sb ), pairing causes T to decrease: because (1 ¡ a) < 1, the frequency-independent increase is more than offset by the frequency-dependent decrease. The neutral frequency, at which the test pulse I produces the same postsynaptic depolarization before and after pairing, lies between these two intervals. Figure 3D combines the graphs in Figure 3C to replicate the MarkramTsodyks data on changes in frequency dependence, which are redrawn on this plot. The graph of T a / T b is divided into three intervals, determined by the saturation points of S before pairing ( I D IN(1 ¡ t b ) D 25.8 Hz) and after pairing ( I D IN(1 ¡ t a ) D 20.3 Hz) (see Figure 3A). The neutral frequency lies between these two values. System parameters of the dART model were chosen, in Figure 3, to obtain a quantitative t to the Markram-Tsodyks (1996) results concerning changes in synaptic potentiation, before pairing versus after pairing. In that preparation, the data exhibit the reversal phenomenon where, for high-frequency test pulses, post-pairing synaptic efcacy falls below its pre-pairing value. Note that dART system parameters could also be chosen to t data that might show a reduction, but not a reversal, of synaptic efcacy. This might occur, for example, if the test pulse frequency of the theoretical reversal point
Neural Networks and Synaptic Efcacy
883
200
180
Ta/Tb
(%)
160
140
120
100
80 0
10
20
30
40
Presynaptic spike frequency (Hz) Figure 4: Transitional RSE ratios. The dART model predicts that if postsynaptic responses were measured at intermediate numbers of pairing intervals, the location of the neutral point, where pairing leaves the ratio T a / T b unchanged, would move to the left on the graph. That is, the cross-over point would occur at lower frequencies I.
were beyond the physiological range. Across a wide parameter range, the qualitative properties illustrated here are robust and intrinsic to the internal mechanisms of the dART model. Analysis of the function T D S C (1 ¡a) H suggests how this postsynaptic signal would vary with presynaptic spike frequency if responses to test pulses were measured at transitional points in the adaptation process (see Figure 4), after fewer than the 30 pairing intervals used to produce the original data. In particular, the saturation point where the curve modeling T a / T b attens out at high presynaptic spike frequency depends on only the state of the system before pairing, so this location remains constant as adaptation proceeds. On the other hand, as the number of pairing intervals increases, the dART model predicts that the neutral point, where the curve crosses the 100% line and T a D T b , moves progressively to the left. That is, as the degree of LTP amplication of low-frequency inputs grows, the set of presynaptic frequencies that produce any increased synaptic efcacy shrinks.
884
Gail A. Carpenter and Boriana L. Milenova
3 Discussion 3.1 Redistribution of Synaptic Efcacy Supports Stable Pattern Learning. Markram and Tsodyks (1996) report measurements of the initial, tran-
sient, and stationary components of the excitatory postsynaptic potential in neocortical pyramidal neurons, bringing to a traditional LTP pairing paradigm a set of nontraditional test stimuli that measure postsynaptic responses at various presynaptic input frequencies. The dART model analysis of these experiments focuses on how the stationary component of the postsynaptic response is modied by learning. This analysis places aspects of the single-cell observations in the context of a large-scale neural network for stable pattern learning. While classical multiplicative models are considered highly plausible, having succeeded in organizing and promoting the understanding of volumes of physiological data, nearly all such models failed to predict adaptive changes in frequency dependence. Learning laws in the dART model operate on a principle of atrophy due to disuse, which allows the network to mold parallel distributed pattern representations while protecting stored memories. The dynamic balance of competing postsynaptic computational components at each synapse dynamically limits memory change, enabling stable fast learning with distributed code representations in a real-time neural model. To date, other competitive learning systems have not realized this combination of computational capabilities. Although dART model elements do not attempt to t detailed physiological measurements of synaptic signal components, RSE is the computational element that sustains stable distributed pattern coding in the network. As described in section 2.2, the network synapse balances an adaptive increase in a frequency-independent component of the postsynaptic signal against a corresponding frequency-dependent decrease. Local models of synaptic transmission designed to t the Markram-Tsodyks data are reviewed in section 3.2. These models do not show, however, how adaptive changes in frequency dependence might be implemented in a network with useful computational functions. In the dART model, the synaptic location of a frequency-independent bias term, realized as an adaptive threshold, leads to dual postsynaptic computations that mimic observed changes in postsynaptic frequency dependence, before versus after pairing. However, producing this effect was not a primary design goal; in fact, model specication preceded the data report. Rather, replication of certain aspects of the Markram-Tsodyks experiments was a secondary result of seeking to design a distributed neural network that does not suffer catastrophic forgetting. The dInstar learning law (see equation 2.2) allows thresholds to change only at highly active coding nodes. This rule stabilizes memory because total activation across the target eld is assumed to be bounded, so most of the system’s memory traces remain constant in response to a typical input pattern. Dening
Neural Networks and Synaptic Efcacy
885
long-term change in terms of dynamic weights thus allows signicant new information to be encoded quickly at any future time, but also protects the network’s previous memories at any given time. In contrast, most neural networks with distributed codes suffer unselective forgetting unless they operate with restrictions such as slow learning. The rst goal of the dART network is the coding process itself. In particular, as in a typical coding system, two functionally distinct input patterns need to be able to activate distinct patterns at the coding eld. The network accomplishes this by shrinking large dynamic weights just enough to t the current pattern (see Figure 2). Increased thresholds enhance the net excitatory signal transmitted by this input pattern to currently active coding nodes because learning leaves all frequency-dependent responses to this input unchanged while causing frequency-independent components to increase wherever thresholds increase. On the other hand, increased thresholds can depress the postsynaptic signal produced by a different input pattern, since a higher threshold in a high-frequency path would now cause the frequencydependent component to be depressed relative to its previous size. If this depression is great enough, it can outweigh the nonspecic enhancement of the frequency-independent component. Local RSE, as illustrated in Figure 3, is an epiphenomenon of this global pattern learning dynamic. A learning process represented as a simple gain increase would only enhance network responses. Recognizing the need for balance, models dating back at least to the McCulloch-Pitts neuron (McCulloch & Pitts, 1943) have included a nodal bias term. In multilayer perceptrons such as backpropagation (Rosenblatt, 1958, 1962; Werbos, 1974; Rumelhart, Hinton, & Williams, 1986), a single bias weight is trained along with all the patternspecic weights converging on a network node. The dART model differs from these systems in that each synapse includes both frequency-dependent (pattern-specic) and frequency-independent (nonspecic bias) processes. All synapses then contribute to a net nodal bias. The total increased frequency-independent bias is locally tied to increased pattern selectivity. Although the adaptation process is unidirectional, complement coding, by representing both the original input pattern and its complement, provides a full dynamic range of coding computations.
3.2 Local Models of the Markram-Tsodyks Data. During dInstar learning, the decrease in the frequency-dependent postsynaptic component S balances the increase in the frequency-independent component H. These qualitative properties subserve necessary network computations. However, model perturbations may have similar computational properties, and system components do not uniquely imply a physical model. Models that focus more on the Markram- Tsodyks paradigm with respect to the detailed biophysics of the local synapse, including transient dynamics, are now reviewed.
886
Gail A. Carpenter and Boriana L. Milenova
In the Tsodyks-Markram (1997) model, the limiting frequency, beyond which EPSPstat saturates, decreases as a depletion rate parameter USE (utilization of synaptic efcacy) increases. In this model, as in dART, pairing lowers the saturation point (see Figure 3C). Tsodyks and Markram discuss changes in presynaptic release probabilities as one possible interpretation of system parameters such as USE . Abbott et al. (1997) also model some of the same experimental phenomena discussed by Tsodyks and Markram, focusing on short-term synaptic depression. In other model analyses of synaptic efcacy, Markram, Pikus, Gupta, & Tsodyks (1998) and Markram, Wang, and Tsodyks (1998) add a facilitating term to their 1997 model in order to investigate differential signaling arising from a single axonal source. Tsodyks, Pawelzik, and Markram (1998) investigate the implications of these synaptic model variations for a large-scale neural network. Using a mean-eld approximation, they “show that the dynamics of synaptic transmission results in complex sets of regular and irregular regimes of network activity” (p. 821). However, their network is not constructed to carry out any specied function; neither is it adaptive. Tsodyks et al. (1998) conclude “An important challenge for the proposed formulation remains in analyzing the inuence of the synaptic dynamics on the performance of other, computationally more instructive neural network models. Work in this direction is in progress” (pp. 831–832). Because the Markram-Tsodyks RSE data follow from the intrinsic functional design goals of a complete system, the dART neural network model begins to meet this challenge. Acknowledgments
This research was supported by grants from the Air Force Ofce of Scientic Research (AFOSR F49620-01-1-0397), the Ofce of Naval Research and the Defense Advanced Research Projects Agency (ONR N00014-95-1-0409 and ONR N00014-1-95-0657), and the National Institutes of Health (NIH 20-3164304-5). References Abbott, L. F., Varela, J. A., Sen, K., & Nelson, S. B. (1997). Synaptic depression and cortical gain control. Science, 275, 220–224. Carpenter, G. A. (1994). A distributed outstar network for spatial pattern learning. Neural Networks, 7, 159–168. Carpenter, G. A. (1996). Distributed activation, search, and learning by ART and ARTMAP neural networks. In Proceedings of the International Conference on Neural Networks (ICNN’96):Plenary, Panel and Special Sessions (pp. 244–249). Piscataway, NJ: IEEE Press. Carpenter, G. A. (1997). Distributed learning, recognition, and prediction by ART and ARTMAP neural networks. Neural Networks, 10, 1473–1494.
Neural Networks and Synaptic Efcacy
887
Carpenter, G. A. (2001). Neural network models of learning and memory: Leading questions and an emerging framework. Trends in Cognitive Sciences, 5, 114–118. Carpenter, G. A., & Grossberg, S. (1987). A massively parallel architecture for a self-organizing neural pattern recognition machine. Computer Vision, Graphics, and Image Processing, 37, 54–115. Carpenter, G. A., & Grossberg, S. (1990). ART 3: Hierarchical search using chemical transmitters in self-organizing pattern recognition architectures. Neural Networks, 3, 129–152. Carpenter, G. A., & Grossberg, S. (1993). Normal and amnesic learning, recognition, and memory by a neural model of cortico-hippocampal interactions. Trends in Neuroscience, 16, 131–137. Carpenter, G. A., Grossberg, S., Markuzon, N., Reynolds, J. H., & Rosen, D. B. (1992). Fuzzy ARTMAP: A neural network architecture for incremental supervised learning of analog multidimensional maps. IEEE Transactions on Neural Networks, 3, 698–713. Carpenter, G. A., Grossberg, S., & Reynolds, J. H. (1991). ARTMAP: Supervised real-time learning and classication of nonstationary data by a selforganizing neural network. Neural Networks, 4, 565–588. Carpenter, G. A., Grossberg, S., & Rosen, D. B. (1991). Fuzzy ART: Fast stable learning and categorization of analog patterns by an adaptive resonance system. Neural Networks, 4, 759–771. Carpenter, G. A., Milenova, B. L., & Noeske, B. W. (1998). Distributed ARTMAP: A neural network for fast distributed supervised learning. Neural Networks, 11, 793–813. Grossberg, S. (1968). Some physiological and biochemical consequences of psychological postulates. Proc. Natl. Acad. Sci. USA, 60, 758–765. Grossberg, S. (1972). Neural expectation: Cerebellar and retinal analogs of cells red by learnable or unlearned pattern classes. Kybernetik, 10, 49–57. Markram, H., Pikus, D., Gupta, A., & Tsodyks, M. (1998). Potential for multiple mechanisms, phenomena and algorithms for synaptic plasticity at single synapses. Neuropharmacology, 37, 489–500. Markram, H., & Tsodyks, M. (1996). Redistribution of synaptic efcacy between neocortical pyramidal neurons. Nature, 382, 807–810. Markram, H., Wang, Y., & Tsodyks, M. (1998). Differential signaling via the same axon of neocortical pyramidal neurons. Proc. Natl. Acad. Sci. USA, 95, 5323–5328. McCulloch, W. S., & Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity. Bulletin of Mathematical Biophysics, 9, 127–147. Press, W. H., Teukolski, S. A., Vetterling, W. T., & Flannery, B. P. (1994). Numerical recipes in C: The art of scientic computing (2nd ed.). Cambridge: Cambridge University Press. Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65, 386– 408. Rosenblatt, F. (1962). Principles of neurodynamics. Washington, DC: Spartan Books.
888
Gail A. Carpenter and Boriana L. Milenova
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal representations by error propagation. In D. E. Rumelhart & J. L. McClelland (Eds.), Parallel distributed processing: Explorations in the microstructures of cognitions (Vol. 1, pp. 318–362). Cambridge, MA: MIT Press. Tsodyks, M., & Markram, H. (1997). The neural code between neocortical pyramidal neurons depends on neurotransmitter release probability. Proc. Natl. Acad. Sci. USA, 94, 719–723. Tsodyks, M., Pawelzik, K., & Markram, H. (1998). Neural networks with dynamic synapses. Neural Computation, 10, 821–835. Werbos, P. J. (1974). Beyond regression: New tools for prediction and analysis in the behavioral sciences. Unpublished doctoral dissertation, Harvard University. Received July 28, 1999; accepted July 5, 2001.
LETTER
Communicated by Hagai Attias
Mean-Field Approaches to Independent Component Analysis Pedro A.d.F.R. Højen-Sørensen [email protected] Department of Mathematical Modelling, Technical University of Denmark, DK-2800 Lyngby, Denmark Ole Winther [email protected] Department of Mathematical Modelling and Center for Biological Sequence Analysis, Department of Biotechnology, Technical University of Denmark, DK-2800 Lyngby, Denmark Lars Kai Hansen [email protected] Department of Mathematical Modelling, Technical University of Denmark, DK-2800 Lyngby, Denmark We develop mean-eld approaches for probabilistic independent component analysis (ICA). The sources are estimated from the mean of their posterior distribution and the mixing matrix (and noise level) is estimated by maximum a posteriori (MAP). The latter requires the computation of (a good approximation to) the correlations between sources. For this purpose, we investigate three increasingly advanced mean-eld methods: the variational (also known as naive mean eld) approach, linear response corrections, and an adaptive version of the Thouless, Anderson and Palmer (1977) (TAP) mean-eld approach, which is due to Opper and Winther (2001). The resulting algorithms are tested on a number of problems. On synthetic data, the advanced mean-eld approaches are able to recover the correct mixing matrix in cases where the variational meaneld theory fails. For handwritten digits, sparse encoding is achieved using nonnegative source and mixing priors. For speech, the mean-eld method is able to separate in the underdetermined (overcomplete) case of two sensors and three sources. One major advantage of the proposed method is its generality and algorithmic simplicity. Finally, we point out several possible extensions of the approaches developed here. 1 Introduction
Reconstruction of statistically independent source signals from linear mixtures is an active research eld with numerous important applications (for Neural Computation 14, 889–918 (2002)
° c 2002 Massachusetts Institute of Technology
890
P. Højen-Sørensen, O. Winther, and L. Hansen
background and references, see Lee, 1998; Girolami, 2000). Blind signal separation in the face of additive noise typically involves four estimation problems: estimation of source signals, source distribution, mixing coefcients, and noise distribution. A full Bayesian treatment of the combined estimation problem is possible but requires extensive Monte Carlo sampling (Belouchrani & Cardoso, 1995); therefore, several authors have proposed variational (also known as mean eld or ensemble) approaches in which the posterior distributions are either approximated by factorized gaussians or integrals over the posteriors are evaluated by saddle-point approximations (Attias, 1999; Belouchrani & Cardoso, 1995; Lewicki & Sejnowski, 2000; Lappalainen & Miskin, 2000; Hansen, 2000; Knuth, 1999). The resulting algorithm is an expectation-maximization (EM)–like procedure with the four estimations performed sequentially. One important problem with these approximations arises from the assumed posterior independence of sources. In particular, variational mean-eld theory using factorized trial distributions treats only “self-interactions” correctly, while producing trivial second moments, that 6 is, hSi Si i D hSi ihSj i for i D j. This is a poor approximation when estimating the mixing matrix and noise distribution since these estimates will typically depend on correlations. Recently, Kappen and RodrÂõ guez (1998) pointed out that for Boltzmann machines this naive mean-eld (NMF) approximation, introduced in this context by Peterson and Anderson (1987), may fail completely in some cases. They went on to propose an efcient learning algorithm based on linear response (LR) theory. LR theory gives a recipe for computing an improved approximation to the covariances directly from the solution to the NMF equations (Parisi, 1988). In this article, we give a general presentation of LR theory and apply it to the probabilistic ICA problem. We also briey outline the supposedly more accurate adaptive TAP mean-eld theory (Opper & Winther, 2001) and compare this method to the NMF and LR approaches. Whereas estimates of correlations obtained from variational mean-eld theory and its linear response correction in general differ, adaptive TAP is constructed such that it is consistent with linear response theory. We expect that advanced mean-eld methods such as LR and TAP can be useful in the many contexts within neural computation, where variational mean-eld theory already has proven to be useful, for instance, for sigmoid belief networks (Saul, Jaakkola, & Jordan, 1996). In our experience, the main difference between variational mean eld and the advanced methods lies in the estimates of correlations (often needed in algorithms of the EM type) and the calculation of the likelihood of the data. We will not discuss the latter here, however (see Opper & Winther, 2001, for a general method for computing the likelihood from the covariance matrix). In ICA simulations, we nd that the variational approach can fail typically by ignoring some of the sources and consequently overestimating the noise
Mean-Field Approaches to Independent Component Analysis
891
covariance. The LR and TAP approaches, on the other hand, succeed in all cases studied. However, we do not nd a signicant improvement using TAP (which is also somewhat more computationally intensive), suggesting that LR is close to being the optimal mean-eld approach for the probabilistic ICA model. The derivations of the mean-eld equations are valid for a general source prior (without temporal correlation) and tractable for priors that can be folded analytically with a gaussian distribution. This includes mixture of gaussians, Laplacian, and binary distributions. For other priors, one has to evaluate an extensive number of one-dimensional integrals numerically. Alternatively, one can construct computationally tractable ICA algorithms using priors that are dened only implicitly. To illustrate this point, we dene one such algorithm, which approximately corresponds to the prior having a power law tail. To underline the exibility and computational power of the probabilistic ICA framework and its mean-eld implementation, we give two quite different real-world examples of recent interest that straightforwardly can be solved in this framework. The rst example is that of separating speech in the overcomplete setting of two sensors and three sources (Lewicki & Sejnowski, 2000) using a heavy-tailed source prior such a Laplacian or the (approximative) power law prior described above. The second real-world problem considered is that of feature extraction in images. For images, it is natural to work with a nonnegativity constraint for the mixing matrix and sources, as in Lee and Seung (1999) and Miskin and MacKay (2000). In the probabilistic framework, this type of prior knowledge is readily built into the mixing matrix and source priors. Throughout this article, we conne ourselves to xed source priors.There are, however, no theoretical problems in extending the EM algorithm to estimating hyperparameters (see Attias, 1999, for an example of such source prior parameter estimation). In section 2 the basic probabilistic ICA model and the associated learning problem are stated. Section 3 concerns the inference part of the learning problem; we will see that variational mean-eld theory, linear response theory, and the adaptive TAP approach can be seen as stepwise more rened ways of estimating correlations. Applying the advanced mean-eld methods to independent component analysis is the main contribution of this article. Another contribution is the generality of the framework. In section 4, we examine various types of explicitly given source priors, which leads us to dene an implicitly given source prior. The impatient or applicationminded reader might consult section 4.1, which shows a table summarizing all priors considered. Section 5 shows some simulation results on both synthetic and real-world data. Finally, obvious ways to extend this work are outlined in the conclusion in section 6. The pseudocode for the algorithm is outlined in appendix A, and some additional priors not directly used are given in appendix B.
892
P. Højen-Sørensen, O. Winther, and L. Hansen
2 Probabilistic ICA
We formulate the ICA problem as follows (Hansen, 2000). The measurements are a collection of N temporal D-dimensional signals X D fXdtg, d D 1, . . . , D and t D 1, . . . , N, where Xdt denotes the measurement at the dth sensor at time t. Similarly, let S D fSmt g, m D 1, . . . , M, denote a collection of M mutually statistical independent sources, where Smt is the mth source at time t. The measured signal X is assumed to be an instantaneous linear mixing of the sources corrupted with additive white gaussian noise ¡ , that is, X D AS C ¡ ,
(2.1)
where A is a (time-independent) mixing matrix and the noise is assumed to be without temporal correlations and with time-independent covariance matrix § , that is, we have C dtC d0 t0 D dtt0 Sdd0 . We thus have the following likelihood for parameters and sources: N 1 T P (X | A , § , S ) D (det 2p § ) ¡ 2 e ¡ 2 Tr( X ¡AS ) §
¡1
(X ¡AS )
.
(2.2)
The aim of ICA is to recover the unknown quantities—the sources S , the mixing matrix A , and the noise covariance § —from the observed data. The main difculty is associated with estimating the source signals. The estimation problems for the mixing matrix and the noise covariance matrix are relatively simple, given the sufcient source statistics. Hence, our primary objective is to improve on the estimate of sufcient statistics from the posterior distribution of the sources. The mixing matrix A and the noise covariance § are then estimated by maximum a posteriori (MAP) (or maximum likelihood II, ML-II). This naturally leads to an EM-type algorithm where the expectation step amounts to nding the posterior mean and covariances of the sources and the maximization step is the MAP/ML-II estimation. Mean-eld methods, especially the advanced ones, are well suited for the nontrivial expectation step. Given the likelihood equation 2.2, the posterior distribution of the sources is readily given by P (S | X, A, § ) D
P (X | A , § , S ) P ( S ) , P ( X | A, § )
(2.3)
where P ( S ) is a prior on the sources, which might include temporal correlations (although we will postpone this problem to a future contribution). 2.1 Estimation of Mixing Matrix and Noise Covariance. The likelihood
of the parameters is given by Z P (X | A, § ) D dS P ( X | A , § , S ) P ( S ) .
(2.4)
Mean-Field Approaches to Independent Component Analysis
893
The problem of estimating the mixing matrix and noise covariance now amounts to nding the saddle points of the likelihood equation 2.4 with respect to the mixing matrix and noise covariance. We note that the saddle points will be given in terms of averages over the source posterior. These calculations of mean sufcient statistics with respect to the posterior are the main challenge for mean-eld approaches since the sources will be coupled through the observations. The mixing matrix A will be estimated by MAP and the noise by ML-II for convenience, A MAP D argmax P ( A | X , § )
(2.5)
A
§
MLII
D argmax P ( X | A , §
),
(2.6)
§
where the posterior of A is given by P (A | X , § ) / P ( X | A , § ) P ( A ), where P ( A ) is the prior on A . For the optimization in equations 2.5 and 2.6, we need the derivatives of the likelihood term, @ @A @ @§
¡1
log P (X | A , § ) D § log P (X | A , § ) D
1 § 2
(X hS iT ¡ A hSS T i)
¡1
h( X ¡ AS ) (X ¡ AS ) T i§
(2.7) ¡1
¡
N § 2
¡1
, (2.8)
where h¢i D h¢iS | A, , X denotes the posterior average with respect to the sources given the mixing matrix and noise covariance. Equating equation 2.8 to zero leads to the well-known result for § ,
§
MLII
D
1 h( X ¡ AS ) (X ¡ AS ) T i. N
(2.9)
In the particular case of measurements with independently and identically distributed noise, we can simplify the covariance § D s 2 I —hence, s 2 D Tr § MLII / D, where D is the number of sensors. Q For A , we consider two factorized priors, P ( A ) D dm P ( Adm ) , a zero mean gaussian P ( Adm ) / exp(¡adm A2dm / 2) , and the Laplace distribution P ( Adm ) / exp(¡bdm |Adm | ). Furthermore, we consider optimizing Adm both unconstrained and constrained to be nonnegative. Clearly, the MAP approach offers a exibility for encoding prior knowledge about A that is not available in the maximum likelihood II approach; one can encode sparseness (Hyv¨arinen & Karthikesh, 2000) and nonnegativeness (for images and text; see section 5 and Lee & Seung, 1999; Miskin & MacKay, 2000). 2.1.1 Unconstrained Mixing Matrices. A straightforward calculation gives the following iterative equation for the MAP estimate of A , ± ² A ( k C 1) D X hS iT ¡ § (aA ( k ) C bsign( A ( k) ) ) hSS T i¡1 , (2.10)
894
P. Højen-Sørensen, O. Winther, and L. Hansen
where we have included both priors and set adm D a and bdm D b. This equation can be solved explicitly for the gaussian prior with equal noise variance on all sensors—b D 0 and S D s 2 I, ± ²¡1 A D X hS iT hSS T i C as 2 I . (2.11) The ML-II estimate is the special case obtained by setting a D 0. 2.1.2 Nonnegative Mixing Matrices. To enforce nonnegative A , we introduce a set of nonnegative Lagrange multipliers Ldm ¸ 0 and maximize the modied cost: log P ( A | X , § ) C Tr L T A . Solving for the Lagrange multipliers, we get LD§
¡1
( A hSS T i ¡ X hS iT ) C aA C b.
(2.12)
We can write down an iterative update rule for Adm > 0 using the KuhnTucker condition Ldm Adm D 0 (Luenberger, 1984), together with the result for the Lagrange multipliers: ( k C 1)
Adm
D
[§ [§
¡1
¡1
X hS iT ]dm
( )
( )
k Cb A ( k) hSS i]dm C aAdm T
k Adm .
(2.13)
In the case of no prior knowledge—a D 0 and b D 0—we get an update rule similar to the image space reconstruction algorithm used in positron emission tomography (see, e.g., Pierro, 1993, for references) or the more recently proposed nonnegative matrix factorization procedure of Lee and Seung (1999). 3 Mean-Field Theory
We present three different mean-eld approaches that give us estimates of the source second moment matrix of increasing quality. First, we derive mean-eld equations using the standard variational mean-eld theory. Next, using linear response theory, we obtain directly from the variational solution improved estimates of hSS T i needed for estimating A and § . Finally, we present the adaptive TAP approach of Opper and Winther (2001), which goes beyond the simple factorized trial distribution of variational mean-eld theory to give a theory that is self-consistent to within linear response corrections. From mean-eld theory, we also get an approximation to the likelihood P (X | A , § ) which can be used for model selection (Hansen, 2000). 1 In appendix A, we summarize all mean-eld equations and give an EM-type recipe for solving them. 1 The variational approximation is a lower bound to the exact likelihood, whereas the TAP and LR approximations (not given here) are not bounds, but hopefully are more accurate.
Mean-Field Approaches to Independent Component Analysis
895
The following derivation is valid for any source prior without temporal correlations. Specic source priors are discussed in section 4. Although equations for the mean-eld estimates of the mean and covariance of the sources are written with equality in this section, it is to be understood that they are only approximations. 3.1 Variational Approach. We adopt a standard variational mean-eld ( ) theoretic approach and approximate the posterior Q distribution, P S | X , A , § , in a family of product distributions Q ( S ) D mt Q (Smt ).2 For a gaussian likelihood P (X | A , § , S ) , the optimal choice of Q ( Smt ) is given by a gaussian times the prior (Csat Âo, FokouÂe, Opper, Schottky, & Winther, 2000):
Q ( Smt ) / P ( Smt ) e
¡ 12 lmt S 2 Cc mt Smt mt .
(3.1)
To simplify the notation in the following, we will parameterize the likelihood as P (X | A , § , S ) D P ( X | J , h, S ) D
1 ¡ 1 Tr(S T JS ) C Tr( hT S ) e 2 , C
(3.2)
where log C D N2 log det 2p § C 12 Tr X T § ¡1 X , the M £ M interaction matrix J, and the eld h (having same dimension as S ) are given by J D AT § T
hDA §
¡1
A
(3.3)
¡1
X.
(3.4)
Note that h acts as an external eld from which all moments of the sources can be obtained. This is the key property that we will make use of in the next section when we derive the linear response corrections. The starting point of the variational derivation of mean-eld equations is the KullbackLeibler divergence between the product distribution Q ( S ) and the true source posterior— Z dS Q ( S ) log
KL D
D log P ( X | A , §
log P ( X | A , § , NMF) D
X
Z
log
Q(S) ( | P S X, A , § )
) ¡ log P (X | A , § , NMF)
(3.5)
¡ 12 lmt S2 Cc mt Smt mt
(3.6)
dSmt P ( Smt ) e
mt
2 Note that Q(Smt ) is a also the variational mean-eld approximation to the marginal RQ distribution dSm 0 t0 P (S | X , A , § ). 6 m,t0 D 6 t m0 D
896
P. Højen-Sørensen, O. Winther, and L. Hansen C
1X (lmt ¡ Jmm ) hS2mt i C Tr(h ¡ ° ) T hS i 2 mt
C
1 TrhS T i(diag( J ) ¡ J )hS i ¡ ln C, 2
where P ( X | A , § , NMF) is the naive mean-eld approximation to the likelihood and diag( J ) is the diagonal matrix of J. The Kullback-Leibler is zero when P D Q and positive otherwise. The parameters of Q should consequently be chosen as to minimize KL. The saddle points dene the meaneld equations:3 @KL @hS i
@KL
@hS2mt i
D 0: D 0:
° D h ¡ (J ¡ diag( J ) ) hS i
(3.7)
lmt D Jmm .
(3.8)
The remaining two equations depend explicitly on the source prior, P ( S ) : @KL @c mt
D 0:
hSmt i D
Z
@ @c mt
log
dSmt P ( Smt ) e
¡ 12 lmt S2 Cc mt Smt mt
´ f (c mt , lmt ) @KL @lmt
D 0:
hS2mt i
D ¡2
£e
@ @lmt
(3.9) Z
log
dSmt P ( Smt )
¡ 12 lmt S 2 Cc mt Smt mt .
(3.10)
The variational mean f (c mt , lmt ) plays a crucial role in dening the mean@f eld algorithm since all dependence on the prior is implicit in f (and in @c as well for the advanced methods). Combining equations 3.8, 3.7, and 3.9, we see that the variational mean is given in terms of a set of coupled xedpoint equations, which depends on the interaction matrix J and external eld h . (For details on how to solve this set of xed-point equations, see appendix A.) An analysis of different strategies for iterating the mean-eld equations in the context of Potts spin glasses can be found in Peterson and Soderberg ¨ (1989). In section 4, we calculate f (c mt , lmt ) for some of the prior distributions found in the ICA literature. We will primarily consider source priors, which can be integrated analytically against the gaussian kernel and hence avoid numerical integration. 3 The requirement that we should be at a local minima of log P(X | A , § , NMF) is fullled when the covariance matrix equation 3.12 is positive denite. To test whether we are at the global minima is harder. However, when the model is well matched to the data, we expect the problem to be convex.
Mean-Field Approaches to Independent Component Analysis
897
3.2 Linear Response Theory. So far we have not discussed how to obtain mean-eld approximations to the covariances 0
tt Âmm 0 ´ hSmt Sm0 t0 i ¡ hSmt ihSm0 t0 i.
Since variational mean-eld theory uses a factorized trial distribution, the covariances among variables are trivially predicted to be zero. However, using linear response theory, we can improve the variational mean-eld solution. As mentioned earlier, h acts as an external eld. This makes it possible to calculate the means and covariances as derivatives of log P (X | J, h ) — hSmt i D 0
tt Âmm 0 D
@ log P ( X | J , h )
(3.11)
@hmt @2 log P ( X | J, h )
D
@hm0 t0 @hmt
@hSmt i @hm0 t0
.
(3.12)
These relations are exact when using the exact likelihood. However, we can also use the NMF likelihood through the mean-eld equations 3.7 through tt0 3.9 to derive an approximate equation for Âmm 0: 0
tt Âmm 0 D
D
@f (c mt , lmt ) @c mt @c mt
@hm0 t0
³
@f (c mt , lmt )
¡
@c mt
´
X 6 m m00 ,m00 D
Jmm00 Âmtt00 m0 C dmm0
dtt0 .
(3.13)
As a direct consequence of the lack of temporal correlations in the present tt0 t setting, the Â-matrix factorizes in time: Âmm 0 D d tt 0 Âmm0 . We can straightfort wardly solve for Âmm0 , h t (¤ Âmm 0 D
t C
J ) ¡1
i mm0
,
(3.14)
where we have dened the diagonal matrix ³
¤
t
D diag (L 1t , . . . , L Mt ) ,
L mt ´
@f (c mt , lmt ) @c mt
´ ¡1
¡ Jmm . (3.15) @hS
i
t,NMF For comparison, the naive mean-eld result is Âmm D dmm0 @h mt , which 0 mt follows directly from equation 3.10. Why is the covariance matrix obtained by linear response more accurate? Here, we give an argument that can be found in Parisi’s book on statistical eld theory (Parisi, 1988). Let us assume (as always implicit in any meaneld theory) that the approximate and exact distribution is close in some sense, that is, P (S | X , A , § ) D Q ( S ) C e where e is small. Then by direct
898
P. Højen-Sørensen, O. Winther, and L. Hansen
t,exact t,NMF application of the factorized distribution, we have Âmm D Âmm0 C O ( e ) . 0 By exploiting the nonnegativity of KL, equation 3.5, we can also prove that 2
( |
@ log P X J, h,NMF t,LR the linear response estimate Âmm has an error of O ( e 2 ) . 0 D @hm0 t @hmt Since KL ¸ 0, the NMF theory log-likelihood gives a lower bound on the log-likelihood, and consequently, the linear term vanishes in the expansion of the log-likelihood: log P ( X | J, h ) D log P (X | J, h , NMF) C O (e 2 ) . This shows that as long as e is small, estimates of moments obtained by linear response are more accurate than using the trial distribution directly. For a specic case, it is possible to demonstrate the improvement directly. Consider the gaussian prior4 P ( Smt ) / exp(¡S2mt / 2). In this case, the variational mean eld, equation 3.9, is given by f (c , l) D c / (1 C l) . Thus, the @hS i t,NMF variational mean-eld theory predicts Âmm D dmm0 @hmt D 1 / (1 C lmt ) D 0 mt 1 / (1 C Jmm ). However, the linear response estimate, equation 3.14, gives £ ¤ t,LR Âmm D ( I C J ) ¡1 mm0 and hence reconstructs the full covariance matrix 0 identical with the exact result obtained by direct integration.
)
3.3 Adaptive TAP Approach. So far we have derived two different estit,NMF mates of the covariance matrix from variational mean-eld theory: Âmm D 0 £ ¤ @hSmt i t,LR ¡1 dmm0 @h and Âmm0 D ( ¤ t C J ) mm0 . Obviously there is no guarentee that mt the two estimates are identical. Variational mean-eld theory is thus not selfconsistent to within linear response corrections. The adaptive TAP approach (Opper & Winther, 2001), on the other hand, goes beyond the factorized trial distribution and requires self-consistency for the covariances estimated by linear response. It is beyond the scope of this article to rederive the adaptive TAP mean-eld theory. Consult Opper and Winther (2001) for a derivation valid for a model with quadratic interactions and general variable prior, the model considered in this article. We have chosen to present and test adaptive TAP because it offers the most advanced (and, we hope, the most precise) mean-eld approximation for this type of model. The self-consistency is achieved by introducing a set of MT additional mean-eld (or variational) parameters, the variances lmt in the marginal t,TAP distribution equation 3.1, such that the diagonal term Âmm obeys @hSmt i @hmt
h
D ( ¤ t C J ) ¡1
i mm
,
(3.16)
where L mt and c mt now depend on lmt : ¡ t ¢ ¡1 L mt D Âmm ¡ lmt
(3.17)
4 A gaussian source prior is not suitable for doing source separation. We merely use it here to show that the linear response correction in this case recovers the exact result.
Mean-Field Approaches to Independent Component Analysis
c mt D hmt ¡
X m0
( Jmm0 ¡ lm0 tdmm0 ) hSm0 t i.
899
(3.18)
To recover the variational mean-eld equations, 3.15 and 3.7, we let lmt D Jmm . 4 Source Models
In this section we calculate for various source priors the variational mean f , equation 3.9, and the derivative @f / @c needed for the linear response correction and adaptive TAP. The priors that we are considering are all chosen such that the variational mean can be calculated using tables of standard integrals (Gradshteyn & Ryzhik, 1980). It turns out to be convenient to introduce the gaussian kernel D with unit variance and its associated cumulative distribution function (cdf) W in order to keep the following expressions of a manageable size: ³ ´ 1 1 exp ¡ x2 , D ( x) D p 2 2p Z x W ( x) D D ( t) dt, ¡1
D0 ( x ) D ¡xD(x)
(4.1)
W 0 ( x) D D ( x ).
(4.2)
4.1 Summary of Source Priors. Table 1 summarizes the variational means and response functions corresponding to the priors described in this article. This is by no means a complete list of all priors for which it is possible to calculate these quantities (e.g., the Rayleigh distribution is one such prior). 4.2 Mixture of Gaussians Source Prior. In this section, we consider a
general mixture of gaussians, p (S|m , s) D
Nm X
p i p (S|m i , si ) ,
i D1
S2R
(4.3)
where each of the Nm individual mixture components is parameterized by p (S|m i , si ) D p
1 2p si2
1 2 2 e¡ 2 ( S¡m i ) / si .
(4.4)
Using this source prior, the generative ICA model becomes the independent factor analysis model proposed in Attias (1999). Since the main scope of this article is concerned with reliably inferring mean sufcient statistics with respect to the sources, we will, contrary to Attias (1999), always regard the source parameters as xed; we are at no time adapting the source priors to
900
P. Højen-Sørensen, O. Winther, and L. Hansen
Table 1: Variational Mean and Response Function Corresponding to Various Source Priors. P (S)
Source Prior
1 (S 2d
Binary Gaussian mixture
¡ 1) C 12 d (S C 1)
Equation 4.3 p1 2p
Gaussian Heavy tail
exp (¡S2 / 2)
Not analytic 1 b¡a
uniform Laplace Positive gaussian
p
exponential
H (S ¡ a)H (b ¡ S)
Response
f (c , l) D hSi
Function
tanh(c )
1 ¡ hSi2
Equations 4.5 and 4.7 c / (1 C l) c l
c
¡ a la C c 2
@hSi @c
1 / (1 C l) 1 C l
2
c ¡la a (la C c 2 )2
Equation B.14
EquationB.15
Equation 4.9
Equation 4.11
exp (¡S2 / 2)H (S)
Equation B.9
Equation B.11
exp (¡S)H (S)
Equation 4.13
Equation 4.14
1 2
2 p
Mean Function
exp(¡|S|)
Notes: The three rst rows describe source priors having negative, zero, and positive kurtosis, respectively. The fourth row expresses nonnegative priors. The step function is dened as H (S) D 1 for S > 0 and zero otherwise.
data. However, it is straightforward to extend the proposed methodology to allow for this possibility—for example, in an EM setting where the improved mean-eld solutions are being used in the posterior expectation of the complete log-likelihood. Trivial but tedious calculations show that the variational mean f of a mixture of gaussians is given by PN m f D
c si2 C m i ji i D 1 k i lsi2 C 1 e , P Nm j i i D1 k i e
(4.5)
where we have introduced pi and , lsi2 C 1 ³ ´ (c si C m i / si ) 2 1 (m i / si ) 2 ¡ ji D ¡ . 2 lsi2 C 1
ki D p
(4.6)
The derivatives with respect to c are easy to obtain but are omitted in the interest of space. For the special case of a mixture of two gaussians (Nm D 2) with common variance s 2 and means m i D §m , we get f D
1 ls 2 C 1
³ ³ 2 c s C m tanh
cm ls 2 C 1
´´ .
(4.7)
Mean-Field Approaches to Independent Component Analysis
901
For s 2 D 0 and m D 1, we recover the variational mean for the binary source P ( S ) D 12 d ( S ¡ 1) C 12 d ( S C 1) : f D tanh (c ). This particular choice of the bigaussian source distribution, equation 4.7, which is also known as the symmetric Pearson mixture density, was proposed in Girolami (1998) as a simple way of archiving a negative kurtosis (subgaussian) density function. To become familiar with the f -function and its derivative, consider the variational mean of the bigaussian with s 2 D 1 shown in Figures 1a and 1b for two values of m : m D 1, for which the density function is unimodal, and m D 4, for which the density function is signicantly bimodal. We seen that the more bimodal the source distribution is, the more compact becomes the region of high curvature. By introducing additional mixture components, it is possible to form the region of high curvature, which is illustrated in Figure 1g in the case of a mixture of ve gaussians. 4.3 Laplace Source Prior. Although a subgaussian distribution may be a reasonable source prior for some applications, such as telecommunications (discrete priors; see van der Veen, 1997) or processing of functional magnetic resonance images (Petersen, Hansen, Kolenda, Rostrup, & Strother, 2000), there is a large class of interesting real-world signals, such as speech, with heavier tails than the gaussian distribution. We therefore need to consider source priors that have positive kurtosis (supergaussian). One such choice, which has been widely used in the ICA community, is P ( S ) D 1 / (p cosh S ) (Bell & Sejnowski, 1995; MacKay, 1996). Using this prior, however, it is not possible to calculate the variational mean analytically. Instead, we consider the Laplace or double exponential distribution, which is very similar. The Laplace density is given by
p (S) D
g ¡g|S| e , 2
S 2 R, g > 0.
(4.8)
The variational mean can be calculated as 1 j Ck C C j ¡k ¡ f D p , l k C C k¡
(4.9)
where we have introduced c ¨g j§ D p , l
and
k § D W ( §j § ) D (j¨ ) .
(4.10)
Using equations 4.1 and 4.2, the derivative is found to be @f @c
D
³ 1 jC ¡ j¡ 1 ¡ j ¡j C C D (j C ) D (j ¡ ) l kC C k¡ ´ p (j Ck ¡ C j ¡k C ) C l f . (k C C k ¡ )
(4.11)
902
P. Højen-Sørensen, O. Winther, and L. Hansen
Figure 1: The variational mean f (left columns) and its derivative f 0 (right columns) as a function of c and l. (a, b) The bigaussian case with s 2 D 1 for m i D § 1 and m i D § 4, respectively. (c, d) The Laplacian prior for decay rates g D 1 / 2 and g D 2, respectively. (e, f) The exponential prior for decay rates g D 1 / 2 and g D 2, respectively. (g) The variational mean f and the derivative f 0 of a mixture of ve gaussians with mixing proportions p i D 1 / 5, means m i D f¡4, ¡1, 0, 1, 4g, and standard deviations si D f1, 2, 4, 2, 1g. (h) The heavytailed prior, equation 4.17, with a D 1.
Figures 1c and 1d show the variational mean and its derivative for a slowly decaying (g D 0.5) and a quickly decaying (g D 2) Laplacian prior. The Laplacian prior has, contrary to the bigaussian source, its region of high curvature for numerical large values of c . 4.4 Exponential Source Prior. Some application domains naturally restrict the possible range of the hidden sources and the mixing matrix due
Mean-Field Approaches to Independent Component Analysis
903
to the physical interpretation of these quantities in the generative model. This is, for instance, the case when the measured signal is known to be a positive superposition of latent counting numbers or intensities. Positivity constrains are relevant in parts-based representations of natural images, deconvolution of the power spectrum of nuclear magnetic resonance resonance (NMR) spectrometers, and latent semantic analysis in text mining (Lee & Seung, 1999). In this section, we consider the exponential source prior parameterized by p (S ) D ge¡gS ,
S 2 R C , g > 0,
(4.12)
which gives 1 j W (j ) C D (j ) f D p W (j ) l @f @c
D
1 D (j ) C p f, l lW (j )
(4.13) (4.14)
with c ¡g j D p . l
(4.15)
Figures 1e and 1f show the variational mean and its derivative for the exponential source prior. It is veried that the exponential variational mean is nonnegative. At this point, we will make some brief comments on some algorithmic issues when the normal cdf W appears in the denominator of the variational mean. Special care has to be taken when j ! ¡1, for example, when c ¡ g < 0 and l is small, that is, for small self-interactions. Using l’Hospital’s rule together with equations 4.1 and 4.2, it is seen that D (j ) ! ¡j for j ! ¡1, W (j )
(4.16)
which implies that the variational mean f ! 0 and its derivative ( @f / @c ) ! 1 / l for j ! ¡1. In section 5.4, we will use this prior to learn a set of sparse localized basis functions in images. The source priors considered thus far are just some examples of priors where the variational mean can be computed analytically. In appendix B, we state some additional examples of priors for which this calculation can be carried out analytically. 4.5 Power Law Tail Prior. In the previous sections, we have considered only source priors for which it was possible to carry out the integration equation 3.9 analytically. For arbitrary source priors, however, the onedimensional integral may be be solved using standard approaches for numerical integration. Alternatively, we could simply use the insight gained in
904
P. Højen-Sørensen, O. Winther, and L. Hansen
the previous sections, where we considered the functional form of the variational mean of various source priors, to come up with computationally tractable f functions directly. To give an example of this, we will construct p an f that for large |c | / l corresponds to a distribution with a power law tail P ( S ) / |S| ¡a . In this limit, the integral in equation 3.9qis dominated by its sadc (1 C 1 ¡ 4al ) ¼ cl ¡ ca . This dle point. The saddle-point value of S is S 0 D 2l c2 gives the behavior of the mean function for large c . We can now straightforwardly construct a mean function that has this asymptotic behavior and is well dened for small values of c : f D
c ac ¡ . l al C c 2
(4.17)
Figure 1h shows the heavy-tail f -function as a function of c and l. Figure 2 shows for a xed l D 1 the variational mean and derivative for some of the unconstrained source priors considered so far. For c ! 1, the gaussian and the uniform (improper) prior give, respectively, the the lower and upper value for f for the priors considered. The variational means and derivatives for the priors considered in this paper are summarized in Table 1. 5 Simulations
In this section, we compare the performance of the different mean-eld approaches described in the previous sections: NMF, LR correction, and TAP. To begin, we conduct two experiments with articial generated data. The source priors used in these experiments are equal to the source prior that generated the data set. We consider both the complete case in which two binary sources are mixed into two sensors and the overcomplete case of three continuous sources mixed into two sensors. Finally, we apply the linear response corrected mean-eld approach to perform ICA on two realworld data sets: speech signals and parts of the MNIST handwritten digit database. 5.1 Synthetic Binary Sources in a Complete Setting. Independent component analysis of binary sources has been considered in data transmission using binary modulation schemes such as MSK or biphase codes (van der Veen, 1997). Here, we consider a binary source S D f § 1g with prior distribution 12 [d (S ¡1) C d ( S C 1) ]. In this case we recover the well-known mean-eld equations hSi D tanh (c ). Figure 3a shows the column vectors of the mixing matrix and 1000 samples generated from the ICA generative model using a fairly low-noise variance, s 2 D 0.3. Ideally, the noiseless measurements would consist of the four combinations (with sign) of the columns in the mixing matrix. However, due to the noise, the measurements will be scattered around these prototype observations (shown as + in Figure 3a). Figure 3b
Mean-Field Approaches to Independent Component Analysis
G auss, ( =1, =1) G auss, ( =0, =1, 1
905
=2)
2
G auss, ( = 3, =1) Laplace, ( =1) Laplace, ( =1/4) =1/2 U niform (im proper)
G a us s, ( = 1, = 1) G a us s, ( = 0, =1 , 1
=2 )
2
G a us s, ( = 3, =1 ) L a pla ce , ( = 1) L a pla ce , ( = 1/4) =1 /2 U n ifo rm (im p rop e r)
Figure 2: (a) Variational mean and (b) derivative as a function of c for various source priors and xed l D 1. Legends: [- - ] gaussian with unit mean and variance; [¢ ¢ ¢] mixture of two gaussians with 0 mean and standard deviation 1 and 2; [- -] mixture of two gaussians with unit variance and mean at § 3; [—] and [--] Laplacian with g D 1 and g D 1 / 4, respectively; [¢-¢-] heavy tail with a D 1 / 2; [- -] uniform (improper) distribution.
shows, for each of the mean-eld approaches, the variance as a function of iteration number. At these moderate noise variances, an improvement in the convergence rate is obtained by using the linear response corrected mean-eld solution. The adaptive TAP approach, on the other hand, is seen to have a slower convergence rate, and only a marginal improvement in the estimated noise variance and mixing matrix is obtained. This is due to the fact that this approach is critically sensitive to how well the variational parameters have been determined.
906
P. Højen-Sørensen, O. Winther, and L. Hansen
NMF LR TAP Empirical
2
Figure 3: Binary source recovery for a low-noise variance, s 2 D 0.3. (a) 1000 measurements (scatter plot), +/¡ the column vectors of the true mixing matrix (the solid axis), and the measurement prototypes (+ ) for the noise-less case. (b) Estimated variance for NMF, LR and TAP as a function of iteration. The thick solid line is the true empirical noise variance. The empirical variance is the variance of the 1000 random noise contributions. The trajectories of the xedpoint iteration using (c) NMF, (d) LR, and (e) adaptive TAP. The initial condition is marked £ and nal point ±. The dashed lines are the true mixing matrix.
Figures 3c–e show, for the different mean-eld approaches, the trajectories of the xed-point iterations. All the methods use the same initial conditions (£), and the nal point in the trajectory is marked ±. The dashed lines are +/¡ the column vectors of the true mixing matrix. In this case, there is no signicant difference in the mixing matrix estimated using the different mean-eld approaches. We now increase the noise variance to s 2 D 1. In this case, it is hard to identify the prototype signals from the measured data (see Figure 4a). The naive mean-eld approach fails in recovering the mixing matrix. Figure 4c
Mean-Field Approaches to Independent Component Analysis
NMF LR TAP Empirical
907
2
Figure 4: Binary source recovery for a high noise variance, s 2 D 1. (a) 1000 measurements (scatter plot), +/¡ the column vectors of the true mixing matrix (the solid axis), and the measurement prototypes (+) for the noiseless case. (b) The estimated variance for NMF, LR, and TAP as a function of iteration. The thick solid line is the true empirical noise variance. The trajectories of the xedpoint iteration using (c) NMF, (d) LR, and (e) adaptive TAP. The initial condition is marked £ and nal point ±. The dashed lines are the true mixing matrix.
shows that one of the directions in the mixing matrix vanishes during the xed-point iterations, which results in the noise variance’s being overestimated (see Figure 4b). However, the linear response corrected mean-eld approach and adaptive TAP recovers the true mixing matrix. 5.2 Continuous Sources in an Overcomplete Setting. In this section, the problem is to recover more sources than sensors; in particular, we consider mixing three sources into two sensors. The source used in this experiment is the symmetric Pearson mixture, equation 4.7, with m D 1. A total of
908
P. Højen-Sørensen, O. Winther, and L. Hansen
2000 samples was generated from the generative model (see Figure 5a), and the three mean-eld approaches were used to learn the mixing matrix. The trajectories plotted in Figure 5c show that the naive mean-eld approach fails in recovering the mixing matrix. Similar to the binary case with high variance, one of the directions in the mixing matrix vanishes (see Figure 5). Only the dominant direction in the data space is captured, whereas the two remaining directions collapse into one “mean” direction. However, both the linear response corrected and the adaptive TAP mean-eld approaches succeed in estimating the mixing matrix. It appears likely that the adaptive TAP result is a bit worse than the LR solution. The reason is that adaptive TAP stops after approximate 2200 iterations (for the particular value of ftol used in this experiment, see appendix A), whereas LR continues for a little more than 4000 iterations. The slow convergence of the adaptive TAP approach is essentially due to the estimation of the additional variational parameters. We will restrict ourselves to the LR approach in the next realworld examples since NMF has turned out to fail in some cases and TAP is considerably more computationally expensive while giving comparable performance. 5.3 Separating Three Speakers from Two Microphones. In this section we consider the problem of separating three speakers from two microphones. This experimental example was originally reported in Lee, Lewicki, Girolami, and Sejnowski (1999). At hand, we have the three original speech signals, each having a duration of 1 second and sampled at 8 kHz. The speech signal is then instantaneously linearly mixed into two microphones. Figure 6a shows a scatter plot of the 8000 samples in the measurement (microphone) space. The fact that natural speech has a heavy-tailed distribution makes this overcomplete problem somewhat easier in the sense that the hidden directions of the mixing matrix reveal themselves clearly in the scatter plot. The linear response corrected mean-eld approach was used in performing ICA with the computationally tractable variational mean, equation 4.17, with a D 1. The initial mixing matrix was randomly picked (shown as the dotted axis in Figure 6a). Figure 6b shows the convergence of the algorithm in terms of the angle between the estimated directions and the true directions (the dashed lines in Figure 6a). Figure 6a shows that the algorithm converges rapidly to a mixing matrix that is very close to the one that actually mixed the speech signals. Figure 7 shows each of the inferred sources plotted against each of the true sources. We see that the three recovered sources are nicely correlated with exactly one of the true sources and (more or less) uncorrelated with the remaining sources. Notice that any relabeling of the sources and corresponding perturbation of the columns of the mixing matrix leaves the solution of the ICA problem invariant. 5.4 Local Feature Extraction with Sparse Positive Encoding. In this section, we apply the linear response corrected ICA algorithm to the problem of
Mean-Field Approaches to Independent Component Analysis NMF LR TAP Empirical
909
2
Figure 5: Overcomplete continuous source recovery with s 2 D 1. (a) 2000 measurements (scatter plot), +/¡ the column vectors (four times axis) of the true mixing matrix (the solid axis). (b) The estimated variance for NMF, LR, and TAP as a function of iteration. The thick solid line is the true empirical noise variance. The trajectories of the xed-point iteration using (c) NMF, (d) LR, and (e) adaptive TAP. The initial condition is marked £ and nal point ±. The dashed lines are the true mixing matrix.
nding a small set of localized images representing parts of the digit images in the MNIST handwritten digit database. For illustration purposes, we consider only a small subset of the database: the rst 500 cases of the handwritten digit 3. This experiment is identical to that reported in (Miskin & MacKay, 2000). It is natural to consider positive constraints on latent variables (say pixels) when dealing with images. However, such constraints are usually ignored by most of the commonly used preprocessing models; for example, the principal component analysis (PCA) generative model amounts simply to sequentially nding orthogonal directions (components) with maximum variance in the data space. Ignoring such constraints is problematic since
910
P. Højen-Sørensen, O. Winther, and L. Hansen
Figure 6: Overcomplete speech separation (3-in-2) using the heavy-tailed f function, equation 4.17, with a D 1; see Figure 1h. (a) Scatter plot of 1 sec of the mixed speech (8 kHz), the true A (dashed lines), the initial A (black dotted), and the estimated A . (b) Estimated angle as a function iteration. The horizontal lines illustrate true angles at 0 and § 45 degrees.
Figure 7: Overcomplete speech separation (3-in-2) using the heavy-tailed f function, equation 4.17, with a D 1. Scatter plots of the ICA estimated sources (i) (i) Sesti versus the true sources Strue , i D 1, 2, 3.
Mean-Field Approaches to Independent Component Analysis
911
for an unconstrained model to yield positive digit images, there has to be an interaction between positive and negative regions in different components, and it is therefore not obvious what the set of components represents visually. To illustrate these points, we conduct two ICA experiments using the exponential prior P ( S ) D e¡S , S 2 R C . In the rst experiment, we do not constrain the mixing matrix, whereas in the second experiment, the mixing matrix is constrained to be positive. For both experiments, we assume that there are 25 hidden images. Figure 8a shows the 25 hidden images obtained using ICA with positively constrained sources but unconstrained mixing matrix. Although the sources in this case are positively constrained, the fact that hidden images are allowed to be subtracted in order to obtain a positive image leads to nonlocal hidden images, which are hard to interpret visually. Figure 8b shows the 25 hidden images obtained by performing ICA, which enforces the positive constraint on the mixing matrix. In this case, the hidden images clearly represent local features, in particular, the different handwriting styles and strokes in the various parts of the written digit.
6 Conclusion
In this article, we have presented a probabilistic (Bayesian) approach to ICA. Sources are estimated by their posterior mean, and maximum a posteriori estimates are used for the mixing matrix and the noise covariance. By this procedure, we derived an EM-type algorithm. The expectation step is carried out using different mean eld (MF) approaches: variational (also known as ensemble learning or naive MF), linear response, and adaptive TAP. The MF theories produce estimates of posterior source correlations of increasing quality. These are needed for the maximization step in the estimate for the mixing matrix and the noise. The importance of a good estimate of correlations is seen for specic examples where the simplest variational approach fails. The general applicability of the formalism and its MF implementation is demonstrated on local feature extraction in images (using nonnegative mixing matrix and source priors) and in overcomplete separation of speech (using heavy-tailed source priors). The good performance of the mean-eld approach supports the belief that we get fair estimates of the posterior means and covariances. However, a rigorous test requires either explicit numerical integration, which is possible only for low-dimensional problems, or Monte Carlo sampling, which may also be inaccurate in complex cases. In the following, we will discuss a number of possible extensions of this work. One obvious extension is the modeling of temporal correlations. The most general formulation of the model with temporal correlation leads to the consideration of the junction tree algorithm.
912
P. Højen-Sørensen, O. Winther, and L. Hansen
Figure 8: Feature extraction of the handwritten digit 3 using an exponential prior with g D 1 and (a) unconstrained mixing matrix and (b) positive constrained mixing matrix.
Optimization of the hyperparameters of the prior can be performed by extending the current EM algorithm. The mean-eld approach can also be used to derive leave-one-out estimators (Opper & Winther, 2000, 2001), which can be used for both optimization of hyperparameters and model selection. Model selection can also be performed using the (approximate mean-eld) likelihood of either an independent test set or the training set, together with an asymptotic model selection criterion such as the Bayesian information criterion (BIC) (Schwarz, 1978). Finally, it could be interesting to relax some of the basic requirements of the model. First is that of statistical independence of the sources. Our formalism can be extended to treat a priori gaussian correlations between (the nongaussian) sources. We should be able estimate these correlations effectively by, for example, the linear response technique. Second, the model can be extended to nonlinear mixing by, for example, introducing a sigmoidal squashing of the mixed signal. With some increase in the computational complexity, this situation can also be included in the mean-eld framework (Opper & Winther, 2001).
Mean-Field Approaches to Independent Component Analysis
913
Appendix A: Algorithmic recipe
Table 2 gives an EM recipe for solving the mean-eld equations and the equations for the mixing matrix and the noise covariance. The table species which equations have been used. Here, we give the equations for adapTable 2: Pseudocode for the Mean-Field ICA Algorithms. Initialization: Equations 3.3, 3.4, and 3.8 ¡1
J :D A T § A ¡1 h :D A T § X hS i :D 0 (or small random values if 0 is a xed point) for m :D 1, . . . , M and t :D 1, . . . , N:
lmt :D Jmm
endfor
NhSi :D 20,Nl :D 10,NA :D 10,N§ :D 1,ftol :D 10¡5 do: Expectation-step: for NhSi iterations, Equations 3.18 and 3.9 for m :D 1, . . . , M P and t :D 1, . . . , N:
c mt D hmt ¡ m0 (Jmm 0 ¡ lm 0 t dmm 0 )hSm 0 t i dhSmt i :D f (c mt , lmt ) ¡ hSmt i
endfor hS i :D hS i C gdhS i endfor for Nl iterations, equations 3.17 and 3.16 for m :D 1, . . . , M and t :D 1, . . . , N:
L mt :D lmt C
1 f 0 (c mt , lmt )
dlmt :D £
1
endfor for m :D 1, . . . , M and t :D 1, . . . , N:
¤
(
¡1 t CJ)
lmt :D lmt C dlmt
¤
mm
¡
1 f 0 (c mt , lmt )
endfor endfor for t :D 1, . . . , N, equation 3.14 Â t :D (¤ t C J )¡1 endfor Maximization-step for NA iterations, equation 2.13 or 2.10 for d :D 1, . . . , D and m :D 1, . . . , M: ¡1 [§ X hS iT ]dm
dAdm :D
[
§
¡1
A hSS T i] dm C aAdm C b
Adm :D Adm C dAdm
Adm ¡ Adm
endfor endfor for N iterations, equation 2.9 § d § :D N1 h(X ¡ AS )(X ¡ AS )T i ¡ §
§
endfor J :D A T § h :D
AT
:D
§
§
¡1
C d§
A
¡1
X while max(|dhSmt i| 2 , |dlmt | 2 , |dAdm | 2 , |d Sdd 0 | 2 ) > ftol
914
P. Højen-Sørensen, O. Winther, and L. Hansen
tive TAP. Linear response theory is obtained by omitting the updating step 0 t for lmt —by setting Nl :D 0. Furthermore, setting Âmm 0 :D d mm0 f (c mt , l mt ) ¡1 t instead of  :D ( ¤ t C J ) leads to the naive mean-eld algorithm. In the table, we have given the update rule for the nonnegative mixing matrix, equation 2.13. To get to the unconstrained mixing matrix, the unconstrained update rule, equation 2.10, should be used. Note that we use a greedy update step for all variables but the means hS i. Adaptive TAP is especially sensitive to the choice of the learning rate g. It is therefore made adaptive such P that it is increased by a factor of 1.1 if the sum of the squared deviations mt |dhSmt i| 2 decreases compared to the previous update. Otherwise, it is decreased with a factor 2. Our experience with the TAP equations also indicates that running with a variable number of updates of hS i could be helpful. However, in the simulations described here, we kept the number of iterations xed. Appendix B: Some Additional Analytical Source Priors
In this appendix, we derive the variational mean and response function for some additional analytical source priors that have not been directly used in this article. We show these calculations in some detail since they are of the same type as the one we carried out in deriving the variational mean of the sources in section 4. B.1 Positively Constrained Gaussian Source Prior. Calculating the variational mean equation 3.9 in general involves calculating an integral of the form Z 1 2 (B.1) dSP (S ) e¡ 2 lS Cc S ,
where P ( S ) is the source prior. The source priors considered in this article are all of such a form that this integral can reparameterized into an integral over a gaussian kernel. For this reason, it is useful to have at hand an expression for the integral of a gaussian kernel, Z
x
dSe ¡1
¡ 12 lS2 Cc S
D
³ p
³
c 2p D p
l ³
´´ ¡1 Z
x ¡1
1
c
dSe ¡ 2 l(S¡ l )
2
(B.2)
³ ´´ ¡1 Z j p p c 1 2 l 2p D p D dSe ¡ 2 lS l ¡1
(B.3)
D p
(B.4)
W (j )
lD( pc )
,
l
p where j D l(x ¡c / l) . The rst equality follows from completing squares and introducing the gaussian pdf, equation 4.1. The second equality follows
Mean-Field Approaches to Independent Component Analysis
915
by changing the integration variable, whereas the nal equality follows by introducing the gaussian cdf, equation 4.2. We can now calculate the following integral, Z
C1
Z
dSe
Z
1 ¡ W (¡ pc ) W ( pc ) l l (¢) ¡ (¢) D p . D D p lD( pc ) lD( pc ) ¡1 ¡1
¡ 12 lS2 Cc S
0
C1
0
l
(B.5)
l
Suppose we are interested in calculating the variational mean of a density having equation B.5 as the partition function. It is remembered that any factor of proportionality independent of c is not needed in calculating the variational mean, i.e., ³ f (c , l) D D
W D
´ ¡1
Wc0 D ¡ WDc0 D2
³ D
W D
´¡1
p D2 / l C c / lWD D2
D ( pc l ) c C p . l lW ( pc )
(B.6)
(B.7)
l
We can now return to the problem of calculating the variational mean of a positively constrained gaussian parameterized by 1 2 2 p (S|m , s) / e¡ 2 ( S¡m ) / s ,
S 2 RC ,
(B.8)
where m and s 2 are the mean and variance, respectively. Multiplying the source prior onto the gaussian kernel and identifying terms, it is seen that the product can be written as a gaussian with l :D l C 1 / s 2 and c :D c C m / s 2 . Substituting back into equation B.7, we directly obtain the variational mean, f (c , l) D
c C m /s2 1 D (k ) C p , 2 (k ) l C 1/s2 W l C 1/s
(B.9)
where we have introduced c C m /s2
kD p
l C 1/s2
,
(B.10)
and the response function can be readily derived, m /s 2 D l C 1/s2 @c @f
³
D (k ) 1 ¡k ¡ W (k )
³
D (k ) W (k )
´2 ´ .
(B.11)
We now consider the variational mean and the response function associated with the uniform source prior.
916
P. Højen-Sørensen, O. Winther, and L. Hansen
B.2 Uniform Source Prior. In this section we consider the uniform prior,
P (S) D
1 , b ¡a
S 2 [aI b],
(B.12)
where b ¸ a. By reusing the calculations from section B.1, we directly obtain Z
b
Z
dSe ¡ 2 lS 1
2 Cc
S
a
D
b
(¢) ¡
¡1
Z
a
(¢) D
¡1
W (k b ) ¡ W (k a ) p , lD( pc )
(B.13)
l
p p where k x D l(x ¡c / l) D lx ¡ pc . Here, we have again left out the norl malizing constant since it is of no importance in calculating the variational mean, f (c , l) D
c 1 D (ka ) ¡ D (k b ) C p , l l W (kb ) ¡ W (ka )
(B.14)
and the response function, 1 D l @c @f
³
³
ka D (ka ) ¡ kb D (k b ) D (ka ) ¡ D (kb ) 1C ¡ W (kb ) ¡ W (ka ) W (ka ) ¡ W (k b )
´2 ´ .
(B.15)
This appendix showed some examples of the calculations needed in deriving the variational mean and response functions for the source priors considered in this article. Acknowledgments
We thank Michael Jordan and Manfred Opper for helpful discussions. This research is supported by the Danish Research Councils through the THOR Center for Neuroinformatics and by the Center for Biological Sequence Analysis. References Attias, H. (1999). Independent factor analysis. Neural Computation, 11(4), 803– 851. Bell, A. J., & Sejnowski, T. J. (1995). An information–maximization approach to blind separation and blind deconvolution. Neural Computation, 7(6), 1129– 1159. Belouchrani, A., & Cardoso, J.-F. (1995). Maximum likelihood source separation by the expectation-maximization technique: deterministic and stochastic implementation. In In Proc. NOLTA (pp. 49–53).
Mean-Field Approaches to Independent Component Analysis
917
CsatÂo, L., FokouÂe, E., Opper, M., Schottky, B., & Winther, O. (2000). Efcient approaches to gaussian process classication. In S. Solla, T. Leen, & K.-R. Muller ¨ (Eds.), Advances in neural information processing systems, 12 (pp. 251– 257). Cambridge, MA: MIT Press. Girolami, M. (1998). An alternative perspective on adaptive independent component analysis algorithms. Neural Computation, 10(8), 2103–2114. Girolami, M. (Ed.). (2000). Advances in independent components analysis. Berlin: Springer-Verlag. Gradshteyn, I. S., & Ryzhik, I. M. (1980). Table of integrals, series, and products (enl. ed.). New York: Academic Press. Hansen, L. K. (2000). Blind separation of noisy image mixtures. In M. Girolami (Ed.), Advances in independent components analysis (pp. 165–187). Berlin: Springer-Verlag. Hyva¨ rinen, A., & Karthikesh, R. (2000). Sparse priors on the mixing matrix in independent component analysis. In Proc. Int. Workshop on Independent Component Analysis and Blind Signal Separation (ICA2000) (pp. 477–452). Helsinki, Finland. Kappen, H. J., & RodrÂiguez, F. B. (1998). Efcient learning in Boltzmann machines using linear response theory. Neural Computation, 10, 1137–1156. Knuth, K. (1999). A Bayesian approach to source separation. In J.-F. Cardoso, C. Jutten, & P. Loubaton (Eds.), Proceedings of the First International Workshop on Independent Component Analysis and Signal Separation (ICA’99) (pp. 283–288). Aussois, France. Lappalainen, H., & Miskin, J. W. (2000). Ensemble learning. In M. Girolami (Ed.), Advances in independent components analysis. Berlin: Springer-Verlag. Lee, D. D., & Seung, H. S. (1999). Learning the parts of objects by nonnegative matrix factorization. Nature, 401, 788–791. Lee, T.-W. (1998). Independent component analysis: Theory and applications. Boston: Kluwer. Lee, T.-W., Lewicki, M. S., Girolami, M., & Sejnowski, T. J. (1999). Blind source separation of more sources than mixtures using overcomplete representations. IEEE Signal Processing Letters, 4(4). Lewicki, M. S., & Sejnowski, T. J. (2000). Learning overcomplete representations. Neural Computation, 12(2), 337–365. Luenberger, D. G. (1984). Linear and nonlinear programming (2nd ed.). Reading, MA: Addison-Wesley. MacKay, D. J. C. (1996). Maximum likelihood and covariant algorithms for independent component analysis (Tech. Rep.). Cambridge University, Cavendish Laboratory. Miskin, J. W., & MacKay, J. C. (2000). Ensemble learning for blind image separation and deconvolution. In M. Girolami (Ed.), Advances in independent components analysis. Berlin: Springer-Verlag. Opper, M., & Winther, O. (2000). Gaussian processes for classication: Mean eld algorithms. Neural Computation, 12(11), 2655–2684. Opper, M., & Winther, O. (2001).Tractable approximations for probabilistic models: The adaptive Thouless-Anderson-Palmer mean eld approach. Phys. Rev. Lett., 86, 3695–3699.
918
P. Højen-Sørensen, O. Winther, and L. Hansen
Parisi, G. (1988). Statistical eld theory. Reading, MA: Addison-Wesley. Petersen, K. S., Hansen, L. K., Kolenda, T., Rostrup, E., & Strother, S. (2000). On the independent components in functional neuroimages. In P. Pajunen & J. Karhunen (Eds.), Proc. Int. Workshop on Independent Component Analysis and Blind Signal Separation (ICA2000) (pp. 615–620). Helsinki, Finland. Peterson, C., & Anderson, J. R. (1987). A mean eld theory learning algorithm for neural networks. Complex Systems, 1, 995–1019. Peterson, C., & Soderberg, ¨ B. (1989). A new method for mapping optimization problems onto neural networks. International Journal of Neural Systems, 1, 3– 21. Pierro, A. R. (1993). On the relation between the ISRA and the EM algorithm for positron emission tomography. IEEE Transactions on Medical Imaging, 12(2), 328–333. Saul, L. K., Jaakkola, T., & Jordan, M. I. (1996). Mean eld theory for sigmoid belief networks. Journal of Articial Intelligence Research, 4, 61–76. Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461–464. Thouless, D. J., Anderson, P. W., & Palmer, R. G. (1977). Solution of “solvable model of a spin glass.” Philosophical Magazine, 35(3), 593–601. van der Veen, A.-J. (1997). Analytical method for blind binary signal separation. IEEE Trans. on Signal Processing, 45(4), 1078–1082. Received December 27, 2000; accepted June 25, 2001.
LETTER
Communicated by Bob Williamson
Neural Networks with Local Receptive Fields and Superlinear VC Dimension Michael Schmitt [email protected] ¨ Mathematik Lehrstuhl Mathematik und Informatik, Fakult¨at fur Ruhr-Universit¨at Bochum, D–44780 Bochum, Germany Local receptive eld neurons comprise such well-known and widely used unit types as radial basis function (RBF) neurons and neurons with centersurround receptive eld. We study the Vapnik-Chervonenkis (VC) dimension of feedforward neural networks with one hidden layer of these units. For several variants of local receptive eld neurons, we show that the VC dimension of these networks is superlinear. In particular, we establish the bound (W log k) for any reasonably sized network with W parameters and k hidden nodes. This bound is shown to hold for discrete center-surround receptive eld neurons, which are physiologically relevant models of cells in the mammalian visual system, for neurons computing a difference of gaussians, which are popular in computational vision, and for standard RBF neurons, a major alternative to sigmoidal neurons in articial neural networks. The result for RBF neural networks is of particular interest since it answers a question that has been open for several years. The results also give rise to lower bounds for networks with xed input dimension. Regarding constants, all bounds are larger than those known thus far for similar architectures with sigmoidal neurons. The superlinear lower bounds contrast with linear upper bounds for single local receptive eld neurons also derived here. 1 Introduction
The receptive eld of a neuron is the region of the input domain giving rise to stimuli to which the neuron responds by changing its behavior. Neuron models can be classied according to whether these stimuli are contained in some bounded region or may come from afar—in other words, whether their receptive eld is local or not. Prominent examples of local receptive eld models are neurons with center-surround receptive eld and neurons computing radial basis functions (RBFs), whereas sigmoidal neurons represent a widely used model type having a nonlocal receptive eld. The impressive computational and learning capabilities of neural networks, being signicantly higher than those of single neurons, are well established by experimental ndings in biology, innumerable successful applications in Neural Computation 14, 919–956 (2002)
° c 2002 Massachusetts Institute of Technology
920
Michael Schmitt
practice, and substantial formal arguments in the theories of computation, approximation, and learning. An evident question is to what extent these network capabilities depend on the receptive eld type of the neurons. An extensively studied measure for quantifying the computational and learning capabilities of formal systems is the Vapnik-Chervonenkis (VC) dimension. It characterizes the expressiveness of a neural network and is mostly given in terms of the number of network parameters and the network size. A well-known fact is that the VC dimension of sigmoidal neural networks is signicantly more than linear. This has been shown for networks growing in depth (Koiran & Sontag, 1997; Bartlett, Maiorov, & Meir, 1998) and for constant-depth networks with the number of hidden layers being two (Maass, 1994) and one (Sakurai, 1993). The two latter results for networks of constant depth are of particular signicance since they deal with neural architectures as they are used in practice, where one rarely allows the number of hidden layers to grow indenitely. Moreover, the case of one hidden layer is of even greater interest, because single neurons almost always have a VC dimension that is linear in the input dimension and, hence, linear in the number of model parameters.1 This has been found for sigmoidal neurons (Cover, 1965; Haussler, 1992) (see also Anthony & Bartlett, 1999) and for several other models, such as higher-order sigmoidal neurons with restricted degree (Anthony, 1995), for the neuron computing Boolean monomials (Natschlager ¨ & Schmitt, 1996), and for the product unit (Schmitt, 2000). Thus, the fact that networks with the minimal number of one hidden layer have superlinear VC dimension corroborates the enormous computational capabilities arising when sigmoidal neurons cooperate in networks. In this article, we study networks with one hidden layer of local receptive eld neurons. We show for several types of receptive elds that the VC dimension of these networks is superlinear. First, we consider discrete models of cells with center-surround receptive eld (CSRF). The rst real neurons to be identied as having a receptive eld with center-surround organization are the ganglion cells in the visual system of the cat. The recording experiments of Kufer (1953) from the optic nerve revealed the pure onand off-type responses within specic areas of ganglion cell receptive elds and their concentric, antagonistic center-surround organization. Also, other neurons of the mammalian visual system such as the bipolar cells of the retina and cells in the lateral geniculate nucleus are known to have centersurround receptive elds (see, e.g., Tessier-Lavigne, 1991; Nicholls, Martin, & Wallace, 1992). CSRF neurons play an important role in algorithmic experiments with self-organizing networks. The question how center-surround receptive elds can emerge in articial networks by adjusting their parameters has been investigated using unsupervised (Linsker, 1986, 1988; Atick
1 A notable exemplar of a single neuron having superlinear VC dimension is the model of a spiking neuron studied by Maass and Schmitt (1999).
Neural Networks with Local Receptive Fields
921
& Redlich, 1993; Schmidhuber, Eldracher, & Foltin, 1996) as well as supervised learning mechanisms (Joshi & Lee, 1993; Yasui, Furukawa, Yamada, & Saito, 1996). It is found that cells similar to those of the rst few stages of the mammalian visual system develop when applying simple learning rules and using training data from realistic visual scenes. Neural networks with center-surround receptive elds have also been fabricated in analog VLSI hardware. The silicon retinas constructed by Mead and Mahowald (1988) (see also Mead, 1989), Ward and Syrzycki (1995), and Liu and Boahen (1996) consist of neuromorphic cells performing operations of biological receptive elds. The second type of receptive eld neuron we consider is called the difference-of-gaussians (DOG) neuron and is a continuous version of the above model. It also has its origin in neurobiology. Extending the work of Kufer (1953), Rodieck (1965) was probably the rst to introduce the DOG as a quantitative model for the functional responses of ganglion cells. The importance and the physiological plausibility of the DOG for satisfactorily tting functions to experimental data from retinal ganglion cell recordings is also demonstrated by the work of Enroth-Cugell and Robson (1966). The DOG is generally accepted as a mathematical description of the behavior of several cell types in the retino-cortical pathway,2 such as the abovementioned bipolar cells, ganglion cells, and cells in the lateral geniculate nucleus (Marr & Hildreth, 1980; Marr, 1982; Glezer, 1995). Models based on DOG functions may even provide better descriptions of experimental data than other common models of visual processing, as shown by Hawken and Parker (1987) in their study of the monkey primary visual cortex. Thethird and last type of local receptive eld neuron in this study is the RBF neuron, specically the standard, that is, gaussian RBF neuron. RBF networks are among the major neural network types used in practice (see, e.g., Bishop, 1995; Ripley, 1996). They are appreciated because of their powerful capabilities in function approximation and learning that are also theoretically well founded. A series of articles specically deals with showing that under rather mild conditions, RBF networks can uniformly approximate continuous functions on compact domains arbitrarily closely (Hartman, Keeler, & Kowalski, 1990; Park & Sandberg, 1991, 1993; Mhaskar, 1996). Even before RBF networks were considered as articial neural networks, they were a well-established method for multivariable interpolation and function approximation. A comprehensive account on the approximation theory of RBFs up to 1990 is given by Powell (1992). The connections between approximation theory and learning in adaptive networks of RBF neurons were initially explored by Broomhead and Lowe (1988) and Poggio 2 In particular, Marr and Hildreth (1980) prove that under certain conditions, the DOG operator closely approximates the Laplacian-of-gaussians, also known as Marr lter, which they show to be well suited to detect intensity changes and, especially, edges in images.
922
Michael Schmitt
and Girosi (1990). Moody and Darken (1989) studied learning algorithms for RBF networks that can be implemented as real-time adaptive systems. They showed that combined supervised and unsupervised learning methods can be computationally faster in RBF networks than the gradient-based methods devised for sigmoidal networks. (See Howlett and Jain, 2001a, 2001b, and Yee & Haykin, 2001 for recent developments in RBF neural networks.) There has been previous work on the VC dimension of RBF networks. Bartlett and Williamson (1996) show that the VC dimension and the related pseudodimension of RBF networks with discrete inputs is O ( W log(WD) ) , where W is the number of network parameters and the inputs take on values from f¡D, . . . , Dg. The best-known upper bound for RBF networks with unconstrained inputs is due to Karpinski and Macintyre (1997) and is O ( W 2 k2 ) , where k denotes the number of network nodes. Holden and Rayner (1995) address the generalization capabilities of networks having RBF units with xed parameters and establish a linear upper bound on the VC dimension. Anthony and Holden (1994) consider fully adaptable RBF networks with adjustable hidden and output node parameters. Referring to the lower-bound V ( W log W ) for sigmoidal networks due to Maass (1994), they write “we leave as an open question whether it is possible to obtain a lower bound similar to that recently proved by Maass (1993, 1994) for certain feedforward networks” (p. 104). The work of Erlich, Chazan, Petrack, and Levy (1997) together with a result of Lee, Bartlett, and Williamson (1995) gives a linear lower bound (see also Lee, Bartlett, & Williamson, 1997). Although there exists already a large collection of VC dimension bounds for neural networks, it has not been known thus far whether the VC dimension of RBF neural networks is superlinear. Major reasons for this might be that previous results establishing superlinear bounds are based on methods geared to sigmoidal3 neurons or consider networks having an unrestricted number of layers4 (Sakurai, 1993; Maass, 1994; Koiran & Sontag, 1997; Bartlett et al., 1998). In this article, we prove that the VC dimension of RBF networks is indeed superlinear, thus answering the question of Anthony and Holden (1994) quoted above. Precisely, we show that every network with n input nodes, W parameters, and one hidden layer of k RBF neurons, where k · 2 ( n C 2) / 2 , has VC dimension5 V ( W log k) . Thus, the cooperative network effect ob-
3 For a quite general denition of a sigmoidal neuron that does not capture RBF neurons, see Koiran and Sontag (1997). 4 We point out that it might be possible to obtain superlinear lower bounds for local receptive eld networks pursuing the approaches of Koiran and Sontag (1997) and Bartlett et al. (1998), but only at the expense of allowing arbitrary depth. In particular, this has no relevance for standard RBF networks. 5 Note that this result also gives rise to the lower bound V (W log W) by choosing a network with k D n hidden units.
Neural Networks with Local Receptive Fields
923
served in sigmoidal networks is also existent in RBF networks. This result also has implications for the complexity of learning with RBF networks, all the more since it entails the same lower bound for the related notions of pseudodimension and fat-shattering dimension. We do not state these consequences explicitly here but refer readers to Anthony and Bartlett (1999) instead. Before establishing the lower bound for RBF networks, however, we show that the bound V ( W log k) holds for the VC dimension of DOG networks. From this, the bound for RBF networks is then immediately obtained. The result for DOG networks, in turn, is derived from the superlinear lower bound for discrete CSRF networks that we establish rst. Thus, this work creates a link between these three neuron models not only by focusing on their common receptive eld property but also by the logical requisite of the successive proofs of the VC dimension bounds. We introduce denitions and notation in section 2. The two subsequent sections contain the derivations of the superlinear lower bounds. In section 3, we consider networks of discrete local receptive eld neurons, specifically the binary and ternary CSRF neuron and a discrete variant of the RBF neuron, the binary RBF neuron. In section 4, we study networks of DOG neurons and standard RBF neural networks. We note that all bounds derived in both these sections have larger constant factors than those known for sigmoidal networks of constant depth thus far. In particular, we obtain the bound ( W / 5) log(k / 4) for binary CSRF neurons and for DOG neurons, and the bound ( W / 12) log(k/ 8) for ternary CSRF neurons, binary RBF neurons, and gaussian RBF neurons. For comparison, sigmoidal networks are known with one hidden layer and VC dimension at least ( W / 32) log(k/ 4), and with two hidden layers and VC dimension at least ( W / 132) log(k / 16) (see Anthony & Bartlett, 1999, section 6.3). The results in sections 3 and 4 also give rise to lower bounds for local receptive eld networks when the input dimension is xed. In section 5, we present upper and lower bounds for single neurons. In particular, we show that the VC dimension of binary RBF and CSRF neurons and of ternary CSRF neurons is linear. Further, we derive such a result for the pseudodimension of the gaussian RBF neuron. Finally, in section 6, we return to networks of discrete local receptive eld neurons and establish the upper bound O (W log k) for all discrete variants of local receptive eld neurons considered here. This implies that the lower bounds for these networks are asymptotically optimal. We conclude with section 7 discussing the results and presenting some open questions. An appendix gives the derivation of a bound for a specic class of functions dened in terms of halfspaces and required for a result in section 5.1. 2 Denitions
We rst introduce the types of neurons and networks with local receptive elds that we study. Then we give the denitions of the VC dimension and other basic concepts.
924
Michael Schmitt
2.1 Networks of Local Receptive Field Neurons. We start with discrete neurons. Let ||u|| denote the Euclidean norm of vector u. A binary CSRF neuron computes the function gbCSRF : R2n C 2 ! f0, 1g dened as ( 1 if a · kx ¡ ck · b, gbCSRF ( c, a, b, x) D 0 otherwise,
with input variables x1 , . . . , xn , and parameters c1 , . . . , cn , a, b, where b > a > 0. The vector ( c1 , . . . , cn ) is called the center of the neuron, and a, b are its center radius and surround radius, respectively. We also refer to this neuron as binary off-center on-surround neuron and call for given parameters c, a, b the set fx: gbCSRF ( c, a, b, x ) D 1g the surround region of the neuron. A ternary CSRF neuron is dened by means of the function gtCSRF : R2n C 2 ! f¡1, 0, 1g with 8 > < 1 if a · kx ¡ ck · b, gtCSRF (c, a, b, x ) D ¡1 if kx ¡ ck < a, > : 0 otherwise. This neuron is also called ternary off-center on-surround neuron. Finally, a binary RBF neuron computes the function gbRBF : R2n C 1 ! f0, 1g satisfying ( 1 if kx ¡ ck · b, gbRBF ( c, b, x ) D 0 otherwise. Figure 1 shows the receptive eld of these neurons for the case n D 2. We emphasize that the output values in these denitions are meant to be symbolic and represent discrete levels of neural activity. So a 1 corresponds to a state where the neuron is highly active, whereas ¡1 indicates low activity. The value 0 signies that the neuron is silent. Furthermore, the specic assignment of values to the activity levels is not relevant for the results derived in this article. For instance, any two distinct nonzero values A < B instead of ¡1, 1 can be chosen for the ternary CSRF neuron without affecting the validity of the lower bounds. The same holds for the binary CSRF and binary RBF neuron, where the value 1 can be replaced by any other nonzero value. The assignment of output values to points lying on a radius also allows some freedom. For instance, we could alternatively require that for kx¡ck D a, we have gbCSRF ( c, a, b, x ) D 0 or gtCSRF ( c, a, b, x ) D ¡1. The same is true for kx ¡ ck D b. The VC dimension bounds do not rely on the values for the radii and hence still hold for these and other cases. We have dened here only off-center on-surround variants of CSRF neurons. In neurobiology, models of on-center off-surround cells are equally important (see, e.g., Tessier-Lavigne, 1991; Nicholls et al., 1992). In these neurons, the activity in the surround region is lower than in the center.
Neural Networks with Local Receptive Fields
925
surround region
Figure 1: Receptive eld of discrete neurons with center c, center radius a, and surround radius b. Output values A D 0, B D 1 correspond to a binary CSRF neuron, A D ¡1, B D 1 yields a ternary CSRF neuron, and a binary RBF neuron is given by A D B D 1. Outside the regions labeled A or B, the output is always 0.
Since we are considering networks that are weighted combinations of neurons, such a denition is redundant here. An on-center off-surround neuron in a network can be replaced by an off-center on-surround neuron, and vice versa, by simply multiplying the weight outgoing from it with a negative number. The two types of continuous local receptive eld neurons considered in this article are dened as follows. A gaussian RBF neuron computes the function gRBF : R2n C 1 ! R dened as ³ ´ kx ¡ ck2 gRBF ( c, s, x ) D exp ¡ , s2 with input variables x1 , . . . , xn , and parameters c1 , . . . , cn and s > 0. Here ( c1 , . . . , cn ) is the center and s > 0 the width. A DOG neuron is dened as a function gDOG : R2n C 4 ! R computed by the weighted difference of two RBF neurons with equal centers, that is, gDOG ( c, s, t, a, b, x ) D agRBF ( c, s, x) ¡ b gRBF ( c, t, x ) , where s, t > 0. Examples of gRBF and gDOG for input dimension 2 are shown in Figure 2.
926
Michael Schmitt
Figure 2: Receptive eld functions computed by a (left) gaussian radial basis function neuron and a (right) difference-of-gaussians neuron.
The neural networks we are studying are of the feedforward type and have one hidden layer. They compute functions of the form f : RW C n ! R, where W is the number of network parameters, n the number of input nodes, and f is dened as f ( w, y, x) D w0 C w1 h1 ( y, x ) C ¢ ¢ ¢ C wkh k ( y, x) . The k hidden nodes may compute any of the functions dened above, that is, h1 , . . . , h k 2 fgbCSRF , gtCSRF , gbRBF , gRBF , gDOG g. The parameters of the hidden nodes are gathered in y from which each node selects its own parameters. The network has a linear output node with parameters w0 , . . . , wk also known as the output weights. The parameter ¡w0 is also called the output threshold. For simplicity, we sometimes refer to all network parameters as weights. If hi D gRBF for i D 1, . . . , k, we have the standard form of a gaussian RBF neural network. 2.2 Vapnik-Chervonen kis Dimension of Neural Networks. A dichotomy of a set S µ Rn is a pair ( S 0, S1 ) of subsets such that S 0 \ S1 D ; and S 0 [ S1 D S. A class F of functions mapping Rn to f0, 1g is said to shatter S if every dichotomy ( S 0, S1 ) of S is induced by some f 2 F , in the sense that f satises f (S 0 ) µ f0g and f (S1 ) µ f1g. The function sgn: R ! f0, 1g satises sgn ( x ) D 1 if x ¸ 0, and sgn ( x ) D 0 otherwise.
Neural Networks with Local Receptive Fields
927
Let N be a neural network and F be the class of functions computed by N . The Vapnik-Chervonenkis (VC) dimension of N is the cardinality of the largest set shattered by the class fsgn ± f : f 2 F g. Denition 1.
The pseudodimension and the fat-shattering dimension are generalizations of the VC dimension that apply in particular to real valued function classes. The lower bounds for local receptive eld networks presented in this article are stated for the VC dimension, but they also hold for the pseudodimension and the fat-shattering dimension. The bound on the pseudodimension follows from the fact that the VC dimension of a neural network is by denition not larger than its pseudodimension. The bound on the fatshattering dimension is implied because the output weights of the neural networks considered here can be scaled arbitrarily. The denition of the pseudodimension will be given in section 5.2, where we establish a linear upper bound for the single gaussian RBF neuron. We refer the reader to Anthony and Bartlett (1999) for a denition of the fat-shattering dimension and results about the relationship between these three notions of dimension. 2.3 Further Concepts and Notation. An ( n¡1) -dimensional hyperplane in Rn is represented by a vector (w 0, . . . , wn ) 2 Rn C 1 and dened as the set
fx 2 Rn : w 0 C w1 x1 C ¢ ¢ ¢ C wn xn D 0g. An ( n ¡ 1)-dimensional hypersphere in Rn is given by a center c 2 Rn and a radius r > 0, and dened as the set fx 2 Rn : kx ¡ ck D rg. We clearly distinguish the hypersphere from a ball, which is dened as the set fx 2 Rn : kx ¡ ck · rg. We also consider hyperplanes and hyperspheres in Rn with a dimension k < n ¡ 1. In this case, a k-dimensional hyperplane is the intersection of two ( k C 1) -dimensional hyperplanes, assuming that the intersection is nonempty. Similarly, the nonempty intersection of two ( k C 1) -dimensional hyperspheres yields a k-dimensional hypersphere, provided that the intersection is not a single point. We use “ln” to denote the natural logarithm and “log” for the logarithm to base 2. 3 Superlinear Lower Bounds for Networks of Discrete Neurons
In this section we establish superlinear lower bounds for networks consisting of discrete versions of local receptive eld neurons. Crucial is the result for binary CSRF networks, presented in section 3.2, from which the bounds
928
Michael Schmitt
Figure 3: Positive and negative examples for sets in spherically general position (see denition 2). The set fx1 , x2 , x3 g has a line passing through its points, and the set fx1 , x2 , x4 , x5 g lies on a circle. Hence, any set that includes one of these sets (or both) is not in spherically general position since they violate conditions 1 and 2, respectively. A positive example is the set fx2 , x3 , x4 , x5 , x6 g.
for ternary CSRF networks and binary RBF networks in sections 3.3 and 3.4, respectively, follow. First, however, we introduce a geometric property of certain nite sets of points. 3.1 Geometric Preliminaries
A set S of m points in Rn is said to be in spherically general position if the following two conditions are satised: Denition 2.
1. For every k · min(n, m ¡ 1) and every ( k C 1) -element subset P µ S, there is no ( k ¡ 1)-dimensional hyperplane containing all points in P. 2. For every l · min(n, m ¡ 2) and every ( l C 2) -element subset Q µ S, there is no ( l ¡ 1)-dimensional hypersphere containing all points in Q. The denition is illustrated by Figure 3 showing six points in R2 . The entire set is not in spherically general position, as witnessed by the line and the circle. It is easy, but may take a while, to verify that the set fx2 , x3 , x4 , x5 , x6 g is indeed in spherically general position. Sets satisfying only condition 1 are commonly referred to as being in general position (see, e.g., Cover, 1965; Nilsson, 1990). Thus, a set in spherically general position is particularly in general position. (The converse does not always hold, as can be seen from Figure 3: The set fx1 , x2 , x4 , x5 g meets condition 1 but not condition 2.) For establishing the superlinear lower bounds
Neural Networks with Local Receptive Fields
929
on the VC dimension, we require sets in spherically general position with sufciently many elements. It is easy to show that for any dimension n, there exist arbitrarily large such sets. The proof of the following proposition provides a method for constructing them. For every n, m ¸ 1 there exists a set S µ Rn of m points in spherically general position. Proposition 1.
Proof. We perform induction on m. Clearly, every point trivially satises
conditions 1 and 2. Assume that some set S µ Rn of cardinality m has been constructed. Then by the induction hypothesis, for every k · min(n, m ), every k-element subset P µ S does not lie on a hyperplane of dimension less than k ¡ 1. Hence, every P µ S, |P| D k · min(n, m ) , uniquely species a ( k ¡ 1) -dimensional hyperplane HP that includes P. The induction hypothesis implies further that no point in SnP lies on HP . Analogously, for every l · min(n, m ¡ 1), every (l C 1)-element subset Q µ S does not lie on a hypersphere of dimension less than l ¡ 1. Thus, every Q µ S, |Q| D l C 1 · min(n, m ¡ 1) C 1, uniquely determines an ( l ¡ 1) -dimensional hypersphere BQ containing all points in Q and none of the points in SnQ. To obtain a set of cardinality m C 1 in spherically general position, we observe that the union of all hyperplanes and hyperspheres considered above—that is, the union of all HP and all BQ for all subsets P and Q—has Lebesgue measure 0. Hence, there is some point s 2 Rn not contained in any hyperplane HP and not contained in any hypersphere BQ . By adding s to S, we then obtain a set of cardinality m C 1 in spherically general position. 3.2 Networks of Binary CSRF Neurons. The following theorem is the main step in establishing the superlinear lower bound.
Let h, q, m ¸ 1 be arbitrary natural numbers. Suppose N is a network with one hidden layer consisting of binary CSRF neurons, where the number of hidden nodes is h C 2q and the number of input nodes is m C q. Assume further that the output node is linear. Then there exists a set of cardinality hq (m C 1) shattered by N . This holds even if the output weights of N are xed to 1. Theorem 1.
Proof. Before starting with the details, we give a brief outline. The main
idea is to imagine the set we want to shatter as being composed of groups of vectors, where the groups are distinguished by means of the rst m components and the remaining q components identify the group members. We catch these groups by hyperspheres such that each hypersphere is responsible for up to m C 1 groups. The condition of spherically general position will ensure that this operation works. The hyperspheres are then expanded to become surround regions of off-center on-surround neurons. To induce a dichotomy of the given set, we split the groups. We do this for each group
930
Michael Schmitt
using the q last components in such a way that the points with designated output 1 stay within the surround region of the respective neuron and the points with designated output 0 are expelled from it. In order for this to succeed, we have to make sure that the displaced points do not fall into the surround region of some other neuron. The verication of the split operation will constitute the major part of the proof. Let us rst choose the vectors. By means of proposition 1, we select a set fs1 , . . . , sh ( m C 1) g µ Rm in spherically general position. Let e1 , . . . , eq denote the unit vectors in Rq , that is, those with a 1 in exactly one component and 0 elsewhere. We dene the set S by S D fsi : i D 1, . . . , h ( m C 1) g £ fej : j D 1, . . . , qg. Clearly, S is a subset of Rm C q and has cardinality hq ( m C 1) . It remains to show that S is shattered by N . Let ( S 0, S1 ) be some arbitrary dichotomy of S. Consider an enumeration M1 , . . . , M2q of all subsets of the set f1, . . . , qg. Let the function f : f1, . . . , h (m C 1)g ! f1, . . . , 2q g be dened by M f ( i) D fj: si ej 2 S1 g, where si ej denotes the vector resulting from the concatenation of si and ej . We use f to dene a partition of fs1 , . . . , sh ( m C 1) g into sets T k for k D 1, . . . , 2q by T k D fsi : f ( i) D kg.
We further partition each set T k into subsets T k,p for p D 1, . . . , d|T k | / ( m C 1) e, where each subset T k,p has cardinality m C 1 except if m C 1 does not divide |T k |, in which case there is exactly one subset of cardinality less than m C 1. Since there are at most h ( m C 1) elements si , the partitioning of all T k results in no more than h subsets of cardinality m C 1. Further, the fact k · 2q permits at most 2q subsets of cardinality less than m C 1. Thus, there are no more than h C 2q subsets T k,p . We employ one hidden node H k,p for each subset T k,p . Thus, we get by with h C 2q hidden nodes in N as claimed. Since fs1 , . . . , sh ( m C 1) g is in spherically general position, there exists for each T k,p an ( m¡1 ) -dimensional hypersphere containing all points in T k,p and no other point. If |T k,p | D m C 1, this hypersphere is unique; if |T k,p | < m C 1, there is a unique (| T k,p | ¡ 2) dimensional hypersphere that can be extended to an ( m ¡ 1) -dimensional hypersphere that does not contain any further point. (Note that we require condition 1 of denition 2; otherwise no hypersphere of dimension |T k,p | ¡ 2 including all points of T k,p might exist.) Clearly, if |T k,p | D 1, we can also extend this single point to an ( m ¡ 1)-dimensional hypersphere, not including any further point. Suppose that ( ck,p , rk,p ) with center ck,p and radius r k,p represents the hypersphere associated with subset T k,p . It is obvious from the construction above that all radii satisfy rk,p > 0. Further, since the subsets T k,p are pairwise
Neural Networks with Local Receptive Fields
931
disjoint, there is some e > 0 such that every point si 2 fs1 , . . . , sh ( m C 1) g and every just dened hypersphere ( ck,p , r k,p ) satisfy 6 T k,p then |ks i ¡ c k,p k ¡ r k,p | > e. if si 2
(3.1)
In other words, e is smaller than the distance between any si and any hypersphere (ck,p , r k,p ) that does not contain si . Without loss of generality, we assume that e is sufciently small such that
e · min r k,p .
(3.2)
k,p
The parameters of the hidden nodes are adjusted as follows. We dene the center b ck,p D (b ck,p,1 , . . . , b ck,p,m C q ) of hidden node Hk,p by assigning the vector ck,p to the rst m components and specifying the remaining ones by ( 0 if j 2 M k, b ck,p,m C j D 2 ¡e / 4 otherwise, for j D 1, . . . , q. We further dene new radiib r k,p by r ± e ²4 b C1 rk,p D r2k,p C ( q ¡ |M k |) 2 and choose some c > 0 satisfying c · min k,p
e2 . 8b rk,p
(3.3)
The center and surround radiib ak,p , b bk,p of the hidden nodes are then specied as b a k,p D b rk,p ¡ c , b bk,p D b rk,p C c . 2 Note that b implies c < b a k,p > 0 holds, because e 2 < b r k,p r k,p . This completes the assignment of parameters to the hidden nodes H k,p . We now derive two inequalities concerning the relationship between e and c that we need in the following. First, we estimate e 2 / 2 from below by
e2 e2 e2 C > 2 4 64 >
e2 e4 C (8b 4 r k,p ) 2
for all k, p,
932
Michael Schmitt
2 where the last inequality is obtained from e 2 < b . Using equation 3.3 for r k,p both terms on the right-hand side, we get
e2 > 2b r k,pc C c 2
2
for all k, p.
(3.4)
Second, from equation 3.2, we get ¡r k,p e C
e2 e2 < ¡ 2 4
for all k, p,
and equation 3.3 yields ¡
e2 < ¡2b rk,pc 4
for all k, p.
Putting the last two inequalities together and adding c side, we obtain ¡ rk,p e C
e2 < ¡2b rk,pc C c 2
2
2
to the right-hand
for all k, p.
(3.5)
We next establish three facts about the hidden nodes. Claim i. Let si ej be some point and T k,p some subset where si 2 T k,p and j 2 M k. Then hidden node Hk,p outputs 1 on si ej . According to the denition of b ck,p , if j 2 M k , we have ksi ej ¡b ck,p k2 D ksi ¡ ck,p k2 C ( q ¡ |M k |)
± e ²4 2
C 1.
The condition si 2 T k,p implies ksi ¡ ck,p k2 D r2k,p , and thus ksi ej ¡b ck,p k2 D r2k,p C (q ¡ |M k | )
± e ²4 2
C 1.
2 . Db rk,p
It follows that ksi ej ¡ b ck,p k D b rk,p , and since b a k,p < b rk,p < b bk,p , point si ej lies within the surround region of node H k,p . Hence, claim i is shown. 6 M k . Then hidden node Claim ii. Let si ej and T k,p satisfy si 2 T k,p and j 2 H k,p outputs 0 on si ej .
Neural Networks with Local Receptive Fields
933
From the assumptions, we get here ± e ²4
ksi ej ¡b ck,p k2 D ksi ¡ ck,p k2 C ( q ¡ |M k | ¡ 1) D r2k,p C ( q ¡ |M k |) 2 C Db rk,p
e2 2
± e ²4 2
2
C1C
³ C
1C
e2 4
´2
e2 2
.
Employing equation 3.4 on the right-hand side results in 2 C 2b ksi ej ¡b ck,p k2 > b rk,p rk,pc C c 2 .
Hence, taking square roots, we have ksi ej ¡ b ck,p k > b rk,p C c , implying that si ej lies outside the surround region of H k,p . Thus, claim ii follows. Claim iii. Let si ej be some point and T k,p some subset such that si 2 T k,p . 6 ( k, p ) outputs 0 on si ej . Then every hidden node H k0 ,p0 with ( k0 , p0 ) D Since si 2 T k,p and si is not contained in any other subset T k0 ,p0 , condition 3.1 implies ksi ¡ ck0 ,p0 k2 > ( rk0 ,p0 C e ) 2 or ksi ¡ ck0 ,p0 k2 < (r k0 ,p0 ¡ e ) 2 .
(3.6)
We distinguish between two cases: whether j 2 M k0 or not. Case 1. If j 2 M k0 , then by the denition of b ck0 ,p0 we have
ksi ej ¡b ck0 ,p0 k2 D ksi ¡ ck0 ,p0 k2 C ( q ¡ |M k0 | )
± e ²4 2
C 1.
From this, using equation 3.6 and the denition ofb r k0 ,p0 , we obtain ksi ej ¡b ck0 ,p0 k2 ksi ej ¡b ck0 ,p0 k2
>
b r2k0 ,p0 C 2rk0 ,p0 e C e 2
or
<
(3.7) b r2k0 ,p0 ¡ 2rk0 ,p0 e C e 2 .
We derive bounds for the right-hand sides of these inequalities as follows. From equation 3.4, we have
e 2 > 4b rk0 ,p0 c C 2c 2 ,
934
Michael Schmitt
which, after adding 2rk0 ,p0 e to the left-hand side and halving the right-hand side, gives 2rk0 ,p0 e C e 2 > 2b rk0 ,p0 c C c 2 .
(3.8)
From equation 3.2, we get e 2 / 2 < rk0 ,p0 e , that is, the left-hand side of equation 3.5 is negative. Hence, we may double it to obtain from equation 3.5, ¡2rk0 ,p0 e C e 2 < ¡2b rk0 ,p0c C c 2 . Using this and equation 3.8 in 3.7 leads to ksi ej ¡b ck0 ,p0 k2 > (b rk0 ,p0 C c ) 2 or ksi ej ¡b ck0 ,p0 k2 < (b rk0 ,p0 ¡ c ) 2 . And this is equivalent to ksi ej ¡b ck0 ,p0 k > b b k0 ,p0 or ksi ej ¡b ck0 ,p0 k < b a k0 ,p0 , meaning that Hk0 ,p0 outputs 0. 6 M k0 then, Case 2 . If j 2
ksi ej ¡b ck0 ,p0 k2 D ksi ¡ ck0 ,p0 k2 C (q ¡ |M k0 |)
± e ²4 2
C1C
e2 . 2
As a consequence of this, together with equation 3.6 and the denition of b rk0 ,p0 we get ksi ej ¡b ck0 ,p0 k2 ksi ej ¡b ck0 ,p0 k
2
>
b rk20 ,p0 C 2r k0 ,p0 e C e 2 C
e2 2
b rk20 ,p0 ¡ 2r k0 ,p0 e C e 2 C
e2 2,
or
<
from which we derive, using for the second inequality e · r k0 ,p0 from equation 3.2, ksi ej ¡b ck0 ,p0 k2
>
b rk20 ,p0 C rk0 ,p0 e C
e2 2
or ksi ej ¡b ck0 ,p0 k2
<
(3.9) b rk20 ,p0 ¡ rk0 ,p0 e C
e2 2
.
Neural Networks with Local Receptive Fields
935
Finally, from equation 3.4, we have
rk0 ,p0 e C
e2 > 2b rk0 ,p0 C c 2 , 2
and, employing this together with equation 3.5, we obtain from equation 3.9 ksi ej ¡b ck0 ,p0 k2 > (b rk0 ,p0 C c ) 2 or ksi ej ¡b ck0 ,p0 k2 < (b rk0 ,p0 ¡ c ) 2 , which holds if and only if ksi ej ¡b ck0 ,p0 k > b bk0 ,p0 or ksi ej ¡b ck0 ,p0 k < b ak0 ,p0 . This shows that Hk0 ,p0 outputs 0 also in this case. Thus, claim iii is established. We complete the network N by connecting every hidden node with weight 1 to the output node, which then computes the sum of the hidden node output values. We nally show that we have indeed obtained a network that induces the dichotomy ( S 0, S1 ). Assume that si ej 2 S1 . Claims i, ii, and iii imply that there is exactly one hidden node H k,p , namely, one satisfying k D f ( i) by the denition of f , that outputs 1 on si ej . Hence, the network outputs 1 as well. On the other hand, if si ej 2 S 0 , it follows from claims ii and iii that none of the hidden nodes outputs 1. Therefore, the network output is 0. Thus, N shatters S with output threshold 1 / 2, and the proof is completed. The construction in the previous proof was based on the assumption that the difference between center radius and surround radius, given by the value 2c , can be made sufciently small. This may require constraints on the precision of computation that are not available in natural or articial systems. It is possible, however, to obtain the same result even if there is a lower bound on the difference of the radii. One simply has to scale the elements of the shattered set by a sufciently large factor. We apply the result now to obtain a superlinear lower bound for the VC dimension of networks with center-surround receptive eld neurons. By bxc we denote the largest integer less or equal to x. Suppose N is a network with one hidden layer of k binary CSRF neurons and input dimension n ¸ 2, where k · 2n , and assume that the output Corollary 1.
936
Michael Schmitt
node is linear. Then N
has VC dimension at least
µ ¶ µ ³ ´¶ ³ µ ³ ´¶ ´ k k k C1 . ¢ log ¢ n ¡ log 2 2 2 This holds even if the weights of the output node are not adjustable. Proof. We use theorem 1 with h D bk / 2c, q D blog(k / 2) c, and m D n ¡ blog(k / 2) c. The condition k · 2n guarantees that m ¸ 1. Then there is a set
of cardinality
µ ¶ µ ³ ´¶ ³ µ ³ ´¶ ´ k k k C1 , ¢ log ¢ n ¡ log hq ( m C 1) D 2 2 2 that is shattered by the network specied in theorem 1. Since the number of hidden nodes is h C 2q · k and the input dimension is m C q D n, the network satises the required conditions. Furthermore, it was shown in the proof of theorem 1 that all weights of the output node can be xed to 1. Hence, they need not be adjustable. VC dimension bounds for neural networks are often expressed in terms of the number of weights and the network size. In the following, we give a lower bound of this kind. Consider a network N with input dimension n ¸ 2, one hidden layer of k binary CSRF neurons, where k · 2n/ 2 , and a linear output node. Let W D k ( n C 2) C k C 1 denote the number of weights. Then N has VC dimension at least Corollary 2.
³ ´ W k log . 5 4 This holds even in the case when the weights of the output node are xed. According to corollary 1, N has VC dimension at least bk / 2c ¢ blog(k / 2) c ¢ (n ¡ blog(k/ 2)c C 1) . The condition k · 2n / 2 implies Proof.
µ
³ ´¶ k nC4 C1¸ . n ¡ log 2 2
Neural Networks with Local Receptive Fields
937
We may assume that k ¸ 5. (The statement is trivial for k · 4.) It follows, using bk / 2c ¸ ( k ¡ 1) / 2 and k / 10 ¸ 1 / 2, that µ ¶ 2k k ¸ . 2 5 Finally, we have µ log
³ ´¶ ³ ´ ³ ´ k k k ¸ log ¡ 1 D log . 2 2 4
Hence, N has VC dimension at least ( n C 4) ( k / 5) log(k / 4) , which is at least as large as the claimed bound ( W / 5) log(k/ 4). In the networks considered thus far, the input dimension was assumed to be variable. It is an easy consequence of theorem 1 that even when n is constant, the VC dimension grows still linearly in terms of the network size. Assume that the input dimension is xed and consider a network with one hidden layer of binary CSRF neurons and a linear output node. Then the VC dimension of N is V ( k) and V ( W ) , where k is the number of hidden nodes and W the number of weights. This even holds in the case of xed output weights. Corollary 3.
N
Proof. Choose m, q ¸ 1 such that m C q · n, and let h D k ¡ 2q . Since n is
constant, hq ( m C 1) is V ( k) . Thus, according to theorem 1, there is a set of cardinality V ( k) shattered by N . Since the number of weights is k (n C 3) C 1, which is O ( k) , the lower bound V (W ) also follows. 3.3 Networks of Ternary CSRF Neurons. The results from the previous section now easily allow deriving similar bounds for ternary CSRF neurons. The following statement is the counterpart of theorem 1.
Suppose N is a network with one hidden layer consisting of ternary CSRF neurons and a linear output node. Let 2(h C 2q ) be the number of hidden nodes and m C q the number of input nodes, where h, m, q ¸ 1. Then there exists a set of cardinality hq ( m C 1) shattered by N . This holds even for xed output weights. Theorem 2.
Proof. The idea is to use the same set S as in the proof of theorem 1 and
to simulate the behavior of a binary neuron by two ternary neurons. Let N b be the network constructed in the proof of theorem 1. Assume rst that we have off-center on-surround neurons available for the construction of N . b of N b , we introduce two hidden nodes H, H0 for For every hidden node H N dening their parameters as follows: Node H gets the same center and
938
Michael Schmitt
b Node H 0 also gets the same center, but for the center radius we radii as H. b choose 0, and the surround radius is dened to be the center radius of H. 0 Formally, H can be regarded as an off-center on-surround neuron without center region. It is easy to see that on any input vector from S, the sum of the output b Here we use the fact values of H and H 0 is equal to the output value of H. that no point in S lies on the radii of any hidden node of N b . Hence, by what was shown in theorem 1, a dichotomy ( S 0, S1 ) is accomplished by the sum of the output values of the hidden nodes, being 0 for elements of S 0 and 1 for elements of S1 . In case we are dealing with on-center off-surround neurons, we use the property that they are negatives of off-center on-surround neurons. Thus, dening the corresponding output weight to be ¡1 instead of 1, we obtain the same result. In analogy to the previous section, we are now able to infer three lower bounds: a superlinear bound in terms of input dimension and network size, a superlinear bound in terms of weight number and network size, and a linear bound for xed input dimension. Corollary 4. Suppose N is a network with input dimension n ¸ 2, one hidden layer of k · 2n C 1 ternary CSRF neurons, and a linear output node. Then N has
VC dimension at least
µ ¶ µ ³ ´¶ ³ µ ³ ´¶ ´ k k k C1 , ¢ log ¢ n ¡ log 4 4 4 even for xed output weights. Proof. From k · 2n C 1 follows that blog(k / 4) c · n ¡ 1. Hence, applying
theorem 2 with h D bk/ 4c, q D blog(k/ 4)c, m D n ¡blog(k / 4) c, and observing that 2(h C 2q ) · k, we immediately obtain the claimed result.
Consider a network N with input dimension n ¸ 2, one hidden layer of k ternary CSRF neurons, where k · 2 ( n C 2) / 2 , and a linear output node. Let W D k ( n C 3) C 1 denote the number of weights. Then N has VC dimension at least Corollary 5.
³ ´ W k log , 12 8 even for xed output weights.
Neural Networks with Local Receptive Fields
939
Proof. Since k · 2 ( n C 2) / 2 , we have ( n ¡ blog(k/ 4)c C 1) ¸ ( n C 4) / 2. Further, bk/ 4c ¸ ( k ¡ 3) / 4 implies for k ¸ 9 that bk / 4c ¸ k / 6. (The statement is trivial for k · 8.) Using these estimates in the bound of corollary 4 together with blog(k / 4) c ¸ log(k/ 8) gives the result.
Assume that the input dimension n ¸ 2 is xed, and consider a network N with one hidden layer of ternary CSRF neurons and a linear output node. Then the VC dimension of N is V ( k) and V ( W ) , where k is the number of hidden nodes and W the number of weights. This holds even for xed output weights. Corollary 6.
Proof. The result can be deduced from theorem 2 by analogy with corol-
lary 3. 3.4 Networks of Binary RBF Neurons. Finally, we consider the third variant of a discrete local receptive eld neuron and show that networks of binary RBF neurons also respect the bounds established above for ternary CSRF neurons.
Suppose N is a network with one hidden layer consisting of binary RBF neurons and a linear output node. Let n ¸ 2 be the input dimension, k the number of hidden nodes, and assume that k · 2n C 1 . Then N has VC dimension at least Theorem 3.
µ ¶ µ ³ ´¶ ³ µ ³ ´¶ ´ k k k C1 . ¢ log ¢ n ¡ log 4 4 4 Let W denote the number of weights and assume that k · 2 ( n C 2) / 2 . Then the VC dimension of N is at least ³ ´ W k log . 12 8 For xed input dimension n ¸ 2, the VC dimension of N satises the bounds V ( k) and V (W ) . All these bounds are valid even when the output weights are xed. Proof. The main idea is to employ two binary RBF neurons for the simula-
tion of one binary CSRF neuron. This is easy to achieve. Given a neuron of the latter type, we provide the two RBF neurons with its center and assign its center radius to the rst and its surround radius to the second neuron. If we give output weights ¡1 and 1 to the the rst and second neuron, respectively, then it is clear that on points not lying on the radii, the summed ouptut of the weighted RBF neurons is equivalent to the output of the CSRF neuron.
940
Michael Schmitt
Thus, we can do a similar construction as in the proof of theorem 2, obtaining a network of size twice the original network such that both networks shatter the set from theorem 1. We recall that the networks have the property that the parameters can be chosen such that no point of this set lies on any radius. The consequences stated in corollaries 4 to 6 for ternary CSRF neurons then follow immediately for binary RBF neurons. 4 Superlinear Lower Bounds for Networks of Continuous Neurons
We now turn toward networks of continuous local receptive eld neurons. In this section, we rst establish lower bounds for networks of DOG neurons. Their derivation mainly builds on constructions and results from the previous section. The bounds for RBF networks are then easily obtained. 4.1 Networks of DOG Neurons. We begin by deriving a result in anal-
ogy with theorem 1. Let h, q, m ¸ 1 be arbitrary natural numbers. Suppose N is a network with m C q input nodes, one hidden layer of h C 2q DOG neurons, and a linear output node. Then there is a set of cardinality hq ( m C 1) shattered by N . Theorem 4.
Proof. We use ideas and results from the proof of theorem 1. In particular,
we show that the set constructed there can be shattered by a network of new model neurons, the so-called extended gaussian neurons, which we introduce below. Then we demonstrate that a network of these extended gaussian neurons can be simulated by a network of DOG neurons, which establishes the statement of the theorem. We dene an extended gaussian neuron with n inputs to compute the function g: Q R2n C 2 ! R with ³ ³ ´ ´2 kx ¡ ck2 ¡ 1 gQ (c, s, a, x) D 1 ¡ a exp ¡ , s2 where x1 , . . . , xn are the input variables and c1 , . . . , cn , a, and s > 0 are real-valued parameters. Thus, the computation of an extended gaussian neuron is performed by scaling the output of a gaussian RBF neuron with a, squaring the difference to 1, and comparing this value with 1. Let S µ Rm C q be the set of cardinality hq ( m C 1) constructed in the proof of theorem 1. In particular, S has the form S D fsi ej : i D 1, . . . , h ( m C 1)I j D 1, . . . , qg. We have also dened in that proof binary CSRF neurons H k,p as hidden nodes in terms of parameters b ck,p 2 Rm C q , which became the centers of the
Neural Networks with Local Receptive Fields
941
neurons, and b rk,p 2 R, which gave the center radii b ak,p D b rk,p ¡ c and the surround radii b bk,p D b rk,p C c using some c > 0. The number of hidden nodes was not larger than h C 2q . We replace the CSRF neurons by extended gaussian neurons G k,p with parameters ck,p , sk,p , ak,p dened as follows. Assume some s > 0 that will be specied later. Then we let ck,p D b ck,p , sk,p D s, ak,p D exp
³
2 b r k,p
´
s2
.
These hidden nodes are connected to the output node with all weights being 1. We call this network N 0 and claim that it shatters S. Consider some arbitrary dichotomy ( S 0, S1 ) of S and some si ej 2 S. Then node G k,p computes
³
³
gQ ( ck,p , sk,p , ak,p , si ej ) D 1¡ ak,p exp ¡
³ D 1¡ exp
³ D 1¡ exp
³
2 b rk,p
s2
³ ¡
´
ksi ej ¡ ck,p k2
´
2 sk,p
´2 ¡1
³
´
ksi ej ¡b ck,p k2 ¢ exp ¡ ¡1 s2
2 ksi ej ¡b ck,p k2 ¡b r k,p
s2
´
´2
´2 ¡1
.
(4.1)
Suppose rst that si ej 2 S1 . It was shown by claims i, ii, and iii in the proof of theorem 1 that there is exactly one hidden node H k,p that outputs 1 on si ej . In particular, claim i established that this node satises 2 ksi ej ¡b . ck,p k2 D b rk,p
Hence, according to equation 4.1 node G k,p outputs 1. We note that this holds for all values of s. Further, the derivations of claims ii and iii yielded that those nodes H k,p that output 0 on si ej satisfy ksi ej ¡b ck,p k2 > (b rk,p C c ) 2 or ksi ej ¡b ck,p k2 < (b rk,p ¡ c ) 2 .
(4.2)
This implies for the computation of G k,p that in equation 4.1, we can make the expression
³ exp ¡
2 ksi ej ¡b ck,p k2 ¡b rk,p
s2
´
942
Michael Schmitt
as close to 0 as necessary by choosing s sufciently small. Since this does not affect the node that outputs 1, network N 0 computes a value close to 1 on si ej . On the other hand, for the case si ej 2 S 0, it was shown in theorem 1 that all nodes Hk,p output 0. Thus, each of them satises condition 4.2, implying that if s is sufciently small, each node Gk,p , and hence N 0 , outputs a value close to 0. Altogether, S is shattered by thresholding the output of N 0 at 1 / 2. Finally, we show that S can be shattered by a network N of the same size with DOG neurons as hidden nodes. The computation of an extended gaussian neuron can be rewritten as
gQ (c, s, a, x) D D D D
³ ³ ´ ´2 kx ¡ ck2 1¡ a exp ¡ ¡1 s2 ³ ³ ´ ³ ´ ´ 2kx ¡ ck2 kx ¡ ck2 C1 1¡ a2 exp ¡ ¡2a exp ¡ s2 s2 ³ ´ ³ ´ kx ¡ ck2 2kx ¡ ck2 2 2a exp ¡ ¡a exp ¡ s2 s2 p gDOG ( c, s, s / 2, 2a, a2 , x ) .
Hence, the extended gaussian neuron is equivalent to p a weighted difference of two gaussian neurons with center c, widths s, s / 2, and weights 2a, a2 , respectively. Thus, the extended gaussian neurons can be replaced by DOG neurons, which completes the proof. We note that the network of extended gaussian neurons constructed in the previous proof has all output weights xed, whereas the output weights of the DOG neurons, that is, the parameters a and b in the notation of section 2.1, are calculated from the parameters of the extended gaussian neurons and therefore depend on the particular dichotomy to be implemented. (It is trivial for a DOG network to have an output node with xed weights since the DOG neurons have built-in output weights.) We are now able to deduce a superlinear lower bound on the VC dimension of DOG networks. Suppose N is a network with one hidden layer of DOG neurons and a linear output node. Let N have k hidden nodes and input dimension n ¸ 2, where k · 2n . Then N has VC dimension at least Corollary 7.
µ ¶ µ ³ ´¶ ³ µ ³ ´¶ ´ k k k C1 . ¢ log ¢ n ¡ log 2 2 2
Neural Networks with Local Receptive Fields
943
Let W denote the number of weights and assume that k · 2n/ 2 . Then the VC dimension of N is at least ³ ´ W k log . 5 4 For xed input dimension, the VC dimension of N
is bounded by V ( k) and V ( W ) .
Proof. The results are implied by theorem 4 in the same way as corollaries 1
through 3 follow from theorem 1. 4.2 Networks of Gaussian RBF Neurons. We can now give the answer to the question of Anthony and Holden (1994) quoted in section 1.
Suppose N is a network with one hidden layer of gaussian RBF neurons and a linear output node. Let k be the number of hidden nodes and n the input dimension, where n ¸ 2 and k · 2n C 1 . Then N has VC dimension at least Theorem 5.
µ ¶ µ ³ ´¶ ³ µ ³ ´¶ ´ k k k C1 . ¢ log ¢ n ¡ log 4 4 4 Let W denote the number of weights and assume that k · 2 ( n C 2) / 2 . Then the VC dimension of N is at least ³ ´ W k log . 12 8 For xed input dimension n ¸ 2 the VC dimension of N and V (W ) .
satises the bounds V ( k)
Proof. Clearly, a DOG neuron can be simulated by two gaussian RBF neu-
rons. Thus, by virtue of theorem 4, there is a network N with m C q input nodes and one hidden layer of 2(h C 2q ) gaussian RBF neurons that shatters some set of cardinality hq ( m C 1) . Choosing h D bk / 4c, q D blog(k/ 4)c, and m D n ¡ blog(k / 4) c, we obtain similarly to corollary 4 the claimed lower bound in terms of n and k. Furthermore, the stated bound in terms of W and k follows by analogy to the reasoning in corollary 5. Finally, the bound for xed input dimension is obvious, as in the proof of corollary 3. Some RBF networks studied theoretically or used in practice have no adjustable width parameters (for instance, Broomhead & Lowe, 1988; Powell, 1992). Therefore, a natural question is whether the previous result also holds for networks with xed width parameters. The values of the width
944
Michael Schmitt
parameters for theorem 5 arise from the widths of DOG neurons specied in theorem p 4. The two width parameters of each DOG neuron have the form s and s / 2, where s is common to all DOG neurons and is required only to be sufciently small. Hence, we can choose a single s that is sufciently small for all dichotomies to be induced. Thus, for the RBF network, we not only have that the width parameters can be xed, but even that there need to be only two different width values—depending solely on the architecture and not on the particular dichotomy. Let N be a gaussian RBF network with n input nodes and k hidden nodes satisfying the conditions of theorem 5. Then there exists a real number sk,n > 0 such that the VC dimension bounds statedpin theorem 5 hold for N with each RBF neuron having xed width sk,n or sk,n / 2. Corollary 8.
With regard to theorem 5, we further remark that k has been previously established as lower bound for RBF networks by Anthony and Holden (1994). Further, also theorem 19 of Lee et al. (1995) in connection with the result of Erlich et al. (1997) implies the lower bound V (nk) , and hence V ( k) for xed input dimension. By means of theorem 5, we are now able to present a lower bound that is even superlinear in k. Let n ¸ 2 and N be the network with k D 2n C 1 hidden gaussian RBF neurons. Then N has VC dimension at least Corollary 9.
³ ´ k k log . 3 8 Proof. Since k D 2n C 1 , we may substitute n D log k ¡ 1 in the rst bound of
theorem 5. Hence, the VC dimension of N
is at least
µ ¶ µ ³ ´¶ ³ µ ³ ´¶´ µ ¶ µ ³ ´¶ k k k k k ¢ log ¢ log k ¡ log ¸2 ¢ log . 4 4 4 4 4 As in the proof of corollary 5, we use that bk/ 4c ¸ k/ 6 and blog(k/ 4)c ¸ log(k / 8) . This yields the claimed bound. 5 Bounds for Single Neurons
In this section we consider the three discrete variants of a local receptive eld neuron and the gaussian RBF neuron. We show that their VC dimension is at most linear. Furthermore, this bound is asymptotically tight. 5.1 Discrete Neurons. We assume in the following that the output of the ternary CSRF neuron is thresholded at 1 / 2 or any other xed value from
Neural Networks with Local Receptive Fields
945
the interval (0, 1], to obtain output values in f0, 1g. Thus, we can treat the binary and ternary CSRF neuron similarly. (If the threshold is chosen from the interval [¡1, 0], this corresponds to a negated binary RBF neuron and, hence, has the VC dimension of the latter.) The VC dimension of a binary RBF neuron with n inputs is equal to n C 1. The VC dimension of a (binary and ternary) center-surround neuron with n inputs is at least n C 1 and at most 4n C 5. Theorem 6.
Proof. The class of functions computed by a binary RBF neuron with n
inputs can be identied with the class of balls in Rn . Dudley (1979) shows that the VC dimension of this class is equal to n C 1 (see also Wenocur and Dudley, 1981; Assouad, 1983). This gives the result for the RBF neuron. Clearly, a binary and ternary center-surround neuron can simulate the RBF neuron by adjusting the center radius to 0. This implies the lower bound n C 1. For the upper bound, consider a center-surround neuron with n inputs and assume, without loss of generality, that its output is binary. Let c D ( c1 , . . . , cn ) 2 Rn be the center and a, b 2 R the radii. Then if f : Rn ! f0, 1g is the function computed by the neuron, on some input vector x D ( x1 , . . . , xn ) 2 Rn , it satises f ( x ) D 1 () kx ¡ ck ¸ a () ( x1 ¡ c1
)2
and
kx ¡ ck · b
C ¢ ¢ ¢ C ( xn ¡ cn ) 2 ¸ a2
and
( x1 ¡ c1 ) 2 C ¢ ¢ ¢ C ( xn ¡ cn ) 2 · b2
() kxk2 ¡ 2c1 x1 ¡ ¢ ¢ ¢ ¡ 2cn xn ¸ a2 ¡ c21 ¡ ¢ ¢ ¢ ¡ c2n
and
¡ kxk2 C 2c1 x1 C ¢ ¢ ¢ C 2cn xn ¸ ¡b2 C c21 C ¢ ¢ ¢ C c2n .
Each of the last two inequalities denes a halfspace in Rn C 1 , both with weights 1, ¡2c1 , . . . , ¡2cn or the negative thereof, and with thresholds a2 ¡ c21 ¡ ¢ ¢ ¢ ¡ c2n or ¡b2 C c21 C ¢ ¢ ¢ C c2n , respectively. Thus, we have that the number of dichotomies induced by a binary center-surround neuron on some nite subset of Rn is not larger than the number of dichotomies induced by intersections of parallel halfspaces on some subset of Rn C 1 with the same cardinality, where the additional input component is obtained as kxk2 for every input vector. Hence, a set shattered by a center-surround neuron in Rn gives rise to a set of the same cardinality shattered by intersections of parallel halfspaces in Rn C 1 . According to theorem 9, which is given in the appendix, the VC dimension of the class of intersections of parallel halfspaces in Rn C 1 is at most 4n C 5. This entails the bound for the center-surround neuron. In contrast to the RBF neuron, the exact values for the VC dimension of center-surround neurons are not known yet.
946
Michael Schmitt
5.2 Gaussian RBF Neurons. It is easy to see that a thresholded gaussian RBF neuron, that is, one with a xed output threshold, is equivalent to a binary RBF neuron. Hence, by theorem 6, its VC dimension is equal to the number of its parameters. The pseudodimension generalizes the VC dimension to real-valued function classes and is dened as follows.
Let F be a class of functions mapping Rn to R. The pseudodimension of F is the cardinality of the largest set S µ Rn C 1 shattered by the class f(x, y ) 7! sgn ( f ( x) ¡ y) : f 2 F g. Denition 3.
The pseudodimension is a stronger notion than the VC dimension in that an upper bound on the pseudodimension of some function class also yields the same bound on the VC dimension of the thresholded class, whereas the converse need not necessarily be true. Thus, there is no general way of inferring the pseudodimension of a gaussian RBF neuron from the VC dimension of a binary RBF neuron. Nevertheless, the pseudodimension of a single gaussian RBF neuron is linear, as we show now. The pseudodimension of a gaussian RBF neuron with n inputs is at least n C 1 and at most n C 2. Theorem 7.
Proof. The lower bound easily follows from the facts that a thresholded
gaussian RBF neuron can simulate any binary RBF neuron, that a binary RBF neuron has VC dimension n C 1 (see theorem 6), and that the VC dimension is a lower bound for the pseudodimension. We obtain the upper bound as follows: According to a well-known result (see, e.g., Haussler, 1992, theorem 5), since the function z 7! exp(¡z) is continuous and strictly decreasing, the pseudodimension of the class »
³
kx ¡ ck2 x 7! exp ¡ s2
´
¼ n
: c 2 R , s 2 R n f0g
is equal to the pseudodimension of the class »
¼ kx ¡ ck2 n x 7! : c 2 R , s 2 R n f0g , s2
which is, by denitions 1 and 3, equal to the VC dimension of the class »
³ (x, y ) 7! sgn
´ ¼ kx ¡ ck2 n ¡ y : c 2 R , s 2 R n f0g . s2
Neural Networks with Local Receptive Fields
947
This class can also be written as n o ( x, y) 7! sgn (kxk2 ¡ 2c ¢ x C kck2 ¡ s 2 y ) : c 2 Rn , s 2 R n f0g . Each function in this class has the form ( x, y ) 7! sgn ( f ( x ) C g (x, y ) ) with f (x ) D kxk2 and g being an afne function in n C 1 variables. Hence, the VC dimension of this class cannot be larger than the VC dimension of the class fsgn ( f C g ) : g afne, g: Rn C 1 ! Rg. Wenocur and Dudley (1981) show that if G is a d-dimensional vector space of real-valued functions, then fsgn ( f C g ) : g 2 Gg has VC dimension d (see also Anthony and Bartlett, 1999, theorem 3.5). Thus, the upper bound follows since the class of afne functions in n C 1 variables is a vector space of dimension n C 2. 6 Upper Bounds for Networks of Discrete Neurons
The following result shows that one-hidden-layer networks of discrete local receptive eld neurons have a VC dimension bounded by O ( W log k) . This implies that the lower bounds established in section 3 are asymptotically tight. For the proof, we employ a method from a similar result for threshold networks. Suppose N is a network with one hidden layer of binary RBF neurons and a linear output node. Let k denote the number of hidden nodes and W D nk C 2k C 1 the number of weights. Then the VC dimension of N is at most 2W log( (2 k C 2) / ln 2) . If the hidden nodes are binary or ternary CSRF neurons, the VC dimension is at most 2W log( (4k C 2) / ln 2). Theorem 8.
Proof. A binary RBF neuron with n inputs has VC dimension n C 1 (see
theorem 6). The output node of N is a linear neuron with k inputs; thus, it has VC dimension k C 1. Since each node of N has a VC dimension equal to the number of its parameters, it follows by reasoning similarly as in theorem 6.1 of Anthony and Bartlett (1999) that the number of dichotomies induced by N on a set of cardinality m, where m ¸ W, is at most (em ( k C 1) / W ) W . (Note that N has k C 1 computation nodes.) This implies that the VC dimension is at most 2W log( (2k C 2) / ln 2). Consider now the case that the hidden nodes are CSRF neurons. Clearly, a weighted binary or ternary CSRF neuron can be simulated by a weighted combination of two binary RBF neurons. Thus, N can be simulated by a network N 0 with 2k binary RBF neurons as hidden nodes. Now observe that each CSRF neuron gives rise to two RBF neurons with the same center. Thus, although the number of nodes and connections in N 0 has increased,
948
Michael Schmitt
the number of parameters is the same as in N . In other words, N 0 is a network with equivalences among its weights. Combining a method due to Shawe-Taylor (1995) for networks with equivalences with the abovementioned derivation by Anthony and Bartlett (1999), we obtain that N 0 induces at most ( em (2k C 1) / W ) W dichotomies on a set of cardinality m. This results in a VC dimension not larger than 2W log( (4 k C 2) / ln 2) . 7 Conclusion
Local receptive elds occur in many kinds of biological and articial neural networks. We have studied here several models of local receptive eld neurons and have established superlinear VC dimension lower bounds for networks with one hidden layer. Although compared with the previously known linear bounds, at rst sight the gain by a logarithmic factor seems exiguous, there are at least two arguments showing that it constitutes a signicant improvement. First, the VC dimension is a rather coarse measure. Increasing it by one amounts to doubling the number of functions computed by the network. Second, in a network with the VC dimension linearly bounded from above by the number of weights, each weight can be considered responsible for a particular input vector. Superlinearity implies that each weight manages to get hold of a number of input vectors that increases with the network size. Thus, in networks with superlinear VC dimension, the neurons have found a very effective way to cooperate and coordinate their computations. The VC dimension yields bounds on the complexity of learning for several models of learnability. For instance, bounds on the computation time or the number of examples required for learning can often be expressed in terms of the VC dimension. If the VC dimension provides a lower bound in a model of learning, then the superlinear lower bounds given here yield new lower bounds on the complexity of learning using local receptive eld neural networks. Of course, if the VC dimension serves as an upper bound in a model, there is no immediate consequence. But then one may be encouraged to nd other measures that more tightly quantify the complexity of learning in these models. For the discrete versions of local receptive eld neurons, we have shown that the superlinear lower bounds for networks are asymptotically tight. The currently available methods for RBF and DOG networks give rise only to the upper bound O (W 2 k2 ) for these networks. This bound, however, is also valid for networks of unrestricted depth and of sigmoidal neurons. The problems of narrowing the gaps between upper and lower bounds for RBF and sigmoidal networks with one hidden layer therefore seem to be closely related. We have also established tight linear bounds for the VC dimension of single discrete neurons and for the pseudodimension of the gaussian RBF neuron. The VC and pseudodimension of the DOG neuron can be shown to be at most quadratic. We conjecture also that the DOG neuron has linear VC
Neural Networks with Local Receptive Fields
949
and pseudodimension, but the methods currently available do not seem to permit an answer. In the constructions of the sets being shattered, we have permitted arbitrary real vectors. It is not hard to see that rational numbers sufce. It would be interesting to know what happens for even more restrictive inputs such as, for instance, Boolean vectors. We have also allowed that the centers of the local receptive eld neurons can be placed anywhere in the input domain. We do not know if the results hold when the centers may not freely oat around. The superlinear bounds involve constant factors that are the largest known for any standard neural network with one hidden layer. This fact could be interpreted as evidence that the cooperative computational capabilities of local receptive eld neurons are even higher than those of other neuron types. This statement, however, must be taken with a grain of salt since the constants in these bounds are not yet known to be tight. Gaussian units are just one type of RBF neuron. The method we have developed for obtaining superlinear lower bounds is of a quite general nature. We therefore expect it to be applicable to other RBF networks as well. The main clue in the result for RBF networks was rst to consider CSRF and DOG networks. With this idea, we have established a new kind of link between neurophysiological models and articial neural networks. This link extends the paradigm of neural computation by demonstrating that models originating from neuroscience lead not only to powerful computing mechanisms but can also be essential in theory, that is, in proofs concerning the computational power of those mechanisms. Appendix: A VC Dimension Upper Bound for Intersections and Unions of Parallel Halfspaces
We consider the function classes dened by intersections and unions of parallel halfspaces and derive an upper bound on the VC dimension of these classes. This bound was used in theorem 6. A general way of bounding the VC dimension of classes that are constructed from nite intersections and unions has been established by Blumer, Ehrenfeucht, Haussler, and Warmuth (1989). In particular, they show that if we form a new class having as members intersections of s functions from a class with VC dimension d, then the VC dimension of the new class is less than 2ds log(3s) (Blumer et al., 1989, lemma 3.2.3). The following new calculation for parallel halfspaces results in a bound with improved constants. The function class consisting of intersections of parallel halfspaces in Rn has VC dimension at most 4n C 1. The same holds for the class of unions of parallel halfspaces. Theorem 9.
950
Michael Schmitt
Proof. The proof is given for intersections; the result then follows for unions
by duality. Clearly, it is sufcient to consider intersections of only two parallel halfspaces. We use Rn C 2 to represent the joint parameter domain for the halfspaces. The rst n components dening the weights are shared by both halfspaces. The components n C 1 and n C 2 correspond to their separate thresholds. The main step is to derive an upper bound on the number of dichotomies induced on any set S µ Rn of cardinality m. We assume without loss of generality that S is in general position. (If not, then the elements can be perturbed to obtain a set in general position with a number of dichotomies no less than for the original set. See, e.g., Anthony and Bartlett, 1999, p. 34.) First, we give an upper bound on the number of dichotomies induced by pairs of parallel halfspaces where each halfspace is nontrivial. Here, we say that a halfspace is trivial if it induces one of the dichotomies (;, S ) or ( S, ;) . Such a bound is obtained in terms of the number of connected components into which the parameter domain is partitioned by certain hyperplanes arising from the elements of S. Every input vector ( s1 , . . . , sn ) 2 S gives rise to the two hyperplanes fx 2 Rn C 2 : s1 x1 C ¢ ¢ ¢ C sn xn ¡ xn C 1 D 0g, fx 2 Rn C 2 : s1 x1 C ¢ ¢ ¢ C sn xn ¡ xn C 2 D 0g, that is, their representations in Rn C 2 are the vectors ( s1 , . . . , sn , ¡1, 0) and (s1 , . . . , sn , 0, ¡1) , respectively. All hyperplanes are homogeneous; that is, they pass through the origin. It is clear that for every connected component of Rn C 2 arising from this partition, the two functions induced on S by the pair of halfspaces represented in this way are the same for all vectors belonging to this component. Thus, the number of connected components provides an upper bound on the number of induced dichotomies. A well-known result attributed to Schl¨ai (1901) states that m homogeneous hyperplanes in general position partition Rn into exactly
³
n X m ¡1 2 i iD 0
´ (A.1)
connected components (see also Anthony and Bartlett, 1999, lemma 3.3). Hence, the set S, giving rise to 2m hyperplanes, partitions Rn C 2 into at most
³
2
nC1 X 2m ¡ 1 iD 0
i
´ (A.2)
Neural Networks with Local Receptive Fields
951
connected components. Not all of them, however, represent pairs of nontrivial halfspaces. For every trivial dichotomy of the rst halfspace, we can have as many dichotomies induced by the second halfspace as there are dichotomies possible by a single halfspace in Rn C 1 on a set of cardinality m; the same holds for every trivial dichotomy of the second halfspace. Hence, using Schl¨ai’s (1901) count, equation A.1, we may subtract from equation A.2 the amount of
³
³
´´
n X m¡1 4 i iD 0
³
³
´´
n X m¡1 4 i iD 0
C
¡ 4.
(A.3)
The term ¡4 at the end results from the fact that the four combinations of trivial halfspaces are counted by both sums. Note also that the number given in equation A.3 is not a bound but is precise since S is in general position. Up to this point, the pairs of halfspaces also include redundant combinations where the intersection is empty or one halfspace is a subset of the other. Clearly, for every nonredundant pair, there are three redundant ones. Therefore, we can exclude the latter, dividing the number by 4. Thus, an upper bound for the number of pairs of nontrivial halfspaces with nonempty intersections is obtained by subtracting one-fourth of equation A.3 from A.2, giving
³
³
nC1 2m ¡ 1 1X 2 iD 0 i
´´
³
³
n X m ¡1 ¡ 2 i iD 0
´´ C 1.
(A.4)
That the intersection of two halfspaces is nonempty does not imply that the dichotomy induced on S is nonempty. Therefore, we are allowed to exclude these cases. Each pair of nontrivial halfspaces with empty intersection on S gives rise to two nontrivial dichotomies that can be induced by a single halfspace. Thus, we may subtract half the number of nontrivial dichotomies induced by a single halfspace, which is
³
³
n X m ¡1
´´
i
iD 0
¡ 1.
(A.5)
Finally, we take those pairs into account where at least one halfspace induces a trivial dichotomy. In this case, the dichotomy can be induced by a single halfspace, that is, we may add the amount given by equation A.1. All in all, an upper bound is provided by equation A.4, minus A.5, plus A.1, yielding
³
³
nC1 2m ¡ 1 1X 2 iD 0 i
´´
³
³
n X m¡1 ¡ i iD0
´´ C 2.
(A.6)
952
Michael Schmitt
Assuming 1 · n C 1 · 2m¡1 without loss of generality, we use the estimates
³
nC1 2m ¡ 1 1X 2 iD 0 i
´
1 < 2
³
e (2m ¡ 1) nC1
´n C 1
(see, e.g., Anthony and Bartlett, 1999, theorem 3.7) and
³
n X m ¡1 i iD 0
´ ¸ 2,
whence we obtain that the number of dichotomies is less than 1 2
³
e (2m ¡ 1) nC1
´n C 1
.
Now suppose that S is shattered. Then all 2m dichotomies must be induced, which implies that ³ 2
m
<2
n
e (2m ¡ 1) 2(n C 1)
´n C 1
.
Taking logarithms, this is equivalent to ³ m < n C ( n C 1) log
e (2m ¡ 1) 2(n C 1)
´ .
(A.7)
It is well known that all real numbers a, b > 0 satisfy the inequality ln a · ab C ln(1 / b ) ¡ 1 (see, e.g., Anthony and Bartlett, 1999, appendix A.1.1). Substituting a D (2m ¡ 1) / 2 and b D (ln 2) / (2( n C 1) ) yields ³ ln
2m ¡ 1 2
´ ·
³ ´ (2m ¡ 1) ln 2 2(n C 1) C ln , 4(n C 1) e ln 2
implying ³
2m ¡ 1 ( n C 1) log 2
´
³ ´ 1 2(n C 1) m · ¡ C ( n C 1) log . 2 4 e ln 2
Neural Networks with Local Receptive Fields
953
Using this in inequality A.7, it follows that
m<
³ ´ 2 1 m C n C ( n C 1) log ¡ , 2 ln 2 4
which is equivalent to ³ ³ ´´ ³ ´ 2 2 1 C log ¡ . m < n 2 C log ln 2 ln 2 2 This implies that m · 4n C 1. Hence, the cardinality of any set shattered by intersections of parallel halfspaces, and thus the VC dimension of this class, is not larger than this number. Acknowledgments
This work has been supported in part by the ESPRIT Working Group in Neural and Computational Learning II, NeuroCOLT2, No. 27150. References Anthony, M. (1995).Classication by polynomial surfaces. Discrete Applied Mathematics, 61, 91–103. Anthony, M., & Bartlett, P. L. (1999). Neural network learning: Theoretical foundations. Cambridge: Cambridge University Press. Anthony, M., & Holden, S. B. (1994). Quantifying generalization in linearly weighted neural networks. Complex Systems, 8, 91–114. Assouad, P. (1983). Densit e et dimension. Annales de l’Institut Fourier, 33(3), 233– 282. Atick, J. J., & Redlich, A. N. (1993). Convergent algorithm for sensory receptive eld development. Neural Computation, 5, 45–60. Bartlett, P. L., Maiorov, V., & Meir, R. (1998). Almost linear VC dimension bounds for piecewise polynomial networks. Neural Computation, 10, 2159–2173. Bartlett, P. L., & Williamson, R. C. (1996). The VC dimension and pseudodimension of two-layer neural networks with discrete inputs. Neural Computation, 8, 625–628. Bishop, C. M. (1995). Neural networks for pattern recognition. Oxford: Clarendon Press. Blumer, A., Ehrenfeucht, A., Haussler, D., & Warmuth, M. K. (1989). Learnability and the Vapnik-Chervonenkis dimension. Journal of the Association for Computing Machinery, 36, 929–965. Broomhead, D. S., & Lowe, D. (1988). Multivariable functional interpolation and adaptive networks. Complex Systems, 2, 321–355.
954
Michael Schmitt
Cover, T. M. (1965). Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE Transactions on Electronic Computers, 14, 326–334. Dudley, R. M. (1979). Balls in R k do not cut all subsets of k C 2 points. Advances in Mathematics, 31, 306–308. Enroth-Cugell, C., & Robson, J. G. (1966). The contrast sensitivity of retinal ganglion cells of the cat. Journal of Physiology, 187, 517–552. Erlich, Y., Chazan, D., Petrack, S., & Levy, A. (1997). Lower bound on VCdimension by local shattering. Neural Computation, 9, 771–776. Glezer, V. D. (1995). Vision and mind: Modeling mental functions. Mahwah, NJ: Erlbaum. Hartman, E. J., Keeler, J. D., & Kowalski, J. M. (1990). Layered neural networks with gaussian hidden units as universal approximations. Neural Computation, 2, 210–215. Haussler, D. (1992). Decision theoretic generalizations of the PAC model for neural net and other learning applications. Information and Computation, 100, 78–150. Hawken, M. J., & Parker, A. J. (1987). Spatial properties of neurons in the monkey striate cortex. Proceedings of the Royal Society of London, Series B, 231, 251– 288. Holden, S. B., & Rayner, P. J. W. (1995). Generalization and PAC learning: Some new results for the class of generalized single-layer networks. IEEE Transactions on Neural Networks, 6, 368–380. Howlett, R. J., & Jain, L. C. (Eds.). (2001a). Radial basis function networks 1: Recent developments in theory and applications. Berlin: Springer-Verlag. Howlett, R. J., & Jain, L. C. (Eds.). (2001b). Radial basis function networks 2: New advances in design. Berlin: Springer-Verlag. Joshi, A., & Lee, C.-H. (1993). Backpropagation learns Marr’s operator. Biological Cybernetics, 70, 65–73. Karpinski, M., & Macintyre, A. (1997). Polynomial bounds for VC dimension of sigmoidal and general Pfafan neural networks. Journal of Computer and System Sciences, 54, 169–176. Koiran, P., & Sontag, E. D. (1997). Neural networks with quadratic VC dimension. Journal of Computer and System Sciences, 54, 190–198. Kufer, S. W. (1953). Discharge patterns and functional organization of mammalian retina. Journal of Neurophysiology, 16, 37–68. Lee, W. S., Bartlett, P. L., & Williamson, R. C. (1995). Lower bounds on the VC dimension of smoothly parameterized function classes. Neural Computation, 7, 1040–1053. Lee, W. S., Bartlett, P. L., & Williamson, R. C. (1997). Correction to “Lower bounds on VC-dimension of smoothly parameterized function classes.” Neural Computation, 9, 765–769. Linsker, R. (1986). From basic network principles to neural architecture: Emergence of spatial-opponent cells. Proceedings of the National Academy of Sciences USA, 83, 7508–7512. Linsker, R. (1988). Self-organization in a perceptual network. Computer, 21(3), 105–117.
Neural Networks with Local Receptive Fields
955
Liu, S.-C., & Boahen, K. (1996). Adaptive retina with center-surround receptive eld. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 678–684). Cambridge, MA: MIT Press. Maass, W. (1993). Bounds on the computational power and learning complexity of analog neural nets. In Proceedings of the 25th Annual ACM Symposium on Theory of Computing (pp. 335–344). New York: ACM Press. Maass, W. (1994). Neural nets with super-linear VC-dimension. Neural Computation, 6, 877–884. Maass, W., & Schmitt, M. (1999). On the complexity of learning for spiking neurons with temporal coding. Information and Computation, 153, 26–46. Marr, D. (1982). Vision: A computational investigation into the human representation and processing of visual information. New York: Freeman. Marr, D., & Hildreth, E. (1980). Theory of edge detection. Proceedings of the Royal Society of London. Series B, 207, 187–217. Mead, C. (1989). Analog VLSI and neural systems. Reading, MA: Addison-Wesley. Mead, C. A., & Mahowald, M. A. (1988). A silicon model of early visual processing. Neural Networks, 1, 91–97. Mhaskar, H. N. (1996). Neural networks for optimal approximation of smooth and analytic functions. Neural Computation, 8, 164–177. Moody, J., & Darken, C. J. (1989). Fast learning in networks of locally-tuned processing units. Neural Computation, 1, 281–294. Natschl¨ager, T., & Schmitt, M. (1996). Exact VC-dimension of Boolean monomials. Information Processing Letters, 59, 19–20. (Erratum Information Processing Letters, 60, 107, 1996.) Nicholls, J. G., Martin, A. R., & Wallace, B. G. (1992). From neuron to brain: A cellular and molecular approach to the function of the nervous system (3rd ed.). Sunderland, MA: Sinauer Associates. Nilsson, N. J. (1990). The mathematical foundations of learning machines. San Mateo, CA: Morgan Kaufmann. Park, J., & Sandberg, I. W. (1991). Universal approximation using radial-basisfunction networks. Neural Computation, 3, 246–257. Park, J., & Sandberg, I. W. (1993). Approximation and radial-basis-function networks. Neural Computation, 5, 305–316. Poggio, T., & Girosi, F. (1990). Networks for approximation and learning. Proceedings of the IEEE, 78, 1481–1497. Powell, M. J. D. (1992). The theory of radial basis function approximation in 1990. In W. Light (Ed.), Advances in numerical analysis II: Wavelets, subdivision algorithms, and radial basis functions (pp. 105–210). Oxford: Clarendon Press. Ripley, B. D. (1996). Pattern recognition and neural networks. Cambridge: Cambridge University Press. Rodieck, R. W. (1965). Quantitative analysis of cat retinal ganglion cell response to visual stimuli. Vision Research, 5, 583–601. Sakurai, A. (1993). Tighter bounds of the VC-dimension of three layer networks. In Proceedings of the World Congress on Neural Networks (Vol. 3, pp. 540–543). Hillsdale, NJ: Erlbaum. Schl¨ai, L. (1901). Theorie der vielfachen Kontinuit¨at. Zurich: ¨ Zurcher ¨ & Furrer.
956
Michael Schmitt
Schmidhuber, J., Eldracher, M., & Foltin, B. (1996). Semilinear predictability minimization produces well-known feature detectors. Neural Computation, 8, 773–786. Schmitt, M. (2000). On the complexity of computing and learning with multiplicative neural networks. Neural Computation, in press. Shawe-Taylor, J. (1995). Sample sizes for threshold networks with equivalences. Information and Computation, 118, 65–72. Tessier-Lavigne, M. (1991). Phototransduction and information processing in the retina. In E. R. Kandel, J. H. Schwartz, & T. M. Jessell (Eds.), Principles of neural science (3rd ed., pp. 400–418). Englewood Cliffs, NJ: Prentice Hall. Ward, V., & Syrzycki, M. (1995). VLSI implementation of receptive elds with current-mode signal processing for smart vision sensors. Analog Integrated Circuits and Signal Processing, 7, 167–179. Wenocur, R. S., & Dudley, R. M. (1981). Some special Vapnik-Chervonenkis classes. Discrete Mathematics, 33, 313–318. Yasui, S., Furukawa, T., Yamada, M., & Saito, T. (1996). Plasticity of centersurround opponent receptive elds in real and articial neural systems of vision. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 159–165). Cambridge, MA: MIT Press. Yee, P., & Haykin, S. (2001). Regularized radial-basis function networks: Theory and applications. New York: Wiley.
Received March 27, 2001; accepted June 4, 2001.
ARTICLE
Communicated by Dan Tranchina
A Population Study of Integrate-and-Fire-or-Burst Neurons A.R.R. Casti [email protected]
A. Omurtag [email protected]
A. Sornborger [email protected] Laboratory of Applied Mathematics, Mount Sinai School of Medicine, New York, NY 10029, U.S.A.
E. Kaplan [email protected] Laboratory of Applied Mathematics and Department of Ophthalmology, Mount Sinai School of Medicine, New York, NY 10029, U.S.A.
B. Knight [email protected] Laboratory of Applied Mathematics, Mount Sinai School of Medicine, New York, NY 10029, U.S.A.
J. Victor [email protected] Department of Neurology and Neuroscience, Weill Medical College of Cornell University, New York, NY 10021, U.S.A.
L. Sirovich [email protected] Laboratory of Applied Mathematics, Mount Sinai School of Medicine, New York, NY 10029, U.S.A.
Any realistic model of the neuronal pathway from the retina to the visual cortex (V1) must account for the bursting behavior of neurons in the lateral geniculate nucleus (LGN). A robust but minimal model, the integrateand-re-or-burst (IFB) model, has recently been proposed for individual LGN neurons. Based on this, we derive a dynamic population model and study a population of such LGN cells. This population model, the rst simulation of its kind evolving in a two-dimensional phase space, is used to study the behavior of bursting populations in response to diverse stimulus conditions.
Neural Computation 14, 957–986 (2002)
° c 2002 Massachusetts Institute of Technology
958
A.R.R. Casti et al.
1 Introduction The lateral geniculate nucleus (LGN) is the main gateway to the visual cortex. Unlike the receptive elds of cortical neurons, the receptive elds of LGN relay cells have essentially the same center-surround structure as their retinal afferents. This observation, coupled with the roughly equal numbers of retinal ganglion cells and LGN cells, led to the view that the LGN is a passive relay that simply transmits a faithful replica of the input to the cortex. Anatomical data, however, show that retinal afferents comprise just 10% to 20% of the input to the LGN (Sherman & Guillery, 1996). The majority of LGN input originates in the cortex and parabrachial region of the brainstem, with some inhibitory contributions coming from local interneurons and from cells of the thalamic reticular nucleus, neither of which makes direct connections with higher levels in the visual pathway. This complex connectivity implies that the LGN serves as more than a passive relay. Further evidence of the LGN’s active role in visual processing is the dual modality of its so-called relay cells. When sufciently depolarized, an LGN cell will re in a tonic mode, characterized by a train of regularly spaced spikes, with a nearly linear response over a range of time-dependent stimuli that typify real-world input (0–10 Hz) (Mukherjee & Kaplan, 1998). In tonic mode, relay cells are well modeled by classic integrate-and-re dynamics (Tuckwell, 1988). Because of the primarily linear input-output relationship when an LGN neuron res in tonic mode, this state imparts to the cortex a relatively faithful rendition of the visual stimulus over the entire duration in which it is coded by the retinal ganglion cells. Relay cells may also respond in a distinctly different mode, the burst mode, in which a cell res bursts of 2 to 10 spikes, with an interspike interval of about 4 ms or less (Guido, Lu, & Sherman, 1992), followed by a refractory period on the order of 100 ms before the next burst. Burst behavior is precipitated by a low-threshold calcium spike that occurs when the cell is hyperpolarized sufciently to deinactivate calcium T-channels (Jahnsen & Llinas, 1982). The calcium spike is termed low threshold because it is initiated at much more negative membrane potentials than those that initiate conventional spikes that are mediated by activated sodium and potassium channels. After its deinactivation, the calcium channels may be activated by current injection or depolarizing excitatory postsynaptic potentials. If the cell is driven past threshold, it then res a train of action potentials for about 20 to 50 ms until the T-channels inactivate, after which the burst cycle begins anew, provided the cell is again hyperpolarized below the deinactivation threshold. The burst mode is a nonlinear response since it is intermittent, whether driven by continuous or steplike input. Further, because of the all-or-none character of the low-threshold calcium spike, the dynamics of burst response is essentially the same in both amplitude and the number of spikes, regardless of whether the cell is driven just above its ring threshold or well past it. Thus, in contrast to the transmission of visual input in tonic
A Population Study of Integrate-and-Fire-or-Burst Neurons
959
mode, the LGN in burst mode transmits a modied version of the visual stimulus to the cortex. The time dependence of the input and differences in driving strength appear to have little inuence on the details of the burst activity, which is shaped primarily by the intrinsic dynamics of the relay neuron. The functional signicance of the LGN’s two distinct modes of operation may be related to attentional demands. Relay cells of fully awake animals tend to re in tonic mode, which suggests that a faithful transmission of retinal activity is the primary need of the alert animal. Rhythmic bursting, on the other hand, is common for low arousal states (Livingstone & Hubel, 1981) and during paroxsymal events such as epileptic seizures (Steriade & Contreras, 1995). Arrhythmic bursting has been observed (Guido & Weyand, 1995) in both lightly anaesthetized and fully awake cats subjected to a visual stimulus. Taken together, these results indicate that the burst mode is not solely associated with a sleep state or functional disorders, and the variability of bursting patterns may signal distinct events to the cortex. Another difference between the tonic and burst modes is their temporal frequency response (Mukherjee & Kaplan, 1995). Relatively depolarized cells, which re in tonic mode when stimulated, act as low-pass lters. The transfer function of bursting cells, on the other hand, has a low-frequency cutoff. Because real-world stimuli typically contain low frequencies, the implication is that the job of the tonic mode is to transmit more faithfully the details of synaptic events responding to a specic visual stimulus. Sudden changes in retinal input, however, have more power at high frequencies, so the role of the burst mode may be to disable the geniculate relay and tell the cortex that nothing is happening when bursting rhythmically, but to signal sharp changes in the visual eld with arrhythmic bursts. Furthermore, there is evidence that the LGN switches between burst and tonic mode depending on the attentional needs of higher levels of the visual pathway (Sherman, 1996). This hypothesis provides a potential explanation for the intricate feedback connections arriving from the cortex and the brainstem, a connectivity pattern that is likely not accidental. Clearly, any realistic model of the visual pathway must account for the dual behavior of the LGN. In this work, we elaborate on the features of a recent detailed model (Smith, Cox, Sherman, & Rinzel, 2000) of thalamic relay cells. Our principal purpose is to demonstrate the utility of a particular approach, the population dynamics method, in the simulation of a population of LGN cells that accurately mirrors the burst and tonic modes over a wide range of stimulus conditions. Although it is well known that the discharge of individual LGN neurons can be tightly correlated with the activity of individual retinal ganglion cells (Troy & Robson, 1992), this aspect of the physiology will not be considered, and the input to an LGN cell, at least implicitly, will be modeled by Poisson arrivals. We consider the behavior of an assembly of LGN cells in the same spirit in which Smith et al. (2000) consider individual LGN neuron responses to external driving.
960
A.R.R. Casti et al.
The population method presented here illustrates a higher-dimensional extension of previous models (Knight, 1972; Knight, Manin, & Sirovich, 1996; Knight, Omurtag, & Sirovich, 2000; Nykamp & Tranchina, 2000; Omurtag, Knight, & Sirovich, 2000; Sirovich, Knight, & Omurtag, 2000) used in the simulation of a population of integrate-and-re cells. A model with an additional underlying dimension is necessary to capture the dependence of relay cells on the calcium channel conductance, or on any other ion channels that may be operative. To illustrate the population method for relay cells, we employ a two-dimensional state space model for individual LGN cells as suggested by Smith et al. (2000). In section 2, we review this single-cell model and formulate the corresponding population equation. In section 3, we present simulations of a population driven by steady and step input currents and compare the validity of these results against a direct simulation of 104 neurons, each of which follows the integrate-andre-or-burst dynamics. The appendix contains the details of the numerical techniques used. 2 Integrate-and-Fire-or-Burst Population Equations Smith et al. (2000) have presented a model of relay neurons in the LGN of a cat. Their model extends the classic integrate-and-re dynamics to a two-dimensional system that involves a membrane voltage, V, and a gating variable, h, which models a slow Ca2 C current. The slow current, usually referred to as the T-channel current, allows bursts of impulses to be red by a cell which recovers from hyperpolarization (Jahnsen & Llinas, 1982). Thalamic cells that are sufciently depolarized (roughly V > ¡60 mV) do not burst, but exhibit integrate-and-re behavior when a depolarizing current is applied. Both of these features are built into the integrate-and-re-or-burst (IFB) model that we use to model the activity of each neuron of a population. After a brief review of the single-cell equations, we derive the kinetic equation for a population of IFB cells. 2.1 Dynamical Equations for an IFB Neuron. The IFB model introduced in Smith et al. (2000) includes a Hodgkin-Huxley type equation for the voltage V, C
dV D I ¡ IL ¡ IT . dt
(2.1)
Here, I is an applied current, IL is a leakage current of the form IL D gL ( V ¡ VL ) ,
(2.2)
and the current IT couples the dynamics of V to the calcium conductance variable h, I T D gT m 1 h ( V ¡ V T ) ,
(2.3)
A Population Study of Integrate-and-Fire-or-Burst Neurons
961
where m1 ( V ) is an activation function for the Ca2 C channel, and VL and VT are the reversal potentials for the leakage and calcium ions. For simplicity, m1 is represented as a Heaviside function: ( m1 ( V ) D H (V ¡ Vh ) D
1 0
( V > Vh ) (V < Vh ) .
(2.4)
The dynamics of the Ca2 C current, which typically varies on long timescales relative to the time course of a fast sodium spike (· 4 ms), is given in the IFB model by 8 h > > > < ¡t¡
dh h D 1¡h > dt > > : C th
( V > Vh ) (2.5) (V < Vh ) ,
so h always approaches either zero or one. The parameter Vh divides the V axis into a hyperpolarizing region (V < Vh ), where the calcium current is deinactivated, and a nonhyperpolarizing region (V > Vh ) in which the calcium current is inactivated. The timescale th¡ sets the duration of a burst event, and thC controls the inactivation rate. In accordance with the literature, thC À th¡ (Smith et al., 2000). The leakage reversal potential VL (¼ ¡65 mV) sets the rest voltage in the absence of a stimulus, and for mammalian LGN cells in vitro, VL typically lies below the potential Vh (¼ ¡60mV) at which burst behavior is observed (Jahnsen & Llinas, 1982). The reversal potential VT (¼ 120 mV) for the calcium ions is relatively large and causes rapid depolarization once the T-channels are activated. On crossing the ring-threshold voltage, Vh ¼ ¡35 mV, the cell res and the membrane potential is reset to Vr > Vh , where Vr ¼ ¡50 mV is typical. Consequently, the ve voltage parameters of the IFB model satisfy the relation VL < Vh < Vr < Vh < VT .
(2.6)
The calcium variable h ranges between 0 and 1. For relatively large values of h, a cell will burst due to the calcium channel coupling term and will produce fast trajectories in V with rapid resets. For relatively low values of h, given a large enough input current, tonic ring occurs. Figure 1 illustrates these dynamic features of the IFB model. In this simulation, the neuron was driven by Poisson-distributed synaptic events that increased the membrane potential by 2 D 1.5 mV with each occurrence. The driving rate was chosen too small for the cell to re in tonic mode. Instead, the cell drifted randomly near the threshold Vh D ¡60 mV for Ca2 C deinactivation. Bursts occurred
962
A.R.R. Casti et al.
Stochastic Input (Single Neuron) 1 V h
Vq 50
100
0.5
0
100
200
300
400
500
t (ms)
600
700
800
900
h
V (mV)
0
0 1000
1 0.8
h
0.6 0.4 0.2 0 65
60
55
50
V (mV)
45
40
35
Figure 1: Simulation of a single IFB neuron driven by Poisson-distributed mA synaptic events with xed mean arrival rate s 0 D CI2 , with I D .08 cm 2 , 2 D 1.5 mV, and all other parameters as in Table 1. (Top) Time series for V (solid line) and h (dashed line). The horizontal line demarcates the threshold V D Vh . (Bottom) Phase plane orbit. The asterisk locates the initial condition ( V 0, h 0 ) D (¡61, .01) . In the time span shown, there were two burst events, the rst of containing three spikes and the other four spikes.
whenever the calcium channel deinactivated and enough clustered synaptic events, in collaboration with the T-channel current, conspired to drive the neuron past the ring threshold. 2.2 Population Dynamics. Our aim is to study the behavior of a population of excitatory IFB neurons. A general approach to this problem, in the limit of a continuous distribution of neurons, is presented in Knight et al. (1996), Knight (2000), Nykamp and Tranchina (2000), and Omurtag, Knight, et al. (2000). Based on the development in the cited references, the number of neurons in a state v D ( V, h ) at time t is described by a probability density, r ( V, h, t) , whose dynamics respects conservation of probability. The evolution equation for r thus takes the form of a conservation law, @r @t
D ¡
@ @v
¢ J.
(2.7)
A Population Study of Integrate-and-Fire-or-Burst Neurons
963
The probability ux, J D JS C Js ´ ( JV , Jh ) ,
(2.8)
is split into two parts. The rst part is a streaming ux, JS D F(v)r ,
(2.9)
due to the direction eld of the single-neuron dynamical system, equations 2.1 through 2.5, where Á F(v) D
¡C
¡1
[IL C IT ] ,
(1 ¡ h ) thC
! h H ( Vh ¡ V ) ¡ ¡ H ( V ¡ Vh ) th
(2.10)
´ ( FV , Fh ) . The applied current enters our analysis as a stochastic arrival term in equation 2.7, with an arrival rate s ( t), so that each arrival elevates the membrane voltage by an increment 2 . Written in terms of an excitation ux, this is expressed as Z Js D eˆ V s ( t)
V V¡2
Q h, t) dVQ , r ( V,
(2.11)
where eˆ V is a unit vector pointing along the voltage direction in the ( V, h ) phase space. This states that the probability current in the voltage direction, across the voltage V, comes from all population members whose voltages range below V by an amount not exceeding the jump voltage 2 . The assumptions underlying equation 2.11 are detailed in Omurtag, Knight, and Sirovich (2000). With the ux denitions 2.9 and 2.11, the population density r ( V, h, t) evolves according to @r @t
D¡
@ @v
¢ [F(v)r] ¡ s ( t) [r ( V, h, t) ¡ r ( V ¡ 2 , h, t )] .
(2.12)
Although a realistic synaptic arrival initiates a continuous conductance change, this effect is well approximated by a jump of size 2 in the membrane potential. Thus, we see in equation 2.12 a loss term proportional to r (V, h, t) and a gain term proportional to r ( V ¡ 2 , h, t) due to synaptic events. Hereafter, this is referred to as the nite-jump model. For purposes of exposition, we restrict attention to excitatory current inputs. We note that (like Smith et al., 2000) we are investigating the dynamics of the thalamic cell alone, and not the integrated dynamics of the retinal ganglion cell and LGN cell pair. This involves one simplifying departure
964
A.R.R. Casti et al.
from the actual physiological situation. For the input current I, Smith et al. (2000) use a specied smooth function of time. Our population equation 2.12 goes a bit further by including the stochastic nature of synaptic arrivals, which are treated as uncorrelated with LGN cell activity. This should be in fair accord with the physiology for nonretinal input that is many-to-one but is only an earliest approximation for the one-to-one retinal input. 2.3 Population Firing Rate. A response variable of interest is the average ring rate of individual neurons in the population, r ( t) . In general, two input sources drive any cell of a particular population: synaptic events arriving at a rate s 0 ( t) that arise from the external neural input and synaptic events resulting from feedback within the population. We will ignore feedback in the interest of simplicity and take the input to be purely of external origin (see Omurtag, Knight, and Sirovich, 2000, for the more general treatment). The population will be assumed homogeneous, so that each neuron is driven equally at the input rate s D s 0 ( t). In the case of stochastic external input, s 0 ( t) is the ensemble mean arrival rate of external nerve impulses, *
1 s t) D lim D t#0 D t
Z
tC D t
0(
dt
t
1 X nD 1
+ d (t
¡ tn0 )
,
(2.13)
where d ( t) is the Dirac delta function, h¢i denotes an ensemble average, and ftn0 g1 n D 1 is a set of spike arrival times. The population ring rate, r ( t), is determined by the rate at which cells cross the threshold Vh . This may be expressed as a function of the voltagedirection ux at threshold integrated over all calcium channel activation states h, Z r (t ) D D
Z
1 0 1 0
¡ ¢ dh J V Vh , h, t
(2.14) Z
dh FV ( Vh , h)r (Vh , h, t) C s 0 ( t)
1Z 0
Vh
Vh ¡2
Q h, t ) . dh dVQ r ( V,
2.4 Boundary Conditions. Boundary conditions on r are chosen to conserve total probability over the phase-space domain D , Z D
r (v, t) dv D 1 .
(2.15)
From equation 2.7 and an application of the divergence theorem, it follows that the boundary ux integrates to zero, I @D
nˆ ¢ J dS D 0 ,
(2.16)
A Population Study of Integrate-and-Fire-or-Burst Neurons
965
where nˆ is a boundary normal vector. Our domain of interest is the box D D fVL · V · Vh , 0 · h · 1g, for which it is appropriate to choose the no-ux boundary conditions nˆ ¢ J D 0 at V D VL , h D 0, 1 ,
(2.17)
so that the outward ux vanishes at each boundary face except at the threshold boundary V D Vh . To handle the voltage offset term, r (V ¡ 2 , h, t) , in the population equation 2.12, which requires evaluation of the density at points outside the box, it is natural to choose r to vanish at all points outside D , and we set r (v, t) D 0 , V 2/ D .
(2.18)
In addition, there is a reset condition that reintroduces the ux at V D Vh back into the domain at the reset V D Vr . This may be incorporated directly into the population equation 2.7 by means of a delta function, @r @t
D¡
@ @v
¢ J C JV (Vh , h, t) d ( V ¡ Vr ) .
(2.19)
It is worth noting that the threshold ux is reset pointwise in h. This is consistent with the slow temporal variation of the calcium channel, and we assume that h does not change appreciably during the time between a cell ring a spike and its subsequent fast relaxation to reset. Under these conditions, with the total ux J suitably redened to include the delta function source, probability is conserved and equation 2.16 is satised. Equation 2.19, with the boundary conditions 2.17 and 2.18, forms the complete mathematical specication of the model for a single population. 2.5 Diffusion Approximation. An approximation that lends itself to simpler analysis is the small-jump limit, 2 ! 0. Upon expanding the density, r, in the excitation ux equation 2.11 through second-order terms, one obtains Q h, t) D r ( V, h, t) C dV r ( V, Z eV ¢ Js D
0 ¡2
@r @V
( V, h, t) C O (d V 2 )
( VQ D V C dV)
(2.20)
µ ¶ @r ( V, h, t) C . . . d (d V ) r ( V, h, t) C dV @V
¼ 2 r ( V, h, t ) ¡
2 2 @r
2 @V
( V, h, t) .
(2.21)
Substitution of equation 2.21 into 2.7 gives the diffusion approximation (the Fokker-Planck equation), @r @t
D ¡
@ @v
¢ (F C s2 eV ) r C
¡ ¢ s2 2 @2r C J V Vh , h, t d ( V ¡ Vr ) , 2 2 @V
(2.22)
966
A.R.R. Casti et al.
Table 1: Model Parameters. th¡ thC gL gT VT VL Vh Vr Vh C
2 £ 10¡2 sec 10¡1 sec 3.5 £ 10¡2 mS/cm2 7 £ 10¡2 mS/cm2 120 mV ¡65 mV ¡35 mV ¡50 mV ¡60 mV 2 m F/cm2
where the approximate threshold ux is ¡ ¢ ¡ ¢ ¡ ¢ s2 2 @r ¡ ¢ JV Vh , h, t D [FV Vh , h C s2 ]r Vh , h, t ¡ Vh , h, t . 2 @V
(2.23)
The diffusion approximation is also useful for comparison with the nitejump model, and this will be done in section 3.2 (also see Sirovich et al., 2000). A further simplication of the diffusion approximation is valid in the 2 ! 0 limit with s2 nite. This special limit reduces equation 2.22 to a pure advection equation for which exact solutions can be derived (see section 3.1). Physically, this approximation is appropriate when input spike rates are extremely large and the evoked postsynaptic potentials are small. This would be the case, for example, when DC input overwhelms stochastic uctuations. 3 Results In Table 1 we show the parameter values used in our simulations, following Smith et al. (2000), whose choices are based on experimental data from cat LGN cells. In particular, when the cell is in burst mode, the parameter subspace near these values produces 2 to 10 spikes per burst. We tested the population simulation with a variety of external inputs. In all cases, we compared the accuracy against a direct numerical simulation (DNS), in which equations 2.1 and 2.5 were numerically integrated for a discrete network of 104 neurons, each driven by Poisson-distributed action potentials involving the same voltage elevation, 2 , with the same mean arrival rate, s 0 , as in the population simulation. As explained in the appendix, a population simulation is sensitive to nite grid effects that do not affect the DNS, and for this reason we treat the DNS as the standard for comparison. As in Omurtag, Knight, and Sirovich (2000), the two approaches converge for a sufciently ne mesh. For instance, as seen in Figure 6, the ring-rate response curve to a step input current generated by the DNS overshoots the converged mean ring rate of
A Population Study of Integrate-and-Fire-or-Burst Neurons
967
the population simulation by about 20%. These simulations took roughly the same amount of computation time, but we note that as the number of neurons of the DNS increases, the population simulation becomes far more computationally efcient. Since the uctuations in the DNS about the true solution scale inversely proportional to the square root of the number of neurons, one would need four times as many neurons to reduce the error by half. For an uncoupled population, the computation time would increase by a factor of about 4. The increase in computation time for the DNS is even greater when the neurons are coupled, whereas the population simulation demands no extra computation time. 3.1 An Exactly Solvable Case. To compare the results of the population simulation with an exactly solvable problem, we rst considered the case of constant driving, s 02 D CI , in the diffusion approximation 2.22 of the 2
population equation. We further suppose that the diffusive term, s 22 @@Vr2 , is negligible, a justiable simplication if the voltage increment 2 arising from a random spike input is small but s 02 nite. This case, which may also be viewed as a population model for which the external driving is noiseless, then reduces to the pure advection equation, 0 2
@r @t
D¡
@ ( ( FV C CI )r ) @ ( Fh r ) C d ( V ¡ Vr ) J V ( Vh , h, t ) . ¡ @V @h
(3.1)
¡ ¢ Here F D FV , Fh is dened by equation 2.10. This equation can be solved exactly by the method of characteristics. In the absence of diffusive smearing of the density eld by stochastic effects, the population density r traces the single-neuron orbit dened by equations 2.1 and 2.5. Upon dividing these equations, we obtain the characteristic trace equation, FV C dV D dh Fh
I C
.
(3.2)
Integrating equation 3.2 in the Ca2 C -inactivated region (V > Vh ) gives VD
µ ³ ´c 1 c (h¡h0 ) h L C th¡ ( I C gL VL ) (c T ) c L eT CV0 C h0 £(C (¡c L , c T h ) ¡ C (¡c L , c T h0 ) ) C CVT (c T h) c L ec T h0 (C (1 ¡ c L , c T h ) ¡ C (1 ¡ c L , c T h0 ) )
¶ (3.3)
R1 gT t ¡ gL t ¡ where c T D C h , c L D Ch , C ( a, z ) D z ta¡1 e ¡t dt is the incomplete gamma function (Abramowitz & Stegun, 1972), h 0 denes the starting point of a
968
A.R.R. Casti et al.
IFB Characteristic Trace
1
(V ,h ) eq
0.9
eq
0.8 0.7
h
0.6 0.5 0.4 0.3 0.2 0.1
V
V
h
0 65
r
60
55
50
V (mV)
45
40
35
Figure 2: Characteristic traces (see equations 3.3 and 3.4) with an input current mA I D .05 cm 2 , which leads to a burst of four spikes. The asterisk in the upper left corner at ( Veq , heq ) D (¡63.6, 1) indicates the xed point to which the neuron will eventually settle. The reset condition has been added manually.
particular characteristic curve, and V 0 D V ( h 0 ) . In the Ca2 C -deinactivated region, the dynamics of V and h are uncoupled, and we nd µ
V D VL C
I I C V0 ¡ VL ¡ gL gL
¶³
C
h¡1 h0 ¡ 1
´ gL th C
.
(3.4)
Characteristic lines for the case of a calcium-driven burst event, using the initial point (V 0, h 0 ) D (Vh , 1) , are shown in Figure 2. Because the input is nonstochastic and subthreshold, a neuron driven with this small current mA (I D .05 cm ) ( ) (¡63.6, 1) (marked 2 equilibrates at the xed point at Veq , heq D by the asterisk). Before reaching the equilibrium, a burst of four spikes, preceded by a low-threshold calcium spike, was red before the calcium channels fully deinactivated. Stochastic input, explored in section 3.2, has the effect of increasing the average ring rate of a population of IFB cells due to additional depolarizing input from random, excitatory spike inputs.
A Population Study of Integrate-and-Fire-or-Burst Neurons
969
However, the number of spikes per burst event generally remains the same because the IFB dynamics are dominated by the calcium current when V > Vh and h > 0 (see Figure 5 for comparison). If the input current is large, each neuron res in a classic integrate-and-re (Omurtag, Knight, and Sirovich, 2000) tonic mode and the calcium channel equilibrates to the inactivated state h D 0. In this case, the equilibrium potential, Veq , of the average cell is given by setting dV 0 in equation 2.1, dt D with the result Veq D
I C VL . gL
(3.5)
The population is driven past threshold when I > Icrit D gL (Vh ¡ VL ) , in which case each neuron of the population res a periodic train of spikes with a time-averaged ring rate that eventually equilibrates to a constant value rN, 1 rN D T
Z
T
r ( s ) ds ,
(3.6)
t¡T
with r ( t) given by Z r ( t) D
1 0
dh FV ( Vh , h )r ( Vh , h, t) .
(3.7)
This time-averaged ring rate rN is independent of the reference time, t, provided t is chosen large enough for each member of the population to have equilibrated to its limit cycle. The interval T is chosen large enough to encompass many traversals of a given cell from its reset potential through the threshold. To test the accuracy of the population simulation, we compared time-averaged population ring rate, rN, with the single-neuron ring rate, f ( I ) , which is obtained from equation 2.1 upon integrating through the interval [Vr , Vh ] : µ f (I) D
C Vr ¡ VL ¡ I / gL ln gL Vh ¡ VL ¡ I / gL
¶¡1
.
(3.8)
In Figure 3 we plot the exact result (see equation 3.8) and simulation results, and see that they are in excellent agreement. 3.2 Constant Stochastic Input. Irregularity in the arrival times in the external driving introduces stochasticity into the population evolution, which is modeled either as diffusion (see equation 2.22) or nite jumps in the membrane voltage (see equation 2.12). In either case, there is a nite chance that a cell will be driven through the threshold, Vh , even if the mean current
970
A.R.R. Casti et al.
IFB Model , 50 X 50 grid
25
Analytical (non stochastic) Numerical (non stochastic) Numerical (stochastic , e = .5 mV)
Firing Rate (Hz)
20
15
10
5
0
0
0.5
I (mA cm
2
1
1.5
)
Figure 3: Comparison of the exact analytical ring rate (see equation 3.8) (solid line) for a neuron driven by a nonstochastic current with the population ring rate obtained from simulation (asterisks). The population ring rate of the nitejump model with stochastic driving (2 D .5 mV) at an equivalent Poisson rate mA is included for comparison (circles). Horizontal axis: Applied current I ( cm 2 ). Vertical axis: Firing rate r ( t) (Hz).
is not strong enough to push the average cell through threshold. Thus, we expect to see higher ring rates compared to the case of purely deterministic driving (see section 3.1). The population results are in accord with this expectation, as shown in Figure 3. A notable feature of stochastic input is a nonvanishing ring rate for driving currents below the threshold. In Figure 3, this effect appears in the mA equilibrium ring-rate curve as a bump peaked at I D .175 cm 2 for the parameter values stated in Table 1. This bump is a consequence of low-threshold calcium spiking events. If the cell’s resting potential lies near the calcium channel activation threshold, Veq ¼ Vh , which occurs if the input rate sat¢ g ( V ¡V ) ¡ ises s 0 ¼ L Ch2 L s 0 ¼ .0875 ms¡1 , then random walks in voltage, in cooperation with activated calcium currents, occasionally drive neurons
A Population Study of Integrate-and-Fire-or-Burst Neurons
971
Equilibrium Density , Tonic Mode 0.1
Finite Jumps Diffusion Approximation
0.09 0.08 0.07
r
0.06 0.05 0.04
e = 1.5 mV s0 = 0.5 ms 1
0.03 0.02 0.01 0 65
60
55
50
V (mV)
45
40
35
Figure 4: Comparison of the time-independent equilibrium distributions for a population ring in tonic mode. The gure is plotted in the h D 0 plane. Solid line: Finite-jump numerical solution. Dash-dot line: Diffusion approximation. The numerically generated distribution, equation 2.22, and the analytical solution, equation 3.9, for the diffusion approximation are imperceptibly different. Parameters: s 0 D .5 ms ¡1 , 2 D 1.5 mV, 150 grid points in V and h (all other parameters given by Table 1).
through the threshold Vh . If the average resting membrane potential lies too far above Vh but still well below the threshold, then the calcium currents are rarely deinactivated for a sufcient duration to trigger the low-threshold calcium spike. With a xed Poisson arrival rate, the population always achieves a timeindependent equilibrium whose characteristic features hinge on whether the population is ring in tonic or burst mode. The tonic mode for any individual LGN cell is typied by an uninterrupted sequence of independently generated spikes, all occurring in the calcium-inactivated state (McCormick & Feeser, 1990). The equilibrium prole for a tonic-spiking population thus lies in a plane cutting through h D 0 in the two-dimensional phase space (see Figure 4). By contrast, a cell in burst mode res clusters of calcium-triggered
972
A.R.R. Casti et al.
Figure 5: Density plot of the numerically generated equilibrium solution for a population of continuously bursting cells (log r is plotted in the V ¡ h plane with color scale indicated on the right). Parameters: 50 £ 50 grid resolution, 2 D 1 mV, s 0 D .025 ms ¡1 , and all other parameters as in Table 1. The more jagged features of the density distribution are numerical artifacts owing to the modest resolution.
spikes followed by a refractory period on the order of 100 ms. The burst cycle repeats provided that any depolarizing input is small enough to allow the cell to rehyperpolarize below the calcium deinactivation threshold potential. Consequently, the density prole for a repetitively bursting population is spread throughout the phase space (see Figure 5).
3.2.1 Tonic Spiking. Since tonic spiking cells have inactivated calcium currents (h D 0), we may obtain an analytical expression for the equilibrium solution by solving the time-independent population equation in the absence of h-dependent dynamics. For ease of comparison with the numerical results, we focus on the more analytically tractable diffusion approximation,
A Population Study of Integrate-and-Fire-or-Burst Neurons
973
equation 2.22, which has the equilibrium solution req ( V ) D
2CJh b V¡ a V2 2 e I2
Z
Vh V
a 2
e¡bs C 2 s H (s ¡ Vr ) ds ,
(3.9)
2g where a D I2 L , b D 2 2 (1 C gLIVL ) , and H (V ¡ Vr ) is the Heaviside function. The equilibrium ring rate, Jh D JV (Vh , 0, t), is determined self-consistently from the normalization condition, equation 2.15. The exact solution, equation 3.9, is virtually identical to the simulation result. It is also seen that the diffusion approximation, when compared to the nite-jump case, has the effect of smoothing the population distribution. The displacement of the peak toward a lower voltage occurs because the diffusion approximation, obtained by a truncated Taylor series of the density r, does not correctly capture the boundary layer near V D Vh unless the input current is close to the threshold for tonic spiking. This issue, and others related to the equilibrium prole for a population of integrate-and-re neurons, has been explored more extensively in Sirovich et al. (2000).
3.2.2 Burst Firing. For the population to exhibit continuous bursting under statistically steady input, the driving must be small enough so that the average neuron spends most of its time sufciently hyperpolarized below (or near) the Ca2 C activation threshold at V D Vh . This situation is achieved g when the input spike rate satises s 0 < CL2 ( Vh ¡ VL ) . The representative features of a population in burst mode are illustrated in Figure 5, which shows a numerical simulation of the population equation, 2.19, driven by random, nite voltage jumps. Most of the neuron density equilibrates near the xed point ( Veq , heq ) D (¡63.6, 1) of the single-neuron system under the driving condition of the same average current, but steady instead of in jumps. Thus, in Figure 5, one sees the majority of neurons residing in the upper-leftmost portion of the phase space, with deinactivated calcium currents ( h D 1) but not quite enough input to push them forward through the activation threshold very often. However, because the input is noisy, some cells may randomly walk past Vh . This is reected in the equilibrium prole by the faint stripes in the right half of the phase space (compare with the single-neuron orbit, driven by DC input, of Figure 2). These characteristic stripes indicate bursting cells with four spikes per burst. As Figure 5 shows, a signicant number of cells, after going through a burst cycle, also temporarily get stuck in a calcium-inactivated state (h D 0) near the activation threshold Vh , again owing to stochastic drift. 3.3 Stepped Input Current. An experiment that probes the dynamical behavior of a neuron population involves the stimulation of cells by a step input from one contrast level to another (Mukherjee & Kaplan, 1995). Such experiments give insight into the approach to equilibrium and the degree of
974
A.R.R. Casti et al.
Tonic Response To Step Current
25
Population Simulation , 50 X 50 grid 50 X 100 grid 50 X 200 grid Direct Simulation (e = 1 mV)
Firing Rate (Hz)
20
15
10
5
0
200
220
240
260
280
300
t (ms)
320
340
360
380
400
Figure 6: Firing-rate comparison of the direct simulation (solid jagged curve) with population dynamics simulations (dashed and dotted curves) at various mA grid resolutions in h for a mean current step from I D .4 to I D 1.2 cm 2 . The step increase began at t D 200 ms and terminated at t D 400 ms.
nonlinear response under various changes in contrast. For LGN neurons, it is of particular interest to understand the connection between the stimulus, or the absence of one, and the extent to which visual input is faithfully tracked by relay cell activity and sent forth to the cortex. In this section, we examine the ring activity one might observe in such an experiment for a population of LGN cells. The input is stochastic, with a mean driving rate s 0 that steps from one constant value to another. The population dynamics for this sort of input to integrate-and-re cells operating in tonic mode has been explored in Knight et. al (2000). 3.3.1 Current Step Between Two Calcium Inactivated States. Figure 6 compares the results of a direct numerical simulation with the IFB population mA model for a mean current step from I D .4 to I D 1.2 cm 2 (the mean input 0 ¡1 0 ¡1 rate steps from s D .2 ms to s D .6 ms ). In this case, the prestep
A Population Study of Integrate-and-Fire-or-Burst Neurons
975
equilibrium potential of the average neuron is Veq ¼ ¡53.6 mV, which lies several millivolts above the threshold Vh at which the calcium current is mA deinactivated. Once the current jumps to 1.2 cm 2 , the neurons re in tonic mode since the associated equilibrium Veq ¼ ¡30.7 mV is well above the sodium spiking threshold, Vh D ¡35 mV. The response to a range of step input that does not promote burst ring is primarily linear. Upon activation of the step at t D 200 ms and after a delay of about 30 ms while the membrane potential moves toward threshold, the ring rate r ( t) mimics the input with a step response, aside from a minor overshoot at the peak ring rate. The linearity of the input-output relationship of the tonic mode was veried by a spectral analysis; the transfer function was approximately constant, with only slight deviations at lower frequencies. Further simulations at various input levels revealed that although the input-output relation was not exactly linear, it was far more so than when the population red in burst mode (as discussed below). The three curves generated by the population simulation correspond to varying grid resolution in the calcium coordinate h. The population model and the DNS achieve increasing agreement as the resolution of the population simulation is increased. The reason that the equilibrated ring rate of the population model lies slightly above the mean of the DNS is attributable to nite grid effects, which are further discussed in the appendix.
3.3.2 Current Step from Calcium Deinactivated State to Calcium Inactivated State. In Figure 7 is a comparison of the ring rates of the population mA model at various resolutions with the DNS for a current step from I D .1 cm 2 to I D 1.33. These current values correspond to an equilibrium potential before the step of Veq ¼ ¡62.1 mV (in the absence of Ca2 C dynamics), so the calcium channels are initially deinactivated, and a poststep equilibrium potential of Veq D ¡27 mV, which is well above the ring threshold. Compared to the previous case, one expects much higher ring rates at the onset of the current step because a large fraction of cells are poised to burst. This is indeed reected in the sharper ring-rate peak at the current jump, relative to the equilibrium ring rate to which the population relaxes (compare with Figure 6). Consistent with physiological experiments and analysis of temporal modulation transfer functions (Mukherjee & Kaplan, 1995), the bursting LGN cells in the IFB population nonlinearly modify retinal input and pass to the visual cortex a signicantly altered rendition of the stimulus. Numerical simulations at different input levels, all corresponding to the burst mode, veried that the transfer function was not constant in each case. This behavior of the population activity reects that of the single-neuron IFB model, which Smith et al. (2000) have explored quantitatively for sinusoidal input; we refer to their work for the details.
976
A.R.R. Casti et al.
Burst Response To Step Current 150
Firing Rate (Hz)
Population Simulation , 50 X 50 grid 100 X 50 grid 300 X 50 grid Direct Simulation (e = 1 mV) 100
50
0 150
200
250
300
t (ms)
350
400
Figure 7: Firing-rate comparison of the direct simulation (solid jagged curve) with the IFB population model (dashed and dotted curves) at various grid resmA olutions in V for a current step from I D .1 cm 2 to I D 1.33. The onset of the step occurred at t D 200 ms.
It is interesting to note that the agreement between the DNS and the population simulation is less favorable for coarse grid resolutions (50 gridpoints in both V and h) than in Figure 6. This is a consequence of nite grid inuences that cause some of the population to drift spuriously through the calcium activation transition at Vh , an effect that is much more manifest whenever the equilibrium potential of a typical neuron in the population initially lies near Vh . This issue is elaborated on in the appendix. Figure 8 demonstrates the increase in accuracy when the prestep voltage equilibrium is well removed from Vh . Here, the input was a current mA step from I D 0 cm 2 to I D 1.33, corresponding to a prestep equilibrium ( ) point Veq , heq D (VL , 1) with I D 0. The neurons initially equilibrate in the upper-left corner of the phase space, far enough away from Vh to preclude signicant nite-grid-inuenced drift through the Ca2 C activation threshold. As in Figure 6, the agreement between the direct simulation and the population result is excellent at a moderate resolution.
A Population Study of Integrate-and-Fire-or-Burst Neurons
977
Burst Response To Step Current
Firing Rate (Hz)
250
Population Simulation , 200 X 100 grid Direct Simulation (e = 1 mV)
200
150
100
50
0 150
200
250
300
t (ms)
350
400
mA Figure 8: Comparison of ring rates for a step current from I D 0 cm 2 to 1.33. Solid curve: Direct simulation with 2 D 1 mV. Dashed curve: Population dynamics simulation on a 200 £ 100 grid.
4 Discussion To capture the dynamical range of LGN cells that may re in a burst or tonic mode, a neuron model of two or more state variables is required, since at least two fundamental timescales comprise the intrinsic dynamics of the burst mode: the relatively long interval between bursts (around 100 ms) and the short interspike interval (about 4 ms) of sodium action potentials that ride the low-threshold calcium spike. Using the single-cell IFB model of Smith et al. (2000) as a springboard, this work presents the rst simulation of a population equation with a two-dimensional phase space. Most previous studies using the population method focused on the single state-space variable integrate-and-re model (Knight, 1972; Knight et al., 1996, 2000; Nykamp & Tranchina, 2000; Omurtag, Knight, and Sirovich, 2000; Sirovich et al., 2000). Here, the computationally efcient simulations of an LGN population under a variety of stimulus conditions were seen to be in excellent agreement with direct numerical simulations of an analogous discrete net-
978
A.R.R. Casti et al.
work (see Figures 3, 6, 7, and 8), as well as with special analytical solutions (see Figures 3 and 4). Although the role of the intrinsic variability of LGN cells in visual processing is still unknown, a large body of experimental evidence indicates that the dual response mode—burst or tonic—has a signicant effect on the faithfulness with which a retinal stimulus is transmitted to cortex. This fact may be related to attentional demands. In alert animals, the burst mode, being a more nonlinear response, could serve the purpose of signaling sudden changes in stimulus conditions (Guido & Weyand, 1995; Sherman, 1996). The tonic mode, a nearly linear response mode, presumably takes over when the cortex demands transmission of the details. The population model of IFB neurons mirrors these qualitative features of the dual response modes. For a population initially hyperpolarized below Vh , where calcium channels are deinactivated, the response to a step input was large and nonlinear (see Figures 7 and 8). When the population was initialized with inactivated calcium currents at membrane potentials above Vh , the ring-rate response tracked the stimulus much more faithfully, indicating a primarily linear response (which a spectral analysis veries explicitly). Further, an initially hyperpolarized IFB population driven by a mA large current (I > 1.05 cm 2 for the parameters of Table 1) will re a burst of calcium-activated spikes and then settle into a tonic ring mode. This is consistent with the experiment of Guido and Weyand (1995), in which relay cells of an awake cat red in burst mode at stimulus onset during the early xation phase and then switched to a tonic ring pattern thereafter. The IFB model, being of low dimension, thus shows great promise as an efcient means of simulating realistic LGN activity in models of the visual pathway. Previous simulations of the early stages of visual processing (retina ! LGN ! V1) typically did not incorporate LGN dynamics (however, see Tiesinga & JosÂe, 2000). Typically, the LGN input used is a convolved version of the retinal stimulus, which is then relayed to the cortex (McLaughlin, Shapley, Shelley, & Wielaard, 2000; Nykamp & Tranchina, 2000; Omurtag, Kaplan, Knight, & Sirovich, 2000; Somers, Nelson, & Sur, 1995). In such simulations, no account is made of the intrinsic variability of the LGN cells or the effects of feedback from the cortex or other areas. Because the convolution of retinal input with a lter is a linear operation, most models effectively simulate LGN cells in their tonic mode. Yet the burst response mode of relay cells is certainly an important feature of LGN cells whatever the arousal level of the animal in question. Although this may be of less concern in feedforward models used to study orientation tuning in V1, for instance, it is a necessary consideration of any model of cortical activity when stimulus conditions promote strong hyperpolarization for signicant durations, which may arise realistically from variable levels of alertness, or for simulations in which the stimulus is weak.
A Population Study of Integrate-and-Fire-or-Burst Neurons
979
A dynamically faithful model of relay cell activity, dictated by experiment, is likely needed to assess the role of the massive feedback connections on the cortico-thalamic pathway. There is some evidence that feedback from a layer 6 neuron in V1 to the LGN may play a role in orientation tuning by synchronizing the spiking of relay cells within its receptive eld (Sillito, Jones, Gerstein, & West, 1994). This suggests the possibility that cortical cells provide very specic afferent connections to reinforce the activity of the LGN neurons that excited them in the rst place. Anatomical support for this has been provided by the experiments of Murphy, Duckett, and Sillito (1999), who observed that the corticofugal axons, though sparse, exhibit localized clustering of their boutons into elongated anatomical “hot spots” that synapse upon a relatively large number of target cells in the LGN and reticular nucleus. They also demonstrated a high correlation between the major axis of the elongated array of boutons and the orientation preference of the cortical cells from which they originated. A plausible conjecture is that the cortico-thalamic feedback serves the purpose of enhancing the response of salient features such as edge orientations in the retinal input. If so, then a more dynamically realistic LGN model than those used to date is called for. In any event, the relative importance of feedback to the LGN, as opposed to intracortical connectivity and feedforward convergence, say, in the tuning of cortical cells to various modalities such as orientation and spatial frequency, is an important issue. Future work with the population method is underway to simulate a simplied version of the thalamocortical loop, with realistic dynamical models for the LGN, layers 4 and 6 of the primary visual cortex, and the inhibitory interneurons of the thalamic reticular nucleus. The aim will be to study the functional nature of the circuitry that connects the various levels of the early visual pathway and investigate in particular the role that feedback plays in visual pattern analysis. Appendix: Numerical Methods A.1 Direct Simulation. The state variable v D ( V, h ) for each cell in the network is governed by the ordinary differential equation (ODE), X dv d (t ¡ tk ) , D F(v) C eˆ V 2 dt k
(A.1)
where eˆ V is the unit vector in the V direction and F(v) is dened by equation 2.10. The solution that corresponds to the streaming motion alone, dv / dt D F(v) ´ ( FV (v, h) , Fh ( v, h ) ), is formally denoted by the time evolution opera(1) tor eQ ( t) : v(t) D eQ
(1) (
t)
v(0).
(A.2)
980
A.R.R. Casti et al.
The solution that corresponds to the synaptic input alone similarly can be written as v(t) D eQ
(2) (
t)
v(0) .
(A.3)
Equation A.3 has a simple, explicit form for the case of xed nite jumps: eQ
(2) (
t)
v(0) D v(0) C2 n (t) .
(A.4)
Here, n ( t) is the number of impulses that have arrived during time t. It is an integer chosen randomly from the nth Poisson distribution, Pn (t) D (s t) n e¡st / n!, with mean arrival rate s. We are interested in solutions of the form v(t) D e ( Q
(1) (
t ) C Q (2) ( t) )
v(0) .
(A.5)
According to the Baker-Campbell-Hausdorff lemma (Sakurai, 1994), a second-order accurate splitting of the exponential operator, equation A.5, is (1) (2) (1) v(t C D t) D eQ (D t/ 2) eQ ( D t) eQ ( D t / 2) v(t) C O ( D t3 ) .
(A.6)
We use a second-order Runge-Kutta method for the streaming operator (1) eQ . Thus, the above operator splitting method provides an efcient secondorder accurate numerical scheme for integrating equation A.1. However, the streaming direction eld of the IFB neuron contains discontinuous changes in the voltage variable V due to the Heaviside calcium-channel activation function, equation 2.4, and the pointwise (in h) reset condition. We now describe how we handle these discontinuities numerically with the same second-order accuracy. Suppose the discontinuity occurs when v D vd . After a time step D t, if vd is found to lie between v (t ) and v (t C D t) , then the integration is performed from the current state up to the discontinuity. Let the state at time t be denoted ( v, h ) . Dene v D 12 ( vd C v ) , and let D t denote the time it takes to move from the present state (v, h ) to the state ( v, h ) , which lies between the present state and the point where the trajectory crosses the discontinuity, (vd , h¤ ) . The time to reach the discontinuity is D t¤ . By Taylor expanding the direction eld at ( v, h ) we nd
Dt D
v ¡v C O ( D t2 ) . h)
FV ( v,
(A.7)
We remark that FV ( v, h ) D 0 almost never occurs in the phase plane for the cases of interest. The solution for the calcium channel ODE, equation 2.5, is then written (2) (2) 1 V h D eQh ( D t ) h D eQh ( 2 ( vd ¡v) / F ( v,h ) ) h C O ( D t2 ) .
(A.8)
A Population Study of Integrate-and-Fire-or-Burst Neurons
981
Next, after the direction eld is Taylor expanded about ( v, h ) , one can show that vd D v C FV (v, h) D t¤ C O ( D t3 ) ,
(A.9)
whence
D t¤
D
± FV
1( C 2 vd
vd ¡ v ² C O ( D t3 ) . (2) 1 V ( v,h) ) ( ¡v) Q ( v F / d 2 v) , e h h
(A.10)
Consequently, the point where the trajectory crosses the discontinuity is found to be (2) (vd , h¤ ) D (vd , eQh ( D t¤ ) h C O (D t3 ) ) .
(A.11)
A.2 Population Simulation. Equation 2.7 is linear in r. However, due to the boundary conditions in the V-direction (population exiting the phase space at threshold resurfaces in the middle of the grid) and the tendency of the population to pile up at h D 0 and h D 1, it is necessary to use relatively sophisticated methods to integrate the equations. In particular, if we simply discretize the grid and expand the derivative terms to second order, oscillations due to the discretization become amplied when the population piles up at the h boundaries and population ux is reintroduced into the grid by the reset boundary condition. The oscillations then cause the population density to take on negative values in regions of the phase space. For this reason, we employed a second-order total-variation-diminishing (TVD) scheme. The undesirable oscillations in nite difference-based schemes used to evolve advection equations may be overcome by the TVD algorithm (Hirsch, 1992). The scheme that we use for the advective term in the population equation 2.12 is a second-order upwind method, for which we describe here the one-dimensional version. The evolution of the conserved variable, r, is governed by d¡ dr i D ¡ Dx dt
µ ¤ (R ) fi C 1 C 2
¶ 1 C 1 ¡ ¤ ( R) ¤ ( R) ( ) ( ) C y 1 fi ¡ fi¡ 1 y 3 fi C 1 ¡ fi C 3 , 2 i¡ 2 2 iC 2 2 2
(A.12)
where an i subscript indicates the ith grid zone, fi D ( Fr ) i is a numerical approximation of the streaming ux (where F is a component of the direction eld), and the difference operator d ¡ [ui ] D ui ¡ ui¡1 . The parameters y C and y ¡ are ux limiters that are dynamically adjusted to control spurious oscillations arising from large, streaming ux gradients, which in our case occur at lines of discontinuity (V D Vr and V D Vh ) and along lines at which the population tends to pile up (h D 0 and h D 1). The ux limiter used was
982
A.R.R. Casti et al.
Roe’s Superbee limiter, which is dened by 2
yC 1 D y 4 i¡ 2
2
yi¡C 3 D y 4 2
¤ ( R)
fi C 1 ¡ fi C 1 2
¤ ( R)
fi ¡ fi¡ 1 2
¤( ) fi ¡ fi C 1R 2 ¤ ( R) fi C 1 ¡ fi C 3 2
3 5,
(A.13)
3 5,
y [r] D max[0, min(2r, 1) , min(r, 2) ],
(A.14) (A.15)
where the rst-order Roe ux is dened as ¤ ( R)
1 1 ( fi C fi C 1 ) ¡ (ri C 1 ¡ ri ) ai C 1 2 2 2 1 ( Fi C Fi C 1 ) . D 2
fi C 1 D
(A.16)
ai C 1
(A.17)
2
2
This method eliminates undesirable oscillations due to discretization and reduces negative densities to values on the order of the numerical round-off error. We found one undesirable feature with this method. If the direction eld points in opposite directions between grid points n and n C 1, the populations on either side diverge, one tending to 1 and the other to ¡1. To cure this problem, we set the Roe ux, equation A.16, to zero at the grid boundary between grid points n and n C 1. With this modication, the method works extremely well for the advective terms in equations 2.12 and 2.22. The other terms of the population equation, corresponding to the stochastic input, were also discretized to second order. For the nite-jump model, equation 2.12, the source term s ( t)r (V ¡ 2 h, t) was discretized by linear interpolation. The diffusion term in the diffusion approximation, equation 2.22, was discretized by a second-order central difference. A.3 Discrepancies Between the Direct Simulation and the Population Simulation. We commented in section 3.3 that the accuracy of the population simulation with a step input current is sensitive to the prestep voltage equilibrium of the neuron. In Figure 7, for example, we noted that a ne resolution (300 grid points in V) was necessary to achieve reasonable agreement in the ring rates of the direct simulation and the TVD simulation of the population equation, 2.12. For other cases in which the prejump membrane potential equilibrium was well away from Vh , we generally found that the population simulation with roughly 100 grid points in each direc-
A Population Study of Integrate-and-Fire-or-Burst Neurons
983
tion converged to a direct simulation of sufciently many neurons. We now comment on why this is so. In cases where the prejump state of the average neuron is far away from Vh , the main source of numerical error arose from the resolution in the recovery variable h. Since all the neurons equilibrate at h D 0 before the current step, the TVD algorithm imposes large dissipation at the bottom boundary of the phase space in order to avoid grid-scale oscillations in the density r. As the grid becomes ner in the h-direction, corresponding to diminished ux limiting and less numerical dissipation, the predicted ring rates of the population model converge to that of the direct simulation (see Figure 6). Population model ring rates err slightly on the high side owing to the nite accuracy, as some neurons achieve their equilibrium at a value of h slightly larger than zero. These neurons feel a slightly larger depolarizing input, due to a nonvanishing T-channel current, and thus achieve the ringrate threshold at the onset of the current step faster than their counterparts, which lie initially at h D 0. When Veq ¼ Vh before the current step, the TVD simulations are more sensitive to the resolution in the voltage variable. As seen in Figure 7, with a coarse resolution of 50 grid points in V, the ring rate of the population simulation is off by approximately 56% at peak ring. As the resolution is increased to 300 grid points in V, the maximal error drops to about 8%. These observations are attributable to the spurious drift of neurons through Vh owing to nite grid effects, something from which the direct simulation does not suffer. We also observed that the calculated ring rates were relatively insensitive to the resolution in h, which here was chosen to be 50 grid points. Comparison of the equilibrium proles in Figure 9 gives a better understanding of the ring-rate discrepancies between the direct simulation and the population dynamics simulation shown in Figure 7. In the direct simulation, stochastic voltage jumps across Vh cause some neurons to equilibrate at h D 0 and some at h D 1, as reected in the top panel of Figure 9. Similar peaks in the density arise in the population simulation. However, as the lower panel of Figure 9 shows, many more neurons lie near h D 0 relative to the direct simulation. Consequently, more neurons are poised to run through a burst cycle at the onset of the step current in the direct simulation, which is why the peak ring rate associated with the bursting event is greater than that calculated in the population model. This error decreases as the V¡direction mesh of the population code is made ner. Increases in resolution beyond 50 grid points in the h-direction did not alter the results signicantly. To deal effectively with these nite grid issues, one option is to employ a variable phase-space mesh that has ner resolution near points of discontinuity along the voltage axis (at Vh and Vr ), and near the grid lines at h D 0 and h D 1 where the population tends to pile up. We leave such renements to future work.
984
A.R.R. Casti et al.
Direct Simulation
4
3
r
2
1
0 1
0.8
0.6
0.4
0.2
0
65
60
50
55
45
40
35
V
h
Population Simulation , 50 X 50 grid
4
3
r
2
1
0 1
0.8
0.6
0.4
h
0.2
0
65
60
50
55
45
40
35
V
Figure 9: Equilibrium population distributions corresponding to a mean drivmA ing current I D .1 cm 2 . (Top): Direct simulation with stochastic driving (xed jumps of size 2 D 1 mV). (Bottom): Population dynamics simulation (50 £ 50 grid). Note that many more neurons equilibrate near h D 0 in the population dynamics simulation relative to that of the direct simulation.
Acknowledgments This work has been supported by NIH/NEI EY 11276, NIH/NIMH MH50166, NIH/NEI EY01867, NIH/NEI EY9314, DARPA MDA 972-01-1-0028, and ONR N00014-96-1-0492. E. K. is the Jules and Doris Stein Research to Prevent Blindness Professor at the Ophthalmology Department at Mount Sinai.
A Population Study of Integrate-and-Fire-or-Burst Neurons
985
References Abramowitz, M. A., & Stegun, I. A. (1972).Handbookof mathematicalfunctions with formulas, graphs, and mathematical tables. Washington, DC: U.S. Government Printing Ofce. Guido, W., Lu, S., & Sherman, S. M. (1992). Relative contributions of burst and tonic responses to the receptive eld properties of lateral geniculate neurons in the cat. J. Neurophysiol., 68, 2199–2211. Guido, W., & Weyand, T. G. (1995). Burst responses in lateral geniculate neurons of the awake behaving cat. J. Neurophysiol., 74, 1782–1786. Hirsch, C. (1992). Numerical computation of internal and external ows. New York: Wiley. Jahnsen, H., & Llinas, R. (1982). Electrophysiology of mammalian thalamic neurons in vitro. Nature, 297, 406–408. Knight, B. W. (1972). Dynamics of encoding in a population of neurons. J. Gen. Physiol., 59, 734–766. Knight, B. W. (2000).Dynamics of encoding in neuron populations: Some general mathematical features. Neural Comp., 12, 473–518. Knight, B. W., Manin, D., & Sirovich, L. (1996). Dynamical models of interacting neuron populations. In E. C. Gerf (Ed.), Symposium on Robotics and Cybernetics: Computational Engineering in System Applications. Lille, France: Cite Scientique. Knight, B. W., Omurtag, A., & Sirovich, L. (2000). The approach of a neuron population ring rate to a new equilibrium: An exact theoretical result. Neural Comp., 12, 1045–1055. Livingstone, M. S., & Hubel, D. H. (1981). Effects of sleep and arousal on the processing of visual information in the cat. Nature, 291, 554–561. McCormick, D. A., & Feeser, H. R. (1990). Functional implications of burst ring and single spike activity in lateral geniculate relay neurons. Neurosci., 39, 103–113. McLaughlin, D., Shapley, R., Shelley, M., & Wielaard, D. J. (2000). A neuronal network model of macaque primary visual cortex (V1): Orientation selectivity and dynamics in the input layer 4Ca. PNAS, 97, 8087–8092. Mukherjee, P., & Kaplan, E. (1995). Dynamics of neurons in the cat lateral geniculate nucleus: In vivo electrophysiology and computational modeling. J. Neurophysiol., 74, 1222–1243. Mukherjee, P., & Kaplan, E. (1998). The maintained discharge of neurons in the cat lateral geniculate nucleus: Spectral analysis and computational modeling. Vis. Neurosci., 15, 529–539. Murphy, P., Duckett, S., & Sillito, A. (1999). Feedback connections to the lateral geniculate nucleus and cortical response properties. Science, 286, 1552–1554. Nykamp, D., & Tranchina, D. (2000). A population density approach that facilitates large-scale modeling of neural networks: Analysis and an application to orientation tuning. J. Comp. Neurosci., 8, 19–50. Omurtag, A., Kaplan, E., Knight, B. W., & Sirovich, L. (2000). A population approach to cortical dynamics with an application to orientation tuning. Network: Comput. Neural Syst., 11, 247–260.
986
A.R.R. Casti et al.
Omurtag, A., Knight, B. W., & Sirovich, L. (2000). On the simulation of large populations of neurons. J. Comp. Neurosci., 8, 51–63. Sakurai, J. J. (1994). Modern quantum mechanics. Reading, MA: Addison-Wesley. Sherman, S. M. (1996). Dual response modes in lateral geniculate neurons: Mechanisms and functions. Vis. Neurosci., 13, 205–213. Sherman, S. M., & Guillery, R. W. (1996). Functional organization of thalamocortical relays. J. Neurophysiol., 76, 1367–1395. Sillito, A. M., Jones, H. E., Gerstein, G. L., & West, D. C. (1994). Feature-linked synchronization of thalamic relay cell ring induced by feedback from the visual cortex. Nature, 369, 479–482. Sirovich, L., Knight, B. W., & Omurtag, A. (2000). Dynamics of neuronal populations: The equilibrium solution. SIAM J. Appl. Math., 60, 2009–2028. Smith, G. D., Cox, C. L., Sherman, S. W., & Rinzel, J. (2000). Fourier analysis of sinusoidally driven thalamocortical relay neurons and a minimal integrateand-re-or-burst model. J. Neurophysiol., 83, 588–610. Somers, D. C., Nelson, S. B., & Sur, M. (1995). An emergent model of orientation selectivity in cat visual cortical simple cells. J. Neurosci., 15, 5448–5465. Steriade, M., & Contreras, D. (1995). Relations between cortical and thalamic cellular events during transition from sleep patterns to paroxysmal activity. J. Neurosci., 15, 623–642. Tiesinga, P.H.E., & Jos`e, J. V. (2000). Synchronous clusters in a noisy inhibitory neural network. J. Comp. Neurosci., 9, 49–65. Troy, J. B., & Robson, J. G. (1992). Steady discharges of X and Y retinal ganglion cells of cat under photopic illuminance. Vis. Neurosci., 9, 535–553. Tuckwell, H. (1988). Introduction to theoretical neurobiology. Cambridge: Cambridge University Press. Received February 13, 2001; accepted August 10, 2001.
NOTE
Communicated by Ad Aertsen
Stable Propagation of Activity Pulses in Populations of Spiking Neurons Werner M. Kistler [email protected] Swiss Federal Institute of Technology Lausanne, 1015 Lausanne EPFL, Switzerland, and Neuroscience Institute, Department of Anatomy, FGG, Erasmus University Rotterdam, 3000DR Rotterdam, The Netherlands
Wulfram Gerstner wulfram.gerstner@ep.ch Swiss Federal Institute of Technology Lausanne, 1015 Lausanne EPFL, Switzerland
We investigate the propagation of pulses of spike activity in a neuronal network with feedforward couplings. The neurons are of the spikeresponse type with a ring probability that depends linearly on the membrane potential. After ring, neurons enter a phase of refractoriness. Spike packets are described in terms of the moments of the ring-time distribution so as to allow for an analytical treatment of the evolution of the spike packet as it propagates from one layer to the next. Analytical results and simulations show that depending on the synaptic coupling strength, a stable propagation of the packet with constant waveform is possible. Crucial for this observation is neither the existence of a ring threshold nor a sigmoidal gain function—both are absent in our model—but the refractory behavior of the neurons.
1 Introduction Recently, the propagation of sharp pulses of spike activity through various types of neuronal networks has attracted a lot of attention. There are basically two complementary scenarios where a temporally precise transmission of spikes has been investigated in model studies: spatially extended networks with distance-dependent couplings and layered feedforward structures of pools of neurons. Spatially extended networks have properties similar to those of excitable media and exhibit, for example, solitary waves of spike activity (Kistler, Seitz, & van Hemmen, 1998; Ermentrout, 1998; Bressloff, 1999; Kistler, 2000). Layered feedforward networks, also known as synre chains (Abeles, 1991), can be seen as a discretized version of the former, where the smooth propagation of a wave of activity is replaced by a discrete transmission of spikes from one layer to the next. Similar to solitary Neural Computation 14, 987–997 (2002)
° c 2002 Massachusetts Institute of Technology
988
Werner M. Kistler and Wulfram Gerstner
waves in spatially extended networks, the transmission function for spikes can produce an attractive xed point for the shape of the ring-time distribution in each layer. Such a “spike packet” can propagate in a stable way from one layer to the next (Abeles, 1991; Aertsen, Diesmann, & Gewaltig, 1996; MarÏsa lek, Koch, & Maunsell, 1997; Gewaltig, 2000; Diesmann, Gewaltig, & Aertsen, 1999). Similarly to Abeles (1991) and Diesmann et al. (1999), we consider in this article a chain of M pools of identical neurons with feedforward coupling. Each neuron is described by the spike response model, a generalization of the integrate-and-re model. The spike train of neuron i is formalized as a P f f sum of d functions, Si ( t) D f d (t ¡ ti ) , where the ring times ti of neuron i are labeled by an upper index f . The membrane potential ui of a given neuron is the linear response to pre- and postsynaptic action potentials, ui ( t, tOi ) D
X 6 i j, j D
Z vij
1 0
dt0 2 ( t0 ) Sj ( t ¡ t0 ) C g(t ¡ tOi ) .
(1.1)
Here, the response kernel 2 describes the form of an elementary postsynaptic potential, vij is the synaptic coupling strength, and g is a (negative) afterpotential that accounts for the reset of the membrane potential after the last f f spike at tOi D maxfti |ti < t, f D 1, 2, . . .g and for refractoriness (Gerstner & van Hemmen, 1992; Gerstner, Ritz, & van Hemmen, 1993; Kistler, Gerstner, & van Hemmen, 1997). In the absence of synaptic input, ui D 0 corresponds to the resting potential of the neuron. The inuence of the last-but-one and earlier spikes is neglected so that spike triggering can be described by an input-dependent renewal process (Cox, 1962). Noise is implemented in the model by a stochastic spike-triggering mechanism. New spikes are dened through a stochastic process that depends on the value of the membrane potential. The probability that a spike will occur in the innitesimal interval [t, t C dt) is probfspike in [t, t C dt) g D f [ui (t, tOi ) ] dt.
(1.2)
The function f is called escape rate (or hazard function) (Plesser & Gerstner, 2000). For simplicity, we assume a semilinear dependence of the ring probability and the membrane potential, f ( u ) D [u] C ,
(1.3)
with [u] C D u if u > 0 and [u] C D 0 elsewhere. f (0) D 0 implies that the neuron is not spontaneously active. This completes the denition of our single-neuron model. If we assume that neuron i has red its last action potential at time tOi , we can calculate the probability si ( t, tOi ) that it will “survive” without ring
Propagation of Activity Pulses
989
until time t > tOi , » Z si ( t, tOi ) D exp ¡
t tOi
¼ f [ui ( t0 , tOi ) ] dt0 ,
(1.4)
(cf. Cox, 1962; Gerstner, 2000). The probability density for the next ring time is thus pi ( t, tOi ) D ¡
@ @t
» Z si ( t, tOi ) D f [ui (t, tOi ) ] exp ¡
t tOi
¼ f [ui ( t0 , tOi ) ] dt0 .
(1.5)
We consider M pools containing N neurons each that are connected in a purely feedforward manner; neurons from pool n project only to pool n C 1, and there are no synapses between neurons from the same pool. We assume all-to-all connectivity between two successive pools with uniform synaptic weights vij D v / N. The membrane potential of a neuron i from pool n C 1 is thus Z 1 v X 0 0 0 O ( ) ui t, ti D 2 ( t ) Sj ( t ¡ t ) dt C g(ti ¡ tOi ) N j2C ( n) 0 Z D v
1 0
0 0 0 2 ( t ) An ( t ¡ t ) dt C g ( ti ¡ tOi ) ,
(1.6)
( ) with i 2 C ( n C 1) , C Pn the index set of all neurons that belong to pool n, and An ( t) D N ¡1 j2C ( n) Sj ( t) the population activity of pool n. Integration of An over a short interval of time thus gives the portion of neurons from pool n that re an action potential during this interval. The coupling strength between two successive pools v describes the amplitude of the resulting postsynaptic potential if all neurons in the presynaptic pool would re synchronously. A single action potential thus produces only weak postsynaptic potentials that, according to equation 1.3, have only a low chance of triggering the neuron. The spike trains Si and the population activity An are random variables. Each pool is supposed to contain a large number of neurons (N À 1) so that we can replace the population activity An in equation 1.6 by its expectation value AN n , which is given by a normalization condition (Gerstner, 2000), Z
t ¡1
sn (t, tO) AN n (tO) dtO D 1 ¡ sn ( t) .
(1.7)
Here, sn ( t) D sn ( t, ¡1) accounts for those neurons that have been quiescent in the past (i.e., have not red up to time t). The strong law of large numbers ensures that the population activity An converges in probability to AN n (in
990
Werner M. Kistler and Wulfram Gerstner
the weak topology) as the number of neurons in the pool goes to innity, » prob
Z lim
1
N!1 ¡1
Z An ( t) w ( t) dt D
for any test function w 2 C
1(
1 ¡1
¼ AN n ( t) w (t) dt D 1,
(1.8)
R ) (cf. Lamperti, 1996).
2 Pulse Propagation Simulation studies (Diesmann et al., 1999) and analytic calculations (Gewaltig, 2000) suggest that a pronounced refractory behavior is required in order to obtain a stable propagation of a spike packet from one layer to the next. If neurons were allowed to re more than once within one spike packet, the number of spikes per packet and thus the width of the packet would grow in each step. Therefore, we use a strong and long-lasting afterpotential g so that each neuron can re only once during each pulse. The survivor function thus equals unity for the duration tAP of the afterpotential; sn ( t, tO) D 1 for 0 < t ¡ tO < tAP and tAP being large as compared to the typical pulse width. Let us denote by Tn the moment when a pulse packet arrives at pool n. We assume that for t < Tn , all neurons in layer n have been inactive—An ( t) D 0 for t < Tn . Differentiation of equation 1.7 with respect to t (and dropping bars in order to keep notation simple) leads to An ( t) D ¡
@ @t
» Z sn (t ) D f [un ( t)] exp ¡
t ¡1
¼ f [un (t0 ) ] dt0 ,
(2.1)
with Z un ( t ) D v
1 0
0 0 0 2 ( t ) An¡1 ( t ¡ t ) dt .
(2.2)
Equation 2.1 provides an explicit expression for the ring-time distribution An ( t) in layer n as a function of the time course of the membrane potential. The membrane potential un ( t) in turn depends on the time course of the activity An¡1 (t ) in the previous layer, as shown in equation 2.2. Note that both equations are independent of the network size N; their derivation, however, relies on the strong law of large numbers so that N À 1 is implicitly assumed. Both equations 2.1 and 2.2 can easily be integrated numerically; an analytic treatment, however, is difcult even if a particularly simple form of the response kernel 2 is chosen. Following Diesmann et al. (1999), we therefore concentrate on the rst few moments of the ring-time distribution in order to characterize the transmission properties. More precisely, we approximate the ring-time distribution An¡1 (t) by a gamma distribution and calculate— in step i—the zeroth, rst, and second moment of the resulting membrane
Propagation of Activity Pulses
991
potential in the following layer n. In step ii, we use these results to approximate the time course of the membrane potential by a gamma distribution and calculate the moments of the corresponding ring-time distribution in layer n. We thus obtain an analytical expression for the amplitude and the variance of the spike packet in layer n as a function of amplitude and variance of the spike packet in the previous layer. In step i, we assume that the activity An¡1 (t) in layer n ¡ 1 is given by a gamma distribution with parameters an¡1 and ln¡1 , that is, An¡1 ( t) D an¡1c an¡1 ,ln¡1 ( t).
(2.3)
Here, an¡1 is the portion of neurons of layer n ¡ 1 that contribute to the spike packet, c a,l ( t) D ta¡1 e¡t / l H (t ) / [C (a)la ] the density function of the gamma distribution, C the complete gamma function, and H the Heaviside step function with H ( t) D 1 for t > 0 and H ( t) D 0 else. The mean m and the variance s 2 of a gamma distribution with parameters a and l are m D a l and s 2 D al2 , respectively. The membrane potential un ( t) in the next layer results from a convolution of An¡1 with the response kernel 2 . This is the only point where we have to refer explicitly to the shape of the 2 kernel. For simplicity, we use a normalized a function, 2 ( t) D
t ¡t/ t e H (t) ´ c 2,t ( t) , t2
(2.4)
with time constant t . The precise form of 2 is not important; similar results hold for a different choice of 2 . In the present context, spikes are mostly triggered during the raising phase of the (excitatory) postsynaptic potential. We therefore set t D 1 ms for fast AMPA-mediated potentials rather than describe the passive membrane time constant, which is about one order of magnitude larger. We want to approximate the time course of the membrane potential by a gamma distribution c aQ n , lQ n . The parameters1 aQ n and lQ n are chosen so that the rst few moments of the distribution are identical to those of the membrane potential, that is, un ( t) ¼ aQ nc aQ n , lQ n ( t) ,
(2.5)
with Z
1 0
!
tn un (t) dt D
Z
1 0
tn aQ nc aQ n , lQ n ( t) dt,
n 2 f0, 1, 2g.
(2.6)
1 We use a tilde to identify parameters that describe the time course of the membrane potential. Parameters without a tilde refer to the ring-time distribution.
992
Werner M. Kistler and Wulfram Gerstner
As far as the rst two moments are concerned, a convolution of two distributions reduces to a mere summation of their mean and variance. Therefore, the convolution of An¡1 with 2 basically translates the center of mass by 2t and increases the variance by 2t 2 . Altogether, amplitude, center of mass, and variance of the time course of the membrane potential in layer n are 9 aQn D van¡1 , > = mQ n D m n¡1 C 2t, (2.7) > ; 2 2 2 sQ n D sn¡1 C 2t , respectively. The parameters a Q n and lQ n of the gamma distribution are directly related to mean and variance: a Q n D mQ 2n / sQ n2 , lQ n D sQ n2 /mQ n . In step ii, we calculate the ring-time distribution that results from a membrane potential with time course given by a gamma distribution as in equation 2.5. We use the same strategy as in step i, that is, we calculate the rst few moments of the ring-time distribution and approximate it by the corresponding gamma distribution, An ( t) ¼ anc an ,ln ( t) .
(2.8)
The zeroth moment of An ( t) (the portion of neurons in layer n that participates in the activity pulse) can be cast in a particularly simple form; the expressions for higher-order moments, however, contain integrals that have to be evaluated numerically. For amplitude, center of mass, and variance of An ( t) , we nd 9 an D 1 ¡ e ¡Qan , > > = (1) m n D mn , (2.9) h i > (2) (1) 2 > ; 2 sn D mn ¡ mn , with ± ² ¡1 Z mn( k) D 1 ¡ e ¡Qan
1 0
µ Z un ( t) exp ¡
t ¡1
¶ un ( t0 ) dt0 tk dt
aQ n lQ nk ¢ D ¡ 1 ¡ e ¡Qan C ( a Q n) Z 1 £ exp[¡t ¡ aQ n C ( a Q n , 0, t) / C (aQ n ) ]tk¡1 C aQ n dt
(2.10)
0
being the kth moment of the ring-time distribution (see equation 2.1) that results from a gamma-shaped time course of the membrane potenRt tial. C ( z, t1 , t2 ) D t12 tz¡1 e ¡t dt is the generalized incomplete gamma function. The last equality in equation 2.10 has been obtained by substituting aQ nc aQ n , lQ n (t ) for un ( t) .
Propagation of Activity Pulses
993
A combination of equations 2.7 and 2.9 yields explicit expressions for the parameters ( an , m n , sn ) of the ring-time distribution in layer n as a function of the parameters in the previous layer. The mapping (an¡1 , m n¡1 , sn¡1 ) ! ( an , m n , sn ) is closely related to the neural transmission function for pulsepacket input, as discussed by Diesmann et al. (1999). Particularly interesting is the iteration that describes the amplitude of the spike packet, an D 1 ¡ e ¡v an¡1 ,
(2.11)
which is independent of the shape of the spike packet. If v · 1, the mapping an¡1 ! an has a single (globally attractive) xed point at a D 0. In this case, no stable propagation of spike packets is possible since any packet will nally die out.2 For v > 1, a second xed point at a1 2 (0, 1) emerges through a pitchfork bifurcation. The new xed point is stable, and its basin of attraction contains the open interval (0, 1) . The fact that the all-off state at a D 0 is unstable for v > 1 is related to the fact that there is no real ring threshold in our model. Figure 1 shows examples of the propagation of a spike packet for various synaptic coupling strengths and initial conditions. Theoretical predictions based on equations 2.7 and 2.9 are compared to simulations of a network with N D 1000 neurons per layer. In each subgure, a series of bar charts shows the ring-time distribution of neurons from layers n D 0 to n D 5. The ow eld illustrates the evolution of the amplitude and the width of the spike packet as it propagates from one layer to the next. In Figure 1A, a small coupling strength has been chosen (v D 1) so that iteration 2.11 has only a single xed point at a D 0. Therefore, any spike packet will die out whatever is the initial ring-time distribution in layer n D 0. Figure 1B is another example for v D 2. Here, the iteration 2.11 has a stable xed point at a ¼ 0.80, and both simulations and theory show that this xed point corresponds to a stable propagation of spike packets from one layer to the next. Finally, in Figure 1C (v D 4), we demonstrate that the iteration of ( an¡1 , m n¡1 , sn¡1 ) ! ( an , m n , sn ) converges to a mere translation of the spike packet with an approximately xed waveform. This waveform is globally attractive so that even a weak and broadly tuned initial ring distribution will become sharper and form a narrow spike packet. 3 Discussion Any information processing scheme that relies on the precise timing of action potentials obviously requires a means to transmit spikes without 2
in n.
The decay of the activity is exponential in n if v < 1, for v D 1 the decay is polynomial
994
Werner M. Kistler and Wulfram Gerstner
Propagation of Activity Pulses
995
destroying their temporal structure. In this article, we have shown analytically that despite the noise in the spike-generating mechanism, packets of (almost) synchronous spikes can propagate in a feedforward structure from one layer to the next so that their width is preserved, provided that the synaptic coupling strength is sufciently large. Our approach is closely related to the concept of synre-chains (Abeles, 1991). While Abeles stresses the importance of a nonlinear transfer function,3 our results are not based on a nonlinear transfer function but are a direct consequence of the refractory behavior of the neurons. Noise and broad postsynaptic potentials tend to smear out initially sharp spike packets. If, however, the synaptic coupling is strong enough, then postsynaptic neurons will start ring during the raising phase of their membrane potential. If, in addition, these neurons show pronounced refractory behavior, then ring will cease even before the postsynaptic potentials have reached their maximum. With respect to precise timing, refractoriness thus counteracts the effects of noise and synaptic transmission. As a consequence of the linear transfer function, no bistability between asynchronously ring neurons and a propagating pulse could be observed in our model. Depending on the synaptic coupling strength, there is always only one stable xed point: either the all-off state or a propagating pulse. Figure 1: Facing page. Propagation of spike packets through a feedforward network. (A) Evolution of the ring-time distribution of a spike packet as it propagates from one layer to the next (n D 0, 1, . . . , 4). The neurons in layer n D 0 are driven by an external input that creates a sharp initial spike packet given by a gamma distribution with a0 D 10 and l 0 D 0.1. Initial amplitude is a0 D 1. The bars (bin width 0.2) represent the results of a simulation with N D 1000 neurons per layer; the solid line is the ring-time distribution as predicted by the theory; cf. equations 2.7 and 2.9. The “ow eld” to the right characterizes the transmission function for spike packets in terms of their amplitude an and p width sn D an ln . Open symbols connected by a dashed line represent the simulations shown to the left; lled symbols connected by solid lines represent the corresponding theoretical trajectories. Neurons between layers are only weakly coupled (v D 1), so that the packet will fade out. Time is given in units of the membrane time constant t . (B) Same as in A but with increased coupling strength v D 2. There is an attractive xed point of the ow eld at a D 0.80 and s D 2.9 that corresponds to the stable waveform of the spike packet. (C) Similar plots as in A and B but with a strong coupling strength (v D 4). Initial stimulation is weak (a0 D 0.2) and broad (s0 D 2). As the packet propagates through a few layers, it quickly reaches a stable waveform with amplitude a D 0.98 and s D 1.5. 3 The R 1transfer function of Abeles can be retrieved if we replace our equation 1.3 by f (u) / # r (u0 ¡ u) du0 where the membrane potential density r (u0 ¡ u) is approximated by a gaussian with mean u and a variance s 2 ; cf. Abeles (1991, sections 4.5, 7.1–7.3).
996
Werner M. Kistler and Wulfram Gerstner
This seems to be a severe limitation for the computational usefulness of the system because in the latter case, even a single action potential ultimately can lead to a full-size pulse. Note, however, that this statement holds true only in the limit N ! 1. Due to the intrinsic probabilistic properties of our model, nite-size effects become important as soon as the activity An is no longer large as compared to N ¡1 . In a nite network, a few initial action potentials will lead to a full-size pulse only with a certain probability smaller than one depending on the size of the network and the distance from the bifurcation. Recent simulation studies (Diesmann et al., 1999) have conrmed that a slightly more general model with a nonlinear transfer function can indeed exhibit bistability where neurons are either ring asynchronously at a low rate or participating in the transmission of a sharp spike packet.
References Abeles, M. (1991). Corticonics. Cambridge: Cambridge University Press. Aertsen, A., Diesmann, M., & Gewaltig, M.-O. (1996). Propagation of synchronous spiking activity in feedforward neural networks. J. Physiol. Paris, 90, 243–247. Bressloff, P. C. (1999). Synaptically generated wave propagation in excitable neural media. Phys. Rev. Lett., 82, 2979–2982. Cox, D. R. (1962). Renewal theory. London: Methuen. Diesmann, M., Gewaltig, M.-O., & Aertsen, A. (1999). Stable propagation of synchronous spiking in cortical neural networks. Nature, 402, 529–533. Ermentrout, B. (1998). The analysis of synaptically generated traveling waves. J. Comput. Neurosci., 5, 191–208. Gerstner, W. (2000). Population dynamics of spiking neurons: Fast transients, asynchronous states, and locking. Neural Comput., 12, 43–89. Gerstner, W., Ritz, R., & van Hemmen, J. L. (1993). Why spikes? Hebbian learning and retrieval of time-resolved excitation patterns. Biol. Cybern., 69, 503–515. Gerstner, W., & van Hemmen, J. L. (1992). Associative memory in a network of “spiking” neurons. Network, 3, 139–164. Gewaltig, M.-O. (2000). Evolution of synchronous spike volleysin cortical networks — Network simulations and continuous probabilistic models. Doctoral dissertation, Shaker Verlag, Aachen, Germany. Kistler, W. M. (2000). Stability properties of solitary waves and periodic wave trains in a two-dimensional network of spiking neurons. Phys. Rev. E, 62(6), 8834–8837. Kistler, W. M., Gerstner, W., & van Hemmen, J. L. (1997). Reduction of the Hodgkin-Huxley equations to a single-variable threshold model. Neural Comput., 9, 1015–1045. Kistler, W. M., Seitz, R., & van Hemmen, J. L. (1998). Modeling collective excitations in cortical tissue. Physica D, 114, 273–295. Lamperti, J. (1996). Probability: A survey of the mathematical theory. New York: Wiley.
Propagation of Activity Pulses
997
MarÏsa lek, P., Koch, C., & Maunsell, J. (1997). On the relationship between synaptic input and spike output jitter in individual neurons. Proc. Natl. Acad. Sci. USA, 94, 735–740. Plesser, H. E., & Gerstner, W. (2000). Noise in integrate-and-re neurons: From stochastic input to escape rates. Neural Comput., 12, 367–384. Received October 31, 2000; accepted July 30, 2001.
LETTER
Communicated by Peter Dayan
Population Coding and Decoding in a Neural Field: A Computational Study Si Wu [email protected] RIKEN Brain Science Institute, Wako-shi, Saitama, Japan, and Department of Computer Science, Shefeld University, U.K. Shun-ichi Amari [email protected] RIKEN Brain Science Institute, Wako-shi, Saitama, Japan Hiroyuki Nakahara [email protected] RIKEN Brain Science Institute, Wako-shi, Saitama, Japan, and Japan Advanced Institute of Science and Technology, Nomi, Ishikawa, Japan This study uses a neural eld model to investigate computational aspects of population coding and decoding when the stimulus is a single variable. A general prototype model for the encoding process is proposed, in which neural responses are correlated, with strength specied by a gaussian function of their difference in preferred stimuli. Based on the model, we study the effect of correlation on the Fisher information, compare the performances of three decoding methods that differ in the amount of encoding information being used, and investigate the implementation of the three methods by using a recurrent network. This study not only rediscovers main results in existing literatures in a unied way, but also reveals important new features, especially when the neural correlation is strong. As the neural correlation of ring becomes larger, the Fisher information decreases drastically. We conrm that as the width of correlation increases, the Fisher information saturates and no longer increases in proportion to the number of neurons. p However, we prove that as the width increases further—wider than 2 times the effective width of the turning function—the Fisher information increases again, and it increases without limit in proportion to the number of neurons. Furthermore, we clarify the asymptotic efciency of the maximum likelihood inference (MLI) type of decoding methods for correlated neural signals. It shows that when the correlation covers a nonlocal range of population (excepting the uniform correlation and when the noise is extremely small), the MLI type of method, whose decoding error satises the Cauchy-type distribution, is not asymptotically efcient. This implies that the variance is no longer adequate to measure decoding accuracy. Neural Computation 14, 999–1026 (2002)
° c 2002 Massachusetts Institute of Technology
1000
Si Wu, Shun-ichi Amari, and Hiroyuki Nakahara
1 Introduction
Population coding is a method to represent stimuli by using the joint activities of a number of neurons. Experimental studies have revealed that this coding paradigm is widely used in the sensor and motor areas of the brain. For example, in the visual area MT, neurons are tuned to the moving direction (Maunsell & Van Essen, 1983). In response to an object moving in a particular direction, many neurons in MT re, with a noise-corrupted and bell-shaped activity pattern across the population. The moving direction of the object is retrieved from the population activity, to be immune from the uctuation existing in a single neuron’s signal. From the theoretical point of view, population coding, which concerns how information is represented in the brain, is prerequisite for more complex issues of the brain functions. It is also one of a few mathematically well-formulated problems in neuroscience, in the sense that it grasps the essential features of neural coding and yet, is simple enough for theoretic analysis. There has been extensive work to understand population coding theoretically (Abbott & Dayan, 1999; Brunel & Nadal, 1998; Georgopoulos, Kalaska, Caminiti, & Massey, 1982; Nakahara, Wu, & Amari, 2001; Paradiso, 1988; Pouget, Zhang, Deneve, & Latham, 1998; Salinas & Abbott, 1994; Seung & Sompolinsky, 1993; Wu, Nakahara, Murata, & Amari, 2000; Wu, Nakahara, & Amari, 2001; Yoon & Sompolinsky, 1999)—for example, how much information is included in population coding, what the optimal decoding accuracy is given an encoding model, how to construct a simple and yet accurate enough decoding strategy, how to implement a decoding method in a biological network, and what the effect of correlation is on decoding accuracy. These issues have been studied by using various models for the encoding process. In this article, based on a unied encoding model in a neural eld, we systematically study all the above computational aspects. This article is more than a review of the literature; it elucidates the published results in a more transparent way. Furthermore, we explore important new features concerning how the behavior of the Fisher information changes as the width of correlation increases. The efciencies of various decoding methods are also evaluated. To study population coding, a prototype model for the encoding process needs to be constructed. We consider a general case that neural activities are correlated, as seen in experimental data (Fetz, Yoyama, & Smith, 1991; Gawne & Richmond, 1993; Lee, Port, Kruse, & Georgopoulos, 1998; Zohary, Shadlen, & Newsome, 1994), under the assumption of gaussian additive noise. The uncorrelated case is handled as a special one when the correlation length is zero. The correlation model we consider is a continuous neural eld (Amari, 1977; Giese, 1999) in which neurons are pair-wise correlated with a strength distribution given by a gaussian function of their difference in preferred stimuli. Depending on the width of gaussian function, the correlation
Population Coding and Decoding in a Neural Field
1001
form varies to include noncorrelation, local correlation, short-range correlation, wide-range correlation, and uniform correlation. Compared with other prototype models in the literature (Abbott & Dayan, 1999; Johnson, 1980; Snippe & Koenderink, 1992; Yoon & Sompolinsky, 1999), this one shows the advantage of simplifying the calculation and provides a clear picture of the results. As we show, due to the property of the gaussian function and the continuous extension of the model, many calculations can be done more easily in the Fourier domain. It is also possible to apply the proposed method of neural elds to a more general case of ring-rate-dependent variances, including the multiplicative and Poisson-type noises, as treated in Wilke and Eurich (in press). Based on the proposed encoding model, we rst calculate the Fisher information. The inverse of Fisher information, the CramÂer-Rao bound, denes the optimal accuracy for an unbiased estimator to achieve. When no correlation exists, that is, the correlation width is zero, the Fisher information increases in proportion to the number of neurons or the neural density in the eld. As the width of correlation increases, the Fisher information decreases rapidly. Abbott and Dayan (1999) and Yoon and Sompolinsky (1999) found that if the neural correlation covers a nonlocal range of population, the Fisher information does not increase but saturates even when the number of neurons increases. The same behavior is observed in a simpler way in this article. We also show that as the range of correlationp becomes wider, that is, when the width of correlation becomes larger than 2 times the effective length of the tuning function, the Fisher information increases again in proportion to the number of neurons. This is an interesting new nding. Such a phenomenon is observed more generally in the case of multiplicative noise (Wilke & Eurich, in press), which we also conrmed by our method, although it is not described in this article. We then study three decoding methods and compare their performances. All of them are formulated in the maximum likelihood inference (MLI) type (including the conventional center of mass method), whereas they differ in the knowledge on the true encoding scheme. It turns out that a decoding method that keeps the knowledge of the tuning function but neglects the neural correlation is a good compromisebetween computational complexity and decoding accuracy, supporting the ndings in (Wu, Nakahara, et al., 2000; Wu et al., 2001). In previous work, we have pointed out that the MLI type of decoding method may not be asymptotically efcient for some strong correlation structures (Wu, Nakahara, et al., 2000; Wu et al., 2001). In this article, we prove that this is indeed the case when the correlation covers a nonlocal range of population (excepting the uniform correlation and when the noise is extremely small). Here, the estimated or decoded position of stimulus is no longer subject to the gaussian distribution, but is subject to the Cauchytype distribution in which the mean and variance diverge. In other words, the standard paradigm of statistical estimation does not hold. This is also a
1002
Si Wu, Shun-ichi Amari, and Hiroyuki Nakahara
new nding in population coding, and we discuss the consequence of this property. We also investigate the network implementation of the three decoding methods, following the idea of Pouget et al. (Deneve, Latham, & Pouget, 1999; Pouget & Zhang, 1997; Pouget et al., 1998). A recurrent neural eld is constructed in such a way that its steady state has a shape similar to the tuning function and is noise free. The peak position of the steady state gives the estimator of the stimulus (Amari, 1977; Giese, 1999). When there is no external input to the network, the system is neutrally stable on a line attractor. A small input will cause the state to drift to the position corresponding to the estimator in the three methods. Throughout the article, the effect of correlation on decoding accuracies is discussed as a by-product of calculations. In this study, we consider only that the stimulus is a single variable. Population coding also works in more complex cases, as studied by Eurich, Wilke, and Schwegler (2000), Treue, Hol, and Rauber (2000), Zemel, Dayan, and Pouget (1998), Zhang and Sejnowski (1999), and Zohary (1992). We need to study a high-dimensional neural eld. These issues are not in the scope of the work presented here. The article is organized as follows. In section 2, we introduce a discrete encoding model, which is extended to be the continuous version in section 3. In section 4, the Fisher information for the encoding model is calculated. In section 5, we compare three decoding methods. Their network implementations are discussed in section 6. Conclusions and a discussion are given in section 7. 2 Encoding Model
We begin with a discrete encoding model. Consider an ensemble of N neurons coding a variable x, which represents the position of the stimulus. Let us denote by ci the preferred stimulus position of the ith neuron, and let ri denote the response of the neuron, so that r D fri g, for i D 1, . . . , N, denote the population activity. The neural responses are correlated, and the ith neuron’s activity is given by r i D fi ( x ) C s 2 i ,
i D 1, . . . , N,
(2.1)
where fi ( x) is the tuning function of the ith neuron representing the mean value of the response when stimulus x is applied, and s2 i is noise. In this study, we consider only the gaussian tuning function, that is, fi ( x ) D p
1 2p a
2 2 e¡(ci ¡x) / 2a ,
where the parameter a is the tuning width.
(2.2)
Population Coding and Decoding in a Neural Field
1003
The parameter s represents the noise intensity, and 2 i is a gaussian random variable with mean 0 and variance 1. We decompose 2 i as 0 00 2 i D2 i C2 i,
(2.3)
where f2 i0 g are independent to all the others with zero mean and variance 1 ¡ b, while 2 i00 and 2 j00 are correlated. We assume the gaussian correlation, h2
00
i
where <
¢2
00
j
2 2 i D be¡(ci ¡cj ) / 2b ,
(2.4)
> denotes expectation over many trials. Then, the noise satises
h2 i i D 0,
(2.5)
h2 i2 j i D Aij ,
(2.6)
where the covariance matrix is given by 2 )2 Aij D (1 ¡ b )dij C be¡(ci ¡cj / 2b .
(2.7)
The parameter satises 0 · b · 1, and the width b is called the effective correlation length. The model captures the fact that the correlation strength between neurons decreases with the dissimilarity in their preferred stimuli |ci ¡ cj |. The encoding process of population coding is fully specied by the conditional probability density of r when stimulus x is given as Q ( r |x) D p
1 (2p s 2 ) N
det( A )
2
3 1 X ¡1 £ exp 4¡ 2 A ( ri ¡ fi ( x ) ) ( rj ¡ fj (x ) ) 5 . 2s ij ij
(2.8)
3 From Discrete to Continuous
Because we are interested only in the case when the number N of neurons is large (more accurately, the neuron density is large), it is useful to extend the discrete model to the continuous case. Mathematically, there are a lot of benets from coping with a continuous neural eld model (Amari, 1977; Giese, 1999). For example, as shown later, many operations can be done much more easily in the Fourier domain due to the continuous extension.
1004
Si Wu, Shun-ichi Amari, and Hiroyuki Nakahara
Let us consider a one-dimensional neural eld in which neurons are located with uniform density r. The activity of neuron at position c is denoted by r (c) . The neural response function r (c) is given by r ( c ) D f ( c ¡ x) C s 2 ( c ) ,
(3.1)
when stimulus x is applied, where quantities r (c) , f (c ¡ x) , and 2 ( c) are the counterparts of ri , fi , and 2 i in the discrete version. The tuning function f ( c ¡ x) has the same form as that of fi ( x ) except that ci is replaced by c. The noise term 2 ( c) satises h2 ( c) i D 0,
(3.2)
h2 ( c) 2 ( c ) i D h ( c, c ) / r , 0
0
2
(3.3)
where h ( c, c0 ) is the covariance function divided by the neuron density r 2 . We assume that the covariance function has the same form as the discrete case, 0 )2 2 h ( c, c0 ) D D1 (1 ¡ b )d (c ¡ c0 ) C D2 be¡(c¡c / 2b ,
(3.4)
where d ( c ¡ c0 ) is the delta function. In order to determine the coefcients D1 and D2 , we use the correspondence principle: The covariance matrix Aij and the correlation function h ( c, c0 ) correspond to each other to give the quadratic form for an arbitrary vector k D ( ki ) and its continuous version k (c ), Z
1 ¡1
Z
1 ¡1
k (c) h ( c, c0 ) k (c0 ) dc dc0 D
X
ki Aij kj ,
(3.5)
ij
Substituting equations 2.7 and 3.4 into 3.5, we get (see appendix A) 0)2 2 h ( c, c0 ) D r (1 ¡ b )d ( c ¡ c0 ) C r 2 be¡(c¡c / 2b .
(3.6)
Therefore, the continuous form of the encoding process is » Z 1Z 1 1 r2 [r(c) ¡ f ( c ¡ x )]h¡1 (c, c0 ) Q ( r |x) D exp ¡ 2 2s ¡1 ¡1 Z ¼ £[r(c0 ) ¡ f ( c0 ¡ x ) ] dc dc0 ,
(3.7)
where r D fr(c) g, and the parameter Z is the normalization factor. The function h ¡1 ( c, c0 ) is the inverse kernel of h ( c, c0 ) , satisfying Z
1 ¡1
h ¡1 (c, c0 ) h (c0 , c00 ) dc0 D d ( c ¡ c00 ) .
(3.8)
Population Coding and Decoding in a Neural Field
1005
The above correlation model contains a number of important forms depending on the correlation length: No correlation: When the correlation length b D 0, neurons are uncorrelated, with the covariance function h ( c, c0 ) D r (1 ¡ b )d (c ¡ c0 ) or the correlation matrix Aij D (1 ¡ b )dij . Local correlation: When the correlation length is of order 1 / r, that is, b D m /r for a xed m, neurons are correlated only within m neighboring neurons. When the density r is large, neurons are correlated only extremely locally. Short-range correlation:p When the correlation length is much longer than 1 / r but shorter than 2 times the width of the tuning function, that p is, 1 / r ¿ b < 2a, neurons are correlated over a short range. 1 p Wide-range correlation: When b ¸ 2a, neurons are correlated over a wide range. Uniform correlation: When the correlation length b ! 1, neurons are uniformly correlated with strength b, that is, h ( c, c0 ) D r (1 ¡ b )d ( c ¡ c0 ) C r 2 b or Aij D (1 ¡ b )dij C b. 4 The Fisher Information
The Fisher information is a useful measure in the study of population coding. By knowing the Fisher information, we have an idea of the minimum error one may achieve. The Fisher information for the encoding model Q ( r |x) is dened as Z Q ( r |x)
IF ( x ) D ¡
d2 ln Q (r |x) dr. dx2
(4.1)
From equation 3.7, we get r2 IF ( x ) D 2 s
Z
1
Z
¡1
1 ¡1
f 0 ( c ¡ x ) h ¡1 ( c, c0 ) f 0 (c0 ¡ x ) dc dc0 ,
(4.2)
where f 0 ( c ¡x) D df ( c ¡x) / dx. Note that IF ( x ) does not depend on x because of the homogeneity of the eld. The Fourier transform of a function g ( t) is dened as 1 F [g(t) ] D p 2p
Z
1 ¡1
e ¡ivt g ( t) dt.
(4.3)
1 We consider only the case that a À 1 / r, as suggested by experiments, that is, neurons are widely tuned by stimulus.
1006
Si Wu, Shun-ichi Amari, and Hiroyuki Nakahara
By using the Fourier transformation, equation 4.2 becomes IF (x ) D
r2 2p s 2
Z
1 ¡1
v2 F (v) 2 dw, H (v)
(4.4)
where F (v) D F [ f (c ¡x) ] and H (v) D F [h(c ¡c0 ) ]. Here, we use the relations F [ f 0 ( c ¡ x) ] D ivF (v) and F [h ¡1 ( c, c0 ) ] D 1 / H (v) . From equations 2.2 and 3.6, F (v) D e¡a
2 v2
/ 2,
(4.5)
H (v) D r (1 ¡ b ) C r
p 2
2p bbe
¡b2 v2
/2 ,
(4.6)
where we put x D 0 without loss of generality. Therefore, IF (x ) D
r2 2p s 2
Z
1 ¡1
v2 e¡a v p dv. r (1 ¡ b ) C r 2 2p bbe¡b2 v2 / 2 2
2
(4.7)
Let us analyze the property of Fisher information in various forms: No correlation: The Fourier transform of the covariance function is p H (v) D r (1 ¡ b ) , and the Fisher information IF (x ) D r / [4 p a3 s 2 (1 ¡ b ) ] increases in proportion to the neuron density r. Since each neuron carries independent information for x and the total Fisher information is their sum, IF increases in proportion to r. p 2 2 2 Local correlation: H (v) D r (1 ¡ b C 2p mbe¡m v / 2r ) is of order r. In this case, the total Fisher information IF is not a simple sum of component neurons, but correlations disappear so quickly that the total information is still in proportion to r. Short-range correlation: For large r, H (v) is the order of r 2 , so that we have p 2 2 H (v) ¼ r 2 2p bbe ¡b v / 2 .
(4.8)
Hence, IF is written as IF ¼
1 (2p ) 3 / 2 bbs 2
Z v2 e (¡a
b / 2)v2
2C 2
dv.
(4.9)
p Since b < 2a, the integral converges to a constant. Hence, IF is of nite value even when r goes to innity. This result agrees with Abbott and Dayan (1999) and Yoon and Sompolinsky (1999) under the additive noise distribution.
Population Coding and Decoding in a Neural Field
1007
p Wide-range correlation: When b ¸ 2a, the integral, equation 4.9, diverges, and we cannot use this approximation. If we evaluate the integral more accurately by taking the term of 1 /r into account, we have 1 IF D 2p s 2 r D 2p s 2
Z Z
v2 e ¡a v p dv (1 /r ) (1 ¡ b ) C 2p bbe ¡b2 v2 / 2 2
(1 ¡ b ) ea2 v2
2
v2 p dv. 2 2 2 C 2p rbbe ( a ¡b / 2)v
(4.10)
Hence, IF increases with r for large r (see Figure 2d).
p Uniform correlation: H (v) D r (1 ¡ b ) , and IF ( x ) D r / [4 p a3 s 2 (1 ¡ b ) ] is proportional to r, which has the same value as in the uncorrelated case. 2 In this case, the noise is decomposed as (Wu, Nakahara, et al., 2000; Wu et al., 2001) 2 ( c) D 2
2
0
( c) C 2 00 ,
(4.11)
where 2 0 ( c) is independent noise, h2 0 ( c) 2 0 ( c0 ) i D (1 ¡ b )d (c ¡ c0 ) /r 2 , and 00 is common to all neurons, h(2 00 ) 2 i D b / r 2 . The term 2 00 increases all r (c) by a common factor s 2 00 , which does not have any effect for decoding x.
Special case with b D 1: In order to make the p situation clear, we consider this special case. In this case, when b < 2a, the Fisher information is constant not depending on r, aspis similar to the case of short-range correlation. However, when b ¸ 2a, the Fisher information diverges to 1 as is seen from equation 4.7. This implies that x can be decoded accurately. In this case, the noise 2 ( c) is so strongly correlated among neurons in an interval of b that 2 ( c) is regarded as constant in this range. Since the range of width b covers the effective range a of the tuning function, the noise shifts the tuning function to r (c) D f ( x ¡ c) C 2 , so that we can decode x from r ( c) as if there were no noise. Figure 1 shows how the Fisher information behaves as the correlation width changes (with xed neural density). It rst decreases drastically when the correlation width is small and then increases again when the width is large. Figure 2 shows the asymptotic behavior of the Fisher information in different correlation cases, as the density r increases. When a multiplicative noise model is used, the Fisher information always increases with the neural density r (Wilke & Eurich, in press). This is im2 This conclusion is different from that in Abbott and Dayan (1999), where the uncorrelated case is dened as Aij D dij (by setting b D 0) instead of Aij D (1 ¡ b )dij used here (by setting b D 0).
1008
Si Wu, Shun-ichi Amari, and Hiroyuki Nakahara
60
r =50 r =100 r=200
40
IF 20
0
0
2
4
6
8
10
b Figure 1: The Fisher information changes with the correlation width. The parameters are a D 1, s D 1, and b D 0.5.
portant because the multiplicative noise is believed to be more biologically plausible. Our method can be extended in such a case. 5 Population Decoding
The Fisher information tells us only the optimal decoding accuracy for an unbiased estimator to achieve. When a practical decoding method is concerned, its performance needs to be evaluated individually depending on the decoding model. We compare three decoding methods in this study. All are formulated as the MLI type, that is, the maximizer of a likelihood function, whereas they differ in the probability models used for decoding. 5.1 Three Decoding Methods. An MLI type estimator xO is obtained through maximization of the presumed log likelihood ln P ( r |x), that is, by solving
r ln P ( r | xO ) D 0,
(5.1)
Population Coding and Decoding in a Neural Field 300
1009
100 80
b=0 b=100
IF 200
b=1/ r b=2/ r
I F 60 40
100
20 0
0
200
400
600
r
800 1000
0
0
200
400
(a)
800 1000
(b)
0.32
50
0.30
40 b=0.8 b=100
I F 0.28
20
0.24
10 0
200
400
r
600
(c)
b=2 b=2.5
IF 30
0.26
0.22
600
r
800 1000
0
0
200
400
600
r
800 1000
(d)
Figure 2: The asymptotic behavior of Fisher information in different correlation cases. The parameters are a D 1, s D 1, and b D 0.5. (a) No correlation (b D 0) and uniform correlation (b D 100, approximated as innity). The two curves coincide. (b) Limited-range correlations. Parameters are b D 1 /r and b D 2 /r, respectively. (c) Short-range correlations. b D 0.8 and 1. (d) Wide-range correlations. b D 2 and 2.5.
where r k ( x) denotes dk ( x) / dx. P ( r |x) is called the decoding model, which can be different from the real encoding model Q ( r |x) . This is because the decoding system usually does not know the exact information of the encoding system. Moreover, a simple and robust decoding model is computationally desirable. We consider three decoding models, dened as follows: 1. The conventional MLI, referred to as FMLI (MLI based on the faithful model), uses all of the encoding information, that is, the decoding model is the true encoding model, PF ( r |x) D Q ( r |x) .
(5.2)
2. The UMLI method (MLI based on an unfaithful model) (Wu, Nakahara et al., 2000) uses the information on the shape of the tuning function
1010
Si Wu, Shun-ichi Amari, and Hiroyuki Nakahara
but neglects the neural correlation, so that the probability density PU ( r |x) D
» ¼ Z 1 1 r [r(c) ¡ f ( c ¡ x ) ]2 dc , exp ¡ 2 2s ¡1 ZU
(5.3)
is used for decoding. 3. The center of mass (COM) method does not use any information of the encoding process; instead, it assumes an incorrect but simple tuning function. It also disregards correlations by using PC ( r |x) D
» ¼ Z 1 1 r [r(c) ¡ fQ(c ¡ x) ]2 dc , exp ¡ 2 2s ¡1 ZC
(5.4)
where fQ(c ¡ x ) D ¡(x ¡ c) 2 C const is used as a presumed tuning function.3 It is easy to check that the third method is equivalent to the conventional COM decoding strategy (Baldi & Heiligenberg, 1988; Georgopoulos et al., 1982; Wu, Nakahara et al., 2000), 4 with the solution given by R1
¡1 xO D R 1
cr ( c) dc
¡1 r
( c) dc
.
(5.5)
We should point out that the use of an unfaithful model (e.g., UMLI or COM) has important meaning. When experimental scientists reconstruct the stimulus from the recorded data, they in fact use an unfaithful model, since the real encoding process is never known. Furthermore, the real neural correlation is often complex and may constantly change over time, so that it is hard, even if possible, for the brain to store and use all this information. MLI based on an unfaithful model (neglecting some part of information) is a key to solving this information curse. 5.2 Performance of Decoding and Asymptotic Efciency. Since the three methods are of the same type, their decoding errors can be calculated in similar ways. We show only the derivation of the decoding error for UMLI. For FMLI, see appendix B. 3 In the strict sense, fQ(c ¡ x) cannot be a tuning function, since its value diverges when (x ¡ c) goes to innity. However, this does not matter in practice, since (x ¡ c) is always restricted to a nite region when COM is practically used. See Snippe (1996), Wu, Nakahara, et al. (2000), and Wu et al. (2001). 4 This explains why COM performs comparably well to MLI when neurons are uncorrelated and the cosine tuning function is used (Salinas & Abbott, 1994), since the cosine function, for example, cos[(x ¡ c) / T], has an approximated quadratic form when (x ¡ c) is restricted in a small region.
Population Coding and Decoding in a Neural Field
1011
For convenience, two notations are introduced: EQ [k( r, x )] and VQ [k( r, x) ] denote, respectively, the mean and variance of k (r , x ) with respect to the distribution Q ( r |x) . Suppose xO is close enough to x. We expand r ln PU ( r | xO ) at x,
r ln PU ( r | xO ) ’ r ln PU (r |x) C rr ln PU (r |x) ( xO ¡ x ).
(5.6)
Since the estimator xO satises r ln PU ( r | xO ) D 0, 1 1 rr ln PU ( r |x) ( xO ¡ x) ’ ¡ r ln PU (r |x). r r
(5.7)
We put Z 1 1 1 [r(c) ¡ f (c ¡ x) ] f 0 (c ¡ x ) dc r lnPU (r |x) D 2 r s ¡1 Z 1 1 0 D 2 2 ( c ) f ( c ¡ x ) dc, s ¡1
RD
Z 1 1 1 ( [r(c) ¡ f (c ¡ x) ] f 00 (c ¡ x) dc S D rr lnPU r |x) D 2 r s ¡1 Z 1 1 C [ f 0 ( c ¡ x )]2 dc s 2 ¡1 Z 1 1 00 D 2 2 ( c ) f ( c ¡ x) dc C D, s ¡1
(5.8)
(5.9)
where 1 D D p 3 2. 4 pa s
(5.10)
Then the estimating equation is S (xO ¡ x ) D ¡R
(5.11)
R xO ¡ x D ¡ . S
(5.12)
or
Here, both R and S are random variables depending on 2 ( c). It is easy to show that EQ [R] D 0, EQ [S] D
1 p . 4 p a3 s 2
(5.13) (5.14)
1012
Si Wu, Shun-ichi Amari, and Hiroyuki Nakahara
Their variances are given, by using the Fourier transforms, as
VQ [R] D
1 2pr 2 s 2
VQ [S] D
1 2pr 2 s 2
Z Z
1
v2 F (v) 2 H (v) dv,
(5.15)
v4 F (v) 2 H (v) dv.
(5.16)
¡1 1
¡1
These show that R is a zero-mean gaussian random variable. The random variable S is composed of two terms, the rst one being random and the second term D being a xed constant (see the right-hand side of equation 5.9). Remark. The above procedure is the standard way to analyze the asymptotic
error of estimation. In the standard statistical model with repeated independent observations that is, the independent and identically distributed (i.i.d.) case, R is gaussian, while the constant term D dominates over the random one because of the law of large numbers. In particular, when the faithful model is used (FMLI), D is the Fisher information, and so is VQ [R]. Therefore, the asymptotic error is given by 1 / N times of the inverse of Fisher information, where N is the number of observations. This shows that the Crame r-Rao bound is asymptotically attained. Such an estimator is said to be asymptotically efcient or Fisher efcient. When an unfaithful model is applied (still to the i.i.d. case), VQ [R] and D are different. Hence, the Fisher efciency is not attained in general (Akahira & Takeuchi, 1981; Murata, Yoshizawa, & Amari, 1994). However, the asymptotic gaussianity of the estimator and the 1 / N convergence property of the error are guaranteed. In this article, we say that such an estimator is quasi-asymptotic efcient or quasi-Fisher efcient. Apart from the above two standard cases, population coding includes the third one (because of correlation), where the estimator is not subject to the gaussian distribution, although the asymptotic error may be small (Wu et al., 2001). Such an unusual case occurs when the random term S is not negligible, as compared to the constant D. In such a situation, the estimator is represented by a ratio of two gaussian random variables, so it is subject to the Cauchy-type distribution. The 1 / N convergence of error does not hold either. Such an estimator is said to be non-Fisherian, since the CramÂer-Rao paradigm using the Fisher information does not hold. This is a new fact in the population coding literature. Let us return to calculating the UMLI decoding error. From equations 5.10 and 5.16, we see that the constant term in S dominates over the random one
Population Coding and Decoding in a Neural Field
1013
in the two cases: H (v) is of order r. In this case, the random term is of O ( r1 ) , and D is of O (1) . H (v) is of order r 2 , but the noise variance s 2 is sufciently small. In this case, the random one is O ( s1 ) , and the constant term D is O ( s12 ) . The rst case corresponds to uncorrelated, local range, and uniformly correlated cases. The second case holds for the short- and wide-range correlations with small noise.5 In these cases, we may neglect the random term in R, so that we have asymptotically S ¼ D, xO ¡ x ¼
(5.17)
R , D
(5.18)
which is normally distributed with zero mean and variance,
EQ [ (xO ¡ x ) 2 ] D
8a6 s 2 r2
Z
1 ¡1
v2 F (v) 2 H (v) dv.
(5.19)
The above equation holds asymptotically when r ! 1 and H (v) is of order r, or s ! 0. The cases of short- and wide-range correlations and the noise are strong. In these cases, since H (v) is of order r 2 , the random and constant terms in the variable S are of the same order. It is rather difcult to analyze the O Equation 5.12 shows that x¡x O behavior of x. is a ratio of two gaussian random variables, so that its distribution is of the Cauchy type, which means that UMLI is not Fisherian. We may note that the neural eld is nite practically; the variance of decoding error does not diverge even in the Cauchy case. But the variance is no longer adequate to measure the decoding accuracy. A similar analysis is applicable to FMLI (see appendix B) and COM. It turns out that they have the same asymptotic behaviors. Table 1 summarizes the asymptotic behaviors of the three decoding methods (the special case of weak noise, in which the MLI type of methods is always approximately asymptotically or quasi-asymptotically efcient, is not included) and the Fisher information in different correlation cases.
5 Note that the weak noise assumption is used in Deneve et al. (1999) and Pouget et al. (1998) to get their results.
1014
Si Wu, Shun-ichi Amari, and Hiroyuki Nakahara
Table 1: The Asymptotic Behaviors of Fisher Information and Three MLI Type of Decoding Methods.
No correlation Local correlation Short-range correlation Wide-range correlation Uniform correlation
IF
FMLI
UMLI
COM
/r /r Saturating /r /r
FE FE Non-F Non-F FE
QFE QFE Non-F Non-F QFE
QFE QFE Non-F Non-F QFE
Notes: The special case of weak noise is exluded. FE = Fisher efciency, QFE = quasi-Fisher efciency; Non-F = non-Fisherian.
Table 2: Comparing the Decoding Errors of FMLI, UMLI, and COM When H (v) is Order of r.
FMLI UMLI COM
b !1 p 3 2 4 p a s (1 ¡ b ) /r p 4 p a3 s 2 (1 ¡ b ) /r 18a3 s 2 (1 ¡ b ) / r
bD0 p 3 2 4 p a s (1 ¡ b ) / r p 4 p a3 s 2 (1 ¡ b ) / r 18a3 s 2 (1 ¡ b ) /r
b D m/r p p 3 2 4 p a s [1 C (p2p m ¡ 1)b] /r p 3 2 ( 2p m ¡ 1)b] /r 4 p a s [1 C p 18a3 s 2 [1 C ( 2p m ¡ 1)b] / r
When FMLI and COM are asymptotically efcient, their decoding errors are calculated to be E Q [( xO ¡ x ) 2FMLI ] »
r2
EQ [(xO ¡ x ) 2COM ] »
s2 r2
2p s 2
R1
v2 F (v) 2
Z
Z
¡1
1
¡1
1
/ H (v) dv
D
ch (c, c0 ) c0 dc dc0 ,
1 , IF
(5.20)
(5.21)
¡1
respectively. 5.2.1 Performance Comparison. We compare the performances of three decoding methods. Table 2 lists the decoding errors of three decoding methods when H (v) is the order of r.6 We see that UMLI and FMLI have comparable performances. Both are much better than COM. Figure 3 compares the decoding errors of FMLI, UMLI and COM in the case of nonlocal range correlation and weak noise (note that FMLI, UMLI, and COM are asymptotically or quasi-asymptotically efcient in this case 6
Two conditions are used to obtain the results in Table 2. (1) To calculate the decoding error of COM, we need to restrict the preferred stimulus c within a nite range [¡L, L]; otherwise, the error diverges. This restriction means sampling only those neurons that are active enough. This approximation benets only COM and does not affect the results of FMLI and UMLI much (see Wu, Nakahara et al., 2000; Wu et al., 2001). In this article, we choose L D 3a. (2) To get the results for the case of b D m /r, we use the condition a À 1 / r.
Population Coding and Decoding in a Neural Field 8
20
Decoding Error
Decoding Error
7
FMLI UMLI COM
6 5 4 3
1015
0
200
400
600
800
1000
FML UML COM
15
10
5
0
0
2
4
6
r
b
(a)
(b)
8
10
Figure 3: Comparing the decoding errors of FMLI, UMLI, and COM in the case of strong correlation and weak noise. The unit of decoding errors in the gure is s 2 . Parameters a D 1 and b D 0.5. (a) Decoding errors change with the neuron density r. Parameter b D 1. (b) Decoding errors change with the correlation length b. Neural density r D 50.
because of weak noise.). It shows that UMLI has a larger error than FMLI and a lower error than COM. When the correlation covers the short range of the population, the decoding errors of three methods saturate as the neural density increases (see Figure 3a). This is similar to the behavior of Fisher information. For a xed neural density r, the decoding errors of the three methods rst increase with the correlation length and then decrease subsequently (see Figure 3b). This is understandable, since the extremes on both ends correspond to the cases when neurons are either uncorrelated or uniformly correlated. The computational complexity of the three methods can be roughly compared as follows. Consider maximization of the log likelihood of UMLI and FMLI by using the standard gradient-descent method. The amounts of computation for obtaining the derivative of the log likelihood of UMLI and FMLI are proportional to N and N 2 , respectively. UMLI is signicantly simpler than FMLI when N is large. For COM, due to the quadratic form of the tuning function, the estimation can be done in one shot by equation 5.5. Therefore, UMLI is a good compromise between decoding accuracy and computational complexity.
6 Network Implementation
So far, we have been concerned only with the accuracy and simplicity of decoding methods. To be realistic, it is essential for the strategy to be bi-
1016
Si Wu, Shun-ichi Amari, and Hiroyuki Nakahara
ologically achievable. We investigate implementing the three methods by using a recurrent network, following the ideas in Deneve et al. (1999), Pouget and Zhang (1997), and Pouget et al. (1998). UMLI is studied rst. Let us consider a fully connected one-dimensional homogeneous neural eld, in which c denotes the position coordinate. Let Uc denote the (average) internal state of neurons at c and Wc,c0 the recurrent connection weights from neurons at c to those at c0 . We propose that the dynamics of neural excitation is governed by dU c D ¡Uc C dt
Z Wc,c0 Oc0 dc0 C Ic ,
(6.1)
where
Oc D
Uc2 R . 1 C m Uc2 dc
(6.2)
This Oc is the activity of the neurons at c, and Ic is the external input arriving at c. The recurrent interaction is assumed to be 0 )2 2 W c,c0 D e ¡(c¡c / 2a .
(6.3)
We rst look at the network dynamics, when there is no external input— Ic D 0. It is not difcult to check that the network has a one-parameter family of nontrivial steady states, 2 2 OQ c D Ae¡(c¡z) / 2a ,
(6.4)
with the corresponding internal state, 2 2 UQ c D Be¡(c¡z) / 4a ,
(6.5)
including z as a free parameter, where the coefcients A and B are easily determined. The parameter z denotes the peak position of the population activity comprised by fOQ c g. Note that the stable state has a similar shape to the tuning function (see equation 2.2) in this case. Equation 6.4 or 6.5, parameterized by z, denes a line attractor for the network, on which the system is neutrally stable. This is due to the translation invariance of the network interactions (Amari, 1977; Deneve et al., 1999; Giese, 1999; Seung, 1996; Zhang, 1996).
Population Coding and Decoding in a Neural Field
1017
The decoded estimator xO is given by the peak position of the nal population activity Oc , that is, the value of z. In order to achieve this goal, one 1 2 considers that the external input is transient, that is, Ic » rc / d ( t) . This stimulates the network whose initial state is set equal to the original noisy version, Oc (0) D rc (Pouget et al., 1998; Deneve et al., 1999). After relaxation, the system reaches the desired state, in which the noises are smoothed out and the O However, if Ic disappears, the attractor of the eld is peak position gives x. only neutrally stable, so that the peak position may uctuates. Hence, we consider that a small input Ic D 2 rc persists after the initial state is set. 7 We further assume that the input is sufciently small (2 is small enough), such that the change it brings to the form of the stable state (see equation 6.4) is negligible. Instead of being trapped into complicated mathematical calculation, we adopt an approximate but simple way to understand the above network dynamics, being backed up by simulation results. Since the steady state of the network is assumed always to be on the line attractor and its position is determined by the input, independent of the initial value, we can, without risk of losing any information, see that the network is, from the beginning on the line attractor, at a random position, evolving along it, until it reaches a stable position. This modied dynamic picture simplies the relationship between Oc and Uc as Oc D DU2c ,
(6.6)
where equations 6.4 and 6.5 are used, and D D B2 / A. For the dynamics specied by equations 6.1 and 6.6, there exists a Lyapunov function (Cohen & Grossberg, 1983), Z Z 1 1 1 Wc,c0 Oc Oc0 dc dc0 2 ¡1 ¡1 Z 1 Z Uc Z 1 C zg0 ( z) dz dc ¡ 2 rc Oc dc,
LD ¡
¡1
0
(6.7)
¡1
with the function g ( z) D Dz2 . We consider a two-step perturbation procedure to minimize the Lyapunov function. Minimizing the rst two terms of equation 6.7, which are
7
This mechanism was proposed in Amari (1977) and Pouget and Zhang (1997).
1018
Si Wu, Shun-ichi Amari, and Hiroyuki Nakahara
0.6 tuning function initial state steady state
0.5 0.4 0.3 0.2 0.1 0.0 0.1 0.2
3
1 1 Preferred Stimulus
3
(a) Figure 4: Performance of the recurrent network. The parameters are a D 1, m D 0.5, s 2 D 0.01, and b D 0.5. (a) The typical states of network before and after relaxation. The steady state is scaled to match the tuning function. (b) Comparing the decoding errors of the network estimation and UMLI. The results are obtained after averaging over 100 trials.
of order 1, determines that the stable state is on the line attractor. Minimizing the third term of order of 2 determines the peak positions of the stable state. A justication of this two-step perturbation procedure is given in appendix C. This two-step optimization is equivalent to Z Minfzg ¡
1 ¡1
rc Oc dc,
2 2 subject to Oc D Ae¡(c¡z) / 2a .
(6.8)
Recall that the solution of UMLI is given by Max fxg ln PU ( r |x) , Z 1 r D 2 rc f (c ¡ x) dc C terms not on x. s ¡1
(6.9)
Comparing equations 6.8 with 6.9 and 2.2, we see that the nal state of the network gives the same estimator as that of UMLI. The above result is conrmed by the simulation experiment (see Figure 4), which was done with 101 neurons uniformly distributed in the region [¡3, 3], with the true stimulus having the value of zero. Figure 4a shows the
Population Coding and Decoding in a Neural Field
1019
typical behaviors of the recurrent network before and after relaxation. The steady state of the network becomes smooth after the noise in the initial state is cleaned out. To see the coincidence between theptwo methods, we measure O Q [Oz], where xO and a quantity t, dened as t D < (xO ¡ zO ) (xO ¡ zO ) > / VQ [x]V O and zO denote the estimations of UMLI and the network, respectively. VQ [x] VQ [Oz] are their variances, and < > represents averaging over many trials. This quantity measures the statistic difference between the two estimators. The smaller the value of t is, the more the two methods agree with each other. In the extreme case of xO D zO for each trial, the value of t is zero. If the two estimators are completely independent of each other and have zero means, as in this example, the value of t is 2. Figure 4b shows that t is quite small in all correlation cases (the largest value is 0.032), which means that the network estimation can be regarded as the same as that of UMLI. In a similar way, we can show that FMLI can be implemented by the same recurrent neural eld (since both methods use the same tuning function to R ¡1 t the data) but by using a different external input: Ic D 2 hc,c0 rc0 dc0 (see appendix D). This result is different from that in Deneve et al. (1999), where FMLI and UMLI are implemented by using different recurrent interactions. To implement COM, the external input is the same as that in UMLI (since both methods discard the correlation). However, a different form of network interaction is needed to ensure that the line attractor has the same shape of the corresponding tuning function. It is interesting to compare the complexity of three methods in terms of network implementation. Obviously, FMLI is more complicated than UMLI, since it uses neural correlation. However, COM and UMLI are around the same level.
7 Conclusion and Discussion
We have proposed a new unied encoding eld model for population coding, in which neural responses are correlated with a gaussian correlation function. This model serves as a good prototype for theoretical study. It has the advantages of simplifying the calculation and providing a clear picture of the results, which are often vague when other models are used. Based on the proposed model, we calculate the Fisher information and elucidate its asymptotic behaviors for various correlation lengths. We conrm that as the correlation covers the short range of population, the Fisher information saturates and does not increase any more in proportion to the number of neurons. Moreover, we prove that as the correlation covers the wide range of the population, the Fisher information increases again without limit in proportion to the number of neurons. This nding, together with others, needs experimental conrmation on the range of neural correlation in the brain.
1020
Si Wu, Shun-ichi Amari, and Hiroyuki Nakahara
Three decoding methods are compared in this study. All are formulated as the MLI type, including the conventional COM method, whereas they differ in the knowledge of encoding process being used. It turns out that UMLI, which uses the shape of the tuning function but neglects the neural correlation, stands out due to the good balance between computational complexity and decoding accuracy. Furthermore, we investigate the network implementation of the above three methods. It shows that UMLI and FMLI can be achieved by the same recurrent network with different external inputs. We also clarify the asymptotic efciency (Fisher efciency) of the MLI type of decoding methods for correlated signals. It is proved that when the neural correlation covers a nonlocal range of population (excluding the cases of uniform correlation and weak noise), the MLI type of method is non-Fisherian. The CramÂer-Rao bound is not achievable in this case, and hence one should be careful carrying out analysis based on the Fisher information. It is also important to be aware of the quasi-asymptotic efciency of MLI based on unfaithful models. Only when the quasi-asymptotic efciency is ensured can one calculate the decoding errors of UMLI and COM by equations 5.19 and 5.21, respectively. Otherwise, the decoding errors are subject to the Cauchy type of distribution and are difcult to quantify. An interesting case that is not studied in this article is the multiplicative correlation, in which the correlation strength depends on ring rates (Abbott & Dayan, 1999; Nakahara & Amari, in press; Wilke & Eurich, in press). To cope with this situation, we can, for example, extend the correlation matrix Aij in equation 2.7 to be a new one A0ij D fia ( x) Aij fja ( x) . In the case of a D 1 and b ! 1, it returns to A0ij D fi ( x )[(1 ¡ b )dij C b] fj (x ) , which is the case studied by Abbott and Dayan (1999) and Wu et al. (2000). The continuous version of A0ij in a neural eld is h0 (c, c0 , x) D f a ( x ¡ c) h ( c, c0 ) f a ( x ¡ c0 ) , where h ( c, c0 ) is given by equation 3.6, and its inverse is (h0 ) ¡1 ( c, c0 , x ) D f ¡a ( x ¡ c) h ¡1 ( c, c0 ) f ¡a ( x ¡ c0 ) . We can calculate the Fisher information and the performance of the MLI type of decoding methods much as we have done in this article. The results will be reported in a future publication. Finally, we should point out that a nonlocal range correlation does not imply the failure of MLI-type methods. In addition to the counter-example of uniform correlation, another one is the multiplicative correlation, for example, Aij D [dij C c (1 ¡ dij ) ] fi ( x ) fj ( x) . It has been proved that the MLItype method is asymptotically efcient in this case, although the correlation covers the whole range of the population (Wu, Chen, & Amari, 2000). There is an intuitive way to understand the reason (similarly for the uniform correlation). The idea is to decompose the uctuations of neural responses into two parts, ri ¡ fi ( x ) D s fi ( x) (c C 2 i ) , where c and f2 i g, for i D 1, . . . , N, are independent random variables having zero mean and variance c and 1 ¡ c, respectively. Neurons are correlated through the common factor c ,
Population Coding and Decoding in a Neural Field
1021
P P which can be calculated by c D i [ri ¡ fi (x ) ] / [s i fi ( x) ]. This information can be used when MLI is performed. Appendix A: The Covariance Function of the Continuous Encoding Model
Without loss of generality, we consider that the preferred stimulus ci is uniformly distributed in a range [¡ L2 , L2 ], that is, ci D ¡
L L Ci , 2 N
for
i D 1, . . . , N.
(A.1)
By choosing a particular form of fki g, e.g., ki D 1, for i D 1, . . . , N, equation 3.3 becomes Z
L 2
Z
¡ L2
L 2
¡ L2
h (c, c0 ) dc dc0 D
X
Aij ,
(A.2)
ij
which has an intuitive meaning; the total correlation is reserved after the continuous extension. From equation 3.2, Z
L 2
Z
¡ L2
L 2
¡ L2
p h (c, c0 ) dc dc0 D D1 (1 ¡ b ) L C D2 b 2p bL.
(A.3)
From equation 2.6, X ij
p Aij D N (1 ¡ b ) C N2 b 2p b / L,
(A.4)
in the large N limit. Combining equations A.3 and A.4, we get 0 2 2 h ( c, c0 ) D r (1 ¡ b )d ( c ¡ c0 ) C r 2 be¡(c¡c ) / 2b ,
where r D
N L
(A.5)
is the neuron density.
Appendix B: The Performance of FMLI
The performance of FMLI can be similarly analyzed as done for UMLI. Expanding r lnQ( r | xO ) at x and using the condition r lnQ( r |x) D 0, we obtain an estimating equation, xO ¡ x D ¡
RF , SF
(B.1)
1022
Si Wu, Shun-ichi Amari, and Hiroyuki Nakahara
where the variables RF and SF are dened as 1 r lnQ(r |x) r2 Z 1Z 1 1 0 ¡1 0 0 0 D 2 2 ( c ) h ( c, c ) f ( c ¡ x ) dc dc , s ¡1 ¡1
RF D
1 rr lnQ(r |x) r2 Z 1Z 1 1 0 ¡1 0 00 0 D 2 2 ( c ) h ( c, c ) f ( c ¡ x) dc dc C DF , s ¡1 ¡1
(B.2)
SF D
(B.3)
where DF D
1 s2
Z
1 ¡1
Z
1 ¡1
f 0 (c ¡ x) h ¡1 ( c, c0 ) f 0 (c0 ¡ x) dc dc0 .
(B.4)
It is easy to check that the mean values of RF and SF are zero and DF , respectively. Their variances and the value of DF are given by using the Fourier transforms as VQ [RF ] D
1 2pr 2 s 2
Z
1 ¡1
v2 F (v) 2 / H (v) dv,
Z 1 1 v4 F (v) 2 / H (v) dv, 2pr 2 s 2 ¡1 Z 1 1 v2 F (v) 2 / H (v) dv. DF D 2p s 2 ¡1
VQ [SF ] D
(B.5) (B.6) (B.7)
These show that RF is a zero-mean gaussian variable. The random variable SF is composed of two terms, the rst one being random and the second one, DF , being a constant. The constant term dominates over the random one in the two cases: H (v) is of order r. In this case, the random term is O ( r 13/ 2 ) , and the constant one is O ( r1 ) . H (v) is of order r 2 , but the noise variance s 2 is sufciently small. In this case, the random one is O ( s1 ), and the constant term is O ( s12 ) . The rst case corresponds to uncorrelated, local range, and uniformly correlated cases. The second case holds for the short- and wide-range correlations with small noise. In these cases, we may neglect the random term in SF , so
Population Coding and Decoding in a Neural Field
1023
that we have asymptotically SF ¼ D F , xO ¡ x ¼
RF , DF
(B.8)
which is normally distributed with zero mean and variance, EQ [ (xO ¡ x ) 2FMLI ] »
r2
2p s 2 . 2 (v) 2 H (v) dv / ¡1 v F
R1
(B.9)
In the cases of short- and wide-range correlations and when the noise is strong, the random and constant terms in SF are of the same order. Equation B.1 shows that the distribution of xO ¡ x is of the Cauchy type. FMLI is not asymptotically efcient in this case. Appendix C: Minimizing the Lyapunov Function
When 2 D 0, the solution of minimizing the Lyapunov function of the form 6.7 (only the rst terms are concerned) is given by 2 2 Uc D UQ c D Be ¡(c¡z) / 4a ,
2 2 Oc D OQ c D Ae¡(c¡z) / 2s ,
(C.1)
for any value of z. When a small input 2 rc (2 ! 0) is added, the solution will be uniquely determined and has a form slightly deviated from equation C.1, which can be approximated as (in the rst order of 2 ) Uc ¼ UQ c C 2 Ec ,
Oc ¼ OQ c C 2 (2DUQ c Ec ) ,
(C.2)
where 2 Ec and 2 (2DUQ c Ec ) denote deviations from the linear attractor. Substituting equation C.2 into 6.7, we get 1 LD ¡ 2
Z
1 ¡1
Z ¡2
1
¡1
Z
1 ¡1
Z Wc,c OQ c OQ c0 dc dc0 C 0
1 ¡1
rc OQ c dc C terms of order 2
2
Z
Qc U
zg0 (z ) dz dc
0
and higher.
(C.3)
Note that the deviation of the waveform of the solution from the linear attractor contributes to the Lyapunov function only in the order of 2 2 or
1024
Si Wu, Shun-ichi Amari, and Hiroyuki Nakahara
higher. The contribution in the rst order of 2 , however, is determined by the overlapping between rc and OQ c (the third term in equation C.3). Minimizing this term determines the position of stable state on the linear attractor and gives the estimation of stimulus. The above procedure can be formally done in a two-step perturbation procedure described in the text. Appendix D: The Network Implementation of FMLI
For FMLI, we consider that it is implemented by using the same recurrent neural eld as for UMLI, but with a different external input, which is Z 0 ¡1 0 (D.1) Ic D 2 hc,c 0 r c dc . Following the same line as for UMLI, we obtain a Lyapunov function, LD ¡
1 2
Z
1 ¡1
Z ¡2
1
Z
1 ¡1
Z
¡1
1
¡1
Z Wc,c0 Oc Oc0 dc dc0 C
1 ¡1
Z
Uc
zg0 ( z) dz dc
0
0 ¡1 0 Oc hc,c 0 r c dc dc ,
(D.2)
which can be approximately minimized in a two-step perturbation procedure, with the solution determined by Z Minfzg ¡
1 ¡1
Z
1 ¡1
0 ¡1 0 Oc hc,c 0 rc dc dc ,
2 2 subject to Oc D Ae¡(c¡z) / 2s .
(D.3)
Recall that the solution of FMLI is given by Max fxg ln PF ( r |x) , Z Z r2 1 1 ¡1 0 D 2 f ( c ¡ x ) hc,c 0 r c0 dc dc C terms not on x. s ¡1 ¡1
(D.4)
Comparing equations D.3 and D.4, we see that the network estimation is the same as that of FMLI. Acknowledgments
We thank the two anonymous reviewers for their valuable comments. S. W. acknowledges helpful discussions with Peter Dayan. H. N. is supported by Grants-in-Aid 11780589 and 13210154 from the Ministry of Education, Japan.
Population Coding and Decoding in a Neural Field
1025
References Abbott, L. F., & Dayan, P. (1999). The effect of correlated variability on the accuracy of a population code. Neural Computation, 11, 91–101. Akahira, M., & Takeuchi, K. (1981). Asymptotic efciency of statistical estimation: Concepts and high order asymptotic efciency. Berlin: Springer-Verlag. Amari, S. (1977). Dynamics of pattern formation in lateral-inhibition type neural elds, Biological Cybernetics, 27, 77–87. Baldi, P., & Heiligenberg, W. (1988). How sensory maps could enhance resolution through ordered arrangements of broadly tuned receivers. Biological Cybernetics, 59, 313–318. Brunel, N., & Nadal, J.-P. (1998). Mutual information, Fisher information, and population coding. Neural Computation, 10, 1731–1757. Cohen, M., & Grossberg, S. (1983). Absolute stability of global pattern formulation and parallel memory storage by competitive neural networks. IEEE Trans. SMC, 13, 815–826. Deneve, S., Latham, P. E., & Pouget, A. (1999). Reading population codes: A neural implementation of ideal observers. Nature Neuroscience, 2, 740–745. Eurich, C. W., Wilke, S. D., & Schwegler, H. (2000). Neural representation of multi-dimensional stimuli. In S. A. Solla, T. K. Leen, & K.-R. Muller ¨ (Eds.), Advances in neural information processing systems,12 (pp. 115–121). Cambridge, MA: MIT Press. Fetz, E., Yoyama, K., & Smith, W. (1991). Synaptic interactions between cortical neurons. In A. Peters & E. G. Jones (Ed.), Cerebral cortex, 9. New York: Plenum Press. Gawne, T. J., & Richmond, B. J. (1993). How independent are messages carried by adjacent inferior temporal cortical neurons? J. Neuroscience, 13, 2758–2771. Georgopoulos, A. P., Kalaska, J. F., Caminiti, R., & Massey, J. T. (1982). On the relations between the direction of two-dimensional arm movements and cell discharge in primate motor cortex. J. Neurosci., 2, 1527–1537. Giese, M. A. (1999). Dynamic neural eld theory for motion perception. Norwell, MA: Kluwer Academic. Johnson, K. O. (1980). Sensory discrimination: Neural processes preceding discrimination decision. J. Neurophysiology, 43, 1793–1815. Lee, D., Port, N. L., Kruse, W., & Georgopoulos, A. P. (1998). Variability and correlated noise in the discharge of neurons in motor and parietal areas of the primate cortex. J. Neuroscience, 18, 1161–1170. Maunsell, J. H. R., & Van Essen, D. C. (1983). Functional properties of neurons in middle temporal visual area of the Macaque monkey. I. Selectivity for stimulus direction, speed, and orientation. J. Neurophysiology, 49, 1127–1147. Murata, M., Yoshizawa, S., & Amari, S. (1994). Network information criterion— determining the number of hidden units for articial neural network model. IEEE Trans. Neural Networks, 5, 865–872. Nakahara, H., & Amari, S. (in press). Attention modulation of neural tuning through peak and base rate in correlated ring. Neural Networks. Nakahara, H., Wu, S., & Amari, S. (2001). Attention modulation of neural tuning through peak and base rate. Neural Computation, 13, 2031–2047.
1026
Si Wu, Shun-ichi Amari, and Hiroyuki Nakahara
Paradiso, M. A. (1988). A theory for use of visual orientation information which exploits the columnar structure for striate cortex. Biological Cybernetics, 58, 35–49. Pouget, A., & Zhang, K. (1997). Statistically efcient estimation using cortical lateral connections. In M. Mozer, M. Jordan, & T. Petsche (Eds.), Advances in neural processing systems, 9 (pp. 97–103). Cambridge, MA: MIT Press. Pouget, A., Zhang, K., Deneve, S., & Latham, P. E. (1998). Statistically efcient estimation using population coding. Neural Computation, 10, 373–401. Salinas, E., & Abbott, L. F. (1994). Vector reconstruction from ring rates. J. Comp. Neurosci., 1, 89–107. Seung, H. S. (1996). How the brain keeps the eyes still. Proc. Acad. Sci. USA, 93, 13339–13344. Seung, H. S., & Sompolinsky, H. (1993). Simple models for reading neuronal population codes. Proc. Natl. Acad. Sci. USA, 90, 10749–10753. Snippe, H. P. (1996). Parameter extraction from population codes: A critical assessment. Neural Computation, 8, 511–529. Snippe, H. P., & Koenderink, J. J. (1992). Information in channel-coded systems: Correlated receivers. Biological Cybernetics, 67, 183–190. Treue, S., Hol, K., & Rauber, H.-J. (2000). Seeing multiple directions of motionphysiology and psychophysics. Nature Neuroscience, 3, 270–276. Wilke, S. D., & Eurich, C. W. (in press). Representational accuracy of stochastic neural population. Neural Computation. Wu, S., Chen, D., & Amari, S. (2000). Unfaithful population decoding. In Proceedings of the International Joint Conference on Neural Networks. Wu, S., Nakahara, H., & Amari, S. (2001). Population coding with correlation an unfaithful model. Neural Computation, 13, 775–797. Wu, S., Nakahara, H., Murata, N., & Amari, S. (2000). Population decoding based on an unfaithful model. In S. A. Solla, T. K. Leen, & K.-R. Muller ¨ (Eds.), Advances in neural information systems, 12 (pp. 192–198). Cambridge, MA: MIT Press. Yoon, H., & Sompolinsky, H. (1999). The effect of correlations on the Fisher information of population codes. In M. S. Kearns, S. Solla, & D. Cohn (Eds.), Advances in neural information processing systems,11 (pp. 167–173). Cambridge, MA: MIT Press. Zemel, R. S., Dayan, P., & Pouget, A. (1998). Probabilistic interpolation of population codes. Neural Computation, 10, 403–430. Zhang, K-C. (1996). Representation of spatial orientation by the intrinsic dynamics of the head-direction cell ensemble: A theory. J. Neuroscience,16, 2112–2126. Zhang, K., & Sejnowski, T. J. (1999). Neural tuning: To sharpen or broaden. Neural Computation, 11, 75–84. Zohary, E. (1992). Population coding of visual stimuli by cortical neurons tuned to more than one dimension. Biological Cybernetics, 66, 265–272. Zohary, E., Shadlen, M. N., & Newsome, W. T. (1994). Correlated neural discharge rate and its implications for psychophysical performance. Nature, 370, 140–143. Received February 2, 2001; accepted September 18, 2001.
LETTER
Communicated by Bard Ermentrout
The Inuence of Limit Cycle Topology on the Phase Resetting Curve Sorinel A. Oprisan [email protected] Carmen C. Canavier [email protected] Department of Psychology,Universityof New Orleans, New Orleans, LA 70148,U.S.A.
Understanding the phenomenology of phase resetting is an essential step toward developing a formalism for the analysis of circuits composed of bursting neurons that receive multiple, and sometimes overlapping, inputs. If we are to use phase-resetting methods to analyze these circuits, we can either generate phase-resetting curves (PRCs) for all possible inputs and combinations of inputs, or we can develop an understanding of how to construct PRCs for arbitrary perturbations of a given neuron. The latter strategy is the goal of this study. We present a geometrical derivation of phase resetting of neural limit cycle oscillators in response to short current pulses. A geometrical phase is dened as the distance traveled along the limit cycle in the appropriate phase space. The perturbations in current are treated as displacements in the direction corresponding to membrane voltage. We show that for type I oscillators, the direction of a perturbation in current is nearly tangent to the limit cycle; hence, the projection of the displacement in voltage onto the limit cycle is sufcient to give the geometrical phase resetting. In order to obtain the phase resetting in terms of elapsed time or temporal phase, a mapping between geometrical and temporal phase is obtained empirically and used to make the conversion. This mapping is shown to be an invariant of the dynamics. Perturbations in current applied to type II oscillators produce signicant normal displacements from the limit cycle, so the difference in angular velocity at displaced points compared to the angular velocity on the limit cycle must be taken into account. Empirical attempts to correct for differences in angular velocity (amplitude versus phase effects in terms of a circular coordinate system) during relaxation back to the limit cycle achieved some success in the construction of phaseresetting curves for type II model oscillators. The ultimate goal of this work is the extension of these techniques to biological circuits comprising type II neural oscillators, which appear frequently in identied central pattern-generating circuits.
c 2002 Massachusetts Institute of Technology Neural Computation 14, 1027– 1057 (2002) °
1028
Sorinel A. Oprisan and Carmen C. Canavier
1 Introduction
In order to gain an understanding of the central pattern generators (CPG) involved in rhythmic motor activity such as locomotion (Arshavsky, Beloozerova, Orlovsky, Panchin, Yu, & Pavlova, 1985; Beer, Chiel, & Gallagher, 1999; Canavier et al., 1997; Chiel, Beer, & Gallager, 1999; Collins & Stewart, 1993; Collins & Richmond, 1994; Golubitsky, Stewart, Buono, & Collins, 1998) and respiration (Abramovich-Sivan & Akselrod, 1998a), it is necessary to understand the phase-resetting behavior of the neural oscillators that comprise such circuits. We have examined the structure of phase resetting in neural oscillators in order to construct phase-resetting curves (PRCs) in response to single and multiple perturbations of arbitrary amplitude and duration. Neural oscillators that produce periodic rhythms such as burst ring can be modeled effectively as limit cycle oscillators (Baxter et al., 2000), and such burst ring neurons are frequently components of CPGs (Chiel et al., 1999; Beer et al., 1999; Collins & Richmond, 1994; Kopell & Ermentrout, 1988). In this study, we ignore action potential generation and instead focus on the underlying oscillations whose plateaus comprise the bursts and whose troughs comprise the interburst hyperpolarizations. Since physiological inputs to real neurons in a CPG circuit frequently have a duration that is a signicant fraction of the cycle period (Canavier, Baxter, Clarke, & Byrne, 1999), this work is a theoretical rst step that will be extended to include inputs of signicant duration. A CPG is a network of neurons in the central nervous system that is capable of producing rhythmic output (Chiel et al., 1999; Beer et al., 1999; Golubitsky et al., 1998; Kopell & Ermentrout, 1988) in the absence of input from higher centers and sensory feedback. Networks of nonlinear oscillators have attracted a great deal of research due to their theoretical and practical importance (Canavier et al., 1999; Ermentrout, 1985, 1986, 1994; Izhikevich, 2000). When acting as a component of a network, nonlinear oscillators can be characterized by their PRC (Abramovich-Sivan & Akselrod, 1998a, 1998b; Dror, Canavier, Butera, Clark, & Byrne, 1999; Ermentrout, 1996; Pinsker, 1977). The PRC can be obtained by measuring the change in the period of the limit cycle when a pulselike perturbation is applied at different points in the cycle (phases). The power of PRC methods has been clearly demonstrated by Winfree (1980) and Murray (1993), as well as by Canavier et al. (1997, 1999). Most theoretical work on coupled oscillators has used normal form, or canonical reduction, methods (Ermentrout, 1994, 1996; Hoppensteadt & Izhikevich, 1997). However, such methods involve restrictive assumptions that can limit the application of these methods to circuits composed of physiological neurons. In this article, we explore the theoretical aspects of PRC construction that is based on a geometrical understanding of the phase-space dynamics as originally developed by Winfree (Glass & Mackey, 1988; Hoppensteadt & Izhikevich, 1997; Murray, 1993; Winfree, 1980) and does not require restrictive assumptions or any infor-
Inuence of Limit Cycle Topology
1029
mation that cannot be easily obtained experimentally from physiological neurons. A convenient way to study periodic behavior is to characterize a limit cycle oscillator by its phase. The term phase has been used in the literature in several different ways. Here we will use Winfree’s denition (1980): the elapsed time measured from an arbitrary reference divided by the intrinsic period. We denote this term the temporal phase to distinguish it from the geometrical (or distance-based) phase dened below. In contrast to Winfree (1980), Murray (1993) and Hoppensteadt and Izhikevich (1997) dened the phase as the angle that parameterizes the unit circle. For a constant angular velocity on a circular limit cycle, the two denitions are equivalent. Given that limit cycles associated with nonlinear oscillators are usually not circular and can have highly variable angular velocity, we propose another denition that we call the geometrical phase. The geometrical phase can be dened as a distance measured from an arbitrary origin along the limit cycle divided by the intrinsic length of the limit cycle. Both denitions give quantities that are positive and normalized to unity. We stress that the temporally based denition of the phase (Winfree, 1980) has the same meaning as the distance-based one only for so-called phase oscillators, that is, the nonlinear oscillators running with a constant phase speed, or angular velocity. As an illustrative example, we will use the Morris-Lecar oscillator, originally formulated to model action potentials in barnacle muscle (Morris & Lecar, 1981; Rinzel & Lee, 1986; Ermentrout, 1996). We will use it in a different context, as a model for the envelope of a burst-ring waveform described above. A limit cycle oscillation requires a minimum of two variables (Hoppensteadt & Izhikevich, 1997; Guckenheimer & Holmes, 1983; Murray, 1993; Winfree, 1980); in the Morris-Lecar model, these variables are membrane potential (V) and a negative feedback variable (w) that corresponds to a gating variable for an ionic channel. The limit cycle oscillator can exhibit a range of dynamic activity that depends on the characteristic timescales of the two variables (Rinzel & Lee, 1986; Rinzel & Ermentrout, 1998). If they are comparable in magnitude, the qualitative dynamics of a phase oscillator arise, and the two variables tend to vary smoothly in unison. If w varies much more slowly than V, a relaxation oscillator can be produced in which alternating rapid increases and decreases in V (jumps) are followed by slow relaxation in w. The qualitative characterization of a limit cycle oscillator as a phase oscillator, a relaxation oscillator, or an intermediate determines the relationship between time elapsed (temporal phase) and distance traversed along the limit cycle (geometrical phase). A mapping between geometrical phase and temporal phase was found to provide insight into the shape of the phase-resetting curve. The Morris-Lecar model can exhibit two types of neural excitability. Neural excitability is the capability of a neuron to re repetitively in response to an externally applied current. For type I excitability, the onset of repetitive ring occurs at arbitrarily low frequency, whereas for type II excitabil-
1030
Sorinel A. Oprisan and Carmen C. Canavier
ity, the onset of repetitive ring occurs at a nonzero frequency (Hodgkin, 1948). The type of excitability is determined by the bifurcation structure: in type I, oscillations arise via a saddle node bifurcation, whereas in type II, they arise via a Hopf bifurcation (Rinzel & Ermentrout, 1998). The phase response characteristics of a neuron depend on the type of excitability it exhibits (Ermentrout, 1996): in general, type I neurons exhibit either only delays or only advances in response to perturbation of a given sign, whereas type II can exhibit both advances and delays to a given perturbation, depending on its timing. The phase response of the type I neuron is due to its essentially monotonic increase in membrane potential, except for a very rapid hyperpolarization phase of the action potential. Similar results apply to integrate-and-re neurons for the same reason (Mirollo & Strogatz, 1990). Our analysis assumed that for a neural oscillator, perturbations generally occur in the form of an increase or decrease in transmembrane current. The rst derivative of membrane potential is proportional to the transmembrane current, whereas the transmembrane current does not appear in the derivatives of the other state variables, which for neural oscillators are generally gating variables. In oscillators in which the frequency of the temporal waveform is simply scaled by an increase in transmembrane current level, such as type I or integrate-and-re models, the direction (advance or delay) of a brief perturbation can be determined using only the sign of the current perturbation and the sign of the rst derivative of membrane potential (the instantaneous transmembrane current) at the time of the perturbation. By convention, a depolarizing (hyperpolarizing) current applied externally has a positive (negative) sign and produces a positive (negative) change or increase (decrease) in the membrane potential. Hence, if the signs are the same, there is an advance, whereas if they are different, there is a delay. For an ionic current, such as a synaptic current, the sign convention is opposite. Furthermore, the magnitude of the instantaneous geometrical phase shift due to a brief perturbation is proportional to the magnitude of the time integral of the perturbing current. Once the geometrical phase shift due to a perturbation has been calculated, the mapping from geometrical to temporal phase can be used to construct the desired PRC for oscillators. When
Figure 1: Facing page. Geometric versus temporal phase. Two-dimensional phasespace plot of the limit cycles for (A) a hypothetical phase oscillator, (C) MorrisLecar (type I oscillator), and (D) FitzHugh-Nagumo (type II oscillator) relaxation oscillators. (B, D, F) The corresponding plots of the normalized geometrical distance (s) versus temporal phase (w ). For a phase oscillator, the gurative point travels with constant angular velocity around the limit cycle, and therefore the normalized geometrical distance (s) increases linearly (B). For the relaxation oscillators (D, F) the gurative point travels with variable speed along the limit cycle. (A) The four quadrants of a limit cycle; (F) the conversion from geometrical to temporal phase.
Inuence of Limit Cycle Topology
1031
1032
Sorinel A. Oprisan and Carmen C. Canavier
transmembrane current levels have more complex effects than simply scaling the frequency, additional calculations are required to compensate for effects on the amplitude of the temporal waveform. 2 A Geometric Analysis of Phase Resetting 2.1 A Mapping from Temporal to Geometrical Phase. The phenomenology of phase resetting can be best understood by visualizing a limit cycle, such as those shown in Figures 1A, 1C, and 1E, represented for convenience in a two-dimensional plane. Consider a system of N differential equations:
du k D fk ( u) , dt
(2.1)
where u1 D V. We used the innitesimal Euclidean distance, v u N uX ds D t ( fk (u ) ) 2 dt,
(2.2)
kD 1
to dene the geometrical phase along the limit cycle (Oprisan & Canavier, 2000a). Therefore, the length of a closed orbit is 0v 1 ZPi u N uX @t ( fk ( u ) ) 2 A dt LD i D1
0
where Pi is the intrinsic period of the motion. The normalized geometrical (distance-based) phase along a limit cycle is given by 1 sD L
Zt 0
0v 1 u N uX @t ( fk (u ) ) 2 A dt.
(2.3)
kD 1
The geometrical phase s is a monotonically increasing function of both the elapsed time t and the temporal phase w D Pti , where Pi is the intrinsic period of the limit cycle motion (see equations 2.3 and 3.2). The function s (w ) gives an invertible mapping between temporal and geometrical phase. Some examples of typical mappings are shown in Figures 1B, 1D, and 1F. For a phase oscillator, the geometrical and temporal phase are identical (see Figure 1B), whereas Figures 1D and 1F show that the mapping for relaxation oscillators can be highly variable and idiosyncratic. A remarkable property of the geometrical to temporal phase mapping is its invariance. This mapping can be obtained using equation 2.3 if the evolution equations are known or by performing a time series delay embedding
Inuence of Limit Cycle Topology
1033
to reconstruct the attractor (Takens, 1981; Abarbanel, Brown, Sidorowich, & Tsimring, 1993) rst and then integrating along the numerically computed trajectory. We obtained a geometrical to temporal phase mapping for the Morris-Lecar model at the xed parameter settings given in the caption of Figure 2. We used direct integration of the differential equations, as well as attractor reconstruction using delay embedding with two different delays, and in every case the agreement was very good (see Figure 2) (see appendix A for a detailed proof of invariance under certain assumptions). 2.2 Phase Resetting During a Perturbation. A geometrical perspective on phase resetting is given in Figure 3. A perturbation (the dark solid line in
Figure 2: Dynamical invariance of the reconstructed geometrical to temporal phase mapping. The geometrical to temporal phase mapping was obtained by integrating the differential equations of the ML model oscillator in the type I excitability case. Using only the dimensionless membrane potential record, the phase-space attractor was reconstructed using two different time lags (t D 100, dotted line and t D 200, dot-dashed line), and for each reconstructed attractor, the geometrical to temporal phase mapping was computed.
1034
Sorinel A. Oprisan and Carmen C. Canavier
Figure 3: Topology of phase resetting. The phase-space plot of a twodimensional stable limit cycle. A perturbation applied at time t 0 for a duration of t1 perturbs the trajectory (dark line). After the perturbation, the trajectory requires a time t2 to return to the limit cycle.
Figure 3) applied at time t 0 has a duration of t1 . At the end of the perturbation, the trajectory may be at some distance from the limit cycle, and it relaxes back to the limit cycle (dashed line) during the time period t2 . The temporal phase resetting is the normalized difference in elapsed time along the perturbed and unperturbed paths over the time period t1 C t2 . While t1 is easily determined, t2 is not. Fortunately, for the simple case of type I excitability, it is possible to neglect the resetting that occurs during t2 , but we will rst propose expressions that give the geometrical phase resetting for both t1 and t2 (see Oprisan & Canavier, 2001) and then recover the temporal phase resetting by using the inverse mapping from geometrical to temporal phase. The main idea is to obtain the projection of the perturbed trajectory onto the unperturbed trajectory. Physiological perturbations generally arise in the form of a perturbation in current, which affects only VP or f1 (see equation 3.1) and no other derivative (see also Nishii, 1999). The evolution equation P Iionic C ICext , where of the membrane potential has the general form VP D ¡ Cm m P Iionic stands for the sum of all the ionic currents, Cm is the membrane capacitance, and Iext is an externally applied current. The projectionof the vector representing the perturbation ( i ( t) / Cm , 0, 0 . . .) , where i ( t) is the perturbing current as a function of time, onto the vector containing the time derivatives fk ( u ) of the state variables evaluated along the limit cycle would produce the normalized innitesimal distance traveled along the limit cycle: ds1 D
1 i f1 ( t ) dt. L Cm || f ( t) ||
(2.4)
Inuence of Limit Cycle Topology
1035
Thus, the geometrical phase resetting during the perturbation of duration t1 would be Z 1 t 1 i f1 ( t ) (2.5) D s1 D dt. L 0 Cm || f ( t) || If we assume that the perturbation has a very short duration, it is not neces( ) sary to integrate || ff1( tt) || over the course of the perturbation; rather, we assume this quantity remains constant for the duration of the perturbation (Glass & Mackey, 1988; Guckenheimer & Holmes, 1983; Hoppensteadt & Izhikevich, 1997; Murray, 1993; Pavlidis, 1973). For simplicity of presentation, we assume that i is also constant for the duration t1 of the perturbation:
D s1
D
1 t1 i f1 ( t0 ) . L Cm || f ( t0 ) ||
(2.6)
Figure 4 illustrates the relationships among the vectors described above for a two-dimensional case for a phase advance in Figure 1A and a delay in Figure 1B. In each case, the vector s 0 represents the unperturbed limit cycle, the vector s¤ represents the perturbed limit cycle, and D s is the geometrical resetting before normalization by L, which then gives D s1 . The derivatives with respect to voltage are given by f1 evaluated on the limit cycle at the point of the perturbation and by f1¤ during the perturbation. The derivative with respect to the second variable, f2 , is the same for the perturbed and unperturbed cases. The vectors are dened as follows: sE0 D [t1 f1 ( t0 ) , f2 ( t0 ) ] and sE¤ D [t f1 C D V, f2 ( t0 ) ]. The vector sE0 would be traversed by the trajectory in the absence of a perturbation, and the vector sE¤ would be traversed in its presence. The direction of the perturbation is given by the difference f1¤ t1 ¡ f1 t1 , which is equal to it1 / Cm , or D V. One can obtain the normal displacement D h from the tangential displacement using similar triangles: ( ) D h D ff21 ( tt00 ) D s1 . This relationship becomes useful in the next section. The temporal phase resetting is recovered using
Dw1
D s ¡1 ( D s1 ) .
(2.7)
The recovery can be accomplished graphically as shown in the lower portion of Figure 1F. The calculated D s is added to the geometrical phase at the point at which the perturbation is received. The change in temporal phase is given by the horizontal displacement required to reach the new value of s. 2.3 Phase Resetting During Relaxation Back to the Limit Cycle After a Perturbation. As shown in Figure 3, it is possible for the trajectory to
be at some normal displacement from the unperturbed limit cycle after the perturbation has ended, so that additional phase resetting is incurred during t2 , the relaxation back to the limit cycle. As we will show in the
1036
Sorinel A. Oprisan and Carmen C. Canavier
Figure 4: Idealized trajectory of a perturbation. The linearized tangent displacement along the unperturbed limit cycle s0 and its perturbed counterpart s¤ . The perturbed trajectory s¤ is induced by a perturbation in transmembrane current: (A) the depolarizing case and (B) the hyperpolarizing case. The tangent geometrical phase shift D s and the transverse displacement ff2 D s can be easily written 1 in terms of the unperturbed f and perturbed f ¤ vector elds.
Inuence of Limit Cycle Topology
1037
Table 1: Phase Resetting Due to Normal Displacement from the Limit Cycle. D s1
f1 / f2
Dh
C
C C
¡
Quadrant
f1
DV
I
¡
¡
C
¡
¡ ¡
C
¡
¡
¡
¡
II III IV
C C C C
C
¡ ¡
C
C
¡ ¡
C C
C C
C
¡ ¡
¡
¡ C
D s2 Inside Outside Inside Outside Outside Inside Outside Inside
C
¡ C
¡ ¡ C
¡ C
next section, for some types of oscillators, this additional phase resetting (D s2 and D w 2 ) can be neglected. For the case in which there is signicant normal displacement that cannot be ignored, we have begun to calculate the appropriate corrections by initially assuming that the linear velocity of a trajectory just inside or just outside the limit cycle is approximately the same as that of a point on the limit cycle. However, in general, the angular velocity will be greater inside the limit cycle and smaller outside the limit cycle. Hence, trajectories that are horizontally displaced inside the limit cycle will be advanced, and those outside will be delayed. As a rst approximation, we made the assumption that D s2 is proportional to the normal distance D h between the perturbed and unperturbed trajectories: D s2 D KD h. The normal distance can be calculated as Ás
Dh D
N P
!
( fk ( u) 2
i D2
f1
D s1 .
(2.8)
For the two-dimensional case, the expression
Dh D
f2 D s1 , f1
(2.9)
always gives the correct sign for the resetting. A limit cycle can be divided into four quadrants (see Figure 1A): in I, f1 < 0 and f2 < 0, in II f1 > 0 and f2 < 0, in III f1 > 0 and f2 > 0, and in IV, f1 < 0 and f2 > 0. The sign of D s1 f is given by sgn ( f1 D V ) , and the sign of D s2 is given by sgn ( f21 D s1 ) , and in every case, a normal displacement outside the limit cycle gives a positive D s2 , which corresponds to an advance, and a displacement inside the limit cycle gives a negative D s2 , or a delay (see Table 1). The remaining problem is to assign a value to the proportionality K. The constant of proportionality K between the normal displacement and its corresponding additional tangential displacement was found by calculating the time required for relaxation to the limit cycle, assuming the temporal
1038
Sorinel A. Oprisan and Carmen C. Canavier
phase resetting was proportional to the relaxation time, and converting this temporal resetting to geometrical resetting and nding the appropriate K. First, we determined the maximum normal distance (D hmax ) between the unperturbed limit cycle and a perturbed limit cycle with a current perturbation of i. The normalized time D w max (with respect to the intrinsic period of the unperturbed limit cycle) necessary for the gurative point to relax from the maximum displacement sufciently close to the unperturbed limit cycle was determined. D w max was converted to D smax using the temporal to geometrical mapping. The constant K is the ratio D smax / D hmax . Since the time required to relax back to the limit cycle (t2 in Figure 3) is theoretically innite, we used instead the time required to relax to within a neighborhood of the limit cycle dened by the limits of our computational accuracy (Izhikevich, 2000). A plot of the natural logarithm of normal displacement versus time (not shown) was linear until this limit was reached; then it became at. The time to reach the end of the linear region was normalized to give D w max . Once again, the temporal phase resetting is recovered as
D w2
Ds
¡1
(D s2 ) ,
(2.10)
and the total phase resetting is given by D w D D w 1 C D w 2 . 3 Results Using the Morris-Lecar Model as an Example
The Morris-Lecar model has the advantages of simplicity (it has only two state variables) and versatility (it can exhibit either type I or type II excitability, depending the values of its parameters). We used the following general dimensionless form of the nonlinear oscillator, VP D f ( V, w) C Iext D f1 (V, wI Iext ) , wP D g ( V, w) D f2 (V, wI Iext ) ,
(3.1)
where V stands for the fast variable, w for the slow variable, and Iext for the control parameter (external applied current). Two-dimensional reduced models like Morris-Lecar (Morris & Lecar, 1981; Ermentrout, 1994) and FitzHugh-Nagumo (FitzHugh, 1969) have similar dimensionless forms. The ionic currents can be written in terms of membrane potential V and slow variable(s) (see appendix B). The expression for geometrical phase resetting due to tangential displacement during a perturbation (see equation 2.6) becomes
D s1 P where f1 D VP and f2 D w.
D
1 t1 i f1 ( t 0 ) p , L Cm f1 (t 0 ) 2 C f2 (t 0 ) 2
(3.2)
Inuence of Limit Cycle Topology
1039
In order to evaluate the success of our methods for constructing the PRC using the methods described in the preceding section, we numerically generated actual PRCs for the Morris-Lecar model using an explicitfourth-order Runge-Kutta method to solve the nonlinear equations B.1. The simulations were compiled on C/UNIX and run on a Sun Enterprise 450 Ultra Server. First, the differential equations were numerically integrated for until the solution converged to the limit cycle and the period Pi of the steady solution was computed. Then a steplike current pulse was applied at different temporal phases (w ) during the cycle with respect to an arbitrary xed reference point on the limit cycle (we used the maximum of the membrane potential as a reference point, but different points can be used Canavier et al., 1997, 1999). The perturbed period P of the membrane potential is measured, and the quantity F (w ) D P / Pi ¡ 1 is computed for each value of w . F (w ) was used for consistency with our previous articles and is equal to ¡D Q . These actual PRCs were used for comparison with the theoretical PRCs. For precise continuity near the reference point, the second perturbed period after the pulse is applied should also be tabulated and the effect on both periods summed to produce F (w ) (Canavier et al., 1997, 1999), but the second perturbed pulse was not examined in this article. 3.1 Phase Resetting Induced in Type I and Type II Oscillators by a Single Input. The actual PRC for a type I oscillator (parameters given in
appendix B) receiving an excitatory input is shown in Figure 5A (dotted line), and compared with the PRC constructed according to the methods described in the preceding section (dashed line). Only the contribution from D s1 was included in the calculation; any contribution from D s2 was neglected. It can be seen from the common plot of actual and predicted PRCs result that our geometrical method gives a very good estimation of the PRC for the type I oscillator. As we expected, based on analytic results (see equation 3.2), the amplitude of the PRC is proportional to pulse duration t1 (see Figure 5B) and also to the pulse amplitude as shown in Figure 5C. The shape of the geometrical phase shift curve can be anticipated by examining equation 3.2 together with the waveform of the membrane potential (see Figure 6C). Since the membrane potential is always increasing except in a very short time interval, we would expect a depolarizing external stimulus to produce a negative phase shift (an advance), as is indeed the case in Figure 5A. However, this is a very small temporal window in which the membrane potential is decreasing, and a perturbation applied at this time would produce a phase shift of the opposite sign. As a practical matter, such an anomalous phase shift would be difcult to observe in a physiological type I oscillator, but in a model, we have the precision to detect it. Hence Figure 7 illustrates how this region of anomalous phase shift narrows as the bifurcation that generates type I excitability is approached. A numerical computation of the ratio of time interval during which the membrane potential is decreasing to the period of the signal is shown in Figure 7. The time
1040
Sorinel A. Oprisan and Carmen C. Canavier
Inuence of Limit Cycle Topology
1041
Figure 6: Effect of bias current on limit cycle topology and angular velocity (frequency). Two-dimensional limit cycles for (A) Morris-Lecar type I and (B) type II excitability cases. The curves labeled 1 and 2 were obtained at different values of Iext . For type I excitability, the topology of the attractor is not very sensitive to changes in the control parameter Iext (A), whereas the period of the limit cycle motion is sensitive to the same change in Iext (C). On the other hand, the topology of the attractor obtained via a type II excitability mechanism (Hopf bifurcation) is dramatically inuenced by a change in the control parameter Iext (B) but the period is not (D).
Figure 5: Facing page. Predicted (analytic) versus actual (numerically computed) type I phase resetting. (A) The numeric (continuous curve) and analytically mapped (dotted curve) PRCs for type I excitability case. Perturbation duration Pi was t D 100 using equation 3.2 and the graphic mapping procedure. The steady dimensionless current was Iext D 0.0695, and the pulse amplitude was 10% of Iext . A summary of numerical simulations conrms that the PRC amplitude scales with the product of current amplitude (B) and its duration (C) as theoretically predicted by the equation 3.2.
1042
Sorinel A. Oprisan and Carmen C. Canavier
Figure 7: Deviation of type I model from resetting curve with a single sign. The deviation is plotted as the fraction of the cycle over which the phase resetting has the opposite sign from the idealized type I PRC, in which all phase resets in response to a given input have the same predictable sign. When the control parameter approaches the bifurcation point (criticality), the relative magnitude of the time interval over which the membrane potential decreases becomes negligible.
interval over which the membrane potential decreases from its maximum to minimum is almost constant over a very large control parameter range, but as the bifurcation (criticality) is approached, the period increases, so the ratio decreases. Theoretically, in the limiting case of criticality, the period of the membrane potential oscillation becomes innite because the phase trajectory approaches the separatrix. On the other hand, using our observation that the time interval over which the membrane potential decreases from its maximum to minimum is nearly constant over a very large range of Iext and Ermentrout’s (1996) formula for the period of the membrane oscillations T / ( Iext ¡ Ic ) ¡1 / 2 , where Ic is the critical value of the control
Inuence of Limit Cycle Topology
1043
parameter at the saddle-node bifurcation point, it is possible to derive an analytic expression for the plot in Figure 7. Therefore, the computational p determined ratio can be satisfactorily tted by ( Iext ¡ Ic ) . These two observations lead to the conclusion that the relative magnitude of positive part of PRC (which is related to the relative contribution of the decreasing membrane potential to the whole signal duration) decreases to zero as we approach criticality. Rinzel and Ermentrout (1998) make the same point: “In only a very small interval of time can the phase be delayed, and this is a general property of membranes that become oscillatory through a saddle node bifurcation” (p. 286). The temporal waveform given in Figure 6C explains the general shape of the PRC, and the effect of a perturbation in current on both the phase plane representation (see Figure 6A) and the temporal waveform (see Figure 6C) explains why the phase resetting due to relaxation after a normal displacement is not signicant for oscillators that exhibit type I excitability. An important topological feature of the associated limit cycle attractor shown in the phase-plane representation of Figure 6A is its invariance with respect to change in the applied current. On the other hand, the period of the limit cycle motion is highly sensitive with respect to such changes (see Figure 6C). For type I excitability, a perturbation in current produces a negligible normal displacement from the limit cycle. Therefore, in this case, the geometrical phase difference occurs solely because of different speeds of the gurative points along the limit cycle. Quantitatively, the normal displace( ) ment from the unperturbed limit cycle is proportional to ff2 ( tt0 ) . For the type I 1 0 oscillators that we examined in this study, we found that the average value of the above ratio is about 2.8 10¡4 using the dimensionless (normalized) forms of V and w, whereas for type II oscillators, the average value was 2.8. Thus, the normal displacement induced by the same perturbation is four orders of magnitude smaller for type I oscillators than for type II oscillators. This explains the excellent results using only D s1 representing tangential displacement during a perturbation for type I. Since the normal displacement is signicant for type II oscillators (see Figure 6B), we expect that we will have to include the contribution D s2 from relaxation after normal displacement. Type II excitability arises via a Hopf bifurcation. In contrast to type I excitability, the geometry of the associated limit cycle attractor is very sensitive to control parameter changes (see Figure 6B), but the period of the motion changes more slowly than for type I excitability (see Figure 6D). As we expected, calculating the PRC using only the contribution of D s1 (dotted line, Figure 8A) did not produce a good t to the actual PRC. For the two-dimensional case, the expression for D s2 becomes
D s2
DK
f2 ( t 0 ) D s1 , f1 ( t 0 )
(3.3)
1044
Sorinel A. Oprisan and Carmen C. Canavier
Inuence of Limit Cycle Topology
1045
for type II parameter settings (given in appendix B). Adding the contributions of both D s1 and D s2 produced a better t (see Figure 8B), but not the nearly perfect agreement achieved for type I. Possible future directions for improving the t are given in section 4. 3.2 Phase Resetting Induced in a Type I Oscillator by Multiple Inputs per Cycle. We generalized the above methodology to the case in which the
neural oscillator receives more than one input during a cycle (Oprisan & Canavier, 2000b). Only the simpler case of type I excitability is addressed so that we could assume the trajectory returns to the limit cycle between successive perturbations. We considered the simplied case of two identical current pulses, meaning that t1 D t2 and D Iext 1 D D Iext 2 (see Figure 9A). Here, t1 and t2 denote the duration of two distinct pulses rather than the duration of the perturbation and the associated relaxation back to the limit cycle as in the preceding sections. t1 denotes the time at which the rst pulse current is applied (see Figure 9A). As a consequence of this current perturbation, the geometrical phase shift induced at time t1 is given by the equation 3.2. Another current pulse is applied at time t2 . The geometrical phase shift is again given by equation 3.2, but the temporal phase at which the second pulse is applied is t1 / Pi C D w 1 , where D w 1 is the temporal phase shift corresponding to the geometrical phase shift previously determined for the rst pulse. Therefore, if we denote by F[1] (w ) the single-pulse PRC as it was dened in the preceding section, then for two identical current pulses acting during the same cycle, the PRC becomes ³ ´ t2 ¡ t1 F[2] (w ) D F[1] (w ) C F[1] w C F[1] (w ) C , Pi
(3.4)
where Pi is the intrinsic period of the oscillation for the steady external current value Iext 1 and t1PCi t2 < 1 in order to deliver both the current pulses during the same cycle (Ermentrout & Kopell, 1991b). This analysis also can be generalized to more than two inputs per cycle with arbitrary amplitudeduration ratios. We performed extensive numerical simulations to verify our predictions using a Morris-Lecar model neuron. Figure 10 summarizes the agreement between the predicted two-pulse PRC (continuous curve)
Figure 8: Facing page. Predicted (analytical) versus actual (numerically computed) type II phase resetting. (A) Uncorrected and (B) corrected for relaxation numeric (continuous curve) and analytic (dotted curve) PRCs for type II exPi citability. The duration of the perturbing pulse was t D 100 . The steady dimensionless current was Iext D 0.25, and the pulse amplitude was 10% of Iext . This anomalous region is given due to the nite cycle fraction during which membrane potential is rapidly decreasing, whereas it increases during the remainder of the cycle.
1046
Sorinel A. Oprisan and Carmen C. Canavier
Figure 9: Generalization of the resetting topology to multiple pulses. (A) Two brief current pulses with the same amplitude D Iext D Iext 2 ¡ Iext 1 and duration t1 D t2 acting during the same cycle. (B) The gurative point is accelerated along the unperturbed limit cycle when perturbation is rst applied (t1 ) for the interval t1 . When the rst perturbation switches off, the gurative point again moves along the unperturbed limit cycle until a second pulse is applied, at time t2 . The gurative point is again accelerated along the limit cycle and moves along it. When the second perturbation terminates, the gurative point again moves along the unperturbed limit cycle. Mapping the geometrical phase differences into temporal phase differences, we found the PRCs for multiple inputs per cycle (see B).
Inuence of Limit Cycle Topology
1047
Figure 10: Predicted versus actual phase resetting for multiple pulse protocol. Common plot of predicted (continuous line) and actual numerically computed (dotted line) PRCs with two identical current pulses applied during the same cycle. The pulse’s amplitude was 1% of the steady external current, and its duration was 1% of the intrinsic period Pi . The only variable was the distance between current pulses: (A) 20%Pi , (B) 40%Pi , (C) 60%Pi , and (D) 80%Pi .
and its numerically computed actual counterpart (dotted curve) for variable distances between the two current pulses. The agreement between the actual and predicted values is very good. We also found that the shape of the current pulse is not critical. Numerical simulations using a presynaptic input from an identical Morris-Lecar model neuron rather than a square pulse gave results in agreement with equation 3.4. Another recent study (Foss & Milton, 2000) also showed that the shape of the presynaptic input does not dramatically change the response of the postsynaptic neuron. 4 Discussion and Conclusion
The improvement of the theoretical understanding of the PRC is signicant. First, we used an unambiguous denition of geometrical (distance-based) phase that applies for every nonlinear oscillator even in high-dimensional spaces. Second, we provided a geometrical conceptual basis for phase reset-
1048
Sorinel A. Oprisan and Carmen C. Canavier
ting. The phase-space analysis that we performed revealed complex problems posed by phase-resetting mechanisms and suggested analytic approaches in order to solve these problems. A main result of this article is a general method that allows the analytic construction of the PRC in addition to previously established computational methods. The relationship of our approach to previous approaches to this problem is given in appendix C. Although we referred to a particular model (Morris-Lecar) in order to compare our ndings with numerically reported ones, the method we described is generally applicable. The theoretical predictions have been conrmed by numerical solution. For type I excitability, the agreement is quite satisfactory and generalizes to any arbitrary combination of pulses. For type II excitability, some disagreement persists, and further theoretical work is required. One source of error in the prediction of type II PRCs is the uncertainty in the tangential displacement due to relaxation after normal displacement. This tangent displacement was assumed to be linearly proportional to the normal displacement. However, for a very simple circular limit cycle with radius r0 and a constant angular velocity at all points on the limit cycle and a constant linear velocity at all points near the limit cycle, it can be shown that a normal displacement of D r produces a tangential displacement proportional to ln( r0 Cr0D r ) . Since limit cycles are not in general circular, the quantities r 0 and D r are not dened for arbitrary limit cycles such as those shown in Figure 6. Nevertheless, perhaps a better approximation of the dependence of tangent displacement (geometrical phase) on normal displacement can be made by dening analogous quantities for arbitrary limit cycles. The ultimate goal of this work is to apply the results to the analysis of circuits composed of physiological oscillators, including bursting neurons. The fundamental assumption of phase-resetting analysis is that the only effect of a perturbation is to lengthen or shorten the cycle in which a perturbation is received. When a neuron is a component of a network, the burst in that component neuron serves as the perturbation received by the neurons on which it makes synaptic connections. Thus, an alteration in burst duration or amplitude may have an impact on the resetting induced in the target neurons. To measure the magnitude of the signal shape change induced by perturbation, we evaluated the relative change in the maximum amplitude of the signal. For type I excitability, the maximum amplitude of membrane potential is negligibly affected by perturbations, as one would expect from the coincidence of perturbed and unperturbed limit cycles. However, burst duration may be affected. Similarly, again as one would expect, for type II excitability, we observed that the maximum amplitude of the membrane potential signicantly changes in a manner strongly correlated with the PRC. Preliminary work (Oprisan & Canavier, 2000a, 2000b) shows that these effects of perturbations do not signicantly compromise our ability to analyze circuits of coupled Morris-Lecar oscillators using their PRC characteristics. The work presented in this article will help us to analyze physiological cir-
Inuence of Limit Cycle Topology
1049
cuits without having to generate PRCs experimentally in every combination of number, timing, amplitude, and duration of perturbations that might result from the oscillatory activity of the circuit, but rather to generalize from a small number of PRCs, we hope as few as one per neuron. How can this method be extended to physiological neurons? Although we do not have access a priori to the equations that govern the electrical activity of a given neuron, this method can still apply to physiological neurons because we have observed that the geometrical to temporal phase relationship that we obtained here using equation 2.3 is similar to that we can numerically extract using membrane potential record. More generally, we can replace the described Euclidean-based approach (which requires analytic expressions of the vector eld) by a computational geometrical phase construction based entirely on a delayed embedding using the fast variable record (Takens, 1981). In order to obtain a slow variable, ltering techniques may be required. Once we have obtained an appropriate phase-plane reconstruction, the geometrical to temporal phase mapping can be obtained and then the predicted phase resetting induced by external perturbations. We predict (Oprisan & Canavier, 2000a, b) that these two steps can give us the PRC for neurons (such as cortical neurons) that are type I oscillators (Wilson & Bower, 1989; Traub & Miles, 1991). Experimentally obtained PRCs can then be used to conrm the validity of our theoretical analysis. In addition, these methods can be extended to type II neural oscillators, which appear frequently in physiological central pattern generators, by experimentally estimating the proportionality constant K to correct for phase resetting that occurs during relaxation after normal displacement from the limit cycle. Appendix A: The Dynamical Invariance of the Mapping Between the Geometrical Phase and the Temporal Phase
The dynamical invariance of the geometrical to temporal phase curve, which is a homeomorphism from the limit cycle to unit circle, is especially important for our short-term objective, which is the recovery of the PRC from a phase-space reconstruction of the attractor. The sketch of the proof that the geometrical to temporal phase curve is an invariant is as follows. Consider a one-dimensional time series, such as the membrane potential record, for example. x[1], x[2], x[3], . . . ,
(A.1)
where the x[i] are evenly sampled at intervals of D t, the time delay t is an integral multiple of D t, and D is the embedding dimension of the appropriate phase space. Then the following D-dimensional vectors describe the attractor: y[i] D x[i], x[i C t ], x[i C 2t], . . . , x[i C ( D ¡ 1)t ].
(A.2)
1050
Sorinel A. Oprisan and Carmen C. Canavier
The Euclidean distance between two successive points y[i] and y[i C 1] along the reconstructed attractor is R2i,i C 1 ( D ) D
D¡1 X kD 0
(x[i C 1 C kt ] ¡ x[i C kt ]) 2 ,
(A.3)
which gives for the length of the closed trajectory L ( D ) D
N P iD 1
Ri,i C 1 ( D ) , where
N is the number of points along the closed trajectory and x[i] D x[i C N]. The geometrical distance (before normalization) after j < N steps along the limit cycle is
sj (D ) D
j X i D1
Ri,i C 1 (D ) .
(A.4)
Let D be the minimum embedding dimension of the attractor. Reconstructing the attractor in 2D-dimensional phase space, we obtain: R2i,i C 1 (2D) D
2¤D¡1 X kD 0
( x[i C 1 C kt] ¡ x[i C kt] ) 2 .
(A.5)
This can be rewritten as: R2i,i C 1 (2D ) D
D¡1 X kD 0
( x[i C 1 C kt ] ¡ x[i C kt]) 2 C
D R2i,i C 1 ( D ) C
D R2i,i C 1 ( D) C
D¡1 X jD 0
D¡1 X jD 0
2D¡1 X kD D
( x[i C 1 C kt] ¡ x[i C kt] ) 2
( x[i C 1 C (D C j) t ] ¡ x[i C ( D C j)t ]) 2
( x[i C 1 C Dt C jt ] ¡ x[i C Dt C jt ]) 2 .
(A.6)
The lag time t should be as small as possible to capture the shortest changes in the data, yet it should be large enough to generate the maximum possible independence between the components of the vectors in the phase space. Since the time span Dt of each vector in the phase space represents the duration of a state of the system, it should be at most equal to the period of the dominant frequency Dt D N (both t and N are integer multiples of the time step D t). Therefore, x[i C Dt ] D x[i C N] D x[i], where the last equality
Inuence of Limit Cycle Topology
1051
means that the period of the attractor is N. Using the above observation, the above expression simplies to R2i,i C 1 (2D ) D 2R2i,i C 1 ( D ).
(A.7)
Thus, the geometrical distance sj ( D ) after j steps along the limit cycle and the total length of the limit cycle L ( D) show the same scaling with respect to the (2¤D ) sk (D ) embedding dimension: sLk(2¤D ) D L ( D ) . Our numerical computations show that the general scaling relationship is s k (a ¤ D ) D s k ( D) (a) c with c slowly dependent on k with a saturation threshold c ’ .5, which also supports our conjecture that the mapping between geometrical and temporal phase is a dynamical invariant. Appendix B: The Dimensionless Morris-Lecar Model Neuron
We use the Morris-Lecar model as an example of a simple neural model. Despite its simplicity, it produces a facsimile of the membrane potential envelope of a neural oscillator. The dimensionless equations are (Morris & Lecar, 1981; Rinzel & Lee, 1986; Rinzel & Ermentrout, 1998): dV D Iext ( t) ¡ gCa m1 ( V ) ( V ¡ VCa ) ¡ gK w ( V ¡ VK ) ¡ gL ( V ¡ VL ) D f1 ( V, w) , dt dw (B.1) D wl(V) ( w1 ( V ) ¡ w) D f2 ( V, w) , dt ± ± ²² ± ± ²² 1 3 with m1 ( V ) D 12 1 C tanh V¡V , w1 (V ) D 12 1 C tanh V¡V , l(V) D V2 V4 ± ² 3 cosh V¡V . 2V4 For type I excitability, we have (Rinzel & Ermentrout, 1998) the following values: V1 D ¡0.01; V2 D 0.15; V3 D 0.1; V4 D 0.145; VK D ¡0.7; VL D ¡0.5; VCa D 1.0; gCa D 1.33; gK D 2.0; gL D 0.5; w D 0.33; I D 0.0705. For type II excitability, the parameters have the same values except V3 D 0.017; V4 D 0.25; gCa D 2.2; gK D 4.0; gL D 1.0;w D 0.4; I D 0.4. Appendix C: Comparison with Previous Studies
There are similarities between our method of predicting phase resetting and those given in the appendix of Ermentrout and Kopell (1991a, b), but there are some signicant differences as well. Our derivation uses geometrical intuition, whereas that of Ermentrout and Kopell is analytic. In Ermentrout and Kopell (1991a, b), a system of N state variables representing oscillator k that is coupled to oscillator j as a function of time, du k D f k ( u k ) C g ( u k , uj ) , dt
(C.1)
1052
Sorinel A. Oprisan and Carmen C. Canavier
is converted by a change of variables to a system that is a function of phase (h) around the limit cycle solution to the original equations, as well as N ¡ 1 variables that measure normal distance from the limit cycle. These latter N ¡ 1 variables can be ignored by assuming a strong attraction to the limit cycle, which renders these N ¡ 1 variables negligibly small (as in a type I oscillator). Ermentrout and Kopell further assume that the phase variables of two coupled oscillators are never perturbed independently, which reduces 0 the system to a single phase variable. The vector U k contains the derivatives evaluated along the limit cycle of each of the original state variables with respect to phase (dU k / dh): w
du k 0 dhk D U k (h ) , dt dt
(C.2)
where h ( t) D wt and dh D wdt. Combining equations C.1 and C.2 yields 0
w ¡1 U k (h )
dh k D f k ( Uk ) C g ( U k , Uj ) . dt
(C.3) 0
Multiplying both sides of the equation C.3 by the inverse of Uk , which is 0 0 0 UkT / ||Uk | | 2 and using the fact that fk ( Uk ) D U k (h (t ) ) , the following expression results: 0
Uk dhk DwCw g ( U k, Uj ) . 0 dt | |U k || 2
(C.4)
A similar expression appears in Hoppensteadt and Izhikevich (1997, p. 255). By analogy with equation A4 of Ermentrout and Kopell (1991a, b), we can U
0
dene hk (h k, hj ) as rkk g (U k, Uj ) . Both our derivation and that of Ermentrout and Kopell (1991a, b) take advantage of the fact that the coupling is in one variable (V) only, hence for the particular form of perturbation. They assumed in a system of two coupled Morris-Lecar oscillators: g (U k, Uj ) D [gex ( Vk (h ) ( Vex ¡ Vj (h ) ) , 0].
(C.5)
The nonzero term is simply the scaled instantaneous current and is equivalent to the scaled instantaneous current in our derivation, I / CM . We assumed a constant pulse, but any arbitrary pulse can be used, and the integral of that scaled current over the perturbing pulse can be substituted for ( It ) / CM , or D V, in our formulation. Thus, the nal expression to be used for comparison with our results is dh V 0 (h ) D w C wi 0 2 , 0 dt V (h ) C w 2 (h )
(C.6)
Inuence of Limit Cycle Topology
1053
where i is instantaneous scaled current from equation C.5 and V0 (h ) is the 0 only surviving element from U k. Just as in our results, described below, this gives the phase resetting due to a perturbation by projecting the perturbation in voltage onto the limit cycle. However, the intuition behind this reasoning is described only in this article. The most logical way to compare our results with those of Ermentrout and Kopell is to use the product form of the coupling h k (hk, hj ) D P (h ) R (h ) and to set P (h ) D i and R (h ) D
V 0 (h ) . 0 V 2 (h ) C w 2 (h ) 0
(C.7)
This is not how Ermentrout and Kopell (1991a, b) dened R (h ) and P (h ) : they included the presynaptic voltage terms in R (h ) and all postsynaptic terms in P (h ). Since they used a synaptic current for the perturbation rather than a square pulse, the current term gex ( V (hj ) ) (Vex ¡ V (h k ) ) was not equivalent to R (h ) , and the P (h ) term included (Vex ¡ V (hk ) ) . On the other hand, we separated the terms into a term representing the dynamics near the original unperturbed limit cycle (R (h )) and a term representing the perturbation in voltage (P(h ) ). If you make the assumption that i is nonzero for a very brief period of time and that for simplicity the integral of i over this brief interval is 1 (like a delta function), then you can assume R (h ) is constant at its value at the time the perturbation is initiated. The h on the left-hand side of equation C.6 varies with a constant angular velocity over most of the limit cycle (everywhere except when it receives a perturbation; see Figure 11). The temporal phase resetting in response to an instantaneous pulse is then a scaled version of R (h ) (see Figure 11): F (h ) D ¡w¡1 R (h ) .
(C.8)
In our derivation, we dene a new quantity s, geometrical phase, that is distance along the limit cycle as a function of time. This function, unlike the temporal phase h (t) used by Ermentrout and Kopell (1991a) and by Hoppensteadt and Izhikevich (1997) has a variable derivative along the limit cycle, s D s ( t) D
1 L
Z
t 0
|| uP ( t) ||dt,
(C.9)
where L is the length of the limit cycle in the appropriate coordinate system and the dot indicates the derivative with respect to time. Thus, ds D k uP ( t) k dt. We show that for Morris-Lecar, ds D p
VP (t)
VP 2 ( t) C wP 2 ( t)
.
(C.10)
1054
Sorinel A. Oprisan and Carmen C. Canavier
Figure 11: Relationship of PRC F (h ) to a change in angular displacement, R (h ) . This is a plot of constant angular velocity of the temporal phase that indicates the resetting produced by Ermentrout and Kopell’s (1991a, 1991b) R (h ) .
From equation C.9, we see that for Morris-Lecar, q VP 2 ( t) C wP 2 ( t) dt.
ds D
(C.11)
Furthermore, the temporal phase resetting (w ) can be obtained using the inverse of s ( t) : dw D s¡1 (s) .
(C.12)
We can recover the result of Ermentrout and Kopell (1991a), for temporal phase resetting R (h ) given in equation C.8 by equating R (h ) with the instantaneous change in temporal phase dt due to a delta function perturbation and then substituting equation C.12 into C.11: dt D
V 0 (t) , 0 V 2 ( t) C w 2 (t) 0
(C.13)
noting that in Ermentrout and Kopell, time t and phase w are related by a constant. The expression given in equation C.13 is the adjoint, or innitesimal PRC, as given in equation 3.2 of Hansel, Mato, and Meunier (1995) and equation A21 of Ermentrout and Kopell (1991b), providing that variations in
Inuence of Limit Cycle Topology
1055
amplitude (perturbations normal to the limit cycle) are ignored. Izhikevich (2000) also obtained a similar result, but in the limit as the derivative of 0 the slow variable (w 2 (t) in C.13) goes to zero. Although previous work in the literature contains expressions that are mathematically similar to the ones presented in this article, they do not contain the geometrical intuition presented here. This intuition allows for the possibility of extending PRC analysis to type II model neurons and even to physiological type II neurons for which we do not know the equations of state but for which we can reconstruct an attractor. Acknowledgments
This work was supported by Whitaker Foundation grant TF-98-0033 and NSF grant IBN-0118117. References Abarbanel, H. D. I., Brown R., Sidorowich J. J., & Tsimring L. Sh. (1993). The analysis of observed chaotic data in physical systems. Rev. Mod. Phys., 65, (4), 1331–1392. Abramovich-Sivan, S., & Akselrod, S. (1998a). A PRC based model of a pacemaker cell: Effect of vagal activity and investigation of the respiratory sinus arrhythmia. J. Theor. Biol., 192, 219–243. Abramovich-Sivan, S., & Akselrod, S. (1998b). A phase response curve based model: effect of vagal and sympathetic stimulation and interaction on a pacemaker cell. J. Theor. Biol., 192, 567–579. Arshavsky, Yu. I., Beloozerova, I. N., Orlovsky, G. N., Panchin, Yu, V., & Pavlova, G. A. (1985). Control of locomotion in marine mollusc Clione limacina I. Efferent activity during actual and ctitious swimming. Exp. Brain Res., 58, 255–262. Baxter, R. J., Canavier, C. C., Lechner, H. A., Butera, R. J., DeFranceschi, A. A., Clark, J. W., & Byrne, J. H. (2000). Coexisting stable oscillatory states in single cell and multicellular neuronal oscillators. In D. Levine, V. Brown, and T. Shirley (Eds.), Oscillations in neural systems (pp. 51–77). Hillside, NJ: Erlbaum. Beer, R. D., Chiel, H. J., & Gallagher, J. C. (1999). Evolution and analysis of model CPGs for walking II. General principles and individual variability. J. Comp. Neurosci., 7(2), 119–147. Canavier, C. C., Baxter, D. A., Clark, J. W., & Byrne, J. H. (1999). Control of multistability in ring circuits of oscillators, Biol. Cybernetics, 80, 87–102. Canavier, C. C., Butera, R. J., Dror, R. O., Baxter, D. A., Clark, J. W., & Byrne, J. H. (1997).Phase response characteristics of model neurons determine which patterns are expressed in a ring circuit model of gait generator. Biol. Cybernetics, 77, 367–380. Chiel, H. J, Beer, R. D., & Gallager, J. C. (1999). Evaluation and analysis of model CPGs for walking: I dynamical models. J. Comp. Neurosci., 7, 1–20.
1056
Sorinel A. Oprisan and Carmen C. Canavier
Collins, J. J., & Richmond, S. A. (1994). Hard-wired central pattern generators for quadrupedal locomotion. Biol. Cyber., 71, 375–385. Collins, J. J., & Stewart, I. N. (1993). Coupled nonlinear oscillators and the symmetries of animal gaits. J. Nonlin. Sci., 3, 349–392. Dror, R. O., Canavier, C. C., Butera, R. J., Clark, J. W., & Byrne, J. H. (1999). A mathematical criterion based on phase response curves for the stability of a ring network of oscillators. Biol. Cybern., 80, 11–23. Ermentrout, G. B. (1985). The behavior of rings of coupled oscillators. J. Math. Biol., 23, 55–74. Ermentrout, G. B. (1986). Parabolic bursting in an excitable system coupled with a slow oscillation. SIAM J. Appl. Math., 46, 233–253. Ermentrout, G. B. (1994). Neural modelling and neural networks. Oxford: Pergamon Press. Ermentrout, G. B. (1996). Type I membranes, phase resetting curves, and synchrony. Neural Computation, 8, 979–1001. Ermentrout, G. B., & Kopell, N. (1991a). Oscillator death in systems of coupled neural oscillators. SIAM J. Appl. Math., 29, 195–217. Ermentrout, G. B., & Kopell, N. (1991b). Multiple pulse interactions and averaging in coupled neural oscillators. J. Math. Biol., 50(1), 125–146. FitzHugh, R. (1969). Mathematical models of excitation and propagation in nerve. In H. P. Schwan (Ed.), Biological engineering (pp. 1–10). New York: McGraw-Hill. Foss, J., & Milton, J. (2000). Multistability in recurrent neural loops arising from delay. J. Neurophysiology, 84, 975–985. Glass, L., & Mackey, M. (1988). From clocks to chaos: The rhythms of life. Princeton, NJ: Princeton University Press. Golubitsky, M., Stewart, I., Buono, P. -L., & Collins, J. J. (1998). A modular network for legged locomotion. Physica D, 115, 56–72. Guckenheimer, J., & Holmes, P. (1983). Nonlinear oscillations, dynamical systems, and bifurcation of the vector elds. Berlin: Springer-Verlag. Hansel, D., Mato, G., & Meunier, C. (1995). Synchronization in excitatory neural networks. Neural Computation, 7, 307–337. Hodgkin, A. L. (1948).The local electric changes associated with repetitive action in a non-medullated axon. J. Physiology, 107, 165–181. Hoppensteadt, F. C., & Izhikevich, E. M. (1997). Weakly connected neural networks. New York: Springer-Verlag. Izhikevich, E. M. (2000). Phase equations for relaxation oscillators. SIAM J. Appl. Math., 60, 1789–1804. Kopell, N., & Ermentrout, G. B. (1988). Coupled oscillators and the design of central pattern generators. Math. Biol., 90, 87–109. Mirollo, R. I., & Strogatz, S. H. (1990). Synchronization of pulse-coupled biological oscillators. SIAM J. Appl. Math., 50, 1645–1662. Morris, C., & Lecar, H. (1981). Voltage oscillations in the barnacle giant muscle ber. Biophys. J., 35, 193–213. Murray, J. D. (1993). Mathematical biology (2nd ed.). New York: Springer-Verlag. Nishii, J. (1999). Learning model for coupled neural oscillators. Network: Comp. Neural Syst., 10, 213–226.
Inuence of Limit Cycle Topology
1057
Oprisan, S. A., & Canavier, C. C. (2000a).Phase response curve via multiple time scales analysis of the limit cycle behavior of type I and type II excitability. Biophys. J., 78, 218A. Oprisan, S. A., & Canavier, C. C. (2000b). Analysis of central pattern generator circuits using phase resetting. Soc. Neurosci. Abstr., 26, 2000. Oprisan, S. A., & Canavier, C. C. (2001). The structure of instantaneous phase resetting in a neural oscillator. InterJournal of Complex Systems, manuscript 386. Available on-line at: www.interjournal.org/. Pavlidis, T. (1973). Biological oscillators: Their mathematical analysis. Orlando, FL: Academic Press. Pinsker, H. M. (1977). Aplysia bursting neurons as endogenous oscillators: I: Phase-response curves for pulsed inhibitory synaptic input. J. Neurophysiology, 40, 527–543. Rinzel, J., & Ermentrout, G. B. (1998). Analysis of neural excitability and oscillations. In C. Koch & I. Segev (Eds.), Methods in neuronal modeling: From ions to networks (pp. 251–292). Cambridge, MA: MIT Press. Rinzel, J., & Lee, Y. S. (1986). On different mechanisms for membrane potential bursting. In H. G. Othmer (Ed.), Nonlinear oscillations in biology and chemistry. New York: Springer-Verlag. Takens, F. (1981). Detecting strange attractors in turbulence. In D. A. Rand & L. S. Young (Eds.), Proceedings of the Symposium on Dynamical Systems and Turbulence. Berlin: University of Warwick. Traub, R. D., & Miles, R. (1991). Neural networks of the hippocampus. Cambridge: Cambridge University Press. Wilson, M. A., & Bower, J. M. (1989). The simulation of large scale neural networks. In Methods in Neural Modeling. Cambridge, MA: MIT Press. Winfree, A. (1980). The geometry of biological time. New York: Springer-Verlag. Received March 5, 2001; accepted July 25, 2001.
LETTER
Communicated by Steven Nowlan
Fields as Limit Functions of Stochastic Discrimination and Their Adaptability Philip Van Loocke [email protected] Lab for Applied Epistemology, University of Ghent, 9000 Ghent, Belgium For a particular type of elementary function, stochastic discrimination is shown to have an analytic limit function. Classications can be performed directly by this limit function instead of by a sampling procedure. The limit function has an interpretation in terms of elds that originate from the training examples of a classication problem. Fields depend on the global conguration of the training points. The classication of a point in input space is known when the contributions of all elds are summed. Two modications of the limit function are proposed. First, for nonlinear problems like high-dimensional parity problems, elds can be quantized. This leads to classication functions with perfect generalization for high-dimensional parity problems. Second, elds can be provided with adaptable amplitudes. The classication corresponding to a limit function is taken as an initialization; subsequently, amplitudes are adapted until an error function for the test set reaches minimal value. It is illustrated that this increases the performance of stochastic discrimination. Due to the nature of the elds, generalization improves even if the amplitude of every training example is adaptable. 1 Introduction
In recent years, different types of random discrimination methods for classication have been studied (Kleinberg, 1996, 2000; Ho, 1995; Berlind, 1994; Van Loocke, 2000, 2001). Such methods generate a huge number of elementary functions (which correspond to units in the present context) to dene a classication. They run counter to the intuition that generalization is best when the number of units in a network is kept relatively low. The philosophy behind this approach is that the errors generated by an individual function are smoothed out by the other functions. Every individual function has a poor performance on a problem. Therefore, random discrimination methods have been called weak methods. Each time a new function is included, however, the statistical properties of the sum of the functions tend to improve. On an intuitive level, the statistical fact behind the effectiveness of random discrimination can be formulated as follows. Consider a training set c 2002 Massachusetts Institute of Technology Neural Computation 14, 1059– 1070 (2002) °
1060
Philip Van Loocke
with N input-output pairs, and suppose that the vector v ´ (v ( u1 ) , . . . , v (uN ) ) is the vector composed of all output values; v ( ui ) is the output value corresponding to the ith training input ui . The input space is denoted I and has n dimensions. Since we examine only classication problems in this article, the output space O can be assumed to be one-dimensional.1 Suppose that a set S of vectors with positive correlation with v is generated. Then if S is sufciently large and if the vectors of S are randomly distributed around v, v will be the average of the vectors in S. Suppose that a set of elementary functions f (with f : I ! O) is considered. For every elementary function f , the vector v f composed of the output values for all training instances can be formed. If the vectors v f have positive correlation with v and if they are randomly distributed around v, the average of the elementary functions will converge to v when the number of elementary functions increases. When this supposition is turned into a practical method, a series of functions so that v f has varying correlation with v is generated. Every time a function is met for which the correlation between v f and v is positive, the function is selected, and v f is put in the set S. The latter is extended until a satisfying classication is obtained. The mathematics of random discrimination has been studied in detail by Kleinberg (1996, 2000). We recapitulate only the core mathematical fact. Suppose that a classication problem divides an input I space in two classes A and B. Suppose that A and B correspond to output 1 and 0. Consider an elementary function f that denes a partition in two classes Af and B f . A point q 2 I is mapped on 1 by f if q 2 A f ; it is mapped on 0 if q 2 B f . Suppose that different functions f are generated in such a way that two criteria are met: (1) if a point q belongs to A, then p (q 2 Af ) > p ( q 2 B f ) , and (2) when different functions f are considered, the chance that a point q is a member of A f is independent of q (meaning that the parts receiving output value 1 cover I uniformly when different functions f are considered), and suppose that the same is true for B f . Then the variable X ( q, f ) D ( f ( q) ¡p(q 2 A f | q 2 B ) ) / ( ( q 2 A f | q 2 A) ¡p(q 2 A f | q 2 B ) ) has average 1 for a point q belonging to A and average 0 for a point belonging to B (see Kleinberg, 1996, 2000). It follows from the central limit theorem that the sum over different elementary functions f , Y ( q) D SfkD 1 to tg X ( q, fk ) / t, is normally distributed with average 1 if q belongs to A and with average 0 if q belongs to B. The variance of this variable is 1 / t times the variance of X ( q, f ) , so that for high values of t, Y ( q) can be considered as the output of an efcient classier system. 1 We conne ourselves to examples that are binary classication problems. The generalization of this situation to classications in more than two classes is straightforward and can be carried out as described in Kleinberg (2000).
Fields as Limit Functions
1061
Suppose that the input space I is discrete and that an elementary function f randomly partitions I in two classes, A f and B f . Suppose that a function is sampled if the number of vectors belonging to A f that are in A is larger than the number of input vectors of B f that are in A. For random partitions, the uniformity condition is met, so that sampling and averaging lead to an exact classication of all training examples. But random functions do not generalize, since for two points q and q0 , which do not both belong to the training set, f (q ) and f ( q0 ) are uncorrelated, however close they are. Therefore, random discrimination methods use elementary functions that are smooth. As a consequence, the uniformity condition is generally not met. Kleinberg (1996, 2000) has shown that for some choices of f , weakening this condition still leads to good performance on training sets while allowing for good generalization properties. In Kleinberg’s work, the functions f take the following form.2 Consider a point q in I that belongs to a training couple. A rectangular parallelopiped is constructed as follows. For every dimension, a random number between q and the maximal value for this dimension and a random number between q and the minimal value for this dimension is generated. The parallelopiped is obtained as the set of points of I with coordinates that are between these numbers for all dimensions. The parallelopiped and its complement dene A f and B f . Different functions result for different choices of q and the random numbers that dene the boundaries of the parallelopipeds. In this article, we introduce a different set of elementary functions. Basically, they are dened by randomly selecting a number of training vectors and allowing these vectors to split up the input space in terms of a nearestneighbor condition. The special feature of these functions is that when the number of functions sampled increases, the average function can be written in a relatively simple analytical form L ( q). This form can be used directly to perform the classication task. Basically, the classication of an input vector q requires one to rank all training vectors ui in terms of their distance to q. A power of this rank determines the contribution of ui to the output of q (see section 2). Then we consider two possibilities to modify the form L ( q) . The rst is aimed at problems that are strongly nonlinear. It adds a factor to L ( q) that accounts for the spherically differentiated predictive value of a training vector. For every training vector u, the input space is divided in concentric hyperspheres with center u and with increasing radius, and the relative number of training vectors with the same classication as u for the region between two successive hyperspheres is determined. This number determines the extent to which u is allowed to contribute to the outputs of the points located in this region (see section 3). As a second modication of the algorithm, every 2 In Van Loocke (2000, 2001), elementary functions of a sigmoid type have been shown to be efcient at nonlinear problems such as high-dimensional parity problems. These functions were applied to binary problems only.
1062
Philip Van Loocke
training item receives an amplitude. The amplitude species the weight that is given to the training item in the limit function. Initially, all amplitudes are put to one. An incremental procedure is used to search for amplitudes that minimize the error on the training set. The nature of the limit function guards for overtraining, even if the number of parameters that can be adapted equals the number of training vectors (see section 4). The resulting classication method is applied to two types of problems. Section 5 considers two generative binary problems: an eight-dimensional linear problem and an eight-dimensional parity problem. It is shown that the method gives perfect classications for training sets and also generalizes perfectly, even in case of parity problems. In the latter case, perfect generalization is obtained without the need to adapt amplitudes. In section 6, the method is applied to two real-world data sets. When compared to the 10 random discrimination methods evaluated in Kleinberg (2000), the method increases performance for both data sets. Intuitively, the approach explored here considers every training example as the source of a eld in the input space. In every point q of the input space, this eld is determined by the output associated with a source and by a power of the local rank of the source. The resulting quantities are summed, which yields the total local eld in q. If it is suspected that the resulting classier is not optimal yet, then the elds are provided with amplitudes, and these are adapted by an incremental procedure that minimizes an error function. Section 7 briey discusses this intuition. 2 A Limit Function for a Particular Type of Stochastic Discrimination
We dene an elementary function f as follows. Select randomly k training items ( uis , v ( uis ) ) from the training set (s D 1, . . . , k) ; uis has n continuous or discrete components, and v ( uis ) is a binary number (k < N; as explained below, different values of k result in different types of limit functions). From this section on, the binary class-specifying values are +1 and ¡1 instead of 1 and 0. Then f is determined by a nearest-neighbor condition: for every point q of the input space, the point uc in fui1 , . . . , uikg that is closest to q is determined, and f is dened by f ( q) D v ( uc ) . If k D 2, then f draws a hyperplane in the input space. The hyperplane is orthogonal to the line connecting ui1 and ui2 and passes through the middle of the line segment [ui1 ui2 ]. All points on the side of ui1 receive the same output as ui1 , and the output of all points on the side of ui2 is v ( ui2 ). For k D 3, the input space is divided into three parts with linear separation, and so on. A function f in general does not necessarily obey the criterion that more items of A are mapped on +1 than on ¡1 or that more items of B are mapped on ¡1 instead of on +1. (A and B are dened as in section 1: A is the set of input training points mapped on +1, and conversely for B). The function f maps k training points correctly with certainty. The extent to which the other N ¡ k training points are mapped correctly depends on the conguration of the latter in
Fields as Limit Functions
1063
the input space. The higher k is, the higher is the chance that f obeys the criterion that at least half of the training items are classied correctly. We include all functions in the sampling procedure even though this criterion may occasionally not be obeyed for low values of k. In this sense, we are more lenient for our sampling functions than in the situation described in section 1. Consider a point q in the input space and a point u that appears as the rst argument of a training couple. Suppose that a large set of functions f is sampled for a xed value of k. By denition, the chance r (q, u ) that q is directed by u is the chance that u is the point used to specify the output of q. This chance is equal to the chance that u appears in a selected set fui1 , . . . , uik g and that u is the point closest to q of these k points. The former chance is the same for every point u and will be denoted C. In order to obtain the chance that a point u in the set fui1 , . . . , uikg is the closest point to q, it can be assumed without loss of generality that u coincides with ui1 . The chance that q is closer to ui1 than to ui2 and closer to ui1 than to ui3 . . . and closer to ui1 than uik is equal to the (k ¡ 1)th power of the chance that q is closer to ui1 than to an arbitrary other point of the training set. Suppose that input points of the training set are ranked in terms of their distance to q. Suppose that u is the point of the training set that is the rth closest to q. This number is a function of q and u: r ´ r ( q, u ) . Suppose that all training points have different distance to q. Then the training set has N C 1 ¡ r points that are at least as far from q as u, so that the chance that u is the closest point is equal to (N C 1 ¡ r (q, u ) ) / N. Hence, we obtain r ( q, u ) D C £ ( (N C 1 ¡ r (q, u ) ) / N ) k¡1 .
For this type of elementary function f , a point q of the input space is directed by a single training point only. Hence, the chance that q is directed by a point belonging to A is equal to the sum of the chances that it is directed by the individual elements of A; the same is true for B. Hence: p ( f ( q) D 1) D Sfu2Ag r (q, u )
and
p ( f ( q) D ¡1) D Sfu2Bg r (q, u ) .
Suppose now that a very large number of functions f is sampled and that the sign of their average determines the classication of q. Then q will be mapped on 1 if p ( f ( q) D 1) > p ( f ( q) D ¡1) ; else, it will be mapped on ¡1. This can be written in a slightly different form: L (q) ´ sgn < f > takes the form L (q ) D 1 L (q ) D ¡1
Sfu2Agr ( q, u) ¡ Sfu2Bgr ( q, u) > 0
if if
Sfu2Agr ( q, u) ¡ Sfu2Bgr ( q, u) < 0.
Since for points u of A, v ( u ) D 1, and conversely for B, these expressions can be summarized in a single expression: L (q ) D sgn ( Su r ( q, u ) £ v (u ) ) ,
1064
Philip Van Loocke
where the sum ranges over the entire training set. Due to the presence of the signum function, the constant C in the expression for r ( q, u ) can be omitted, and we obtain L ( q) D sgn ( Su v ( u ) £ ( ( N C 1 ¡ r (q, u ) ) / N ) k¡1 ). Different values of k result in different limit functions. The above expression entails that the higher k is, the stronger is the inuence exerted by the closest training points on q. Therefore, if k is taken too high, L ( q) acts too much like an ordinary nearest-neighbor classier, and congurations of training points in the broader neighborhood of q cannot exert inuence on the classication. On the other hand, if k is given the lowest possible value (k D 2), then the functions f of which L (q ) is the limit function have a higher chance of not being weak classiers. As illustrated in sections 5 and 6, k D 3 appears to be an adequate compromise between these situations. Instead of generating and averaging high numbers of elementary functions f , one can turn directly to the limit function L ( q) to perform a classication. This yields a more accurate application of this method, and it also signicantly reduces computation time if large numbers of elementary functions are needed. In the context of this paper, an equally important fact about having an explicit expression for L ( q) is that it can be modied in two natural ways that can lead to more optimal classications. 3 Radial Differentiation and Quantization of Spherical Fields
The expression for L ( q) can be read as follows. Each point u in I that belongs to a training item exerts an inuence in all points of the input space. This inuence consists of a positive stimulation to take the same output as u. The point u can be considered the source of this inuence. If the training set has a uniform distribution around u, then the inuence of u decreases as a function of the distance only. If this condition does not hold, u may exert high inuence over a long range in one direction and have faster decaying inuence in another direction if the latter has a higher density of training points. The function L ( q) is limited in its ability to grasp strongly nonlinear structure. Consider, for instance, a parity problem. If a (binary) point receives the strongest stimulation from its closest (binary) neighbors, then L will have a strong tendency to produce the wrong output, since these neighbors have different parity. For this type of problem, L ( q) can be complemented in a natural way with a factor that corresponds with the spherical quantization of the elds. Suppose that for every training point u, r concentric hyperspheres are considered, with an increasing diameter given by s £ m , where m is the diameter of the smallest hypersphere (s D 1, . . . , v ) . The relative number of items in the smallest hypersphere with the same classication as u is denoted k ( u, 1). The relative number of items located between the s ¡ 1th
Fields as Limit Functions
1065
and the sth hypersphere and with the same classication as u is k (u, s) . The latter number can be used as a quality score for the contribution of u to the sth spherical band around u (if the sth band has no training items, the quality score is put equal to 0.5). Then the following modication of L (q ) results. For every point q of the input space that has to be classied and for every training point u, the spherical band around u to which q belongs is determined. The contribution of u to the classication of q is weighted by the quality score of this band. Suppose that q is located in the s ( q) th band around u. Then the function L ( q) can be modied into K ( q) with K (q ) D sgn ( Su v ( u ) £ k (u, s ( q) ) £ ( ( N C 1 ¡ r ( q, u) ) / N ) k¡1 ) . The introduction of the factor k ( u, s ( q) ) allows this method to grasp structural properties that are reducible to the sums of spherical regularities around training points. The factor can be seen as a coefcient in an analysis in terms of spherical bands. Unlike Fourier analysis, for instance, the coefcients multiply not orthogonal functions but another factor that codetermines the weight of v ( u ). Intuitively, u becomes the source of a spherically symmetric eld that is quantized in r bands but is directionally invariant. The eld multiplies the nonsymmetrical factor derived in the previous section. There are other ways in which L can be modied with spherically dened differentiation. Consider a noisy training point u in the input space. For such a point, the closest different training point with different output value will on average be closer than for nonnoisy training points. Therefore, one can determine the largest hypersphere for which the training points have the same output as u and increase the inuence of u within this hypersphere. Then, nonnoisy points will tend to be associated with larger hyperspheres in which they exert increased inuence. This modication can be stated formally as follows. Suppose that w ( u, q) takes the value 1 for every point q outside the hypersphere associated with u and value a with a > 1 for points inside the hypersphere. Then one can consider the alternative classication function K0 ( q) with K0 (q ) D sgn( Su v ( u ) £ w ( u, q) £ ( ( N C 1 ¡ r ( q, u ) ) / N ) k¡1 ) The enhancement of K0 ( q) relative to L ( q) should be expected to be limited. In effect, L ( q) itself suppresses noise, since in every region that corresponds clearly to one of both classes, noisy points tend to be more sparsely distributed than nonnoisy points, so that their inuence is often canceled by the majority of nonnoisy points in the neighborhood of this region. This fact turns out to be true for the two practical examples studied in section 6 and for which K0 (q ) yields only an occasional and small improvement relative to L ( q) .
1066
Philip Van Loocke
4 Adaptable Amplitudes
For xed values of k, m , and a, the functions L ( q) , K ( q) , and K0 ( q) are not adaptable. They do not contain parameters that can be systematically varied in order to enhance classications. A simple way to insert parameters is to provide every training example with an amplitude a ( u ) . Then for K ( q) , the new classication function reads MK (q ) D sgn ( Su v ( u ) £ a (u ) £ k ( u, s (q ) ) £ ( ( N C 1 ¡ r ( q, u ) ) / N ) k¡1 ) ,
and similar for L ( q) and K0 (q ) (the versions of the latter with modiable amplitudes are denoted ML ( q) and MK 0 (q )). Suppose that initially all amplitudes are put to one. Then the initial classication function MK ( q) coincides with K ( q) . Subsequently, amplitudes are allowed to deviate from one if this enhances the performance on the training set. This procedure can be continued until the classication error on a test set starts to increase. The algorithm by which amplitudes are adapted requires introducing a continuous error function as well as continuous outputs. For this end, the sign function in the expressions for MK ( q) , ML (q ), and MK 0 ( q) is replaced by the familiar squashing function f (x ) D 1 / (1 C exp(¡e £ x ) ) ; this yields in case of MK ( q) the continuous value C ( q) : C (q ) D 1 / (1 C exp(¡e £ Su v ( u ) £ a ( u ) £ k (u, s ( q) ) £ ( ( N C 1 ¡ r ( q, u ) ) / N ) k¡1 ) ) .
Then an error function is dened for the training set by E D Su (v ( u ) ¡ C ( u ) ) 2 . In the examples in sections 5 and 6, the amplitudes are adapted according to a simple random schema. At every time step, l amplitudes are randomly selected. A random number between ¡k and k is added to each of the amplitudes, and if this leads to a lower value for the error function, the change in amplitudes is effectively carried through (this corresponds to a Hooke-Jeeves search in a classical optimization context; see Reklaitis, Ravindran, & Ragsdell, 1983). As will become apparent in the examples, the insertion of amplitude parameters leads to better generalizations. At rst sight, this may seem counterintuitive because of the relatively large number of parameters that are inserted. In case of a backpropagation network, for instance, the total number of connections and biases is usually kept signicantly below the total number of training examples. Networks with too many units tend to generalize poorly because they overt the training data. Consider a network with at least one hidden layer. For every unit that is added to the rst hidden layer, another hyperplane separation of the input space is included in the function calculated by the network. Therefore, the more units there are at this layer, the higher is the chance that neighboring points in input space are mapped on output values corresponding to different classication. Nonlinear functions calculated by networks with at least one hidden
Fields as Limit Functions
1067
layer become less smooth and more volatile when more units are included (Reed & Marks, 1999). For the present method, the situation is different. If amplitudes are made to differ from one, the fact that changes is that some training points are allowed to exert higher inuence, whereas others will have lower impact on their environment. This fact does not introduce additional boundaries (like hyperplanes in case of backpropagation) in the classication function, so that the volatility of the latter does not necessarily increase. 5 Illustration with Two Generative Problems
In this section, we conne ourselves to a binary eight-dimensional input space. For every dimension, the possible values are +1 and ¡1. The input space contains 28 D 256 different points. To start, consider a linear separation problem. For every element u ´ ( u1 , u2 , . . . , u8 ) of the training set, the output value v ( u ) is determined in accordance with the linear functional prescription w (u ) D sgn ( SfiD 1 to 8g ci ui ) , where ci are random numbers between ¡1 and +1. The input space is randomly divided into a training set and a test set of equal size. The classication function ML ( q) with adaptable amplitudes was used to solve this problem. In the run illustrated in Figures 1 and 2, L (q) initially makes 11 errors on the training set and 9 errors on the tests set. The initial values for E are 112.5 and 112.9 on the training set and the test set, respectively. After 18,000 iterations of the adaptation procedure for amplitudes, the error decreased to 8.8 on the training set and 20.3 on the test set (see Figure 1). Classication was perfect on both the training set and the test set after 9900 iterations (see Figure 2) (all simulations were made for k D 2). This illustrates that noiseless, linear problems are easily solved by the present method. The initial classication errors produced by L ( q) are typically lower than 10%, and amplitude adaptation makes them steadily decrease to zero. Next, consider an eight-dimensional parity problem. We use K ( q) to solve this problem (the parameter m was put on 0.06). When the input space was randomly divided in a training set and a test set of equal size, K (q ) appeared to perform perfectly on both the training set and the test set, so that no further amplitude adaptation was required. The fact that this method generalizes on parity problems suggests that it can grasp relations of a structural kind and that it is not conned to noise suppression. 6 Illustration with Two Practical Problems
We applied the ML ( q) classication procedure on two practical problems that were also evaluated in Kleinberg (2000). First, we took a credit scoring problem for which 690 examples are provided. This total number of examples was divided randomly 10 times in a training set and a test set of equal
1068
Philip Van Loocke
Figure 1: Evolution of E for the training set and the test set (e D 0.005, k D 2, and l D 5).
Figure 2: Evolution of classication errors on the training set and the test set.
Fields as Limit Functions
1069
size. The parameters values for our runs were e D 0.01, k D 2, and l D 5. The smallest fractions of erroneously classied test items for each of the 10 partitions were 0.1101, 0.1043, 0.1101, 0.113, 0.1217, 0.113, 0.1391, 0.1217, 0.1072, and 0.1217, which gives an average of 0.1162. This is to be compared with the fractions of errors for the 10 random discrimination methods investigated in Kleinberg (2000) (these fractions reect averages of 10 separations of the example set in a training and a test set too), which were between 0.124 and 0.158. No further increase in performance was noticed when ML (q ) was replaced by MK ( q) or MK0 ( q) . Second, ML ( q) was applied to a diabetes classication problem. The total number of examples was 768, and again this set was split 10 times in a training and a test set. The parameter values for all our runs were again e D 0.01, k D 2, and l D 5. The smallest values of the fractions of errors on the test set were 0.2057, 0.2187, 0.2395, 0.2135, 0.2317, 0.2369, 0.2239, 0.2265, 0.2473, and 0.2213, with an average of 0.2265. This is to be compared with the values for the 10 methods studied by Kleinberg (2000), which were between 0.244 and 0.284. For this problem, MK 0 ( q) occasionally appears to give a lower errror for a D 2, but the average effect is marginal and of the order of 0.3%. Also this time, turning to MK ( q) did not increase the performance, which suggests that these problems do not contain strong nonlinearities. For these two practical examples, ML ( q) and MK 0 ( q) appear to perform better than random discrimination methods. The initial values of the errors (i.e., the values produced by L ( q) and K0 (q )) are typically slightly higher than the values produced by random discrimination.3 But this is countered by the fact that the present method allows for adaptation and that adaptation generally does not harm the smoothness of the output function, so that generalization properties are enhanced when the performance on a training set is enhanced. 7 Discussion
Conceptually, the classication function L ( q) has a straightforward relation to random discrimination methods. But the limit function has also an interpretation in terms of elds. Every training example is the source of a eld. The strength of a eld that originates at an example decays as a function of the conguration of all other examples. For a uniform distribution of other examples in the neighborhood of an example, the eld produced by
3 The main reason for this is as follows. Suppose that a sampling procedure is repeated several times and with nite numbers of elementary functions. Then the average of the sampled functions can show statistical uctuations around the theoretical average for an innite number of elementary functions. If the run that corresponds with best performance is selected, better values than for the theoretical limit function may be obtained. This was the procedure followed in Kleinberg (2000), where 10 runs were made for each division of the example set in a training set and a test set.
1070
Philip Van Loocke
the latter decays with radial symmetry. Symmetry or lack of symmetry of examples in I relative to a particular example is directly reected in the eld emerging from the latter. Once the sum of the elds in a point is calculated, its classication is known. The strength of elds can be optimized in terms of an error function. In terms of procedure, ML ( q) , MK (q ), and MK 0 (q ) can be seen as incremental methods that start from a fairly good initialization, corresponding to L ( q) , K (q ), and K0 (q ), respectively. The quality of the initialization entails that the incremental procedure to be followed can be kept simple. The adaptability of amplitudes is an advantage relative to random discrimination methods. Further, the straightforward quantization of elds in K (q ) yields a method that solves highly nonlinear problems like parity problems with remarkable efciency. Generalization for such problems does not require further adaptation of amplitudes. This contrasts with backpropagation. In the classic parallel distributed processing work (Rumelhart, Hinton, et al., 1986), a network conguration with symmetry that solves parity problems is shown, but for high-dimensional problems, backpropation generally does not nd it. (In our experience, it does not nd a solution at all for dimensionalities equal to eight or higher.) References Berlind, R. (1994).An alternativemethod of stochasticdiscriminationwith applications to pattern recognition. Unpublished doctoral dissertation, State University of New York at Buffalo. Ho, T. (1995). Random decision forests. In Proceedings of the Third International Congress on Document Analysis and Recognition (pp. 272–282). Montreal, Canada. Kleinberg, E. (1996). An overtraining-resistant stochastic modeling method for pattern recognition. Annals of Statistics, 24, 2319–2349. Kleinberg, E. (2000). On the algorithmic implementation of stochastic discrimination. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22, 473– 490. Reed, R., & Marks, R. (1999). Neural smithing: Supervised learning in feedforward articial neural networks. Cambridge, MA: MIT Press. Reklaitis, G., Ravindran, A., & Ragsdell, K. (1983).Engineering optimization:Methods and applications. New York: Wiley. Rumelhart, D., Hinton, G., & Williams, R. (1986). Learning internal representations by error propagation. In D. Rumelhart & J. McClelland (Eds.), Parallel distributed processing (Vol. 1). Cambridge, MA: MIT Press. Van Loocke, P. (2000). Fractals in cellular systems as solutions for cognitive problems. Fractals, 8(1), 7–14. Van Loocke, P. (2001). Meta-patterns and higher-order meta-patterns in cellular systems. Articial Intelligence Review, 16, 49–60. Received September 2000; accepted July 27, 2001.
LETTER
Communicated by Shimon Edelman
Learning to Recognize Three-Dimensional Objects Dan Roth [email protected] Ming-Hsuan Yang [email protected] Narendra Ahuja [email protected] Department of Computer Science and the Beckman Institute Urbana, IL 61801, U.S.A.
A learning account for the problem of object recognition is developed within the probably approximately correct (PAC) model of learnability. The key assumption underlying this work is that objects can be recognized (or discriminated) using simple representations in terms of syntactically simple relations over the raw image. Although the potential number of these simple relations could be huge, only a few of them are actually present in each observed image, and a fairly small number of those observed are relevant to discriminating an object. We show that these properties can be exploited to yield an efcient learning approach in terms of sample and computational complexity within the PAC model. No assumptions are needed on the distribution of the observed objects, and the learning performance is quantied relative to its experience. Most important, the success of learning an object representation is naturally tied to the ability to represent it as a function of some intermediate representations extracted from the image. We evaluate this approach in a large-scale experimental study in which the SNoW learning architecture is used to learn representations for the 100 objects in the Columbia Object Image Library. Experimental results exhibit good generalization and robustness properties of the SNoW-based method relative to other approaches. SNoW’s recognition rate degrades more gracefully when the training data contains fewer views, and it shows similar behavior in some preliminary experiments with partially occluded objects.
1 Introduction
The role of learning in computer vision research has become increasingly signicant. Statistical learning theory has had an inuence on many applic 2002 Massachusetts Institute of Technology Neural Computation 14, 1071– 1103 (2002) °
1072
D. Roth, M.-H. Yang, and N. Ahuja
cations: classication and object recognition, grouping and segmentation, illumination modeling, scene reconstruction, and others. The rising role of learning methods, made possible by signicant improvements in computing power and storage, is largely motivated by the realization that explicit modeling of complex phenomena in a messy world cannot be done without a signicant role of learning. Learning is required for model and knowledge acquisition, as well as to support generalization and avoid brittleness. However, many statistical and probabilistic learning methods require making explicit assumptions, for example, on the distribution that governs the occurrences of instances in the world. For visual inference problems such as recognition, categorization, and detection, making these assumptions seems unrealistic. This work develops a distribution-free learning theory account to an archetypical visual recognition problem: object recognition. The problem is viewed as that of learning a representation of an object that, given a new image, is used to identify the target object in it. The learning account is developed within the probably approximately correct (PAC) model of learnability (Valiant, 1984). This framework allows us to quantify success relative to the experience with previously observed objects, without making assumptions on the distribution, and study the theoretical limits of what can be learned from images in terms of the expressivity of the intermediate representation used by the learning process. That is, learnability guarantees that objects sampled from the same distribution as the one that governs the experience of the learner are very likely to be recognized correctly. In addition, the framework gives guidelines to developing practical algorithmic solutions to the problem. Earlier work discussed the possibility of identifying the theoretical limits of what can be learned from images (Shvaytser, 1990) and found that learning in terms of the raw representation of the images is computationally intractable. Attempts to explain this focused on the dependence of learnability on the representation of the object (Edelman, 1993) but failed to provide a satisfactory explanation for it or a practical solution. The approach developed here builds on suggestions made in Kushilevitz and Roth (1996) and relies heavily on the development of a feature-efcient learning approach (Littlestone, 1988; Roth, 1998; Carlson, Cumby, Rosen, & Roth, 1999). This is a learning process capable of learning quickly and efciently in domains in which the number of potential features is very large, but any concept of interest actually depends on a fairly small number of them. At the heart of the learning approach are two assumptions that we abstract as follows: Representational: There exists a (possibly innite) collection M of “expla-
nations” such that an object o can be represented as a simple function of elements in M.
Learning to Recognize Three-Dimensional Objects
1073
Procedural: There exists a process that given an image in which the target
object o occurs, efciently generates “explanations” in M that are present in the image and such that, with high probability, at least one of them is in the representation of o. Under these assumptions, we prove that there exists an efcient algorithm that, given a collection of images labeled as positive or negative examples of the target object, can learn a good representation of the object. That is, it can learn a representation that with high probability would make correct predictions on future images that contain (or do not contain) the object. Furthermore, we show that under these conditions, the learned representations are robust under realistic noisy conditions. A signicant nonassumption of our approach is that it has no prior knowledge of the distribution of images nor it is trying to estimate it. The learned representation is guaranteed to perform well when tested on images sampled from the distribution1 that governed the data observed in its learning experience. Section 2 describes this framework in detail. The framework developed here is very general. The explanations alluded to above can represent a variety of computational processes and information sources that operate on the image. They can depend on local properties of the image, the relative positions of primitives in the image, and even external information sources or context variables. Thus, the theoretical support given here applies also to an intermediate learning stage in a hierarchical process. In order to generate the explanations efciently, this work assumes that they are syntactically simple in terms of the raw image. However, the explanation might as well be syntactically simple in terms of previously learned or inferred predicates, giving rise to hierarchical representations. The main assumptions of the framework are discussed in section 3, where we also provide some concrete examples to the type of representations used. For this framework to contribute to a practical solution, there needs to be a computational approach that is able to learn efciently (in terms of both computation and number of examples) in the presence of a large number of potential explanations. Our evaluation of the theoretical framework makes use of the SNoW learning architecture (Roth, 1998; Carlson et al., 1999) that is tailored for these kind of tasks. The SNoW system (available online at http://L2R.cs.uiuc.edu/˜cogcomp/) is described in section 4. This is followed by a comparison of SNoW with support vector machines in section 5. In section 6, we review and compare our method with previous works on view-based object recognition. We then describe our experiments comparing SNoW to other approaches in section 7. In these experiments, SNoW 1 It will be clear from the technical discussion that the distribution is over the intermediate representation (features) generated given the images.
1074
D. Roth, M.-H. Yang, and N. Ahuja
is shown to exhibit a high level of recognition and robustness. We nd that the SNoW-based approach compares favorably with other approaches and behaves more robustly in the presence of fewer views in the training data. We conclude with some directions for future work in section 8. 2 Learning Framework
We study learnability within the standard PAC model (Valiant, 1984) and the mistake-bound model (Littlestone, 1988). Both learning models assume the existence of a concept class C , a class of f0, 1g-valued functions over an instance space X with an associated complexity parameter (typically, X’s dimensionality) n, and some unknown target concept fT 2 C that we are trying to learn. In the mistake-bound model, an example x 2 X is presented at each learning stage; the learning algorithm is asked to predict fT (x ) and is then told whether the prediction was correct. Each time the learning algorithm makes an incorrect prediction, we charge it one mistake. We say that C is mistake-bound learnable if there exists a polynomial-time prediction algorithm A (possibly randomized) that for all fT 2 C and any sequence of examples is guaranteed to make at most polynomially many (in n) mistakes. We say that C is expected mistake-bound learnable if there exists A, as above, such that the expected number of mistakes it makes for all fT 2 C and any sequence of examples is at most polynomially many (in n). Note that the expectation is taken over the random choices made by A; no probability distribution is associated with the sequences. In learning an unknown target function fT 2 C in the PAC model, we assume that there is a xed but arbitrary and unknown distribution D over the instance space X. The learning algorithm sees examples drawn independently according to D together with their labels (positive or negative). Then it is required to predict the value of fT on another example drawn according to D. Denote by h ( x) the prediction of the algorithm on the example x 2 X. The error of the algorithm 6 h ( x ) g. with respect to fT and D is measured by error (h ) D Prx2D f fT ( x) D We say that C is PAC learnable if there exists a polynomial time learning algorithm A and a polynomial p (¢, ¢, ¢) such that for all n ¸ 1, all target concepts fT 2 C , all distributions D over X, and all 2 > 0 and 0 < d · 1, if the algorithm A is given p (n, 1 /2 , 1 /d) examples, then with probability at least 1 ¡ d, A’s hypothesis, h, satises error (h ) · 2 . It can be shown that if a concept class C is learnable in the expected mistake-bound model (and thus in the mistake-bound model), then it is PAC learnable (Haussler, Kearns, Littlestone, & Warmuth, 1988). The agnostic learning model (Haussler, 1992; Kearns, Schapire, & Sellie, 1994), a variant of the PAC learning model, might be more relevant to practical learning; it applies when one does not want to assume that the labeled training examples (x, l) arise from a target concept of an a priori known structure fT 2 C . In this model, one assumes that data elements (x, l) are sampled according to some arbitrary distribution D on X £ f0, 1g. D may
Learning to Recognize Three-Dimensional Objects
1075
simply reect the distribution of the data as they occur “in nature” without assuming that the labels are generated according to some “rule.” As in the PAC model, the goal is to nd an approximately correct h in some class H of hypotheses. In terms of generalization bounds, the models are similar, and therefore we will discuss the PAC/mistake-bound case here. In practice, learning is done on a set of training examples, and its performance is then evaluated on a set of previously unseen examples. The hope that a classier learned on a training set will perform well on previously unseen examples is based on the basic theorem of learning theory (Valiant, 1984; Vapnik, 1995); stated informally, it guarantees that if the training data and the test data are sampled from the same distribution, good performance on large enough training sample guarantees good performance on the test data (implying good “true” error),2 where the difference between the performance on the training data and that on the test data is parameterized using a parameter that measures the richness of the hypothesis class H . For completeness, we simply cite the following uniform convergence result: Theorem 1 (Haussler, 1992).
m (2 , d) D
1 2 2
If the size of the training sample S is at least
³ kVC ( H ) C ln 2
1
C ln
´ 1 , d
then with probability at least 1 ¡ d, the learned hypothesis h 2 H satises errorD (h ) < errorS (h ) C 2 , where k is some constant and VC ( H ) is the VC dimension of the class H (Vapnik, 1982), a combinatorial parameter that measures the richness of H. 2.1 Learning Scenario. Let I be an input space of images. Our goal is to learn a denition such as apple:I ! f0, 1g that when evaluated on a given image, outputs 1 when there is an apple in the image and 0 otherwise. It is clear, though, that this target function is very complex in terms of the input space; in particular, it may depend on relational information and even quantied predicates. Many years of research in learning theory, however, have shown that efcient learnability of complex functions is not feasible (Angluin, 1992). In the learning scenario described here, therefore, learning will not take effect directly in terms of the raw input. Rather, we will learn the target denitions in terms of an intermediate representation that will 2 In this sense, the evaluation done in section 7 after training on a small training set is not as optimal as, say, a face detection study (Yang, Roth, & Ahuja, 2000c) done on a large training set. However, the theory quanties the dependence of the performance on the size of the training data, and the experimental study exhibits how different algorithmic approaches fare with a relatively small number of examples.
1076
D. Roth, M.-H. Yang, and N. Ahuja
be generated from the input image. This will allow us to learn a simpler functional description, quantifying learnability in terms of the expressivity of the intermediate representation as well as the function learned on top of it; in particular, it would make explicit the requirements from an intermediate representation so that learning is possible. A relation3 over the instance space I is a function Â: I ! f0, 1g.  can be viewed as an indicator function over I , dening the subset of those elements mapped to 1 by Â. A relation  is active in I 2 I if  (I ) D 1. Given an instance, we would like to transform it so that it is represented as the collection of the relations that are active in it. We would like to do that, though, without the need to write down explicitly all possible relations that could be active in the domain ahead of time. This is important, in particular, over innite domains or in on-line situations where the domain elements are not known in advance and therefore it is impossible to write down all possible relations. An efcient way to do that is given by the construct of relation-generating functions (Cumby & Roth, 2000). Let X be an enumerable collection of relations over I . A relation generation function (RGF) is a mapping G: I ! 2X that maps I 2 I to a set of all elements in X that satisfy  ( I ) D 1. If there is no  2 X for which  ( I ) D 1, G (I) D w . Denition 1.
RGFs can be thought as a way to dene kinds of relations, or to parameterize over a large space of relations. Only when presented with an instance I 2 I is a concrete relation generated. For example, G may be the RGF that corresponds to all vertical edges of size 3 in an image. Given an image, a list of all these edges that are present in the image is produced. It will be clear in section 2.3 that RGFs can be noisy to a certain extent since the framework can tolerate some amount of noise at the features level. 2.2 Learning Approach. We now present a mistake-bound algorithm for a class of functions that can be represented as DNF formulas over the space X of all relations. As indicated, this implies a PAC learning algorithm, but the proof for the mistake-bound case is simpler. In section 7, we will learn a more general function—a linear threshold function over conjunctions of relations in X . We discuss later how the theoretical results can be expanded to this case.
Let X be a set of relations that can be generated by a set of RGFs. Let M be a collection of monomials (conjunctions) over the elements of X and Denition 2.
3 In the machine learning literature, a relation is sometimes called a feature. We use the term relation here to emphasize that it is a Boolean predicate and that, in principle, it could be a higher-order predicate, that is, it could take variables (Cumby & Roth, 2000). In this article, features are simple functions (e.g., monomials) over relations.
Learning to Recognize Three-Dimensional Objects
1077
p ( n ) , q (n ) , and g (n ) be polynomials. Let C M be the class of all functions that are disjunctions of at most p ( n ) monomials in M. Following Kushilevitz and Roth (1996), we call C M polynomially explainable if there exists an efcient (polynomial time) algorithm B such that for every function f 2 C M and every positive example of f as input, B outputs at most q ( n ) monomials (not necessarily all of them are in M) such that with probability at least 1 / g ( n ) , at least one of them appears in f (the probability is taken over the coin ips of the (possibly probabilistic) algorithm B ). It should be clear that the class of polynomially explainable DNFs is a strict generalization over the class of, say, k-DNF, for any xed k. This difference might be important in the current context. It might be possible that some structural constraints govern the generation of the monomials, placing it in M, but the size of the monomials is not xed. As a canonical example for the difference, consider the following example. Let S1 , . . . , St be subsets of fx1 , . . . , xn g, where t is polynomial in n. Let M be any collection of monomials with the property that for every m 2 M, the set of variables in m is Si for some 1 · i · t (i.e., any set Si may correspond to at most 2 |Si | monomials in M, by choosing for each xj 2 Si whether xj or xNj appears in the monomial). If there exists an efcient algorithm B that on input n enumerates these sets (and possibly some more), then C M is polynomially explainable. The reason is that although M may contain an exponential number of monomials, given an example ( x1 , . . . , xn ), only one element in each of the Si s might correspond to it. That is, the data-driven nature of the process allows working with families of features that could be very large, provided that only a reasonable number of them (polynomial, in our denition) is active in each observation. As a concrete instantiation, consider the class of all DNF formulas in which the variables in each monomial have consecutive indices—for ex6 k-DNF, for any constant ample, f D x1 xN 2 x3 x4 _ xN 4 xN 5 x6 _ x8 x9 . Clearly, ¡ ¢ f 2 k. However, it is easy to enumerate the n2 < n2 sets Si, j (1 · i · j · n) dened by Si, j D fxi , xi C 1 , . . . , xj g and, as above, although the number of corresponding monomials is exponential, given an example, only one is relevant in each Si, j . A similar situation occurs when learning representations of visual objects. In this case, the sets might be dened by structural constraints, and each pixel will take a constant number of values (albeit larger than 2, as in the above examples) but only one of these will be relevant given an example. Example 1.
We note that in principle, it is possible to abstract the generation of the conjunctions into the RGFs (see denition 1). However, we would like to emphasize the generation of conjunctions over simple relations and the possibility of learning on top of it, given arguments in the literature of its
1078
D. Roth, M.-H. Yang, and N. Ahuja
effectiveness and potential biological plausibility (Fleuret & Geman, 1999; Ullman & Soloviev, 1999). The elements generated by the algorithm B , the explanations, are what we will later call the “features” supplied to the learning algorithm. Denition 2 implements the assumptions that we abstracted in section 1. The algorithm B is the procedural part. B keeps a small set of syntactically simple denitions, the RGFs, and given an image it outputs those instantiations that are present in the image. All we require here is that with nonnegligible probability, it outputs at least one “relevant” explanation (and possibly many that are irrelevant). The class C M implements our representational assumption; we assume that each object has a simple (disjunctive) representation over the relational monomials. This assumption can be veried only experimentally, as we do later in this article. We emphasize that f itself is not given to the algorithm B. Also note that a function f in the class C M may have few equivalent representations as a disjunction of monomials in M. The denition requires the output of the algorithm B to satisfy the above property, independent of which of these representations of f are considered. The importance of this will become clear when we analyze the learning algorithm below. If C M is polynomially explainable, then C M is expected mistake bound learnable. Furthermore, if C M is polynomially explainable by an algorithm B that always outputs at least one term of f (i.e., g ( n ) ´ 1), then C M is mistake bound learnable. Theorem 2.
Proof. The algorithm is similar to an algorithm presented in Blum (1992),
which learns a disjunction in the innite attribute model. The algorithm maintains a hypothesis h, which is a disjunction of monomials. Initially, h contains no monomials (i.e., h ´ FALSE). Upon receiving an example e, the algorithm predicts h (e ) ; if the prediction is correct, h is not updated. Otherwise, upon a mistaken prediction, it proceeds as follows: If e is positive: Execute B (the algorithm guaranteed by the assumption that C M is polynomially explainable) on the example e, and add the monomials it outputs to h. If e is negative: Remove from the hypothesis h all the monomials that are satised by e (there must be at least one). The analysis of the algorithm is straightforward for the case g ( n) ´ 1 and more subtle in general. To analyze the algorithm, we rst x a representation for f as a disjunction of monomials in M (in case f has more than one possible representation, choose one arbitrarily; we can work with any representation of f that uses only monomials in M). Now note that an active monomial (i.e., a monomial that appears in this representation of the target
Learning to Recognize Three-Dimensional Objects
1079
function f ) is never removed from h. Therefore, since on a positive example e, the algorithm B is guaranteed to output at least one monomial that appears in f with probability at least 1 / g ( n ) , then the expected number of mistakes made on positive examples is at most p ( n ) ¢ g ( n ). This also implies that the expected total number of monomials included in h during the execution of the algorithm is not more than p ( n ) ¢ q ( n ) ¢ g ( n ).4 Each mistake on a negative example results in removing at least one of the monomials included in h but not in f . The expected number of these monomials is therefore at most p ( n ) ¢ q ( n ) ¢ g (n ) . The expected total number of mistakes made by the algorithm is O ( p ( n) ¢ q ( n ) ¢ g ( n ) ) . Finally, note that in the case g ( n ) ´ 1, we get a truly mistake-bound algorithm, whose number of mistakes is bounded by p ( n ) ¢ q ( n ). The algorithm used in practice, in SNoW, is conceptually similar. The main difference is that the hypothesis h used is a general linear threshold function over elements in M rather than a disjunction, which is a restricted linear threshold function. Algorithmically, rather than dropping elements from it, their weights are updated. The details of this process (see section 4) are crucial for our approach to be computationally feasible for large-scale domains and for robustness. In order to expand the theoretical results to this case, we appeal to the results in Littlestone (1988), modied to the case of the innite attribute domain. If the target function is indeed a disjunction over elements in M, our results hold directly, with a much improved mistake bound that depends mostly on the number of elements in M that are actually relevant to f . If f is a general linear function over M, this behavior still holds with an additional dependence on the margin between positive and negative examples, as measured in the M space (d, in Littlestone, 1988; see details there). 2.3 Robustness. Any realistic learning framework needs to support different kinds of noise in its input. Several kinds of noise have been studied in the literature in the context of PAC learning, and algorithms of the type we consider here have been shown to be robust to them. The most studied type of noise is that of classication noise (Kearns & Li, 1993) in which the examples are assumed to be given to the learning algorithm with labels that are ipped with some probability, smaller than 1 / 2. Learning in our framework can be shown to be robust to this kind of noise, as well as to a more realistic case of attribute noise, in which the description of the input itself is corrupted to a certain degree. We believe that this is the type of noise that is more relevant in the current case. First, learning is done in terms of 4 Note that if B was guaranteed only to give a monomial that appears in some representation of f , then this bound is false (as it could be the case that the active monomials in different executions of B belong to different representations of f ). This explains the requirement of the denition that seems too strong.
1080
D. Roth, M.-H. Yang, and N. Ahuja
the output of the RGFs, which may introduce some noise. Second, attribute noise is related to occlusion noise, which is important in object recognition. Specically, attribute noise can be used to model the type of noise that usually occurs when other objects appear in the image, behind or in front of the target object. This is formalized next using the notion of domination. Let f1 , f2 be two concepts. We say that f1 is k-dominated by f2 if each f1 example can be obtained from an f2 example by ipping the (binary) value of at most k of the active relations. In this case, f2 k-dominates f1 . The labels of the examples, however, are generated according to the original concept, before the noise is introduced. If a class C M is learnable by virtue of being polynomially explainable, then it is learnable even if examples of the target class are cluttered by a k-domination attribute noise, for any constant k. Theorem 3.
Proof. The proof is an extension of the arguments in Littlestone (1991) re-
garding robustness to attribute noise to the case of the innite attribute model. It basically shows that the same algorithm works in the noisy case, only that the number of ipped relations affects the number of examples that the algorithm is required to see before it stops making mistakes. 3 From Theory to Practice
Several issues need to be addressed in order to exhibit the practicality of our learning framework. The rst is the availability of a variety of RGFs that can be used to extract primitive visual patterns from data under different conditions and that are expressive enough so that a simple function dened on top of them is enough to discriminate an object. A basic assumption underlying this work is that this is not hard to do. In this work, we illustrate the approach by using simple edge detectors (clearly, too simple for a realistic situation). The second issue is the composition of complex, albeit simply dened, relations from primitive ones. This is crucial since it allows the representation of complex functions in terms of the instantiated relations by learning simple functional descriptions over their compositions. A language that supports composition of restricted families of conjunctions and can encode structural relations in images (e.g., above, to the left of . . .) is discussed in Cumby and Roth (2000). The current work uses only general conjunctions and restricts only their size. Again, our working assumption is that even this simple representation is enough to generate a family of discriminating features. Specically, the above discussion amounts to assuming that it is (1) easy to extract simple relations over images—short vertical and horizontal edges in our case; (2) easy to extract simple monomials over these—each of our features represents the presence of some short conjunction of edges in the
Learning to Recognize Three-Dimensional Objects
1081
Figure 1: The short vertical and horizontal edges are extracted from an object image. These edges (and the conjunctions of them) represent the features of the object.
image; and (3) a linear threshold function (generalizing the simple disjunction in the theorems) over these features can be used to discriminate an object from other objects. We represent each object, therefore, as a linear function over the conjunctive features. In Figure 1, we exemplify the feature-based representation extracted for three objects when the features used are only short vertical and horizontal conjunctions (no conjunctions of those are used here). Our theory does not commit to this simplied and clearly insufcient representation. Nevertheless, such simple features capture sufcient information to recognize the 100 objects in the Columbia Object Image Library (COIL 100) database, as will be demonstrated. Finally, the issue of the learnability of these representation is crucial in our approach. In learning situations in vision, the number of relations compositions (features) that could potentially affect each decision is very large, but typically, only a small number of them is actually relevant to a decision. Beyond correctness, a realistic learning approach therefore needs to be feature efcient (Littlestone, 1988) in that its learning complexity (the number of examples required for the learning algorithm to converge to a function that is a good discriminator) depends on the number of relevant features and not the global number of features in the domain. Equivalently, this can be phrased as the dependence of the true error on the error observed on the training data and the number of examples observed during training. For a feature-efcient algorithm, the number of training examples required in order to generalize well—to have true error that is not far from the error on the training data—is relatively small (Kivinen & Warmuth, 1995b).
1082
D. Roth, M.-H. Yang, and N. Ahuja
A realistic learning approach in this domain should also allow the use of variable input size, for two reasons. First, learning is done in terms of relations that are generated from the image in a data-driven way, making it impossible, or impractical, for a learning approach to write explicitly, in advance, all possible relations and features. Similarly, dealing with a variable input size also allows an on-line learning scenario. Second, for computational efciency purposes, since only a few of the many potential features are active in any instance, using a variable input size allows the complexity of evaluating the learning hypothesis on an instance to depend on the number of active features in the input rather than the total number in the domain. Given that, the learning approach used in this work is the one developed within the SNoW learning architecture (Roth, 1998; Carlson et al., 1999). SNoW is specically tailored for learning in domains in which the potential number of features taking part in decisions is very large but may be unknown a priori, as in the innite attribute learning model (Blum, 1992). Specically, as input, the algorithm receives labeled instances < ( x, l) > , where an instance x 2 f0, 1g1 is presented as a list of all the active features in it and the label is a member of a discrete set of values (e.g., object identiers). Given a domain instance (an image), a set of preexisting RGFs is evaluated on it and generates a collection of relations that are active in this image; these in turn may be composed to generate complex features, the elements of M. A list (of unique identiers) of active elements in M is presented to the learning procedure, and learning is done at this level. Specically, given an image, the learning algorithm receives as input a description of it in terms of elements in M that are active in it. As required by the theorems already presented, at least some elements in this description should be relevant to the functional description of the target object, but many others may not be. The learning algorithm will quickly determine, by determining the weight of different features, the appropriate linear combination of features that can be used as a good discriminator. We note that conceptually similar approaches have been developed by several researchers (Fleuret & Geman, 1999; Amit & Geman, 1999; Tieu & Viola, 2000; Mel & Fiser, 2000). In all cases, the approach is based on generating a fairly large number of features—typically generated as conjunctions of primitive feature—and hoping that a learning approach could learn a reliable discriminator as a function of these. These approaches made use of simple statistical methods (Mel & Fiser, 2000) or learning algorithms such as decision tree and AdaBoost (Fleuret & Geman, 1999; Tieu & Viola, 2000). Our approach differs in that (1) given some reasonable assumptions, we suggest a theoretical framework in which justications can be developed and in which performance can be quantied as a function of the expressivity of the features used, and (2) it makes use of a different computational paradigm that we believe to be more appropriate in this domain. We use a featureefcient learning method that provides the opportunity to learn high-order
Learning to Recognize Three-Dimensional Objects
1083
Figure 2: (Top) Images of the same object taken at different view points. We use SNoW to learn the representations of each object: one subnetwork of features for the pill box (b) and the other for mug (f). (a, e) respectively depict the features with top weights. Given a test image (b) and the learned representations (a, e), (c) shows the active features in the pill box subnetwork (a), and (d) shows the active features in the mug subnetwork (e). Given the test image (f) and the learned representations (a, e), (g) shows the active features in the mug subnetwork (e), and (h) shows the active features in the pill box subnetwork (a).
discriminators efciently by increasing the dimensionality of the feature space (in a data-driven way) and still learn a linear function in the augmented feature space. This eliminates the need to perform explicit feature selection as in the other methods—it is done automatically and efciently in the learning stage—or to deal with each feature separately (as in AdaBoost). In Figure 2, we use two objects to illustrate the learned representation after training on a collection of examples, as well as the features in the learned representation that are active in a new example.
1084
D. Roth, M.-H. Yang, and N. Ahuja
The images of a pill box object class and a mug object class taken at different view points (5 degrees apart) are shown at the top of gure 2. We extract small edges (of length 3) from each example and use these as the representation of each instance in training a SNoW system: one subnetwork for the pill box object class and the other for the mug object class (See also Figure 1 for examples of the input representation to SNoW.). Figures 2a and 2e depict the dominant features in the representations learned for the pill box object class and the mug object class (the weights of the feature are not shown; we show only those that have relatively high weights). Given the pill box test image shown in Figure 2b and the two learned subnetworks, Figure 2c shows the features of the test image that are active in the pill box subnetwork. Again, weights are not shown, but even in this way, it can be viewed as evidence that the test image belongs to the pill box object class. Figure 2d shows the features of the same test image that are active in the mug subnetwork. Here we see low evidence too that the test image belongs to the mug object class. The mug test image shown in Figure 2f and the two learned subnetworks, Figures 2g and 2h, show features of the test image that are active in the mug and the pill box subnetworks. It seems clear, even with the simpler features shown here (only short edges) and without the weight information on the features, that the learned representations can discriminate images of objects taken from one target class from those taken from a different class of objects. 4 The SNoW Learning Architecture
The SNoW (Sparse Network of Winnows5 ) learning architecture is a sparse network of linear units over a common predened or incrementally learned feature space. Nodes in the input layer of the network typically represent relations over the input instance and are being used as the input features. Each linear unit, called a target node, represents a concept of interest over the input. In the current application, target nodes represent a denition of an object in terms of the elements of M—features extracted from the 2D image input. An input instance is mapped into a set of features active in it; this representation is presented to the input layer of SNoW and propagates to the target nodes. Target nodes are linked via weighted edges to (some of) the input features. Let At D fi1 , . . . , im g be the set of features that are active in an example and are linked to the target node t. Then the linear unit corresponding to t is active iff X
wti > ht ,
i2At
5
To winnow: to separate chaff from grain.
Learning to Recognize Three-Dimensional Objects
1085
where wti is the weight on the edge connecting the ith feature to the target node t, and ht is the threshold for the target node t. Each SNoW unit may include a collection of subnetworks, one for each of the target relations but all using the same feature space. A given example is treated autonomously by each target unit; an example labeled t may be treated as a positive example by the t unit and as a negative example by the rest of the target nodes in its subnetwork. At decision time, a prediction for each subnetwork is derived using a winner-take-all policy. In this way, SNoW may be viewed as a multiclass predictor. In the application described here, we may have one unit with target subnetworks for all the target objects or we may dene different units, each with two competing target objects. SNoW’s learning policy is on-line and mistake driven. Several update rules can be used within SNoW; the most successful and the only one used in this work is a variant of Littlestone’s Winnow update rule (Littlestone, 1988), a multiplicative update rule that is tailored to the situation in which the set of input features is not known a priori, as in the innite attribute model (Blum, 1992). This mechanism is implemented via the sparse architecture of SNoW. That is, (1) input features are allocated in a data-driven way—an input node for the feature i is allocated only if the feature i was active in any input vector—and (2) a link (i.e., a nonzero weight) exists between a target node t and a feature i if and only if i was active in an example labeled t. One of the important properties of the sparse architecture is that the complexity of processing an example depends on only the number of features active in it, na , and is independent of the total number of features, nt , observed over the lifetime of the system. This is important in domains in which the total number of features is very large but only a small number of them is active in each example. The Winnow update rule has, in addition to the threshold ht at the target t, two update parameters: a promotion parameter a > 1 and a demotion parameter 0 < b < 1. These are being used to update the current representation of the target t (the set of weights wti ) only when a mistake in prediction is made. Let At D fi1 , . . . , im g be the set of P active features linked to the target node t. If the algorithm predicts 0 (that is, i2At wti · ht ) and the received label is 1, the active weights in the current example are promoted in a multiplicative fashion: 8i 2 At , wti à a ¢ wti .
P If the algorithm predicts 1 ( i2At wti > ht ) and the received label is 0, the active weights in the current example are demoted: 8i 2 At , wti à b ¢ wti . All other weights are unchanged. The key feature of the Winnow update rule is that the number of examples required to learn a linear function grows linearly with the number
1086
D. Roth, M.-H. Yang, and N. Ahuja
nr of relevant features and only logarithmically with the total number of features. Specically, in the sparse model, the number of examples required before converging to a linear separator that separates the data (provided it exists) scales with O ( nr log na ) , where na is the number of active features observed. This property seems crucial in domains in which the number of potential features is vast but a relatively small number of them is relevant. Winnow is known to learn efciently any linear threshold function (in general, the number of examples scales inversely with the margin (Littlestone, 1988) and to be robust in the presence of various kinds of noise and in cases where no linear-threshold function can make perfect classications, while still maintaining its dependence on the number of total and relevant attributes (Littlestone, 1991; Kivinen & Warmuth, 1995a). Once target subnetworks have been learned and the network is being evaluated, a decision support mechanism is employed, which selects the dominant active target node in the SNoW unit via a winner-take-all mechanism to produce a nal prediction. Figures 3, 4, and 5 provide more details on the SNoW learning architecture. Essentially, the SNoW learning architecture inherits its generalization properties from the update rule being used—the Winnow rule in this case— but there are a few differences worth mentioning relative to simply using the basic update rule. First, SNoW allows the use of a variable input size via the innite attribute domain. Second, SNoW is more expressive than the basic Winnow rule. The basic Winnow update rule makes use of positive weights only. Standard augmentation, for example, via the duplication trick (Littlestone, 1988), is infeasible in high-dimensional spaces since they diminish the gain from using variable-size examples (since half of the features become active). More sophisticated approaches such as using the balanced version of Winnow apply only to the case of two classes, while SNoW is a multiclass classier. Other extensions offered in SNoW relative to the standard update rule include an involved feature pruning method and a prediction condence mechanism (Carlson, Rosen, & Roth, 2001).
FDZ
C
D f0, 1, . . .g
/* Set of potential features */
T D ft1 , . . . t k g ½ F Ft µ F
tNET D f[(i, wti ): i 2 Ft ], ht g
/* Set of targets */ /* Set of features linked to target t */ /* The representation of the target t */
activation: T ! <
SNoW D ftNET : t 2 Tg e D fi1 , . . . , im g ½ Fm
/* activation level of a target t */ /* The SNoW Network */ /* An example: a list of active features */
Figure 3: SNoW: Objects and notation.
Learning to Recognize Three-Dimensional Objects
1087
Training Phase: SNoW-Train (SNoW, e)
Initially: Ft D w , for all t 2 T. For each t 2 T
1. UpdateArchitecture (t, e) 2. Evaluate (t, e) 3. UpdateWeights (t, e) Evaluation Phase: SNoW-Evaluation(SNoW, e)
For each t 2 T Evaluate (t, e) MakeDecision (SNoW, e)
Figure 4: SNoW: Training and evaluation. Training is the learning phase in which the network is constructed and weights are adjusted. Evaluation is the phase in which the network is evaluated, given an observation. This is a conceptual distinction; in principle, one can run in on-line mode, in which training is done continuously, even when the network is used for evaluating examples.
Procedure Evaluate(t, e)
activation D
P
i2e
wti
Procedure UpdateWeights(t, e) 6 e) If (activation(t) > ht )&(t 2
/* predicted positive on negative example */
for each i 2 e: wit à wti ¢ b If (activation(t) · ht )& (t 2 e) for each i 2
e: wit
Ã
wti
/*predicted negative on a positive example*/
¢a
Procedure UpdateArchitecture(t, e)
If t 2 e for each i 2 enFt , set wti D w
/* Link feature to target; set initial weight */
Otherwise: do nothing Procedure MakeDecisio n(SNoW, e)
Predict winner = arg maxt2T activation(t)
Figure 5: SNoW: Main procedures.
/* Winner-take-all Prediction */
1088
D. Roth, M.-H. Yang, and N. Ahuja
The SNoW learning architecture has been used successfully on a variety of large-scale problems in the natural language domain (Roth, 1998; Munoz, Punyakanok, Roth, & Zimak, 1999; Golding & Roth, 1999) and only recently has been attempted on problems in the visual domain (Yang et al., 2000c). Finally we note that, although not crucial to the main topic of this article, the SNoW learning architecture and the general approach of generating features for it could also be defended with neural plausibility in mind. First, SNoW represents objects in units that have only positive, excitatory connections corresponding to physiological facts that the ring rates of neurons cannot be negative; inhibitory effects of features arise from their occurrences in the representations of other objects. Moreover, a consequence of the weight update rule in the SNoW unit is that synapses do not change sign. Second, generating features as conjunctions of simple binary detectors (which can be SNoW units themselves) has been suggested before as an effective representation with potential biological plausibility (Fleuret & Geman, 1999; Ullman & Soloviev, 1999). 5 Discussion of Learning Methods
In this article, our learning approach is compared mostly to another linear learning approach: support vector machines (SVM). Below, we present the SVM method in some detail. 5.1 Support Vector Machines. SVM (Vapnik, 1995; Cortes & Vapnik, 1995) is a general-purpose learning method for pattern recognition and regression problems that is based on the theory of structural risk minimization. According to this principle, as described in section 2, a function that describes the training data well and belongs to a set of functions with low VC dimension will generalize well (that is, will guarantee a small, expected recognition error for the unseen data points) regardless of the dimensionality of the input space (Vapnik, 1995). Based on this principle, the SVM is a systematic approach to nd a linear function (a hyperplane) that belongs to a set of functions of this form with the lowest VC dimension. Linear classiers are used for computational purposes and, in addition, have the nice property that it is possible to quantify the VC dimension of this function class explicitly in terms of the minimal distance (margin) between positive and negative points (assuming the data is linearly separable). For expressivity, SVMs provide nonlinear function approximations by mapping the input vectors into a high-dimensional feature space where a linear hyperplane that separates the data exists. It can also be extended to cases where the best hyperplane in the resulting high-dimensional space does not quite separate all the data points. Given a set of samples (x 1 , y1 ) , ( x 2 , y2 ), . . ., ( x l , yl ) where x i 2 RN is the input vector and yi 2 f¡1, 1g is its label, an SVM aims to nd a separating hyperplane with the property that the distance it has from points of either
Learning to Recognize Three-Dimensional Objects
1089
class (margin distance) is maximized. Vapnik (1995) shows that maximizing the margin distance is equivalent to minimizing the VC dimension and therefore contributes to better generalization. The problem of nding the optimal hyperplane is thus posed as a constrained optimization problem and solved using quadratic programming techniques. The optimal hyperplane, which determines the class label of a data point x 2 RN , is of the form Á ! l X f ( x ) D sgn yi ai ¢ k ( x , xi ) C b iD 1
where k (¢, ¢) is a kernel function, used to map the original instance space to a high-dimensional space, b is a bias term, and sgn is the function that outputs C 1 on positive inputs and ¡1 otherwise. Constructing an optimal hyperplane is equivalent to determining the nonzero ai ’s. Sample vectors x i that correspond to a nonzero ai are called the support vectors (SVs) of the optimal hyperplane. The hope, when using this method, is to nd a small number of SVs, thereby producing a compact classier. The use of kernel functions allows avoiding the need to blow up the dimensionality explicitly in order to reach a state in which the sample is linearly separable. If the kernel is of the form k( x, xi ) D W ( x ) ¢ W ( xi ) for some nonlinear function W: RN ! RM , the computation can be done in the original low-dimensional space rather than in the M-dimensional space, although the hyperplane is constructed in RM . For a linear SVM, the kernel function is simply the dot product of vectors in the input space. Several kernel functions, such as polynomial functions and radial basis functions, have this property (Mercer’s theorem) and can be used in nonlinear SVM, allowing the construction of a variety of learning machines, some of which coincide with classical architectures. However, this also results in a drawback since one needs to nd the “right” kernel function when using SVMs. It is interesting to observe that although the use of kernel functions seems to be one of the advantages of SVMs from a theoretical point of view, many experimental studies have used linear SVMs, which were found to perform better than higher-level kernels (Pontil & Verri, 1998). This might be due to the fact that higher-level kernels imply in general worse generalization bounds, and thus require more examples to generalize well. 5.2 SNoW and SVMs. It is worthwhile to discuss the similarities and differences between the computational approaches we experiment with and to develop expectations to differences in the results. At a conceptual level, both learning methods are very similar. They both search for a linear function that best separates the training data. Both are based on the same inductive principle: performing well on the training data
1090
D. Roth, M.-H. Yang, and N. Ahuja
with a classier of low expressivity would result in good generalization on data sampled from the same distribution. Both methods work by blowing up the original instance space to a highdimensional space and attempt to nd a linear classier in the new space. This gives rise to one signicant difference between the methods. SVMs are a close relative of additive update algorithms like the perceptron (Rosenblatt, 1958; Freund & Schapire, 1998). For these algorithms, the dimensionality increase is done via the kernel functions and thus need not be done explicitly. The multiplicative update rule used in SNoW does not allow the use of kernels, and the dimensionality increase has to be done explicitly. Computationally, this could be signicant. However, SNoW allows for the use of a variable input space, and since the feature space is sparse, it turns out that SNoW is actually more efcient than current SVM implementations. This advantage is signicant when the examples are sparse (the number of active features in each example is small), and it disappears when there are many active features in each example, where the kernel-based methods are advantageous computationally. In addition, RGFs, which are the equivalent notion to kernels, could allow for more general transformations than those allowed by kernels (although in this work, we use conjunctions that are polynomial kernels). A second issue has to do with the way the two methods determine the coefcients of the linear classier and the implication this has for their generalization abilities. In SVMs, the weights are determined based on a global optimization criterion that aims at maximizing the margin, using a quadratic programming scheme. The generalization bounds achieved this way are related to those achieved by perceptron (Graepel, Herbrich, & Williamson, 2001). SNoW makes use of an on-line algorithm that attempts to minimize the number of mistakes on the training data; the loss function used to determine the weight update rule can be traced to the maximum entropy principle (Kivinen & Warmuth, 1995a). The implications are that while, in the limit, SVMs might nd the optimal linear separator, SNoW has significant advantages in sparse spaces—those in which a few of the features are actually relevant (Kivinen & Warmuth, 1995a). We could expect, therefore, that in domains with these characteristics, if the number of training examples is limited, SNoW will generalize better (and in general will have better learning curves). In the limit, when sufcient examples are given, the methods will be comparable. Finally, there is one practical issue: SVMs are binary classiers, while SNoW can be used as a multiclass classier. However, to get a fair comparison, we use SNoW here as a binary classier as well, as described below. 6 View-Based Object Recognition
The appearance of an object is the combined effects of its shape, reectance properties, pose, and illumination in the scene. While shape and reectance
Learning to Recognize Three-Dimensional Objects
1091
are intrinsic properties that do not change for a rigid object, pose and illumination vary from one scene to another. View-based recognition methods attempt to use data observed under different poses and illumination conditions to learn a compact model of the object’s appearance; this, in turn, is used to resolve the recognition problem from view points that were not observed previously. A number of view-based schemes have been developed to recognize 3D objects. Poggio and Edelman (1990) show that 3D objects can be recognized from the raw intensity values in 2D images (we call this representation here a pixel-based representation) using a network of generalized radial basis functions. Each radial basis generalizes and stores one of the example views and computes a weighting factor to minimize a measure of the error between the network’s prediction and the desired output for each of the training examples. They argue and demonstrate that the full 3D structure of an object can be estimated if enough 2D views of the object are provided. This work has been extended to object categorization (Risenhuber & Poggio, 2000). (See also Edelman, 1999, for more details.) Turk and Pentland (1991) demonstrate that human faces can be represented and recognized by “eigenfaces.” Representing a face image as a vector of pixel values, the eigenfaces are the eigenvectors associated with the largest eigenvalues, which are computed from a covariance matrix of the sample vectors. An attractive feature of this method is that the eigenfaces can be learned from the sample images in pixel-based representation without any feature selection. The eigenspace approach has since been used in vision tasks from face recognition to object tracking. (Murase and Nayar, 1995; Nayar, Nene, & Murase, 1996) develop a parametric eigenspace method to recognize 3D objects directly from their appearance. For each object of interest, a set of images in which the object appears in different poses is obtained as training examples. Next, the eigenvectors are computed from the covariance matrix of the training set. The set of images is projected to a low-dimensional subspace spanned by a subset of eigenvectors in which the object is represented as a manifold. A compact parametric model is constructed by interpolating the points in the subspace. In recognition, the image of a test object is projected to the subspace, and the object is recognized based on the manifold it lies on. Using a subset of COIL 100, they show that 3D objects can be recognized accurately from their appearances in real time. In contrast to these algebraic methods, general-purpose learning methods such as SVMs have also been used for this problem. Sch¨olkopf (1997) applies SVMs to recognize 3D objects from 2D images and demonstrate the potential of this approach in visual learning. Pontil and Verri (1998) also use SVMs for 3D object recognition and experimented with a subset of the COIL 100 data set. Their training set consisted of 36 images (one for every 10 degrees) for each of the 32 objects they chose, and the test sets consist of the remaining 36 images for each object. For 20 random selections of 32
1092
D. Roth, M.-H. Yang, and N. Ahuja
objects from the COIL 100, their system achieves perfect recognition rate (but see the comments on that in section 7). Recently, Roobaert and M. Van Hulle (1999) also used a subset of the COIL 100 database to compare the performance of SVMs with different pixel-based input representations. Instead of using the whole appearance of an object for object recognition, several methods have used local features in visual learning. Le Cun et al. (1995) apply a convolutional neural network with local features to handwritten digit recognition, with very good results. They also demonstrate that their learning method is able to extract salient local features from example images without complicated and elaborated algorithms. The idea of estimating joint statistics of local features has been used in recent work. Amit and Geman (1999) use conjunction of edges as local features of image and apply tree classiers to recognize handwritten digits. Schneiderman and Kanade (1998) use naive Bayes classier to model the joint distribution of features from face images and use the learned model for face detection with success. Viola and colleagues (De Bonet & Viola, 1998; Rikert, Jones, & Viola, 1999; Tieu & Viola, 2000) assume that the appearance of an object in one image is generated by a sparse set of visual local causes (i.e., features). Their method computes a large set of selective features from examples to capture local causal structure and applies a variation of AdaBoost (Freund & Schapire, 1997) with gaussian models to learn a hypothesis of an object. Other examples include object recognition using high-dimensional iconic representations (Rao & Ballard, 1995), multidimensional histograms (Schiele, 2000), local curve features (Nelson & Selinger, 1998), conjunction of local features (Mel, 1997), SVMs with local features using wavelets (Papageorgiou & Poggio, 2000), local principal component analysis (Penev & Atick, 1996) and independent component analysis (Donato, Bartlett, Hager, Ekman, & Sejnowski, 1999). The current work is most related conceptually to a collection of other works that make use of local features and their joint statistics (De Bonet & Viola, 1998; Rikert et al., 1999; Nelson & Selinger, 1998; Amit & Geman, 1999; Schiele, 2000). As in these works, the key assumption underlying our work is that objects can be recognized (or discriminated) using simple representations in terms of syntactically simple relations over the raw image. Based on these assumptions, this work provides a learning theory account for the problem of object recognition within the PAC model of learnability. Moreover, the computational approach developed and supported here is different from previous approaches and more suitable, we believe, to realistic visual learning situations. Although the number of these simple relations could be huge, at the basis of our computational approach is the belief that only a few of them are actually present in each observed image and a fairly small number of those observed is relevant to discriminating an object. Under these assumptions, our framework has several theoretical advantages that we described in the previous sections. For our framework to contribute to a practical solution,
Learning to Recognize Three-Dimensional Objects
1093
there also needs to be a computational approach that is able to learn efciently (in terms of both computation and number of examples) in the presence of a large number of potential explanations. Our evaluation of the theoretical framework makes use of the SNoW learning architecture (Roth, 1998; Carlson et al., 1999), which is tailored for these kind of tasks. Next we use the COIL 100 data set for and quantitative experimental evaluation. 7 Experimental Evaluation
We use the COIL 100 database in all the experiments below (COIL is available on-line at www.cs.columbia.edu/CAVE). The data set consists of color images of 100 objects where the images of the objects were taken at pose intervals of 5 degrees, for 72 poses per object. The images were also normalized such that the larger of the two object dimensions (height and width) ts the image size of 128 £ 128 pixels. Figure 6 shows the images of the 100 objects taken in frontal view (i.e., zero pose angle). The 32 highlighted objects in Figure 6 are considered more difcult to recognize in Pontil and Verri (1998); we use all 100 objects, including these in our experiments. Each color image is converted to a gray-scale image of 32 £ 32 pixels for our experiments. 7.1 Ground Truth of the COIL 100 Data Set. At rst glance, it seems difcult to recognize the objects in the COIL data set because it consists of a large number of objects with varying pose, texture, shape, and size. Since each object has 72 images of different poses (5 degrees apart), many viewbased recognition methods use 36 (10 degrees apart) of them for training and the remaining images for testing. However, it turns out that under these dense sampling conditions, the recognition problem is not difcult (even when only gray-level images are used). In this case, instances that belong to the same object are very close to each other in the image space (where each data point represents an image of an object in a certain pose). We veried this by experimenting with a simple nearest-neighbor classier (using the Euclidean distance), resulting in an average recognition rate of 98.50% (54 errors out of 3,600 tests). Figure 7 shows some of the objects misclassied by the nearest-neighbor method. In principle, one may want to avoid using the nearest-neighbor method since it requires a lot of memory for storing templates and its recognition time complexity is high. The goal here is simply to show that this method is comparable to the complex SVM approaches (Pontil & Verri, 1998; Roobaert & Hulle, 1999) for the case of dense sampling. Therefore, the recognition problem is not appropriate for comparison among different methods. It is interesting to see that the pairs of the objects on which the nearestneighbor method misclassied have similar geometric congurations and similar poses. A close inspection shows that most of the recognition errors are made between the three packs of chewing gums, bottles, and cars. Other
1094
D. Roth, M.-H. Yang, and N. Ahuja
Figure 6: Columbia Object Image Library (COIL 100) consists of 100 objects of varying poses 5 degrees apart. The objects are shown in row order; the highlighted ones are those considered more difcult to recognize.
dense sampling cases are easier for this method. Consequently, the set of selected objects in an experiment has direct effects on the recognition rate. This needs to be taken into account when evaluating results that use only a subset of the 100 objects (typically 20 to 30) from the COIL data set for experiments. Table 1 shows the recognition rates of nearest-neighbor classiers in several experiments in which 36 poses of each object are used for templates and the remaining 36 poses are used for tests. Given this baseline experiment, we decided to perform our experimental comparisons in cases in which the number of views of objects available in training is limited. Some of our preliminary results were presented in Yang, Roth, and Ahuja (2000a, 2000b).
Learning to Recognize Three-Dimensional Objects
1095
Figure 7: Mismatched objects using the nearest-neighbor method. ( x : a, y : b) means that object x with view angle a is recognized as object y with view angle b. It shows some of the 54 errors (out of 3600 test samples) made by the nearestneighbor classier when there are 36 views per object in the training set. Table 1: Recognition Rates of Nearest-Neighbor Classier.
Errors/tests Recognition rate
30 Objects Randomly Selected from COIL
32 Objects Shown in Figure 6 Selected by Pontil and Verri (1998)
The 100 Objects in COIL
14/1080 98.70%
46/1152 96.00%
54/3600 98.50%
7.2 Experiment Setups. Applying SNoW to 3D object recognition requires specifying the architecture used and the representation chosen for the input images. To perform object recognition, we associate a target unit with each target object. This target learns a denition of the object in terms of the input features extracted from the image. We could dene either a single SNoW unit that contains target subnetworks for all 100 different target objects or different units, each with several (e.g., two) competing target objects. Statistically, this approach is advantageous (Hastie & Tibshirani, 1998), although it clearly requires a lot more computation. The architecture selected affects the training time, where learning a denition for object a makes use of negative examples of other objects that are part of the same unit. More important, it makes a difference in testing; rather than two competing objects for a decision, there may be a hundred. The chances for a spurious mistake caused by an incidental view point are clearly much higher. It also
1096
D. Roth, M.-H. Yang, and N. Ahuja
Table 2: Experimental Results of Three Classiers Using the 100 Objects in the COIL-100 Data Set. Number of Views per Object
SNoW Linear SVM Nearest neighbor
36
18
8
4
3600 Tests
5400 Tests
6400 Tests
6800 Tests
95.81% 96.03 98.50
92.31% 91.30 87.54
85.13% 84.80 79.52
81.46% 78.50 74.63
has signicant advantages in terms of space complexity and the appeal of the evaluation mode. SVMs are two-class classiers that for a c-class pattern recognition prob( ) lem usually need to train c c¡1 binary classiers. Since we compare the 2 performance of the proposed SNoW-based method with SVMs, in order to maintain a fair comparison we have to perform it in the one-againstone scheme. That is, we use SNoW units of size two. To classify a test instance, tournament-like pair-wise competition between all the machines is performed, and the winner determines the label of the test instance. Table 2 shows the recognition rates of the SVM- and SNoW-based methods using ¡ ¢ the one-against-one scheme. (That is, we trained 100 2 D 4950 classiers for each method and evaluated 99(50 C 25 C 12 C 6 C 3 C 2 C 1) classiers on each test instance.) 7.3 Results Using Pixel-Based Representation. Table 2 shows the recognition rates of the SNoW-based method, the SVM-based method (using linear dot product for the kernel function), and the nearest-neighbor classier using the COIL 100 data set. The important parameter we vary here is the number of views (v) observed during training; the rest of the views (72 ¡ v) are used for testing. The experimental results show that the SNoW-based method performs as well as the SVM-based method when many views of the objects are present during training and outperforms the SVM-based method when the number of views is limited. Although it is not surprising to see that the recognition rate decreases as the number of views available during training decreases, it is worth noticing that both SNoW and SVM are capable of recognizing 3D objects in the COIL 100 data set with satisfactory performance if enough views (e.g., more than 18) are provided. Also they seem to be fairly robust even if only a limited number of views (e.g., 8 and 4) are used for training; the performance of both methods degrades gracefully. An additional potential advantage of the SNoW architecture is that it does not learn discriminators but rather can learn a representation for each object,
Learning to Recognize Three-Dimensional Objects
1097
Table 3: Recognition Rates of SNoW Using Two Learning Paradigms. Number of Views per Object SNoW
36
18
8
4
One-against-one One-against-all
95.81% 90.52
92.31% 84.50
85.13% 81.85
81.46% 76.00
which can then be used for prediction in the one-against-all scheme or to build hierarchical representations. See Figure 2 for examples. However, as is shown in Table 3, this implies a signicant degradation in the performance. Finding a way to make better predictions in the one-against-all scheme is one of the important issues for future investigation, to exploit the advantages of this approach better. 7.4 Results Using Edge-Based Representation. For each 32 £ 32 edge map, we extract horizontal and vertical edges (of length at least 3 pixels) and then encode as our features conjunctions of two of these edges. The ¡ ¢ number of potential features of this sort is 2048 D 2,096,128. However, only 2 an average of 1822 of these is active for objects in the COIL 100 data set. To reduce the computational cost, the feature vectors were further pruned, and only the 512 most frequently occurring features were retained in each image. Table 4 shows the performance of the SNoW-based method when conjunctions of edges are used to represent objects. As before, we vary the number of views of an object (v) during training and use the rest of the views (72 ¡ v) for testing. The results indicate that conjunctions of edges provide useful information for object recognition and that SNoW is able to learn very good object representations using these features. The experimental results also exhibit the relative advantage of this representation when the number of views per object is limited. 7.5 Simple Occlusion. This section presents some preliminary studies of object recognition in the presence of occlusion. Our current modeling of occlusion is fairly simplistic relative to studies such as Nelson and Selinger (1998) and Schiele (2000). However, given that our theoretical paradigm supports recognition in the presence of occlusion, we wanted to experiment with this in the current setting. We select a set of 10 objects6 from the COIL 100 data set and add in articial occlusions for experiments. In the data set, each object has 36 images (10 degrees apart) for training and the remaining 36 images for tests (also 10 6 More specically, the objects are selected from the set of objects on which the nearestneighbor classier makes the most mistakes: objects 8, 13, 23, 27, 31, 42, 65, 78, 80, 91.
1098
D. Roth, M.-H. Yang, and N. Ahuja
Table 4: Experimental Results of Three Classiers Using the 100 Objects in the COIL-100 Data Set. Number of Views per Object
SNoW with conjunction of edges SNoW with intensity values Linear support vector machine Nearest neighbor
36
18
8
4
3600 Tests
5400 Tests
6400 Tests
6800 Tests
96.25%
94.13%
89.23%
88.28%
95.81 96.03 98.50
92.31 91.30 87.54
85.13 84.80 79.52
81.46 78.50 74.63
Figure 8: Object images with and without occlusion as well as their edge maps.
degrees apart). The object images are occluded by a strip controlled by four parameters (a, p, l, g ), where a denote the angle of the strip, p denote the percentage of occluded image area, l denote the location of the center of the strip, and g denote the intensity values of the strip. Figure 8 shows some object images and the occluded object images for fa, p, l, gg D f45± , 15%, (16, 16) , 0g. Our trained SNoW classier is tested against this data set using the edgebased representation. Table 5 shows the experimental results with and without occlusions on this set of 10 objects with 36 views. The recognition performance degrades only slightly from 92.03% to 88.78%. Note that the objects are those on which the nearest-neighbor classier makes the most mistakes. Although our theoretical framework supports noise tolerance, it is clear that the limited expressivity of the features used in this experimental study Table 5: Experimental Results of SNoW Classier on Occluded Images with 36 Views per Object.
SNoW
Recognition Rate Without Occlusion
Recognition Rate With Occlusion
92.03%
88.78%
Learning to Recognize Three-Dimensional Objects
1099
limits the ability to tolerate occlusion; we nevertheless nd our results promising. 8 Conclusion
The main contribution of this work is proposing a learning framework for visual learning and exhibiting its feasibility. In this approach, learnability can be rigorously studied without making assumptions on the distribution of the observed objects; instead, via the PAC model, the performance of the learned hypothesis naturally depends on its prior experience. An important feature of the approach is that learning is studied not directly in terms of the raw data but rather with respect to intermediate representations extracted from it and can thus be quantied in terms of the ability to generate expressive intermediate representations. In particular, it makes explicit the requirements from these representations to allow learnability. We believe that research in vision should concentrate on the study of these intermediate representations. We evaluated the approach and demonstrated its feasibility in a largescale experiment in the context of learning for object recognition. Our experiments allowed us also to perform a fair comparison between two successful and related learning methods and study them in the context of object recognition. We have illustrated our approach in a large-scale experimental study in which we use the SNoW learning architecture to learn representations for the objects in COIL 100. Although it is clear that object recognition in isolation is not the ultimate goal, this study shows the potential of this computational approach as a basis for studying and supporting more realistic visual inferences. We note that for a fair comparison among different methods, we have used pixel-based presentation in the experiments. The experimental results suggest that the edge-based representation used is more effective and robust and should be the starting point for future research. There is no question that the RGFs used in this work are not general enough to support more challenging recognition problems; the intention was merely to exhibit the general approach. We believe that pursuing the direction of using complex intermediate representations will benet future work on recognition and, in particular, robust recognition under realistic conditions. The framework developed here is very general. The explanations used as features by the learning algorithm can represent a variety of computational processes and information sources that operate on the image. They can depend on local properties of the image, the relative positions of primitives in the image, and even external information sources or context variables. Thus, the theoretical support given here applies also to an intermediate learning stage in a hierarchical process. In order to generate the explanations efciently, this work assumes that they are syntactically simple in terms of the raw image. However, the explanation might as well be syntactically
1100
D. Roth, M.-H. Yang, and N. Ahuja
simple in terms of previously learned or inferred predicates, giving rise to a hierarchical representation. We believe that the key future research question suggested by this line of work is that of incorporating more visual knowledge into instantiations of this framework and, in particular, using it to generate better explanations.
Acknowledgments
D. R. is supported by National Science Foundation grants IIS-9801638, IIS9984168, and ITR-IIS-0085836, M-S. Y. was supported by a Ray Ozzie Fellowship and Ofce of Naval Research (ONR) grant N00014-96-1-0502 and currently is with Honda Fundamental Research Labs, Mountain View, California. N. A. is supported by ONR grant N00014-96-1-0502.
References Amit, Y., & Geman, D. (1999).A computational model for visual selection. Neural Computation, 11(7), 1691–1715. Angluin, D. (1992). Computational learning theory: Survey and selected bibliography. In Proceedings of the 24th Annual ACM Symposium on Theory of Computing (pp. 351–369). Victoria, B.C., Canada. Blum, A. (1992). Learning Boolean functions in an innite attribute space. Machine Learning, 9(4), 373–386. Carlson, A., Cumby, C., Rosen, J., & Roth, D. (1999). The SNoW learning architecture (Tech. Rep. No. UIUCDCS-R-99-2101). Urbana, IL: UIUC Computer Science Department. Carlson, A. J., Rosen, J., & Roth, D. (2001). Scaling up context sensitive text correction. In Proceedings of the Thirteenth Innovative Applications of Articial Intelligence Conference. Seattle, WA. Cortes, C., & Vapnik, V. (1995).Support vector networks. Machine Learning, 20(3), 273–297. Cumby, C., & Roth, D. (2000). Relational representations that facilitate learning. In Proceedings of the International Conference on the Principles of Knowledge Representation and Reasoning (pp. 425–434). Breckinridge, CO. De Bonet, J., & Viola, P. (1998).Texture recognition using a non-parametric multiscale statistical model. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (pp. 641–647). Santa Barbara, CA. Donato, G., Bartlett, M. S., Hager, J. C., Ekman, P., & Sejnowski, T. J. (1999). Classifying facial actions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(10), 974–989. Edelman, S. (1993). On learning to recognize 3-D objects from examples. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15(8), 833–837. Edelman, S. (1999). Representation and recognition in vision. Cambridge, MA: MIT Press.
Learning to Recognize Three-Dimensional Objects
1101
Fleuret, F., & Geman, D. (1999). Graded learning for object detection. In Proceedings of the IEEE Workshop on Statistical and Computational Theories of Visions. Fort Collins, CO. Freund, Y., & Schapire, R. (1997). A decision-theoretic generalization of online learning and an application to boosting. Journal of Computer and System Sciences, 55(1), 119–139. Freund, Y., & Schapire, R. (1998). Large margin classication using the perceptron algorithm. In Proceedings of the Annual ACM Workshop on Computational Learning Theory (pp. 209–217). Madison, WI. Golding, A. R., & Roth, D. (1999).A Winnow based approach to context-sensitive spelling correction. Machine Learning, 34(1–3), 107–130. (Special Issue on Machine Learning and Natural Language.) Graepel, T., Herbrich, R., & Williamson, R. C. (2001). From margin to sparsity. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.), Advances in neural information processing systems, 13 (pp. 210–216). Cambridge, MA: MIT Press. Hastie, T., & Tibshirani, R. (1998). Classication by pairwise coupling. In M. I. Jordan, M. J. Kearns, & S. A. Solla (Eds.), Advances in neural information processing systems, 10 (pp. 507–513). Cambridge, MA: MIT Press. Haussler, D. (1992). Decision theoretic generalizations of the PAC model for neural net and other learning applications. Information and Computer, 100(1), 78–150. Haussler, D., Kearns, M., Littlestone, N., & Warmuth, M. K. (1988). Equivalence of models for polynomial learnability. In Proceedings of the 1988 Workshop on Computational Learning Theory (pp. 42–55). Cambridge, MA. Kearns, M., & Li, M. (1993). Learning in the presence of malicious error. SIAM Journal of Computing, 22(4), 807–837. Kearns, M. J., Schapire, R. E., & Sellie, L. M. (1994). Toward efcient agnostic learning. Machine Learning, 17(2/3), 115–142. Kivinen, J., & Warmuth, M. K. (1995a). Additive versus exponentiated gradient updates for linear prediction. In Proceedings of the Annual ACM Symposium on Theory of Computing (pp. 209–218). Las Vegas, NV. Kivinen, J., & Warmuth, M. K. (1995b). The perceptron algorithm vs. Winnow: Linear vs. logarithmic mistake bounds when few input variables are relevant. In Proceedings of 8th Annual Conference on Computational Learning Theory (pp. 289–296). Santa Cruz, CA. Kushilevitz, E., & Roth, D. (1996). On learning visual concepts and DNF formulae. Machine Learning, 24(1), 65–85. Le Cun, Y., Jackel, L., Bottou, L., Brunot, A., Cortes, C., Denker, J., Drucker, H., Guyon, I., Muller, ¨ U., S¨ackinger, E., Simard, P., & Vapnik, V. (1995). Comparison of learning algorithms for handwritten digit recognition. In Proceedings of International Conference on Articial Neural Networks (pp. 53–60). Paris, France. Littlestone, N. (1988). Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine Learning, 2, 285–318. Littlestone, N. (1991). Redundant noisy attributes, attribute errors, and linear threshold learning using Winnow. In Proceedings of the Fourth Annual Workshop on Computational Learning Theory (pp. 147–156). Santa Cruz, CA.
1102
D. Roth, M.-H. Yang, and N. Ahuja
Mel, B. W. (1997). Seemore: Combing color, shape, and texture histogramming in a neurally-inspired approach to visual object recognition. Neural Computation, 9, 777–804. Mel, B. W., & Fiser, J. (2000). Minimizing binding errors using learned conjunctive features. Neural Computation, 12, 247–278. Munoz, M., Punyakanok, V., Roth, D., & Zimak, D. (1999).A learning approach to shallow parsing. In EMNLP-VLC’99, the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (pp. 168–178). College Park, MD. Murase, H., & Nayar, S. K. (1995). Visual learning and recognition of 3-D objects from appearance. International Journal of Computer Vision, 14, 5–24. Nayar, S. K., Nene, S. A., & Murase, H. (1996). Real-time 100 object recognition system. In Proceedings of IEEE International Conference on Robotics and Automation. Minneapolis, MN. Nelson, R. C., & Selinger, A. (1998). Large-scale tests of a keyed, appearancebased 3-D object recognition system. Vision Research, 38(15/16), 2469–2488. Papageorgiou, C., & Poggio, T. (2000). A trainable system for object detection. International Journal of Computer Vision, 38(1), 15–33. Penev, P., & Atick, J. (1996). Local feature analysis: A general statistical theory for object representation. Network: Computation in Neural Systems, 7(3), 477–500. Poggio, T., & Edelman, S. (1990). A network that learns to recognize 3D objects. Nature, 343, 263–266. Pontil, M., & Verri, A. (1998). Support vector machines for 3D object recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(6), 637–646. Rao, R. P. N., & Ballard, D. H. (1995). An active vision architecture based on iconic representations. Articial Intelligence, 78(1–2), 461–505. Rikert, T., Jones, M., & Viola, P. (1999). A cluster-based statistical model for object detection. In Proceedings of the Seventh IEEE International Conference on Computer Vision (pp. 1046–1053). Kerkyra, Greece. Risenhuber, M., & Poggio, T. (2000). Models of object recognition. Nature Neuroscience, 3, 1199–1204. Roobaert, D., & Hulle, M. V. (1999). View-based 3D object recognition with support vector machines. In IEEE International Workshop on Neural Networks for Signal Processing (pp. 77–84). Madison, WI. Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65, 386–407. Roth, D. (1998). Learning to resolve natural language ambiguities: A unied approach. In Proceedings of the Fifteenth National Conference on Articial Intelligence (pp. 806–813). Madison, WI. Schiele, B. (2000). Recognition without correspondence using multidimensional receptive eld histograms. International Journal of Computer Vision, 36(1), 31– 50. Schneiderman, H., & Kanade, T. (1998). Probabilistic modeling of local appearance and spatial relationships for object recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (pp. 45–51). Santa Barbara, CA. Sch¨olkopf, B. (1997). Support vector learning. Munich: Oldenbourg Verlag.
Learning to Recognize Three-Dimensional Objects
1103
Shvaytser, H. (1990). Learnable and nonlearnable visual concepts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(5), 459–466. Tieu, K., & Viola, P. (2000). Boosting image retrieval. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (Vol. 1, pp. 228–235). Hilton Head, SC. Turk, M., & Pentland, A. (1991). Eigenfaces for recognition. Journal of Cognitive Neuroscience, 3(1), 71–86. Ullman, S., & Soloviev, S. (1999). Computation of pattern invariance in brain-like structures. Neural Networks, 12, 1021–1036. Valiant, L. G. (1984).A theory of the learnable. Communications of the ACM, 27(11), 1134–1142. Vapnik, V. N. (1982). Estimation of dependences based on empirical data. New York: Springer-Verlag. Vapnik, V. N. (1995). The nature of statistical learning theory. New York: SpringerVerlag. Yang, M.-H., Roth, D., & Ahuja, N. (2000a). Learning to recognize 3D objects with SNoW. In Proceedings of the Sixth European Conference on Computer Vision (Vol. 1, pp. 439–454). Dublin, Ireland. Yang, M.-H., Roth, D., & Ahuja, N. (2000b). View-based 3D object recognition using SNoW. In Proceedings of the Fourth Asian Conference on Computer Vision (Vol. 2, pp. 830–835). Denver, CO. Yang, M.-H., Roth, D., & Ahuja, N. (2000b). A SNoW-based face detector. In S. A. Solla, T. K. Leen, & K.-R. Muller ¨ (Eds.), Advances in neural information processing systems, 12 (pp. 855–861). Cambridge, MA: MIT Press. Received August 23, 2000; accepted July 27, 2001.
LETTER
Communicated by John Platt
A Parallel Mixture of SVMs for Very Large Scale Problems Ronan Collobert [email protected] and [email protected] Dalle Molle Institute for Perceptual Articial Intelligence, 1920 Martigny, Switzerland, and UniversitÂe de MontrÂeal, DIRO, MontrÂeal, QuÂebec, Canada Samy Bengio [email protected] Dalle Molle Institute for Perceptual Articial Intelligence, 1920 Martigny, Switzerland Yoshua Bengio [email protected] UniversitÂe de MontrÂeal, DIRO, MontrÂeal, QuÂebec, Canada Support vector machines (SVMs) are the state-of-the-art models for many classication problems, but they suffer from the complexity of their training algorithm, which is at least quadratic with respect to the number of examples. Hence, it is hopeless to try to solve real-life problems having more than a few hundred thousand examples with SVMs. This article proposes a new mixture of SVMs that can be easily implemented in parallel and where each SVM is trained on a small subset of the whole data set. Experiments on a large benchmark data set (Forest) yielded signicant time improvement (time complexity appears empirically to locally grow linearly with the number of examples). In addition, and surprisingly, a signicant improvement in generalization was observed. 1 Introduction
Recently a lot of work has been done around support vector machines (Vapnik, 1995), mainly due to their impressive generalization performances on classication problems when compared to other algorithms such as articial neural networks (Cortes & Vapnik, 1995; Osuna, Freund, & Girosi, 1997). However, SVMs need resources that are at least quadratic on the number of training examples in order to solve a quadratic optimization problem, and it is thus hopeless to try to solve problems having millions of examples using classical SVMs. In order to overcome this drawback, we propose in this article to use a mixture of several SVMs, each of them trained on only a part of the data set. The idea of an SVM mixture is not new, although previous attempts such as Kwok’s article (1998) on support vector mixtures trained c 2002 Massachusetts Institute of Technology Neural Computation 14, 1105– 1114 (2002) °
1106
Ronan Collobert, Samy Bengio, and Yoshua Bengio
the SVMs not on part of the data set but on the whole data set and hence could not overcome the time complexity problem for large data sets. We propose here a simple method to train such a mixture and will show that in practice, this method is much faster than training only one SVM and leads to results that are at least as good as one SVM. We conjecture that the training time complexity of the proposed approach with respect to the number of examples is subquadratic for large data sets. Moreover this mixture can be easily parallelized, which could improve the training time signicantly. In the next section, we briey introduce the SVM model for classication. In section 3, we present our mixture of SVMs, followed in section 4 by some comparisons to related models. In section 5, we show some experimental results, rst on a toy data set and then on a large real-life data set. A brief conclusion follows. 2 Introduction to Support Vector Machines
SVMs (Vapnik, 1995) have been applied to many classication problems, generally yielding good performance compared to other algorithms. The decision function is of the form y D sign
Á N X iD 1
! yi ai K (x, xi ) C b ,
(2.1)
where x 2 Rd is the d-dimensional input vector of a test example, y 2 f¡1, 1g is a class label, xi 2 Rd is the input vector for the ith training example, N is the number of training examples, K (x, xi ) is a positive denite kernel function, and ® D fa1 , . . . , aN g and b are the parameters of the model. Training an SVM consists of nding ® that minimizes the objective function Q ( ®) D ¡
N X iD 1
ai C
N X N 1X ai aj yi yj K (xi , xj ) , 2 i D 1 j D1
(2.2)
subject to the constraints N X i D1
ai yi D 0
(2.3)
and 0 · ai · C 8i.
(2.4)
SVMs for Very Large Scale Problems
1107
The kernel K (x, xi ) can have different forms, such as the radial basis function (RBF), Á ! ¡kxi ¡ xj k2 (2.5) K (xi , xj ) D exp s2 with parameter s. Therefore, to train an SVM, we need to solve a quadratic optimization problem where the number of parameters is N. This makes the use of SVMs for large data sets difcult: computing K ( xi , xj ) for every training pair would require O ( N 2 ) computation, and solving may take up to O ( N 3 ) . Note, however, that current state-of-the-art algorithms appear to have a training time complexity scaling much closer to O (N 2 ) than O (N 3 ) (Collobert & Bengio, 2001; Joachims, 1999; Platt, 1999). 3 Mixtures of SVMs
In this section, we introduce a new type of mixture of SVMs. The proposed model should minimize the following cost function, CD
N X iD 1
" Á h
M X
! wm (xi ) sm (xi )
m D1
#2 ¡ yi
,
(3.1)
where M is the number of experts in the mixture, sm (xi ) is the output of the mth expert given input xi , wm (xi ) is the weight for the mth expert given by a “gater ” module also taking xi in input, and h is a transfer function that could be, for example, the hyperbolic tangent for classication tasks. Here, each expert is an SVM, and we took a neural network for the gater in our experiments. To train this model, we propose a very simple algorithm: 1. Divide the training set into M random subsets of size near N / M. 2. Train each expert separately over one of these subsets. 3. Keeping the experts xed, train the gater to minimize the cost, equation 3.1, on the whole training set. 4. Reconstruct M subsets. For each example (xi , yi ) , Sort the experts in descending order according to the values wm (xi ) . Assign the example to the rst expert in the list that has fewer than (N / M C c) examples,1 in order to ensure a balance between the experts. 1
Where c is a small, positive constant.
1108
Ronan Collobert, Samy Bengio, and Yoshua Bengio
5. If a termination criterion is not fullled (such as a given number of iterations or a validation error going up), go to step 2. Note that step 2 of this algorithm can be easily implemented in parallel as each expert can be trained separately on a different computer. Note also that step 3 can be an approximate minimization (as usually done when training neural networks). 4 Related Models
The idea of mixture models is quite old and has given rise to very popular algorithms, such as the well-known mixture of experts (Jacobs, Jordan, Nowlan, & Hinton, 1991), where the cost function is similar to equation 3.1 but where the gater and the experts are trained, using gradient descent or expectation maximization, on the whole data set (and not subsets) and their parameters are trained simultaneously. Hence, such an algorithm is quite demanding in terms of resources when the data set is large if training time scales like O ( N p ) with p > 1. In the more recent SVM model (Kwok, 1998), the author shows how to replace the experts (typically neural networks) by SVMs and gives a learning algorithm for this model. Once again, the resulting mixture is trained jointly on the whole data set, and hence does not solve the quadratic barrier when the data set is large. In another divide-and-conquer approach (Rida, Labbi, & Pellegrini, 1999), the authors propose to divide the training set using an unsupervised algorithm to cluster the data (typically a mixture of gaussians), train an expert (such as an SVM) on each subset of the data corresponding to a cluster, and nally recombine the outputs of the experts. Here, the algorithm does indeed separately train the experts on small data sets, like the present algorithm, but there is no notion of a loop reassigning the examples to experts according to the prediction made by the gater of how well each expert performs on each example. Our experiments suggest that this element is essential to the success of the algorithm. 5 Experiments
In this section, we present two sets of experiments comparing the new mixture of SVMs to other machine learning algorithms. 5.1 A Toy Problem. In the rst series of experiments, we tested the mixture on an articial toy problem where we generated 1000 training examples and 10,000 test examples. The problem had two nonlinearly separable classes and two input dimensions. Figure 1 shows the decision surfaces obtained rst by a linear SVM, then by a gaussian SVM, and nally by the
SVMs for Very Large Scale Problems
1109
2
2
1
1
0
0
1
1
2 2
1
0
1
2
2 2
(a) Linear SVM
1
0
1
2
(b) Gaussian SVM
2
1
0
1
2 2
1
0
1
2
(c) Mixture of two linear SVMs Figure 1: Comparison of the decision surfaces obtained by (a) a linear SVM, (b) a gaussian SVM, and (c) a linear mixture of two linear SVMs, on a two-dimensional classication toy problem.
proposed mixture of SVMs. Moreover, in the proposed mixture, the gater was a simple linear function and there were two linear SVMs in the mixture. This articial problem thus shows clearly that the algorithm seems to work and is able to combine, even linearly, very simple models in order to produce a nonlinear decision surface. 5.2 A Large-Scale Realistic Problem: Forest. For a more realistic problem, we did a series of experiments on part of the UCI Forest data set. 2 Since this is a classication problem with seven classes, we modied it in a binary classication problem where the goal was to separate class 2 from the other 2 The Forest data set is available on the UCI web site at the following address: ftp://ftp.ics.uci.edu/pub/machine-learning-databases/covtype/covtype.info.
1110
Ronan Collobert, Samy Bengio, and Yoshua Bengio
Table 1: Comparison of Performances. Train
Valid
Test
Error (%) One MLPa One SVM Uniform SVM mixtureb Gated SVM mixture
17.56 16.03 19.69 5.91
18.12 16.68 19.90 8.90
Time (minutes) 1 cpu
18.15 16.76 20.31 9.28
12 3231 85 237
Total Number of
50 cpu
Support Vectors
2 73
100 42,451 52,846 31,703
Notes: a 100 Hidden Units. b The gater always output the same value for each expert.
six classes. The data set had more than 500,000 examples, and this enabled us to prepare a series of experiments as follows: We kept a test set of 50,000 examples to compare the best mixture of SVMs to other learning algorithms. We used a validation set of 10,000 examples to select the best mixture of SVMs, varying the number of experts and the number of hidden units in the gater. We trained our models on different training sets, using from 100,000 to 400,000 examples. The mixtures had from 10 to 50 expert SVMs with gaussian kernel, and the gater was a multilayer perceptron (MLP) with between 25 and 500 hidden units. Because the number of examples was quite large, we selected the internal training parameters such as the s of the gaussian kernel of the SVMs or the learning rate of the gater using a held-out portion of the training set. We compared our models to: A single MLP, where the number of hidden units was selected by crossvalidation between 25 and 250 units. A single SVM, where the parameter of the kernel was also selected by cross validation. A mixture of SVMs where the gater was replaced by a constant vector, assigning the same weight value to every expert. Table 1 gives the results of a rst series of experiments with a xed training set of 100,000 examples. We selected among the variants of the gated SVM mixture not only using performance on the validation set but also taking into account the time to train the model. The selected model had 50 experts and a gater with 150 hidden units. A model with 500 hidden units would have given a performance of 8.1% over the test set but would have taken 621 minutes on one machine (and 388 minutes on 50 machines). As
SVMs for Very Large Scale Problems
1111
Validation error (%)
14
12
10
8 2550 100 150 200 250
Number of hidden units of the gater
50
Number of experts 500
10
15
20
25
Figure 2: Comparison of the validation error of different mixtures of SVMs with various numbers of hidden units and experts.
can be seen, the gated SVM outperformed all models in terms of training, validation, and test error. It was also much faster, even on one machine, than the SVM, and since the mixture could easily be parallelized (each expert can be trained separately), we also give the time it took to train on 50 machines. It is interesting to note that the total number of support vectors of the gated SVM was fewer than the number of support vectors of the SVM. In a rst attempt to understand these results, we can at least say that the power of the model lies not only on the gater, since a single MLP had quite bad performance on the test set, not only because we used SVMs, since a single SVM was not as good as the gated mixture, and not only because we divided the problem into many subproblems since the uniform mixture also performed badly. It seems to be a combination of all these elements. We also did a series of experiments in order to see the inuence of the number of hidden units of the gater as well as the number of experts in the mixture. Figure 2 shows the validation error of different mixtures of SVMs, where the number of hidden units varied from 25 to 500 and the number of experts varied from 10 to 50. There is a clear performance improvement when the number of hidden units is increased, while the improvement with
1112
Ronan Collobert, Samy Bengio, and Yoshua Bengio
Training time as a function of the number of train examples 450 400
Time (min)
350 300 250 200 150 100 50 1
1.5
2 2.5 3 Number of train examples
3.5
4 x 10
5
Figure 3: Comparison of the training time of the same mixture of SVMs (50 experts, 150 hidden units in the gater) trained on different training set sizes, from 100,000 to 400,000.
additional experts exists but is less clear. Note, however, that the training time also increases rapidly with the number of hidden units and slightly decreases with the number of experts if one uses one computer per expert. In order to determine how the algorithm scaled with respect to the number of examples, we then compared the same mixture of experts (50 experts, 150 hidden units in the gater) on different training set sizes. Figure 3 shows the training time of the mixture of SVMs trained on training sets of sizes from 100,000 to 400,000. It seems that at least in this range and for this particular data set, the mixture of SVMs scales linearly with respect to the number of examples and not quadratically, like a classical SVM. It is interesting to see, for instance, that the mixture of SVMs was able to solve a problem of 400,000 examples in less than 7 hours (on 50 computers), while it would have taken more than one month to solve the same problem with a single SVM.
SVMs for Very Large Scale Problems
1113
Error as a function of the number of training iterations 14 13
Train error Validation Error
12
Error (%)
11 10 9 8 7 6 5 1
2 3 4 Number of training iterations
5
Figure 4: Comparison of the training and validation errors of the mixture of SVMs as a function of the number of training iterations.
Finally, Figure 4 shows the evolution of the training and validation errors of a mixture of 50 SVMs gated by an MLP with 150 hidden units, during ve iterations of the algorithm. This should be convincing that the loop of the algorithm is essential to obtain good performance. 6 Conclusion
In this article, we have presented a new algorithm to train a mixture of SVMs that gave very good results compared to classical SVMs in terms of training time, generalization performance, or sparseness. Moreover, the algorithm appears to scale linearly with the number of examples, at least between 100,000 and 400,000 examples. Furthermore, the proposed algorithm has also been tested on another task with a similar number of examples and also yielded performance improvements. These results are extremely encouraging and suggest that the proposed method could allow training SVM-like models for very large multimillion data sets in a reasonable time. If training of the neural network gater with
1114
Ronan Collobert, Samy Bengio, and Yoshua Bengio
stochastic gradient takes time that grows much less than quadratically, as we believe to be the case for very large data sets (to reach a “good enough” solution), then the whole method is clearly subquadratic in training time with respect to the number of training examples. Acknowledgments
R. C. thanks the Swiss National Science Foundation for nancial support (project FN2100-061234.00). Y. B. thanks the NSERC funding agency and NCM2 network for support. References Collobert, R., & Bengio, S. (2001). SVMTorch: Support vector machines for largescale regression problems. Journal of Machine Learning Research, 1, 143–160. Cortes, C., & Vapnik, V. (1995). Support vector networks. Machine Learning, 20, 273–297. Jacobs, R. A., Jordan, M. I., Nowlan, S. J., & Hinton, G. E. (1991). Adaptive mixtures of local experts. Neural Computation, 3(1), 79–87. Joachims, T. (1999). Making large-scale support vector machine learning practical. In B. Sch¨olkopf, C. Burges, & A. Smola (Eds.), Advances in kernel methods. Cambridge, MA: MIT Press. Kwok, J. T. (1998). Support vector mixture for classication and regression problems. In Proceedings of the International Conference on Pattern Recognition (ICPR) (pp. 255–258). Brisbane, Queensland, Australia. Osuna, E., Freund, R., & Girosi, F. (1997). Training support vector machines: An application to face detection. In IEEE Conference on Computer Vision and Pattern Recognition (pp. 130–136). San Juan, Puerto Rico. Platt, J. (1999). Fast training of support vector machines using sequential minimal optimization. In B. Sch¨olkopf, C. Burges, & A. Smola (Eds.), Advances in kernel methods. Cambridge, MA: MIT Press. Rida, A., Labbi, A., & Pellegrini, C. (1999). Local experts combination trough density decomposition. In International Workshop on AI and Statistics (Uncertainty ’99). San Mateo, CA: Morgan Kaufmann. Vapnik, V. N. (1995). The nature of statistical learning theory. (2nd ed). New York: Springer-Verlag. Received May 15, 2001; accepted August 13, 2001.
LETTER
Communicated by John Platt
Bayesian Framewor k for Least-Squares Support Vector Machine Classiers, Gaussian Processes, and Kernel Fisher Discriminant Analysis T. Van Gestel tony.vangestelesat.kuleuven.ac.be J. A. K. Suykens johan.suykensesat.kuleuven.ac.be G. Lanckriet gert.lanckrietesat.kuleuven.ac.be A. Lambrechts annemie.lambrechtsesat.kuleuven.ac.be B. De Moor bart.demooresat.kuleuven.ac.be J. Vandewalle joos.vandewalleesat.kuleuven.ac.be Katholieke Universiteit Leuven, Department of Electrical Engineering ESAT-SISTA, B-3001 Leuven, Belgium The Bayesian evidence framework has been successfully applied to the design of multilayer perceptrons (MLPs) in the work of MacKay. Nevertheless, the training of MLPs suffers from drawbacks like the nonconvex optimization problem and the choice of the number of hidden units. In support vector machines (SVMs) for classication, as introduced by Vapnik, a nonlinear decision boundary is obtained by mapping the input vector rst in a nonlinear way to a high-dimensional kernel-induced feature space in which a linear large margin classier is constructed. Practical expressions are formulated in the dual space in terms of the related kernel function, and the solution follows from a (convex) quadratic programming (QP) problem. In least-squares SVMs (LS-SVM s), the SVM problem formulation is modied by introducing a least-squares cost function and equality instead of inequality constraints, and the solution follows from a linear system in the dual space. Implicitly, the least-squares formulation corresponds to a regression formulation and is also related to kernel Fisher discriminant analysis. The least-squares regression formulation has advantages for deriving analytic expressions in a Bayesian evidence framework, in contrast to the classication formulations used, for example, in gaussian processes (GPs). The LS-SVM formulation has clear primal-dual interpretations, and without the bias term, one explicitly constructs a model that yields the same expressions as have been obtained with GPs for regression. In this article, the Bayesian evidence framec 2002 Massachusetts Institute of Technology Neural Computation 14, 1115– 1147 (2002) °
1116
T. Van Gestel, et al.
work is combined with the LS-SVM classier formulation. Starting from the feature space formulation, analytic expressions are obtained in the dual space on the different levels of Bayesian inference, while posterior class probabilities are obtained by marginalizing over the model parameters. Empirical results obtained on 10 public domain data sets show that the LS-SVM classier designed within the Bayesian evidence framework consistently yields good generalization performances. 1 Introduction
Bayesian probability theory provides a unifying framework to nd models that are well matched to the data and to use these models for making optimal decisions. Multilayer perceptrons (MLPs) are popular nonlinear parametric models for both regression and classication. In MacKay (1992, 1995, 1999), the evidence framework was successfully applied to the training of MLPs using three levels of Bayesian inference: the model parameters, regularization hyperparameters, and network structure are inferred on the rst, second, and third level, respectively. The moderated output is obtained by marginalizing over the model- and hyperparameters using a Laplace approximation in a local optimum. Whereas MLPs are exible nonlinear parametric models that can approximate any continuous nonlinear function over a compact interval (Bishop, 1995), the training of an MLP suffers from drawbacks like the nonconvex optimization problem and the choice of the number of hidden units. In support vector machines (SVMs), the classication problem is formulated and represented as a convex quadratic programming (QP) problem (Cristianini & Shawe-Taylor, 2000; Vapnik, 1995, 1998). A key idea of the nonlinear SVM classier is to map the inputs to a high-dimensional feature space where the classes are assumed to be linearly separable. In this high-dimensional space, a large margin classier is constructed. By applying the Mercer condition, the classier is obtained by solving a nite dimensional QP problem in the dual space, which avoids the explicit knowledge of the high-dimensional mapping and uses only the related kernel function. In Suykens and Vandewalle (1999), a least-squares type of SVM classier (LS-SVM) was introduced by modifying the problem formulation so as to obtain a linear set of equations in the dual space. This is done by taking a least-squares cost function, with equality instead of inequality constraints. The training of MLP classiers is often done by using a regression approach with binary targets for solving the classication problem. This is also implicitly done in the LS-SVM formulation and has the advantage of deriving analytic expressions within a Bayesian evidence framework in contrast with classication approaches used, as in GPs. As in ordinary ridge regression (Brown, 1977), no regularization is applied on the bias term in SVMs and LS-SVMs, which results in a centering in the kernel-induced feature space and allows relating the LS-SVM formulation to kernel Fisher dis-
Bayesian LS-SVM Classier
1117
criminant analysis (Baudat & Anouar, 2000; Mika, Ra¨ tsch, & Muller, ¨ 2001). The corresponding eigenvalues of the centered kernel matrix are obtained from kernel PCA (Sch¨olkopf, Smola, & Muller, ¨ 1998). When no bias term is used in the LS-SVM formulation, similar expressions are obtained as with kernel ridge regression and gaussian processes (GPs) for regression (Gibbs, 1997; Neal, 1997; Rasmussen, 1996; Williams, 1998). In this article, a Bayesian framework is derived for the LS-SVM formulation starting from the SVM and LS-SVM feature space formulation, while the corresponding analytic expressions in the dual space are similar, up to the centering, to the expressions obtained for GP. The primal-dual interpretations and equality constraints of LS-SVMs have also allowed, extending the LS-SVM framework to recurrent networks and optimal control (Suykens & Vandewalle, 2000; Suykens, Vandewalle, & De Moor, 2001). The regression formulation allows deriving analytic expressions in order to infer the model parameters, hyper parameters, and kernel parameters on the corresponding three levels of Bayesian inference, respectively. Posterior class probabilities of the LSSVM classier are obtained by marginalizing over the model parameters within the evidence framework. In section 2, links between kernel-based classication techniques are discussed. The three levels of inference are described in sections 3, 4, and 5. The design strategy is explained in section 6. Empirical results are discussed in section 7. 2 Kernel-Based Classication Techniques
Given a binary classication problem with classes C C and C ¡ , with corresponding class labels y D § 1, the classication task is to assign a class label to a given new input vector x 2 Rn . Applying Bayes’ formula, one can calculate the posterior class probability: P ( y | x) D
p ( x | y) P ( y ) , p ( x)
(2.1)
where P ( y ) is the (discrete) a priori probability of the classes and p (x | y) is the (continuous) probability of observing x when corresponding to class label y. The denominator p (x ) follows from normalization. The class label is then assigned to the class with maximum posterior probability: y (x ) D sign[g 0 (x ) ] D sign[P(y D C 1 | x ) ¡ P ( y D ¡1 | x) ]
(2.2)
or y (x ) D sign[g1 (x ) ] D sign[log(p(x | y D C 1) P ( y D C 1) )
¡ log(p(x | y D ¡1) P ( y D ¡1) )].
(2.3)
1118
T. Van Gestel, et al.
Given g 0 ( x ) , one obtains the posterior class probabilities from P ( y D C 1 | x ) D 12 (1 C g0 ( x) ) and P ( y D ¡1) D 12 (1 ¡ g 0 (x ) ) (Duda & Hart, 1973). When the densities p (x | y D C 1) and p (x | y D ¡1) have a multivariate normal distribution with the same covariance matrix S and corresponding mean m C and m ¡ , respectively, the Bayesian decision rule, equation 2.3, becomes the linear discriminant function, y ( x ) D sign[wT x C b],
(2.4)
with w D S ¡1 ( m C ¡ m ¡ ) and b D ¡wT ( m C C m ¡ ) / 2 C log(P(y D C 1) ) ¡ log(P ( y D ¡1) ) (Bishop, 1995; Duda & Hart, 1973). In practice, the class covariance matrix S and the mean m C and m ¡ are not known, and the linear classier wT x C b has to be estimated from given data D D f(xi , yi ) gN i D1 that consist of N C positive and N ¡ negative labels. The corresponding sets of indices with positive and negative labels are denoted by I C and I ¡ with the full index set I equal to I D I C [ I ¡ D f1, . . . , N g. Some well-known algorithms to estimate the discriminant vector w and bias term b are Fisher discriminant analysis, support vector machine (SVM) classier, and a regression approach with binary targets yi D § 1. However, when the class densities are not normally distributed with the same covariance matrix, the optimal decision boundary typically is no longer linear (Bishop, 1995; Duda & Hart, 1973). A nonlinear decision boundary in the input space can be obtained by applying the kernel trick: the input vector x 2 Rn is mapped in a nonlinear way to the high (possibly innite) dimensional feature vector Q ( x) 2 Rnf , where the nonlinear function Q (¢) : Rn ! Rnf is related to the symmetric, positive denite kernel function, K ( x1 , x2 ) D Q ( x1 ) T Q ( x2 ) ,
(2.5)
from Mercer’s theorem (Cristianini & Shawe-Taylor, 2000; Smola, Sch¨olkopf, & Muller, ¨ 1998; Vapnik, 1995, 1998). In this high-dimensional feature space, a linear separation is made. For the kernel function K, one typically has the following choices: K ( x, xi ) D xTi x (linear kernel); K ( x, xi ) D ( xTi x C 1) d (polynomial kernel of degree d 2 N); K ( x, xi ) D expf¡kx ¡ xi k22 / s 2 g (RBF kernel); or a K (x, xi ) D tanh(k xTi x C h ) (MLP kernel). Notice that the Mercer condition holds for all s 2 R and d values in the RBF (resp. the polynomial case), but not for all possible choices of k , h 2 R in the MLP case. Combinations of kernels can be obtained by stacking the corresponding feature vectors. The classication problem now is assumed to be linear in the feature space, and the classier takes the form y ( x ) D sign[wT Q (x ) C b],
(2.6)
Bayesian LS-SVM Classier
1119
where w and b are obtained by applying the kernel version of the abovementioned algorithms, where typically a regularization term wT w / 2 is introduced in order to avoid overtting (large margin 2 / wT w) in the high (and possibly innite) dimensional feature space. On the other hand, the classier, equation 2.6, is never evaluated in this form, and the Mercer condition, equation 2.5, is applied instead. In the remainder of this section, the links between the different kernel-based classication algorithms are discussed. 2.1 SVM Classiers. Given the training data f(xi , yi ) gN i D 1 with input data xi 2 Rn and corresponding binary class labels yi 2 f¡1, C 1g, the SVM classier, according to Vapnik’s original formulation (Vapnik, 1995, 1998), incorporates the following constraints (i D 1, . . . , N):
(
wT Q ( xi ) C b ¸ C 1, wT Q ( xi ) C b · ¡1,
if yi D C 1 if yi D ¡1,
(2.7)
which is equivalent to yi [wT Q ( xi ) C b] ¸ 1, ( i D 1, . . . , N ) . The classication problem is formulated as follows:
min J 1 ( w, j ) D w,b,j
N X 1 T ji w wCC 2 iD 1
( £ ¤ yi wT Q ( xi ) C b ¸ 1 ¡ ji , subject to ji ¸ 0,
(2.8) i D 1, . . . , N i D 1, . . . , N.
(2.9)
This optimization problem is solved in its dual form, and the resulting classier, equation 2.6, is evaluated in its dual representation. The variables ji are slack variables that are needed in order to allow misclassications in the set of inequalities (e.g., due to overlapping distributions). The positive real constant C should be considered as a tuning parameter in the algorithm. More details on SVMs can be found in Cristianini and Shawe-Taylor (2000), Smola et al. (1998), and Vapnik (1995, 1998). Observe that no regularization is applied on the bias term b. 2.2 LS-SVM Classiers. In Suykens & Vandewalle, 1999 the SVM classier formulation was modied basically as follows:
min J2c ( w, ec ) D
w,b,ec
N m T f X w wC e2 2 2 iD 1 c,i
(2.10)
1120
T. Van Gestel, et al.
subject to
h i yi wT Q ( xi ) C b D 1 ¡ ec,i ,
i D 1, . . . , N.
(2.11)
Besides the quadratic cost function, an important difference with standard SVMs is that the formulation now consists of equality instead of inequality constraints. The LS-SVM classier formulation, equations 2.10 and 2.11, implicitly corresponds to a regression interpretation with binary targets yi D § 1. By multiplying the error ec,i with yi and using y2i D 1, the sum squared error P 2 term N i D 1 ec,i becomes N X i D1
e2c,i D
N X i D1
( yi ec,i ) 2 D
N X i D1
± ± ²²2 e2i D yi ¡ wT Q (x ) C b ,
(2.12)
with ei D yi ¡ ( wT Q ( x) C b ).
(2.13)
Hence, the LS-SVM classier formulation is equivalent to
J2 ( w, b) D m EW C f ED ,
(2.14)
with EW D
1 T w w, 2
(2.15)
ED D
N N ± h i²2 1X 1X . e2i D yi ¡ w T Q ( xi ) C b 2 iD 1 2 i D1
(2.16)
Both m and f should be considered as hyperparameters in order to tune the amount of regularization versus the sum squared error. The solution of equation 2.14 depends on only the ratio c D f /m . Therefore, the original formulation (Suykens & Vandewalle, 1999) used only c as tuning parameter. The use of both parameters m and f will become clear in the Bayesian interpretation of the LS-SVM cost function, equation 2.14, in the next sections. Observe that no regularization is applied to the bias term b, which is the preferred form for ordinary ridge regression (Brown, 1977). The regression approach with binary targets is a common approach for training MLP classiers and also for the simpler case of linear discriminant analysis (Bishop, 1995). Dening the MSE error between wT Q ( x) C b and the
Bayesian LS-SVM Classier
1121
Bayes discriminant g 0 (x ) from equation 2.2, Z h i2 MSE D wT Q (x ) C b ¡ g 0 ( x ) p ( x ) dx,
(2.17)
it has been shown (Duda & Hart, 1973) that minimization of ED in equation 2.14 is asymptotically ( N ! 1) equivalent to minimizing equation 2.17. Hence, the regression formulation with binary targets yields asymptotically the best approximation to the Bayes discriminant, equation 2.2, in the leastsquares sense (Duda & Hart, 1973). Such an approximation typically gives good results but may be suboptimal since the misclassication risk is not directly minimized. The solution of the LS-SVM regressor is obtained after constructing the P T Lagrangian L ( w, b, eI a) D J2 ( w, e ) ¡ N i D 1 ai fyi ¡ [w Q ( xi ) C b] ¡ ei g, where ai 2 R are the Lagrange multipliers. The conditions for optimality are: 8 > > @L > > > > @w > > > > > > @L > < @b > > > L @ > > > > > @ei > > > @L > > : @ai
N X
D0
!
D0
!
D0
!
ai D c ei ,
i D 1, . . . , N
D0
!
yi D wT Q ( xi ) C b C ei D 0,
i D 1, . . . , N.
wD N X i D1
ai Q (xi )
iD 1
ai D 0
(2.18)
As in standard SVMs, we never calculate w or Q (xi ) . Therefore, we eliminate w and e, yielding a linear Karush-Kuhn-Tucker system instead of a QP problem: µ
0 1v
1Tv C V c ¡1 IN
¶µ ¶ µ ¶ 0 b D a Y
(2.19)
with Y D [y1 I . . . I yN ],
1v D [1I . . . I 1],
e D [e1 I . . . I eN ],
a D [a1 I . . . I aN ],
(2.20)
and where Mercer’s condition, equation 2.5, is applied within the kernel matrix V 2 RN£N , Vij D Q (xi ) T Q ( xj ) D K ( xi , xj ) .
(2.21)
1122
T. Van Gestel, et al.
The LS-SVM classier is then constructed as follows: y ( x ) D sign
" N X i D1
# ai yi K (x, xi ) C b .
(2.22)
In numerical linear algebra, efcient algorithms exist for solving largescale linear systems (Golub & Van Loan, 1989). The system, equation 2.19, can be reformulated into two linear systems with positive denite data matrices, so as to apply iterative methods such as the Hestenes-Stiefel conjugate gradient algorithm (Suykens, 2000). LS-SVM classiers can be extended to multiple classes by dening additional output variables. Although sparseness is lost due to the use of a 2-norm, a sparse approximation of the LSSVM can be obtained by sequentially pruning the support value spectrum (Suykens, 2000) without loss of generalization performance. 2.3 Gaussian Processes for Regression. When one uses no bias term b in the regression formulation (Cristianini & Shawe-Taylor, 2000; Saunders, Gammerman, & Vovk, 1998), the support values a¤ are obtained from the linear system,
³ VC
m f
´ IN a¤ D Y.
(2.23)
The output of the LS-SVM regressor for a new input x is given by yO ( x ) D
N X i D1
ai¤ K ( x, xi ) D h ( x ) T a¤ ,
(2.24)
with h ( x ) D [K ( x, x1 )I . . . I K (x, xN ) ]. For classication purposes one can use the interpretation of an optimal least-squares approximation, equation 2.17, to the Bayesian decision rule, and the class label is assigned as follows: y D sign[yO (x ) ]. Observe that the result of equation 2.24 is equivalent with the gaussian process (GP) formulation (Gibbs, 1997; Neal, 1997; Rasmussen, 1996; Sollich, 2000; Williams, 1998; Williams & Barber, 1998) for regression. In GPs, one assumes that the data are generated as yi D yO ( x ) C ei . Given N data points f(xi , yi ) gN i D 1 , the predictive mean for a new input x is given ¡1 T by yO ( x ) D h ( x ) CN YR , with h (x ) D [C(x, x1 ) I . . . I C ( x, xN ) ] and the matrix CN 2 RN£N with CN,ij D C ( xi , xj ), where C ( xi , xj ) is the parameterized covariance function, C (xi , xj ) D
1 1 K ( xi , xj ) C dij , m f
(2.25)
Bayesian LS-SVM Classier
1123
with dij the Kronecker delta and i, j D 1, . . . , N. The predictive mean is obtained as yO (x ) D
1 h ( x) T m
³
1 1 V C IN m f
´ ¡1
Y.
(2.26)
By combination of equations 2.23 and 2.24, one also obtains equation 2.26. The regularization term EW is related to the covariance matrix of the inputs, while the error term ED yields a ridge regression estimate in the dual space (Saunders et al., 1998; Suykens & Vandewalle, 1999; Suykens, 2000). With the results of the next sections, one can also show that the expression for the variance in GP is equal to the expressions for the LS-SVM without the bias term. Compared with the GP classier formulation, the regression approach allows the derivation of analytical expressions on all three levels of inference. In GPs, one typically uses combinations of kernel functions (Gibbs, 1997; Neal, 1997; Rasmussen, 1996; Williams & Barber, 1998), while a positive constant is added when there is a bias term in the regression function. In Neal (1997), the hyperparameters of the covariance function C and the variance 1 /f of the noise ei are obtained from a sampled posterior distribution of the hyperparameters. Evidence maximization is used in Gibbs (1997) to infer the hyperparameters on the second level. In this article, the bias term b is inferred on the rst level, while m and f are obtained from a scalar optimization problem on the second level. Kernel parameters are determined on the third level of inference. Although the results from the LS-SVM formulation without bias term and gaussian processes are identical, LS-SVMs explicitly formulate a model in the primal space. The resulting support values ai of the model give further insight in the importance of each data point and can be used to obtain sparseness and detect outliers. The explicit use of a model also allows dening, in a straightforward way the effective number of parameters c ef f in section 4. In the LS-SVM formulation, the bias term is considered a model parameter and is obtained on the rst level of inference. As in ordinary ridge regression (Brown, 1997), no regularization is applied on the bias term b in LS-SVMs, and a zero-mean training set error is obtained from equation 2.18: PN i D 1 ei D 0. It will become clear that the bias term also results in a centering of the Gram matrix in the feature space, as is done in kernel PCA (Sch¨olkopf et al., 1998). The corresponding eigenvalues can be used to derive improved generalization bounds for SVM classiers (Sch¨olkopf, Shawe-Taylor, Smola, & Williamson, 1999). The use of the unregularized bias term also allows the derivation of explicit links with kernel Fisher discriminant analysis (Baudat & Anouar, 2000; Mika et al., 2001). 2.4 Regularized Kernel Fisher Discriminant Analysis. The main concern in Fisher discriminant analysis (Bishop, 1995; Duda & Hart, 1973) is
1124
T. Van Gestel, et al.
Figure 1: Two gaussian distributed classes with the same covariance matrix are separated by the hyperplane wT Q ( x ) C b D 0 in the feature space. The class center of classes ¡1 and C 1 is located on the hyperplanes wT Q ( x ) C b D ¡1 and wT Q ( x ) C b D 1, respectively. The projections of the features onto the linear discriminant result in gaussian distributed errors with variance f ¡1 around the targets ¡1 and C 1.
to nd a linear discriminant w that yields an optimal discrimination between the two classes C C and C ¡ depicted in Figure 1. A good discriminant maximizes the distance between the projected class centers and minimizes the overlap Given the estimated class P between both distributions. P ( ) O centers mO C D Q and x N m D ¡ i / C i2I C i2I¡ Q ( xi ) / N ¡ , one maximizes 2 T OC ¡m O ¡ ) ) between the projected class centers the squared distance ( w ( m and minimizes the regularized scatter s around the class centers, sD
X± i2I C
Cc
wT (Q (xi ) ¡ mO C )
¡1
²2
C
²2 X± O ¡) wT ( Q ( x C i ) ¡ m
i2I¡
T
(2.27)
w w,
where the regularization term c ¡1 wT w is introduced so as to avoid overtting in the high-dimensional feature space. The scatter s is minimized so as to obtain a small overlap between the classes. The feature space expression for the regularized kernel Fisher discriminant is then found by maximizing ¡
OC ¡m O ¡) wT ( m max J FDA (w ) D w s
¢2 D
O C ¡ mO ¡ ) ( m OC ¡m O ¡) T w wT ( m , T w SW C w
(2.28)
Bayesian LS-SVM Classier
1125
P P O C ) (Q ( xi ) ¡ m O C ) T C i2I¡ (Q ( xi ) ¡ mO ¡ ) (Q ( xi ) ¡ with SW C D i2I C (Q ( xi ) ¡ m O ¡ ) T C c ¡1 Inf . The solution to the generalized Rayleigh quotient, equam tion 2.28, follows from a generalized eigenvalue problem in the feature O ¡) ( m O C ¡ mO ¡ ) T w D lSW C w, from which one obtains space ( mO C ¡ m O ¡ ). w D SW¡1C (mO C ¡ m
(2.29)
As the mapping Q is typically unknown, practical expressions need to be derived in the dual space, for example, by solving a generalized eigenvalue problem (Baudat & Anouar, 2000; Mika et al., 2001). Also the SVM formulation has been related to Fisher discriminant analysis (Shashua, 1999). The bias term b is not determined by Fisher discriminant analysis. Fisher discriminant analysis is typically used as a rst step, which yields the optimal linear discriminant between the two classes. The bias term b has to be determined in the second step so as to obtain an optimal classier. It can be easily shown in the feature space that the LS-SVM regression formulation, equation 2.14, yields the same discriminant vector w. Dening U D [Q (x1 ) , . . . , Q ( xN ) ] 2 Rnf £N , the conditions for optimality in the primal space are "
UUT C c
¡1
1Tv U T
In f
U 1v N
#" w b
#
" D
UY U 1v
# .
(2.30)
O C C N¡m O ¡ ) / N C (N C ¡ From the second condition, we obtain b D wT ( N C m N ¡ ) / N. Substituting this into the rst condition, one obtains ³ ´ N C N¡ N C N¡ T ( ) ( ) (m O O O O O C ¡m O ¡) , ¡ ¡ SW C C m m m m wD2 C C ¡ ¡ 2 N N2 which yields, up to a scaling constant, the same discriminant vector w as O C ¡m O ¡ ) . In the regression equation 2.29 since (mO C ¡ mO ¡ ) (mO C ¡ mO ¡ ) T ) w / ( m formulation, the bias b is determined so as to obtain an optimal least-squares approximation, equation 2.17, for the discriminant function, equation 2.2. 3 Probabilistic Interpretation of the LS-SVM Classier (Level 1)
A probabilistic framework is related to the LS-SVM classier. The outline of our approach is similar to the work of Kwok (1999, 2000) for SVMs, but there are signicant differences concerning the Bayesian interpretation of the cost function and the algebra involved for the computations in the feature space. First, Bayes’ rule is applied in order to obtain the LS-SVM cost function. The moderated output is obtained by marginalizing over w and b. 3.1 Inference of the Model Parameters w and b. Given the data points D D f(xi , yi ) gN i D1 and the hyperparameters m and f of the model H , the model
1126
T. Van Gestel, et al.
parameters w and b are estimated by maximizing the posterior p ( w, b | D, log m , log f , H ). Applying Bayes’ rule at the rst level (Bishop, 1995; MacKay, 1995), we obtain1 p ( w, b | D, log m , log f , H ) D
p (D | w, b, log m , log f , H ) p ( w, b | log m , log f , H ) , p ( D | log m , log f , H )
(3.1)
where the evidence p (D | log m , log f , H ) is a normalizing constant such that the integral over all possible w and b values is equal to 1. We assume a separable gaussian prior, which is independent of the hyperparameter f , that is, p (w, b | log m , log f , H ) D p ( w | log m , H ) p ( b | log sb , H ) , where sb ! 1 to approximate a uniform distribution. By the choice of the regularization term EW in equation 2.15, we obtain for the prior with sb ! 1: Á ! ± m ² nf ± m ² 1 b2 2 T exp ¡ w w q exp ¡ 2 p ( w, b | log m , H ) D 2p 2 2sb 2p sb2
_
± m ² nf ± m ² 2 exp ¡ wT w . 2p 2
(3.2)
To simplify the notation, the step of taking the limit of sb ! 1 is already made in the remainder of this article. The probability p (D | w, b, log m , log f , H ) is assumed to depend only on w, b, f , and H . We assume that the data points are independent p ( D | Q w, b, log f , H ) D N i D 1 p ( xi , yi | w, b, log f , H ) . In order to obtain the leastsquares cost function, equation 2.16, from the LS-SVM formulation, it is assumed that the probability of a data point is proportional to p ( xi , yi | w, b, log f , H ) _ p ( ei | w, b, log f , H ) ,
(3.3)
where the normalizing constant is independent of w and b. A gaussian distribution is taken for the errors ei D yi ¡ (wT Q ( xi ) C b) from equation 2.13: r
p ( ei | w, b, log f , H ) D
f
2p
³
exp ¡
f e2i
2
´
.
(3.4)
An appealing way to interpret this probability is depicted in Figure 1. It is assumed that the w and b are determined in such a way that the class 1 The notation p(¢ | ¢, log m , log f , ¢) used here is somewhat different from the notation p (¢ | ¢, m , f , ¢) used in MacKay (1995). We prefer this notation since m and f are (positive) scale parameters (Gull, 1988). By doing so, a uniform notation over the three levels of inference is obtained. The change in notation does not affect the results.
Bayesian LS-SVM Classier
1127
O ¡ and m O C are mapped onto the targets ¡1 and C 1, respectively. centers m The projections wT Q ( x ) C b of the class elements Q ( x) of the multivariate gaussian distributions are then normally disturbed around the corresponding targets with variance 1 /f . One can then write p (xi , yi | w, b, f , H ) D p ( xi | yi , w, b, f , H ) P ( yi ) D p ( ei | w, b, f , H ) P ( yi ), where the errors ei D yi ¡ (wT Q ( xi ) C b) are obtained by projecting the feature vector Q ( xi ) onto the discriminant function wT Q ( xi ) C b and comparing them with the target yi . Given the binary targets yi 2 f¡1, C 1g, the error ei is a function of the input xi in the classier interpretation. Assuming a multivariate gaussian distribution of feature vector Q ( xi ) in the feature space, the errors ei are also gaussian distributed, as is depicted in Figure 1 (Bishop, 1995; Duda & Hart, O ¡ C b D ¡1 and wT mO C C b D C 1 1973). However, the assumptions that wT m may not always hold and will be relaxed in the next section. By combining equations 3.2 and 3.4 and neglecting all constants, Bayes’ rule, equation 3.1, for the rst level of inference becomes Á
N m f X p (w, b | D, log m , log f , H ) _ exp ¡ wT w ¡ e2 2 2 i D1 i
!
D exp(¡J 2 ( w, b) ) .
(3.5)
The maximum a posteriori estimates wMP and bMP are then obtained by minimizing the negative logarithm of equation 3.5. In the dual space, this corresponds to solving the linear set of equations 2.19. The quadratic cost function, equation 2.14, is linear in w and b and can be related to the posterior p (w, b | D, m , f , H ) ³ ´ 1 1 exp ¡ gT Q ¡1 g , D p 2 (2p ) n f C 1 det Q
(3.6)
with2 g D [w ¡ wMP I b ¡ bMP ] and Q D covar(w, b) D E ( gT g ) , taking the expectation over w and b. The covariance matrix Q is related to the Hessian H of equation 2.14: " Q D H ¡1 D
H11
H12
T H12
H22
# ¡1
2 6
D6 4
3 ¡1
@2 J2
@2 J2
@w2 @2 J2
@w@b 7 @2 J2 5
@b@w
7
.
(3.7)
@b2
When using MLPs, the cost function is typically nonconvex, and the covariance matrix is estimated using a quadratic approximation in the local optimum (MacKay, 1995). 2
The Matlab notation [XI Y] is used, where [XI Y] D [XT YT ]T .
1128
T. Van Gestel, et al.
3.2 Class Probabilities for the LS-SVM Classier. Given the posterior probability, equation 3.6, of the model parameters w and b, we will now integrate over all w and b values in order to obtain the posterior class probability P ( y | x, D, m , f , H ) . First, it should be remarked that the assumption O C C b D C 1 and wT m O ¡ C b D ¡1 may not always be satised. This typwT m 6 ically occurs when the training set is unbalanced (N C D N ¡ ). In this case, the discriminant shifts toward the a priori most likely class so as to yield the optimal least-squares approximation (see equation 2.17). Therefore, we will use r ³ ´ O ) ) 2 f f ( wT ( Q ( x) ¡ m exp ¡ p ( x | y D 1, w, b, log f, H ) D 2p 2 r ³ ´ 2 f fe exp ¡ (3.8) D 2p 2
with e D wT (Q ( x ) ¡ mO ) by denition, where f¡1 is the variance of e. The notation is used to denote either C or ¡, since analogous expressions are obtained for classes C C and C ¡, respectively. In this article, we assume
f C D f¡ D f¤ .
Since e is a linear combination of the gaussian distributed w, marginalizing over w will yield a gaussian distributed e with mean me and variance se2 . The expression for the mean is O ) D me D wTMP (Q ( x ) ¡ m
N X i D1
O d. ai K ( x, xi ) ¡ m
(3.9)
PN P O d D N1 with m i D 1 ai j2I K ( xi , xj ), while the corresponding expression for the variance is O ]T Q11 [Q ( x ) ¡ m O ]. se2 D [Q ( x ) ¡ m
(3.10)
The expression for the upper left n f £n f block Q11 of the covariance matrix Q is derived in appendix A. By using matrix algebra and applying the Mercer condition, we obtain X se2 D m ¡1 K ( x, x ) ¡ 2m ¡1 N¡1 K (x, xi ) C m ¡1 N¡2 1Tv V (I, I) 1v i2I
1 T ¡ [h (x ) T ¡ 1 V (I, I )] N v £ MUG [m ¡1 Ineff ¡ (m Inef f C f DG ) ¡1 ] T £ UG M[h (x ) ¡
1 V (I , I) 1v ], N
(3.11)
Bayesian LS-SVM Classier
1129
where 1v is a vector of appropriate dimensions with all elements equal to one and where we used the Matlab index notation X ( Ia , Ib ) , which selects the corresponding rows Ia and columns Ib of the matrix X. The vector h ( x ) 2 RN and the matrices UG 2 RN£Nef f and DG 2 RNeff £Nef f are dened as follows: hi ( x ) D K ( x, xi ) , ¡1
UG ( :, i) D lG,i2 vG,i ,
i D 1, . . . , N
(3.12)
i D 1, . . . , Nef f · N ¡ 1
(3.13)
DG D diag([lG,1 , . . . , lG,Neff ]) ,
(3.14)
where vG,i and lG,i are the solutions to the eigenvalue problem (see equation A.4) MVMºG,i D lG,i vG,i ,
i D 1, . . . , Nef f · N ¡ 1,
(3.15)
with VG D [vG,1 , . . . , vG,Nef f ] 2 RN£Nef f . The vector Y and the matrix V are dened in equations 2.20 and 2.21, respectively, while M 2 RN£N is the idempotent centering matrix M D IN ¡ 1 / N1v 1Tv with rank N ¡ 1. The number of nonzero eigenvalues is denoted by Nef f < N. For rather large data sets, one may choose to reduce to computational requirements and approximate the variance sz2 by using only the most signicant eigenvalues (lG,i À mf ) in the above expressions. In this case, Nef f denotes the number of most signicant eigenvalues (see appendix A for details). The conditional probabilities p ( x | y D C 1, D, log m , log f , log f¤ , H ) and p ( x | y D ¡1, D, log m , log f , log f¤ , H ) are then equal to p (x | y D 1, D, log m , log f , log f, H ) Á D (2p
(f¡1 C
1 se2 ) ) ¡ 2
exp ¡
m2e
2(f¡1 C se2 )
! (3.16)
with either C or ¡, respectively. By applying Bayes’ rule, equation 2.1, the following class probabilities of the LS-SVM classier are obtained: P ( y | x, D, log m , log f , log f¤ , H ) D
P ( y ) p (x | y, D, log m , log f , log f¤ , H ) , p (x | D, log m , log f , log f¤ , H )
(3.17)
where the denominator p ( x | D, log m , log f , log f¤ , H ) D P ( y D C 1) p ( x | y D C 1, D, log m , log f , log f¤ , H ) C P ( y D ¡1) p ( x | y D ¡1, D, log m , log f , log f¤ , H ) follows from normalization. Substituting expression 3.16 for D C and D ¡ into expression 3.17, a quadratic expression is obtained since
1130
T. Van Gestel, et al.
6 f¤¡1 C s 2 . When s 2 ’ s 2 , one can dene s 2 D f¤¡1 C se2¡ D eC e¡ eC e
one obtains the linear discriminant function " N X O d¡ mO d C C m ai K ( x, xi ) ¡ y ( x ) D sign 2 i D1 C
f ¡1 C se2
O d C ¡ mO d¡ m
log
¶ P ( y D C 1) . P ( y D ¡1)
q
se2C se2¡ , and
(3.18)
The second and third terms in equation 3.18 correspond to the bias term b in the LS-SVM classier formulation, equation 2.22, where the bias term was determined to obtain an optimal least-squares approximation to the Bayes discriminant. The decision rules, equations 3.17 and 3.18, allow taking into account prior class probabilities in a more elegant way. This also allows adjusting the bias term for classication problems with different prior class probabilities in the training and test set. Due to the marginalization over w, the bias term correction is also a function of the input x since se2 is a function of x. The idea of a (static) bias term correction has also been applied in Evgeniou, Pontil, Papageorgiou, & Poggio, 2000 in order to improve the validation set performance. In Mukherjee et al., 1999 the probabilities p (e | y, wMP , bMP , m , f , H ) were estimated using leave-one-out cross validation given the obtained SVM classier, and the corresponding classier decision was made in a similar way as in equation 3.18. A simple density estimation algorithm was used, and no gaussian assumptions were made, while no marginalization over the model parameters was performed. A bias term correction was also applied in the softmax interpretation for the SVM output (Platt, 1999) using a validation set. Given the asymptotically optimal least-squares approximation, one can approximate the class probabilities P ( y D C 1 | x, w, b, D, log m , log f , H ) D (1 C g 0 ( x ) ) / 2 replacing g 0 (x ) by wTMP Q (x ) C bMP for the LS-SVM formulation. However, such an approach does not yield true probabilities that are between 0 and 1 and sum up to 1. Using a softmax function (Bishop, 1995; MacKay, 1992, 1995), one obtains P ( y D C 1 | x, w, b, D, log m , log f , H ) D (1 C exp(¡(wT Q (x ) C b) ) ) ¡1 and P ( y D ¡1 | x, w, b, D, log m , log f , H ) D (1 C exp( C ( wT Q (x ) C b) ) ) ¡1 . In order to marginalize over the model parameters in the logistic functions, one can use the approximate expressions of MacKay (1992, 1995) in combination with the expression for the moderated output of the LS-SVM regressor derived in Van Gestel et al., (2001). In the softmax interpretation for SVMs (Platt, 1999), no marginalization over the model parameters is applied, and the bias term is determined on a validation set. Finally, because of the equivalence between classication costs and prior probabilities (Duda & Hart, 1973), the results for the moderated output of the LS-SVM classier can be extended in a straightforward way in order to take different classication costs into account.
Bayesian LS-SVM Classier
1131
For large-scale data sets, the computation of the eigenvalue decomposition (see equation 3.15) may require long computations, and one may choose to compute only the largest eigenvalues and corresponding eigenvectors using an expectation-maximization approach (Rosipal & Girolami, 2001). This will result in an increased variance, as explained in appendix A. An alternative approach is to use the “cheap and chearful” approach described in MacKay (1995). 4 Inference of the Hyperparameters m and f (Level 2)
Bayes’ rule is applied on the second level of inference to infer the most likely m MP and fMP values from the given data D. The differences with the expressions obtained in MacKay (1995) are due to the fact that no regularization is applied on the bias term b and that all practical expressions are obtained in the dual space by applying the Mercer condition. Up to a centering, these expressions are similar to the expressions obtained with GP for regression. By combination of the conditions for optimality, the minimization problem in m and f is reformulated into a scalar minimization problem in c D f /m . 4.1 Inference of m and f . In the second level of inference, Bayes’ rule is applied to infer the most likely m and f values from the data:
p (log m , log f | D, H ) D
p (D | log m , log f , H ) p (log m , log f | H ) p (D | H )
_ p (D | log m , log f , H ) .
(4.1)
Because the hyperparameters m and f are scale parameters (Gull, 1988), we take a uniform distribution in log m and log f for the prior p (log m , log f | H ) D p (log m | H ) p (log f | H ) in equation 4.1. The evidence p ( D | H ) is again a normalizing constant, which will be needed in level 3. The probability p ( D | log m , log f , H ) is equal to the evidence in equation 3.1 of the previous level. Substituting equations 3.2, 3.4, and 3.6 into 4.1, we obtain: p p m nf f N exp(¡J2 (w, b ) ) p det H exp(¡ 12 gT Hg) p m nf f N _ p exp(¡J2 ( wMP , bMP ) ) , det H
p (log m , log f | D, H ) _
where J2 (w, b ) D J2 ( wMP , bMP ) C 12 gT Hg with g D [w ¡ wMP I b ¡ bMP ]. The expression for det H is given in appendix B and is equal to det H D Nm nf ¡Neff f QNeff (m C f lG,i ) , where the Nef f eigenvalues lG,i are the nonzero eigenvalues iD 1
1132
T. Van Gestel, et al.
of MVM. Taking the negative logarithm of p (log m , log f | D, H ) , the optimal parameters m MP and fMP are found as the solution to the minimization problem: min J3 (m , f ) D m EW ( wMP ) C f ED (wMP , bMP ) m ,f
N
C
ef f Nef f 1X N ¡1 log(m C f lG,i ) ¡ log m ¡ log f . 2 i D1 2 2
(4.2)
In Appendix B it is also shown that the level 1 cost function evaluated in wMP and bMP can be written as m EW ( wMP ) C f ED ( wMP , bMP ) D 12 YT M (m ¡1 MVM C f ¡1 IN ) ¡1 MY. The cost function J 3 from equation 4.2 can be written as ³ ´ ¡1 1 T 1 1 min J3 (m , f ) D Y M MVM C IN MY m ,f 2 m f ³ ´ 1 1 1 1 1 C log det MVM C IN ¡ log , 2 m f 2 f
(4.3)
where the last term is due to the extra bias term b in the LS-SVM formulation. Neglecting the centering matrix M, the rst two terms in equation 4.3 correspond to the level 2 cost function used in GP (Gibbs, 1997; Rasmussen, 1996; Williams, 1998). Hence, the use of the unregularized bias term b in the SVM and LS-SVM formulation results in a centering matrix M in the obtained expressions compared to GP. The eigenvalues lG,i of the centered Gram matrix are also used in kernel PCA (Sch¨olkopf et al., 1998), and can also be used to infer improved error bounds for SVM classiers (Sch¨olkopf et al., 1999). In the Bayesian framework, the capacity is controlled by the prior. The effective number of parameters (Bishop, 1995; MacKay, 1995) is equal P to c ef f D li,u / li,r , where li,u and li,r are the eigenvalues of Hessians of the unregularized cost function (Ju D f ED ) and regularized cost function (Jr D m EW C f ED ), respectively. For the LS-SVM, the effective number of parameters is equal to
c ef f D 1 C
Neff X iD 1
Nef f X c MP lG,i fMP lG,i D 1C , m MP C fMP lG,i 1 C c MP lG,i iD 1
(4.4)
with c D f / m . The term C 1 is obtained because no regularization on the bias term b is applied. Notice that since Nef f · N ¡1, the effective number of parameters c ef f can never exceed the number of given training data points, c ef f · N, although we may choose a kernel function K with possibly n f ! 1 degrees of freedom in the feature space.
Bayesian LS-SVM Classier
1133
The gradient of the cost function J3 (m , f ) is (MacKay, 1992): @J3 @m @J3 @f
D EW ( wMP ) C
Nef f Nef f 1X 1 ¡ 2 iD 1 m C f lG,i 2m
(4.5)
Nef f
D ED ( wMP , bMP ) C
1X lG,i N¡1 ¡ . 2 iD 1 m C f lG,i 2f
(4.6)
Putting the partial derivatives 4.5 and 4.6 equal to zero, we obtain the following relations in the optimum of the level 2 cost function: 2m MP EW ( wMP ) D c ef f ¡ 1 and 2fMP ED ( wMP , bMP ) D N ¡ c ef f . The last equality can be viewed as PN 2 ¡1 the Bayesian estimate of the variance fMP D i D 1 ei / ( N ¡ c ef f ) of the noise ei . While this yields an implicit expression for the optimal fMP for the regression formulation, this may not be equal to the variance f¤ since the targets class centers. Therefore, § 1 do not necessarily correspond to the projected P P we will use the estimate f¤¡1 D (N ¡ c ef f ) ¡1 ( i2I C e2i, C C i2I¡ e2i, ¡ ) in the remainder of this article. Combining both relations, we obtain that for the optimal m MP , fMP and c MP D fMP / m MP : 2m MP [EW ( wMP ) C c MP ED (wMP , bMP ) ] D N ¡ 1.
(4.7)
4.2 A Scalar Optimization Problem in c D f / m . We reformulate the optimization problem, equation 4.2, in m and f into a scalar optimization problem in c D f /m . Therefore, we rst replace that optimization problem by an optimization problem in m and c . We can use that EW ( wMP ) and ED ( wMP , bMP ) in the optimum of equation 2.14 depend on only c . Since in the optimum equation 4.7 also holds, we have the search for the optimum only along this curve in the (m , c ) space. By elimination of m from equation 4.7, the following minimization problem is obtained in a straightforward way:
min J 4 (c ) D c
N¡1 X i D1
µ ¶ 1 C ( N ¡ 1) log[EW ( wMP ) C c ED ( wMP , bMP ) ] (4.8) log lG,i C c
with lG,i D 0 for i > Nef f . The derivative as
@J4 @c
is obtained in a similar way
@J3 : @m
@J4 @c
D¡
N¡1 X
1
iD 1
c C lG,ic 2
C ( N ¡ 1)
ED ( wMP , bMP ) . ( EW wMP ) C c ED (wMP , bMP )
(4.9)
Due to the second logarithmic term, this cost function is not convex, and it is
1134
T. Van Gestel, et al.
useful to start from different initial values forc . The condition for optimality (@J4 / @c D 0) is c MP D
N ¡ c ef f c ef f
EW ( wMP ) . ¡ 1 ED ( wMP , bMP )
(4.10)
We also need the expressions for ED and EW in equations 4.8 and 4.9. It is explained in appendix B that these terms can be expressed in terms of the output vector Y and the eigenvalue decomposition of the centered kernel matrix MVM: ED (wMP , bMP ) D
1 T Y MVG (DG C c 2c 2
¡1
T Inef f ) ¡2 VG MY
1 T T Y MVG DG (DG C c ¡1 Inef f ) ¡2 VG MY 2 1 T EW ( wMP ) C c ED (wMP , bMP ) D YT MVG ( DG C c ¡1 Inef f ) ¡1 VG MY. 2 EW ( wMP ) D
(4.11) (4.12) (4.13)
When the eigenvalue decomposition, equation 3.15, is calculated, the optimization, equation 4.8, involves only vector products that can be evaluated very quickly. Although the eigenvalues lG,i have to be calculated only once, their calculation in the eigenvalue problem, equation 3.15, becomes computationally expensive for large data sets. In this case, one can choose to calculate only the largest eigenvalues in equation 3.15 using an expectation maximization approach (Rosipal & Girolami, 2001), while the linear system, equation 2.19, can be solved using the Hestenes-Stiefel conjugate gradient algorithm (Suykens, 2000). The obtained a and b can also be used to derive an P ai 1 PN 2 alternative expression for ED D 2c1 2 N i D1 ai and E W D 2 i D 1 ai ( yi ¡ c ¡bMP ) instead of using equations 4.11 and 4.12. 5 Bayesian Model Comparison (Level 3)
After determination of the hyperparameters m MP and fMP on the second level of inference, we still have to select a suitable model H. For SVMs, different models correspond to different kernel functions K, for example, a linear kernel or an RBF kernel with tuning parameter s. We describe how to rank different models Hj (j D 1, 2, . . ., corresponding to, e.g., RBF kernels with different tuning parameters sj ) in the Bayesian evidence framework (MacKay, 1999). By applying Bayes’ rule on the third level, we obtain the posterior for the model Hj : p ( Hj | D ) _ p ( D | Hj ) p ( Hj ) .
(5.1)
Bayesian LS-SVM Classier
1135
At this level, no evidence or normalizing constant is used since it is computationally infeasible to compare all possible models Hj . The prior p (Hj ) over all possible models is assumed to be uniform here. Hence, equation 5.1 becomes p ( Hj | D ) _ p (D | Hj ) . The likelihood p (D | Hj ) corresponds to the evidence (see equation 4.1) of the previous level. A separable gaussian prior p (log m MP , log fMP | Hj ) with error bars slog m and slog f is assumed for all models Hj . To estimate the posterior analytically, it is assumed (MacKay, 1999) that the evidence p (log m , log f | D, Hj ) can be very well approximated by using a separable gaussian with error bars slog m |D and slog f |D . As in section 4, the posterior p ( D | Hj ) then becomes (MacKay, 1995,1999) p (D | Hj ) _ p ( D | log m MP , log fMP , Hj )
slog m |D slog f |D slog m slog f
.
(5.2)
Ranking of models according to model quality p ( D | Hj ) is thus based on the goodness of t p ( D | log m MP , log fMP , Hj ) from the previous level and the s m | D slog f | D Occam factor log slog m slog f (Gull, 1988; MacKay, 1995,1999). Following a similar reasoning as in MacKay (1999), the error bars slog m |D 2 2 2 and slog f |D can be approximated as follows: slog m |D ’ c ¡1 and slog f |D ’ eff
2 N¡c eff
. Using equations 4.1 and 4.7 in 5.2 and neglecting all constants yields v u u p (D | Hj ) _ t
Neff
(c ef f
N¡1 m MP fMP . QNef f ¡ 1) ( N ¡ c ef f ) iD1 (m MP C fMP lG,i )
(5.3)
6 Design and Application of the LS-SVM Classier
In this section, the theory of the previous sections is used to design the LSSVM classier in the Bayesian evidence framework. The obtained classier is then used to assign class labels and class probabilities to new inputs x by using the probabilistic interpretation of the LS-SVM classier. 6.1 Design of the LS-SVM Classier in the Evidence Framework. The
design of the LS-SVM classier consists of the following steps: 1. The inputs are normalized to zero-mean and unit variance (Bishop, 1995). The normalized training data are denoted by D D f(xi , yi )gN iD 1 , with xi the normalized inputs and yi 2 f¡1, 1g the corresponding class label. 2. Select the model Hj by choosing a kernel type Kj (possibly with a kernel parameter, e.g., sj for an RBF-kernel). For this model Hj , the optimal hyperparameters m MP and fMP are estimated on the second
1136
T. Van Gestel, et al.
level of inference. This is done as follows. (a) Estimate the Nef f important eigenvalues (and eigenvectors) of the eigenvalue problem, equation 3.15, to obtain DG (and VG ). (b) Solve the scalar optimization problem, equation 4.8, in c D f /m with cost function 4.8 and gradient 4.9. (c) Use the optimal c MP to calculate m MP from equation 4.7, while fMP D m MPc MP . Calculate the effective number of parameters c ef f from equation 4.4. 3. Calculate the model evidence p ( D | Hj ) from equation 5.3. 4. For a kernel Kj with tuning parameters, rene the tuning parameters. For example, for the RBF kernel with tuning parameter sj , rene sj such that a higher model evidence p ( D | Hj ) is obtained. This can be done by maximizing the model evidence with respect to sj by evaluating the model evidence for the rened kernel parameter starting from step 2a. 5. Select the model Hj with maximal model evidence p ( D | Hj ) . Go to step 2, unless the best model has been selected. For a kernel function without tuning parameter, like the linear kernel and polynomial kernel with (already) xed degree d, steps 2 and 4 are trivial, since no tuning parameter of the kernel has to be chosen in step 2 and no rening of the tuning parameter is needed in step 4. The model evidence obtained at step 4 can then be used to rank the different kernel types and select the most appropriate kernel function. 6.2 Decision Making with the LS-SVM Classier. The designed LSSVM classier Hj is now used to calculate class probabilities. By combination of these class probabilities with Bayesian decision theory (Duda & Hart, 1973), class labels are assigned in an optimal way. The classication is done in the following steps:
1. Normalize the input in the same way as the training data D. The normalized input vector is denoted by x. 2. Assuming that the parameters a, bMP , m MP , fMP , c MP , c ef f , DG , UG , Nef f are available from the design of the model Hj , one can calculate me C , me¡ , se2C and se2¡ from equations 3.9 and 3.11, respectively. Compute f¤ P P from f¤¡1 D ( N ¡ c ef f ) ¡1 ( i2IC e2i, C C i2I¡ e2i,¡ ) . 3. Calculate p (x | y D C 1, D, log m MP , log fMP , log f¤ , H ) and p ( x | y D C 1, D, log m , log f , log f¤ , H ) from equation 3.16.
4. Calculate P ( y | x, D, Hj ) from equation 3.17 using the prior class probabilities P ( y D C 1) and P ( y D ¡1). When these prior class probabilities are not available, compute P ( y D C 1) D N C / N and P ( y D ¡1) D N ¡ / N. 5. Assign the class label to class with maximal posterior P ( y | x, D, Hj ) .
Bayesian LS-SVM Classier
1137
Figure 2: Contour plot of the posterior class probability P ( y D C 1 | x, D, H ) for the rsy data set. The training data are marked with C and £ for class y D C 1 and y D ¡1, respectively.
7 Examples
The synthetic binary classication benchmark data set from Ripley (1996) is used to illustrate the theory of this article. Randomized test set performances of the Bayesian LS-SVM are reported on 10 binary classication data sets. 7.1 Design of the Bayesian LS-SVM : A Case Study. We illustrate the design of the LS-SVM classier within the evidence framework on the synthetic data set (rsy) from Ripley (1996). The data set consists of a training and test set of N D 250 and Ntest D 1000 data points, respectively. There are two inputs (n D 2), and each class is an equal mixture of two normal distributions with the same covariance matrices. Both classes have the same prior probability P ( y D C 1) D P ( y D ¡1) D 1 / 2. The training data are visualized in Figure 2, with class C 1 and class ¡1 depicted by C and £, respectively. In a rst step, both inputs x (1) and x (2) were normalized to zero mean and unit variance (Bishop, 1995). For the kernel function K of the model H , an RBF kernel with parameter s was chosen. Assuming a at prior on the value of log s, the optimal sMP was selected by maximizing p ( D | Hj ) D p ( D | log sj ) , given by equation 5.3. The maximum likelihood is obtained for sMP D 1.3110. This yields a test set performance of 90.6% for both LS-
1138
T. Van Gestel, et al.
Figure 3: The posterior class probability P ( y D C 1 | x, D, H ) as a function of the inputs x (1) and x (2) for the rsy data set.
SVM classiers. We also trained a gaussian process for classication with the Flexible Bayesian Modeling toolbox (Neal, 1997). A logistic model with constant term and RBF kernel in the covariance function yielded an average test set performance of 89.9%, which is not signicantly different from the LS-SVM result given the standard deviation from Table 2. This table is discussed further in the next section. In the logistic model, the parameters are directly optimized with respect to the output probability using sampling techniques. The LS-SVM classier formulation assumes a gaussian distribution on the errors between the projected feature vectors and the targets (or class centers), which allows deriving analytic expressions on the three levels of inference. The evolution of the posterior class probabilities P ( y D C 1 | x, D, H ) is plotted in Figure 3 for x (1) 2 [¡1.5, 1.5] and x (2) 2 [¡0.5, 1.5]. The corresponding contour plot is given in Figure 2, together with the location of the training points. Notice how the uncertainty on the class labels increases as the new input x is farther away from the training data. The value zMP 2 increases when moving away from the decreases while the variance sz,t training data. We also intentionally unbalanced the test set by dening a new test set from the original set: the negatively and positively labeled instances of
Bayesian LS-SVM Classier
1139
the original set are repeated three times and once in the new set, respectively. This corresponds to prior class probabilities P ( y D ¡1) D 0.75 and P ( y D C 1) D 0.25. Not taking these class probabilities into account, a test set accuracy of 90.9% is obtained, while one achieves a classication performance of 92.5% when the prior class probabilities are taken into account. 7.2 Empirical Results on Binary Classication Data Sets. The test set classication performance of the Bayesian (Bay) LS-SVM classier with RBF kernel was assessed on 10 publicly available binary classication data sets. We compare the results with LS-SVM and SVM classication and GP regression (LS-SVM without bias term) where the hyperparameter and kernel parameter are tuned by 10-fold cross-validation (CV10). The BUPA Liver Disorders (bld), the Statlog German Credit (gcr), the Statlog Heart Disease (hea), the John Hopkins University Ionosphere (ion), the Pima Indians Diabetes (pid), the Sonar (snr), and the Wisconsin Breast Cancer (wbc) data sets were retrieved from the UCI benchmark repository (Blake & Merz, 1998). The synthetic data set (rsy) and Leptograpsus crabs (cra) are described in Ripley (1996). The Titanic data (tit) was obtained from Delve. Each data set was split up into a training (2/3) and test set (1/3), except for the rsy data set, where we used N D 250 and Ntest D 1000. Each data set was randomized 10 times in order to reduce possible sensitivities in the test set performances to the choice of training and test set. For each randomization, the design procedure from section 6.1 was used to estimate m and f from the training data for the Bayesian LS-SVM, while selecting s from a candidate set S D [s1 , s2 , . . . , sj , . . . , sNs ] using model comparison. The classication decision (LS-SVM BayM) is made by the Bayesian decision rule, equation 3.17, using the moderated output and is compared with the classier, equation 2.22, which is denoted by (LS-SVM Bay). A 10-fold cross validation (LS-SVM, SVM and GP CV10) procedure was used to select the parameters3 c or C and s yielding best CV10 performance from the set S and an additional set C D [c 1 , c 2 , . . . , c Ng ]. The same sets were used for each algorithm. In a second step, more rened sets were dened for each algorithm4 in order to select the optimal parameters. The classication decisions were obtained from equation 2.4 with the corresponding w and b determined in the dual space for each algorithm. We also designed the GP regressor within the evidence framework for a GP with RBF kernel (GP Bay) and for a GP with RBF kernel and an additional bias term b in the kernel function (GPb Bay). In Table 1, we report the average test set performance and sample standard deviation on ten randomizations for each data set (De Groot, 1986). The 3 Notice that the parameter C of the SVM plays a similar role as the parameter c of the LS-SVM. 4 We used the Matlab SVM toolbox (Cawley, 2000), while the GP CV10 was obtained from the linear system, equation 2.23.
1140
T. Van Gestel, et al.
Table 1: Comparison of the 10 Times Randomized Test Set Performances of LS-SVMs, GPs, and SVM. n bld cra gcr hea ion pid rsy snr tit wbc
N
Ntest
Ntot
LS-SVM LS-SVM LS-SVM (BayM) (Bay) (CV10)
6 230 115 345 69.4 (2.9) 69.4 (3.1) 6 133 67 200 96.7 (1.5) 96.7 (1.5) 20 666 334 1000 73.1(3.8) 73.5(3.9) 13 180 90 270 83.6 (5.1) 83.2 (5.2) 33 234 117 351 95.6(0.9) 96.2 (1.0) 8 512 256 768 77.3 (3.1) 77.5 (2.8) 2 250 1000 1250 90.2 (0.7) 90.2 (0.6) 60 138 70 208 76.7(5.6) 78.0 (5.2) 3 1467 734 2201 78.8 (1.1) 78.7 (1.1) 9 455 228 683 95.9(0.6) 95.7(0.5)
Average performance Average ranks Probability of a sign test
83.7 2.3 1.000
83.9 2.5 0.754
69.4 (3.4) 96.9 (1.6) 75.6 (1.8) 84.3 (5.3) 95.6 (2.0) 77.3 (3.0)
89.6(1.1) 77.9 (4.2) 78.7 (1.1) 96.2 (0.7) 84.1 2.5 1.000
SVM (CV10)
GP (Bay)
GPb (Bay)
GP (CV10)
69.2 (3.5) 95.1 (3.2)
69.2 (2.7) 96.4 (2.5) 76.2 (1.4) 83.1 (5.5)
68.9 (3.3)
69.7 (4.0) 96.9 (2.4) 75.4 (2.0) 84.1 (5.2)
74.9(1.7) 83.4 (4.4) 95.4(1.7) 76.9 (2.9) 89.7 (0.8) 76.3(5.3) 78.7 (1.1) 96.2 (0.8) 83.6 3.8 0.344
91.0(2.3) 77.6 (2.9) 90.2 (0.7) 78.6 (4.9) 78.5 (1.0) 95.8(0.7) 83.7 3.2 0.754
94.8(3.2) 75.9 (1.7) 83.7 (4.9) 94.4(1.9) 77.5 (2.7) 90.1 (0.8) 75.7 (6.1) 77.2(1.9) 93.7(2.0) 83.2 4.2 0.344
92.4(2.4) 77.2 (3.0) 89.9 (0.8) 76.6 (7.2) 78.7 (1.2) 96.5 (0.7) 83.7 2.6 0.508
Notes: Both CV10 and the Bayesian (Bay) framework were used to design the LS-SVMs. For the Bayesian LS-SVM the class label was assigned using the moderated output (BayM). An RBF kernel was used for all models. The model GPb has an extra bias term in the kernel function. The average performance, average rank, and probability of equal medians using the sign test taken over all domains are reported in the last three rows. Best performances are underlined and denoted in boldface, performances not signicantly different at the 5% level are denoted in boldface, performances signicantly different at the 1% level are emphasized. No signicant differences are observed between the different algorithms.
best average test set performance was underlined and denoted in boldface for each data set. Boldface type is used to tabulate performances that are not signicantly different at the 5% level from the top performance using a two-tailed paired t-test. Statistically signicant underperformances at the 1% level are emphasized. Other performances are tabulated using normal type. Since the observations are not independent, we remark that the t-test is used here only as a heuristic approach to show that the average accuracies on the 10 randomizations can be considered to be different. Ranks are assigned to each algorithm starting from 1 for the best average performance. Averaging over all data sets, a Wilcoxon signed rank test of equality of medians is carried out on both average performance (AP) and average ranks (AR). Finally, the signicance probability of a sign test (PST ) is reported comparing each algorithm to the algorithm with best performance (LS-SVM CV10). These results are denoted in the same way as the performances on each individual data set. No signicant differences are obtained between the different algorithms. Comparing SVM CV10 with GP and LS-SVM CV10, it is observed that similar results are obtained with all three algorithms, which means that the loss of sparseness does not result in a degradation of the generalization performance on these data sets. It is also observed that the LS-SVM and
Bayesian LS-SVM Classier
1141
GP designed within the evidence framework yield consistently comparable results when compared with CV10, which indicates that the gaussian assumptions of the evidence framework hold well for the natural domains at hand. Estimating the bias term b in the GP kernel function by Bayesian inference on level 3 yields comparable but different results from the LS-SVM formulation where the bias term b is obtained on the rst level. Finally, it is observed that assigning the class label from the moderated output, equation 3.17, also yields comparable results with respect to the classier 2.22, but the latter formulation does yield an analytic expression to adjust the bias term for different prior class probabilities, which is useful, for example, in the case of unbalanced training and test set or in the case of different classication costs.
8 Conclusion
In this article, a Bayesian framework has been related to the LS-SVM classier formulation. This least-squares formulation was obtained by modifying the SVM formulation and implicitly corresponds to a regression problem with binary targets. The LS-SVM formulation is also related to kernel sher discriminant analysis. Without the bias term in the LS-SVM formulation, the dual space formulation is equivalent to GPs for regression. The least-squares regression approach to classication allows deriving analytic expressions on the different levels of inference. On the rst level, the model parameters are obtained from solving a linear Karush-Kuhn-Tucker system in the dual space. The regularization hyperparameters are obtained from a scalar optimization problem on the second level, while the kernel parameter is determined on the third level by model comparison. Starting from the LS-SVM feature space formulation, the analytic expressions obtained in the dual space are similar to the expressions obtained for GPs. The use of an unregularized bias term in the LS-SVM formulation results in a zero-mean training error and an implicit centering of the Gram matrix in the feature space, also used in kernel PCA. The corresponding eigenvalues can be used to obtain improved bounds for SVMs. Within the evidence framework, these eigenvalues are used to control the capacity by the regularization term. Class probabilities are obtained within the dened probabilistic framework by marginalizing over the model parameters and hyperparameters. By combination of the posterior class probabilities with an appropriate decision rule, class labels can be assigned in an optimal way. Comparing LS-SVM, SVM classication, and GP regression with binary targets on 10 normalized public domain data sets, no signicant differences in performance are observed. The gaussian assumptions made in the LS-SVM formulation and the related Bayesian framework allow obtaining analytical expressions on all levels of inference using reliable numerical techniques and algorithms.
1142
T. Van Gestel, et al.
Appendix A: Derivations Level 1 Inference
In the expression for the variances se2C and se2¡ , the upper left n f £ n f block of the covariance matrix Q D H ¡1 is needed. Therefore, we rst calculate the inverse of the block Hessian H. Using U D [Q ( x1 ) , . . . , Q ( xN ) ], with V D U T U , the expressions for the block matrices in the Hessian, equation 3.7, are H11 D m Inf C f U U T , H12 D f U 1v and H22 D Nf . Calculating the inverse of the block matrix, the inverse Hessian is obtained as follows: Á" H
¡1
D
" D
In f 0
X 1
#"
¡1 T H11 ¡ H12 H22 H12 0
(m Inf C f G ) ¡1
0 H22
#"
Inf
0
XT
1
#!¡1
¡1 ¡(m Inf C f G) ¡1 H12 H22
¡1 T ¡1 ¡1 T ¡1 C H22 ¡H22 H12 (m Inf C f G ) ¡1 H22 H12 (m Inf C f G ) ¡1 H12 H22
(A.1) # , (A.2)
¡1 with G D U MU T , X D H12H22 and where M D IN ¡ N1 1v 1Tv is the idempotent centering matrix with rank N ¡ 1. Observe that the above derivation results in a centered Gram matrix G, as is also done in kernel PCA (Sch¨olkopf et al., 1998). The eigenvalues of the centered Gram matrix can be used to derive improved bounds for SVM classiers (Sch¨olkopf et al., 1999). In the Bayesian framework of this article, the eigenvalues of the centered Gram matrix are also used on levels 2 and 3 of Bayesian inference to determine the amount of weight decay and select the kernel parameter, respectively. The inverse (m In f C f G ) ¡1 will be calculated using the eigenvalue decomposition of the symmetric matrix G D GT D PT DG, f P D PT1 DG P1 , with P D [P1 P2 ] a unitary matrix and with the diagonal matrix DG D diag([lG,1 , . . . , lG,Neff ]) containing the Neff nonzero diagonal elements of full-size diagonal matrix DG, f 2 Rnf . The matrix P1 corresponds to the eigenspace corresponding to the nonzero eigenvalues, and the null space is denoted by P2 . There are maximally N ¡1 eigenvalues lG,i > 0, and their corresponding eigenvectors ºG,i are a linear combination of U M: ºG,i D cG,i U MvG,i , with cG,i a normalT º ization constant such that ºG,i G,i D 1. The eigenvalue problem we need to solve is the following:
U MU T (U MvG,i ) D lG,i (U MvG,i ) .
(A.3)
Multiplication of equation A.3 to the left with MU T and applying the Mercer condition yields MVMVMvG,i D lG,i MVMvG,i , which is a generalized eigenvalue problem of dimension N. An eigenvector vG,i corresponding to a nonzero eigenvalue lG,i is also a solution of MVMvG,i D lG,i vG,i ,
(A.4)
6 since MVMVMvG,i D lG,i MVMvG,i D 0. By applying the normality con-
Bayesian LS-SVM Classier
1143
q dition of ºG,i , which corresponds to cG,i D 1 / vTG,i MVMvG,i , one nally 1 U MvG,i , and P1 D vTG, i MVMvG, i ¡1 2 lG,i/ vG,i , i D 1, . . . , Nef f . The
obtains P1 D [ºG,1 . . . ºG,Neff ] where ºG,i D p U MUG , with UG ( :, i) D p
1 vG,i vTG, i MVMvG, i
D
remaining n f ¡ Nef f dimensional orthonormal null space P2 of G cannot be explicitly calculated, but using that [P1 P2 ] is a unitary matrix, we have P2 PT2 D In f ¡ P1 PT1 . Observe that this is different from Kwok (1999, 2000), where the space P2 is neglected. This yields (m Inf C f G) ¡1 D P1 ( (m INeff C f DG ) ¡1 ¡ m ¡1 INef f ) PT1 C m ¡1 Inf . By dening h (x ) D U T Q ( x ) and applying the Mercer condition in equation 3.10, one nally obtains expression 3.11. For large N, the calculation of all eigenvalues lG,i and corresponding eigenvectors ºi , i D 1, . . . , N may require long computations. One may expect that little error is introduced by putting the N ¡ rG smallest eigenvalues, m À f lG,i , of G D GT ¸ 0. This corresponds to an optimal rank rG approximation of (m Inf C f G) ¡1 . Indeed, calling GR the rank rG approximation of G, we obtain minGR k(m Inf C f G ) ¡1 ¡ (m In f C f GR ) ¡1 kF . Using the Sherman-Morrison-Woodbury formula (Golub & Van Loan, 1989) this becomes: k(m In f C f G ) ¡1 ¡ (m Inf C f GR ) ¡1 kF D k mf (m In f C f G ) ¡1 G ¡ mf (m Inf C f GR ) ¡1 GR kF . The optimal rank rG approximation for (m Inf C f G ) ¡1 G is obtained by putting its smallest eigenvalues to zero. Using the eigenvalue l decomposition of G, these eigenvalues are m C fG,li G, i . The smallest eigenvalues of (m In f C f G ) ¡1 G correspond to the smallest eigenvalues of G. Hence, the optimal rank rG approximation is obtained by putting the smallest N ¡ rG eigenvalues to zero. Also notice that sz2 is increased by putting lG,i equal to zero. A decrease of the variance would introduce a false amount of certainty on the output. Appendix B: Derivations Level 2 and 3 Inference
First, an expression for det(H) is given using the eigenvalues of G. By block diagonalizing equation 3.7, det(H) is not changed (see, e.g., Horn & Johnson, 1991). From equation A.1, we obtain det H D Nf det(m Inf C f G ). The determinant is the product of the eigenvalues; this yields det H D QNef f Nf m nf ¡Nef f iD 1 (m C f lG,i ) . Due to the regularization term m EW , the Hessian is regular. The inverse exists, and we can write det H ¡1 D 1 / det H. Using equation 2.30, the simulated error ei D yi ¡ ( wTMPQ (xi ) C bMP ) can P also be written as ei D yi ¡ mO Y ¡ wTMP (Q (xi ) ¡ mO U ) , with mO Y D N i D1 yi / N and PN O U D iD1 Q (xi ) / N. Since wMP D (U MU T C c ¡1 Inf ) ¡1 U MY, the error term m
1144
T. Van Gestel, et al.
ED (wMP , bMP ) is equal to ED (wMP , bMP ) D
³ ± 1 (Y ¡ m O Y 1v ) T IN ¡ MU U MU T C c 2
O Y 1v ) £ (Y ¡ m ± 1 T D Y MV DG C c G 2c 2
¡1
Inef f
²¡2
¡1
In f
² ¡1
´2 UM
VGT MY,
(B.1)
where we used the eigenvalue decomposition of G D U MU . In a similar way, one can obtain the expression for EW in the dual space starting from wMP D (U MU T C c ¡1 Inf ) ¡1 U MY: E W ( wMP ) D
± 1 T Y MVG DG DG C c 2
¡1
Inef f
²¡2
T VG MY.
(B.2)
The sum EW ( wMP ) C c ED ( wMP , bMP ) is then equal to ± 1 T Y MVG DG C c 2 ± 1 D YT M MVM C c 2
E W ( wMP ) C c ED ( wMP , bMP ) D
¡1
Inef f
¡1
In
²¡1
²¡1
VGT MY
MY,
(B.3)
which is the same expression as obtained with GP when no centering M is applied on the outputs Y and the feature vectors U . Acknowledgments. We thank the two anonymous reviewers for constructive comments and also thank David MacKay for helpful comments related to the second and third level of inference. T. Van Gestel and J. A. K. Suykens are a research assistant and a postdoctoral researcher with the Fund for Scientic Research-Flanders (FWO-Vlaanderen), respectively. the K.U.Leuven. This work was partially supported by grants and projects from the Flemish government (Research council KULeuven: Grants, GOAMesto 666; FWO-Vlaanderen: Grants, res. proj. G.0240.99, G.0256.97, G.0256.97 and comm. (ICCoS and ANMMM); AWI: Bil. Int. Coll.; IWT: STWW Eureka SINOPSYS, IMPACT); from the Belgian federal government (Interuniv. Attr. Poles: IUAP-IV/02, IV/24; Program Dur. Dev.); and from the European Community (TMR Netw. (Alapedes, Niconet); Science: ERNSI). The scientic responsibility is assumed by its authors. References Baudat, G., & Anouar, F. (2000).Generalized discriminant analysis using a kernel approach. Neural Computation, 12, 2385–2404. Bishop, C. M. (1995). Neural networks for pattern recognition. New York: Oxford University Press.
Bayesian LS-SVM Classier
1145
Blake, C. L., & Merz, C. J. (1998).UCI Repository of Machine Learning Databases. Irvine, CA: University of California, Department of Information and Computer Science. Available on-line: www.ics.uci.edu/»mlearn/MLRepository. html. Brown, P. J. (1977). Centering and scaling in ridge regression. Technometrics, 19, 35–36. Cawley, G. C. (2000). MATLAB Support Vector Machine Toolbox (v0.54b). Norwich, Norfolk, U.K.: University of East Anglia, School of Information Systems. Available on-line: http://theoval.sys.uea.ac.uk/»gcc/svm/ toolbox. Cristianini, N., & Shawe-Taylor, J. (2000). An introduction to support vector machines. Cambridge: Cambridge University Press. De Groot, M. H. (1986).Probabilityand statistics(2nd Ed.). Reading, MA: AddisonWesley. Duda, R. O., & Hart, P. E. (1973). Pattern classication and scene analysis. New York: Wiley. Evgeniou, T., Pontil, M., Papageorgiou, C., & Poggio, T. (2000) Image representations for object detection using kernel classiers. In Proc. Fourth Asian Conference on Computer Vision (ACCV 2000) (pp. 687–692). Taipei, Thailand. Gibbs, M. N. (1997). Bayesian gaussian processes for regression and classication. Unpublished doctoral dissertation, University of Cambridge. Golub, G. H., & Van Loan, C. F. (1989).Matrix computations. Baltimore, MD: Johns Hopkins University Press. Gull, S. F. (1988). Bayesian inductive inference and maximum entropy. In G. J. Erickson & R. Smith (Eds.), Maximum-entropy and bayesian methods in science and engineering (Vol. 1, pp. 73–74). Norwell, Ma: Kluwer. Horn, R. A., & Johnson, C. R. (1991). Topics in matrix analysis. Cambridge: Cambridge University Press. Kwok, J. T. (1999). Moderating the outputs of support vector machine classiers. IEEE Trans. on Neural Networks, 10, 1018–1031. Kwok, J. T. (2000). The evidence framework applied to support vector machines. IEEE Trans. on Neural Networks, 11, 1162–1173. MacKay, D. J. C. (1992). The evidence framework applied to classication networks. Neural Computation, 4(5), 698–714. MacKay, D. J. C. (1995). Probable networks and plausible predictions—A review of practical Bayesian methods for supervised neural networks. Network:Computation in Neural Systems, 6, 469–505. MacKay, D. J. C. (1999). Comparison of approximate methods for handling hyperparameters. Neural Computation, 11(5), 1035–1068. Mika, S., R¨atsch, G., & Muller, ¨ K.-R. (2001). A mathematical programming approach to the Kernel Fisher algorithm. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.), Advances in neural informationprocessingsystems,13 (pp. 591–597). Cambridge, MA: MIT Press. Mukherjee, S., Tamayo, P., Mesirov, J. P., Slonim, D., Verri, A., & Poggio, T. (1999). Support vector machine classication of microarray data (CBCL Paper 182/AI Memo 1676). Cambridge, MA: MIT.
1146
T. Van Gestel, et al.
Neal, R. M. (1997). Monte Carlo implementation of gaussian process models for Bayesian regression and classication (Tech. Rep. No. CRG-TR-97-2). Toronto: Department of Computer Science, University of Toronto. Platt, J. (1999). Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In A. Smola, P. Bartlett, B. Sch¨olkopf, & D. Schuurmans (Eds.), Advances in large margin classiers.Cambridge, MA: MIT Press. Rasmussen, C. (1996). Evaluation of gaussian processes and other methods for nonlinear regression. Unpublished doctoral dissertation, University of Toronto, Canada. Ripley, B. D. (1996). Pattern recognition and neural networks. Cambridge: Cambridge University Press. Rosipal, R., & Girolami, M. (2001). An expectation-maximization approach to nonlinear component analysis. Neural Computation 13(3), 505–510. Saunders, C., Gammerman, A., & Vovk, K. (1998). Ridge regression learning algorithm in dual variables. In Proc. of the 15th Int. Conf. on Machine Learing (ICML-98) (pp. 515–521). Madison, WI. Sch¨olkopf, B., Shawe-Taylor, J., Smola, A., & Williamson, R. C. (1999). Kerneldependent support vector error bounds. In Proc. of the 9th Int. Conf. on Articial Neural Networks (ICANN-99) (pp. 304–309). Edinburgh, UK. Sch¨olkopf, B., Smola, A., & Muller, ¨ K.-M. (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10, 1299–1319. Shashua, A. (1999). On the equivalence between the support vector machine for classication and sparsied Fisher ’s linear discriminant. Neural Processing Letters, 9, 129–139. Smola, A., Sch¨olkopf, B., & Muller, ¨ K.-R. (1998). The connection between regularization operators and support vector kernels. Neural Networks, 11, 637– 649. Sollich, P. (2000). Probabilistic methods for support vector machines. In S. A. Solla, T. K. Leen, & K.-R. Muller ¨ (Eds.), Advances in neural information processing systems, 12 Cambridge, MA: MIT Press. Suykens, J. A. K. (2000). Least squares support vector machines for classication and nonlinear modelling. Neural Network World, 10, 29–48. Suykens, J. A. K., & Vandewalle, J. (1999). Least squares support vector machine classiers. Neural Processing Letters, 9, 293–300. Suykens, J. A. K., & Vandewalle, J. (2000). Recurrent least squares support vector machines. IEEE Transactions on Circuits and Systems-I, 47, 1109– 1114. Suykens, J. A. K., Vandewalle, J., & De Moor, B. (2001). Optimal control by least squares support vector machines. Neural Networks, 14, 23–35. Van Gestel, T., Suykens, J. A. K., Baestaens, D.-E., Lambrechts, A., Lanckriet, G., Vandaele, B., De Moor, B., & Vandewalle, J. (2001). Predicting nancial time series using least squares support vector machines within the evidence framework. IEEE Transactions on Neural Networks, 12, 809– 812. Vapnik, V. (1995). The nature of statistical learning theory. New York: SpringerVerlag.
Bayesian LS-SVM Classier
1147
Vapnik, V. (1998). Statistical learning theory. New York: Wiley. Williams, C. K. I. (1998). Prediction with gaussian processes: From linear regression to linear prediction and beyond. In M. I. Jordan (Ed.), Learning and inference in graphical models. Norwell, MA: Kluwer Academic Press. Williams. C. K. I., & Barber, D. (1998). Bayesian classication with gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20, 1342–1351. Received July 6, 2000; accepted September 12, 2001.
LETTER
Communicated by Sara Solla
Neural Network Pruning with Tukey-Kramer Multiple Comparison Procedure Donald E. Duckro [email protected] Dennis W. Quinn dennis.quinn@at.edu Samuel J. Gardner III Air Force Institute of Technology, Department of Mathematics and Statistics, Wright-Patterson Air Force Base, Ohio 45433, U.S.A.
Reducing a neural network’s complexity improves the ability of the network to generalize future examples. Like an overtted regression function, neural networks may miss their target because of the excessive degrees of freedom stored up in unnecessary parameters. Over the past decade, the subject of pruning networks produced nonstatistical algorithms like Skeletonization, Optimal Brain Damage, and Optimal Brain Surgeon as methods to remove connections with the least salience. The method proposed here uses the bootstrap algorithm to estimate the distribution of the model parameter saliences. Statistical multiple comparison procedures are then used to make pruning decisions. We show this method compares well with Optimal Brain Surgeon in terms of ability to prune and the resulting network performance.
1 Introduction
Large neural networks are easy to t to a data set but may be less than adequate in their ability to predict future examples. The more complex the neural network model, the easier it is to essentially “memorize” the data (i.e., obtain a near-perfect t). This is similar to the situation in an overspecied linear statistical model, and the associated problems (larger variances, poor performance) are observed for both classes of models. In order to nd the best set of parameter and variables to use in a given model, some model selection procedure must often be used. In the neural network literature, these procedures are often called pruning algorithms. Popular algorithms include Skeletonization, Optimal Brain Damage, and Optimal Brain Surgeon. In this article, we propose a novel method, based on statistical inference procedures, to perform neural network pruning. Section 2 gives the necessary background for an understanding of neural networks c 2002 Massachusetts Institute of Technology Neural Computation 14, 1149– 1168 (2002) °
1150
Donald E. Duckro, Dennis W. Quinn, and Samuel J. Gardner III
and network pruning and the statistical procedures we incorporate into our method. Section 3 presents the Tukey-Kramer multiple comparison pruning algorthim. Section 4 presents examples and simulation results comparing our proposed method with Optimal Brain Surgeon. Many of the terms used in the neural network community may be unfamiliar to the statistician. Wherever necessary, the corresponding statistical nomenclature is included (emphasized in parentheses) in the text. 2 Background 2.1 Neural Networks Overview. Neural networks (NNs) represent a wide class of exible nonlinear models. This article focuses on feedforward neural networks as predictors (regression, discrimination), but the results may be applied to several types of NNs. A comprehensive introduction to NNs is given by Bishop (1995). Equations 2.1 through 2.3 illustrate the general form of a feedforward NN’s output:
yk D f k ( x1 , x2 , . . . , xd ) ,
k D 1, . . . , O,
(2.1)
where 0 fk ( x1 , x2 , . . . , xd ) D
1
(2) fo @w 0k C
l X
(2) wjk aj A
(2.2)
jD 1
and Á aj D fh
(1) w 0j C
d X
! (1) wij xi
.
(2.3)
i D1
The inputs (independent variables), xi , and the outputs (dependent variables), y k are connected together by two activation functions, fo and fh . These functions are often nonlinear and monotonic. Throughout this article, we will use fo ( u ) D fh ( u ) D (1 C e¡u ) ¡1 . The w’s are known as weights (parameters). The subscripts on the activation functions indicate whether they are used in the hidden (h) or the output (o) layer of the network. In the hidden layer of the network, there are l hidden nodes, and for each of the O output variables, there is an output node fk. Using this terminology, one can picture the NN as a diagram of connected inputs, hidden nodes, and outputs. An example of a network diagram for a feedforward NN with one output, three hidden nodes, and multiple inputs is given in Figure 1 (see section 2.5). From a training set X D f(y Ti , xTi ) D ( y1 , y2 , . . . , yO , x1 , x2 , . . . , xd ) i , i D ( ) 1 . . . Ng, the weights, wijm are learned (estimated) using an iterative algorithm
Neural Network Pruning
1151
designed to minimize an error function, E . For a regression-prediction problem, the typical error function is the sum of squared error (SSE) given by
E (w ) D
N X i D1
( y i ¡b yi ) T ( yi ¡ b yi ) ,
(2.4)
where w is the vector of the weights used in the NN and b y i D (b y1,i , . . . , b yO,i ) T are the output layer predictions for the ith exemplar (training sample point) with the estimates for the weight substituted in (i.e., b y k,i D fk ( x i I w ) ). The SSE is a summation of the squared deviations of each observation of the training set from the predicted output. A gradient-descent method is used O that minimizes the least-squares criteria. This method to nd the vector w is derived from the Taylor series expansion of E ( w ) around the minimizing vector w opt given by µ
E ( w ) ¼ E ( wopt ) C ( w ¡ w opt ) T C
@E ( w )
@w
¶ w D w opt
1 ( w ¡ w opt ) T H ( w ¡ w opt ) , 2
(2.5) 2
where H is the Hessian matrix of E ( w ) , Hij D [ @w@i @wj E ( w )]w Dw opt . (Note that for simplicity, we may refer to the weights with a single subscript only.) Given an initial starting value for w , one can iterate with the Widrow-Hoff gradient-descent method (Widrow & Hoff, 1960), w à w ¡g
@E , @w
(2.6)
where g is a predetermined or adaptive learning rate. This method is also known as backpropagation. The iterations continue until a stopping criterion is satised (e.g., maximum number of iterations, mean square error target, change in mean squared error target). NN model simplication involves removing superuous weights so as to achieve better generalization (predictive ability). White’s (1989) development of statistical inference applied to network architecture provides a means to test which weights are superuous. Although he did not specically reference Wald (1943), White’s hypothesis testing is based on the chi-square distribution that Wald used for tests involving several parameters. Pruning (parameter selection) algorithms have been developed to trim large, quick learning networks into sufcient minimal networks. Devoid of heuristic information to dictate the order of parameter elimination, pruning algorithms employ simple strategies to select parameters for elimination. The simplest strategy is the elimination of the smallest weights. However,
1152
Donald E. Duckro, Dennis W. Quinn, and Samuel J. Gardner III
the signicance of a weight to an NN is not the size of the weight but its contribution to the ability of the NN to match the data. This is analogous to multiple regression (see Neter, Kutner, Nachtsheim, & Wasserman, 1996) in which a model parameter remains in a model or is eliminated from the model based on the performance of the full model (which contains the parameter in question) compared to the reduced model (which is identical to the full model except that the parameter in question has been set to zero). Pruning decisions are often based on a salience measure: the effect the pruning decision has on the training error. Formally, the salience of a single weight wp is dened as Lp D min [E ( w ) ] ¡ min[E ( w )]. s.t.wp D 0
(2.7)
When equation 2.5 has been minimized and the best estimate of the weights has been found, the gradient term is neglible. Substituting the Taylor series approximation into equation 2.7 gives Lp ¼
1 min [D w T H D w ], 2 s.t.wp D 0
(2.8)
where D w D ( w ¡ w opt ). This yields Lp D 12 Hppw2opt,p . Optimal Brain Damage (OBD; Le Cun, Denker, & Solla, 1990) uses this measure of salience to decide which weight should be eliminated. Specically, the NN is trained, and each of the individual saliences is computed. The weight that has the smallest salience is set to zero, and the NN is retrained with that constraint. This process is repeated (retaining all constraints found through the pruning process) until a stopping criterion is met. The OBD procedure offers variations to this algorithm to include the deletion of some low-saliency parameters at each iteration, but the protocol is not well dened from a statistical perspective. The assumption that the low-salient parameters are irrelevant is consistent with White’s hypotheses, which in turn are consistent with the Wald test. Optimal Brain Surgeon (OBS) (Hassibi & Stork, 1993) is a renement of OBD, allowing for a full Hessian matrix H . Retraining is computationally burdensome for OBD, whereas OBS prunes weights using a constrained optimization approach and avoids retraining. If one sets wp D 0 for a given p, then the optimum D w that satises equation 2.8 is given by
Dw
D ¡wopt,p
1 [H ¡1 ]
pp
H ¡1 ¢ ep ,
(2.9)
where ep is the unit vector in the weight space corresponding to weight wp . The corresponding salience is then Lp D
2 1 wopt,p . 2 [H ¡1 ]pp
(2.10)
Neural Network Pruning
1153
OBS uses this measure of salience to make the single-step pruning decision. However, in contrast with OBD, the weight update formula of OBS allows for quick pruning because retraining is not necessary. One simply uses the weight update formula and recalculates H ¡1 in each pruning cycle. The single-step pruning based on the lowest-saliency parameter is consistent with White’s hypotheses of irrelevance. Hassibi and Stork (1993) suggest that several weights may be deleted simultaneously, and Hassibi, Stork, and Wolff (1994) consider several extensions of OBS. However, neither Hassibi and Stork (1993) nor Hassibi et al. (1994) provide a detailed weight update formula for the deletion of more than one weight. 2.2 Tukey Multiple Comparison Procedure. To prune the nonlinear neural net model of equations 2.1 through 2.3, we now consider the linear statistical model for the saliences
Lij D m i C eij ,
i D 1, . . . , P,
j D 1, . . . , ni ,
(2.11)
where the random variable Li represents the observed salience of the ith weight, the m i are constant parameters representing the expected value of the salience, and the eij are random error terms or noise in the jth observation of each salience. There are P weights and ni samples of the saliency Li . The Tukey Multiple Comparison Procedure considers the set of all pairwise comparisons under the null hypothesis that the saliences are the same and the alternative hypothesis that the saliences are different: H0 : m i D m j 6 m j, Ha : m i D 6 where i, j D 1, 2, ¢ ¢ ¢ , P with i D j. This procedure relies on the studentized range distribution, where the studentized range is the ratio of two dispersion measures: the range and standard deviation. This statistic allows us to quantify the relative differences in salience means. Let L1 , . . . , LP be independently distributed N (m , s 2 ) , and suppose the variance estimate s2 is based on º degrees of freedom independent of Li such that s2 / s 2 is chi-squared distributed, Â 2 (º). Then the studentized range with parameters ( P, º) is dened as
qD
max Li ¡ min Li » q ( PI º). s
(2.12)
For multiple observations of the salience of a connection, a salience mean P LN i D n1i jnDi 1 Lij is calculated, and the standard deviation of the means’
1154
Donald E. Duckro, Dennis W. Quinn, and Samuel J. Gardner III
differences for the pair ( i, j) is denoted by s ( LN i ¡ LN j ) . If the eij are independent and identically distributed (i.i.d.) N (0, s 2 ) , then an appropriate test statistic for each pairwise comparison of salience means is p 2(LN i ¡ LN j ) 2( LN i ¡ LN j ) ± ², q D D p s (LN i ¡ LN j ) s2 ¢ n1i C n1j ¤
p
(2.13)
P P P where s2 D º1 PiD 1 jnDi 1 (Lij ¡ LN j ) 2 . For nonequal sample sizes, º D PiD1 ni ¡ PI for equal sample sizes ni D B, it follows that º D BP ¡P. In equation 2.13, the ( LN i ¡ LN j ) term does not cancel because it represents an arithmetic quantity in the numerator, while in the denominator, it is the argument of the function s. Using the test statistic q¤ of equation 2.13, the null hypothesis H0 : m i D m j is rejected when |q¤ | > q1¡a (PIº) , and we conclude that salience means are different with a family condence coefcient of at least 1 ¡ a for all pairwise comparisons. The signicance level a is sometimes considered the level of risk of rejecting a null hypothesis when in fact it is true. The family signicance level for the pairwise comparison test is exactly a when all the sample sizes ni are equal. When the sample sizes are not equal, the procedure is sometimes called the Tukey-Kramer Procedure (Tukey, 1991; Neter et al., 1996). In this case, the family signicance level is less than a, and all decisions are more conservative. The above test assumes that the mean estimates LN i are independent. In order to account for correlated sample means (as will be observed in section 2.5), Kramer (1957) and Brown (1984) developed multiple range tests and condence intervals for dependent means. The estimated variances and covariances of the means, denoted cii s2 and cij s2 , are used to adjust the critical value for the test in a way to provide simultaneous condence intervals of the form m i ¡ m j 2 [LN i ¡ LN j § Zij q1¡a ( PI º)],
(2.14)
where Z2ij D ( cii s2 ¡ 2cij s2 C cjj s2 ) / 2, have condence level at least 1 ¡ a. A method for grouping “equivalent” means is to determine which group of ordered sample means differs by no more than Zij q1¡a ( PI º) . Mean estimates of each connection’s salience are not readily available from network training. Section 2.3 introduces a procedure to facilitate their estimation. 2.3 Bootstrap Training. Estimates for the salience means (LN i ), variances
(cii s2 ), and covariances (cij s2 ), which appear in equation 2.14, require multiple observations of each connection’s salience. Efron (1982) has shown that the bootstrap algorithm nds a distribution that can be used as an approximation of the sampling distribution for a given statistic. Let U ( X ) be a random variable of interest, such as a salience, where X D fx 1 , x 2 , . . . , x N g indicates the entire i.i.d. sample from distribution F , and N is the size of
Neural Network Pruning
1155
the training set. On the basis of having observed X , we wish to estimate some aspect of U’s sampling distribution, EF g ( U ) . A three-part algorithm is used: 1. Fit the empirical cumulative distribution function (CDF) of F , O mass 1 at x i , F: N
i D 1, 2, . . . , N.
O 2. Draw a bootstrap sample from F, iid
O X ¤ D fx ¤1 , x¤2 , . . . , x ¤N g » F, and calculate g ( U¤ ) D g (U ( X ¤ ) ) .
3. Independently repeat step 2 a large number of times (B) obtaining bootstrap replications g ( U¤1 ) , g (U ¤2 ) , . . . , g ( U¤B ) , and calculate gN ( U¤ ) PB ¤i D i D 1 g ( U ) / B, the bootstrap estimate of EF g ( U ) . Efron’s bootstrap algorithm has been applied to neural networks in the past to build condence intervals for the network output. In this article, we use the algorithm to construct mean estimates of the network’s saliences. Our goal is to provide a method for grouping the least signicant salience with similarly small saliences and eliminating their corresponding weights from the model. The determination of “similar” and “small” requires an estimate of the means and covariances of the sampling distribution of the saliences. Using the bootstrap procedure to estimate the means and covariances of saliences is what we term bootstrap training. To accelerate this procedure, it was discovered that a savings in the number of calculations needed for bootstrap training could be achieved by rst training the NN with the original training set and using the estimated weight parameters as the initial estimate for training on each bootstrap sample. This also serves another purpose in that it helps to overcome problems with parameter identiability in the NN. A neural network such as the one presented in Figure 1 suffers from the property that there are multiple “equivalent” networks that will give the same results: any permutation of the hidden-layer nodes provides an equivalent network. The focus of this research is not so much to estimate the specic parameters in the NN, but rather to estimate the complexity of the network needed to provide good generalization. By using the initial training set to nd a good starting point for each bootstrap training iteration, we keep the location of the weight parameters and their saliences in the same neighborhood for each iteration, avoiding the identiability problem. 2.4 Stopping Criteria. Because both the NN training process and the pruning process are iterative, it is important to determine an appropriate
1156
Donald E. Duckro, Dennis W. Quinn, and Samuel J. Gardner III
stopping criterion. For the training portion, a maximum number of iterations or a minimum change in E (w ) is used. For pruning, one could consider several rules for stopping, including pruning to a predetermined complexity, a maximum number of pruning iterations, or an error rule. The error rule can search for the minimum validation error of the test set (Hassibi et al., 1994). The error rule can stop pruning when the additional error is no longer much smaller than the present total error (Hassibi & Stork, 1993). In this article, we consider an error rule that stops the pruning process when an “F-statistic” criterion has been met. The F-statistic is frequently used to test the appropriateness of model complexity. For instance, Neter et al. (1996, pp. 120, 230, 268, 353, 911) use the F-statistic to test for the appropriateness of model complexity for linear models. In these tests, SSr is the SSE for the reduced model, SS f is the SSE for the full model, P f is the number of parameters in the full model, and k is the number of parameters being considered for elimination from the full model. The number of degrees of freedom for the full model is df f D N ¡ P f , and the number of degrees of freedom for the reduced model is df r D N ¡ ( P f ¡ k) . To compare the reduced model with the full model, the null hypothesis is that the reduced model explains the data as well as the full model. The F-statistic for this test is F¤ D
( SSr ¡ SS f ) / ( df r ¡ df f ) SS f / df f
D
( SSr ¡ SS f ) / k . SS f / ( N ¡ P f )
(2.15)
A large difference in the error sum of squares results in F¤ being greater than a critical value; in this case, we reject the null hypothesis and conclude that the reduced model does not explain the data as well as the full model does. Otherwise, for small F¤ , we accept the reduced model as being statistically equivalent to the full model. This is analogous to White’s (1989) development of the irrelevant input and hidden unit hypotheses and use of the Wald test to dismiss weights close to zero. In our pruning, the goal is to obtain the simplest possible NN, so we prune until the pruned model is no different from using the mean of the data as the model. At that point, we reject the most recent pruning, revert to the previous NN, and stop. For this approach, the full model is the most recent pruned model and the reduced model is the model that uses the mean of the data. Consequently, the F-statistic of equation 2.15 uses SS f D SSE, the usual error sum of squares of equation 2.4, and SSr D SSTO, the usual total sum of squares about the mean, calculated similar to equation 2.4 except that the mean, y, N replaces the prediction, yO i . The degrees of freedom are df f D N ¡ P f and df r D N ¡ 1. Therefore, the Fstatistic is F¤ D
(SSTO ¡ SSE) / ( P f ¡ 1) SSR/ (P f ¡ 1) MSR D D , SSE / ( N ¡ P f ) SSE / (N ¡ P f ) MSE
(2.16)
Neural Network Pruning
1157
where SSR is the regression sum of squares. Consequently, the F-statistic reduces to the usual ratio of the regression mean square and the mean square error. For the range of degrees of freedom we are using, the critical value F D 3 generally satises a signicance level of a D 0.05. If F¤ < 3, we stop pruning and revert back to the previous NN because our latest pruned model is no different from the model that uses the mean of the data to predict. This method provides a consistent error rule to stop pruning as opposed to an unstructured rule of thumb, and its implementation supports the comparison of different pruning algorithms. 2.5 Monk’s Problems. The Monk’s problems (Thrun, Wnek, Michalski, Mitchell, & Cheng, 1991) are excellent benchmark Boolean tests for pruning algorithms. Articles that have used the Monk’s problems as benchmarks include Hadzikadic and Bohren (1997), Hassibi and Stork (1993), Hassibi et al. (1994), Le Cun et al. (1990), Setiono (1997), van de Laar and Heskes (1999), and Young and Downs (1998). Monk’s problems concern the classication of robots exhibiting six different features. Each feature has two, three, or four possibilities, as follows:
Head shape 2 round, square, octagon Body shape 2 round, square, octagon Is smiling 2 yes, no
Is holding 2 sword, balloon, ag
Coat color 2 red, yellow, green, blue With necktie 2 yes, no.
These features can describe 432 different robots. The three problems involve the training of a neural network to discern certain distinctions of the robots: 1. Head shape and body shape are the same, or coat color is red. 2. Two of the six features have the rst value: round, yes, sword, red. 3. Holding a sword and coat color is red, or coat color is not blue and body shape is not octagon. For the Monk’s problem, the neural network has 17 binary inputs—one for each of the feature options. The choice of the number of hidden nodes in the network is arbitrary. One expects a network with a larger number of hidden nodes to learn faster, but also that the subsequent model would be a poor generalizer. The illustrated pruned network in Figure 1 with three hidden nodes and one output, y, began with 58 connections (weights) between the input layer, three hidden nodes, and one output node. The 58 connections include four biases, which can be interpreted as connections to
1158
Donald E. Duckro, Dennis W. Quinn, and Samuel J. Gardner III
Figure 1: Neural network for Monk’s problem 1 after three-fourths of the connection weights have been pruned. Before pruning, there were 58 connections in the network. The pruned network contains only 15 of the original 58 weights. The network nodes use logistic sigmoid activation functions.
an external unit always in the C 1 state, as shown in Figure 1. This network will estimate the posterior probability of a robot meeting the criterion for a given Monk’s problem, and we would classify it as such if y > .5. To illustrate the bootstrap training procedure, we train this network to perform the rst Monk’s problem. A random sample of size 124 was selected from the complete set of robots as a training set. The NN was trained using these 124 points, and initial weight estimates were obtained. Bootstrap training was performed with B D 128 bootstrap replications (resampling from the 124 points). Figure 2 is the histogram of the logarithm of the saliences for the bias (intercept) weight in the output layer. The bootstrap salience distribution is unimodal with some skewness, similar to the universal shape of all saliences for the input-to-hidden weights of a two-layer network as found by Gorodkin, Hansen, Lautrup, and Solla (1997). The experimental results presented in Figure 2 compare well with the theoretical predictions presented in Figures 1 and 2 of Gorodkin et al. (1997). In the case of bootstrap training, a new resampled training set creates each sample for the salience distribution, while Gorodkin et al. collect observations at each step of one training set. The distribution of bootstrap resamplings of each salience is also consistent with the near-normality expectation of Efron (1982) and support the use of multiple comparison procedures. Figure 3 was produced using 128 bootstrap replications of two of the saliences of a fully connected network. In this gure, we see that the boot(1) (1) strap saliences of w11 and w21 appear to be correlated. By the end of pruning, only one of the weights corresponding to these correlated saliences survives,
Neural Network Pruning
1159
Figure 2: Bootstrap distribution of 128 samplings of the logarithm of the salience (2) (2) L 0 for w 0 , the bias to the output node. The bar chart counts the number of samplings of the logarithm of the bias salience that occur over the range of values observed.
as Figure 1 indicates. The information conveyed by these weights remains important, but it is redundant to have both connections corresponding to these weights in the nal network. In general, salience correlations can appear between connections sharing common nodes and inputs, as well as (1) (1) (1) (2) between connections further apart such as w22 and w63 or w51 and w21 (the connection of input 5 to hidden node 1 correlated with the connection of hidden node 2 and the output). The correlation is supported by the observation of Hassibi and Stork (1993) that every problem they investigated had a strongly nondiagonal Hessian. In determining which weight saliences are close to zero and close to each other, one must also take into account their correlation. 3 Tukey Kramer Multiple Comparison Pruning
The common theme among many pruning algorithms is to remove weights that have the “least salience.” This is done in an iterative algorithm, and as seen in OBS and OBD, weights may be removed one at a time until the
1160
Donald E. Duckro, Dennis W. Quinn, and Samuel J. Gardner III 4
Saliences of Weight between Input Two and Hidden Unit One
8
x 10
7.8
7.6
7.4
7.2
7
6.8
6.6
6.4 2.3
2.4
2.5 2.6 2.7 2.8 2.9 3 3.1 Saliences of Weight between Input One and Hidden Unit One
3.2
3.3 4 x 10 (1)
Figure 3: A scatter plot of the 128 samplings of the bootstrap saliences L11 and (1) (1) L21 for the weights between input 1 and hidden node 1, w11 , and between input (1) 2 and hidden node 1, w21 . The plot suggests a correlation between saliences that needs to be accounted for when making a statistical inference.
stopping criterion for pruning has been reached. It is typical that OBS and OBD will remove several weights. This introduces two questions: How does one determine if a small salience is truly “small”? and Is there a way to justify the elimination of several weights at one time, saving pruning iterations? We believe we answer both of these questions in this article. We propose that using bootstrap training on an NN, we can estimate the distribution of the weight saliences, and in batch, eliminate those weights whose saliences are simultaneously found to be close to zero and not differing signicantly. Bootstrap training allows us to compute mean salience estimates, LN p D
PB
¤(b) bD 1 Lp
B
,
and salience covariance estimates, cov[LN p , LN q ] D
B 1 X ( L¤(b) ¡ LN p ) ( L¤(b) ¡ LN q ) . q B ¡ 1 bD 1 p
Neural Network Pruning
1161
One can then use these estimates in the Tukey-Kramer multiple comparison procedure to form condence intervals for the mean saliences, grouping together the smallest saliences that simultaneously do not differ signicantly from each other. The weights corresponding to these saliences are then set to zero, and the NN is retrained with that constraint. We may continue this process to determine once again which weights in the new model affect the error insignicantly. The advantage over other pruning methods is that this procedure makes a decision to eliminate several weights at one time, perhaps saving the number of pruning cycles required. In addition, the incorporation of statistical inference preserves any weight of small magnitude having salience signicantly different from the least salience or zero. As we stated in section 2, this is consistent with the way parameters are removed from multiple regression models. This new approach is what we term Tukey-Kramer multiple comparison pruning (TKMCP). 4 Examples 4.1 Monk’s Problem 1. Referring to the rst Monk’s problem in section 2.5, ideally we would like to select a network structure and complexity adequate for discerning the desirable characteristics—in this case, determining if a robot has the same head and body shape or is wearing a red coat. To illustrate the capability of TKMCP, a random training sample of 124 robots is drawn from the 432 legal examples. The remaining robots are used to estimate prediction accuracy. Bootstrap training is performed with B D 32 on this random training sample, using a bootstrap sample size of 124. Means and covariance of the weight saliences are estimated from the bootstrap distribution. After 18 pruning cycles of TKMCP, the number of neural network connections reduced to a fourth, from 58 to 15 weights. In six of the cycles, the TKMCP found an insignicant difference at a D .05 between the least mean salience (weight subject to pruning) and up to seven other saliences in the cycle. By zeroing the weights with corresponding “small” saliences (not signicantly different), 25 pruning cycles were avoided when compared to the OBS procedure as presented by Hassibi and Stork (1993) and Hassibi et al. (1994). In looking at Figure 1, note that the pruned network retains connections related directly to the Monk’s problem 1 denition. Head and torso shape and the color red are the only inputs relevant to decide whether the robot answers the description of a robot with the same head as torso or wearing the color red. The results nearly mirror the network pruned by OBS (Hassibi & Stork, 1993). Retraining the algorithm with the 43 weights eliminated produces a classication error of 8.33% on the remaining robots, similar to what we achieved using OBS without multiple comparison
1162
Donald E. Duckro, Dennis W. Quinn, and Samuel J. Gardner III
Table 1: Average Number of Weights Eliminated in the First Iteration of MCP for Three Different-Size Neural Networks. Inputs
Hidden
Weights
Ave Pruned
17 17 17
3 5 10
58 96 191
5.45 7.79 60.74
pruning. While the TKMCP was as sucessful as OBS in trimming the network, the computational time required to collect enough bootstrap resamples for mean weight saliency estimation was much larger for MCP than for OBS. The oating-point operations (ops) for MCP were 40.1 gigaops compared to 2.75 gigaops for OBS. This order-of-magnitude difference in part may be attributed to the relatively small size of the initial network. (The computational burden for OBS comes primarily from the necessity to compute the inverse Hessian matrix, which has squared dimension equal to the number of weights in the model.) Also, reducing the number of bootstrap resampling iterations would benet TKMCP’s computation time, at the expense of perhaps slight poorer pruning decisions at each pruning cycle. To demonstrate the capability of TKMCP in a more favorable arrangement, consider an initial network of 10 hidden nodes. With 191 weights in the network, the training set must be expanded from 124 exemplars to something greater than the number of weights. In this case, 248 exemplars were drawn for the training set. After 32 repetitions of bootstrap training (i.e., B D 32), TKMCP performed a single subset elimination on those weights whose saliences were not signicantly different from the least salience at a D .05. In 119 trials, TKMCP eliminated 30 to 90 weights, with approximately 61 weights on average in a single iteration. Table 1 shows a comparison of the average number of weights pruned on the rst iteration of the TKMCP for various network sizes. In competition with TKMCP (with the 10 hidden node network), OBS was permitted to eliminate the same number of weights for each of the 119 trials. In this situation, TKMCP equaled OBS accuracy in all but two of the trials. Meanwhile, TKMCP required 13% less ops than OBS (average 113 gigaops for TKMCP versus 131 gigaops for OBS). 4.2 Simulation Study. A simulation to compare OBS and TKMCP on networks of varying size and complexity was conducted. Feedforward NNs with varying number of inputs and hidden nodes and one output are examined. Because OBS is well grounded in NN pruning theory, the goal of this study is to demonstrate that TKMCP produces networks that are at least as good as OBS in terms of predictive ability and TKMCP prunes to a suitable complexity when compared to OBS.
Neural Network Pruning
1163
Table 2: F Test Results for the Input (IN), Hidden node (HN), and Algorithm (A) Factors and Interactions for the Pruning Ratio Response. Factor IN HN HN*IN A IN*A HN*A IN*HN*A
F¤
p
174.2 6.6 11.5 8.2 2.3 0.3 0.7
< .0001 0.0122 0.0011 0.0056 0.1585 0.5656 0.3942
Notes: F¤ for each of the factors is compared with the critical value F D 7 for the family signicance level of a D 0.05 (ai D .01). The p-value indicates the probability of being wrong in accepting the effect as relevant to the model. The main effects of input and algorithm and the interaction of hidden node with input meet the criterion as factors to model the pruning ratio.
The design for this study is a 23 factorial design, with the design variables being the number of inputs (IN = 4 or 16), number of hidden nodes (HN = 3 or 9), and the pruning algorithm (A = TKMCP or OBS). To create a truth model from which objective comparisons can be made, a real feedforward NN with 4 or 16 inputs, 3 or 9 hidden nodes, and 1 ( ) output is constructed. The weights w D (wijm ) are generated by sampling independently from a N (0, 1) distribution, and then 25% of the input-tohidden weights are randomly set to zero. This creates a sparser network, creating a need for pruning in the estimation step. A training sample T D ft1 , . . . , tN g is then generated by independent sampling an input x i from a d¡variate (d = 4 or 16) normal distribution, x i » Nd ( (0) , I ), and calculating ti D ( y ( x i , w ) , x i ) . Training samples of size N D 256 are used. We then train an NN model with the same number of inputs and hidden nodes using T as the training set and prune with either OBS or TKMCP to come up with a reduced network. The measure of quality used in this study is the ratio of the total number of weights in the true model to the number of weights found by the pruning algorithm (PR). The measure of predictive ability of the nal network is the MSE for the test data set. Upon completion of an analysis of variance (ANOVA) for the pruning ratio response, we compile the results of the F test in Table 2. The response is inuenced by the network architecture as well as the OBS and TKMCP algorithms. The initial size of the network (e.g., the HN*IN interaction) is a signicant factor affecting algorithm performance. Both main effects (IN and HN) in Figures 4 and 5 would remain in the model to support the interaction effect. Table 2 clearly shows an effect of the algorithms on the pruning ratio response. Figures 6 and 7 show a weak interaction between the algorithms and the other two factors, which is consistent with the ANOVA. The aggre-
1164
Donald E. Duckro, Dennis W. Quinn, and Samuel J. Gardner III
Figure 4: A plot of the pruning ratio (PR) for the two levels of model input (IN). The straight line connects the mean response of PR for IN levels 4 and 16. IN at levels 4 and 16 has an effect on PR response, as is apparent from the changing mean. The plot contains the data from both levels of hidden units and the OBS and TKMCP algorithms.
Figure 5: The effect hidden node (HN) at levels 3 and 9 has on the pruning ratio response is less pronounced compared to IN in Figure 4. This is consistent with its lower F¤ and higher p-value in Table 2. The plot contains data from both levels of input and the OBS and TKMCP algorithms.
Neural Network Pruning
1165
Figure 6: The interaction of inputs at levels 4 and 16 with the OBS and TKMCP algorithms is shown by the slight nonparallel track of the respective pruning ratio means at all levels of hidden nodes. The F¤ and p-value do not support the inclusion of this interaction.
Figure 7: The interaction of inputs at levels 4 and 16 with the OBS and TKMCP algorithms is shown by the slight non-parallel track of the respective pruning ratio means at all levels of hidden nodes. The F¤ and p-value do not support the inclusion of this interaction.
1166
Donald E. Duckro, Dennis W. Quinn, and Samuel J. Gardner III
Table 3: Comparison of TKMCP and OBS Performance. Average Training MSE
Average Test MSE
Hidden Nodes
Inputs
TKMCP
OBS
TKMCP
OBS
3 3 9 9
4 16 4 16
.0027 .0026 .0028 .0295
.0020 .0024 .0018 .0215
.0086 .0077 .0095 .0716
.0094 .0079 .0109 .0787
Notes: With 10 trials for each network architecture in the experiment, the averages of the mean square error of the nal pruned networks suggest an insignicant difference between OBS and TKMCP in nal network performance. It appears that OBS performs better on the training set but worse on the test set of exemplars, which suggests that TKMCP improves the pruned network’s ability to generalize on future examples.
gate of Figures 6 and 7 would place the mean response of the pruning ratio slightly higher for TKMCP, suggesting somewhat more aggressive pruning, particularly for the larger networks. In terms of making a statistical conclusion with the associated ANOVA tables, the results for the pruning ratio meet the traditional homoscedasticity requirements, that is, the desirable condition of equal error variances for all cases. In contrast, if the error had varied as a function of the connection within the network, then we would have had heteroscedasticity, and the homoscedasticity requirement would have been violated. Final network performance can be measured by a number of error rules. In keeping with the theme of section 2.4, we consider MSE of the nal network using the training and test data sets. The training data should measure the network’s ability to understand the present data, and the test data should reect the ability to generalize future data. The results in Table 3 indicate that we have achieved our goal: although TKMCP produces worse performance than OBS on the training set, it produces a better generalizing network than OBS. 5 Conclusion
Neural networks are popular and useful exible nonlinear models used extensively in engineering applications. Statistical understanding of these models is growing, and we believe this article sheds light on the issues and identies potential directions for future work in model selection and pruning procedures and practice. Tukey-Kramer multiple comparison pruning has been shown to be a viable pruning algorithm when compared to Optimal Brain Surgeon, adding distinctive statistical inference for multiple weight elimination and the stopping criterion. Future work in this area should continue to rene algorithms with statistical advantages and com-
Neural Network Pruning
1167
putational efciency, incorporating appropriate distributions of the entire collection of neural network weights and saliences to augment the use of mean estimates. Acknowledgments
D. D. currently works for the Defense Resources Management Institute, Naval Postgraduate School, Monterey, CA, and S. G. now works for the Air Force Research Laboratory Human Effectiveness Division, Wright-Patterson AFB, OH. The views expressed in this article are those of the authors and do not reect the ofcial policy or position of the U.S. Air Force, Department of Defense, or U.S. government. We thank the referees for making valuable suggestions that improved the article. We also thank Tom Reid and Tony White for their inputs during the nal revisions of this article. References Bishop, C. M. (1995) Neural networks for pattern recognition. New York: Oxford University Press. Brown, L. D. (1984) A note on the Tukey-Kramer procedure for pairwise comparisons of correlated means. In T. J. Santner & A. C. Tamhane (Eds.), Design of experiments: Ranking and selection (pp. 1–6). New York: Marcel Dekker. Efron, B. (1982) The jackknife,the bootstrapand other resampling plans. Philadelphia: Society for Industrial and Applied Mathematics. Gorodkin, J., Hansen, L. K., Lautrup, B., & Solla, S. A. (1997). Universal distribution of saliencies for pruning in layered neural networks. International Journal of Neural Systems, 8, 489–498. Hadzikadic, M., & Bohren, B. F. (1997). Learning to predict: INC2.5. IEEE Transactions on Knowledge and Data Engineering, 9(1), 168–173. Hassibi, B., & Stork, D. G. (1993). Second order derivatives for network pruning: Optimal Brain Surgeon. In C. L. Giles, S. J. Hanson, & J. D. Cowan (Eds.), Advances in neural information processing systems, 5 (pp. 164–171). San Mateo CA: Morgan Kaufmann. Hassibi, B., Stork, D. G., & Wolff, G. (1994). Optimal Brain Surgeon: Extensions and performance comparisons. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing systems, 6 (pp. 263–270). San Mateo CA: Morgan Kaufmann. Kramer, C. Y. (1957). Extension of multiple range tests to group correlated adjusted means. Biometrics, 13, 13–18. Le Cun, Y., Denker, J. S., & Solla, S. A. (1990). Optimal Brain Damage. In D. Touretzky (Ed.), Advances in neural information processing systems, 2 (pp. 598– 605). San Mateo CA: Morgan Kaufmann. Neter, J., Kutner, M. H., Nachtsheim, C. J., & Wasserman, W. (1996). Applied linear statistical models. Chicago: Irwin. Setiono, R. (1997). A penalty function approach for pruning feedforward neural networks. Neural Computation, 9, 185–204.
1168
Donald E. Duckro, Dennis W. Quinn, and Samuel J. Gardner III
Thrun, S. B., Wnek, J., Michalski, R. S., Mitchell, T., & Cheng, J. (1991). The Monk’s problems—a performance comparison of different algorithms. (Tech. Rep. No. CMU-CS-91-197).Pittsburgh: Carnegie Mellon University, Computer Science Department. Tukey, J. W. (1991). The philosophy of multiple comparisons. Statistical Science, 6(1), 100–116. van de Laar, P., & Heskes, T. (1999). Pruning using parameter and neuronal metrics. Neural Computation, 11, 977–993. Wald, A. (1943). Tests of statistical hypotheses concerning several parameters when the number of observations is large. Transactions of the American Mathematical Society, 54, 426–482. White, H. (1989). Learning in articial neural networks: A statistical perspective. Neural Computation, 1, 425–464. Widrow, B., & Hoff, M. E. (1960). Adaptive switching circuits. IRE WESCON Convention Record, 4, 94–104. Young S., & Downs, T. (1998). “CARVE—a constructive algorithm for realvalued examples.” IEEE Transactions on Neural Networks, 9(6), 1180–1190. Received August 23, 2000; accepted September 6, 2001.
LETTER
Communicated by Andrew Brown
Products of Gaussians and Probabilistic Minor Component Analysis C. K. I. Williams [email protected] Division of Informatics, University of Edinburgh, Edinburgh EH1 2QL, U.K. F. V. Agakov [email protected] System Engineering Research Group, Chair of Manufacturing Technology Friedrich-Alexander-University Erlangen-Nuremberg, 91058 Erlangen, Germany
Recently, Hinton introduced the products of experts architecture for density estimation, where individual expert probabilities are multiplied and renormalized. We consider products of gaussian “pancakes” equally elongated in all directions except one and prove that the maximum likelihood solution for the model gives rise to a minor component analysis solution. We also discuss the covariance structure of sums and products of gaussian pancakes or one-factor probabilistic principal component analysis models. 1 Introduction
Recently Hinton (1999) introduced a new product of experts (PoE) model for combining expert probabilities, where probability p ( x | H) is computed as a normalized multiplication of probabilities pi ( x | H i ) of individual experts with parameters H i : Qm pi ( x | H i ) R QmiD 1 . p (x | H) D 0 0 p x0 iD 1 i ( x | H i )d x
(1.1)
Under the PoE paradigm, experts may assign high probabilities to irrelevant regions of the data space as long as these probabilities are small under other experts (Hinton, 1999). Here, we consider a product of constrained gaussians in the d-dimensional data space, whose probability contours resemble d-dimensional gaussian “pancakes” (GP), contracted in one dimension and equally elongated in the other d ¡ 1 dimensions. We refer to the model as product of gaussian pancakes (PoGP) and show that it provides a probabilistic technique for minor component analysis (MCA). This MCA solution contrasts with c 2002 Massachusetts Institute of Technology Neural Computation 14, 1169– 1182 (2002) °
1170
C. K. I. Williams and F. V. Agakov
probabilistic PCA (PPCA) (Tipping & Bishop, 1999; see also Roweis, 1998), which is a probabilistic method for PCA based on a factor analysis model with isotropic noise. The key difference is that PPCA is a model for the covariance matrix of the data, while PoGP is a model for the inverse covariance matrix of the data. In section 2, we discuss products of gaussians. In section 3, we consider products of Gaussian pancakes, discuss analytic solutions for maximum likelihood estimators for the parameters, and provide experimental evidence that the analytic solution is correct. Section 4 discusses the covariance structure of sums and products of gaussian pancakes and one-factor PPCA models. 2 Products of Gaussians
If each expert in equation 1.1 is a gaussian pi (x | H i ) » N (m i , Ci ), the resulting distribution of the product may be expressed as ( ) m 1X ¡1 T ( x ¡ m i ) Ci ( x ¡ m i ) . p ( x | H) / exp ¡ 2 iD 1 By completing the quadratic term in the exponent, it may be easily shown that p ( x | H) » N (m S , C S ) , where Á ! m X X ¡1 ¡1 ¡1 d£d m S D CS (2.1) CS D Ci ½ R , Ci m i ½ R d . i D1
i
To simplify the following derivations, we will assume that pi ( x | Hi ) » 6 N ( 0, Ci ) and thus that p (x | H) » N ( 0, C S ). m S D 0 can be obtained by translation of the coordinate system. The log likelihood for independently and identically distributed (i.i.d.) data under the PoG is then expressed as N N N ¡1 L ( C S ) D ¡ d ln 2p C ln | C ¡1 S | ¡ 2 tr[C S S]. 2 2
(2.2)
Here N is the number of sample points, and S is the sample covariance matrix with the assumed zero mean, that is, SD
N 1 X xi xTi , N i D1
(2.3)
where xi denotes the ith data point. 3 Products of Gaussian Pancakes
In this section we describe the covariance structure of a GP expert and a product of GP experts and discuss the maximum likelihood (ML) solution for the parameters of the PoGP model.
Products of Gaussians and Probabilistic Minor Component Analysis
1171
v1
v2 w
Figure 1: Probability contour of a gaussian pancake in R3 . 3.1 Covariance Structure of a GP Expert. Consider a d-dimensional O and gaussian whose probability contours are contracted in direction w equally elongated in directions v1 , . . . , vd¡1 (see Figure 1). Its inverse covariance may be written as C ¡1 D
d¡1 X iD 1
Ow O T bwO ½ Rd£d , vi vTi b 0 C w
(3.1)
O form a d £ d matrix of normalized eigenvectors of the where v1 , . . . , vd¡1 , w covariance C. b 0 D s0¡2 , bwO D sw¡2 dene inverse variances in the directions O of elongation and contraction, respectively, so that s02 ¸ swO2 . Expression 3.1 can be rewritten in a more compact form as Ow OT C ¡1 D b 0Id C (b wO ¡ b 0 ) w D b 0Id C wwT ,
(3.2)
p O bwO ¡ b 0 and Id ½ Rd£d is the identity matrix. Notice that where w D w according to the constraint considerations b 0 < bwO , and all elements of w are real valued. We see that covariance of the data C of a GP expert can be uniquely determined by the weight vector w, collinear with the direction of contraction
1172
C. K. I. Williams and F. V. Agakov
and the variance in the direction of elongation s02 D b 0¡1 . We can further notice the similarity of equation 3.2 with the expression for the covariance of the data of a one-factor probabilistic principal component analysis model C D s 2 Id C wwT (Tipping & Bishop, 1999), where s 2 is the variance of the factor-independent spherical gaussian noise. The only difference is that it is the inverse covariance matrix for the constrained gaussian model rather than the covariance matrix, which has the structure of a rank-1 update to a multiple of Id . 3.2 Covariance of the PoGP Model. We now consider a product of m GP experts, each contracted in a single dimension. We refer to the model as a (1, m) PoGP, where 1 represents the number of directions of contraction of each expert. We also assume that all experts have identical means. From equations 2.1 and 3.1, the inverse covariance of the resulting (1, m) PoGP model is expressed as C ¡1 S D
m X i D1
Ci¡1 D b S Id C WWT ½ Rd£d ,
(3.3)
where columns of W ½ Rd£m correspond to weight vectors of m PoGP P ( i) experts, and b S D m i D 1 b 0 > 0. Comparing equation 3.3 with the m-factor PPCA, we can make a conjecture that in contrast with the PPCA model where ML weights correspond to principal components of the data covariance (Tipping & Bishop, 1999), weights W of the PoGP model dene projection onto m minor eigenvectors of the sample covariance in the visible d-dimensional space, while the distortion term b S Id explains larger variations.1 The proof of this conjecture is discussed in the appendix. For the covariance described by equation 3.3, we nd that ( Á ! ) m X 1 T T O iw Oi x , ai w p ( x ) / exp ¡ x b S Id C 2 iD 1
(3.4)
O i is a unit vector in the direction of wi and ai D | wi | 2 . This distribution where w can be given a maximum entropy interpretation. From Cover and Thomas (1991, equation 11.4), the maximum entropy distribution obeying P the constraints E[ fi (x ) ] D ri , i D 1, . . . , c is of the form p (x ) / exp ¡ ciD 1 li fi ( x) . Hence, we see that equation 3.4 can be interpreted as a maximum entropy O Ti x) 2 ], i D 1, . . . , m and on E[xT x]. distribution with constraints on E[ (w 1 Because equation 3.3 has the form of a factor analysis decomposition but for the inverse covariance matrix, we sometimes refer to PoGP as the rotcaf model.
Products of Gaussians and Probabilistic Minor Component Analysis
1173
3.3 Maximum-Likelihood Solution for PoGP. In the appendix, it is
shown that the likelihood 2.2 is maximized when WML D Um (L ¡1 ¡ b SML Im ) 1 / 2 RT ,
d¡m b SML D Pd
i D m C 1 li
,
(3.5)
where Um is the d £ m matrix of the m minor components of the sample covariance S, L is the m £ m matrix of the corresponding eigenvalues, and R is an arbitrary m £ m orthogonal rotation matrix. As in PPCA (Tipping & Bishop, 1999), the distortion term accounts for the variance in the directions lying outside the space spanned by WML . Thus, the maximum likelihood solution for the weights of the (1, m) PoGP model corresponds to m scaled and rotated minor eigenvectors of the sample covariance S and leads to a probabilistic model of minor component analysis. As in the PPCA model, the number of experts m is assumed to be lower than the dimension of the data space d. 3.4 Experimental Conrmation . In order to conrm the analytic results, we have performed experiments comparing the derived ML parameters of the PoGP model with convergence points of scaled conjugate gradient (SCG) optimization (Bishop, 1995; Møller, 1993) performed on the log likelihood 2.2. We considered different data sets in data spaces of dimensions d, varying from 2 to 25 with m D 1, . . . , d ¡ 1 constrained gaussian experts. For each choice of d and m, we looked at three types of sample covariance matrices, resulting in different types of solutions for the weights W ½ Rd£m . In the rst case, S was set to have s · d ¡ m identical largest eigenvalues, and we expected all expert weights to be retained in W. In the second case, S was set up in such a way that d ¡ m < s < d, so that variance in some directions could be explained by the noise term and some columns of W could be equal to zeros. Finally, we have considered the degenerate case when the sample covariance was a multiple of the identity matrix Id and expected W D 0 (see equations 3.5 and 3.2). For all of the considered cases, we performed 30 runs of the scaled conjugate gradient optimization of the likelihood, started at different points of the parameter space fW, b S g. To ensure that b S was nonnegative, we parameterized it as b S D ex , x 2 R. For each data set, the SCG algorithm converged to approximately the same value of the log likelihood (with the largest observed standard deviation of the likelihood » 10¡12). The largest observed ( ) ( ) absolute error between the analytic fW ( A) , b SA g and the SCG fW ( SCG) , b SSCG g parameters satised
max[| W ( A) ¡ W ( SCG) R |]ij / 10¡8 i, j
( )
(
)
|b SA ¡ b SSCG | / 10¡8 ,
1174
C. K. I. Williams and F. V. Agakov
where R ½ Rm£m is a rotation matrix. Moreover, each numeric solution resulted in the expected type of W ( SCG) for each type of the sample covariance. The experiments therefore conrm that the method of the scaled conjugate gradients converges to a point in the parameter space that results in the same values of the log likelihood as the analytically obtained solution (up to the expected arbitrary rotation factor for the weights). 3.5 Intuitive Interpretation of the PoGP Model. An intuitive interpretation of the PoGP model is as follows: Each gaussian pancake imposes an approximate linear constraint in x space. Such a linear constraint is that x should lie close to a particular hyperplane. The conjunction of these constraints is given by the product of the gaussian pancakes. If m ¿ d, it will make sense to dene the resulting gaussian distribution in terms of the constraints. However, if there are many constraints (m > d / 2), then it can be more efcient to describe the directions of large variability using a PPCA model rather than the directions of small variability using a PoGP model. This issue is discussed by Xu, Krzyzak, and Oja (1991) in what they call the dual subspace pattern recognition method, where both PCA and MCA models are used (although their work does not use explicit probabilistic models such as PPCA and PoGP). 4 Related Covariance Structures
We have discussed the covariance structure of a PoGP model. For realvalued densities, an alternative method for combining Pm experts is to sum the random vectors produced by each expert, x D i D 1 xi ; let us call this a sum of experts (SoE) model. (Note that the SoE model refers to the sum of the random vectors, while the PoE model refers to the product of their densities. The distribution of the sum of random variables is obtained by the convolution of their densities.) For zero-mean gaussians, the covariance of the SoE is simply the sum of the individual covariances, in contrast to the PoE, where the overall inverse covariance is the sum of the individual inverse covariances. Hence, we see that the PPCA model is in fact an SoE model, where each expert is a one-factor PPCA model. This leads us to ask what the covariance structures are for a product of one-factor PPCA models and a sum of gaussian pancakes. These questions are discussed below. 4.1 Covariance of the PoPPCA Model. We consider a product of m onefactor PPCA models, denoted as (1, m ) PoPPCA. Geometrically, the probability contours of a one-factor PPCA model in Rd are d-dimensional hyperellipsoids elongated in one dimension and contracted in the other d ¡ 1 directions. The covariance matrix of 1-PPCA is expressed as C D wwT C s 2 I ½ Rd£d ,
(4.1)
Products of Gaussians and Probabilistic Minor Component Analysis
1175
where weight vector w ½ Rd denes the direction of elongation and s 2 is the variance in the direction of contraction. Its inverse covariance is C ¡1 D b Id ¡
b 2 wwT 1 C wT wb
D b Id ¡ bc wwT ,
(4.2)
where b D s ¡2 and c D b / (1 C kwk2 b ). b and c are the inverse variances in the directions of contraction and elongation, respectively. Plugging equation 4.2 into 2.1, we obtain C ¡1 S D
m X iD 1
Ci¡1 D b S Id ¡ WBW T ,
B D diag(b1c 1 , b2c 2 , . . . , bmc m ) ,
(4.3)
where W D [w (1) , . . . , w ( m) ] ½ Rd£m is the weight matrix with columns corresponding to the weights of individual experts, b S is the sum of the inverse noise variances for all experts, and B may be thought of as a squared scaling factor on the columns of W. We can rewrite expression 4.3 as T
d£d Q Q C ¡1 , S D b S Id ¡ WW ½ R
(4.4)
Q D WB1/ 2 are implicitly scaled weights. where W 4.2 Maximum-Likelihood Solution for PoPPCA . Our studies show that the ML solution for the (1, m) PoPPCA model can be equivalent to the ML solution for m-factor PPCA, but only when rather restrictive conditions apply. Consider the simplied case when the noise variance b ¡1 is the same for each one-factor PPCA model. Then equation 4.3 reduces to T C ¡1 S D mb Id ¡ b WC W ,
C D diag(c 1 , c 2 , . . . , c m ) .
(4.5)
An m-factor PPCA model has covariance s 2 Id C WWT and thus, by the Woodbury formula (see, e.g., Press, Teukolsky, Vetterling, & Flannery (1992)), it has inverse covariance b Id ¡ b W (s 2 Im C WT W ) ¡1 WT . The maximum likelihood solution for an m-PPCA model is similar to equation 3.5, that is, O D U (L ¡ s 2 Im ) 1 / 2 RT , but now L is a diagonal matrix of the m princiW pal eigenvalues, and U is a matrix of the corresponding eigenvectors. If O are orthogonal, and the inwe choose RT D Im , then the columns of W verse covariance of the maximum likelihood m-PPCA model has the form O CW O T . Comparing this to equation 4.5 (and setting W D W O ), we see b Id ¡ b W that the difference is that the rst term of the right-hand side of equation 4.5 is bmId , while for m-PPCA it is b Id .
1176
C. K. I. Williams and F. V. Agakov
In section 3.4 and appendix C.3 of Agakov (2000), it is shown that (for m ¸ 2) we obtain the m-factor PPCA solution when lN · li <
m lN , m¡1
i D 1, . . . , m,
(4.6)
where lN is the mean of the d ¡ m discarded eigenvalues and li is a retained eigenvalue; it is the smaller eigenvalues that are discarded. We see that the covariance must be nearly spherical for this condition to hold. For covariance matrices satisfying equation 4.6, this solution was conrmed by numerical experiments similar to those in section 3.4, as detailed in Agakov (2000, section 3.5). To see why this is true intuitively, observe that Ci¡1 for each one-factor PPCA expert will be large (with value b) in all directions except one. If the directions of contraction for each Ci¡1 are orthogonal, we see that the sum of the inverse covariances will be at least ( m ¡ 1) b in a contracted direction and mb in a direction in which no contraction occurs. The above shows that for certain types of sample covariance matrix, the (1, m ) PoPPCA solution is not equivalent to the m-factor PPCA solution. Notice, however, that an m-PPCA can be modeled rather trivially by an m-factor k-expert (k > 1) PoPPCA [ (m, k) PoPPCA] by taking the product of a single m-PPCA expert with k ¡ 1 “large” spherical gaussians, with all experts having identical means. 2 4.3 Sums of Gaussian Pancakes. A gaussian pancake has C ¡1 D b 0 Id C wwT . By analogous arguments to those above, we nd that a sum of gaussian
QW Q T , where W Q is a rescaled W, with pancakes has covariance CSGP D s 2 Id ¡ W a somewhat different denition than in equation 4.4. Analogous to section 4.2, we would expect that an ML solution for the sum of gaussian pancakes would give an MCA solution when the sample covariance is near spherical, but that the solution would not have a simple relationship to the eigendecomposition of S when this is not the case. 5 Discussion
We have considered the product of m gaussian pancakes. The analytic derivations for the optimal model parameters have conrmed the initial hypothesis that the PoGP gives rise to minor component analysis. We have also conrmed by experiment that the analytic solutions correspond to the ones obtained by applying optimization methods to the log likelihood. 2 Recently Marks and Movellan have shown that the product of m one-factor factor analysis models is equivalent to a m-factor factor analyser (T. K. Marks and J. R. Movellen, Diffusion Networks, Products and Experts, and Factor Analysis. Proceedings of the 3rd International Conference on Independent Component Analysis and Blind Source Separation, 2001).
Products of Gaussians and Probabilistic Minor Component Analysis
1177
We have shown that (1, m ) PoGP can be viewed as a probabilistic MCA model. MCA can be used, for example, for signal extraction in digital signal processing (Oja, 1992), dimensionality reduction, and data visualization. Extraction of the minor component is also used in the Pisarenko harmonic decomposition method for detecting sinusoids in white noise (see, e.g., Proakis & Manolakis, 1992, p. 911). Formulating minor component analysis as a probabilistic model simplies comparison of the technique with other dimensionality-reduction procedures, permits extending MCA to a mixture of MCA models (which will be modeled as a mixture of products of gaussian pancakes), permits using PoGP in classication tasks (if each PoGP model denes a class-conditional density), and leads to a number of other advantages over nonprobabilistic MCA models (see the discussion of advantages of PPCA over PCA in Tipping & Bishop, 1999). In section 4, we discussed the relationship of sums and products of gaussian models and showed that the sum of gaussian pancakes and (1, m ) PoPPCA models has rather low representational power compared to PPCA or PoGP. In this article, we have considered sums and products of gaussians with gaussian pancake or PPCA structure. It is possible to apply the sum and product operations to models with other covariance structures; for example, Williams and Felderhof (2001) consider sums and products of treestructured gaussians and study their relationship to AR and MA processes. Appendix A: ML Solutions for PoGP
In this appendix, we derive the ML equations 3.5. In section A.1, we derive the conditions that must hold for a stationary point of the likelihood. In section A.2, these stationarity conditions are expressed in terms of the eigenvectors of S. In section A.3, it is shown that to maximize the likelihood, the m minor eigenvectors must be retained. The derivations in this appendix are the analogs for the PoGP model of those in appendix A of Tipping and Bishop (1999) for the PPCA model. A.1 Derivation of ML Equations. From equation 2.2, we can specify ML conditions for parameter H of a gaussian: @L
"
N D 2 @H
¡1 @ ln | C S |
@H
¡
¡1
@tr( C S S ) @H
# D 0.
(A.1)
In our case H D fW, xg, where W is the weight matrix and x is the log of the distortion term b S . Using the rules for matrix differentiation given in Magnus and Neudecker (1999), we obtain @L @W
D N ( C S W ¡ SW ) ,
@L @x
D b S Ntr ( C S ¡ S ) .
(A.2)
1178
C. K. I. Williams and F. V. Agakov
(See Williams & Agakov, 2001, for further details.) Dropping the constant factors, we obtain ML equations for W and x: C S W ¡ SW D 0,
tr( C S ¡ S ) D 0.
(A.3)
Note that in order to nd the maximum-likelihood solution for the weight matrix W and term b S , both equations in equation A.3 should hold simultaneously. A.2 Stationary Points of the Log Likelihood. There are three classes of
solutions to the equation C S W ¡ SW D 0: W D 0I
S D CS I
6 0, S D 6 CS . SW D C S W, W D
(A.4)
The rst of these, W D 0, is uninteresting, and corresponds to a minimum of the log likelihood. In the second case, the model covariance is equal to the sample covariance, and C S is exact. In the third case, while SW D 6 C S W, S D C S and the model covariance is said to be approximate. By analogy with Tipping and Bishop (1999), we will consider the singular value decomposition of the weight matrix and establish dependencies between left singular vectors of W D ULRT and eigenvectors of the sample covariance S. U D [u1 , u2 , . . . , um ] ½ Rd£m is a matrix of left singular vectors of W with columns constituting an orthonormal basis, L D diag(l1 , l2 , . . . , lm ) ½ Rm£m is a diagonal matrix of the singular values of W, and R ½ Rm£m denes an arbitrary rigid rotation of W. A.2.1 Exact Model Covariance. Considering nonsingularity of S and C S , ¡1 ¡1 T we nd C S D S ) C ¡1 S D S . As C S D b S Id C WW , we obtain T
WWT D UL2 U D S ¡1 ¡ b S Id .
(A.5)
This has the known solution W D Um (L ¡1 ¡ b S Im ) 1/ 2 RT , where Um is the matrix of the m eigenvectors of S with the smallest eigenvalues and L is the corresponding diagonal matrix of the eigenvalues. The sample covariance must be such that the largest d ¡ m eigenvalues are all equal to b S ; the other m eigenvalues are matched explicitly. A.2.2 Approximate Model Covariance. Applying the matrix inversion lemma (see, e.g., Press et al., 1992), to expression 3.3 for the inverse covariance of the PoGP gives C S D b S¡1 ( Id ¡ W (b S Im C WT W) ¡1 WT ) ½ Rd£d .
(A.6)
Products of Gaussians and Probabilistic Minor Component Analysis
1179
Solution of equation A.3 for the approximate model covariance results in C S W D SW ) C S UL D SUL.
(A.7)
Substituting equation A.6 for C S , we obtain ¡1 C S UL D (b S Id ¡ b S¡1 W (b S C WT W ) ¡1 WT ) UL
¡1 ¡1 D (b S Id ¡ b S ULRT (b S Im C RL2 RT ) ¡1 RLUT ) UL ¡1 ¡1 D U (b S Im ¡ b S LRT (b S Im C RL2 RT ) ¡1 RL ) L ¡1 ¡1 D U (b S Im ¡ b S (b S L ¡2 C Im ) ¡1 ) L.
(A.8)
Thus, SUL D U (b S¡1 Id ¡ b S¡1 (b S L ¡2 C Im ) ¡1 ) L.
(A.9)
Notice that the term b S¡1 Id ¡ b S¡1 (b S L ¡2 C Im ) ¡1 in the right-hand side of equation A.9 is a diagonal matrix (i.e., just a scaling factor of U). Equation A.9 denes the matrix form of the eigenvector equation, with both sides postmultiplied by the diagonal matrix L. 6 0, then equation A.9 implies that If li D C S ui D Sui D li ui , ¡1
li D b S (1 ¡
(b S li¡2 C
(A.10) 1)
¡1
),
(A.11)
where ui is an eigenvector of S and li is its corresponding eigenvalue. The scaling factor li of the ith retained expert can be expressed as li D (li¡1 ¡ b S ) 1/ 2 ) li ·
1 . bS
(A.12)
This result resembles the solution for the scaling factors in the PPCA case (cf. Tipping & Bishop, 1999). However, in contrast to the PPCA solution where li D (li ¡ s 2 ) 1/ 2 , we notice that li and s 2 are replaced by their inverses. Obviously, if li D 0, then ui is arbitrary. If li D 0, we say that the direction corresponding to ui is discarded, that is, the variance in that direction is explained merely by noise. Otherwise, we say that ui is retained. All potential solutions of W may then be expressed as W D Um ( D ¡ b S I m ) 1 / 2 RT ,
(A.13)
where R ½ Rm£m is an arbitrary rotation matrix, Um D [u1 u2 . . . um ] ½ Rd£m is a matrix whose columns correspond to m eigenvectors of S, and D D diag(d1 , d2 , . . . , dm ) ½ Rm£m such that ( li¡1 if ui is retainedI (A.14) di D bS if ui is discarded.
1180
C. K. I. Williams and F. V. Agakov
A.3 Properties of the Optimal Solution. Below we show that for the PoGP model, the discarded eigenvalues must be adjacent within the sorted spectrum of S and that for the maximum likelihood solution, the smallest eigenvalues should be retained. Let r · m be the number of eigenvectors of S, retained in WML . Since the SVD representation of a matrix is invariant under simultaneous permutations in the order of the left and right singular vectors (e.g., Golub & Van Loan, 1997), we can assume without any loss of generality that the rst eigenvectors u1 , . . . , ur are retained, and the rest of the eigenvectors ur C 1 , . . . , um are discarded. In order to investigate the nature of eigenvectors retained in WML , we will express the log likelihood, equation 2.2, of a gaussian through eigenvectors and eigenvalues of its covariance matrix. From the expression for the PoGP weights, A.13, and the form of the model’s inverse covariance, 3.3, C ¡1 S can be expressed as follows: C ¡1 S D
r X i D1
ui uTi li¡1 C
d X
ui uTi b S .
(A.15)
iD r C 1
Since determinants and traces can be expressed as products and summations of eigenvalues, respectively, we see that Á ln | C S |
¡1
( d¡r)
D ln b S
tr( C ¡1 S S) D r C b S
r Y
! li¡1
iD 1 d X
D ¡
r X i D1
ln(li ) C (d ¡ r ) ln b S
li .
(A.16)
(A.17)
i Dr C 1
By substituting these expressions into the form of L for a gaussian, equation 2.2, we get " r X N L ( WML ) D ¡ ln(li ) d ln(2p ) C 2 iD 1 ¡ (d ¡ r ) ln b S C r C b S
d X
# li .
(A.18)
iD r C 1
Differentiating equation A.18 with respect to b S and equating the result to zero results in d ¡r b SML D Pd
i Dr C 1 li
.
(A.19)
Products of Gaussians and Probabilistic Minor Component Analysis
1181
We see that assuming nonzero noise, this solution makes sense only when d > m ¸ r, that is, the dimension of the input space should exceed the number of experts. Then we obtain " Pd r li N X ML L (WML , b S ) D ¡ ln(li ) C ( d ¡ r) ln iD r C 1 ( d ¡ r) 2 iD 1 # C r C ( d ¡ r)
Cc
2 3 Pd d X N4 j D r C 1 lj ( d ¡ r) ln ¡ ln(lj ) 5 C c0 , D ¡ ( d ¡ r) 2 j Dr C 1
(A.20)
(A.21)
P where c, c0 are constants, and we have used the fact that diD1 ln(li ) D ln | S | is a constant. Let A denote the term in the square brackets of equation A.21. Clearly L is maximized with respect to the l terms when A is minimized. We will now investigate the value of L (WML , b SML ) as different eigenvalues are discarded or retained, so as to identify the maximum likelihood solution. The expression for A is identical (up to unimportant scaling by ( d ¡ r) ) with expression A.5 in Tipping and Bishop (1999). Thus, their conclusion that the retained eigenvalues should be adjacent within the spectrum of sorted eigenvalues of the sample covariance S is also valid in our case. 3 This result, together with the constraint on the retained eigenvalues, equation A.12, and the noise parameter, A.19, yields Pd 8i D 1 . . . r,
li ·
j D r C 1 lj
d¡r
I
(A.22)
i.e., only the smallest eigenvalues should be retained. It can be shown (see Appendix A.3.2. in Williams & Agakov, 2001) that the log likelihood is maximized when r D m. Acknowledgments. Much of the work on this article was carried out as part of the M.Sc. project of F. A. at the Division of Informatics, University of Edinburgh. C. W. thanks Sam Roweis, Geoff Hinton, and Zoubin Ghahramani for helpful conversations on the rotcaf model during visits to the Gatsby Computational Neuroscience Unit. We also thank the two anonymous referees for their comments, which helped to improve the article. F. A. gratefully acknowledges the support of the Royal Dutch Shell 3
An alternative derivation of this result can be found in appendix B.4 of Agakov (2000).
1182
C. K. I. Williams and F. V. Agakov
Group of Companies for his M.Sc. studies in Edinburgh through a Centenary Scholarship. References Agakov, F. (2000). Investigations of gaussian products-of-experts models. Unpublished master’s thesis, University of Edinburgh. Available on-line: http:// www.dai. ed.ac.uk/homes/felixa/all.ps.gz. Bishop, C. (1995).Neural networks for patternrecognition. Oxford: Clarendon Press. Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: Wiley. Golub, G. H., & Van Loan, C. F. (1997). Matrix computations (3rd ed.). Baltimore, MD: Johns Hopkins University Press. Hinton, G. E. (1999). Products of experts. In Proceedings of the Ninth International Conference on Articial Neural Networks (ICANN 99) (pp. 1–6). Magnus, J. R., & Neudecker, H. (1999). Matrix differentialcalculus with applications in statistics and econometrics (2nd ed.). New York: Wiley. Møller, M. (1993). A scaled conjugate gradient algorithm for fast supervised learning. Neural Networks, 6(4), 525–533. Oja, E. (1992). Principal components, minor components, and linear neural networks. Neural Networks, 5, 927–935. Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. (1992).Numerical recipes in C (2nd ed.). Cambridge: Cambridge University Press. Proakis, J. G., & Manolakis, D. G. (1992). Digital signal processing: Principles, algorithms and applications. New York: Macmillan. Roweis, S. (1998).EM algorithms for PCA and SPCA. In M. I. Jordan, M. J. Kearns, & S. A. Solla, (Eds.), Advances in neural information processing systems, 10 (pp. 626–632). Cambridge, MA: MIT Press. Tipping, M. E., & Bishop, C. M. (1999). Probabilistic principal components analysis. J. Roy. Statistical Society B, 61(3), 611–622. Williams, C. K. I., & Agakov, F. V. (2001). Products of gaussians and probabilistic minor components analysis (Tech. Rep. No. EDI-INF-RR-0043). Edinburgh: Division of Informatics, University of Edinburgh. Available on-line: http://www.informatics.ed.ac.uk/publications/report/0043.html. Williams, C. K. I., & Felderhof, S. N. (2001). Products and sums of tree-structured gaussian processes. In Proceedings of the ICSC Symposium on Soft Computing 2001 (SOCO 2001). Xu, L., Krzyzak, A., & Oja, E. (1991). Neural nets for dual subspace pattern recogntion method. International Journal of Neural Systems, 2(3), 169–184. Received April 20, 2001; accepted September 17, 2001.
LETTER
Communicated by Steven Nowlan
Optimization of the Kernel Functions in a Probabilistic Neural Network Analyzing the Local Pattern Distribution I. Galleske gingo@zipi..upm.es J. Castellanos jcastellanos@.upm.es Departmentode InteligenciaArticial,Facultad de InformÂatica, UniversidadPolitÂecnica de Madrid, Spain
This article proposes a procedure for the automatic determination of the elements of the covariance matrix of the gaussian kernel function of probabilistic neural networks. Two matrices, a rotation matrix and a matrix of variances, can be calculated by analyzing the local environment of each training pattern. The combination of them will form the covariance matrix of each training pattern. This automation has two advantages: First, it will free the neural network designer from indicating the complete covariance matrix, and second, it will result in a network with better generalization ability than the original model. A variation of the famous twospiral problem and real-world examples from the UCI Machine Learning Repository will show a classication rate not only better than the original probabilistic neural network but also that this model can outperform other well-known classication techniques.
1 Introduction
Applications of neural network theory can be found in many areas: pattern recognition, function approximation, control, classication, and others. This article focuses on the classication problem. The neural network, that is, the classication problem solver, tries to nd out to which of some possible classes an unknown pattern has to be assigned to. Probabilistic neural networks (PNN), which are a network realization of the Bayes decision theory, are used for this purpose. This theory tries to separate the complete multidimensional input space into different regions, each belonging to one possible class. Classication then consists in deciding in which region a pattern is located. The probability density underlying the class can be found using kernel estimators (Parzen, 1962; Specht, 1990). These kernel estimators can be calculated in a one-pass learning process, resulting in a rapid training process. This PNN type has a second advantage over other types; a new training pattern can be added to an already trained network without c 2002 Massachusetts Institute of Technology Neural Computation 14, 1183– 1194 (2002) °
1184
I. Galleske and J. Castellanos
retraining it from scratch. These advantages make this network type a good candidate for real-time problems (Chen & You, 1992). Specht (1988) proposes many different kernel types in his publications and uses the multidimensional gaussian functions as kernel estimators. These kernel estimators imply the construction of multidimensional gaussian functions. Specht states that the variance of the gaussian functions, called a smoothing parameter, does not play an important role (Specht, 1988). Other researchers say that the correct value is of importance; a wrong choice of this parameter could result in misclassication errors (Specht, 1991; Yang & Chen, 1998). This value is the only variable in the model of the PNN and therefore should not be guessed; a method to construct a network from the pattern set (Berthold & Diamond, 1998) and ways to calculate these matrices (Musavi, Chan, Hummels, & Kalantri, 1993; Galleske & Castellanos, 1997) were developed. This article presents a new recursive algorithm to calculate this necessary covariance matrix of the kernel functions. Calculating the parameters needs additional computer power and results in a longer training process. Nevertheless, the learning process is still much faster than many other approaches to nd a solution to a classication problem. The disadvantage of the longer training time is well compensated in two ways. First, the developer of the probabilistic neural network is freed from guessing the covariance matrix and takes advantage of an automatic calculation. Second, and perhaps even more important, a better generalization ability and better performance on patterns not seen before, is achieved. This article begins with an introduction to the probabilistic neural network, a network realization of the Bayes decision theory. This includes its architecture and the corresponding learning procedure. This information leads to an explanation of the new, modied version of the PNN. We explain in detail how the two matrices, the rotation matrix and the matrix of variances, are calculated in various dimensions. Finally, we give the results of some experiments, which demonstrate the better generalization ability of the model we propose in this article. 2 Theoretical Foundations of Probabilistic Neural Networks
The general PNN consists of three layers of neurons (see Figure 1). The n input units indicate the n-dimensional problem space. The next layer of units (pattern layer) consists of p units, p being the number of training patterns of all classes. These two layers are completely connected, and the weights leading to unit pi are determined by its position in the n-dimensional input space. The one-pass learning of the PNN consists of setting the weights, leading to the pattern unit, to the coordinates of the pattern. The pattern units feed their output without exception only into that class unit of the next layer that represents the class it belongs to. These connections
Optimization of the Kernel Functions
1185
Figure 1: Architecture of the standard probabilistic neural network.
cannot be learned and have their weights xed to the reciprocal value of the number of units of the corresponding class. The example in Figure 1 shows a two-class classier; if there were more classes, the network had more units on the class layer. Among all units of the class layer, the one with the highest activation is chosen (in a pure connectionist model, a decision network can perform this task). The input—that is, the new pattern—is classied to the class that this unit represents. All output functions of the network are equal to the activation, o ( i) D a ( i), P for all units. The class units have the summation function a (h ) D liD k o ( i), where the pattern units k to l belong to class h. The activation functions of the units in the pattern layer represent the most important concept of the PNN, because they nally dene the complete density function of the underlying problem. Here, the use of the multidimensional gaussian function for each function is legitimated, as proved in the fundamental article by Parzen (1962). Using this activation function for all pattern units, the complete PNN calculates for each class h the following function, a ( h) D
l X iD k
o ( i) D ³
l X
1 n
iD k
1
(2p ) 2 | S | 2
´ 1 ¡1 T ¤ exp ¡ (X ¡ Xi ) Si (X ¡ Xi ) , 2
(2.1)
while the pattern from k to l participates to class h, X is the pattern to classify, Si is the matrix of covariances of pattern i, and Xi is the position of training
1186
I. Galleske and J. Castellanos
pattern i. In other words, around each training pattern, a multidimensional gaussian function is located, whose sums add up to an estimation of the probability density for a given classication problem. The general problem to solve is to nd the best estimation of the gaussian matrix Si for all training pattern i. The original PNN uses the same matrix S for all training patterns, leading to an architecture that is very fast to train. Although the training time of this architecture is short, it has the disadvantage that the generalization ability is not optimized. For enhancing the generalization ability, an algorithm is proposed here that calculates for each pattern i its own matrix Si , subdividing it into four matrices. In the following, i will be omitted for notational simplicity. Generally, the multidimensional gaussian functions have a constant potential surface (CPS; points of same probability) that is hyperellipsoid. The form of this CPS is determined by the parameter S . The idea proposed in this article is to give the ellipsoid the freedom to rotate about any angle with the aim of making it as big as possible to achieve a better network generalization ability. To realize this generic rotation, the covariance matrix S has to be subdivided into four matrices: S D R T S ¡1 S ¡1 R . S is a diagonal matrix, with its nonzero elements indicating the length of the principal axes of the ellipsoid. R is a simple rotation matrix, which rotates the ellipsoid to the desired position. The next section explains in detail how the unknown rotation matrix R and the matrix of the principal axes S are calculated.
3 Constructing the Covariance Matrix S
In the described model, the two matrices R and S for each training pattern of all classes have to be estimated from the training sample. For a classication problem in an n-dimensional space, the following variables have to be calculated: The n variances s1 , . . . , sn , which are the constituents of the variance matrix S , and the rotation matrix R . We present a recursive formula for estimating the unknown parameters of this model in a general n-dimensional input space; rst, sn and the last row vector of R have to be calculated. With the knowledge of sn , the calculation of sn¡1 , and the knowledge of the last vector of R , the second-last row vector can be estimated, and so forth, until sn , sn¡a, . . . , s2 are known and s1 can be calculated, as well as the rst row vector of R . The method for nding the unknown parameters is constructive and has to be executed for all training patterns of all classes. For the description of the algorithm, it is sufcient to make a conjunction of classes: class A contains all of the patterns under observation; all the other units of all other classes will be joined to a general class B. As a notational convenience, XiA denotes in the following training pattern Xi , which is the ith training pattern of the class A.
Optimization of the Kernel Functions
1187
Figure 2: In XiA , a new coordinate system (vE1 , vE2 , vE3 ) is constructed, and XBk is projected “elliptically” to P.
For each training pattern of class A, XiA , it is necessary to nd the closest training pattern of the opposite class (denoted XjB ). As a distance measure, there is no reason not to use simple Euclidean distance. The vector vE n , which connects these training patterns, pointing from XiA to XjB , provides one parameter, sn , which is the smallest of all principal axes. The variances of the training pattern of different classes should not overlap too much, and to prevent the overlap of a certain part of the probability mass, sn is set to a certain part ( p > 1) of the length of the connection vector: an D | vE n |, sn D 1p | vE n |. So the rst parameter of the ellipsoid, the smallest principal axis, is dened. From vE n , we make use not only of the length but also of its direction. We take vE n as the rst constituent of the rotation matrix R . After normalization to unit length, it will be the last row vector of R . Afterward, the calculation of the second principal axis an¡1 is found using the following procedure: The general function of a hyperellipsoid in an n-dimensional space is x21 a21
C
x22 a22
C ¢¢¢ C
x2i x2 C ¢ ¢ ¢ C n D 1. 2 ai a2n
(3.1)
an¡1 is found setting a1 D a2 D ¢ ¢ ¢ D an¡1 . This simplication is possible because the principal axes are not yet located in the (n ¡ 1) -dimensional subspace that is perpendicular to an , and therefore can be rotated around any angle. The calculation of an¡2 with the known variables an , an¡1 is possible, setting an¡2 D an¡3 D ¢ ¢ ¢ D a1 . This procedure continues until all an to a2 are known and the calculation of the last parameter a1 , the last and largest principal axis, is possible. Generally, to calculate a certain ai , 1 < i < n, the variables ai C 1 to an are known, and the variables a1 to ai are set equal. The three-dimensional case serves as an illustration. In Figure 2, vE3 (a3 D | vE3 |) was calculated using the Euclidean distance and points from XiA to XjB . vE3 can be interpreted as the x3 -axis of a new coordinate system originated
1188
I. Galleske and J. Castellanos
in XiA , where the x2 - and x1 -axes are located on the plane P but not dened completely, and they can be rotated freely. To calculate vE2 , the x2 -axis, we set b D a2 D a1 in formula 3.1 (representing the free rotation on P) and resolve bD[
a23 ( x21 C x22 ) 1 ]2 a23 ¡x23
for all XBk transformed into the new coordinate system with
x3 < a3 . b then gives the length of the new x2 -axis, where XBk is projected “elliptically” to P. In other words, an ellipse is constructed through XjB and XBk (see Figure 2). This method allows estimating ai , but as a distance measure, the Euclidean distance is used only for calculating an . For all other ai , the following recurrent formula has to be used (see the appendix for its derivation): 80 11 2 Á ! > > > n i > Q P B C > > a2j xj2 B C > > C > jD i C 1 jD 1 2 ¡@ >@ A @ A A a a x > j j k > > k D iC 1 jD iC 1 jD i C 1 > > kD j > > : 1 , else 6
where i starts from n ¡ 1 and decrements to 1 to calculate all the principal axes of the ellipsoid. The variances are again proportional to the length of the semi-axis of the hyperellipsoid: s1 D 1p a1 , s2 D 1p a2 , . . . , sn D 1p an ( p > 1, for example, p D 2) . The connection vectors vEi between XiA and the XjB found using the above distance measure are used to construct the rotation matrix R projecting vEi “elliptically” to the subspace, which is perpendicular to the known vectors viEC 1 , . . . , vEn . The last vector vE1 again has to be chosen in such a way that the resulting matrix R has a positive determinant. The normalized vectors vEi constitute the row vectors of the matrix R . This terminates the procedure, because all unknown parameters that have to be estimated (s1 , . . . , sn , which construct the matrix S , and the matrix R ) are known. 3.1 Computational Aspects. The central idea of the proposed approach was based in the partition of the covariance matrix S into four matrices S D R T S ¡1 S ¡1 R, and the separate calculation of the elements R and S . But looking at formula 2.1, which relates the probability with the gaussian function in a multidimensional space, ³ ´ 1 1 ¡1 T ¤ exp ¡ ( X ¡ Xp ) S ( X ¡ Xp ) , (3.3) f (X) D n 1 2 (2p ) 2 | S | 2
we can see that the covariance matrix S , which was estimated previously, is not needed explicitly. What is of importance for calculating the probability distribution is the determinant | S | and S ¡1 , its inverse matrix.
Optimization of the Kernel Functions
1189
To calculate | S | D | R T S ¡1 S ¡1 R |, the matrixes R T and S ¡1 have to be calculated from the matrixes R and S , which are known because they were estimated as explained in the previous section. S , the matrix of the variances, is a diagonal matrix with elements not equal to zero only on its diagonal. Its inversion is very simple by taking the reciprocal value of the nonzero elements. Calculating | S | reduces to a multiplication of R , S ¡1 , and R T , all of which can be easily calculated. The other component needed for calculating the gaussian function is the inverse covariance matrix S ¡1 D ( R T S ¡1 S ¡1 R ) ¡1 . It is calculated without inverting S directly, making use of the following equality, R ¡1 D R T , which holds for the rotation matrices as they are applied here:
S ¡1
¡1
¡1
D ( R T S ¡1 S ¡1 R ) ¡1 D R ¡1 S ¡1 S ¡1 R T
¡1
D R T SSR .
(3.4)
4 Experimental Results
The method described was implemented to compare the generalization ability of this model versus the original PNN and the support vector machine (SVM) as an example of a classier with similar complexity. A variation of the famous two-spiral problem, known in the area of backpropagation training and originally proposed by Alexis P. Wieland in a post to the connectionist mailing list in August 1988, and two examples taken from the UCI Machine Learning Repository (Blake & Merz, 1998), the BUPA liver disorder database and the Pima Indian diabetes diagnosis, were used as benchmark tests. 4.1 Two Spirals. The original task, as proposed by Wieland, is to classify points in the (x-y)-plane to one of two classes. The classes have the form of a spiral, making it difcult to be learned by a standard backpropagation network that uses global basis functions. The PNN seemed to offer a better solution because it uses localized functions. In this example, the training pattern formerly located in the two-dimensional x, y-plane now expands to the three-dimensional space. The two spirals are not an extension of the x, y-plane into the z-direction, but build similar spirals in the x, z- and y, z-plane. This classication problem even less appropriate for a linear classicator has to be solved by a PNN in the standard conguration, the SVM, and the model proposed here. Whereas in the original experiment with the two spirals in two dimensions, the network had to learn 194 training patterns, now in three dimensions, the number of patterns exploded to 1200 examples—600 for each class. All models were trained with a subset of the 1200 patterns that constitute the spirals. We used reduction steps of 5%, 10%, 15%, 20%, and 25%. The training set was obtained from the complete example by admitting each pattern with a probability of 0.95 (0.9, 0.85, 0.8, 0.75, respectively). The reason, that we chose this method of selecting the training pattern set for
1190
I. Galleske and J. Castellanos
Table 1: Absolute Classication Errors in 10Experiments of the Two-Spiral Problem. Reduction
5%
10%
15%
20%
25%
PNN (variance 1.0)
456.0
468.6
469.3
477.4
484.3
PNN (variance 0.1)
28.9
58.7
77.9
109.7
132.6
PNN (variance 0.01)
17.6
30.5
43.3
61.6
73.5
SVM
7.2
15.3
26.1
37.9
51.1
Proposed model
1.2
2.2
3.7
8.8
12.3
our test case is that we will get a random distribution and come to holes in the spirals that could be small (deleting only one training pattern) and perhaps could be easily generalized. But we would also get big holes of two or three or even more missing patterns in the big reduction experiments, where the networks would surely have more problems to complete these missing patterns. The test set consisted of the complete 1200 characteristics. The summary of this experiment is shown in Table 1. From the rst three rows, one can observe that the generalization ability of the PNN is better the smaller the variance is. The reason for this classication error lies in the high pattern density of the center of the spiral that occupies and overwrites the low-density region of the outer arms. Although the SVM performs much better (fourth row), reducing the classication error considerably, the model proposed here takes this property into account, leading to a very small misclassication. Another observation is that the quality of the models is independent of the reduction steps. 4.2 BUPA Liver Disorder. So as not to depend solely on the results from the anterior theoretical experiment, we also applied the different models of classication networks to real-world problems, taken from the UCI Machine Learning Repository (Blake & Merz, 1998). The rst example is the BUPA liver disorder database. The available 345 instances were divided into a training set of 150 randomly chosen patterns and a test set of 345 patterns. Table 2 summarizes the results, taking the 10 experiments with different training sets. In the rst ve columns are the absolute numbers of misclassications of the original PNN with different variances of the covariance matrix of this model. To nd the best results for the SVM, various kernel functions were compared, and nally radial basis functions were used, because their generalization ability was superior to other kernels. The last column shows the results of the model proposed in this article. Although all results are very similar, the SVM has a better performance than the standard PNN, and the model proposed in this article outperforms the other models.
Optimization of the Kernel Functions
1191
Table 2: Absoute Classication Errors in 10 Experiments of the Models in the BUPA Liver Disorder Problem. Probabilistic Neural Network Variance: 0.01
Variance: 0.1
Variance: 1.0
Variance: 10
Variance: 20
Support Vector Machine
Proposed Model
85.3
82.9
82.1
81.1
80.9
78.0
75.9
Table 3: Absolute Classication Errors in 10 Experiments of the Different Models in the PIMA Indian Diabetes Diagnosis Problem. Probabilistic Neural Network Variance: 0.01
Variance: 0.1
Variance: 1.0
Variance: 10
Variance: 20
Support Vector Machine
Proposed Model
212.6
213.4
213.7
214.0
216.4
71.8
56.1
4.3 PIMA Indians Diabetes Diagnosis. The second real-world example also demonstrates the effectiveness of the new model. The test setup consisted of 10 examples of 768 test patterns and 576 training patterns (75% of the test set). Table 3 shows the absolute classication error over the 10 examples for the original PNN, the SVM with radial basis function kernel (because it had the best generalization ability), and the new model. As in the anterior experiments, the proposed model generalizes better than the original PNN and SVM. 5 Conclusion
Estimating the covariance matrix is a problem in the PNN model. This work presents a technique to overcome this limit and automatically calculate the covariance matrix of all training patterns of the network only by looking at its local environment. Therefore, the matrix S was divided into submatrixes that were estimated separately, making the calculation of each one simple. A theoretical and two real-world (BUPA liver disorder, PIMA Indians diabetes) experiments were executed to compare the generalization ability of the proposed model with the original PNN and the SVM, a model of similar complexity. In all tests, the classication accuracy of the new model was better, as shown in an enhanced ability to generalize to unseen pattern. The results are very encouraging, so future work will try to reduce the network size of the PNN using some kind of clustering technique to take advantage of the calculated covariance matrix, but without losing the enhanced ability to generalize to unseen patterns.
1192
I. Galleske and J. Castellanos
Appendix: Derivation of the Distance Measure
For the method to calculate the distance between two training pattern in the n-dimensional problem space, two distance measures are used. For the rst dimension, simple Euclidean distance measure was used. But from the second dimension on, a special elliptical distance is applied. The derivation of this distance measure is presented here. For calculating the distance between two training pattern XaA and XBb in the dimension i, 1 · i < n, holds the following equation (see section 3): x2i¡1 x2i C 1 x2n¡1 x21 x2 x2 C ¢¢¢ C C i C C ¢¢¢ C C n D 1. b2 b2 b2 a2n a2i C 1 a2n¡1
(A.1)
x1 to xn are the coordinates of the pattern XbB transformed into the coordinate system located at XaA . an to ai C 1 are known variables from earlier calculations. The variables a1 to ai are replaced by a single placeholder b, which has to be calculated. Transferring all summands that do not contain b to the right side of the equation and resuming the Qleft side and extending all fractions on the right side of the equation with jnD i C 1 aj2 yields n Q
x21 C
¢¢¢ b2
C x2i
D
jD i C 1 n Q
jD i C 1
n Q
aj2 ¡
aj2
jD i C 1 n Q
jD i C 1 n Q
¡ ¢¢¢ ¡
j Di C 1 n Q
jD i C 1
aj2 x2i C 1 aj2 a2i C 1 n Q
aj2 x2n¡1 aj2 a2n¡1
¡
j Di C 1 n Q
j Di C 1
aj2 x2n .
(A.2)
aj2 a2n
To reduce the right side to a common denominator, the aj2C can be eliminated, Q leaving out the respective term in the jnDi C 1 aj2 of the numerator: i P jD 1
n Q
xj2
b2
D
jD i C 1 n Q
jD i C 1
n Q
aj2 aj2
¡
jD i C 1 jD 6 iC1
n Q
aj2 x2i C 1
n Q j Di C 1
aj2
¡¢¢¢¡
jD i C 1 jD 6 n¡1
n Q
aj2 x2n¡1
n Q j Di C 1
aj2
¡
Examining the numerator we nd the following: In the
jD i C 1 jD 6 n
aj2 x2n
n Q j Di C 1
.
(A.3)
aj2
Qn jD iC 1 jD 6 k
are left out
exactly those a’s that have the same indices k as the x k they are multiplied
Optimization of the Kernel Functions
with:
Qn jD i C 1 jD 6 k
way:
aj2 x2k . Therefore, we combine the summands in the following 0
i P j D1
n Q
xj2 D
b2
1193
jD i C 1
aj2 ¡ @
0 n P
1 n Q
@
kD i C 1 n Q j Di C 1
jD i C 1 jD 6 k
1
aj2 A x2k A .
(A.4)
aj2
Resolving this equation to b, the unknown variable gives the distance between the two training patterns: 0
11 2
n Q
i P
B C B C aj2 xj2 B C j Di C 1 jD 1 B C 0 0 1 1C . bDB B n C n n P Q B Q 2 C @ @ aj ¡ @ aj2 A x2k A A jD i C 1
kD i C 1
(A.5)
jD i C 1 jD 6 k
References Berthold, M., & Diamond, J. (1998). Constructive training of probabilistic neural networks. Neurocomputing, 19, 167–183. Blake, C., & Merz, C. (1998). UCI Repository of Machine Learning Databases. Irvine: University of California, Irvine, Department of Information and Computer Sciences. Available on-line: http://www.ics.uci.edu/»/mlearn/ MLRepository.html. Chen, C., & You, G. (1992). ISBN recognition using a modied probabilistic neural network (PNN). Proceedings of the 11th IAPR International Conference on Pattern Recognition, 2, 419–421. Galleske, I., & Castellanos, J. (1997). Probabilistic neural networks with rotated kernel functions. In Proceedings of the 7th International Conference on Articial Neural Networks ICANN’97 (pp. 379–384). Lausanne, Switzerland. Musavi, M. T., Chan, K., Hummels, D., & Kalantri, K. (1993). On the generalization ability of neural network classiers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(6), 659–663. Parzen, E. (1962). On estimation of a probability density function and mode. Annals of Mathematical Statistics, 33, 1065–1076. Specht, D. (1988). Probabilistic neural networks for classication mapping, or associative memory. Proceedings, IEEE International Conference on Neural Networks, 1, 525–532. Specht, D. (1990). Probabilistic neural networks. Neural Networks, 3, 109– 118.
1194
I. Galleske and J. Castellanos
Specht, D. (1991). Generalization accuracy of probabilistic neural networks compared with backpropagation networks. IJCNN-91-Seattle: International Joint Conference on Neural Networks, 1, 887–892. Yang, Z., & Chen, S. (1998). Robust maximum likelihood training of heteroscedastic probabilistic neural networks. Neural Networks, 11, 739–747. Received November 6, 2000; accepted September 27, 2001.
LETTER
Communicated by Stephen JosÂe Hanson
Methods for Binary Multidimensional Scaling Douglas L. T. Rohde [email protected] School of Computer Science, Carnegie Mellon University, and the Center for the Neural Basis of Cognition, Mellon Institute, Pittsburgh, PA 15213 Multidimensional scaling (MDS) is the process of transforming a set of points in a high-dimensional space to a lower-dimensional one while preserving the relative distances between pairs of points. Although effective methods have been developed for solving a variety of MDS problems, they mainly depend on the vectors in the lower-dimensional space having real-valued components. For some applications, the training of neural networks in particular, it is preferable or necessary to obtain vectors in a discrete, binary space. Unfortunately, MDS into a low-dimensional discrete space appears to be a signicantly harder problem than MDS into a continuous space. This article introduces and analyzes several methods for performing approximately optimized binary MDS. 1 Introduction
Recent approaches to articial intelligence and machine learning have come to rely increasingly on data-driven methods that involve large vector spaces. One application of high-dimensional vectors that is particularly relevant today is in representing the contents of large collections of documents, such as all texts available on the Internet. The similarity structure in these vector spaces can be exploited to perform a variety of useful tasks, including searching, clustering, and classication (see, e.g., Deerwester, Dumais, Furnas, Landauer, & Harshman, 1990; Berry, Dumais, & O’Brien, 1994). Other popular applications of vector spaces include representing the content of images (Beatty & Manjunath, 1997) and the meanings of words (Lund & Burgess, 1996; Burgess, 1998; Clouse, 1998). However, it is often inefcient, if not intractable, to perform complex analyses directly in high-dimensional vector spaces. If one could reduce the set of high-dimensional vectors to a set of vectors in a much lowerdimensional space while preserving their similarity structure, operations could be performed more efciently on the smaller space with the potential added benet of improved results due to reduced noise and greater generalization. Scaling to a space with just one, two, or three dimensions also permits easy visualization of the resulting space, which can lead to a better understanding of its overall structure. c 2002 Massachusetts Institute of Technology Neural Computation 14, 1195– 1232 (2002) °
1196
Douglas L. T. Rohde
In order to place the current work in a historical framework, let us briey trace the development of modern multidimensional scaling (MDS) techniques. Most applications of MDS, particularly in the psychological domains, have been in the analysis of human similarity ratings. Thus, rather than beginning with points in a high-dimensional vector space, a more common starting point has been a matrix of pairwise comparisons of a set of items. Various types of comparisons might be used, including similarity or dissimilarity judgments, confusion probabilities, or interaction rates. One problem introduced by the use of measures of this sort is that it is not clear how best to scale these ratings so that they correspond directly to distances in the vector space. Subjects’ ratings may be quite skewed and are likely to be nonmetric. Perhaps the earliest explicit and practical MDS method was that of Torgerson (1952), which grew out of the work of Richardson (1938) and Young and Householder (1938), among others. Torgerson used a one-dimensional scaling technique to convert dissimilarity ratings to target distances and then attempted to nd a set of points whose pairwise Euclidean distances best matched the target distances according to mean-squared error. The initial scaling function might simply be a linear transformation or could be a nonlinear function, such as an exponential. Although the formal requirements of this technique are quite effective, they are too strong for many applications, and a serious drawback is that the proper scaling method is difcult to determine and may vary from one problem to the next. The next major advance was made by Shepard (1962), who suggested that rather than attempting to match scaled target distances directly, the goal of MDS should be to obtain a monotone relationship between the actual point distances and the original dissimilarities. Thus, the dissimilarities need not be scaled, and their values are actually discarded altogether; all that is retained is their relative ordering. Torgerson’s earlier approach came to be known as metric MDS and this new technique as nonmetric MDS. However, Shepard did not provide a mathematically explicit denition of what constitutes a solution. Kruskal (1964a, 1964b) further developed the method by explicitly dening a function, known as stress, relating the pairwise distances and the ranking of dissimilarities. Stress essentially involves the scaled sum-squared error between the pairwise distances and the best-tting monotonic transformation of the original dissimilarity ratings. Iterative gradient descent was used to nd the conguration with minimal stress. The basic technique developed by Shepard and Kruskal has remained the standard for most applications of MDS to psychological phenomena (Shepard, Romney, & Nerlove, 1972; Borg & Groenen, 1997). Although possibly slower, gradient descent techniques have the advantage over matrix algebra methods in that they can more easily tolerate missing or sparse data and can be used to minimize any differentiable measure of stress. Nonmetric MDS is quite effective when similarity ratings involve unknown distortions. How-
Binary Multidimensional Scaling
1197
ever, relying on rank order sometimes discards information that cannot be recovered (Torgerson, 1965). This may be particularly true in cases where the structure of the data involves a number of tight clusters that are well separated (Shepard, 1966). Thus, metric methods may be more suitable for some types of data, but for the vast majority of problems of practical interest, nonmetric methods are likely to be as good, and perhaps better. A common feature of the MDS techniques discussed thus far is that they rely on the nal vector space having real-valued components. However, some applications require vectors with discrete, usually binary components. That is, the vectors should lie at the corners of a unit hypercube. An important application of this type is the development of representations for training neural networks. Increasingly, neural networks that serve as cognitive models are trained using inputs or targets derived from real data rather than articially generated vector sets. But those data sets may involve vectors of high dimensionality, possibly in the tens or hundreds of thousands, and it would be computationally intractable to train a network using such large vectors. Furthermore, neural networks with thresholded outputs, particularly recurrent attractor networks (Pearlmutter, 1989; Plaut & Shallice, 1993), often learn better when their vector targets use binary components. It is harder for the network to drive output units accurately to intermediate levels of activation than to drive them to fully active or inactive states. Thus, it is sometimes necessary to scale a set of high-dimensional vectors to a relatively low-dimensional, binary vector space. Binary MDS (BMDS) is a much harder problem than standard MDS. In fact, it has been shown that embedding a metric distance space in a bit space with minimal distortion is NP-complete (Deza & Laurent, 1997). 1 Thus, there is a good chance that no polynomial-time algorithm exists to compute an optimal set of BMDS vectors. However, it may still be possible to compute efciently an approximation to the optimal BMDS solution. This article presents several methods for performing approximately optimal BMDS. The solutions fall in two broad classes: those that perform the optimization directly in bit space and so-called hybrid methods that perform the optimization in a real-valued space before converting to a bit space. The rst hybrid method is somewhat similar to Torgerson’s linear-algebraic approach to MDS. It begins by computing the singular value decom-
1 Actually, it is NP-complete to decide whether a metric distance space is `1 embeddable with no distortion. It follows that nding a minimally distorted embedding of a metric space into a binary space under an `1 distance measure (such as Hamming distance) is NP-complete. However, the original set of distances in the BMDS problems considered here is more restricted than a metric space because they represent distances between pairs of points. Thus, the known proof may not apply. Nevertheless, it seems likely that deciding whether an `2 - or `1 -embeddable metric space is embeddable in an `1 -space of lower dimensionality is also an NP-complete problem.
1198
Douglas L. T. Rohde
position of the matrix of initial vectors. The right singular vectors are then converted to bits using a unary encoding, with more bits assigned to vectors having larger singular values. Two other hybrid methods are based on Shepard and Kruskal’s technique for MDS. Gradient descent is performed in a real-valued vector space using the stress cost function before the vector components are converted to bits based on their signs. One of these variants is metric in that it uses the actual target values in computing the stress. The other method uses a monotonic transformation of the target values rather than the values themselves. The major problem with hybrid techniques is that important information can be lost in the discretization. A very good real-valued solution may turn into a very bad binary solution. An alternative approach is to perform the bulk of the optimization directly in bit space. The only known prior algorithm for computing BMDS directly in bit space is that of Clouse and Cottrell (1996) and Clouse (1998). Their method began by creating a set of bit vectors for each item by thresholding values from the original high-dimensional vector. They then performed a random walk by repeatedly selecting bits at random, computing whether ipping the bit would improve the overall cost, and doing so only when benecial. Once it was determined that no improvements could be made by ipping any one bit, the algorithm terminated. One drawback of the Clouse and Cottrell method is that it is computationally inefcient. Without good record keeping, it is costly to determine whether a bit ip is advantageous. As the number of remaining good bits diminishes, the algorithm becomes less and less efcient because many bits must be tested before any progress can be made. The alternative— exhaustively testing all bits before deciding which to ip—is not much better. Thus, the algorithm does not scale well to larger problems. The rst fully binary method presented here is an improved version of Clouse and Cottrell’s algorithm. By using careful record keeping, it is able to keep track of the change in cost that would result from ipping any bit and to quickly nd the bit that would result in the greatest improvement. Although there is some cost for the record keeping, it is more than made up for by the fact that the algorithm need never test a bit only to discover that ipping it would be counterproductive. A more effective, though less efcient, version of this algorithm minimizes the sum of the squared difference between actual and target distances rather than the sum of the absolute differences. The nal method introduced in this study constructs the bit vectors sequentially, choosing the rst bit in each vector, followed by the second bit in each vector, and so on. The algorithm then iterates several times, reassigning the bits in each dimension given the other bits previously chosen. Although simple and relatively easy to implement, this method is quite fast and effective. The next section introduces the tasks and metrics used to evaluate the BMDS methods. Each method is then described in further detail, includ-
Binary Multidimensional Scaling
1199
ing the details of its implementation, advantages and disadvantages, and some possible variations. Finally, the methods are evaluated in terms of performance and running time. It is hoped that this study will prove useful to researchers interested in immediate applications of binary multidimensional scaling and that it will inspire future advances in these methods. 2 Evaluation Metrics and Example Tasks
Before describing the actual BMDS algorithms, we begin by dening the scaling task more explicitly. The input is a set of N real-valued vectors of dimensionality M, representing N items. The output is a set of N bit vectors with dimensionality D. The goal is for the relative distances between the nal vectors to reect the relative distances between the initial vectors as closely as possible. To make this more concrete, we must dene the functions measuring pairwise distance in the original and nal spaces and a measure of how well these two sets of distances agree. There are a number of reasonable distance metrics for the original space, four of which are shown in Table 1. Euclidean distance and city-block distance are standard choices. However, they are dependent on the dimensionality, M, and the scaling of the vectors, making them somewhat inconvenient. Cosine is another reasonable choice. It is scale invariant and is conned to a xed range, [¡1, 1], which is more convenient than measures that depend on dimensionality and average value magnitude. A fourth possibility, and the one used in this study, is to base the distance measure on Pearson’s correlation. Like cosine, it is scale invariant and is conned to a xed range, [¡1, 1]. Computationally, correlation and cosine are identical except that cosine is calculated using the actual vector components, while correlation is based on the differences between vector components and their mean. If components are evenly distributed between positive and negative values, their mean is usually close to zero, and cosine and correlation are quite similar. But if components are constrained to be nonnegative, cosine will be positive while correlation continues to use the full [¡1, 1] range. Correlation is thus a good initial choice for many scaling problems. In practice, using correlation has led to better results than using the other three measures with several different tasks and BMDS algorithms. In order to turn correlation into a distance measure, it is scaled by ¡0.5 and shifted by 0.5 so that a correlation of 1.0 becomes a distance of 0 and a correlation of ¡1.0 becomes a distance of 1.0. The set of all pairwise correlation distances was then scaled by a constant factor to achieve a mean value of 0.5, although this has little practical effect because the mean distance tended to be very close to 0.5 before scaling. This linearly transformed correlation will be referred to as correlation distance. Although it was not done here, one could scale the resulting values using an exponential with an exponent greater than 1 to increase the inuence of larger distances or less than 1 to enhance the smaller distances.
1200
Douglas L. T. Rohde
Table 1: Some Candidate Distance Measures. Distance Measure
Formula
City-block
P |x ¡ y k | k k pP
Euclidean Cosine Correlation
k
(xk ¡ y k )2
P
0.5 ¡ 0.5 pP
x y k k k x2 y2 k k k k
P P (xk ¡x)(y N k ¡Ny) 0.5 ¡ 0.5 pP k P k
(xk ¡Nx)2
k
(yk ¡Ny)2
The simplest and most reasonable choice for the distance metric in bit space seems to be Hamming (city-block) distance. That is, the distance between two vectors is the number of bits on which they differ. Note that for bit vectors, Euclidean distance is just the square root of the Hamming distance. Likewise, if the bit vectors tend to have a roughly equal number of 1s and 0s, which seems to be the case with most of these BMDS methods in practice, Hamming distance is closely approximated by correlation distance (when scaled by D). The third function that we must specify evaluates the agreement between the correlation distances in the original space and the Hamming distances in the nal binary space. It’s not entirely clear what is the best measure. One obvious choice is to use Kruskal’s stress (Kruskal, 1964a): vP u u i < j (dij ¡ tij ) 2 Metric stress D t , P 2 i < j dij where i and j together iterate over all pairs of items, dij is the Hamming distance of i and j’s bit vectors, and tij is the correlation distance of i and j’s initial vectors, scaled by the dimensionality of the bit vectors, D. An alternative form of this measure, and the one that Kruskal actually used, is nonmetric stress. In this case, the actual distances are not directly compared to the target distances but to the best monotonic transformation of the target distances: vP u u i < j ( dij ¡ dOij ) 2 Nonmetric stress D t , P 2 i < j dij where dOij are those values that achieve minimal stress, under the constraint that the dOij have the same rank order as the corresponding tij . Nonmetric stress is a better measure if one is concerned only with preserving the rankorder relationship between pairwise distances. But if one is concerned with directly matching the target distances, metric stress is preferable.
Binary Multidimensional Scaling
1201
An alternative method of evaluating the nal vectors is to compute Pearson’s correlation between the original set of pairwise distances among vectors and those of the nal vectors. This measure is referred to here as goodness and is dened mathematically as follows: P
Goodness D qP
¡ dN) ( tij ¡ tN) . P N2 N2 i < j ( dij ¡ d) i < j ( tij ¡ t ) i < j ( dij
Note that the optimal stress value is 0 and the optimal goodness value is 1; better vectors should result in lower stress but higher goodness. Goodness has the property that it is unaffected by linear transformations of the distances, so scaling and shifting the target distances has no effect on goodness. Because metric stress, nonmetric stress, and goodness do not always agree on which is the best set of vectors, all three measures are reported in the analysis in section 9. In section 9.3, the practical differences between these measures is discussed. 2.1 Example Tasks. Two BMDS tasks were used in testing the algorithms presented here. The rst, known as the Exemplar task, was completely articial. It consisted of 4000 vectors of dimensionality 1000, generated in the following way. First, 10 random bit vectors of length 50 were produced. Each of the 4000 vectors was created by taking one of the 10 exemplars, ipping each bit with 10% chance, and then doing a random projection to real-valued 1000-dimensional space. The resulting vector set has a basically simple similarity structure with a good deal of randomness superimposed. Because the vectors occupied a 50-dimensional bit space prior to the random projection, the vectors should be quite compressible. The second task, the Word task, involves 5000 vectors of length 4000 representing word meanings. The vectors were generated using a method similar to HAL (Lund & Burgess, 1996). Word co-occurrences were gathered over a large corpus of Usenet text. Raw co-occurrence counts were converted to a ratio of the conditional probability of one word occurring in the neighborhood of another to the word’s overall probability of occurrence. The rst 4000 values of each word’s vector, reecting its co-occurrences with the 4000 other most frequent words, was taken as the word’s meaning vector. This set has a more complex similarity structure than the Exemplar set. Note that these problems are considerably larger than those to which MDS is typically applied, which generally involve no more than a few hundred items. Clouse and Cottrell (1996) reported an example task involving 233 words. Because any reasonable MDS algorithm will likely have a running time that is at least O ( N2 ) , the tasks studied here are effectively several hundred times more complex. Of critical concern will be not only the ability to achieve low stress or high goodness, but the running time of the various algorithms.
1202
Douglas L. T. Rohde
3 The Singular Value Decomposition (SVD) Method
The singular value decomposition (SVD) is the foundation for the latent semantic analysis technique for document indexing and retrieval (Deerwester et al., 1990). It has also recently received considerable attention for its use in efcient clustering methods (Frieze, Kannan, & Vempala, 1998). It therefore seems natural to consider designing a BMDS algorithm using SVD. This rst method, which is based on computing the SVD of the item vector matrix, is somewhat related to the metric MDS technique of Torgerson (1952). Any real matrix, A, has a unique SVD, which consists of three matrices, U S V, whose product is the original matrix. The rst of these, U, is composed of orthonormal columns known as the left singular vectors, and the last, V, is composed of orthonormal rows known as the right singular vectors. S is diagonal and contains the singular values. The singular vectors reect principal components of A, and each pair has a corresponding value, the magnitude of which is related to the variance accounted for by the vector. If A is symmetric and positive semidenite, the left and right singular vectors will be identical and equivalent to its eigenvectors, and the singular values will be its eigenvalues. Nonbinary multidimensional scaling can be performed using the SVD as follows. Let A be the M £ N matrix whose columns are the original item vectors. The SVD is computed and the right singular vectors are sorted by decreasing singular value. Only the rst D vectors and values are retained. The new representation of item i is the vector composed of the ith value in each of the D highest right singular vectors, scaled by its corresponding singular value. 3.1 Discretization. In order to perform binary MDS, the values must be converted to bits. One could simply use the rst D right singular vectors and assign a single bit to each component. But this would not be very effective because the vectors with highest value contain most of the useful variance. Furthermore, there are often fewer than D nonzero singular values. Thus, we may need to assign more than one bit to each right singular vector. The method found to be most effective is to assign the bits roughly in proportion to the magnitudes of the singular values. A deterministic procedure is used to accomplish this. The rst bit is assigned to the vector with the largest singular value. Its value is then halved. The second bit is assigned to the vector with the singular value that is now the largest. Once two bits have been assigned to a vector, its value is set to one-third of its original value. For three bits, its value is one-fourth of the original, and so on. Assume, for example, that we had three singular vectors with values 12, 9, and 5, and we were to assign 5 bits (D D 5). The rst bit goes to the rst vector, and its value is reduced to 6. The second bit goes to the second vector, and its value is reduced to 4.5. The third bit goes to the rst vector again, because it is once again the largest value. Its value is set to 4 (12/3).
Binary Multidimensional Scaling
1203
Bit 4 goes to the third vector, because it is now the largest, and its value is reduced to 2.5. The remaining values are 4, 4.5, and 2.5, and the last bit goes to the second vector. In the end, the rst two vectors have been assigned two bits and the last vector one. The bits are then given values using a unary encoding. If a vector has three bits assigned to it, the possible codes are 000, 001, 011, and 111. This may not seem to be the most efcient use of the bits, but it is the only method that has the appropriate similarity structure between codes. The codes are assigned to items as follows. Each right singular vector has N components corresponding to the N items. These are sorted and partitioned evenly into the same number of bins as there are unary codes. With three bits, there would be four bins. The items in the rst bin would receive the bits 000, those in the second bin would receive the bits 001, and so on. 3.2 Running Time. A major problem with the SVD method is that computing the SVD is quite slow. Because the matrices are dense, computing the SVD takes H ( N 2 ( N C M ) ) time. If N is larger than M, the matrix A can be transposed and the left singular vectors used, giving a running time of H ( M2 ( M C N ) ) . However, it is no faster to run the SVD algorithm with small D than with large D. Using a 500 MHz 21264 Alpha processor, it takes 25 minutes to compute the SVD on the Exemplar task. On the Word problem, with N D 5000 and M D 4000, it runs for nearly a day. 3.3 Variants. If one knows that a certain subset of items is representative of the others, it is possible to speed up the SVD computation by using only that subset to generate the singular vectors. Although this might enable the SVD method to tackle much larger problems, it hinders performance. As we will see in section 9, the SVD method of BMDS does quite poorly to begin with. Several alternative discretization methods have been tested but were not found to be as effective. 4 The Metric Gradient Descent (MGD) Method
The second hybrid method uses gradient descent to optimize the item vectors in a real-valued space before discretizing them. It is similar to more traditional MDS in the use of the stress cost function and gradient descent but differs from them in that it is metric. The stress measure uses linearly transformed target distances rather than monotonically transformed targets. 4.1 Initialization. The rst step of the gradient descent method is to create an initial set of N real-valued vectors of dimensionality D. The vectors could be assigned randomly, as is typically done in standard MDS. However, this tends to result in an unnecessarily long minimization process. A better approach is to use a fast method to create moderately good initial vectors. One nice way to produce good initial vectors is with a random projection.
1204
Douglas L. T. Rohde
First, D random basis vectors of dimensionality M are generated. The elements of the basis vectors are draw independently from a gaussian distribution. For each item, the correlation between its original vector, having dimensionality M, and each of the D basis vectors is computed, and these D correlations form the components of the item’s initial vector in the smaller space. This random projection is reasonably fast, H ( NMD) , and preserves much of the information in the original vectors, especially if D is large. Even without the minimization phase, the random projection goes a long way toward solving the BMDS problem. However, there is still considerable room for improvement to justify the more expensive optimization process. 4.2 The Cost Function and Its Derivative. The goal of the optimization phase is to minimize the stress between the actual vector distances and the scaled target vector distances: vP r u u i < j ( dij ¡ b tij ) 2 S¤ t . SD D P ¤ 2 T i < j dij
i and j together iterate over all pairs of items, dij is the city-block distance of i and j’s vectors in the new space, tij is the correlation distance of i and j’s initial vectors, and b is an adjustable scaling factor. At the start of each step of the iteration, the value of b is computed that results in minimal stress. This method is commonly known as ratio MDS, as it seeks to minimize the discrepancy between actual and target distance ratios. The optimal value of b is given by P i < j dij tij . bD P 2 i < j tij Next, the derivative of the stress with respect to each of the ND vector components is computed. This derivative is given by the following formula: Á p ! X dij ¡ b tij dij S¤ @dij @S p ¡ p . D @i k S¤ T ¤ T ¤ T ¤ @ik j When using city-block distance, the derivative of distance dij with respect to component ik is simply sgn (ik ¡ jk ), or 1 if ik > jk and ¡1 otherwise. 4.3 Component Updates. Once the ND derivatives have been accumulated over all item pairs, the vector components are updated by taking a small step down the direction of steepest descent. The size of the step is scaled by a learning rate parameter, a. Following Kruskal (1964b), it is also scaled by the root mean square (RMS) value of all vector components and
Binary Multidimensional Scaling
1205
the inverse of the RMS derivative. The reason for this scaling is that it reduces the need to otherwise adapt the learning rate to the size of the overall problem. The component update formula is as follows: qP i,b ib @S . i k D i k ¡ a qP @S @ik i,b @ib
In order to prevent the vector components from shrinking or growing to a point where precision is lost, they are normalized following each update to maintain an RMS value of approximately 1. At the end of the minimization, the real-valued components are converted to bits based on their signs. Negative components become 0s, and positive components become 1s. 4.4 Learning-Rate Adaptation and Stopping Criterion. Despite the learning-rate scaling factors, it is still necessary to adapt the learning rate as the minimization proceeds. In general, a larger learning rate is used initially and then progressively reduced as a minimum is approached. The initial value of the learning rate was 0.2, which has proven to be a good choice for tasks of widely varying size. The general problem of automatically adapting the learning rate during a gradient descent to achieve the best performance is an interesting and difcult one. The following method is based on observations of what experienced humans do when adjusting a learning rate by hand. It is by no means optimal, but it does seem to work quite well for any gradient descent that has a smooth, fairly stable error function, such as the current problem or when training neural networks under batch presentation. The learning-rate and termination criterion are based on two measures, known as progress and instability. Progress is the percentage change in overall stress following the last weight update and is dened as:
Progress D Pt D
St¡1 ¡ St , St¡1
where St is the current stress and St¡1 is the previous stress. A positive P value indicates that the stress is being reduced, which is good. If P ever becomes negative, the learning rate is immediately scaled by 0.75. This normally results in a return to positive progress on the next update. Instability is a time-averaged measure of the consistency of the progress and is dened as follows: ³ ´ Pt¡1 ¡ Pt . Instability D It D 0.5 ¤ It¡1 C P t¡1 Steady progress results in low instability. Whenever the learning rate changes, instability is reset to 10. If progress is high and the instability
1206
Douglas L. T. Rohde
is low, things are proceeding well, and no changes are needed. Unstable progress often indicates that the learning rate is a bit too high, but it is usually not worth lowering the rate unless negative progress is made. The only case where it is generally a good idea to increase the learning rate is when progress is slow and stable. Thus, whenever the progress is less than 0.02 (2%) and the instability is less than 0.2, the learning rate is scaled by 1.2. The minimization terminates when the progress remains below 0.001 (0.1%) for 10 consecutive updates. On the example tasks used here, the algorithm generally terminates between 50 and 250 updates, depending on the values of N and D. 4.5 Running Time. As with any other gradient-descent technique, it is difcult to predict in advance how many updates will be required. In general, the length of the settling process increases a bit with larger N and D. The cost for each update is H ( N2 D ). Therefore, the algorithm tends to be somewhat worse than quadratic in N and somewhat worse than linear in D. The running time is evaluated empirically in section 9.4. 4.6 Variants. Gradient descent methods are extremely exible and permit endless variation. In addition to the stress cost function and the cityblock distance function, summed and sum-squared cost have also been tested, as well as Euclidean distance. None of these alternatives performed as well with metric gradient descent on the current tasks as did stress and city-block distance. Furthermore, various methods of scaling the target distances have also been tested. Rather than scaling the targets by an adaptive factor, b, they could simply be used in their raw form of correlation distances scaled by D. Alternately, one could transform the distances by a C b tij , where a and b are both adjusted to minimize the overall cost. This is known as an interval scale. Using an interval scale provided more exibility but proved slightly less effective on the current problems. One could loosen the restrictions on the target distances by using the optimal monotonic transformation, as in Kruskal and Shepard’s nonmetric MDS technique (but that is the subject of the next algorithm). It may be possible to improve the rate of convergence with a better automated procedure for adjusting the learning rate. Another promising addition would be the use of a momentum term on the component update step, as is often done in training neural networks. Momentum adds a fraction of the previous step in vector space to the current step, which can often speed learning. Finally, as one might expect, a major problem with this method of BMDS is that signicant information is lost in the discretization step. Although the city-block distances between pairs of vectors may match the target distances very well, those city-block distances will not accurately reect the Hamming distances of the corresponding bit vectors unless the real-valued
Binary Multidimensional Scaling
1207
components are close to ¡1 or 1.2 Thus, it may be benecial to introduce an additional cost term that penalizes components for being far away from ¡1 or 1. A simple polarizing cost function is the absolute value of the distance between the value and ¡1 or 1, whichever is closer. This would add a constant-size term to the value on each update, having the effect of pulling the value toward the closer of 1 and ¡1. Like the learning rate, the size of that step is an adjustable parameter. A somewhat more sophisticated polarizing cost function is (i4k ¡2i2k C 1) / 4. This is shaped like a smooth W. It has concave upward minima at 1 and ¡1 and a concave downward maximum at 0. At values larger than ¡1 or 1, it increases rapidly, heavily penalizing large values. It is nice because there are no discontinuities, which can disrupt the gradient descent. The derivative of this function is simply i3k ¡ ik. Thus, at each weight step, i3k ¡ ik, multiplied by the cost parameter, is subtracted from each value. Experimentation with these cost functions indicates that they are quite effective in speeding the convergence of the gradient descent. However, when using metric gradient descent, they do not seem to do much to improve the quality of the resulting vectors. 5 Ordinal Gradient Descent (OGD) Method
This next method is based more closely on Shepard and Kruskal’s gradient descent technique for MDS (Shepard, 1962; Kruskal, 1964a). Rather than using linearly scaled target values in computing the stress, the best-tting monotonic transformation of the target values is used. Thus, the method is nonmetric, or ordinal, in that the actual target values are not important, only their rank ordering. Except where noted, all aspects of this method are identical to those for MGD, including the initialization step, the update step, and the learning-rate adjustment. 5.1 Nonmetric Stress. Ordinal gradient descent uses as its cost function the nonmetric stress measure, v r uP u i < j ( dij ¡ dOij ) 2 S¤ t SD D P , 2 T¤ i < j dij
where dOij are those values that minimize the stress, under the constraint that the dOij have the same rank order as the corresponding tij . The derivative of this function with regard to component ik is the same as for metric stress, except that b tij is replaced by dOij .
2 If city-block distances between vectors with components that are expected to fall close to ¡1 or 1 are to be compared to Hamming distances of bit vectors formed from the signs of the components, the city-block distances must actually be scaled by 1/2.
1208
Douglas L. T. Rohde
The optimal dOij values are computed on each iteration using the up-down algorithm of Kruskal (1964b). In terms of running time, this process is linear in the number of distance values, O ( N 2 ) , and is thus signicantly less costly than computing the component derivatives. 5.2 Sigmoidal Components. A major problem with the gradient descent method is the distortion introduced in discretizing the real values into bits. If the real values are either exactly ¡1 or 1, the city-block distance between real-valued vectors will exactly correspond to the Hamming distance of the bit vectors, and there will be no loss of information. However, if the real values are much less than or greater than 1 in magnitude, the discretization will introduce noise. This led to the idea of adding a cost function that encourages the real values to be close to 1 or ¡1. However, such cost functions were not found to be very helpful in practice. An alternative is to transform the vector components using a sigmoid, or logistic, function, which limits values to the range [0,1] and makes it easier for the gradient descent to achieve nearly discrete values (Rumelhart, Hinton, & Williams, 1986). The sigmoid function has the following formula:
s (ik ) D
1 . 1 C e¡gik
This function is shaped like a attened S. If ik D 0, s (ik ) D 0.5. As ik increases above 0, s ( ik ) approaches 1. As ik decreases below 0, s ( ik ) approaches ¡1. The parameter g is known as gain and controls how rapidly the sigmoid approaches its limits. The advantage of the sigmoid is that it is quite easy for the gradient descent to drive the effective vector coordinates, s (ik ) , close to 1 or 0 by driving the actual coordinates to large positive or negative magnitudes. When using the sigmoid-transformed components, the vector distance function and its partial derivative become dij D @dij @i k
X k
|s(ik ) ¡ s ( jk ) |
D g s ( i k ) (1 ¡ s ( i k ) ) sgn ( s ( i k ) ¡ s ( j k ) ).
5.3 Polarizing Cost. The sigmoid is helpful only if a good proportion of the vector components actually grow fairly large (and thus approach 0 or 1 when put through the sigmoid). In order to encourage this, an extra polarizing term is added to the cost function. In the absence of the sigmoid, such a cost function must be somewhat complex, as it should have the effect of pulling the values toward ¡1 or 1 but not beyond them. Two such functions were mentioned in section 4.6. But when the sigmoid is used, the cost function need only push the raw values away from 0. Therefore, the simple
Binary Multidimensional Scaling
1209
linear cost function is used. This has the effect of adding a small constant to the positive values and subtracting a small constant from the negative values on each weight update. In evaluating this method, a constant of 0.05 was used. Because the sigmoid avoids the problem of overly large components and the polarizing term prevents an overall shrinking of components, there is no need to renormalize the values periodically. 5.4 Running Time. Because they involve an exponential, computing the sigmoids exactly can substantially slow down this algorithm. Therefore, a fast approximation to the sigmoid function was used. The sigmoids of 1024 values evenly distributed in the range [¡16, 16] were computed in advance and stored in a table. In computing the sigmoid of a new value, the sigmoids of the two closest values in the table are linearly interpolated. This method is quite fast and is accurate to within 2 £ 10¡7 of the correct value. Despite being somewhat more complex than MGD, the asymptotic running time of this method remains H (N 2 D) per update. Because it tends to converge faster than the metric method, however, the overall running time is somewhat less. 5.5 Variants. The gain term in the sigmoid function determines how sharp the sigmoid is and thus how polarized the values are. A higher gain draws the resulting values closer to 0 or 1 and thus reduces the noise introduced in the discretization. However, a high gain can also impede learning in the minimization phase. Thus, one might think of starting with a small gain and gradually increasing it as the minimization progresses. However, attempts to do this resulted in no improvement over using a xed-gain of 1.0. Using other xed gain values also seemed to make little difference. 6 Greedy Bit-Flip (GBF) Method
This next method is quite similar to that of Clouse and Cottrell (1996), but the algorithm has been altered to achieve an asymptotically faster running time. Like the gradient descent techniques, this method performs a gradual minimization. However, rather than working in a real-valued space, it operates directly in bit space. The optimization proceeds by ipping individual bits, in an attempt to minimize the linear cost function: X |dij ¡ tij |, Cost D i< j
where dij is the Hamming distance between the bit vectors for items i and j and tij is the correlation distance between their original vectors, scaled by D, the number of bits per vector. The advantage of the linear cost function is that the contribution of individual bits in a vector to the cost is often
1210
Douglas L. T. Rohde
independent of one another, which allows the algorithm to be much more efcient. In the next method, GBFS, we consider the effect of using squared rather than absolute cost. 6.1 Initialization. The initial bit vectors are formed using a random projection, as in the gradient descent methods. However, rather than using the actual correlations with the basis vectors as the components of the initial vectors, these correlations are converted to bits based on their sign. Negative correlations become 0s, and positive correlations become 1s. This method has the property that 1s and 0s are expected equally often. 6.2 Minimization. The minimization phase is conceptually very simple. It operates by repeatedly ipping individual bits in the N £ D matrix of bit vectors, provided that those ips lead to immediate improvements in the overall cost function dened above. The bit that is ipped is always the one that leads to the greatest immediate improvement in the cost—hence, the name greedy. The Clouse and Cottrell (1996) algorithm differed in that it selected bits at random and then tested to see if ipping the selected bit would decrease the cost. This is fairly inefcient near the end of the minimization process, where there are very few bits worth ipping. The key to performing the minimization quickly is to keep track at all times of the change in overall cost that would result from ipping each bit. This is referred to as the bit’s gain. A positive gain means the overall cost would be reduced by changing the bit. All bits with positive gains are stored in an implicitheap (Williams, 1964). This standard priority queue data structure allows the bit with the highest gain to be accessed in constant time. Whenever the gain of a bit is changed (because it or another bit is ipped), the heap must be adjusted. In theory, this adjustment takes O (log( ND ) ) time. However, the O (log( ND ) ) bound is very loose. Because the gain changes tend to be small, heap updates rarely involve more than a few steps through the heap. Furthermore, because only bits with positive gains are maintained in the heap and the vast majority of bits have negative gains, especially toward the end of the process, there are usually far fewer than ND bits actually in the heap. Therefore, the gain update step is, in practice, quite close to a constant-time operation. Along with the gain heap, we also maintain the target distance, tij , and the current distance, dij , for each pair of vectors. If the actual distance is at least 1 less than the target distance, we would like to make the two vectors more different. Therefore, for any dimension k, if bits ik and jk are the same, the overall cost would be reduced by 1 if we ipped either of those bits. If ik and jk are different, the cost would increase by 1 if we ipped either of those bits. If the actual distance is at least 1 larger than the target distance, the contribution of these two vectors to the overall cost will be reduced by 1 if
Binary Multidimensional Scaling
1211
we ip any bit that makes them more similar and increase by 1 if we ip any bit that makes them more different. If ipping a bit causes dij to change from larger than tij to smaller than tij , the change in cost will be 1 ¡2(dij ¡tij ). If dij grows larger than tij with a bit ip, the change in cost is 1 ¡ 2(tij ¡ dij ). Of course, the gain for ipping a particular bit for item i is not dependent on just one other item, j. It is summed over all other items. When any bit is ipped, the gains for some other bits must be adjusted. One factor that improves the efciency of the GBF algorithm is that we do not need to update the gain for every other bit. As long as we are using the linear cost model, the gain for most other bits remains unchanged. If the bit ik has just been ipped, we will denitely need to update the gain for the other bits in vector i. We will also need to update bit k for all of the other vectors. However, we do not necessarily need to update the other bits for the other vectors. We need to do so only if |tij ¡ dij | < 1, either before or after the ip. 6.3 Running Time. Each of the bit updates takes constant time (assuming a constant heap update), so the cost of ipping a bit is somewhere between H ( N C D ) and H (ND ), depending on how many other bits must be updated. In practice, all the bits of a vector are usually updated roughly one-third of the time. However, most of these cases occur toward the end of the minimization as the actual distances grow close to the target distances. Therefore, the bulk of the minimization occurs in a relatively short time, and if time were a factor, the minimization could be stopped well before it is complete without signicant degradation in the resulting vectors. Unfortunately, as with the gradient descent methods, it is not possible to predict exactly how long the minimization process will take. In theory, there could be an exponentially large number of ips before the algorithm terminates. One could, of course, terminate early once a minimum gain, a minimum number of bits in the gain heap, or a time limit has been reached. Or one could test the bit vectors periodically and stop when signicant further progress seems unlikely. However, in practice, the algorithm tends to terminate on its own in a consistent number of ips, varying by at most a few percent between trials. 6.4 Variants. One concern in performing greedy minimization—always ipping the bit that provides the greatest immediate gain—is that the algorithm may be more likely to fall into bad local minima. A better option may be to select random bits from the gain heap as did Clouse and Cottrell (1996), effectively. In practice, performing random optimization can lead to very slightly better solutions than greedy optimization on most, but not all, tasks. However, it tends to require about twice as many ips because they are, on average, less effective. Another possibility is to start with a completely random initial conguration rather than one produced by the random projection technique. Again,
1212
Douglas L. T. Rohde
an optimization starting from a random initial conguration tends to take about twice as long. When D is large, there is little or no difference in performance. Interestingly, when D is small, starting from a random initial conguration leads to signicantly better results on the Exemplar task but much worse results on the Word task. An additional thought is that rather than ipping bits one at a time, several bits could be ipped at once. This would make the algorithm more like a discrete version of the gradient descent methods, in which all vector components are updated simultaneously. Perhaps ipping several bits at once would add noise that could help propel the minimization past local minima. Several variants of this idea were tested. In the rst, a random subset of bits in the gain heap was ipped simultaneously. Each bit in the heap was chosen with a probability that ranged from 5% to 25% across trials. Once there were fewer than about 10 bits in the heap, it was necessary to revert to the single-bit method or the minimization would never terminate. A second variant ipped the n bits with highest gain, where n was a specied fraction of the total number of bits with positive gain. Unfortunately, these methods produced very similar results to the single-bit method but were somewhat slower. 7 Greedy Bit-Flip with Squared Cost (GBFS) Method
GBFS is identical to GBF, except that a squared cost function is used, as recommended by Clouse and Cottrell (1996). This provides a greater penalty for vector pairs that are very far from their target distances. The disadvantage of the squared cost function is that we cannot assume independence when updating bit gains. Thus, when a bit is ipped, we must update the gains for all ND bits, slowing the algorithm considerably. If one wishes to run the algorithm until a local minimum is reached, this method will be more efcient than the Clouse and Cottrell (1996) technique because the latter suffers from inefciency when few bad bits remain. However, if the algorithm is to be terminated well short of convergence, the Clouse and Cottrell (1996) method will most likely be faster because it avoids the overhead of maintaining the gain heap. 8 Greedy Max Cut (GMC) Method
The nal algorithm, known as the greedy max cut (GMC) method, also operates primarily in bit space. It starts with an empty N £ D matrix of bit vectors and iterates through the columns of bits, choosing the value of the rst bit for each vector, then the second bit for each vector, and so on. Once all of the bits have been chosen, it iterates through the columns of the matrix again several times, adjusting the bits where necessary.
Binary Multidimensional Scaling
1213
Aside from the difference in initialization, this algorithm can be distinguished from GBF and GBFS, and the earlier method of Clouse and Cottrell (1996), primarily in how it identies bits that need to be ipped. Rather than selecting bits greedily or at random, the GMC algorithm systematically tests each bit in the matrix. This requires simpler record keeping than GBF, as we no longer need to maintain the gain of each bit. Like the Clouse and Cottrell method, GMC maintains only the current Hamming distances and target distances between all pairs of vectors. As in GBFS, the cost function minimized in this algorithm is the sumsquared difference between the actual Hamming distances, dij , and the correlation distances of the original vectors, scaled by D: X ( dij ¡ tij ) 2 . Cost D C D i< j
8.1 Filling the Columns. At all times, the algorithm maintains the current set of Hamming distances between vector pairs. It begins with vectors of all 0s. It then cycles through the D columns in the bit-vector matrix, lling each column so as to minimize the overall cost. This is called the “greedy max cut” method because the process of lling the columns chooses each bit so as to greedily reduce the overall cost and is related to the well-known maximum cut graph partitioning problem. Consider a complete, undirected graph with N vertices, corresponding to the N items, with the weight of edge ij equal to 1 ¡ 2(tij ¡ dij ) . The problem of nding the assignment of bits to column k that minimizes the squared cost is equivalent to nding the partitioning of the graph that maximizes the weight on edges crossing the partition. The items on the same side of the partition receive the same value for bit k. This is the maximum cut problem, which is known to be NP-complete (Karp, 1972). 3 Therefore, it seems likely that no algorithm exists to produce an optimal solution to the problem in polynomial time. However, several fast approximation algorithms are known that will produce solutions to the maximum cut problem guaranteed to be within a certain percentage of the optimal value. The earliest such approximation algorithm is that of Sahni and Gonzales (1976), which guaranteed that the solution found would be at least half of the optimal value. The method employed here, in its rst pass, is quite similar to the Sahni and Gonzales algorithm. When lling column k, the rst item receives a random bit. Each subsequent item is given the bit value that results in a lower overall cost, computed over the preceding items. The contribution to the overall cost of item i that 3 Technically, our bit assignment problem does not exactly reduce to maximum cut because the latter normally permits only positive edge weights, whereas our graph has both positive and negative weights. Although this form of reduction does not prove that the bit assignment problem is NP-hard, it is suggestive of the difculty of the problem.
1214
Douglas L. T. Rohde
results from selecting the value ik D 0 will be X X ( dij ¡ tij ) 2 C ( dij C 1 ¡ tij ) 2 , Ci0k D j < i, jk D 0
j < i, jk D1
where j iterates over all items for which bit k has been chosen. The rst summation includes only those items for which bit jk is 0 and the second summation includes only those items for which bit jk is 1. Thus, if ik D 0, the distance to items for which jk D 1 will increase by one. Similarly, the cost for choosing ik D 1 will be C1ik D
X
j < i, jk D 0
( dij C 1 ¡ tij ) 2 C
X
j < i, jk D1
( dij ¡ tij ) 2 .
If Ci0k < C1ik , ik is set to 0, otherwise to 1. In practice, it would be inefcient to compute those entire expressions. It is better just to compute their difference, which is given by Ci0k ¡ C1ik D
X j
(2 jk ¡ 1) (2( dij ¡ tij ) C 1).
8.2 Adjusting the Columns. This single-pass assignment of the bits in a column can achieve only a rough approximation to the optimal partitioning. It can be further rened by iterating through the items and ipping their bits when doing so results in lowered cost. In this adjustment phase, the bits are not cleared in advance, and the cost function for item i is computed over all other items, not just the preceding items. Thus, the value of each bit in the column is reconsidered given the values of the other bits. Subsequent adjustments will rene the assignment further, but the number of bits that are changed each time gradually decreases. It is useful to perform at least two of these primary adjustments to each column before lling the next column. Once all of the columns are lled, it is helpful to cycle through them several more times, readjusting the bits in each one when the cost improves. These are known as secondary adjustments. To clarify, the primary adjustments occur to each column before the next column is lled. The secondary adjustments occur once all of the columns have been lled. In evaluating this algorithm, two primary adjustments and eight secondary adjustments were used. 8.3 Running Time. Unlike the gradient descent or bit-ipping methods, the running time of this algorithm is easily predicted. Filling or adjusting each column requires H ( N 2 ) operations. If the total number of primary and secondary adjustments is a, the running time of the algorithm is H ( N 2 D (a C 1) ).
Binary Multidimensional Scaling
1215
8.4 Variants. The obvious parameters affecting this method are the number of primary and secondary adjustments. The rst few adjustments result in signicant improvement, but further adjustments have greatly diminishing returns. There is a trade-off in balancing the number of primary and secondary adjustments. One could rely on all-primary or all-secondary adjustments, but performance is better with some of each. Holding the total number of adjustments xed, it is generally best to use two or three primary adjustments and a greater number of secondary ones. As described, the GMC algorithm is completely deterministic. Multiple runs on the same set of items will result in effectively the same vectors, although some columns will have all of their bits reversed because the rst bit in each column was selected randomly. It seems plausible that this method could tend to hit local minima because the bits are always updated in the same order in the adjustment phases. A reasonable variant would be to adjust the columns in randomly permuted order. However, experiments with this on the Word task found equivalent or slightly worse performance than the simpler deterministic method. Another possibility is to alter the order of traversal in the secondary adjustment phase. Rather than traversing the columns, one might traverse the rows. This has the effect of adjusting a single point relative to all of the other points before moving the next point. In contrast, column traversal adjusts all points along a single dimension before considering the next dimension. One might expect these two adjustment methods to produce different results, but in practice, they seem to result in virtually identical performance. An alternative is to select bits for possible adjustment at random, as in the Clouse and Cottrell (1996) algorithm. Again, one might expect this to avoid local minima better. However, equating for the number of bits tested, random ipping has proved to be slightly, but not much, worse than either of the systematic updating methods. Thus, in its secondary phase, GMC does not differ signicantly from the earlier method. The most important difference between the algorithms is the method of initializing the bit matrix. Replacing the primary bit-assignment phase of GMC with a random assignment of bits, but still equating for the total number of bits tested, results in signicantly worse performance. Replacing it with a random projection, as in GBF, also degrades performance, but less so than random initialization. This algorithm can be easily adapted to any cost function that has the form of a sum over independent contributions from each item pair. It could efciently handle the stress measure if the numerator and denominator were stored and updated incrementally. It would be more difcult to adapt the algorithm to a nonmetric cost function. One would need a fast, incremental version of the up-down algorithm to maintain the optimal monotonic transformation of the vector distances as bits are ipped. When amortized over multiple ips, this may still be more than a constant time operation and would thus add to the asymptotic complexity.
1216
Douglas L. T. Rohde
Finally, as in the GBF and GBFS methods, it is easy to add a term to the cost function that severely penalizes duplicate vectors and thus ensures that each item has a unique representation, which is sometimes necessary. 9 Performance Analysis
In this section, we present comparisons of the six implemented algorithms: SVD, MGD, OGD, GBF, GBFS, and GMC. The methods were applied to the Exemplar and Word tasks with varying bit-vector dimensionalities, D, and were evaluated using three measures of the agreement between the original pairwise distances and the nal bit-vector Hamming distances: goodness, metric stress, and nonmetric stress. The average running times of the algorithms were also measured and are reported in section 9.4. Three trials of each condition were run on a 500 MHz 21264 Alpha processor and the results averaged. The only exceptions were the few trials of SVD and GBFS that lasted more than 10 hours, which were run only once. In general, the results of the methods were all very consistent across trials. The SVD and GMC algorithms are deterministic and thus achieve the same results every time. The other algorithms achieve goodness or stress ratings that vary about 1% between trials for small values of D and less than 0.2% for D > D 100. Because of the small variance, limited number of trials, and general clutter of the gures, error bars are not shown. 9.1 The Exemplar Task. The average goodness ratings of the six algorithms on the Exemplar task, as a function of D, are shown in Figure 1. With the exception of SVD, the goodness increases monotonically with the size of the vectors. SVD is clearly the worst of the methods. Interestingly, in that case, the goodness actually decreases with D values over 20. This is presumably because, with larger D, the SVD method begins to rely on less important singular vectors, which convey little useful information. There is no clear winner between OGD and MGD. OGD may be better for lower D but worse for higher D. The three best algorithms, according to goodness, are the bit-space optimizing methods: GBF, GBFS, and GMC. GBF does not do as well for small D. With the exception of D D 10, GMC achieves the best goodness in every case, although the differences with GBF and GBFS are inconsequential for large D. The metric and nonmetric stress on the Exemplar task are shown in Figures 2 and 3, respectively. SVD does so poorly according to metric stress (from 18.6 for D D 10 to 0.34 for D D 200) that it does not appear on the graph. SVD also does quite poorly according to nonmetric stress, although it is at least comparable to the other methods. Interestingly, although it did not according to goodness, the performance of SVD monotonically improves with D according to the stress measures. MGD is mediocre according to both measures. OGD does rather poorly according to metric stress. This should not be too surprising since OGD is
Binary Multidimensional Scaling
1217
1.00
0.95
0.90
Goodness
0.85
0.80
0.75 SVD MGD OGD GBF GBFS GMC
0.70
0.65
0.60
10
15
20 30 40 50 75 100 Dimensionality of Bit Vectors (D)
150
200
Figure 1: Goodness on the Exemplar task.
only optimizing nonmetric stress and its resulting pairwise distances are unlikely to match the target distances directly. However, OGD is still not as good as the bit-space methods even on nonmetric stress. GBFS and GMC are fairly indistinguishable according to either measure, although, with the exception of D D 10, GMC is slightly better in all cases. GBF is worse than the other two for D D 10 but is nearly as good at higher dimensionality. 9.2 The Word Task. The Word task has a much more complex similarity structure than the articial Exemplar task and thus may provide a better measure of the ability of the BMDS methods on other scaling problems involving natural data. Because it has 5000 items and the original vectors have 4000 dimensions rather than 1000, the Word task is computationally harder as well. The goodness measure is shown in Figure 4, and the stress measures are depicted in Figures 5 and 6. Once again, SVD does quite poorly. It makes little improvement in goodness with higher dimensionality. It is off the metric stress scale, except for
1218
Douglas L. T. Rohde 0.25 SVD MGD OGD GBF GBFS GMC
Metric Stress
0.20
0.15
0.10
0.05
0.00
10
15
20 30 40 50 75 100 Dimensionality of Bit Vectors (D)
150
200
Figure 2: Metric stress on the Exemplar task.
the case of D D 200, and its nonmetric stress is much worse than that of the other measures. Again, however, it does make steady improvement with higher D according to the stress measures but not according to goodness. OGD is the clear winner on the goodness scale, especially for small D. This is surprising since it performed quite poorly on the Exemplar task. According to metric stress, OGD again does not do very well, but it is the best measure by nonmetric stress for D < D 50. Nevertheless, based on nonmetric stress, it is not clearly dominant over the other measures as it is on the goodness scale. The next section addresses this apparent disparity between goodness and the stress measures. As measured by goodness, MGD does quite well but is not close to the performance of OGD. According to stress, MGD is just average. GMC and GBFS have quite similar performance, but GMC is consistently a little bit better on all three measures at all values of D. GBF is reasonably good at high values of D, but its performance on all measures drops off with small D.
Binary Multidimensional Scaling
1219
0.25 SVD MGD OGD GBF GBFS GMC
Ordinal Stress
0.20
0.15
0.10
0.05
0.00
10
15
20 30 40 50 75 100 Dimensionality of Bit Vectors (D)
150
200
Figure 3: Nonmetric stress on the Exemplar task.
9.3 Stress versus Goodness. The stress and goodness measures do not always agree. At times, one method will perform better than another according to goodness but worse according to stress. For example, on the Word task with 50 bits, OGD achieves an average goodness of 0.843, while the GMC vectors have a goodness of only 0.741. However, the GMC vectors have a metric stress of 0.109, which is better than the 0.133 of the OGD vectors. According to nonmetric stress, GMC and OGD are very similar (0.104 versus 0.102). Because the measures do not always agree, it is important to choose an evaluation measure that is appropriate for a given task. So let us briey compare the properties of these measures. We begin by trying to understand what aspects of the GMC and OGD solutions might have led to the disagreement among evaluation measures. One common way to evaluate MDS results visually is through the use of a Shepard diagram, a scatter plot with one point for every pair of items. The horizontal axis represents the distance between the original pair of vectors
1220
Douglas L. T. Rohde 1.0
0.9
0.8 SVD MGD OGD GBF GBFS GMC
Goodness
0.7
0.6
0.5
0.4
0.3
0.2
10
15
20 30 40 50 75 100 Dimensionality of Bit Vectors (D)
150
200
Figure 4: Goodness on the Word task.
(or subjects’ similarity ratings if that is the starting point). The vertical axis represents the actual distance between the vectors in the reduced space. Ideally, the points should fall on the identity line. However, because the Word task involves 5000 items, a standard Shepard plot would contain 12.5 million points. So many points are hard to handle, and estimating their density is difcult. Therefore, the modied version of a Shepard plot used here is something like a 2D histogram. The graph is partitioned into a grid of cells, and the number of points falling into each cell is counted. Then the columns of cells are normalized so that the values in each column sum to 1. Each cell is then plotted as a circle whose area is proportional to the normalized cell value. The ideal graph will be linear and have little vertical spread. These “Shepard histograms” for one run of the GMC and OGD algorithms are shown in Figures 7 and 8. This method of normalizing columns of cells disguises the fact that the vast majority of pairs have target distances close to 25, indicating that their
Binary Multidimensional Scaling
1221
0.30 SVD MGD OGD GBF GBFS GMC
0.25
Metric Stress
0.20
0.15
0.10
0.05
0.00
10
15
20 30 40 50 75 100 Dimensionality of Bit Vectors (D)
150
200
Figure 5: Metric stress on the Word task.
original vectors were uncorrelated. Therefore, most of the data points in the graph actually fall in this middle range. This can be seen by performing the normalization across all cells, not just along the individual columns. Figure 9 displays the same data as Figure 8, but with normalization across all cells. The cells representing very small or large correlation distances are barely visible because they contain so few points. Figure 7 shows that the GMC algorithm nds a fairly linear solution. However, the solution appears to have a negative zero crossing and is somewhat warped for small target distances. Thus, vectors that were originally quite similar tend to be even more similar in the bit space. There is also greater variance for the columns on the left, although this partially results from the fact that they contain very few points. The plot of the OGD solution in Figure 8 is noticeably less linear, having a more pronounced sigmoidal shape. Like the GMC solution, small distances tend to be too small in the nal space. However, in this case, large distances
1222
Douglas L. T. Rohde 0.30 SVD MGD OGD GBF GBFS GMC
0.25
Ordinal Stress
0.20
0.15
0.10
0.05
0.00
10
15
20 30 40 50 75 100 Dimensionality of Bit Vectors (D)
150
200
Figure 6: Nonmetric stress on the Word task.
tend to be exaggerated as well. This partially explains why OGD does better according to goodness, or correlation, worse according to metric stress, and almost equivalently according to nonmetric stress. Nonmetric stress is essentially indicative of the variance of the columns in the Shepard diagram but is insensitive to the mean value in each column. In this case, the two methods have fairly similar variance, resulting in similar nonmetric stress. Metric stress, on the other hand, measures the disparity between the points and the identity function. Any deviation from this line, whether for large or small target distances, contributes equally to the stress. Metric stress is sensitive to both noise and monotonic distortions, the latter having a relatively strong effect. Thus, the OGD solution has a higher metric stress. The correlation measure is a bit harder to characterize analytically. But some simple empirical tests involving a few articial data sets show that correlation is actually fairly insensitive to monotonic distortions. First, a set
Binary Multidimensional Scaling
1223
45 40
Actual Hamming Distance
35 30 25 20 15 10 5 0 0
5
10
15
20
25
30
35
Target Correlation Distance
Figure 7: Distribution of pairwise bit vector Hamming distances versus original distances for a run of the GMC method on the Word task with 50 bits. Cells are normalized by columns.
of random numbers evenly distributed between 0 and 1 was generated. A second set of numbers was produced by transforming each value in the rst set, and the correlation and stress of the number pairs was measured. Monotonic distortions have a greater effect on stress. If the second set is composed of the square roots of the rst set, there is little effect on the correlation, which is 0.98, but the stress rises to 0.098. If cube roots are used, the correlation is still fairly high, 0.959, but the stress is 0.229. Finally, if a sigmoid centered around 0.5 with a gain of 2 is applied to the numbers to create an S-shaped transformation as in Figure 8, the stress is moderate, 0.067, but the correlation is virtually unchanged at 0.9998. In contrast, adding noise to the values has a larger effect on correlation than on stress. If 0.15 is either added to or subtracted from each value with equal chance, the correlation drops to 0.887, but the stress is again 0.067. Thus, in this case of noise, the stress is equal to or lower than it was with the
1224
Douglas L. T. Rohde 45 40
Actual Hamming Distance
35 30 25 20 15 10 5 0 0
5
10
15
20
25
30
35
Target Correlation Distance
Figure 8: Distribution of pairwise bit vector Hamming distances versus original distances for a run of the OGD method on the Word task with 50 bits. Cells are normalized by columns.
monotonic transformations, but the correlation is much worse. The implication of this is that the correlation measure is fairly ordinal in its behavior. Indeed, measuring the correlation on the example tasks using monotonically transformed targets, dOij , rather than the actual targets, tij , generally results in only a slight improvement. 9.4 Running Time. Finally, we turn to the issue of the running time of the various algorithms. Regardless of how good the results may be, an algorithm is useful only if it can solve a given task in an acceptable time frame. The running times of SVD and GMC are fairly easy to analyze because they are deterministic. The method used here to compute the SVD is H ( N 2 ( N C M ) C ND ) . The ND term is for assigning the bits and is inconsequential. If M < N, the matrix is inverted, resulting in a H ( M2 ( M C N ) ) algorithm.
Binary Multidimensional Scaling
1225
45 40
Actual Hamming Distance
35 30 25 20 15 10 5 0 0
5
10
15
20
25
30
35
Target Correlation Distance
Figure 9: The same data as in Figure 8 normalized across all cells rather than by columns.
The other algorithms require that all pairwise distances between the vectors are computed, which is a H ( N 2 M ) process. However, because that step is common to all of them, it was done in advance and the distances stored. It is therefore not shown in the measured running times. It is worth noting that although they are both H (N 2 M ), computing the vector distances is much quicker than computing the SVD due to the much improved constant. Following the distance computation, the GMC algorithm is H ( N2 D ), assuming the number of adjustments is treated as a constant. The gradient descent and bit-ipping algorithms, on the other hand, are difcult to analyze because it is not clear when they will terminate. They are suspected to be roughly H ( N2 D ), however, and we turn to some empirical measures to verify this. Figures 10 and 11 show the scaled running times of the methods as a function of D on the Exemplar and Word tasks. Because the times were
1226
Douglas L. T. Rohde 200 SVD MGD OGD GBF GBFS GMC
Running Time, Divided by D
150
100
50
0
10
15
20 30 40 50 75 100 Dimensionality of Bit Vectors (D)
150
200
Figure 10: Running time as a function of D on the Exemplar task (scaled by 1 / D).
expected to be roughly linear in D, they were all divided by D in producing the graph. Therefore, a at line indicates a truly linear algorithm. SVD, because its running time is essentially constant, has decreasing curves in both graphs. SVD was so slow on the Word task, however, that it appears on the graph only for D D 200. MGD and OGD seem to be fairly linear in D. Both are a bit slower for very small D on the Exemplar task, possibly because they had difculty settling on a good solution with so few bits. OGD was signicantly quicker on the Word task but slower on the Exemplar task. The GBF method appears to be somewhat worse than linear, its curves noticeably rising on the right side. GBFS, on the other hand, is essentially quadratic in D. The algorithm that is consistently fastest, other than the ineffective SVD method, is GMC. On the Word task with 200 bits per vector, GMC is almost three and a half times faster than the next fastest method, GBF. Al-
Binary Multidimensional Scaling
1227
500 SVD MGD OGD GBF GBFS GMC
Running Time, Divided by D
400
300
200
100
0
10
15
20 30 40 50 75 100 Dimensionality of Bit Vectors (D)
150
200
Figure 11: Running time as a function of D on the Word task (scaled by 1 / D).
though GMC is known to be truly linear in D, its scaled running times actually decrease with larger D. This reects the fact that lower-order terms, such as the N 2 cost of loading the pairwise distances, have a relatively diminishing effect on the overall time. This indicates that the other methods that appeared to be linear due to at lines may actually be slightly worse. Figure 12 depicts the running times of the methods for varying numbers of items, N, on the Word task. In this case, the running times have been divided by N 2 . GMC is known to be quadratic in N. Therefore, the slight rise in its line is due to either lower-order terms or caching inefciencies resulting from the increased memory requirements of the larger problems. GBF has a similar prole and is thus nearly quadratic in N as well. GBFS and MGD, on the other hand, are clearly worse than N 2 . OGD may be slightly worse than quadratic, but it is not clear. SVD ranged from 4.3 times slower than MGD for N D 500 to 9.5 times slower for N D 5000.
1228
Douglas L. T. Rohde -4
4´10
SVD MGD OGD GBF GBFS GMC
Running Time, Divided by N^2
-4
3´10
-4
2´10
-4
1´10
0
500
1000 2000 Number of Vectors (N)
5000
Figure 12: Running time as a function of N on the Word task (scaled by 1 / N 2 ).
10 Discussion
Although multidimensional scaling techniques have been studied for over half a century, binary multidimensional scaling, which was inspired by the need to develop representations usable in training neural networks, is a relatively new, and intriguing, problem. This study has introduced and evaluated several algorithms for performing binary multidimensional scaling. It is hoped that the better methods will prove useful to researchers in their current forms and that the discussion of the less effective methods in this article will help to direct future attempts to improve on these techniques. 10.1 SVD. Although it is so useful in other types of scaling problems, the SVD method is not a good choice for BMDS. It consistently achieved the worst performance. For the Word task, this came at the greatest cost in terms of running time. Although it is possible that improved discretization
Binary Multidimensional Scaling
1229
methods could achieve better BMDS performance using the SVD, there is still the issue of the running time. Unless either the number or dimensionality of the original vectors is quite small, simply computing the SVD is prohibitively expensive. 10.2 MGD and OGD. The gradient descent methods, which are the most closely related to techniques currently in use in standard MDS, show some promise for BMDS. They have the advantage that they can be used with any differentiable cost function and are thus extremely exible. Although they were slower than GBF and GMC on these tasks, most of the improvement in their results occurs early in the gradient descent and the process can be cut short with relatively little effect on performance. On the Exemplar task, OGD and MGD were not as good as the bit-space methods by any measure. However, they performed very well according to the goodness measure on the Word task, especially OGD. OGD was also the best according to nonmetric stress for small D on the Word task. Unless one is concerned with using a nonmetric method, OGD seems to be a better choice than MGD. It generally achieves superior performance and also converges more quickly. This is partially due to the fact that it is nonmetric and partially due to the use of sigmoidally transformed values in computing the vector distances. It is true, however, that time to convergence of MGD could be reduced with the use of a polarizing cost function. Innumerable variants of these methods are possible, and it is likely both could be improved with further work. 10.3 GBF and GBFS. The GBFS method is essentially the one used by Clouse and Cottrell (1996), although the current implementation begins with a random projection and is operated in a greedy fashion rather than by ipping random bits with positive gain. GBFS consistently produces very good solutions. Unfortunately, it suffers from being quadratic in D and more than quadratic in N. Because it uses a linear cost function, the GBF method is able to cut corners and run much more quickly, with only a modest loss in performance on the Exemplar task. On the Word task, GBF does not do as well and has particular trouble with small D. Nevertheless, both GBF and GBFS appear to be strictly worse than GMC, in terms of running time as well as performance. Although the speed of both methods could be improved by terminating the optimization early, this would hurt performance. 10.4 GMC. The GMC algorithm seems to be currently the best overall choice for BMDS. It is the fastest of the algorithms and produces the best or nearly the best results according to the stress measures and also achieved the best goodness scores on the Exemplar task. But it should be noted that OGD did achieve better goodness ratings on the Word task, and thus may
1230
Douglas L. T. Rohde
be preferable in cases where nonmetric solutions are acceptable and the similarity structure is relatively complex. The GMC algorithm has a number of other advantages. Its running time is well understood. Unlike the gradient descent and bit-ipping methods, GMC runs for a consistent and predictable amount of time. As with GBF and GBFS, but unlike the gradient descent methods, GMC can be modied to produce only unique vectors by adding a term to the cost function. This may be a requirement for some applications of BMDS. For example, it could be problematic if two different words are assigned exactly the same meaning. GMC can also be used with a variety of cost functions, although it is not quite as exible as the gradient descent methods in this regard because the cost must be incrementally calculable. Finally, GMC is very easy to implement. Unlike the gradient descent methods, there are no learning rates or other parameters to adjust or complex derivatives to compute. Unlike the bit-ipping methods, the algorithm is simple and straightforward with minimal record keeping. 10.5 Conclusion. With the exception of SVD and possibly GBFS, the binary multidimensional scaling methods presented here are capable of handling problems of reasonably high complexity. However, even GMC, with a running time of H ( N 2 ( M C D ) ), will not scale up well to problems with hundreds of thousands of items or dimensions. To solve such large problems, more efcient, though perhaps less effective, techniques will be needed. One possibility is to use a limited set of R reference items. All items are positioned relative to those in the reference set but not necessarily relative to one another. If the dimensionality of the nal space is not too large, the reference vectors may sufciently constrain the positions of the other items relative to one another to produce a good overall solution. Variants of this idea are possible with all of the algorithms presented here, although not always conjointly with the uniqueness constraint offered by the bit-space methods. Code for any of these methods can be obtained by contacting the author. Acknowledgments
This research was supported by NIMH NRSA Grant 5-T32-MH19983 for the study of Computational and Behavioral Approaches to Cognition. Special thanks to David Plaut for his comments and guidance, Santosh Vempala for his brilliance, and Daniel Clouse and Stephen J. Hanson for their helpful comments. References Beatty, M., & Manjunath, B. S. (1997). Dimensionality reduction using multidimensional scaling for image search. In Proceedings of the IEEE International Conference on Image Processing (Vol. 2, pp. 835–838). Santa Barbara, CA.
Binary Multidimensional Scaling
1231
Berry, M. W., Dumais, S. T., & O’Brien, G. W. (1994). Using linear algebra for intelligent retrieval (Tech. Rep. No. CS-94-270). Knoxville, TN: University of Tennessee. Borg, I., & Groenen, P. (1997). Modern multidimensional scaling. New York: Springer-Verlag. Burgess, C. (1998). From simple associations to the building block of language: Modeling meaning in memory with the HAL model. Behavior Research Methods, Instruments, and Computers, 30, 188–198. Clouse, D. S. (1998). Representing lexical semantics with context vectors and modeling lexical access with attractor networks. Unpublished doctoral dissertation, University of California, San Diego. Clouse, D. S., & Cottrell, G. W. (1996). Discrete multi-dimensional scaling. In Proceedings of the 18th Annual Conference of the Cognitive Science Society (pp. 290–294). Mahwah, NJ: Erlbaum. Deerwester, S. C., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. A. (1990). Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41, 391–407. Deza, M. M., & Laurent, M. (1997). Geometry of cuts and metrics. Berlin: SpringerVerlag. Frieze, A., Kannan, R., & Vempala, S. (1998). Fast Monte-Carlo algorithms for nding low-rank approximations. In Proceedings of the 39th IEEE Symposium on Foundations of Computer Science. Los Alamitos, CA: IEEE Computer Society Press. Karp, R. M. (1972). Reducibility among combinatorial problems. In R. Miller and J. Thatcher (Eds.), Complexity of computer computations (pp. 85–103). New York: Plenum Press. Kruskal, J. B. (1964a). Multidimensional scaling by optimizing goodness of t to a nonmetric hypothesis. Psychometrika, 29, 1–27. Kruskal, J. B. (1964b). Nonmetric multidimensional scaling: A numerical method. Psychometrika, 29, 115–129. Lund, K., & Burgess, C. (1996). Producing high-dimensional semantic spaces from lexical co-occurrence. Behavior Research Methods, Instruments, and Computers, 28, 203–208. Pearlmutter, B. A. (1989). Learning state space trajectories in recurrent neural networks. Neural Computation, 1, 263–269. Plaut, D. C., & Shallice, T. (1993). Deep dyslexia: A case study of connectionist neuropsychology. Cognitive Neuropsychology, 10, 377–500. Richardson, M. W. (1938). Multidimensional psychophysics. Psychological Bulletin, 35, 659. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal representations by error propagation. In D. E. Rumelhart, J. L. McClelland, & PDP Research Group (Eds.), Parallel distributed processing:Explorations in the microstructure of cognition. Volume 1: Foundations (pp. 318–362). Cambridge, MA: MIT Press. Sahni, S., & Gonzales, T. (1976). P-complete approximation problems. Journal of the ACM, 23, 555–565. Shepard, R. N. (1962). The analysis of proximities: Multidimensional scaling with an unknown distance function. Psychometrika, 27, 125–139, 219–246.
1232
Douglas L. T. Rohde
Shepard, R. N. (1966). Metric structures in ordinal data. Journal of Mathematical Psychology, 3, 287–315. Shepard, R. N., Romney, A. K., & Nerlove, S. B. (1972). Multidimensional Scaling, Volume 1: Theory. New York: Seminar Press. Torgerson, W. S. (1952). Multidimensional scaling: I. Theory and method. Psychometrika, 17, 401–419. Torgerson, W. S. (1965). Multidimensional scaling of similarity. Psychometrika, 30, 379–393. Williams, J. W. J. (1964). Algorithm 232: Heapsort. Communications of the ACM, 7, 347–348. Young, G., & Householder, A. S. (1938). Discussion of a set of points in terms of their mutual distances. Psychometrika, 3, 19–22. Received November 30, 2000; accepted September 5, 2001.
ARTICLE
Communicated by Philip Sabes
Cosine Tuning Minimizes Motor Errors Emanuel Todorov [email protected] Gatsby Computational Neuroscience Unit, University College London, London, U.K.
Cosine tuning is ubiquitous in the motor system, yet a satisfying explanation of its origin is lacking. Here we argue that cosine tuning minimizes expected errors in force production, which makes it a natural choice for activating muscles and neurons in the nal stages of motor processing. Our results are based on the empirically observed scaling of neuromotor noise, whose standard deviation is a linear function of the mean. Such scaling predicts a reduction of net force errors when redundant actuators pull in the same direction. We conrm this prediction by comparing forces produced with one versus two hands and generalize it across directions. Under the resulting neuromotor noise model, we prove that the optimal activation prole is a (possibly truncated) cosine—for arbitrary dimensionality of the workspace, distribution of force directions, correlated or uncorrelated noise, with or without a separate cocontraction command. The model predicts a negative force bias, truncated cosine tuning at low muscle cocontraction levels, and misalignment of preferred directions and lines of action for nonuniform muscle distributions. All predictions are supported by experimental data. 1 Introduction Neurons are commonly characterized by their tuning curves, which describe the average ring rate f ( x) as a function of some externally dened variable x. The question of what constitutes an optimal tuning curve for a population code (Hinton, McClelland, & Rumelhart, 1986) has attracted considerable attention. In the motor system, cosine tuning has been well established for motor cortical cells (Georgopoulos, Kalaska, Caminiti, & Massey, 1982; Kettner, Schwartz, & Georgopoulos, 1988; Kalaska, Cohen, Hyde, & Prud’homme, 1989; Caminiti, Johnson, Galli, Ferraina, & Burnod, 1991) as well as individual muscles (Turner, Owens, & Anderson, 1995; Herrmann & Flanders, 1998; Hoffman & Strick, 1999). 1 The robustness of cosine 1 When a subject exerts isometric force or produces movements, each cell and muscle is maximally active for a particular direction of force or movement (called preferred direction), and its activity falls off with the cosine of the angle between the preferred and actual direction.
c 2002 Massachusetts Institute of Technology Neural Computation 14, 1233–1260 (2002) °
1234
Emanuel Todorov
tuning suggests that it must be optimal in some meaningful sense, yet a satisfactory explanation is lacking. In this article, we argue that cosine tuning in the motor system is indeed optimal in the most meaningful sense one can imagine: it minimizes the net effect of neuromotor noise, resulting in minimal motor errors. The argument developed here is specic to the motor system. Since it deviates from previous analyses of optimal tuning, we begin by clarifying the main differences. 1.1 Alternative Approaches to Optimal Tuning. The usual approach (Hinton et al., 1986; Snippe, 1996; Zhang & Sejnowski, 1999b; Pouget, Deneve, Ducom, & Latham, 1999) is to equate the goodness of a tuning function f with how accurately the variable x can be reconstructed from a population of noisy responses m 1 C e1 , . . . , m n C en , where m i D f ( x ¡ ci ) is the mean response of neuron i with receptive eld center ci . This approach to the analysis of empirically observed tuning is mathematically appealing but involves hard-to-justify assumptions: In the absence of data on higher-order correlations and in the interest of analytical tractability, oversimplied noise models have to be assumed. 2 In contrast, when the population responses are themselves the outputs of a recurrent network, the noise joint distribution is likely to be rather complex. Ignoring that complexity can lead to absurd conclusions, such as an apparent increase of information (Pouget et al., 1999). Since the actual reconstruction mechanisms used by the nervous system as well as their outputs are rarely observable, one has to rely on theoretical limits (i.e., the Crame r-Rao bound), ignoring possible biological constraints and noise originating at the reconstruction stage. Optimality criteria that may arise from the need to perform computation (and not just represent or transmit information) are also ignored. Even if these assumptions are accepted, it was recently shown (Zhang & Sejnowski, 1999b) that the optimal tuning width is biologically implausible: as narrow as possible3 when x is one-dimensional, irrelevant when x is twodimensional, and as broad as possible when x is more than two-dimensional. Thus, empirical observations such as cosine tuning are difcult to interpret as being optimal in the usual sense.
2 The noise terms e1 , . . . , en are usually modeled as independent or homogeneously correlated Poisson variables. 3 The nite number of neurons in the population prevents innitely sharp tuning (i.e., the entire range of x has to be covered), but that is a weak constraint since a given area typically contains large numbers of neurons.
Cosine Tuning Minimizes Motor Errors
1235
In this article, we pursue an alternative approach. The optimal tuning function f ¤ is still dened as the one that maximizes the accuracy of the reconstruction m 1 C e1 , . . . , m n C en ! b x. However, we do not speculate that the input noise distribution has any particular form or that the reconstruction is optimal. Instead, we use knowledge of the actual reconstruction mechanisms and measurements of the actual output b x, which in the motor system is simply the net muscle force.4 That allows us to infer a direct mapping m 1 , . . . , m n ! Mean (b x ) , Var (b x ) from the mean of the inputs to the mean and variance of the output. Once such a mapping is available, the form of the input noise and the amount of information about x that in principle could have been extracted become irrelevant to the investigation of optimal tuning. 1.2 Optimal Tuning in the Motor System. We construct the mapping m 1 , . . . , m n ! Mean (b x ) , Var(b x ) based on two sets of observations, relating (1) the mean activations to the mean of the net force and (2) the mean to the variance of the net force. Under isometric conditions, individual muscles produce forces in proportion to the rectied and ltered electromyogram (EMG) signals (Zajac, 1989; Winter, 1990), and these forces add mechanically to the net force.5 Thus, the mean of the net force is simply the vector sum of the mean muscle activations m 1 , . . . , m n multiplied by the corresponding force vectors u1 , . . . , un (dening the constant lines of action). If the output cells of primary motor cortex (M1) contribute additively to the activation of muscle groups (Todorov, 2000), a similar additive model may apply for m 1 , . . . , m n corresponding to mean ring rates in M1. In the rest of the article, m 1 , . . . , m n will denote the mean activation levels of abstract force generators, which correspond to individual muscles or muscle groups. The relevance to M1 cell tuning is addressed in Section 6. Numerous studies of motor tremor have established that the standard deviation of the net force increases linearly with its mean. This has been demonstrated when tonic isometric force is generated by muscle groups (Sutton & Sykes, 1967) or individual muscles (McAuley, Rothwell, & Marsden, 1997). The same scaling holds for the magnitude of brief force pulses (Schmidt, Zelaznik, Hawkins, Frank, & Quinn, 1979). This general nding
4 We focus predominantly on isometric force tasks and extend our results to movement velocity and displacement tuning in the last section. Thus, the output (reconstruction) is dened as net force (i.e., vector sum of all individual muscle forces) in the relevant work space. 5 The contributions of different muscles to end-point force are determined by the Jacobian transformation and the tendon insertion points. Each muscle has a line of action (force vector) in end-point space, as well as in joint space. Varying the activation level under isometric conditions affects the force magnitude, but the force direction remains xed.
1236
Emanuel Todorov
is also conrmed indirectly by the EMG histograms of various muscles, which lie between a gaussian and a Laplace distribution, both centered at 0 (Clancy & Hogan, 1999). Under either distribution, the rectied signal |x| has standard deviation proportional to the mean.6 The above scaling law leads to a neuromotor noise model where each generator contributes force with standard deviation s linear in the mean m : s D am . This has an interesting consequence. Suppose we had two redundant generators pulling in the same direction and wanted them to produce net force m . If we activated only one of them at level m , the net variance would be s 2 D a2 m 2 . If we activated both generators at level m / 2, the net variance (assuming uncorrelated noise) would be s 2 D a2 m 2 / 2, which is two times smaller. Thus, it is advantageous to activate all generators pulling in the direction of desired net force. What about generators pulling in slightly different directions? If all of them are recruited simultaneously, the noise in the net force direction will still decrease, but at the same time, extra noise will be generated in orthogonal directions. So the advantage of activating redundant actuators decreases with the angle away from the net force direction. The main technical contribution of this article is to show that it decreases as a cosine, that is, cosine tuning minimizes expected motor errors. Note that the above setting of the optimal tuning problem is in open loop; the effects of activation level on feedback gains are not explicitly considered. Such effects should be taken into account because coactivation of opposing muscles may involve interesting trade-offs: it increases both neuromotor noise and system impedance and possibly modies sensory inputs (due to a ¡ c coactivation). We incorporate these possibilities by assuming that an independent cocontraction command C may be specied, in which case the net activity of all generators is constrained to be equal to C. As shown below, the optimal tuning curve is a cosine regardless of whether C is specied. The optimal setting of C itself will be addressed elsewhere. In the next section, we present new experimental results, conrming the reduction of noise due to redundancy. The rest of the article develops the mathematical argument for cosine tuning rigorously, under quite general assumptions. 2 Actuator Redundancy Decreases Neuromotor Noise The empirically observed scaling law s D am implies that activating redundant actuators should reduce the overall noise level. This effect forms the basis of the entire model, so we decided to test it experimentally. Ideally, we would ask subjects to produce specied forces by activating one versus two ), the mean of |x| is s and the variance exp ( ¡|x| s p is . For the 0-mean gaussian with standard deviation s, the mean of |x| is s 2 / p , and the variance is s 2 (1 ¡ 2 / p ). 6
s2
For the Laplace distribution ps (x) D
1 s
Cosine Tuning Minimizes Motor Errors
1237
synergistic muscles and compare the corresponding noise levels. But human subjects have little voluntary control over which muscles they activate, so instead we used the two hands as redundant force generators: we compared the force errors for the same level of net instructed force produced with one hand versus both hands. The two comparisons are not identical, since the neural mechanisms coordinating the two hands may be different from those coordinating synergistic muscles of one limb. In particular, one might expect coordinating the musculature of both hands to be more difcult, which would increase the errors in the two-hands condition (opposite to our prediction). Thus, we view the results presented here as strong supporting evidence for the predicted effect of redundancy on neuromotor noise.
2.1 Methods. Eight subjects produced isometric forces of specied magnitude (3–33 N) by grasping a force transducer disk (Assurance Technologies F/T Gamma 65/5, 500 Hz sampling rate, 0.05 N resolution) between the thumb and the other four ngers. The instantaneous force magnitude produced by the subject was displayed with minimum delay as a vertical bar on a linear 0–40N scale. Each of 11 target magnitudes was presented in a block of three trials (5 sec per trial, 2 sec between trials), and the subjects were asked to maintain the specied force as accurately as possible. The experiment was repeated twice: with the dominant hand and with both hands grasping the force transducer. Since forces were measured along the forward axis, the two hands can be considered as mechanically identical (i.e., redundant) actuators. To balance possible learning and fatigue effects, the order of the 11 force magnitudes was randomized separately for each subject (subsequent analysis revealed no learning effects). Half of the subjects started with both hands, the other half with the dominant hand. The rst 2 seconds of each trial were discarded; visual inspection conrmed that the 2 second initial period contained the force transient associated with reaching the desired force level. The remaining 3 seconds (1500 sample points) of each trial were used in the data analysis.
2.2 Results. The average standard deviations are shown in Figure 1B for each force level and hand condition. In agreement with previous results, the standard deviation in both conditions was a convincingly linear function of the instructed force level. As predicted, the force errors in the two-hands condition were smaller, and the ratio of the two slopes was 1.42 § 0.25 (95% condence interval), which is indistinguishable from the predicted value p of 2 t 1.41. Two-way (2 conditions £ 11 force levels) ANOVA with replications (eight subjects) indicated that both effects were highly signicant (p < 0.0001), and there was no interaction effect (p D 0.57). Plotting standard deviation versus mean (rather than instructed) force produced very similar results.
1238
Emanuel Todorov
2
R = 0.62
Force Bias (N)
0
0.1 2
R = 0.93 0.2
Dominant Hand Both Hands
0.3 3
6
9 12 15 18 21 24 27 30 33
Instructed Force (N)
B) Variable Error Standard Deviation (N)
A) Constant Error
0.1
0.4
R 2 = 0.98
0.3
0.2
R 2 = 0.97
0.1
0
3
6
9 12 15 18 21 24 27 30 33
Instructed Force (N)
Figure 1: The last 3 seconds of each trial were used to estimate the bias (A) and standard deviation (B) for each instructed force level and hand condition. Averages over subjects and trials, with standard error bars, are shown in the gure. The standard deviation estimates were corrected for sensor noise, measured by placing a 2.5 kg object on the sensor and recording for 10 seconds.
The nonzero intercept in our data was smaller than previous observations but still signicant. It is not due to sensor noise (as previously suggested), because we measured that noise and subtracted its variance. One possible explanation is that because of cocontraction, some force uctuations are present even when the mean force is 0. Figure 2A shows the power spectral density of the uctuations in the two conditions, separated into low (3–15N) and high (21–33N) force levels. The scaling is present at all frequencies, as observed previously (Sutton & Sykes, 1967). Both increasing actuator redundancy and decreasing the force level have similar effects on the spectral density. To identify possible differences between frequency bands, we low-pass-ltered the data at 5 Hz, and bandpass ltered at 5–25 Hz. As shown in Figure 2B, the noise in both frequency bands obeys the same scaling law: standard deviation linear in the mean, with higher slope in the one-hand condition. The slopes in the two frequency bands are different, and interestingly the intercept we saw before is restricted to low frequencies. If the nonzero intercept is indeed due to cocontraction, Figure 2B implies that the cocontraction signal (i.e., common central input to opposing muscles) uctuates at low frequencies. We also found small but highly signicant negative biases (dened as the difference between measured and instructed force) that increased with the instructed force level and were higher in the one-hand condition (see Figure 1A). This effect cannot be explained with perceptual or memory
Cosine Tuning Minimizes Motor Errors
A) Spectral Densities Average sensor noise Dominant Hand Both Hands
0
High Forces 10
30
Low Forces
1
5
10
15
Frequency (Hz)
20
25
0.4
Standard Deviation (N)
Average Power (dB)
10
20
1239
B) Frequency Bands Dominant Hand Both Hands R 2 = 0.98
0.3
0.2
0.1
0
2
0-5 Hz
R = 0.97 2
R = 0.94
5-25 Hz 3
6
R 2 = 0.99
9 12 15 18 21 24 27 30 33
Instructed Force (N)
Figure 2: (A) The power spectral density of the force uctuations was estimated using blocks of 500 sample points with 250 point overlap. Blocks were meandetrended, windowed using a Hanning window, and the squared magnitude of the Fourier coefcients averaged separately for each frequency. This was done separately for each instructed force level and then the low (3–15 N) and high (21–33 N) force levels were averaged. The data from some subjects showed a sharper peak around 8–10 Hz, but that is smoothed out in the average plot. There appears to be a qualitative change in the way average power decreases at about 5 Hz. (B) The data were low-pass-ltered at 5 Hz (fth-order Butterworh lter) and also bandpass ltered at 5–25 Hz. The standard deviation for each force level and hand condition was estimated separately in each frequency band.
limitations, since subjects received real-time visual feedback on a linear force scale. A similar effect is predicted by optimal force production: if the desired force level is m ¤ and we specify mean activation m for a single generator, the expected square error is (m ¡ m ¤ ) 2 C a2 m 2 , which is minimal for m D m ¤ / (1 C a2 ) < m ¤ . Thus, the optimal bias is negative, larger in the one-hand condition, and its magnitude increases with m ¤ . The slopes in Figure 1A are substantially larger than predicted, which is most likely due to a trade-off between error and effort (see section 5.2). Other possible explanations include an inaccurate internal estimate of the noise magnitude and a cost function that penalizes large uctuations more than the square error cost does (see section 5.4). Summarizing the results of the experiment, the neuromotor noise scaling law observed previously (Sutton & Sykes, 1967; Schmidt et al., 1979) was replicated. Our prediction that redundant generators reduce noise was conrmed. Thus, we feel justied in assuming that each generator contributes force whose standard deviation is a linear function of the mean.
1240
Emanuel Todorov
3 Force Production Model Consider a set V of force generators producing force (torque) in D-dimensional Euclidean space RD . Generator a 2 V produces force proportional to its instantaneous activation, always in the direction of the unit vector u(a) 2 RD .7 The dimensionality D can be only 2 or 3 for end-point force but much higher for joint torque—for example, 7 for the human arm. The central nervous system (CNS) species the mean activations m (a) , which are always nonnegative. The actual force contributed by each generator is (m (a) C z (a) ) u(a). The neuromotor noise z is a set of zero-mean random variables whose probability distribution p ( z|m ) depends on m (see below), and p ( z (a) < ¡m (a) ) D 0 since muscles cannot push. Thus, the net force r(m ) 2 RD is the random variable:8 r(m ) D |V| ¡1
X a2V
(m (a) C z (a) ) u(a) .
Given a desired net force vector P f 2 RD and optionally a cocontraction ¡1 command-net activation C D |V| a m (a), the task of the CNS is to nd the activation prole m (a) ¸ 0 that minimizes the expected force error under p ( z|m ) .9 We will dene error as the squared Euclidean distance between the desired force f and actual force r (alternative cost functions are considered in section 5.4). Note that both direction and magnitude errors are penalized, since both are important for achieving the desired motor objectives. The expected error is the sum of variance V and squared bias B, h i T (r ¡ f) , E z|m (r ¡ f) T (r ¡ f) D trace(Covz|m [r, r]) C |(r ¡ f){z } | {z } V
(3.1)
B
P where the mean force is r D |V| ¡1 a m (a) u(a) since E[z(a) ] D 0 for each a. We rst focus on minimizing the variance term V for specied mean force r and then perform another minimization with respect to (w.r.t.) r. Exchang-
7 a will be used interchangeably as an index over force generators in the discrete case and as a continuous index specifying direction in the continuous case. 8 ¡1 The R the transition to a continuous V later: P scaling constant |V| simplies |V| ¡1 a ¢ ¢ ¢ will be replaced with |SD | ¡1 ¢ ¢ ¢ da where |SD | is the surface area of the unit sphere in RD . It does not affect the results. 9 m (a) is the activation prole of all generators at one point in time, corresponding to a given net force vector. In contrast, a tuning curve is the activation of a single generator when the net force direction varies. When m (a) is symmetric around the net force direction, it is identical to the tuning curve of all generators. This symmetry holds in most of our results, except for nonuniform distributions of force directions. In that case, we will compute the tuning curve explicitly (see Figure 4).
Cosine Tuning Minimizes Motor Errors
ing the order of the trace, E, Á V D |V|
¡2
D |V| ¡2
"
trace Ez|m XX
P
1241
operators, the variance term becomes:
XX
#! z (a) z (b ) u(a) u(b ) T
a2V b 2V
Covz|m [z(a), z (b ) ]u(b ) T u(a) .
a2V b 2V
To evaluate V, we need a denition of the noise covariance Covz|m [z(a), z (b ) ] for any pair of generators a, b. Available experimental results only suggest the form of the expression for a D b; since the standard deviation is a linear function of the mean force, Covz|m [z(a) , z (a) ] is a quadratic polynomial of m (a) . This will be generalized to a quadratic polynomial across directions as ³ Covz|m [z(a) , z (b )] D
m (a) C m (b ) l1 m (a)m (b ) C l2 2
´ (dab C l3 ) .
b
The da term is a delta function corresponding to independent noise for each force generator. The correlation term l3 corresponds to uctuations in some shared input to all force generators. A correlation term dependent on the angle between u(a) and u(b ) is considered in section 5.3. ¡1 P Substituting in the above expression for V and dening U D |V| u(a), which is 0 when the force directions are uniformly distributed, a we obtain V D |V| ¡2
X a2V
(l1 m (a) 2 C l2 m (a) ) C l1 l3 rT r C l2 l3 rT U.
(3.2)
The optimal activation prole can be computed in two steps: (1) for given r in equation 3.2, nd the constrained minimum V ¤ (r) w.r.t. m ; (2) substitute in equation 3.1 and nd the minimum of V ¤ (r) C B (r) w.r.t. r. Thus, the shape of the optimal activation prole emerges in step 1 (see section 4), while the optimal bias is found in step 2 (see section 5.1). 4 Cosine Tuning Minimizes Expected Motor Error The minimization problem given by equation 3.2 is an instance of a more general minimization problem described next. We rst solve that general problem and then specialize the solution to equation 3.2. The set V we consider can be continuous or discrete. The activation function (vector) m 2 R (V) : V ! R is nonnegative for all a 2 V. Given arbitrary positive weighting function w 2 R (V), projection functions g1,...,N 2 R (V), resultant lengths r1,...,N 2 R, and offset l 2 R, we will solve the following
1242
Emanuel Todorov
minimization problem w.r.t. m : Minimize Subject to Where
hm C l, m C li m (a) ¸ 0 hm , g1...N i D r1,...,N P hu, vi , |V| ¡1 u (a) v (a) w (a).
(4.1)
a2V
The generalized dot product is symmetric, linear in both arguments, and positive denite (since w (a) > 0 by denition). Note that the dot product is dened between activation proles rather than force vectors. The solution to this general problem is given by the following result (see the appendix):10 P Theorem 1. If m ¤ (a) D b n an gn (a) ¡ lc satises hm ¤ , g1,...,N i D r1,...,N for some a1,...,N 2 R, then m ¤ is the unique constrained minimum of hm C l, m C li. Thus, the unique optimal solution to equation 4.1 is a truncated linear combination of the projection functions g1,...,N , assuming that a set of N constants a1,...,N 2 R satisfying the N constraints hm ¤ , g1,...,N i D r1,...,N exists. Although we have not been able to prove their existence in general, for the concrete problems of interest, these constants can be found by construction (see below). Note that the analytic form of m ¤ does not depend on the arbitrary weighting function w used to dene the dot product (the numerical values of the constants a1,...,N can of course depend on w). P In the case n an gn (a) ¸P l for all a 2 V, the constants a1,...,N satisfy the system of linear equations n an hgn , g k i D rk C hl, g k i for k D 1, . . . , N. It can be solved by inverting the matrix Gnk , hgn , gk i; for the functions g1,...,N considered in section 4.3, the matrix G is always invertible. 4.1 Application to Force Generation. We now clarify what this general result has to do with our problem. Recall that the goal is to nd the (a) ¸ 0 that minimizes equation 3.2 for nonnegative activation prole P m given P net force r D |V| ¡1 a m (a)u(a) and optionally cocontraction C D |V| ¡1 a m (a). Omitting theP last two terms in equation 3.2, P which are constant, we have to minimize a (l1m (a) 2 C l2m (a) ) D l1 a (m (a) C l) 2 C l2 const, where l , 2l . Choosing the weighting function w (a) D 1 and as1 suming l1 > 0 as the experimental data indicate (l1 is the slope of the regression line in Fig 1B), this is equivalent to minimizing the dot product hm C l, m C li. Let e1,...,D be an orthonormal basis of RD , with respect to which r has coordinates rT e1 , . . . , rT eD and u(a) has coordinates u(a) T e1 , . . . , u(a) T eD . Then we can dene r1,...,D , rT e1 , . . . , rT eD , rD C 1 , C, g1...D (a) , u(a) T e1 , . . . , 10
bxc D x for x ¸ 0 and 0 otherwise. Similarly, dxe D x for x < 0 and 0 otherwise.
Cosine Tuning Minimizes Motor Errors
1243
u(a) T eD , gD C 1 (a) , 1, N , D C 1 or D depending on whether the cocontraction command is specied. With these denitions, the problem is in the form of equation 4.1, theorem 1 applies, andP we are guaranteed that the unique optimal activation prole is m ¤ (a) D b n an gn (a) ¡ lc as long as we can nd a1,...,N 2 R for which m ¤ satises all constraints. Why is that function a cosine? The function gn (a) D u(a) T en is the cosine of the angle between unit vectors u(a)Pand en . A linear P combination (a) (a) T en D of D-dimensional cosines is also a cosine: a g D n n n P P n an u ¤ T T u(a) ( n an en ), and thus m (a) D bu(a) E ¡ lc for E D n an en . When C is specied, we have m ¤ (a) D bu(a) T E C aD C 1 ¡ lc since gD C 1 (a) D 1. Note that if we are given E, the constants a are simply an D ET en since the basis e1,...,D is orthonormal. To summarize the results so far, we showed that the minimum generalized length hm C l, m C li of the nonnegative activation prole m (a) subject to linear equality constraints hm , g1,...,N i D r1,...,N is achieved for a truncated P linear combination b n an gn (a) ¡lc of the projection functions g1,...,N . Given a mean force r and optionally a cocontraction command C, this generalized length is proportional to the variance of the net muscle force, with the projection functions being cosines. Since a linear combination of cosines is a cosine, the optimal activation prole is a truncated cosine. In the rest of this section, we compute the optimal activation prole in two special cases. In each case, all we have to do is construct—by whatever means—a function of the specied form that satises all constraints. Theorem 1 then guarantees that we have found the unique global minimum. 4.2 Uniform Distribution of Force Directions in RD . For convenience, we will work with a continuous set of force generators V D SD , the unit sphere embedded in RD . The summation signs will be replaced by integrals, and the force generator index a will be assumed to cover SD uniformly, that is, the distribution of force directions is uniform. The normalization constant becomes the surface area11 of the unit sphere | SD | D D 2p 2 / C ( D2 ) . The unit vector u(a) 2 RD corresponds to point a on SD . The goal R is to nd a truncated cosine function thatR satises the constraints | SD | ¡1 SD m ¤ (a)u(a) da D r and optionally | SD | ¡1 SD m ¤ (a) da D C. We will look for a solution with axial symmetry around r, that is, a m ¤ (a), that depends only on the angle between the vectors r and u(a) rather than the actual direction u(a) . This problem can be transformed into a problem on the circle in R2 by correcting for the area of SD being mapped into each point on the circle. 11 The rst few values of |SD | are: |S1,...,7 | D (2, 2p , 4p , 2p 2 , 83 p 2 , p 3 , cally |SD | decreases for D > 7.
16 3 ). p 15
Numeri-
1244
Emanuel Todorov
The set of unit vectors u 2 RD at angle a away from a given vector r 2 RD is a sphere in RD¡1 with radius | sin(a) |. Therefore, Rfor any function f : SD ! R with axial symmetry around r, we have SD f D Rp 1 (a) | sin(a) | D¡2 da, where the correction factor | SD¡1 ||sin(a)| D¡2 | | 2 SD¡1 ¡p f is the surface area of an RD¡1 sphere with radius | sin(a) |. Without loss of generality, r can be assumed to point along the positive x-axis: r D [R 0]. For given dimensionality D, dene the weighting function wD (a) , 12 | SD¡1 | | sin(a) | D¡2 for a 2 [¡p I p ], which as before denes the dot Rp product hu, viD , | SD | ¡1 ¡p u (a) v (a) wD (a) da. The projection functions on the circle in R2 are g1 (a) D cos(a) , g2 (a) D sin(a) , and optionally g3 (a) D 1. Thus, m ¤ has to satisfy hm ¤ , cosiD D R, hm ¤ , siniD D 0, and optionally ¤ hm , 1iD D C. Below, we set a2 D 0 and nd constants a1 , a3 2 R for which the function m ¤ (a) D ba1 cos(a) C a3 c satises those constraints. Since hba1 cos(a) C a3 c, siniD D 0 for any a1 , a3 , we are concerned only with the remaining two constraints. Note that hm ¤ , cosiD · hm ¤ , 1iD and therefore R · C whenever C is specR ied. Also, from the denition of wD (a) and the identity D cos2 sinD¡2 D R sinD¡1 cos C sinD¡2 , it follows that hcos, 1iD D 0, h1, 1iD D 1, and hcos, cosiD D D1 .
4.2.1 Specied Cocontraction C. We are looking for a function of the form m ¤ (a) D ba1 cos(a) C a3 c that satises the equality constraints. For a3 ¸ a1 , this function is a full cosine. Using the above identities, we nd that the constraints R D hm ¤ , cosiD D a1 hcos, cosiD C a3 h1, cosiD and C D hm ¤ , 1iD D a1 hcos, 1iD C a3 h1, 1iD are satised when a1 D DR and a3 D C. Thus, the optimal solution is a full cosine when CR ¸ D (corresponding to a3 ¸ a1 ). When CR < D, a full cosine solution cannot be found; thus, we look for a truncated cosine solution. Let the truncation point be a D § t, that is, a3 D ¡a1 cos(t) . To satisfy all constraints, t has to be the root of the trigonometric equation, sin(t) D¡1 / ( D ¡ 1) ¡ cos(t) ID ( t) C D , R ID ( t) ¡ cos(t) sin(t) D¡1 / D ( D ¡ 1) R where ID ( t) , 0t sin(a) D¡2 da. That integral can be evaluated analytically for any xed D. Once t is computed numerically, the constant a1 is given D| D¡1 D ( D ¡ 1) ) ¡1 . It can be veried that ( ( ) by a1 D R |S|SD¡1 / | ID t ¡ cos(t) sin(t) the above trigonometric equation has a unique solution for any value of C in the interval (1, D ). Values smaller than 1 are inadmissible because R R · C.
Cosine Tuning Minimizes Motor Errors
A) C specified
135
D=7 90
D=2 45
0
0
0.25
0.5
0.75
B) C unspecified
180
Tuning Width t (deg)
180
Tuning Width t (deg)
1245
135
90
D=2
45
D=7
0
1
-3
(C/R-1) / (D-1)
-2
-1
-
0
1
/ RD
Figure 3: (A) The truncation point–optimal tuning width t computed from equation 4.2 for D D 2, . . . , 7. Since the solution is a truncated cosine when 1 < CR < D, / R¡1 , which varies from 0 to 1 reit is natural to scale the x-axis of the plot as CD¡1 gardless of the dimensionality D. For CR ¸ D, we have the full cosine solution, which technically corresponds to t D 180. (B) The truncation point–optimal tuning width t computed from equation 4.3 for D D 2, . . . , 7. Since the solution is a ¡l truncated cosine when ¡l < D, it is natural to scale the x-axis of the plot as RD , R which varies from ¡1 to 1 regardless of the dimensionality D. For ¡l ¸ D, we R have the full cosine solution: t D 180.
Summarizing the solution,
m
¤ (a)
D
8 > :a1 bcos(a) ¡ cos(t) c
C ¸D R C < D. : R :
(4.2)
In Figure 3A we have plotted the optimal tuning width t in different dimensions, for the truncated cosine case CR < D. 4.2.2 Unspecied Cocontraction C. In this case, a3 D ¡l, that is, m ¤ (a) D ba1 cos(a) ¡ lc. For ¡l ¸ a1 , the solution is a full cosine, and a1 D DR as before. When ¡l < DR, the solution is a truncated cosine. Let the truncation point be a D §t. Then a1 D cosl( t) , and t has to be the root of the trigonometric equation: | SD | l ( ID (t) ¡ cos(t) sin(t) D¡1 / D ( D ¡ 1) ) ¡1 . DR | SD¡1 | cos(t) Note that a1 as a function of t is identical to the previous case when C was xed, while the equation for t is different.
1246
Emanuel Todorov
Summarizing the solution: 8 >
¡l ¸D R m (a) D . > :a1 bcos(a) ¡ cos(t) c : ¡l < D R ¤
:
(4.3)
In Figure 3B we have plotted the optimal tuning width t in different dimensions, for the truncated cosine case ¡l < D. R Comparing the curves in Figure 3A and Figure 3B, we notice that in both cases, the optimal tuning width is rather large (it is advantageous to activate multiple force generators), except in Figure 3A for C / R ¼ 1. From the triangle inequality, a solution m (a) exists only when C ¸ R, and C D R 6 implies m (a D 0) D 0. Thus, a small C “forces” the activation prole to become a delta function. But as soon as that constraint is relaxed, the width of the optimal solution increases sharply. Note also that in both gures, the dimensionality D makes little difference after appropriate scaling of the abscissa. 4.3 Arbitrary Distribution of Force Directions in R2 . For a Puniform dis-2 tribution of force directions, it was possible to replace the term a (m (a) C l) R with SD (m (a) C l) 2 da in equations 3.2. If the distribution is not uniform but instead is given by some density function w (a) , we have to take that function m ¤ that minimizes R into¤ account2 and nd the activation¡1 prole R ¡1 ¤ (m (a) C l) w (a) da subject to | SD | | SD | m (a) u(a)w(a) da D r SD SD R ¡1 ¤ (a) (a) | | and optionally SD w da D C. Theorem 1 still guarantees that SD m the optimal m ¤ is a truncated cosine, assuming we can nd a truncated cosine satisfying the constraints. It is not clear how to do that for arbitrary dimensionality D and arbitrary density w, so we address only the case D D 2. For arbitrary w (a) and D D 2, the solution is in the form m ¤ (a) D ba1 cos(a) C a2 sin(a) C a3 c. When C is not specied, we have a3 D ¡l. Here we evaluate these parameters only when C is specied and large enough to ensure a full cosine solution. The remaining cases can be handled using (a) in a Fourier techniques similar to Pthe previous sections. Expanding w ( ) C series, w (a) D u20 C 1 cos sin and solving the system of u na v na n nD 1 n linear equations given by the constraints, we obtain 2u C u 0 2 2 3 a1 6 2 6 7 6 v2 4a2 5 D 6 42 a3 u1
v2 2 u 0 ¡ u2 2 v1
u1
3 ¡1
7 7 v1 7 5 u0
2
3 2R 6 7 40 5 . 2C
The optimal m ¤ depends only on the Fourier coefcients of w (a) up to order 2; the higher-order terms do not affect the minimization problem. In
Cosine Tuning Minimizes Motor Errors
1247
the previous sections, w (a) was equal to the dimensionality correction factor | sin(a) | D¡2 , in which case only u 0 and u2 were nonzero, the above matrix became diagonal, and thus we had a1 s R, a2 D 0, a3 s C. Note that m ¤ (a) is the optimal activation prole over the set of force generators for xed mean force. In the case of uniformly distributed force directions, this also described the tuning function of an individual force generator for varying mean force direction, since m ¤ (a) was centered at 0 and had the same shape regardless of force direction. That is no longer true here. Since w (a) can be asymmetric, the directions of the force generator and the mean force matter (as illustrated in Figures 4a and 4b). The tuning functions of several force generators at different angles from the peak of w (a) are plotted in Figures 4a and 4b. The direction of maximal activation rotates away from the generator force direction and toward the short axis of w (a) , for generators whose force direction lies in between the short and long axes of w (a). This effect has been observed experimentally for planar arm movements, where the distribution of muscle lines of action is elongated along the hand-shoulder axis (Cisek & Scott, 1998). In that case, muscles are maximally active when the net force is rotated away from their mechanical line of action, toward the short axis of the distribution. The same effect is seen in wrist muscles, where the distribution of lines of action is again asymmetric (Hoffman & Strick, 1999). The tuning modulation (difference between the maximum and minimum of the tuning curve) also varies systematically, as shown in Figures 4a and 4b. Such effects are more difcult to detect experimentally, since that would require comparisons of the absolute values of signals recorded from different muscles or neurons. 5 Some Extensions 5.1 Optimal Force Bias. The optimal force bias can be found by solving equation 3.1: minimize V¤ (r) C B (r) w.r.t. r. We will solve it analytically only for a uniform distribution of force directions in RD and when the minimum in equation 3.2 is a full cosine. It can be shown using equations 4.1 and 4.2 that for both C, specied and unspecied, the variance term dependent on l1 D 2 m ¤ in equation 3.2 is |S R . It is clear that the optimal mean force r is parallel D| to the desired force f, and all we have to nd is its magnitude R D krk. Then to solve equation 3.1, we have to minimize w.r.t. R the following expression: l1 D 2 2 2 ( |SD | R C l1 l3 R C R ¡ kfk) . Setting the derivative to 0, the minimum is achieved for kfk krk D . 1 C l1 l3 C l1 D / | SD | Thus, for positive l1 , l3 the optimal mean force magnitude krk is smaller than the desired force magnitude kfk, and the optimal bias krk¡kfk increases linearly with kfk.
1248
Emanuel Todorov
A) w(a) = 1 + 0.8 cos(2a)
B) w(a) = 1 + 0.25 cos(a) 8
Optimal Generator Tuning
Optimal Generator Tuning
8
90 6
0
4
2
w(a) 0 -180
0
180
Desired - Generator Direction
180
6
4
0
2
w(a) 0 -180
0
180
Desired - Generator Direction
Figure 4: Optimal tuning of generators whose force directions point at different angles relative to the peak of the distribution w ( a ) . 0 corresponds to the peak of w ( a) , and 90 (180, respectively) corresponds to the minimum of w ( a ) . The polar plots show w ( a) , and the lines inside indicate the generator directions plotted in each gure. We used R D 1, C D 5. (A) w ( a ) has a second-order harmonic. In this case, the direction of maximal activation for generators near 45 rotates toward the short axis of w ( a ) . The optimal tuning modulation increases for generators near 90. (B) w ( a) has a rst-order harmonic. In this case, the rotation is smaller, and the tuning curves near the short axis of w ( a ) shift upward rather than increasing their modulation.
5.2 Error-Effort Trade-off. Our formulation of the optimal control problem facing the motor system assumed that the only quantity being minimized is error (see equation 3.1). It may be more sensible, however, to minimize a weighted sum of error and effort, because avoiding fatigue in the current task can lead to smaller errors in tasks performed in the future. Indeed, we have found evidence for error+effort minimization in movement tasks (Todorov, 2001). To allow this possibility here, we consider a modied cost function of the form X ¡2 m (a) 2 . Ez|m [(r ¡ f) T (r ¡ f) ] C b |V| a2V
The only change resulting from the inclusion of the activation penalty term is that the variance V previously given by equation 3.2 now becomes X ( (l1 C b )m (a) 2 C l2m (a) ) C l1 l3 rT r C l2 l3 rT U. V D |V| ¡2 a2V
Thus, the results in section 4 remain unaffected (apart from the substitution l1 Ã l1 C b), and the optimal tuning curve is the same as before. The only
Cosine Tuning Minimizes Motor Errors
1249
effect of the activation penalty is to increase the force bias. The optimal krk computed in section 5.1 now becomes krk D
kfk . 1 C l1 l3 C (l1 C b ) D / | SD |
Thus, the optimal bias krk ¡kfk increases with the weight b of the activation penalty. This can explain why the experimentally observed bias in Figure 1A was larger than predicted by minimizing error alone. 5.3 Nonhomogeneous Noise Correlations. Thus far, we allowed only homogeneous correlations (l3 ) among noise terms affecting different generators. Here, we consider an additional correlation term (l4 ) that varies with the angle between two generators. The noise covariance model Covz|m [z(a) , z (b ) ] now becomes ³ ´ m (a) C m (b ) (dab C l3 C 2l4 u(b ) T u(a) ) . l1m (a)m (b ) C l2 2 We focus on the case when the force generators are uniformly distributed in a two-dimensional work space ( D D 2) , the mean force is r D [R 0] as before, the cocontraction level C is specied, and CR ¸ D. Using the identities u(b ) T u(a) D cos(a ¡ b ) and 2 cos2 (a ¡ b ) D 1 C cos(2a ¡ 2b ) , the force variance V previously given by equation 3.2 now becomes 1 4p 2
Z
l1 l4 2 ( p2 C q22 ) C l1 l4 C2 C l2 l4 C, 4 R R where p2 D p1 m (a) cos(2a) da and q2 D p1 m (a) sin(2a)da are the secP p ond order coefcients in the Fourier series m (a) D 20 C 1 n D 1 ( pn cos na C qn sin na) . The integral term in V can be expressed as a function of the Fourier coefR 1 cients using Parseval’s theorem. The constraints on m (a) are 2p m (a) da D R R 1 1 m (a) cos(a) da D R, and 2p m (a) sin(a) da D 0. These constraints C, 2p specify the p 0, p1 , and q1 Fourier coefcients. Collecting all unconstrained terms in V yields (l1 m (a) 2 C l2m (a) ) da C l1 l3 rT r C
VD
l1 4
³ l4 C
1 p
´ ( p22 C q22 ) C
1 l1 X ( p2 C q2n ) C const(C, r) . 4p nD 3 n
Since the parameter l1 corresponding to the slope of the regression line in Figure 1B is positive, the above expression is a sum of squares with positive weights when l4 > ¡ p1 . The unique minimum is then achieved when p2,...,1 D q2,...,1 D 0, and therefore the optimal tuning curve is m (a) D DR cos(a) C C as before.
1250
Emanuel Todorov
If nonhomogeneous correlations are present, one would expect muscles pulling in similar directions to be positively correlated (l4 > 0), as simultaneous EMG recordings indicate (Stephens, Harrison, Mayston, Carr, & Gibbs, 1999). This justies the assumption l4 > ¡ p1 . 5.4 Alternative Cost Functions. We assumed that the cost being minimized by the motor system is the square error of the force output. While a square error cost is common to most optimization models in motor control (see section 6), it is used for analytical convenience without any empirical support. This is not a problem for phenomenological models that simply look for a quantity whose minimum happens to match the observed behavior. But if we are to construct more principled models and claim some correspondence to a real optimization process in the motor system, it is necessary to conrm the behavioral relevance of the chosen cost function. How can we proceed in the absence of such empirical conrmation? Our approach is to study alternative cost functions, obtain model predictions through numerical simulation, and show that the particular cost function being chosen makes little difference. Throughout this section, we assume that C is specied, the work space is two-dimensional, and the target force (without loss of generality) is f D [R 0]. The cost function is now Costp (m ) D E (kr ¡ fkp ) . We nd numerically the optimal activations m 1,...,15 for 15 uniformly distributed force generators. The noise terms z1,...,15 are assumed independent, with probability distribution matching the experimental data. In order to generate such noise terms, we combined the data for each instructed force level (all subjects, one-hand condition), subtracted the mean, divided by the standard deviation, and pooled the data from all force levels. Samples from the distribution of zi were then obtained as zi D l1 m i s, where s was sampled with replacement from the pooled data set. The scaling constant was set to l1 D 0.2. It could not be easily estimated from the data (because subjects used multiple muscles), but varying it from 0.2 to 0.1 did not affect the results presented here, as expected from section 4. To nd the optimal activations, we initialized m 1,...,15 randomly, and minimized the Monte Carlo estimate of Costp (m ) using BFGS gradient-descent with numerically computed gradients (fminunc in the Maltab Optimization 1 P Toolbox). The constraints m i ¸ 0 and 15 m i D C were enforced by scaling and using the absolute P value of m i inside the estimation function. A small 1 cost proportional to | 15 m i ¡ C| was added to resolve the scaling ambiguity. To speed up convergence, a xed 15 £ 100,000 random sample from the experimental data was used in each minimization run. The average of the optimal tuning curves found in 40 runs of the algorithm (using different starting points and random samples) is plotted in
Cosine Tuning Minimizes Motor Errors
1251
8
Optimal Activation
p = 0.5 p=2 p=4 6
R=1 C=4
4
2
R=1 C = 1.275 0 -180
0
180
Figure 5: The average of 40 optimal tuning curves, for p D 0.2 and p D 4. The different tuning curves found in multiple runs were similar. The solution for p D 2 was computed using the results in section 4.
Figure 5, for p D 0.5 and p D 4. The optimal tuning curve with respect to the quadratic cost ( p D 2) is also shown. For both full and truncated cosine solutions, the choice of cost function made little difference. We have repeated this analysis with gaussian noise and obtained very similar results. It is in principle possible to compare the different curves in Figure 5 to experimental data and try to identify the true cost function used by the motor system. However, the differences are rather small compared to the noise in empirically observed tuning curves, so this analysis is unlikely to produce unambiguous results. 5.5 Movement Velocity and Displacement Tuning. The above analysis explains cosine tuning with respect to isometric force. To extend our results to dynamic conditions and address movement velocity and displacement tuning, we have to take into account the fact that muscle force production is state dependent. For a constant level of activation, the force produced by a muscle varies with its length and rate of change of length (Zajac, 1989), decreasing in the direction of shortening. The only way the CNS can generate a desired net muscle force during movement is to compensate for this dependence: since muscles pulling in the direction of movement are shortening, their force output for xed neural input drops, and so their neural input has to increase. Thus, muscle activation has to correlate with movement velocity and displacement (Todorov, 2000). Now consider a short time interval in which neural activity can change, but all lengths, velocities, and forces remain roughly constant. In this setting, the analysis from the preceding sections applies, and the optimal tun-
1252
Emanuel Todorov
ing curve with respect to movement velocity and displacement is again a (truncated) cosine. While the relationship between muscle force and activation can be different in each time interval, the minimization problem itself remains the same; thus, each solution belongs to the family of truncated cosines described above. The net muscle force that the CNS attempts to generate in each time interval can be a complex function of the estimated state of the limb and the task goals. This complexity, however, does not affect our argument: we are not asking how the desired net muscle force is computed but how it can be generated accurately once it has been computed. The quasi-static setting considered here is an approximation, which is justied because the neural input is low-pass-ltered before generating force (the relationship between EMG and muscle force is well modeled by a second-order linear lter with time constants around 40 msec; Winter, 1990), and lengths and velocities are integrals of the forces acting on the limb, so they vary even more slowly compared to the neural input. Replacing this approximation with a more detailed model of optimal movement control is a topic for future work. 6 Discussion In summary, we developed a model of noisy force production where optimal tuning is dened in terms of expected net force error. We proved that the optimal tuning curve is a (possibly truncated) cosine, for a uniform distribution w (a) of force directions in RD and for an arbitrary distribution w (a) of force directions in R2 . When both w (a) and D are arbitrary, the optimal tuning curve is still a truncated cosine, provided that a truncated cosine satisfying all constraints exists. Although the analytical results were obtained under the assumptions of quadratic cost and homogeneously correlated noise, it was possible to relax these assumptions in special cases. Redening optimal tuning in terms of error+effort minimization did not affect our conclusions. The model makes three novel and somewhat surprising predictions. First, the model predicts a relationship between the shape of the tuning curve m (a) and the cocontraction level C. According to equation 4.2, when C is large enough, the optimal tuning curve m (a) D DR cos(a) C C is a full cosine, which scales with the magnitude of the net force R and shifts with C. But when C is below the threshold value DR, the optimal tuning curve is a truncated cosine, which becomes sharper as C decreases. Thus, we would expect to see sharper-than-cosine tuning curves in the literature. Such examples can indeed be found in Turner et al. (1995) and Hoffman and Strick (1999). A more systematic investigation in M1 (Amirikian & Georgopoulos, 2000) revealed that the tuning curves of most cells were better t by sharperthan-cosine functions, presumably because of the low cocontraction level. We recently tested the above prediction using both M1 and EMG data and found that cells and muscles that appear to have higher contributions to
Cosine Tuning Minimizes Motor Errors
1253
the cocontraction level also have broader tuning curves, whose average is indistinguishable from a cosine (Todorov et al., 2000). This prediction can be tested more directly by asking subjects to generate specied net forces and simultaneously achieve different cocontraction levels. Second, under nonuniform distributions of force directions, the model predicts a misalignment between preferred and force directions, while the tuning curves remain cosine. This effect has been observed by Cisek and Scott (1998) and Hoffman and Strick (1999). Note that a nonuniform distribution of force directions does not necessitate misalignment; instead, the asymmetry can be compensated by using skewed tuning curves. Third, our analysis shows that optimal force production is negatively biased; the bias is larger when fewer force generators are active and increases with mean force. The measured bias was larger than predicted from error minimization alone, which suggests that the motor system minimizes a combination of error and effort in agreement with results we have recently obtained in movement tasks (Todorov, 2001). The model for the rst time demonstrates how cosine tuning could result from optimizing a meaningful objective function: accurate force production. Another model proposed recently (Zhang & Sejnowski, 1999a) takes a very different approach. It assumes a universal rule for encoding motion information in both sensory and motor areas, 12 which gives rise to cosine tuning. Its main advantage is that tuning for movement direction can be treated in the same framework in all parts of the nervous system, regardless of whether the motion signal is related to a body part or an external object perceived visually. But that model has two disadvantages: (1) it cannot explain cosine tuning with direction of force and displacement in the motor system, and (2) cosine tuning is explained with a new encoding rule that remains to be veried experimentally. If the new encoding rule is conrmed, it would provide a mechanistic explanation of cosine tuning that does not address the question of optimality. In that sense, the model of Zhang and Sejnowski (1999a) can be seen as being complementary to ours. 6.1 Origins of Neuromotor Noise. The origin and scaling properties of neuromotor noise are of central importance in stochastic optimization models of the motor system. The scaling law relating the mean and standard deviation of the net force was derived experimentally. What can we say about the neural mechanisms responsible for this type of noise? Very little, unfortunately. Although a number of studies on motor tremor have analyzed the peaks in the power spectrum and how they are affected by different experimen12 Assume each cell has a “hidden” function W (x) and encodes movement in x 2 RD by P ring in proportion to dW (x(t)) / dt. From the chain rule dW / dt D @W / @x . dx / dt D r W . x. This is the dot product of a cell-specic “preferred direction” r W and the movement velocity vector x—thus, P cosine tuning for movement velocity.
1254
Emanuel Todorov
tal manipulations, no widely accepted view of their origin has emerged (McAuley et al., 1997). Possible explanations include noise in the central drive, oscillations arising in spinal circuits, effects of afferent input, and mechanical resonance. One might expect the noise in the force output to reect directly the noise in the descending M1 signals, in agreement with the nding that magnetoencephalogram uctuations recorded over M1 are synchronous with EMG activity in contralateral muscles (Conway et al., 1995). On the level of single cells and muscles, however, this relationship is quite complicated. Cells in M1 (and most other areas of cortex) are well modeled as Poisson processes with coefcients of variation (CV) around 1 (Lee, Port, Kruse, & Georgopoulos, 1998). For a Poisson process, the spike count in a xed interval has variance (rather than standard deviation) linear in the mean. The ring patterns of motoneurons are nothing like Poisson processes. Instead, motoneurons re much more regularly, with CVs around 0.1 to 0.2 (DeLuca, 1995). Furthermore, muscle force is controlled to a large extent by recruiting new motor units, so noise in the force output may arise from the motor unit recruitment mechanisms, which are not very well understood. Other physiological mechanisms likely to affect the output noise distribution are recurrent feedback through Renshaw cells (which may serve as a decorrelating mechanism; Maltenfort, Heckman, & Rymer, 1998), as well as plateau potentials (caused by voltage-activated calcium channels) that may cause sustained ring of motoneurons in the absence of synaptic input (Kiehn & Eken, 1997). Also, muscle force is not just a function of motoneuronal ring rate, but depends signicantly on the sequence of interspike intervals (Burke, Rudomin, & Zajac, 1976). 13 Thus, although the mean ring rates of M1 cells seem to contribute additively to the mean activations of muscle groups (Todorov, 2000), the small timescale uctuations in M1 and muscles have a more complex relationship. The motor tremor illustrated in Figure 1B should not be thought of as being the only source of noise. Under dynamic conditions, various calibration errors (such as inaccurate internal estimates of muscle fatigue, potentiation, length, and velocity dependence) can have a compound effect resembling multiplicative noise. This may be why the errors observed in dynamic force tasks (Schmidt et al., 1979) as well as reaching without vision (Gordon, Ghilardi, Cooper, & Ghez, 1994) are substantially larger than what the slopes in Figure 1B would predict. 6.2 From Muscle Tuning to M1 Cell Tuning. Since M1 cells are synaptically close to motoneurons (in some cases, the projection can even be monosynaptic; Fetz & Cheney, 1980), their activity would be expected to
13 Because of this nonlinear dependence, muscle force would be much noisier if motoneurons had Poisson ring rates, which may be why they re so regularly.
Cosine Tuning Minimizes Motor Errors
1255
reect properties of the motor periphery. The dening feature of a muscle is its line of action (determined by the tendon insertion points), in the same way that the dening feature of a photoreceptor is its location on the retina. A xed line of action implies a preferred direction, just like a xed retinal location implies a spatially localized receptive eld. Thus, given the properties of the periphery, the existence of preferred directions in M1 is no more surprising than the existence of spatially localized receptive elds in V1.14 Of course, directional tuning of muscles does not necessitate similar tuning in M1, in the same way that cells in V1 do not have to display spatial tuning; one can imagine, for example, a spatial Fourier transform in the retina or lateral geniculate nucleus that completely abolishes the spatial tuning arising from photoreceptors. But perhaps the nervous system avoids such drastic changes in representation, and tuning properties that arise (for whatever reason) in one area “propagate” to other densely connected areas, regardless of the direction of connectivity. Using this line of reasoning and the fact that muscle activity has to correlate with movement velocity and displacement in order to compensate for muscle visco-elasticity (see section 5.5), we have previously explained a number of seemingly contradictory phenomena in M1 without the need to evoke abstract encoding principles (Todorov, 2000). This article adds cosine tuning to that list of phenomena. We showed here that because of the multiplicative nature of motor noise, the optimal muscle tuning curve is a cosine. This makes cosine tuning a natural choice for motor areas that are close to the motor periphery. Motor areas that are further removed from motoneurons have less of a reason to display cosine tuning. Cerebellar Purkinje cells, for example, are often tuned for a limited range of movement speeds, and their tuning curves can be bimodal (Coltz, Johnson, & Ebner, 1999). 6.3 Optimization Models in Motor Control. A number of earlier optimization models explain aspects of motor behavior as emerging from the minimization of some cost functional. The speed-accuracy trade-off known as Fitt’s law has been modeled in this way (Meyer, Abrams, Kornblum, Wright, & Smith, 1988; Hoff, 1992; Harris & Wolpert, 1998). The reaching movement trajectory that minimizes expected end-point error is computed under a variety of assumptions about the control system (intermittent versus continuous, open loop versus closed loop) and the noise scaling properties (velocity- versus neural-input-dependent). While each model has advantages and disadvantages in tting existing data, they all capture the basic logarithmic relationship between target width and movement duration.
14 From this point of view, orientation tuning in V1 is surprising because it does not arise directly from peripheral properties. An equally surprising and robust phenomenon in M1 has not yet been found.
1256
Emanuel Todorov
This robustness with respect to model assumptions suggests that Fitt’s law indeed emerges from error minimization. Another set of experimental results that optimization models have addressed are kinematic regularities observed in hand movements (Morasso, 1981; Lacquaniti, Terzuolo, & Viviani, 1983). While a number of physically relevant cost functions (e.g., minimum time, energy, force, impulse) were investigated (Nelson, 1983), better reconstruction of the bell-shaped speed proles of reaching movements was obtained (Hogan, 1984) by minimizing squared jerk (derivative of acceleration). Recently, the most accurate reconstructions of complex movement trajectories were also obtained by minimizing under different assumptions the derivative of acceleration (Todorov & Jordan, 1998) or torque (Nakano et al., 1999). While these ts to experimental data are rather satisfying, the seemingly arbitrary quantity being minimized is less so. The stochastic optimization model of Harris and Wolpert (1998) takes a more principled approach: it minimizes expected end-point error assuming that the standard deviation of neuromotor noise is proportional to the mean neural activation. Shouldn’t that result in minimizing force and acceleration, which, as Nelson (1983) showed, yields unrealistic trajectories? It should, if muscle activation and force were identical, but they are not; instead muscle force is a low-pass-ltered version of activation (Winter, 1990). As a result, the neural signal under dynamic conditions contains terms related to the derivative of force, and so the model of Harris and Wolpert (1998) effectively minimizes a cost that includes jerk or torque change along with other terms. It will be interesting to nd tasks where maximizing accuracy and maximizing smoothness make different predictions and test which prediction is closer to observed trajectories. The noise model used by Harris and Wolpert (1998) is identical to ours under isometric conditions. During movement, it is not known whether noise magnitude is better t by mean force (as in the present model) or muscle activation (as in Harris & Wolpert, 1998). Our conclusions should not be sensitive to such differences, since we do not rely on muscle low-pass ltering to explain cosine tuning. Nevertheless, it is important to establish experimentally the properties of neuromotor noise during movement. Appendix The crucial fact underlying the proof of theorem 1 is that the linear span L of the functions g1,...,N is orthogonal to the hyperplane P dened by the equality constraints in equation 4.1. Lemma 1. For any a1,...,N 2 R and u, v 2 R (V) P satisfying hu, g1,...,N i D hv, g1,...,N i D r1,...,N , the R (V) function l (a) D n an gn (a) is orthogonal to u ¡ v, that is, hu ¡ v, li D 0.
Cosine Tuning Minimizes Motor Errors
1257
( ) *
~
( )
( ) F
F
¤ Figure m (a) , D (a) in theorem 1, case 2, P6: Illustration of the functions m (a) , e with an gn (a) D cos(a) . The shaded region is the set z where cos(a) < 0. The key point is that D (a) e m (a) · 0 for all a.
Proof. hu ¡ v, li D hu, P ( a r n n n ¡ rn ) D 0.
P
n an gn i
¡ hv,
P
n an gn i
D
P
n an
(hu, gn i ¡ hv, gn i) D
The quantity hm C l, m C li we want to minimize is a generalized length, the solution m is constrained to the hyperplane P orthogonal to L , and L contains the origin 0. Thus, we would intuitively expect the optimal solution m ¤ to be close to the intersection of P and L , that is, to resemble a linear combination of g1,...,N . The nonnegativity constraint on m introduces complications that are handled in case 2 (see Figure 6). The proof of theorem 1 is the following: Proof of Theorem 1. Let m D m ¤ C D for some D 2 R (V) be another function satisfying all constraints in equation 4.1. Using the linearity and symmetry of the dot product, hm C l, m C li D hm ¤ C l, m ¤ C li C 2hm ¤ C l, D i C hD , D i. The term hD , D i is always nonnegative and becomes 0 only when D (a) D 0 for all a. Thus, to prove that m ¤ is the unique optimal solution, it is sufcient to show that hm ¤ C l, D i ¸ 0. We have to distinguish two cases, depending on whether the term in the truncation brackets is positive for all a: P P ¤ Case 1. Suppose n aP n gn (a) ¸ l for all a 2 V, that is, m (a) D n an gn (a) ¡ l. Then hm ¤ C l, D i D h n an gn , m ¡ m ¤ i D 0 from the lemma 1. P Case 2. Consider the P function m e(a) D d n an gn (a) ¡ le, which has the property m C m ¤ D n an gn ¡l. With this denition and using lemma 1, P that e 0 D h n an gn , m ¡ m ¤ i D he m C m ¤ C l, D i D he m , D i C hm ¤ C l, D i. Then hm ¤ C l, D i D ¡he m , D i, and it is sufcient to show that he m , D i · 0. Let P e(a 2 z ) < 0 z ½ V be the subset of V on which n an gn (a) < l. Then m and m e(a 2/ z) D 0. Since m D m ¤ C D satises m ¸ 0 and by denition
1258
Emanuel Todorov
m ¤ (a 2 z) D 0, we have D (a 2 z) ¸ 0. The dot product he m , D i can be evaluated by parts on the two sets a 2 z and a 2/ z. Since e m (a) D (a) · 0 for a 2 z, and m e(a) D (a) D 0 for a 2/ z, it follows that he m , D i · 0. Acknowledgments I thank Zoubin Ghahramani and Peter Dayan for their in-depth reading of the manuscript and numerous suggestions. References Amirikian, B., & Georgopoulos, A. (2000). Directional tuning proles of motor cortical cells. Neuroscience Research, 36, 73–79. Burke, R. E., Rudomin, P., & Zajac, F. E. (1976). The effect of activation history on tension production by individual muscle units. Brain Research, 109, 515–529. Caminiti, R., Johnson, P., Galli, C., Ferraina, S., & Burnod, Y. (1991). Making arm movements within different parts of space: The premotor and motor cortical representation of a coordinate system for reaching to visual targets. Journal of Neuroscience, 11(5), 1182–1197. Cisek, P., & Scott, S. H. (1998). Cooperative action of mono- and bi-articular arm muscles during multi-joint posture and movement tasks in monkeys. Society for Neuroscience Abstracts, 164.4. Clancy, E., & Hogan, N. (1999). Probability density of the surface electromyogram and its relation to amplitude detectors. IEEE Transactions on Biomedical Engineering, 46(6), 730–739. Coltz, J., Johnson, M., & Ebner, T. (1999). Cerebellar Purkinje cell simple spike discharge encodes movement velocity in primates during visuomotor tracking. Journal of Neuroscience, 19(5), 1782–1803. Conway, B. A., Halliday, D. M., Farmer, S. F., Shahani, U., Maas, P., Weir, A. I., & Rosenberg, J. R. (1995). Synchronization between motor cortex and spinal motoneuronal pool during the performance of a maintained motor task in man. J. Physiol. (Lond.), 489, 917–924. DeLuca, C. J. (1995). Decomposition of the EMG signal into constituent motor unit action potentials. Muscle and Nerve, 18, 1492–1493. Fetz, E. E., & Cheney, P. D. (1980). Postspike facilitation of forelimb muscle activity by primate corticomotoneuronal cells. Journal of Neurophysiology, 44, 751–772. Georgopoulos, A., Kalaska, J., Caminiti, R., & Massey, J. (1982). On the relations between the direction of two-dimensional arm movements and cell discharge in primate motor cortex. Journal of Neuroscience, 2(11), 1527–1537. Gordon, J., Ghilardi, M. F., Cooper, S., & Ghez, C. (1994). Accuracy of planar reaching movements. Exp. Brain Res., 99, 97–130. Harris, C. M., & Wolpert, D. M. (1998).Signal-dependent noise determines motor planning. Nature, 394, 780–784. Herrmann, U., & Flanders, M. (1998). Directional tuning of single motor units. Journal of Neuroscience, 18(20), 8402–8416.
Cosine Tuning Minimizes Motor Errors
1259
Hinton, G. E., McClelland, J. L., & Rumelhart, D. E. (1986). Distributed representations. In D. E. Rumelhart & J. L. McClelland (Eds.), Parallel distributed processing (pp. 77–109). Cambridge, MA: MIT Press. Hoff, B. (1992). A computational description of the organization of human reaching and prehension. Unpublished doctoral dissertation, University of Southern California. Hoffman, D. S., & Strick, P. L. (1999). Step-tracking movements of the wrist. IV. Muscle activity associated with movements in different directions. Journal of Neurophysiology, 81, 319–333. Hogan, N. (1984). An organizing principle for a class of voluntary movements. Journal of Neuroscience, 4(11), 2745–2754. Kalaska, J. F., Cohen, D. A. D., Hyde, M. L., & Prud’homme, M. (1989). A comparison of movement direction–related versus load direction–related activity in primate motor cortex, using a two-dimensional reaching task. Journal of Neuroscience, 9(6), 2080–2102. Kettner, R. E., Schwartz, A. B., & Georgopoulos, A. P. (1988). Primate motor cortex and free arm movements to visual targets in three-dimensional space. III. Positional gradients and population coding of movement direction from various movement origins. Journal of Neuroscience, 8(8), 2938–2947. Kiehn, O., & Eken, T. (1997). Prolonged ring in motor units: Evidence of plateau potentials in human motoneurons? Journal of Neurophysiology, 78, 3061–3068. Lacquaniti, F., Terzuolo, C., & Viviani, P. (1983). The law relating the kinematic and gural aspects of drawing movements. Acta Psychol., 54, 115–130. Lee, D., Port, N. L., Kruse, W., & Georgopoulos, A. P. (1998). Variability and correlated noise in the discharge of neurons in motor and parietal areas of the primate cortex. Journal of Neuroscience, 18(3), 1161–1170. Maltenfort, M. G., Heckman, C. J., & Rymer, Z. W. (1998). Decorrelating actions of Renshaw interneurons on the ring of spinal motoneurons within a motor nucleus: A simulation study. Journal of Neurophysiology, 80, 309–323. McAuley, J. H., Rothwell, J. C., & Marsden, C. D. (1997). Frequency peaks of tremor, muscle vibration and electromyographic activity at 10 Hz, 20 Hz and 40 Hz during human nger muscle contraction may reect rythmicities of central neural ring. Exp. Brain Res., 114, 525–541. Meyer, D. E., Abrams, R. A., Kornblum, S., Wright, C. E., & Smith, J. E. K. (1988). Optimality in human motor performance: Ideal control of rapid aimed movements. Psychological Review, 95, 340–370. Morasso, P. (1981). Spatial control of arm movements. Exp. Brain Res., 42, 223– 227. Nakano, E., Imamizu, H., Osu, R., Uno, Y., Gomi, H., Yoshioka, T., & Kawato, M. (1999). Quantitative examinations of internal representations for arm trajectory planning: Minimum commanded torque change model. Journal of Neurophysiology, 81(5), 2140–2155. Nelson, W. L. (1983). Physical principles for economies of skilled movements. Biological Cybernetics, 46, 135–147. Pouget, A., Deneve, S., Ducom, J.-C., & Latham, P. (1999). Narrow versus wide tuning curves: What’s best for a population code? Neural Computation, 11, 85–90.
1260
Emanuel Todorov
Schmidt, R. A., Zelaznik, H., Hawkins, B., Frank, J. S., & Quinn, J. T. J. (1979). Motor-output variability: A theory for the accuracy of rapid motor acts. Psychological Review, 86(5), 415–451. Snippe, H. (1996). Parameter extraction from population codes: A critical assessment. Neural Computation, 8, 511–529. Stephens, J. A., Harrison, L. M., Mayston, M. J., Carr, L. J., & Gibbs, J. (1999). The sharing principle. In M. D. Binder (Ed.), Peripheral and spinal mechanisms in the neural control of movement (pp. 419–426). Oxford: Elsevier. Sutton, G. G., & Sykes, K. (1967). The variation of hand tremor with force in healthy subjects. Journal of Physiology, 191(3), 699–711. Todorov, E. (2000). Direct cortical control of muscle activation in voluntary arm movements: A model. Nature Neuroscience, 3(4), 391–398. Todorov, E. (2001). Arm movements minimize a combination of error and effort. Neural Control of Movement, 11. Todorov, E., & Jordan, M. I. (1998).Smoothness maximization along a predened path accurately predicts the speed proles of complex arm movements. Journal of Neurophysiology, 80, 696–714. Todorov, E., Li, R., Gandolfo, F., Benda, B., DiLorenzo, D., Padoa-Schioppa, C., & Bizzi, E. (2000). Cosine tuning minimizes motor errors: Theoretical results and experimental conrmation. Society for Neuroscience Abstracts, 785.6. Turner, R. S., Owens, J., & Anderson, M. E. (1995). Directional variation of spatial and temporal characteristics of limb movements made by monkeys in a twodimensional work space. Journal of Neurophysiology, 74, 684–697. Winter, D. A. (1990). Biomechanics and motor control of human movement. New York: Wiley. Zajac, F. E. (1989). Muscle and tendon: Properties, models, scaling, and application to biomechanics and motor control. Critical Reviews in Biomedical Engineering, 17(4), 359–411. Zhang, K., & Sejnowski, T. (1999a). A theory of geometric constraints on neural activity for natural three-dimensional movement. Journal of Neuroscience, 19(8), 3122–3145. Zhang, K., & Sejnowski, T. J. (1999b). Neuronal tuning: To sharpen or broaden? Neural Computation, 11, 75–84. Received March 23, 2000; accepted October 1, 2001.
NOTE
Communicated by Michael Jordan
SMEM Algorithm Is Not Fully Compatible with Maximum-Likelihood Framewor k Akihiro Minagawa [email protected] Norio Tagawa [email protected] Toshiyuki Tanaka [email protected] Graduate School of Engineering, Tokyo Metropolitan University, Hachioji, Tokyo, 192-0397 Japan
The expectation-maximization (EM) algorithm with split-and-merge operations (SMEM algorithm) proposed by Ueda, Nakano, Ghahramani, and Hinton (2000) is a nonlocal searching method, applicable to mixture models, for relaxing the local optimum property of the EM algorithm. In this article, we point out that the SMEM algorithm uses the acceptance-rejection evaluation method, which may pick up a distribution with smaller likelihood, and demonstrate that an increase in likelihood can then be guaranteed only by comparing log likelihoods.
1 Introduction
The expectation maximization (EM) algorithm (Dempster, Laird, & Rubin, 1977) is widely applied in many elds, including pattern and image recognition, as a general solution for maximum likelihood estimation with hidden variables (Jordan & Jacobs, 1994; McLachlan & Basford, 1988). Although the EM algorithm is an iterative method, its advantage over other methods is that it guarantees local optimality. However, the EM algorithm does not guarantee global optimality, and the solution depends on initial parameter values. In order to circumvent this problem in practice, some heuristics are required. Ueda, Nakano, Ghahramani, and Hinton (2000) proposed the split-andmerge EM (SMEM) algorithm for mixture models to overcome the problem of the EM algorithm for local optimality. This method is a heuristic approach in which a nonlocal search for the solution is incorporated into the EM algorithm by a split-and-merge operation applied to three components of a mixture model. c 2002 Massachusetts Institute of Technology Neural Computation 14, 1261– 1266 (2002) °
1262
Akihiro Minagawa, Norio Tagawa, and Toshiyuki Tanaka
In this note, we discuss the SMEM algorithm from a theoretical viewpoint within the framework of maximum-likelihood estimation. We demonstrate that the SMEM algorithm, in its current form, is not fully compatible with the maximum-likelihood framework assumed by the EM algorithm, that is, it may throw away the globally optimum solution even if it is obtained. We then introduce an alternative approach to avoid this problem. It should be mentioned that Bae, Lee, and Lee (2000) apparently used a similar approach, but they did not discuss the problem explicitly or how to cope with it. Therefore, in this article, we make clear what the problem is precisely and how the alternative approach overcomes it. 2 EM Algorithm and SMEM Algorithm for Mixture Model
Given incomplete data, the EM algorithm performs maximization of the likelihood indirectly by increasing the value of the Q-function, which is the conditional expectation of the log-likelihood function of the complete data. Let X be the set of observed variables, Y be the set of hidden variables, and H be the set of model parameters. Then the log-likelihood function, which should be maximized, can be written in marginal form with respect to the hidden variables as Z (2.1) L (HI X D X ) D ln p ( X D X | Y I H) p ( Y I H) dY . In the EM algorithm under appropriate regularity conditions (Wu, 1983), the maximum likelihood estimator can be obtained by making use of the correspondence between the extremum point of equation 2.1 and the stationary point of the iteration of the EM algorithm. The latter is written as O ( t ) D arg max Q (HI H O ( t¡1 ) ) , H
(2.2)
H
O ( t¡1) ) denotes the Q-function, dened as where Q (HI H O ( t¡1) ) ´ E ( Q (HI H pY
O (t¡1) ) [ln p |X DXIH
D X | Y I H) p ( Y I H) ].
(X
(2.3)
The log likelihood and Q-function are related to each other by L (HI X
D X ) D Ep ( Y |X D Ep ( Y |X
O L D XIH) O D XIH)
(HI X
fln p ( X
D X) D X, Y I H) ¡ ln p ( Y | X
O C H (HI H), O D Q (HI H)
D XI H)g
(2.4)
O D ¡E ( where H (HI H) D XI H)] is just the entropy of the O [ln p ( Y | X p Y |X D XIH) conditional distribution of the hidden variables conditioned by the observed
Alternative SMEM Algorithm
1263
O ( t¡1) I X D X ) from L ( H O ( t ) I X D X) gives a difference in data. Subtracting L ( H Q-function values plus a Kullback-Leibler (KL) divergence, which describes the difference between the self- and cross-entropy terms as follows: O ( t) I X L (H
O D fQ( H
O (t¡1) I X D X) ¡ L ( H
( t)
O IH
( t¡1)
O ) ¡ Q(H
( t¡1 )
D X)
O ( t¡1 ) ) g IH
O ( t) I H O ( t¡1 ) ) ¡ H ( H O (t¡1) I H O ( t¡1) ) g C fH ( H
O ( t) I H O ( t¡1 ) ) ¡ Q ( H O ( t¡1) I H O (t¡1) ) D Q(H C D(p (Y |X
O ( t¡1) ) ||p(Y | X D XI H
O ( t) ) ) . D XI H
(2.5)
Since a KL divergence is always nonnegative, the EM iteration guarantees that the likelihood cannot decrease. The SMEM algorithm was proposed to overcome the problem of local optimality in the EM algorithm for mixture models by using a nonlocal search. Here, the SMEM algorithm is described in brief. (For a detailed description refer to Ueda et al., 2000.) First, the ordinary EM algorithm is run, and consequently a stationary point H ¤ , which satises H ¤ D arg max Q (HI H¤ ), is H
obtained as the convergence result. Next, the split-and-merge operation is performed on H ¤ . In this operation, three components are selected out of the mixture components as a split-and-merge candidate, one of which is split into two and other two are merged. Then two steps, called the partial and the full EM steps, are executed in turn; the partial EM step is applied to the above new three components, and the full EM step is applied to all components of a mixture. The full EM step is therefore equivalent to ordinary EM iteration, and consequently a stationary point H ¤¤ , which satises H ¤¤ D arg max Q (HI H¤¤ ) and is different from H¤ , is obtained as the conH
vergence result, using the result of the preceding partial EM step as the initial value. The split-and-merge operation and subsequent EM steps are attempted for Cmax split-and-merge candidates, which are appropriately ranked. When the Q-function value Q (H¤¤ I H ¤¤ ) , corresponding to the new answer H¤¤ , is larger than the Q-function value Q (H ¤ I H ¤ ) corresponding to H ¤ , the answer H ¤¤ is accepted, and the above procedure is iterated using H ¤¤ . If Q (H ¤¤ I H¤¤ ) < Q (H ¤ I H ¤ ) holds for all Cmax answers, H ¤ at that time is returned as the nal result. 3 Problem of SMEM Algorithm and Its Improvement
In the EM algorithm, since the entropy terms are calculated for the same posterior probability, an increase in likelihood is guaranteed because their difference yields a KL divergence, which is always positive. Whereas in the SMEM algorithm, Q-functions, which should be evaluated at the acceptancerejection evaluation, are calculated by taking expectation with respect to
1264
Akihiro Minagawa, Norio Tagawa, and Toshiyuki Tanaka
different posterior probabilities for H¤ and H ¤¤ . This means that if Q (H¤¤ I H ¤¤ ) ¡ Q (H¤ I H ¤ ) D L (H ¤¤ I X D X ) ¡L(H ¤ I X D X ) ¡ fH(H ¤¤ I H¤¤ ) ¡ H (H ¤ I H ¤ ) g > 0
(3.1)
holds, then H ¤¤ is accepted as the new answer. Unlike equation 2.5, the latter subtraction between two self-entropies on the right-hand side of equation 3.1 does not yield a KL divergence. Therefore, this part is not necessarily positive, and an increase in Q-function value does not correspond to an increase in likelihood in the SMEM algorithm. It shows that the evaluation for the acceptance-rejection judgment of the original SMEM algorithm is not fully compatible with the maximum-likelihood estimation framework. The above fact means that the global maximizer of likelihood may be rejected when the entropy of the corresponding conditional distribution is relatively large. From this observation, it is shown that the evaluation method that guarantees an increase in likelihood so as to avoid incorrect rejection of the global maximum by making use of the log likelihood itself in place of the Q-function when judging whether each new answer should be accepted. Since computational cost of entropy is the same order as that of Qfunction used in the original evaluation method, that is, (number of components) £ (number of data), and the log likelihood is easily obtained, the computational complexity of the corrected evaluation method remains the same level as the original evaluation method. Since the difference in the acceptance-rejection judgment between these evaluation methods arises from the entropy term, clearly this difference becomes apparent when the number of mixed components is small or when a major component with a large number of data is split or merged. Otherwise, we expect that there is so little difference between the whole entropy even if the three components have changed that the original SMEM algorithm using Q-function converges near the global optimal. Therefore, the SMEM algorithm with the original evaluation method can be considered as an approximation of the optimal estimation of mixture models, which ignores such a difference, whereas the algorithm with the corrected evaluation method nally returns the value giving the largest likelihood among those encountered during its execution. It should be noted that since the discussion in this article is based on the maximum-likelihood estimation framework assumed by the EM algorithm and is theoretically oriented, the possibility that the original SMEM algorithm with the incompatible evaluation method works well for practical problems is beyond the scope of this article. 4 Numerical Example
Here, a toy example is presented to demonstrate that the inequality relationship of the log likelihood and the value of the Q-function may be re-
Alternative SMEM Algorithm
1265
Table 1: Input Data Used in the Simulation. Component
mN
sN 2
Number
1 2 3
0.0 10.0 60.0
30.0 30.0 30.0
300 300 100
Table 2: Log Likelihood and Q-Function Values of Two Estimated Distributions. Estimated Distribution
Log Likelihood
Q-Function
a b
¡2655.1 ¡2656.8
¡2898.7 ¡2677.8
versed. The EM algorithm is executed using two different initial values for a mixture model with three components as shown in Table 1; two stationary points are obtained. Figure 1 shows the target distribution, the input data histogram, and two estimated mixture distributions (estimated distribution a and b). In estimated distribution a, two components control the left-side peak, whereas in estimated distribution b, one component controls the left-side peak. Table 2 shows the values of both log likelihood and 6 Target distribution Input data Estimated distribution (a) Estimated distribution (b)
Frequency / Distribution
5
4
3
2
1
0 -20
0
20
X
40
60
80
Figure 1: Distribution of input data and estimated distributions by different initial values.
1266
Akihiro Minagawa, Norio Tagawa, and Toshiyuki Tanaka
Q-function of each estimated distribution. Note that the log likelihood is larger for estimated distribution a, whereas the Q-function is larger for estimated distribution b. This implies that if estimated distribution a is generated by performing the split-and-merge operation on estimated distribution b, the distribution a is not accepted by the original SMEM algorithm. On the other hand, using the corrected evaluation method, distribution a is correctly accepted according to log likelihood. Acknowledgments
We thank N. Ueda for his helpful comments. References Bae, U., Lee, T., & Lee, S., (2000). Blind signal separation in teleconferencing using the ICA mixture model. Electronic Letters, 37(7), 680–682. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B, 39, 1–38. Jordan, M. I., & Jacobs, R. A. (1994). Hierarchical mixtures of experts and the EM algorithm. Neural Computation, 6, 181–214. McLachlan, G. J., & Basford, K. E. (1988). Mixture models:Inference and applications to clustering. New York: Dekker. Ueda, N., Nakano, R., Ghahramani, Z., & Hinton, G. E. (2000). SMEM algorithm for mixture models. Neural Computation, 12, 2109–2128. Wu, C. F. J. (1983). On the convergence properties of the EM algorithm. Annals of Statistics, 11(1), 95–103. Received April 13, 2001; accepted October 2, 2001.
NOTE
Communicated by Samy Bengio
A Note on the Decomposition Methods for Support Vector Regression Shuo-Peng Liao [email protected] Hsuan-Tien Lin [email protected] Chih-Jen Lin [email protected] Department of Computer Science and Information Engineering, National Taiwan University, Taipei 106, Taiwan
The dual formulation of support vector regression involves two closely related sets of variables. When the decomposition method is used, many existing approaches use pairs of indices from these two sets as the working set. Basically, they select a base set rst and then expand it so all indices are pairs. This makes the implementation different from that for support vector classication. In addition, a larger optimization subproblem has to be solved in each iteration. We provide theoretical proofs and conduct experiments to show that using the base set as the working set leads to similar convergence (number of iterations). Therefore, by using a smaller working set while keeping a similar number of iterations, the program can be simpler and more efcient. 1 Introduction
Given a set of data points, f(x1 , z1 ) , . . . , ( xl , zl )g, such that xi 2 Rn is an input and zi 2 R1 is a target output. A major form for solving support vector regression (SVR) is the following optimization problem (Vapnik, 1998):
min
l l X X 1 (a ¡ a¤ ) T Q (a ¡ a¤ ) C 2 (ai C ai¤ ) C zi (ai ¡ ai¤ ) 2 i D1 iD 1 l X i D1
(ai ¡ ai¤ ) D 0, 0 · ai , ai¤ · C, i D 1, . . . , l,
(1.1)
where C is the upper bound, Qij ´ w ( xi ) T w ( xj ), ai and ai¤ are Lagrange multipliers associated with the ith data xi , and 2 is the error that users can tolerate. Note that training vectors xi are mapped into a higher-dimensional c 2002 Massachusetts Institute of Technology Neural Computation 14, 1267– 1281 (2002) °
1268
Shuo-Peng Liao, Hsuan-Tien Lin, and Chih-Jen Lin
space by the function w . An important property is that for any optimal solution, ai ai¤ D 0, i D 1, . . . , l. Due to the density of Q, currently the decomposition method is the major method to solve equation 1.1 (Smola & Sch¨olkopf, 1998; Keerthi, Shevade, Bhattacharyya, & Murthy, 2000; Laskov, 2002). It is an iterative process wherein each iteration, the index set of variables is separated to two sets, B and N, where B is the working set. In that iteration, variables corresponding to N are xed, while a subproblem on variables corresponding to B is minimized. Following approaches for support vector classication, there are some methods for selecting the working set. Many existing approaches for regression rst use these methods to nd a set of variables, called the base set here; then they expand the base set so all elements are pairs. Here we dene the expanded set as the pair set. For example, if fai , aj¤ g are chosen rst, they include fai¤ , aj g in the working set. Then the following subproblem of four variables (ai , ai¤ , aj , aj¤ ) is solved: min
µ ¶T µ ¶µ ¶ 1 ai ¡ ai¤ Qii Qij ai ¡ ai¤ ¤ ¤ Qji Qjj aj ¡ aj 2 aj ¡ aj ¤ ) C zi ) (ai ¡ ai¤ ) C ( Qi,N (aN ¡ aN ¤) C ( Qj,N (aN ¡ aN C zj ) (aj ¡ aj¤ ) C 2 (ai C ai¤ C aj C aj¤ ) X (ai ¡ ai¤ ) C (aj ¡ aj¤ ) D ¡ (at ¡ at¤ )
(1.2)
t2N
0 · ai , aj , ai¤ , aj¤ · C. ¤ Note that aN and aN are xed elements corresponding to N D ft | 1 · t · 6 i, t D 6 l, t D jg. A reason for selecting pairs is to maintain ai ai¤ D 0, i D 1, . . . , l throughout all iterations. Hence, the number of nonzero variables during iterations can be kept small. However, from Lin (2001a, Theorem 4.1), it has been shown that for some existing work (e.g., Keerthi et al., 2000; Laskov, 2002), if the base set is used as the working set, the property ai ai¤ D 0, i D 1, . . . , l still holds. In section 2, we discuss this in more detail. Recently there have been implementations without using pairs of indices—for example, LIBSVM (Chang & Lin, 2001), SVMTorch (Collobert & Bengio, 2001), and mySVM (Ruping, ¨ 2000). A question immediately raised is on the performance of these two approaches—the base approach and the pair approach. The pair approach solves a larger subproblem in each iteration, so there may be fewer iterations; however, a larger subproblem takes more time so the cost of each iteration is higher. Collobert and Bengio (2001) have stated that working with pairs of variables would force the algorithm to do many computations with null variables until the end of the optimization process. In section 3, we elaborate
Decomposition Methods for Support Vector Regression
1269
on this in a detailed proof. First, we consider approaches with the smallest size of working set (two and four elements for both approaches) where the analytic solution of the subproblem is handily available from sequential minimal optimization (SMO) (Platt, 1998). From mathematical explanations, we show that while solving the subproblems of the pair set containing four variables, in most cases, only those two variables in the base set are updated. Therefore, the number of iterations of both the base and the pair approaches is nearly the same. In addition, for larger working sets, we prove that after some nite number of iterations, the subproblem using only the base set is already optimal for the subproblem using the pair set. These give us theoretical justication that it is not necessary to use pairs of variables. In section 4, we conduct experiments to demonstrate the validity of our analysis. Then in section 5, we provide some conclusions and a discussion. There are other decomposition approaches for support vector regression (for example, Flake & Lawrence, 2002). They deal with different situations, which will not be discussed here. 2 Working Set Selection
Here we consider the working set selection from Joachims (1998) and Keerthi, Shevade, Bhattacharyya, and Murthy (2001) which were originally designed for classication. We then apply them for SVR. To make SVR similar to the form of classication, we dene the following 2l by 1 vectors: µ a (¤) ´
¶ » C1 a and ´ , y i a¤ ¡1,
i D 1, . . . , l, i D l C 1, . . . , 2l.
(2.1)
Then the regression problem, 1.1, can be reformulated as min f (a (¤) ) D
µ 1 (¤) T Q (a ) ¡Q 2
0 · ai(¤) · C,
yT a
(¤)
D 0.
¶ ¤ ¡Q (¤) £ T a C 2 e C zT , 2 eT ¡ zT a (¤) Q i D 1, . . . , 2l,
(2.2)
Now f is the objective function of equation 2.2. Practically we can use the Karush-Kuhn-Tucker (KKT) condition to test if a given a (¤) is an optimal solution of equation 2.2. If there exists a number b such that for all i D 1, 2, . . . , 2l,
r f (a (¤) ) i C byi ¸ 0 r f (a (¤) ) i C byi · 0 r f (a (¤) ) i C byi D 0
if ai(¤) D 0, if
ai(¤)
D C,
if 0 < ai(¤) < C,
(2.3a) (2.3b) (2.3c)
a feasible a (¤) is optimal for equation 2.2. Note that the range of b can be
1270
Shuo-Peng Liao, Hsuan-Tien Lin, and Chih-Jen Lin
determined by m (a (¤) ) ´ max (
max
1 ·t ·2l a(¤) < C,yt D1 t at(¤)
M (a (¤) ) ´ min (
¡yt r f (a (¤) ) t ,
max
> 0,yt D ¡1
min
1 ·t ·2l a(¤) > 0,yt D 1 t
min
¡yt r f (a (¤) ) t ) ,
(2.4)
¡yt r f (a (¤) ) t ,
at(¤) < C,yt D ¡1
¡yt r f (a (¤) ) t ) .
(2.5)
That is, a feasible a (¤) is an optimal solution if and only if m (a (¤) ) · b · M (a (¤) ) , or equivalently, M (a (¤) ) ¡ m (a (¤) ) ¸ 0.
(2.6)
For convenience, we dene the candidates of m (a (¤) ) as the set of all (¤) (¤) indices t that satisfy at ,k < C, yt D 1 or at ,k > 0, yt D ¡1, where 1 · t · 2l. Similarly, we can dene the candidates of M (a (¤) ). At the beginning of iteration k, let a (¤) , k D [ak, (a¤ ) k]T be the vector that we are working on. Then we denote m k ´ m (a (¤) , k ) and M k ´ M (a (¤) ,k ) . Also, let arg m k be the subset of indices t in the candidates of m k such that ¡yt r f (a (¤) , k ) t D m (a (¤) , k ) . Similarly, we dene arg M k. Thus, during iterations of the decomposition method, a (¤) , k is not optimal yet, so m k > M k, for all k.
(2.7)
If we would like to select two elements as the working set, intuitively we tend to choose indices i and j, which satisfy i 2 arg m k and j 2 arg M k,
(2.8)
since they cause the maximal violation of the KKT condition. A systematic way to select a larger working set in each iteration is as follows. If q, an even number, is the size of the working set, q / 2 indices are sequentially selected from the largest ¡yi r f (a (¤) ) i values to the smaller ones in the candidate set of mk, that is, ¡yi1 r f (a (¤) ,k ) i1 ¸ ¢ ¢ ¢ ¸ ¡yiq/ 2 r f (a (¤) , k ) iq/ 2 , where i1 2 arg m k. The other q / 2 indices are sequentially selected from the smallest ¡yi r f (a (¤) ) i values to the larger ones in the candidate set of M k, that is, ¡yjq /2 r f (a (¤) ,k ) jq /2 ¸ ¢ ¢ ¢ ¸ ¡yj1 r f (a (¤) , k ) j1 ,
Decomposition Methods for Support Vector Regression
1271
where j1 2 arg M k. Also, we have ¡ yjq /2 r f (a (¤) , k ) jq/ 2 < ¡yiq / 2 r f (a (¤) ,k ) iq / 2 ,
(2.9)
to ensure that the intersection of these two groups is empty. Thus, if q is large, sometimes the actual number of selected indices may be less than q. Note that this is the same as the working set selection in Joachims (1998). However, the original derivation in Joachims was from the concept of feasible directions for constrained optimization problems but not from the violation of the KKT condition. After the base set of q indices is selected, earlier approaches (Keerthi et al., 2000; Laskov 2002) expand the set so all elements in it are pairs. The reason is to keep the property that aik (a¤ ) ik D 0, i D 1, . . . , l, for all k. However, if directly using elements in the base set, the following theorem has been proved in (Lin 2001a, theorem 4.1): Theorem 1.
all k.
If the initial solution is zero, then aik (a¤ ) ik D 0, i D 1, . . . , l for
Hence we know that aik (a¤ ) ik D 0 is not a particular advantage of using pairs of indices. Another important issue for the decomposition method is the stopping criterion. From equation 2.6, a natural choice of the stopping criterion is M k ¡ m k ¸ ¡d,
(2.10)
where d, the stopping tolerance, is a small, positive number. For q D 2, equation 2.10 is the same as ¡ yj r f (a (¤) ,k ) j ¡ (¡yi r f (a (¤) ,k ) i ) ¸ ¡d,
(2.11)
where i, j are selected by equation 2.8. Note that the convergence of the decomposition method under some conditions of the kernel matrix Q is shown in Lin (2001a) for the base approach. Some theoretical justication on the use of the stopping criteria, equation 2.10, for the decomposition method is in Lin (2001b). No particular convergence proof has been made for the pair approach, but we will assume it for our analyses. 3 Number of Iterations
In this section, we discuss the relationship between the solutions of subproblems using the base and the pair approaches. The discussion is divided into two parts. First, we consider approaches with the smallest size of working
1272
Shuo-Peng Liao, Hsuan-Tien Lin, and Chih-Jen Lin
set (two and four elements for both approaches). We show that for most iterations, the optimal solution of the subproblem using the base set is already optimal for the subproblem using the pair set, so the difference between the number of iterations of the base and pair approaches should not be large. Then we consider larger working sets. Although the result is not as elegant as the rst part, we can still show that after enough iterations, the optimal solution of the subproblem using the base set is the same as the optimal solution of the subproblem using the pair set. To start our proof, we state an important property on the difference between the ith and (i C l) th gradient elements. Consider ai and ai¤ , 1 · i · l. We have
r f (a (¤) ) i C l D ¡(Q(a ¡ a¤ ) ) i C 2 ¡ zi (¤) D ¡r f (a ) i C 22 . Note that yi D 1 and yi C l D ¡1 as dened in equation 2.1, so ¡yi C l r f (a (¤) ) i C l D ¡yi r f (a (¤) ) i C 22 .
(3.1)
We will use this in later analyses. When q D 2, the base set is selected from equation 2.8. It is easy to see that indices i and i C l where 1 · i · l cannot both be chosen at the same time. For example, if indices i and i C l are both selected from equation 2.8, by equations 3.1 and 2.7, we must have i C l 2 arg m k and i 2 arg M k. By (¤) (¤) equations 2.4 and 2.5, this means ai C l, k > 0 and ai , k > 0, which violates theorem 1. Therefore, ( i, i C l) cannot be chosen together. (¤) (¤) Also, if one of ai , k and ai C l, k is selected, the other must be zero. We
can prove this by contradiction. Without loss of generality, say if ai , k is (¤) (¤) selected and ai C l, k is nonzero, then by theorem 1, ai , k must be zero. So by equations 2.4, 2.5, and 2.8, i 2 arg m k. Moreover, by equation 2.4, both i and i C l are in the candidate set of mk, and we must have ¡yi r f (a (¤) , k ) i ¸ ¡yi C l r f (a (¤) , k ) i C l since i 2 arg mk. But this contradicts equation 3.1. There(¤) (¤) fore, if one of ai , k and ai C l, k is selected, the other must be zero. If any one of (i, j), ( i, j C l) , ( i C l, j), (i C l, j C l) where 1 · i, j · l is chosen from equation 2.8, our goal is to see the difference in solving the two-variable subproblem and the four-variable subproblem of ( i, j, i C l, j C l) . Without loss of generality, we consider the case where ( i, j) is chosen by (¤) (¤) (¤) (¤) equation 2.8. Then (ai , k, aj ,k, ai C l, k, aj C l, k ) are the corresponding variables (¤)
at iteration k; from earlier discussions, we know ai C l,k D aj C l,k D 0. After (¤)
(¤) (¤) and aj is solved, we assume the new (¤) , k aN j , 0, 0). From equation 3.1 and the KKT condition, it (¤) (¤) (¤) (¤) if aN i , k > 0 and aN j , k > 0, (aN i , k, a N j ,k, 0, 0) is already an
a two-variable subproblem on ai values are
(¤) , k (a Ni ,
(¤)
is easy to see that optimal solution of the four-variable problem, equation 1.2.
Decomposition Methods for Support Vector Regression
1273
Figure 1: Possible situation of plane changes.
Therefore, the only difference happens when there is a “jump” from the ( i, j) plane to another plane of two variables and the objective value of equation 1.2 can be further decreased. We illustrate this in Figure 1. In this gure, each square represents a plane of two nonzero variables. From the linear constraint (¤)
ai(¤) C aj
D ¡
X
(¤)
yt at ,
(3.2)
6 i, j tD
the two dashed parallel lines in Figure 1 show how the solution plane pos(¤) ,k (¤) , k (¤) ,k sibly changes. For example, after a becomes zero, if ( a Nj Ni , a N j , 0, 0) is not an optimal solution of equation 1.2, we may further reduce its objective value by entering the ( i, j¤ ) plane. We will check under what conditions (¤) ,k (¤) , k (a N i , aN j , 0, 0) is not an optimal solution of equation 1.2. (¤)
(¤)
Since ai and aj are adjusted on the line, equation 3.2, we consider the objective value of the subproblem on the (i, j) plane as the following 6 6 function of a single variable v, where N D ft | 1 · t · l, t D i, t D jg are indices of the xed variables: 1 h (¤) , k Cv ai g ( v) ´ 2
(¤) aj , k
iµ Qii ¡v Qji
Qij Qjj
¶"
# (¤) ai ,k C v (¤) aj ,k ¡ v
k k ) C C ) (a (¤) ,k C ) C ( Qi,N (aN ¡ (a¤ ) N 2 zi i v
1274
Shuo-Peng Liao, Hsuan-Tien Lin, and Chih-Jen Lin (¤) , k
k k ) C C ) (a C ( Qj,N (aN ¡ (a¤ ) N 2 zj j
D (¤)
Since ai
¡ v)
1 ( Qii ¡ 2Qij C Qjj ) v2 C ( r f (a (¤) ,k ) i ¡ r f (a (¤) , k ) j ) v C constant. 2 (¤) , k
is increased from ai
(¤) , k
to aN i
, we know
g0 (0) D r f (a (¤) , k ) i ¡ r f (a (¤) ,k ) j < 0. (¤) , k
Now if ( aN i
(¤) , k
, aN j
, 0, 0) is not an optimal solution of equation 1.2, we can (¤) , k
dene a new function gN ( v) similar to g ( v) at ( aN i (¤) If aN i , k can be further increased,
, 0) of the ( i, j C l) plane.
(¤) (¤) gN 0 (0) D r f ( aN , k ) i C r f ( a N ,k ) j C l < 0. (¤) ,k
However, from equation 3.1 and a Ni
(¤) , k
¡ ai
(3.3) (¤) , k
D ¡( a Nj
(¤) , k
¡ aj
),
r f ( aN (¤) , k ) i C r f ( aN (¤) , k ) j C l D r f (a N D r f (a
(¤) ,k (¤) ,k
) i ¡ r f (aN (¤) , k ) j C 22
(¤) (¤) ) i C Qii (aN i(¤) , k ¡ ai(¤) , k ) C Qij ( a N j ,k ¡ aj , k )
(¤) (¤) ¡ (r f (a (¤) , k ) j C Qji (aN i(¤) , k ¡ ai(¤) , k ) C Qjj (aN j , k ¡ aj , k ) ) C 22
(¤) (¤) D ( r f (a , k ) i ¡ r f (a , k ) j )
C (a N i(¤) , k ¡ ai(¤) , k ) (Qii ¡ 2Qij C Qjj ) C 22 .
(3.4)
Since Q is positive semidenite, Qii Qjj ¡ Q2ij ¸ 0 implies Qii ¡ 2Qij C Qjj ¸ 0. (¤) , k
With aN i
(¤) , k
¡ ai
¸ 0 and equation 3.3, we know if
( r f (a (¤) ,k ) i ¡ r f (a (¤) , k ) j ) C (aN i(¤) , k ¡ai(¤) , k ) ( Qii ¡2Qij C Qjj ) C 22 ¸ 0, (3.5) (¤) it is impossible to move ( a N i , k, 0) on ( i, j C l) plane further. That is, (¤) (aN i(¤) , k, a N j ,k, 0, 0) is already an optimal solution of equation 1.2. For other cases, that is, (i, j C l) , ( i C l, j) , and ( i C l, j C l) , results are the same. Note that now, 1 · i, j · l so r f (a (¤) , k ) i ¡ r f (a (¤) ,k ) j is actually the value obtained in equation 2.11. That is, it is the number used for checking the stopping criterion. Therefore, we have the following theorem:
For all iterations with the violation on the stopping criterion, equation 2.11, no more than 22 , an optimal solution of the two-variable subproblem is already an optimal solution of the corresponding four-variable subproblem. Theorem 2.
If 2 is not small, the stopping tolerance in most iterations is smaller than 22 . In addition, as most decomposition iterations are spent in the nal stage
Decomposition Methods for Support Vector Regression
1275
due to slow convergence, this theorem has shown a conclusive result that no matter whether the two-variable or four-variable approach is used, the difference on the number of iterations should not be much. For larger working set (q > 2), we may not be able to get results as elegant as theorem 2. When q D 2, for example, we know exactly the relation on the (¤) (¤) (¤) ,k (¤) , k (¤) , k (¤) changes of ai , k and aj , k in one iteration, as a ¡aj , k ). N i ¡ai D ¡( a Nj However, when q > 2, the change on each variable can be different. In the following, we will show that if fa (¤) ,kg is an innite sequence, during nal iterations, after k is large enough, solving the subproblem of the base set is the same as solving the larger subproblem of the pair set. Next, we describe some properties that will be used for the proof. Assume that the sequence fa (¤) ,kg of the base approach converges to an optimal solution a O (¤) . Then we can dene O ´ M ( aO (¤) ) and mO ´ m ( a O (¤) ) . M
(3.6)
We also note that equation 2.9 implies that for any index 1 · i · 2l in the working set of the kth iteration, M k · ¡yi r f (a (¤) , k ) i · m k.
(3.7)
Now we describe two theorems from Lin (2001b) that are needed for the main proof. These theorems deal with a general framework of decomposition methods for different SVM formulations. We can easily check that the base approach satises the required conditions of these two theorems so they can be applied: Theorem 3.
lim m k ¡ M k D 0.
(3.8)
k!1
(¤)
O (¤) ) i is For any aO i , 1 · i · 2l, whose corresponding ¡yi r f ( a (¤) (¤) , k O after k is large enough, ai O nor M, Oi . neither m is at a bound and is equal to a Theorem 4.
Immediately, we have a corollary of theorem 3, which is specic to SVR: (¤) , k
After k is large enough, for all i D 1, 2, . . . , l, ai would not be both selected in the base working set. Corollary 1.
and ai C l, k (¤)
Proof. By the convergence of m k ¡M k to 0, after k is large enough, m k ¡M k < 2 . (¤) (¤) If there exists 1 · i · l such that ai , k and ai C l,k are both selected in
the base working set, from equation 3.7, M k · ¡r f (a (¤) , k ) i · mk and
1276
Shuo-Peng Liao, Hsuan-Tien Lin, and Chih-Jen Lin
M k · r f (a (¤) , k ) i C l · mk. However, equation 3.1 shows r f (a (¤) ,k ) i C l D ¡r f (a (¤) , k ) i C 22 so m k ¡ M k ¸ 22 , and there is a contradiction. Next, we describe the main proof of this section, which is an analysis on the innite sequence fa (¤) ,kg. O D 6 O C 22 . After k is large enough, any optiWe assume that M m mization subproblem of the base set is already optimal for the larger subproblem of the pair set. Theorem 5.
Proof. If the result is wrong, there is an index 1 · i · l and an innite set
K such that for all k 2 K , ai , k (or ai C l,k) is selected in the working set, but (¤) (¤) then ai C l, k (or ai ,k) is also modied. Without loss of generality, we assume (¤)
(¤)
that ai , k is selected in the working set, but ai C l, k is modied innite times. So by theorem 4, (¤)
(¤)
O O or r f (aO (¤) ) i C l D M. r f ( aO (¤) ) i C l D m, By equation 3.1, O ¡ 22 < M. O O or ¡ r f (aO (¤) ) i D M ¡r f ( a O (¤) ) i D mO ¡ 22 < m, O ¡ 22 , we have m 6 O D O < For the second case, by the assumption that m M (¤) ) (¤) ) (¤) ) ( ( ( O > ¡r f aO O < ¡r f a O < ¡r f ( aO (¤) ) i ¡r f aO O i or m i . But if m i , we have m O which is impossible for an optimal solution. Hence, < M, ¡ yi r f ( a O (¤) ) i < mO
(3.9)
holds for both cases. Therefore, we can dene
D ´ min(2 / 2, (mO ¡ (¡yi r f ( aO (¤) ) i ) ) / 3) > 0. By the convergence of the sequence f¡yj r f (a (¤) ,k ) j g to ¡yj r f ( a O (¤) ) j , for all j D 1, . . . , 2l, after k is large enough, |yj r f (a (¤) , k ) j ¡ yj r f (a (¤) ,kC 1 ) j | · D and |yj r f (a
(¤) , k
) j ¡ yj r f ( a O
(¤)
)j | · D .
(3.10) (3.11)
Suppose that at the kth iteration, j 2 arg M k is selected in the working set and ¡yj r f (a (¤) ,k ) j D M k. By equations 3.7, 3.10, 3.11, and 3.9, ¡yj r f ( a O (¤) ) j · ¡yj r f (a (¤) ,k ) j C D D M k C D
Decomposition Methods for Support Vector Regression
1277
· ¡yi r f (a (¤) , k ) i C D · ¡yi r f ( a O (¤) ) i C 2D
O ¡ (¡yi r f (aO (¤) ) i ) ) / 3 · ¡yi r f ( a O (¤) ) i C 2( m O O · M. < m
(¤) , k
From theorem 4 and equation 3.12, after k is large enough, aj and is equal to (¤) ,k
and aj
(¤) a Oj .
That is,
(¤) aj ,k
D
(¤) C 1 aj , k
D
(¤) a Oj .
Since
(3.12) is at a bound
(¤) C 1 aj , k
(¤) , k
D aj
is in the candidates of M k , by equations 3.7, 3.10, and 3.11,
(¤) M kC 1 · ¡yj r f (a , kC 1 ) j
· ¡yj r f (a (¤) , k ) j C D D M k C D · ¡yi r f (a (¤) , k ) i C D · ¡yi r f (a
(¤) , k C 1
) i C 2D .
Hence, we get M kC 1 · ¡yi r f (a (¤) , kC 1 ) i C 2 .
(3.13)
On the other hand, ai , k is modied to ai strictly positive. By the denition of m k, (¤)
(¤) , k C 1
, so at least one of them is
r f (a (¤) , k ) i C l · m k, or r f (a (¤) , kC 1 ) i C l · mkC 1 .
(3.14)
From equations 3.1, 3.7, 3.13, and 3.14, for all large enough k 2 K , M k · m k ¡ 22 , or M kC 1 · m kC 1 ¡ 2 . Therefore, 6 0 lim m k ¡ M k D
k!1
which contradicts theorem 3. 4 Experiments
We consider two regression problems, abalone (4177 data) and add10 (9792 data), from Blake and Merz (1998) and Friedman (1988), respectively. The RBF kernel is used: Qij ´ e ¡c kxi ¡xj k . 2
Since our purpose is not to examine the quality of the solutions, we do not perform model selection on the value of c . Instead, we x it to 1 / n, where n is the number of attributes in each data. For these two problems, n is
1278
Shuo-Peng Liao, Hsuan-Tien Lin, and Chih-Jen Lin
Table 1: Problem abalone. Iteration (2-variable)
Iteration (4-variable)
SV
Candidatesa
Jumpsb
C D 10, 2 D 0.1 C D 10, 2 D 1 C D 10, 2 D 10
18,930 10,173 17
18,790 10,173 17
3967 2183 7
8438 1705 0
14 0 0
C D 100, 2 D 0.1 C D 100, 2 D 1 C D 100, 2 D 10
142,190 122,038 261
140,686 119,981 261
3938 2147 9
63,057 10,187 0
207 48 0
Parameters
Notes: a The number of iterations that violate conditions of Theorem 2. b The number of plane changes as illustrated in Figure 1.
Table 2: Problem add10. Iteration (2-variable)
Iteration (4-variable)
SV
Candidates
Jumps
C D 10, 2 D 0.1 C D 10, 2 D 1 C D 10, 2 D 10
28,625 22,067 116
28,955 21,913 116
9254 4997 14
13,918 2795 0
1 1 0
C D 100, 2 D 0.1 C D 100, 2 D 1 C D 100, 2 D 10
350,344 227,604 105
350,325 227,604 105
9158 4265 11
109,644 14,628 0
12 0 0
Parameters
eight and ten, respectively. Based on our experience, we think that it is an appropriate value when data are scaled to [¡1, 1]. Tables 1 and 2 present results using different 2 and C on two problems. We consider 2 to be greater than 0.1 because for smaller 2 , the number of support vectors approaches the number of training data. On the other hand, we consider 2 up to 10 where the number of support vectors is close to zero. For each parameter set, we present the number of iterations by both twovariable and four-variable approaches, number of support vectors, number of iterations that violate conditions of theorem 2 (i.e., possible candidates of jumps), and the number of real jumps as illustrated in Figure 1 when using the four-variable approach. The solution of the four-variable approach is obtained as follows. First a two-variable problem obtained from equation 2.8 is solved. If there is at least one variable that goes to zero, another two-variable problem has to be solved. As indicated in Figure 1, at most three two-variable problems are needed. For our experiments, both versions of the code are directly modied from LIBSVM (version 2.03). It can be clearly seen that both approaches take nearly the same number of iterations. In addition, the number of jumps while using the four-variable
Decomposition Methods for Support Vector Regression
1279
Table 3: Problem abalone (First 200 Data). q D 10 Parameters
Iteration Iteration (base) (pairs)
q D 20 NAVMa
Iteration Iteration (base) (pairs)
NAVMa
C D 10, 2 D 0.1 C D 10, 2 D 1 C D 10, 2 D 10
132 49 1
173 46 1
77 3 0
100 42 1
114 40 1
111 10 0
C D 100, 2 D 0.1 C D 100, 2 D 1 C D 100, 2 D 10
1641 807 1
1983 552 1
184 29 0
1168 258 1
1086 310 1
250 46 0
Note: a Number of added variables that are modied for the pair approach.
approach is very small, especially when 2 is larger. Furthermore, the number of “candidates” is much larger than the number of real jumps. This (¤) ,k (¤) means that ( a ¡ ai , k ) (Qii ¡ 2Qij C Qjj ) of equation 3.4 is large enough Ni so equation 3.5 is usually satised. Next, we experiment with using larger working sets. We use a simple implementation written in MATLAB so only small problems are tested. We consider the rst 200 data points of abalone and add10. Results are in Tables 3 and 4, where we show the number of iterations and the number of added variables that are modied (NAVM) for the pair approach. Note that we use “number of added variables that are modied” instead of “number of jumps” since for larger subproblems, we cannot model the change of variables as jumps between planes. The NAVM column is dened as follows. If the pair approach is used, the working set in the kth iteration is the union of two sets: B k, which is the (¤) base set, and its extension, BN k. We check the value of variables in a N before Bk and after solving the subproblem. NAVM is the sum of the count for those (¤) modied variables in a N throughout all iterations, so it is at most the sum Bk of | BN k | throughout all iterations, which is roughly the number of iterations multiplied by the maximal size of the base set, q. In Tables 3 and 4, the column NAVM is relatively small compared to the number of iterations multiplied by q. That means that in nearly all the optimization steps, only variables corresponding to the base set rather than the extended set are changed. In addition, from Tables 1 through 4, we found that the pair approach may not lead to fewer iterations, so it is not necessary to use the pair approach for solving SVR.
1280
Shuo-Peng Liao, Hsuan-Tien Lin, and Chih-Jen Lin
Table 4: Problem add10 (First 200 Data). q D 10 Parameters
Iteration (base)
Iteration (pairs)
NAVMa
59 55 2
50 60 2
13 0 0
27 18 2
26 18 2
33 0 0
2519 944 2
2094 956 2
112 12 0
1317 236 2
1216 278 2
135 28 0
C D 10, 2 D 0.1 C D 10, 2 D 1 C D 10, 2 D 10 C D 100, 2 D 0.1 C D 100, 2 D 1 C D 100, 2 D 10
q D 20 Iteration Iteration (base) (pairs)
NAVMa
Note: a Number of added variables modied for the pair approach.
5 Conclusion and Discussion
From our theoretical proofs, we show that in the nal iterations of the decomposition methods, the solution of the subproblem for the base approach is the same as that for the pair approach. This means that extending the base working set to the pair set will not benet much, so it is not necessary to use the pair method. Experiments conrmed our analysis. The difference between number of iterations of the two approaches is negligible. Moreover, the pair approach solves a larger optimization subproblem in each iteration, which costs more time, so the program using the base approach is more efcient. We mentioned in equation 2.2 that the regression problem can be reformulated to have the same structure as the classication problem, so if we solve SVR using the base approach, it is possible to use the same program for classication with little modication. For example, LIBSVM used this strategy. However, if equation 2.2 is directly applied without using as many regression properties as possible, our experience shows that the performance may be a little worse than software specially designed for regression. Acknowledgments
This work was supported in part by the National Science Council of Taiwan, grant NSC 89-2213-E-002-106. References Blake, C. L., & Merz, C. J. (1998). UCI repository of machine learning databases (Tech. Rep.). Irvine: University of California, Department of Information and Computer Science. Available on-line: http://www.ics.uci.edu/»mlearn/ MLRepository.html.
Decomposition Methods for Support Vector Regression
1281
Chang, C.-C., & Lin, C.-J. (2001). LIBSVM: A library for support vector machines. [Computer software]. Available on-line: http://www.csie.ntu.edu.tw/ »cjlin/libsvm. Collobert, R., & Bengio, S. (2001). SVMTorch: A support vector machine for large-scale regression and classication problems. Journal of Machine Learning Research, 1, 143–160. Available on-line: http://www.idiap.ch/ learning/SVMTorch.html. Flake, G. W., & Lawrence, S. (2002). Efcient SVM regression training with SMO. Machine Learning, 46, 271–290. Friedman, J. (1988). Multivariate adaptive regression splines (Tech. Rep. No. 102). Stanford, CA: Laboratory for Computational Statistics, Department of Statistics, Stanford University. Joachims, T. (1998). Making large-scale SVM learning practical. In B. Sch¨olkopf, C. J. C. Burges, & A. J. Smola (Eds.), Advances in kernel methods—support vector learning. Cambridge, MA: MIT Press. Keerthi, S. S., Shevade, S., Bhattacharyya, C., & Murthy, K. (2000).Improvements to SMO algorithm for SVM regression. IEEE Transactions on Neural Networks, 11, 1188–1193. Keerthi, S. S., Shevade, S., Bhattacharyya, C., & Murthy, K. (2001).Improvements to Platt’s SMO algorithm for SVM classier design. Neural Computation, 13, 637–649. Laskov, P. (2002). An improved decomposition algorithm for regression support vector machines. Machine Learning, 46, 315–350. Lin, C.-J. (2001a). On the convergence of the decomposition method for support vector machines. IEEE Transactions on Neural Networks, 12, 1288–1298. Lin, C.-J. (2001b). Stopping criteria of decomposition methods for support vector machines: a theoretical justication (Tech. Rep.) Taipei, Taiwan: Department of Computer Science and Information Engineering, National Taiwan University. To appear in IEEE Transactions on Neural Networks. Platt, J. C. (1998). Fast training of support vector machines using sequential minimal optimization. In B. Sch¨olkopf, C. J. C. Burges, & A. J. Smola (Eds.), Advances in kernal methods—support vector learning. Cambridge, MA: MIT Press. Ruping, ¨ S. (2000). mySVM—another one of those support vector machines. [Computer software]. Available on-line: http://www-ai.cs.unidortmund.de/ SOFTWARE/MYSVM/. Smola, A. J., & Sch¨olkopf, B. (1998). A tutorial on support vector regression. (Neuro COLT Tech. Rep. TR-1998-030). Egham, Surrey: Royal Holloway College. Vapnik, V. (1998). Statistical learning theory. New York: Wiley. Received January 4, 2001; accepted October 10, 2001.
LETTER
Communicated by Thomas Bartol
An Image Analysis Algorithm for Dendritic Spines Ingrid Y. Y. Koh [email protected] Department of Applied Mathematics and Statistics, State University of New York at Stony Brook, Stony Brook, NY 11794-3600, U.S.A., and Howard Hughes Medical Institute and Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724, U.S.A. W. Brent Lindquist [email protected] Department of Applied Mathematics and Statistics, State University of New York at Stony Brook, Stony Brook, NY 11794-3600, U.S.A. Karen Zito [email protected] Howard Hughes Medical Institute and Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724, U.S.A. Esther A. Nimchinsky [email protected] Howard Hughes Medical Institute and Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724, U.S.A. Karel Svoboda [email protected] Howard Hughes Medical Institute and Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724, U.S.A. The structure of neuronal dendrites and their spines underlie the connectivity of neural networks. Dendrites, spines, and their dynamics are shaped by genetic programs as well as sensory experience. Dendritic structures and dynam ics may therefore be important predictors of the function of neural networks. Based on new imaging approaches and increases in the speed of computation, it has become possible to acquire large sets of high-resolution optical micrographs of neuron structure at length scales small enough to resolve spines. This advance in data acquisition has not been accompanied by comparable advances in data analysis techniques; the analysis c 2002 Massachusetts Institute of Technology Neural Computation 14, 1283– 1310 (2002) °
1284
Ingrid Y. Y. Koh, et al.
of dendritic and spine morphology is still accomplished largely manually. In addition to being extremely time intensive, manual analysis also introduces systematic and hard-to-characterize biases. We present a geometric approach for automatically detecting and quantifying the threedimensional structure of dendritic spines from stacks of image data acquired using laser scanning microscopy. We present results on the measurement of dendritic spine length, volume, density, and shape classication for both static and time-lapse images of dendrites of hippocampal pyramidal neurons. For spine length and density, the automated measurements in static images are compared with manual measurements. Comparisons are also made between automated and manual spine length measurements for a time-series data set. The algorithm performs well compared to a human analyzer, especially on time-series data. Automated analysis of dendritic spine morphology will enable objective analysis of large morphological data sets. The approaches presented here are generalizable to other aspects of neuronal morphology. 1 Introduction
Recent experiments have revealed that important aspects of cognitive function, such as experience-dependent plasticity (Engert & Bonhoeffer, 1999; Maletic-Savatic, Malinow, & Svoboda, 1999; Toni, Buchs, Nikonenko, Bron, & Muller, 1999; Lendvai, Stern, Chen, & Svoboda, 2000), neural integration (Yuste & Denk, 1995; Wearne, Straka, & Baker, 2000), and learning and memory (Moser, Trommald, & Anderson, 1994) are correlated with variations in dendritic branching morphology and with spine density and distribution. Similarly, age-related decits in short-term memory (Duan, He, Wicinski, Morrison, & Hof, 2000), important forms of neural dysfunction (Horner, 1993), and mental retardation (Purpura, 1974) have been localized, in part, to dendrites and spines. These ndings have motivated extensive efforts to obtain quantitative descriptions of dendritic and spine morphologies, both statically and dynamically. Due to its superior resolution capability in revealing ultrastructures at synaptic junctions, serial section electron microscopy (SSEM) has been used (Fiala, Feinberg, Popov, & Harris, 1998; Toni et al., 1999) to quantify dendritic spine structures in three dimensions (3D). This is, however, a nonvital form of observation and an extremely labor-intensive histological approach requiring manual or semiautomatic registration and outlining of the structures on each serial section (Carlbom, Terzopoulos, & Harris, 1994; Spacek, 1994). Modern uorescence microscope methods, such as confocal laser scanning microscopy (CLSM) (Pawley, 1995) and two-photon excitation laser scanning microscopy (2PLSM) (Denk, Strickler, & Webb, 1990; Denk & Svoboda, 1997; Svoboda, Denk, Kleinfeld, & Tank, 1997), offer many advantages
An Image Analysis Algorithm
1285
over SSEM, at the expense of reduced resolution. Sectioning is achieved by limiting the detection (CLSM) or excitation (2PLSM) of uorescence to a sub-femtoliter focal volume. Optical imaging is rapid and noninvasive. The exquisite selectivity of uorescence allows the detection of even single molecules against a background of billions of others (Eigen & Rigler, 1994). Optical microscopy thus occupies a unique niche in biology due to its ability to perform observations in intact, living tissue at relatively high resolution. The properties of uorescence microscopy images are well understood. To image neuronal structure, neurons are labeled with uorescent molecules that ll the cytoplasm homogeneously. Voxel values report the convolution of the density of uorescent probes with the point-spread function (PSF) of the imaging system, which is essentially equal to the focal volume and is easily measured (Svoboda, Tank, & Denk, 1996). Studies of morphological plasticity based on CLSM (Moser et al., 1994; Hosokawa, Rusakov, Bliss, & Fine, 1995) and 2PLSM (Engert & Bonhoeffer, 1999; Maletic-Savatic et al., 1999; Lendvai et al., 2000) measurements of spine length and density have been described. Despite these advances in modern imaging techniques, the analysis of neuronal structure has remained largely manual. The considerable amount of time and effort required for manually extracting spine measurements has precluded routine studies of large amounts of data. In addition, results are not precisely reproducible as accuracy is dependent on the skill and habituation of the user. A few detection and estimation techniques (Rusakov & Stewart, 1995; Watzel, Braun, Hess, Scheich, & Zuschratter, 1995; Herzog et al., 1997; Kilborn & Potter, 1998) of varying degrees of automation have been suggested to overcome the tedium and improve accuracy and reproducibility of the result, none of which has apparently been used and veried on large data sets. Rusakov and Stewart (1995) applied a medial axis construction (skeletonization) to 2D dendritic images to obtain spine length measurements in 2D and estimated the corresponding 3D measurements using a stereological sampling procedure. As the medial axis is sensitive to surface features, manual screening of the medial axis was required to select among spine, dendrite, and irregular surface-induced features (i.e., artifacts). Since measurements were based solely on the medial axis, no volumetric estimates were obtainable. Watzel et al. (1995) have also suggested the detection of dendritic spines using medial axis-based identication. Their 3D algorithm was restricted to images containing a single dendrite. The dendrite backbone (“centerline”) was extracted from the medial axis, and the remaining medial axis spurs branching off the backbone were used to identify candidate spines. A length tolerance was employed to distinguish true spines from artifact spurs. No further analysis beyond that for a single dendritic image was presented. Herzog et al. (1997) employed a 3D reconstruction technique using a parametric model of cylinders with hemispherical ends to t the shape of the dendrites and the spines. In this method, short spines or spines with thin necks were hard to detect and had
1286
Ingrid Y. Y. Koh, et al.
to be manually added to the model. An approach using neural network recognition for spines (Kilborn & Potter, 1998) has also been suggested. In this article, we present an automated dendritic spine detection and analysis procedure appropriate for 3D images obtained using laser scanning microscopy. It is assumed that the image to be analyzed is of a biphase medium, with one phase being the neuronal cytoplasm (dendritic phase) and the other being the background tissue. The method uses a geometric approach, detecting spines as protrusions; it is highly automatic and contains only a few parameter settings. It can be applied to static images as well as time-series data. There is no limitation on the number or the structure of the dendrites in the image. In addition to spine length and density, volumetric measurements and spine classications are obtained. The article is divided into the following sections. The algorithms developed to analyze 3-D scanning microscopy images are described in section 2. The analysis consists of the following steps. An image is rst processed by deconvolution and the dendritic phase extracted (section 2.1). The dendrites are identied via their backbones (section 2.2), which are extracted from a medial axis construction. Unlike the work of Rusakov and Stewart (1995) and Watzel et al. (1995), spines are not detected using the medial axis branches emerging from the backbone, as it is difcult to distinguish true spines from artifacts by this procedure. Instead, spines are detected as geometric protrusions relative to the backbone. Our geometric model and its algorithmic implementation is presented in section 2.3. For time-series data, in which the same dendritic branch is imaged over a sequence of time intervals, translational effects in time are corrected for and individual spines are then traced (section 2.4) through the time-ordered sequence of images. Finally, morphological characterizations of the population of detected spines are extracted (section 2.5). The imaging setup and the biological preparation used to obtain data for testing and verication of these algorithms are summarized in section 3. Results of the application of these algorithms to the analysis of hippocampal CA1 neurons and a small number of hippocampal CA3 neurons are presented in section 4. The accuracy of the automatic approach is addressed by comparison with manual spine detection results.
2 Image Analysis 2.1 Image Deconvolution and Segmentation. The intrinsic spatial resolution limits of optical microscopy arise from the diffraction of light; light from a point source is ideally imaged to a larger spot characterized by the Airy function. The measured spread resulting from a given optical setup is referred to as the PSF. As a result, the intensity recorded in any voxel (volume element) of a digitized image is a convolution of intensities from its neighborhood.
An Image Analysis Algorithm
1287
Deconvolution (Shaw, 1995; Holmes et al., 1995) is typically used to correct aspects of the image degradation due to the PSF. Deconvolution techniques (Lagendijk & Biemond, 1991) employ either theoretical (Gibson & Lanni, 1991) or experimental measures of the PSF. In addition, blind deconvolution methods (Holmes et al., 1995) can be employed that, concurrently with the deconvolution, reconstruct an estimated PSF of the image. However, the presence of noise and the band-limited nature of the PSF limit the improvements by means of classical deconvolution techniques. Therefore, some blurring will remain even after deconvolution due to a trade-off between sharpening of the image and noise amplication. In addition, the photomultiplier tube (PMT) detectors used in most laser scanning microscopes are noisy. Even in darkness, PMTs produce spontaneous bright pixels (shot noise). We deal with shot noise by applying a median lter (Tukey, 1971) to the image. The median lter is a nonlinear, lowpass lter that replaces the gray-scale value of each voxel v in the digitized image by the median grayscale value of v and its 26 neighbors. This effectively removes shot noise but not real spines, which, under typical magnications employed, have an effective width covering many voxels. We use the iterative reblurring deconvolution algorithm of Kawata and Ichioka (1980), which requires a theoretical or experimentally measured PSF. Briey, iterative reblurring proceeds as follows. Let o (0) ( x, y, z) denote the experimental 3D image and h (x, y, z ) an appropriate PSF. Let ¤ and ?denote the convolution and correlation operators (Kawata & Ichioka, 1980). The deconvolved image oO k (x, y, z) in the kth iteration is ( ) ( ) (0) ( ) oO k D oO k¡1 C fo ? h ¡ oO k¡1 ¤ ( h ? h ) g.
A nonnegativity constraint is applied to oO ( k) at the end of each iteration. For the images analyzed in section 4, the PSF was measured by imaging a number of subresolution microspheres (100 nm diameter) and averaging their individual PSFs to reduce noise. Figure 1 demonstrates the result of iterative reblurring applied to an example image. We typically employ kmax D 5 iterations of deblurring. Segmentation is a generic imaging term for labeling each voxel in a grayscale (or color) image with an integer identier designating its “population type.” For dendritic morphometry, this requires distinguishing neuron voxels from the background tissue voxels. A large number of segmentation algorithms are available (Pal & Pal, 1993). As our dendritic images are processed rst by median ltering and deconvolution, we choose to use simple thresholding for this nal segmentation step; all voxels of intensity greater than a threshold value are identied as neuron, otherwise as background. In general, a trade-off between selection of dim spines and reduction of noise on the dendrite surface is made in selecting the threshold.
1288
Ingrid Y. Y. Koh, et al.
Figure 1: The (a) raw and deblurred image after (b) 5 and (c) 20 iterations.
An Image Analysis Algorithm
1289
2.2 Dendritic Backbone Extraction. Geometric analysis of a 3D irregularly shaped object is difcult; such analyses typically employ models based on geometrically simple unit objects such as the cylinders and hemispheres used in the technique of Herzog et al. (1997). Our approach uses the medial axis algorithm of Lee, Kashyap, and Chu (1994) to provide a skeleton from which the backbone of each dendrite in an image can be extracted. Major success in the application of this backbone extraction procedure has been achieved in the analysis of pore space geometry in rock (Lindquist & Venkatarangan, 1999; Lindquist, Venkatarangan, Dunsmuir, & Wong, 2000) and ber mats (Yang, 2000). Intuitively, the medial axis captures a geometrically faithful skeleton (consisting of curve segments joining at vertices) of an object. In a digitized image, these curve segments consist of linked sequences of voxels, with the vertices being voxels at which these segments join together. An example of the medial axis of a portion of a segmented dendritic image is shown in Figure 2a. (This is a view perpendicular to the optical axis.) The medial axis obtained for the dendritic phase contains the backbone (centerline) of each dendrite as a subset. In addition to the backbone, the medial axis contains spurs and other features that correspond to spine-related or non-spine-related surface features (e.g., incipient dendritic branches); to surface artifacts resulting from digitization effects, segmentation errors, and boundary effects due to the nite imaged volume; or to spurious cell debris. Due to resolution limits, spines emerging near each other may appear to have overlapping tips in the digitized image, resulting in the appearance of small loops in the medial axis (top of the parent branch in Figure 2a). A separate skeleton for each disconnected component of dendritic phase is also contained in the medial axis (featured in Figure 2a). From the medial axis, we extract the backbone for each dendrite (see Figure 2b) in two steps. The rst step is achieved by removing the medial axis segments corresponding to all disconnected dendritic components and trimming short spurs and loops on the medial axis. Removal of long spurs is problematic, as they may correspond to either lopodia or incipient branches of the dendrite. These long spurs are dealt with in the next step. In the second step, a backbone is traced through each dendritic branch employing a decision based on minimum deviation angle whenever a vertex on the trimmed skeleton is encountered. If necessary, the number, n, of dendrites in the image can be specied so that only the n longest backbones are retained. Any remaining medial axis segment that is not part of a traced backbone is removed. The nal set of dendritic backbones extracted from Figure 2a is shown in Figure 2b. 2.3 Spine Detection. Dendritic spines are short (» 1m m) protrusions of variable shape that have been characterized (Harris, Jensen, & Tsao, 1992) as falling into one of three types. The type names—stubby, thin, and mushroom—are intended to be descriptive of the geometrical shape of the
1290
Ingrid Y. Y. Koh, et al.
Figure 2: (a) The medial axis from the segmented image of a dendritic image with arrows indicating loop (L) formed by overlapping spines, separate skeletons for disconnected features (D), and spurious cell debris (S). (b) The two backbones extracted from this medial axis.
An Image Analysis Algorithm
1291
spines in each class. Spines, however, exist in a continuum of shape variation over this classication range, and the boundaries established between these three classes are somewhat arbitrary. Thus, in designing an automated spine detection algorithm, we prefer not to search for elements based on prototypes from each class; rather, we prefer a spine model based on a generic protrusion that allows for all three types but rejects nonspine protrusions such as incipient dendritic branches, as well as surface irregularities on the dendrite. Spine candidates are selected using a protrusion criterion (see equation 2.1). Spine detection is complicated by the fact that in mushroomshaped spines, very thin necks are too weak to be detectable. Such spines appear as a detached head-base pair. The protrusion criterion should therefore detect such bases; a separate algorithm is required to detect detached heads. Thus, once the dendritic backbones are isolated, the spine detection algorithm proceeds in three steps: detection of detached spine heads, detection of attached spines (includes bases), and merging of spine components. Since a spine may comprise one or more detached pieces and possibly an attached base in the segmented image, the identication of any spine is not nalized until all three steps have been completed. An optional imposition of a polygon-based trimming of the image volume allows for elimination of regions of noninterest. 2.3.1 Detached Spine Component Detection. Dendritic phase components disconnected from the backbone-containing dendrites are detected and tentatively identied as detached spine heads. For each detached component, a record is kept of its center of mass, the closest dendritic backbone voxel, and the dendritic surface voxel lying on the line joining the center of mass to the backbone voxel. Detached dendritic phase components that are farther from the nearest dendrite surface voxel than a maximum distance tolerance are interpreted as false-positive signals and ignored. For the images analyzed in section 4, the length tolerance is 6m m. 2.3.2 Attached Spine Component Detection. Ignoring the detached spine components, every remaining dendrite phase voxel v is labeled with a distance, db ( v ), to its closest backbone voxel. (This is done using a 26-connected “grassre”; (Leymarie & Levine, 1992) propagating through the dendrite outward from the backbone. db ( v) equals the value of the integer timestep of ignition of voxel v.) Thus, tips of protrusions on the dendrite surface are identied as those voxels assigned the (locally) largest distances. These tip voxel locations are then processed in descending order of db . For each tip voxel S, a sequence fCi g, i D 1, . . . , db ( S ) , of candidate spines is generated. Candidate Ci consists of all voxels w whose distance dS (w ) from S is dS (w ) · i. Figure 3a shows a 2D projected view of the candidates (gray voxels) C1 , . . . , C12 for a tip S having db D 12. The smaller candidates clearly contain insufcient voxels to represent the spine correctly, whereas the larger
1292
Ingrid Y. Y. Koh, et al.
Figure 3: (a) Projected view of 12 spine candidates (shaded gray) with db ( S) D 12. To reduce gure size, the segment of the dendrite to which the spine is attached is shown only with the rst candidate, C1 . (b, c) Sketches of a spine candidate Ci illustrating the denition of geometrical quantities used in the spine detection algorithm.
An Image Analysis Algorithm
1293
candidates protrude too far below the dendrite surface. The optimal choice of a spine candidate would terminate at the surface of the dendrite. This is achieved by estimating the local thickness of the dendrite. To estimate the local dendrite thickness and choose the optimal candidate, a ring of spine-surface boundary points is determined for each candidate. For clarity of explanation, we rst illustrate this algorithm in 2D assuming the image is projected onto its focal plane. The 3D algorithm is described afterward. We use Figures 3b and 3c, which sketch a candidate Ci for a spine of stubby type, to dene the geometrical quantities used in the algorithm and illustrate how they change with orientation angle of the spine. For each candidate Ci , two surface boundary voxels PC1 i and PC2 i and a base voxel ECi having the furthest penetration into the dendrite are determined. The surface points are used to determine the best measure of the dendrite thickness as follows. A reference candidate CR is selected to be that candidate of smallest volume having the minimum surface to backbone distance, dp ´ mini fdb ( PC1 i ) , db ( PC2 i )g. The reference candidate CR and the distance dp are illustrated in Figure 3b. In an ideal case, 2dp is the width of the dendrite, and the best candidate for the spine would be Cj where j D db ( S) ¡ dp C 1. In practice, spines and dendrite surfaces in the images are quite irregular. Therefore, to be accepted as a true spine, we require the candidate to satisfy a heuristic protrusion i criterion. Let DCS!P denote the perpendicular distance from S to the line 1 P2 i segment PC1 i PC2 i of Ci , and let DCE!P denote the perpendicular distance 1 P2
from the base voxel ECi to the line segment PC1 i PC2 i as illustrated in Figure 3c. The spine candidate Cj is required to satisfy i i ¸ DC DCS!P E!P1 P2 . 1 P2
(2.1)
Criterion 2.1 clearly requires that favorable spine candidates protrude out farther from the dendrite surface than they protrude into it. The following argument additionally shows that the criterion favors spine-type protrusions. Consider the spine candidate Ci sketched in Figure 3c. Approximate the base of Ci by the arc of a circle with radius R. Let 2W denote the distance dist PC1 i ! PC2 i . Simple trigonometry gives p i (2.2) DCE!P D R cosh ¡ R2 ¡ W 2 , 1 P2 p i (2.3) DCS!P D R2 ¡ W 2 , 1 P2 where 0 · h · 90± . Substitution of equations 2.2 and 2.3 into criterion 2.1 gives p 2W · 4 ¡ cos2 h . R
(2.4)
1294
Ingrid Y. Y. Koh, et al.
p The right-hand side of equation 2.4 varies over the range [ 3 / 2, 1]. Thus, if we assume that R approximates the spine candidate length and 2W is an approximate measure of the spine candidate neck width, then we see qualitatively that the protrusion criterion “naturally” accepts protrusions having neck width less than spine length. Thus, with the possible exception of stubby spines, the criterion accepts protrusions of spine type. Basing the decision (as to whether a protrusion is indeed a spine) on the single candidate Cj is too fragile, given the irregularity of true spines and the digitized nature of the images. In order to increase the recognition capability, we apply the protrusion criterion to a range of candidates Ci for which i is close to j. Specically, we consider candidates having values i in the range db ( S) ¡ dp · i · db ( S ) ¡ (dp C de ¡ 1) / 2, where de ´ db ( ECR ). If one of the candidates Ci with i value in this range satises criterion 2.1, then the candidate Cj is accepted as a true spine; otherwise, the protrusion is rejected as a spine. The 3D algorithm proceeds in a similar fashion except that instead of using the projected pair of surface boundary points, the entire ring of surface boundary points in 3D is considered. The minimum surface points to backbone distance is used to nd the reference candidate CR to correct for the local thickness of the dendrite. The distance from S (or E) to the ring of voxels is calculated by measuring the perpendicular distance of S (or E) to the plane that best ts the ring of voxels in 3D. 2.3.3 Component Elimination. Spines touching the boundary of the imaged region are ignored as they are incomplete. This is also a useful technique for eliminating debris and other axons or dendrites in the background of the image that are near or touching the dendrites of interest. Our algorithms are written to allow for the imposition of one or more nonoverlapping polygonal areas on the plane of the image slices. The interior of the union of these polygons is regarded as the region of interest for the spine detection algorithm; any structure exterior to the polygons is ignored. By setting the polygonal edges to cross through unwanted structures, they are also automatically ignored. As mentioned, detached components farther from the dendrite surface than a maximum distance are also eliminated. 2.3.4 Component Merging. As a spine may be identied from multiple detached head and attached base components, a nal merging algorithm that accounts for the position and orientation of all possible spine pieces is performed. The merging algorithm considers every component, checking for possible merges with other components. Any merged entity is reconsidered as a new single component and rechecked for possible further merges. Merging can occur between two detached (DD) components (the merged entity is still considered a detached component) or between detached and attached (DA) components (the merged entity is then considered to be an attached component). We employ two criteria for DD or DA type merg-
An Image Analysis Algorithm
1295
D X
D
S
S
X
X
P1 X
X
X
A
P2
P1 X
DA me r g e d
A
X
P2
DA not merged
D1 X
D1 X
D2
q1
X
S1 X B1
X
X
X
S2
q2
B2
D2 X
q1 S1 X
X
S2
X
X
B1
B2
q2
nearest backbone vox el s
DD mer ged
DD not merged
Figure 4: Illustration in 2D of the orientation criterion used to determine if detached (D) or attached (A) spine components need to be merged. (Top) For DA merging, the tip S must lie within the triangle DP1 P2 , where D is the center of mass of the detached component. (Bottom) Detached components D1 and D2 are merged if the angles h1 and h2 subtended by each component with the two surface voxels S1 and S2 satisfy (h1 C h2 ) / 2 · 30o .
ing. The rst criterion is maximum separation: the two components to be merged are required to be close enough (a center-of-mass to center-of-mass separation · 3m m). The second criterion requires appropriate relative orientation of the two components as demonstrated in 2D in Figure 4. For DA type merging, the tip S of the attached component A is required to lie within the triangle DP1 P2 . (In 3D, the tip S is required to lie within the cone determined by D and the ring of spine-surface boundary points.) For DD type merging, the average angle subtended by the center of mass of each spine with the surface voxel locations of both spines is required to be less than 30 degrees.
1296
Ingrid Y. Y. Koh, et al.
2.4 Image Registration and Spine Tracing. A time sequence of 3D images must be registered to correct for possible translational movement of the specimen. After registration, individual spines are traced and identied through the image sequence. Each consecutive pair of images Fi and Fi C 1 is co-registered using the spines separately identied in each image. The offset oE D ( ox , oy , oz ) of Fi C 1 with respect to Fi is allowed to vary within a window |ox | · wx , |oy | · wy , |oz | · wz . Only integer voxel offsets are considered. A common registration method (Pratt, 1991, pp. 651–673) maximizes the cross correlation of two images; thus, no decision can be made until the correlation arrays are computed for all offsets. Instead, we use an efcient sequential search method (Barnea & Silverman, 1972) that computes the l1 norm (absolute value sum) image difference,
e ( oE ) D
XXX i
j
k
|Fi ( i, j, k) ¡ Fi C 1 ( i ¡ ox , j ¡ oy , k ¡ oz ) |,
over all offsets oE in the window for which e ( oE ) is less than a predetermined threshold value T. The offset oE with minimum e ( oE ) provides the optimal registration. In practice, wx D wy D wz D 5 voxels, and T is the average number of total spine voxels in Fi and Fi C 1 . Individual spines are traced through the time series. Two spines at different times are considered to be the same if their percentage overlap (measured in voxels) is larger than 25% of the volume of at least one of them. 2.5 Morphological Characterization. Currently, spine length, density, and volume are computed. Spines are also classied according to their shape. For a detached spine (without any attached component), the spine length is determined by the distance from the recorded associated dendrite surface voxel to the spine voxel farthest away from the dendrite. For spines that are fully or partially attached (consisting of a base and one or more detached components) to the dendrite, the spine length is determined by the distance from the center of mass of the base boundary points to the farthest spine voxel (possibly detached from the dendrite). For the images analyzed in section 4, the automated spine length measurement is calculated from a 2D projection. The reason is that the manual spine analysis measurements against which the automatic analysis results are to be compared are performed in 2D by projecting the 3D stack of image slices along the optical direction. Spine density is computed as the number of spines per unit length of dendritic backbone. For purposes of comparison with manually analyzed images, which are analyzed in 2D projection only, backbone length is also measured from a 2D projection onto the slice plane.
An Image Analysis Algorithm
1297
Table 1: Ratio Criteria for the Classication of Stubby, Thin, and Mushroom Spines. L / dn [0, 2 / 3) [2 / 3, 2) [2, 3) [3, 5) [5, 1)
dh / dn [0, 1.3)
[1.3, 3)
[3, 1)
Stubby Stubby Stubby Thin Thin
Mushroom Stubby Mushroom Mushroom Thin
Mushroom Stubby Mushroom Mushroom Thin
Spine volume is measured according to the intensity values of the deconvolved gray-scale image. For 2PLSM, the excitation of uorescence is limited to a sub-femtoliter focal volume (¼ 0.5 £ 0.5 £ 1.5m m3 ) larger than that of individual spines. The intensity value recorded for each voxel in a spine is a sum of the uorescence from all dye molecules excited within the focal volume. The maximum intensity voxel near the center of a spine is therefore a measure of the volume of a spine. As the larger cross-sectional areas of a dendrite are typically larger than the maximum cross-sectional area of the focal region, the maximum voxel intensity recorded along the dendrite backbone is a measure of the size of the focal volume, assuming the uorescence is saturated near the center of the dendrite (Svoboda et al., 1996; Sabatini & Svoboda, 2000). We dene the spine volume as the ratio of the maximum spine intensity to the maximum dendrite intensity multiplied by an empirically determined focal volume, Spine volume D
Maximum spine intensity £ focal volume. Maximum dendrite intensity
We use the classication of spine shapes (stubby, thin, mushroom) given in Harris et al. (1992; see Figure 3). Spine shape is decided based on spine length (L), head (dh ), and neck diameter (dn ). In general terms, for thin spines, spine length should be much greater than the neck diameter (L À dn ), and head diameter should not exceed neck diameter too excessively. For mushroom spines, spine length should not be excessive, and the head diameter should be much greater than the neck (dh À dn ). For stubby spines, the neck diameter is similar to the length of the spine (dn ¼ L). The specic criteria adopted in our classication use the ratios L / dn and dh / dn as summarized in Table 1. 3 Image Acquisition
From a data analysis standpoint, CLSM and 2PLSM provide essentially equivalent challenges. However, to gain an understanding of the dynamics
1298
Ingrid Y. Y. Koh, et al.
of neuronal circuits, neurons have to be studied in preparations that are as intact as possible. For many questions of subcellular physiology, the living brain slice offers an attractive compromise between the obvious limitations of cultured dissociated neurons and the experimental difculties encountered when working with intact animals. One problem with brain slice physiology has been that scattering of light makes traditional optical microscopies, including CLSM, difcult in living tissues. For these reasons, the data analyzed in this article were collected using 2PLSM, which allows high-resolution uorescence imaging in brain slices up to several hundred microns deep with minimal photodamage (Yuste & Denk, 1995; Svoboda et al., 1996; Denk & Svoboda, 1997; Lendvai et al., 2000). Cultured hippocampal brain slices were prepared as described in Stoppini, Buchs, and Muller (1991) from seven-day-old rats. After ve days in vitro, a small subset of neurons were transfected with a plasmid carrying the gene for enhanced green uorescent protein (GFP). At least two days after transfection, slices were transferred to a perfusion chamber for imaging. Labeled neurons were identied and imaged using a custom-made 2PLSM (Mainen et al., 1999) based on an Olympus Fluoview laser scanning microscope. The light source was a Ti:sapphire laser (Tsunami, SpectraPhysics, Mountainview, CA) running at a wavelength of ¼ 990 nm (repetition frequency 80 MHz; pulse length 150 fs). The average power delivered to the backfocal plane of the objective (40£, NA 0.8) varied depending on the imaging depth (range 30–150 mW). Fluorescence was detected in whole-eld detection mode with a photomultiplier tube (Hamamatsu Corp., Bridgewater, NL). 4 Results 4.1 Static Analysis. To validate the automatic spine detection algorithms, an experiment, E1 , involving ¼ 200 spine measurements over 15 dendrites of hippocampal CA1 and a small number of CA3 neurons was performed. The same imaged regions were subjected to both automatic and manual analysis. A total of 174 spines were identied by both methods, an additional 10 spines were identied only by the manual method, and a further 28 spines were identied only by the automatic method. The results of the manual and automated analysis for one of the dendrites in this experiment are illustrated in Figure 5. Twenty-one spines were detected by both methods; three additional, relatively short spines were detected by the automatic method. Figure 6 compares the individual spine lengths, average spine length, and spine density measured for this particular dendrite. Spines 8 and 22 demonstrate the difculties encountered by both detection methods when two spines appear to overlap. For spine 8, the manual detection has identied only the shorter of two spines, which appear to overlap, whereas the automatic method has identied only the longer. For spine 22, again two spines appear to overlap and are considered
An Image Analysis Algorithm
1299
p24
Stubby Mus hroom Thin
p23
p22
Figure 5: Comparison of (a) manual and (b) automatic spine detection on a segment of a hippocampal CA1 dendrite (projected view). The automatic procedure detects three more spines (p22, p23, and p24) than the manual procedure. Spines are shaded differently according to their shape classication.
as a single spine by the automatic method. On the other hand, the manual method failed to identify either of them. Table 2 compares the mean spine length measured by each method for the population of 174 spines detected in common. The mean lengths for those spine detected by only one of the methods are also presented. For the common detected spine population, the manual and automatic spine length measurements agree to within one standard deviation, although the standard deviations are large. (The large standard deviation is partly due to averaging over spines of different shape classication.) A paired samples Student’s t-test to determine whether the difference in measurements by the two methods is signicant provides a stronger test of the agreement between the two methods of length measurement. Table 3 summarizes the results of the paired t-test; there is no signicant difference between the two methods of length measurement.
1300
Ingrid Y. Y. Koh, et al.
Manual Automated
3 2 1 0 0
5
10
15
20
4
Density (1/micron)
Average Length (microns)
Length (microns)
4
3 2 1 0
0.8 0.6 0.4 0.2 0.0
25
Spine Index Figure 6: Comparison of manual and automatic measurements of individual spine length (left), average spine length (center), and spine density (right) for the dendrite shown in Figure 5. Table 2: Measured Mean Spine Lengths ( § Standard Deviation) for the Spines in Experiment E1 . Population Size 174 174 10 28
Method
Mean Spine Length (m m)
Manual Automatic Manual Automatic
1.05 § 1.08 § 0.75 § 0.38 §
0.62 0.63 0.57 0.28
A one-way ANOVA was used to test for any dependence of dendrite origin on the observed differences in measured spine length. The test produces an F statistic value of 0.95 (dnum D 14 and dden D 159) with a p-value of 0.51. Thus, the differences in spine length measurements between the two methods are uniform across the different dendrites. Table 3: Paired Samples t-Test Results for Measured Spine Lengths and Densities, Experiment E1 .
Degrees of freedom t statistic p-value
Spine Length
Spine Density
173 ¡1.24 0.22, two-sided
14 ¡0.87 0.40, two-sided
An Image Analysis Algorithm
1301
Table 4: Independent Samples t-Test Results, Experiment E1 . Method (population)
Manual (n D 174)
Automatic (n D 174)
Automatic only (n D 28)
df = 200 t D 5.63 p D 6 £ 10¡8 , two-sided
df = 200 t D 5.82 p D 2 £ 10¡8 , two-sided
Manual only (n D 10)
df = 182 t D 1.51 p D 0.13, two-sided
df = 182 t D 1.65 p D 0.10, two-sided
The mean spine length for the 28 spines detected only by the automatic method indicates a population of smaller length spines. The results in Table 4 of an independent samples t-test for these 28 spines show that the difference in spine lengths is signicant compared to that obtained by either measurement method for the population of 174 commonly detected spines. The results indicate that the automatic method is detecting short spines more consistently than the manual method. For the 10 spines detected only by the manual method, the mean length measurement lies midway between that obtained for the common and automatic only populations. An independent samples t-test (see Table 4) shows no signicant differences with the measurements obtained by either method for the 174 commonly detected spines. The results (df = 36, t D 2.65, p D 0.01, 2-sided) of an independent samples t-test between these 10 spines and the 28 detected only by the automatic method indicate a signicant difference between the spine lengths of these two populations. Visual observation of these 10 spines reveals that 7 of them touched the boundary of the image region and were consequently rejected by the automatic method. The remaining 3 were not resolved by the automatic method, as each touched some neighboring spine (which was detected). This is thus a reection of the combined effectiveness of the median lter, deconvolution, and simple thresholding algorithms in segmenting the images. Based on visual investigation of the images, we estimate that no more than 2% of the spines were not resolved by the automatic method due to segmentationrelated effects. Table 5 compares the mean spine densities separately measured by each method for the 15 dendrites. (For the manual method, this is a total population of 184 spines; for the automatic method, 202 spines.) The density measurements by either method agree to within one standard deviation. For the paired sample of 15 dendrites, a Kolmogorov-Smirnov test shows that the dendrite-by-dendrite difference in the automatic and manual measured densities is very close to normal, so that a paired-dendrite samples t-test can be applied. Column 2 of Table 3 summarizes the result; the automatic spine density measurement is not signicantly different from the manual.
1302
Ingrid Y. Y. Koh, et al.
Table 5: Measured Mean Spine Density ( § Standard Deviation) for the Dendrites in Experiment E1 . Population Size
Method
15 15
Mean Spine Density (m m¡1 ) 0.45 § 0.09 0.47 § 0.15
Manual Automatic
For spine volume measurement and shape classication, we report only on automated results because no manual determination is available. A second experiment, E2 , was performed under the same experimental conditions to increase the sample size (E1 C E 2 ) up to ¼ 700 spines. The spine volumes were calculated from the ratio of the maximum intensity values of the spine to the dendrite, as described in section 2.5 using an empirically determined focal volume of 0.5 £ 0.5 £ 1.5m m3 . Figure 7 shows the volume-length correlation plot for the spines measured in experiments E1 and E2 according to their determined classication. The mushroom-shaped spines occupy the widest spectrum of lengths and volumes. The ratio of stubby:mushroom:thin spines is 0.54:0.36:0.10. Table 6 summarizes the average spine volume and length measurements in each shape category and presents comparison with the SSEM results (Harris et al., 1992; Table 4) on rat hippocampus CA1 cells for postnatal day (PND) 15 animals. All automatic measurements are within 1.5 standard deviations of the SSEM result, though our volume results are generally smaller. Note, however, that the automatic results come from cultured neurons and younger-aged animals. In addition, no corrections for any xation-induced changes have been performed in the SSEM study.
Stubby
Volume (micron^3)
0.3
Mushroom
0.3
0.2
0.2
0.2
0.1
0.1
0.1
0.0
0
1
2
3
0.0
0
1
2
Thin
0.3
3
0.0
0
1
2
3
Length (microns)
Figure 7: Spine volume–length scatter plots according to determined spine type for all spines in experiments E1 and E2 .
An Image Analysis Algorithm
1303
Table 6: Comparison of Automated Spine Length and Volume Measurements of Hippocampus Cells in PND 7 Cultured Neurons with the PND 15 SSEM Results of Harris et al. (1992, Table 4). Measurement
3
Method
Volume (m m )
Automatic SSEM
Length (m m)
Automatic SSEM
Shape Classication Stubby
Mushroom
Thin
0.07 § 0.04 0.11 § 0.07
0.06 § 0.05 0.18 § 0.09
0.06 § 0.04 0.05 § 0.03
0.65 § 0.37 0.65 § 0.38
1.35 § 0.55 0.95 § 0.30
1.38 § 0.54 1.40 § 0.39
Number of Spines
4.2 Time-Series Analysis. Time-series data provide the ability to capture dynamic changes in dendritic spine morphology. A series of 50 3D images was taken at 30 second intervals spanning a time period of 25 minutes. Figure 8a shows the number of spines detected in the images as a function of time using the automatic method. On average, 27.5 § 2.3 spines were detected in each image. We are interested in the question of the frequency of observation of any particular spine over time. In total, 52 spines
30 26 22 0
5
10
15
20
25
Time (minutes)
Time (minutes) 25 20 15 10 5 0 0
10
20
30
40
50
Spine Figure 8: (a) Number of spines detected in the image at each time step. (b) Detection history of 52 spines followed for 25 minutes; for example, spine 1 was detected in each image taken over the 25 minute period; spine 14 was seen sporatically over the entire period.
Ingrid Y. Y. Koh, et al.
Length (microns)
1304
4 3 2 1
Volume (micron^3)
0
0.3 0.2 0.1 0.0 0
5
10
15
20
25
Number of Spines
Time (minutes)
15 10 5 0
0.0
0.2
0.4
0.6
0.8
Motility Figure 9: (a) Measured distributions of spine length and volume as a function of time for the population of 52 spines followed in a time series of images. For each time point, the solid circle indicates median value; the horizontal hashmarks indicate (from bottom to top) minimum, rst, and third quartile and maximum values. (b) Measured distribution of spine motility index tted to an exponential decay function n (m ) D n (0) e¡3.69m , where n is the number of spines and m is the motility.
were detected and traced through the time-series. Figure 8b shows how the observations of the 52 spines were distributed in time. The spines were indexed (1 ! 52) according to the rst time in which they appeared. Thus, 27 spines were observed over the full 25 minutes of image taking; among those, 16 were present at all time points. Figure 9a summarizes the distribution of spine length and volume as a function of time. Both length and volume distributions are skewed, with
An Image Analysis Algorithm
1305
smaller lengths and volumes dominating, consistent with the stubby:mushroom:thin spine ratios noted above. The dynamics of the spines are measured using an index for spine motility, which is dened as the summed difference in length of a spine in time divided by the total number of time steps. Figure 9b plots the distribution of motilities for the 52 spines traced in this series. For this limited data set, the number of spines (n) decreases with motility (m) approximately as n ( m ) D n (0) e¡3.69m . A comparison between automated and manual length measurements was made for a limited subset of these time-series data; manual length measurements were made on a subset of 10 of the 52 spines. Figure 10 presents comparisons between the automated and manual spine length measurements as a function of time for 5 of the spines chosen to represent different average lengths. Consistently longer lengths were measured by the manual method for the longer spines (1 and 2). For the medium-length spines (3 and 4), the manual and automated results are very similar. For the short spine (5), some deviations are observed; occasionally, the spines were not detected by the automatic method. A paired t-test was performed on this subset of 10 spines to determine whether the average spine length determined by the automated measurement is signicantly different from that determined by the manual measurement. A Kolmogorov-Smirnov test shows that this sample of 10 difference measurements (¡0.32 § 0.17m m) is very close to normal, so that a paired samples t-test can be applied. The observed t-statistic value is ¡2.79, with p D 0.02, two-sided, revealing some signicance in the averaged difference. While this is contrary to the results in experiment E1 , we note that the manual measurements here were made by a different user than in E1 . Pearson correlation was therefore used to test whether the automatic and manual measurements are correlated in time. The correlation values (r) obtained for the 10 spines for n D 50 time points range from 0.92 to 0.29, with two-sided p-values ranging from 0.00 to 0.04. Thus, for these 10 spines, signicant correlations between the automatic measurements and the manual measurements in time are observed. We therefore conclude the existence of a systematic bias between the manual and automated measurements for this set of data, which we attribute to a change in the user ’s making the manual measurements. The signicant Pearson correlation, however, indicates that the manual measurements are duplicating the trends found by the automatic measurements. 5 Discussion
The automatic procedure presented here offers an objective and consistent analysis that requires a minimal amount of supervision and makes accessible 3D morphological characterizations of spine length, volume, shape classication, and spine density. Comparison of results on spine length and density between the manual and automatic approach on a large number
1306
Ingrid Y. Y. Koh, et al. Manual Automatic
4.5 4.0
Spine 1 (r = 0.92)
3.5 3.0 2.5 3.0
Spine 2 (r = 0.48)
2.5 2.0 1.5
Length (microns)
1.0 2.5
Spine 3 (r = 0.44)
2.0 1.5 1.0 0.5 2.0
Spine 4 (r = 0.51)
1.5 1.0 0.5 0.0 2.0
Spine 5 (r = 0.75)
1.5 1.0 0.5 0.0 0
5
10
15
20
25
Time (minutes) Figure 10: Lengths of ve spines plotted as a function of time showing comparison between manual (close circles) and automated (open circles) measurements. Pearson correlations (r) for each pair of measurements in time are indicated.
of samples has validated the automatic procedure for both static and timelapse images. The approach is highly automatic. At present, parameters that require routine adjustments include the region of interest and segmentation threshold. Other parameters that are used in the deconvolution, backbone extraction, spine component elimination, and tracing algorithms are empirically determined; they remained the same throughout all of the experiments described. Segmentation is crucial to the analysis due to the relatively low intensity associated with small spines and the high intensity of the dendrites. Choosing a critical threshold is important; we nd simple thresholding adequate for most images that have been preprocessed by median ltering and deconvolution.
An Image Analysis Algorithm
1307
The automated analysis greatly enhances speed, consistency, and objectivity. Our timing results show that automated analysis of time-lapse data consisting of 50 images tracking a total of 30 to 50 spines takes about 4 hours CPU time on a Pentium III 500 Hz processor, whereas for manual analysis, an experienced user will take 12 to 16 hours. For static images, the time savings compared to manual analysis may not be as signicant if experimental conditions vary so signicantly that new parameters have to be determined for each image to be analyzed. However, automated analysis still provides more detailed, complete, and objective quantication than manual analysis. The automatic procedure assumes that the spines are simply connected to the dendrites. Small looping structures in the medial axis indicate pairs of spines that are too close to be resolved by imaging or segmentation. If it satises the protrusion criterion, each such structure will be detected as a single entity. Resolution of such structures as paired spines is currently not implemented. We estimate these occurrences to affect less than 2% of the spine population. Our program enables calculation of spine volume, previously not possible with manual analysis. Volumetric measurements offer insight into understanding the electrical capacity of the spines and the structural and electrophysiological properties of neuronal dendrites. It is therefore an important parameter for characterizing dendritic spines. Spine volume has not been reported previously in any of the automatic methods cited in section 1. Our volume measurements agree reasonably well with an SSEM analysis of similarly aged animals, even though the preparation techniques for the specimens are entirely different. The various morphology-based measurement capabilities presented here enable the investigation of the functional signicance of dendritic spines and their plasticity to a wide spectrum of experimental and pathological conditions. Automatic morphometry will signicantly improve the scale and accuracy of such studies. Acknowledgments
We thank Bernardo L. Sabatini for valuable comments and suggestions for the interface of the software. This work is supported by Pew and Whitaker Foundations and NIH. K. Z. is a Merck Fellow of the Helen Hay Whitney Foundation. References Barnea, D. I., & Silverman, H. F. (1972). A class of algorithms for fast image registration. IEEE Trans. Computers, C-21, 179–186. Carlbom, I., Terzopoulos, D., & Harris, K. M. (1994). Computer-assisted registration, segmentation, and 3-D reconstruction from images of neuronal tissue sections. IEEE Trans. Med. Imaging, 13, 351–362.
1308
Ingrid Y. Y. Koh, et al.
Denk, W., Strickler, J. H., & Webb, W. W. (1990). Two-photon laser scanning microscopy. Science, 248, 73–76. Denk, W., & Svoboda, K. (1997).Photon upsmanship: Why multiphoton imaging is more than a gimmick. Neuron, 18, 351–357. Duan, H., He, Y., Wicinski, B., Morrison, J. H., & Hof, P. R. (2000). Agerelated dendrite and spine changes in corticocortically projecting neurons in macaque monkeys. Soc. Neurosci. Abstr., 26, 1237. Eigen, M., & Rigler, R. (1994). Sorting single molecules: Applications to diagnostics and evolutionary biotechnology. Proc. Natl. Acad. Sci. USA, 91, 5740–5747. Engert, F., & Bonhoeffer, T. (1999). Dendritic spine changes associated with hippocampal long-term synaptic plasticity. Nature, 399, 66–70. Fiala, J. C., Feinberg, M., Popov, V., & Harris, K. M. (1998). Synaptogenesis via dendritic lopodia in developing hippocampal. J. Neurosci., 18, 8900–8911. Gibson, S. F., & Lanni, F. (1991). Experimental test of an analytical model of aberration in an oil-immersion objective lens used in three-dimensional light microscopy. J. Opt. Soc. Am. A, 8, 1601–1613. Harris, K. M., Jensen, F. E., & Tsao, B. (1992). Three-dimensional structure of dendritic spines and synapses in rat hippocampus (CA1) at postnatal day 15 and adult ages: Implications for the maturation of synaptic physiology and long-term potentiation. J. Neurosci., 12, 2685–2705. Herzog, A., Krell, G., Michaelis, B., Wang, J., Zuschratter, W., & Braun, K. (1997). Restoration of three-dimensional quasi-binary images from confocal microscopy and its application to dendritic trees. In C. J. Cogswell, J. Conchello, & T. Wilson (Eds.), Proc. SPIE, Three-Dimensional Microscopy: Image Acquisition and Processing IV (pp. 146–157). Holmes, T. J., Bhattacharyya, S., Cooper, J. A., Hanzel, D., Krishnamurthi, V., Lin, W., Roysam, B., Szarowski, D. H., & Turner, J. N. (1995). Light microscopic images reconstructed by maximum likelihood deconvolution. In J. B. Pawley (Ed.), Handbook of biological confocal microscopy (2nd ed., pp. 389–402). New York: Plenum Press. Horner, C. H. (1993). Plasticity of the dendritic spine. Prog. Neurobiol., 41, 281–321. Hosokawa, T., Rusakov, D. A., Bliss, T. V. P., & Fine, A. (1995). Repeated confocal imaging of individual dendritic spines in the living hippocampal slice: Evidence for changes in length and orientation associated with chemically induced LTP. J. Neurosci., 15, 5560–5573. Kawata, S., & Ichioka, Y. (1980). Iterative image restoration for linearly degraded images. I. Basis and II. Reblurring procedure. J. Opt. Soc. Am., 70, 762–772. Kilborn, K., & Potter, S. M. (1998). Delineating and tracking hippocampal dendritic spine plasticity using neural network analysis of two-photon microscopy. Soc. Neurosci. Abstr., 24, 422–425. Lagendijk, R. L., & Biemond, J. (1991). Iterative identication and restoration of images. Norwell, MA: Kluwer Academic. Lee, T. C., Kashyap, R. L., & Chu, C. N. (1994). Building skeleton models via 3-D medial surface/axis thinning algorithms. CVGIP: Graph. Models Image Process., 56, 462–478.
An Image Analysis Algorithm
1309
Lendvai, B., Stern, E. A., Chen, B., & Svoboda, K. (2000). Experience-dependent plasticity of dendritic spines in the developing rat barrel cortex in vivo. Nature, 404, 876–881. Leymarie, R., & Levine, M. D. (1992). Simulating the grassre transform using an active contour model. IEEE Trans. Pattern Anal. Mach. Intell., 14, 56–75. Lindquist, W. B., & Venkatarangan, A. (1999). Investigating 3-D geometry of porous media from high resolution images. Phys. Chem. Earth (A), 25, 593–599. Lindquist, W. B., Venkatarangan, A., Dunsmuir, J., & Wong, T. (2000). Pore and throat size distributions measured from synchrotron X-ray tomographic images of Fontainebleau sandstones. J. Geophys. Research, 105B, 21508–21528. Mainen, Z. F., Maletic-Savatic, M., Shi, S. H., Hayashi, Y., Malinow, R., & Svoboda, K. (1999). Two-photon imaging in living brain slices. Methods, 18, 231–239. Maletic-Savatic,M., Malinow, R., & Svoboda, K. (1999).Rapid dendritic morphogenesis in CA1 hippocampal dendrites induced by synaptic activity. Science, 283, 1923–1927. Moser, M. B., Trommald, M., & Anderson, P. (1994). An increase in dendritic spine density on hippocampal CA1 pyramidal cells following spatial learning in adult rats suggests the formation of new synapses. Proc. Natl. Acad. Sci. USA, 91, 12673–12675. Pal, N. R., & Pal, S. K. (1993). A review on image segmentation techniques. Pattern Recognition, 26, 1277–1294. Pawley, J. B. (Ed.). (1995). Handbook of biological confocal microscopy (2nd ed.). New York: Plenum Press. Pratt, W. K. (1991). In Digital image processing (2nd ed.). New York: WileyInterscience. Purpura, D. P. (1974). Dendritic spine dysgenesis and mental-retardation. Science, 186, 1126–1128. Rusakov, D. A., & Stewart, M. G. (1995). Quantication of dendritic spine populations using image analysis and a tilting disector. J. Neurosci. Methods, 60, 11–21. Sabatini, B. L., & Svoboda, K. (2000). Analysis of calcium channels in single spines using optical uctuation analysis. Nature, 408, 589–593. Shaw, P. J. (1995). Comparison of wide-eld/deconvolution and confocal microscopy for 3-D imaging. In J. B. Pawley (Ed.), Handbook of biological confocal microscopy (2nd ed., pp. 373–387). New York: Plenum Press. Spacek, J. (1994). Design CAD-3D: A useful tool for 3-dimensional reconstructions in biology. J. Neurosci. Methods, 53, 123–124. Stoppini, L., Buchs, P. A., & Muller, D. A. (1991). A simple method of organotypic cultures of nervous tissue. J. Neurosci. Methods, 37, 173–182. Svoboda, K., Denk, W., Kleinfeld, D., & Tank, D. W. (1997). In vivo dendritic calcium dynamics in neocortical pyramidal neurons. Nature, 385, 161–165. Svoboda, K., Tank, D. W., & Denk, W. (1996). Direct measurement of coupling between dendritic spines and shafts. Science, 272, 716–719. Toni, N., Buchs, P. A., Nikonenko, I., Bron, C. R., & Muller, D. (1999). LTP promotes formation of multiple spine synapses between a single axon terminal and a dendrite. Nature, 402, 421–425.
1310
Ingrid Y. Y. Koh, et al.
Tukey, J. W. (1971). Exploratory data analysis, Reading, MA: Addison-Wesley. Watzel, R., Braun, K., Hess, A., Scheich, H., & Zuschratter, W. (1995). Detection of dendritic spines in 3-dimensional images. DAGM-Symposium Bielefeld, 160–167. Wearne, S. L., Straka, H., & Baker, R. (2000). Signal processing in goldsh precerebellar neurons exhibits eye velocity and storage. Soc. Neurosci. Abstr., 26, 7. Yang, H. (2000). A geometric analysis on 3D ber networks from high resolution images. In Proc. of the 2000 Int. Nonwovens Tech. Conf. Carey, NC: International Nonwovens and Disposables Association. Yuste, R., & Denk, W. (1995). Dendritic spines as basic functional units of neuronal integration. Nature, 375, 682–684. Received January 17, 2001; accepted October 1, 2001.
LETTER
Communicated by Geoffrey Goodhill
Multiplicative Synaptic Normalization and a Nonlinear Hebb Rule Underlie a Neurotrophic Model of Competitive Synaptic Plasticity T. Elliott [email protected] N. R. Shadbolt [email protected] Department of Electronics and Computer Science, University of Southampton, Higheld, Southampton, SO17 1BJ, U.K. Synaptic normalization is used to enforce competitive dynamics in many models of developmental synaptic plasticity. In linear and semilinear Hebbian models, multiplicative synaptic normalization fails to segregate afferents whose activity patterns are positively correlated. To achieve this, the biologically problematic device of subtractive synaptic normalization must be used instead. Our own model of competition for neurotrophic support, which can segregate positively correlated afferents, was developed in part in an attempt to overcome these problems by removing the need for synaptic normalization altogether. However, we now show that the dynamics of our model decompose into two decoupled subspaces, with competitive dynamics being implemented in one of them through a nonlinear Hebb rule and multiplicative synaptic normalization. This normalization is “emergent” rather than imposed. We argue that these observations permit biologically plausible forms of synaptic normalization to be viewed as abstract and general descriptions of the underlying biology in certain scaleless models of synaptic plasticity.
1 Introduction
Activity-dependent competition between afferent neurons for control of target neurons is a ubiquitous feature of mammalian neuronal development (Purves, 1994). These competitive interactions are thought to lead, for example, to the development of ocular dominance columns (ODCs)— interdigitated domains of control by the left and right eyes—in the primary visual cortex of higher mammals such as Old World monkeys and cats (Hubel & Wiesel, 1962; LeVay, Stryker, & Shatz, 1978; LeVay, Wiesel, & Hubel, 1980). Understanding the mechanisms underlying competition in the nervous system is of central importance to developmental neuroscience, both experimentally and theoretically. c 2002 Massachusetts Institute of Technology Neural Computation 14, 1311– 1322 (2002) °
1312
T. Elliott and N. R. Shadbolt
Theoretically, several approaches to competition have been developed (for reviews, see Swindale, 1996; van Ooyen, 2001), including the Bienenstock-Cooper-Munro model (Bienenstock, Cooper, & Munro, 1982), neurotrophic models (e.g., Harris, Ermentrout, & Small, 1997; Elliott & Shadbolt, 1998a, 1998b), covariance models (Sejnowski, 1977; Linsker, 1986a, 1986b, 1986c), and various other approaches (e.g., Swindale, 1980; Fraser & Perkel, 1989; Montague, Gally, & Edelman, 1991; Tanaka, 1991). But perhaps the most popular models are based on linear or semilinear Hebbian rules coupled with various forms of synaptic normalization as a means of enforcing competitive dynamics (von der Malsburg, 1973; Miller, Keller, & Stryker, 1989; Goodhill, 1993). Although multiplicative synaptic normalization (von der Malsburg, 1973) is not biologically implausible, it does not lead to afferent segregation in the presence of positive correlations in the activity patterns between afferent cells. To segregate afferents in the presence of positive correlations, subtractive synaptic normalization must be used instead (Goodhill & Barrow, 1994; Miller & MacKay, 1994). Whether models actually need to be able to segregate positively correlated afferents in order to be biologically relevant is currently a moot question. ODCs in Old World monkeys develop prior to birth and are adultlike at birth (Horton & Hocking, 1996). Recent data have also questioned whether, as previously assumed, ODCs develop in the ferret after eye opening (Crowley & Katz, 1999, 2000). In the cat, however, despite data revealing non-Hebbian developmental processes (Crair, Gillespie, & Stryker, 1998), it is still tenable to assume that ODCs develop after eye opening, and therefore in the presence of positively correlated inter-ocular images. Thus, models of at least cat ODC development must be able to segregate positively correlated afferents in a plausible manner. In previous work, we have criticized the use of synaptic normalization on two grounds. First, we argued (Elliott & Shadbolt, 1998a, 1998b, 1999; Elliott, Maddison, & Shadbolt, 2001) that synaptic normalization simply describes rather than seeks to explain the nature of competition in the nervous system. Second, we have argued (Elliott & Shadbolt, 1998a, 1998b, 1999; Elliott et al., 2001) that subtractive synaptic normalization is biologically implausible, for reasons explained later. Inspired by experimental results implicating neurotrophic factors in activity-dependent synaptic competition (reviewed in McAllister, Katz, & Lo, 1999), and in an attempt to overcome the difculties of synaptic normalization, we have built a mathematical model based on activity-dependent competition for neurotrophic factors involving anatomical plasticity, and later we extended the model to include simultaneous physiological plasticity (Elliott & Shadbolt, 1998a, 1998b, 1999; Elliott et al., 2001). Critically, our neurotrophic model segregates positively correlated afferents, as required for application to the development of ODCs (Elliott & Shadbolt, 1998b). In this article, we present the results of further analysis of our neurotrophic model. We have found that the model’s dynamics decouple into two essentially independent subspaces, with the competitive dynamics re-
Neurotrophic Models and Synaptic Normalization
1313
siding exclusively in one subspace. In this latter subspace, we show that a nonlinear Hebb (or Hebb-like) rule governs synaptic growth. To our surprise, and contradicting widely held beliefs about the capacity of multiplicative synaptic normalization to segregate positively correlated afferents, we nd that competition is implemented through multiplicative synaptic normalization. This normalization, however, is not imposed, but, in a sense that we shall explain, is “emergent.” We then argue that the key feature of our model that allows this Hebb-like, synaptic normalization description is that the dynamics in the competitive subspace are independent of an overall synaptic scale (i.e., the absolute number of synapses). We suggest that many such models may satisfy this property, thus perhaps sanctioning the use of synaptic normalization as an acceptable, abstract characterization of the underlying competitive process. 2 Reformulations of the Model
In this section, we rst write down our basic model of anatomical, competitive developmental synaptic plasticity based on competition for neurotrophic support and then state a number of key results that, in the nonlinear Hebb formulation, will be seen to be rather more transparent. We then introduce a change of variables, from which an energy function reformulation of the model will reveal two basically decoupled sets of dynamics: one competitive set leading to afferent segregation and one noncompetitive set determining an overall synaptic scale. Finally, we derive the nonlinear Hebb rule formulation, using the decoupled dynamics of the energy function formulation to extract an “emergent” normalization process. 2.1 The Basic Model. Let afferent cells be labeled by letters such as i and j and target cells be labeled by letters such as x and y. Let the number of synapses between afferent cell i and target cell x be sxi , and let the activity of afferent cell i be ai 2 [0, 1]. Then the basic equation governing the time evolution of sxi is given by " Á ! # P X ( a C ai ) ds xi j syj aj D xy T0 C T1 P ¡1 (2.1) D 2 sxi P dt j sxj ( a C aj ) y j syj
(see Elliott & Shadbolt, 1998a, for a detailed derivation and justication). The quantities T 0 and T 1 represent, respectively, an activity-independent and maximum activity-dependent release of NTFs by target cells; a represents a resting uptake term of NTFs by afferents; and 2 is an overall learning rate. The function D xy embodies lateral interactions between target cells x and y. In previous work, we have considered D xy to arise only through the diffusion of NTFs between target cells, so that D xy ¸ 0 8x, y (Elliott & Shadbolt, 1998a, 1998b, 1999). However, we can also consider D xy to arise from both excitatory (D xy > 0) and inhibitory (D xy < 0) lateral synaptic interactions between target cells, either enhancing or reducing the release
1314
T. Elliott and N. R. Shadbolt
of NTFs by target cells. In this case, we can ignore NTF receptor dynamics, which in previous work we have also considered (Elliott & Shadbolt, 1998a). For P convenience, and without much loss of generality, we will assume that y D xy D 1 8x. The quantity c D T 0 / ( aT1 ) is a critical parameter in our model. Previous work has shown that for D xy D dxy (the Kronecker delta), when c < 1 afferent segregation occurs (that is, all but one of sxi go to zero at each target cell x), while for c > 1, afferent segregation breaks down (Elliott & Shadbolt, 1998a). For c < 1, segregation occurs for all but perfectly correlated afferent activity patterns (Elliott & Shadbolt, 1998a). In the presence of a general D xy , the critical value of c is reduced below unity and also becomes a function of afferent correlations, so that too-extensive (positive) lateral interactions or too strong (but not perfect) afferent activity correlations can lead to a breakdown of afferent segregation (unpublished observations). The parameter a also plays a direct role in segregation. As a ! 1, the rate of segregation tends to zero. In the limit, the ratio of the number of synapses supported by any pair of afferents on a given target cell remains xed (Elliott & Shadbolt, 1998a). Formulation. Introducing the new variables sxC D P 2.2 Energy Function P C i sxi and vxi D sxi / sx , so that i vxi ´ 1, equation 2.1 can be rewritten as
dvxi sx D 2 T1 vxi dt C
"P
j ( ai
P
j
# ¡ aj ) vxj X
( a C aj ) vxj
yj
D xy ( ac C aj ) vyj
(2.2)
and X dsx C 2 sxC D 2 T1 D xy (ac C aj ) vyj . dt yj C
(2.3)
We now average over the ensemble of afferent activity patterns. To achieve this, we assume, for tractability, that for n distinct afferents, i D 1, . . . , n, there are just n distinct activity patterns, with pattern number i being dened 6 i, with p 2 [0, 1]. Dening m as the average activity by ai D 1 and aj D p 8j D of an afferent, so that nm D 1 C (n ¡ 1) p, and introducing the parameter r D (1 ¡ p ) / (a C p) , we obtain after some algebra 2 0 1 X C 1 rd dv ij xi ¡ nA C (1 ¡ p ) nsxC D 2 T1 vxi 4a ( c ¡ 1) @ C rvxj 1 dt j 3 X 1 C rdij X £ D xy ( vyj ¡ vxj ) 5 , 1 C rvxj y j
(2.4)
Neurotrophic Models and Synaptic Normalization
1315
and C
ds x C C 2 sx D 2 T1 ( ac C m ) , dt
(2.5)
where, in these two equations, we have dropped for notational convenience the hi brackets on the variables vxi and sxC , these brackets denoting ensemble averaging. Restricting to n D 2 afferents for increased tractability and writing vx D 2vxi ¡ 1 for any one of the two afferents i, we then obtain sxC
X 1 ¡ v2x dvx DO xyvy , D 2 T1 r2 2 2 2 (2 C r ) ¡ r vx y dt C
(2.6)
dsx C D 2 [T1 ( ac C m ) ¡ sx ], dt
(2.7)
DO xy D ( a C m ) D xy ¡ (ac C m )dxy .
(2.8)
where
From equations 2.6 and 2.7, we easily obtain a Lyapunov or energy function E, such that dE / dt · 0 always, where E D E S C EC ,
(2.9)
with ES D
1X [T1 (ac C m ) ¡ sxC ]2 2 x
(2.10)
and EC D ¡
1X O vx D xy vy . 2 xy
(2.11)
In fact, this form for E generalizes rigorously to n > 2 afferents, as we shall demonstrate elsewhere. The stable solutions of equations 2.6 and 2.7, and thus the solutions of the unaveraged equation 2.1 when 2 is sufciently small that the sxi change slowly compared to the afferent activities ai , correspond to the minima of E. E cleanly decomposes into two pieces that may be minimized independently. ES merely sets the overall scale for sxC and is minimized directly by setting sxC D T1 (ac C m ) 8x. The dynamics embodied in this minimization (and therefore the solutions of equations 2.5 or 2.7) are therefore trivial and uninteresting. Minimization of EC , on the other hand, with vx 2 [¡1, 1] corresponding to a target cell “spin” variable denoting control by one afferent (negative vx ) or the other (positive vx ), encapsulates the competitive dynamics of the model. The eigenvalues of the matrix DO entirely determine the character of these dynamics, and EC is, of course, exactly the Hamiltonian of a spin glass (with vx 2 f¡1, C 1g).
1316
T. Elliott and N. R. Shadbolt
2.3 Nonlinear Hebb Rule Formulation. We now return to n afferents in the unaveraged system dened by equations 2.2 and 2.3 and consider one target cell only, so that we examine competition on a pointwise basis. We then may set D xy D dxy and drop the x subscript on the sxC and vxi (and sxi ) variables. Our results, however, easily generalize to multiple target cells with a general D xy . The energy function analysis above reveals that the n independent degrees of freedom in the n si variables P decompose into n ¡ 1 degrees of freedom in the n scaleless vi variables ( i vi D 1, by construction), which entirely capture the competitive dynamics of the model, and one degree of freedom in the s C variable, which entirely captures the scaling dynamics of the model. As argued above, these latter scaling dynamics are quite trivial and uninteresting, and we may discard them without any important loss of generality, restricting attention to the ( n ¡ 1) -dimensional competitive subspace only. Hence, our basic equation is just
Á dvi D vi dt
ac C aC
P P
j vj aj
j vj aj
!
X j
( ai ¡ aj ) vj ,
(2.12)
P with i vi D 1, where we have absorbed a factor of 2 T1 into a redenition of time. In this subspace, we have a synaptic growth rule, equation 2.12, P which automatically normalizes the vi such that i vi D 1 always. We stress, however, that this normalization is the result of the denition of the vi variables (vi D si / s C ) and the fact that these variables are key in capturing the competitive dynamics, with the scaling dynamics decoupling. Nevertheless, from a purely mathematical point of view, we may ask what the form of the growth Prule underlying equation 2.12 is and how the “emergent” normalization i vi D 1 is maintained. Dening PD
ac C aC
P
P
j vj aj
j vj aj
,
(2.13)
which is a purely postsynaptic although nonlinear term, and p i D ai vi ,
(2.14)
which is a presynaptic term, we may rewrite equation 2.12 as 0 1 X dvi p j A P. D p i P ¡ vi @ dt j
(2.15)
The rst term on the right-hand side (RHS) is a nonlinear Hebb growth rule, where, for the moment, we dene a Hebb growth rule as any synaptic
Neurotrophic Models and Synaptic Normalization
1317
growth rule that is expressible as the product of a presynaptic term and a purely postsynaptic term. Were this the only term on the RHS of equation 2.15, it would induce the unconstrained synaptic growth characteristic of Hebb rules. How, then, is this unconstrained P growth forced to remain in the ( n ¡ 1) -dimensional subspace in which i vi D 1? Were we to impose this through multiplicative synaptic normalization, we would modify the Hebb rule by subtracting from the unconstrained growthP term a term proportional to vi and such that this additional term forces i dvi / dt D 0. In P our case, this term would be vi ( j p j ) P, which is exactly the second term on the RHS of equation 2.15. Equation 2.15 therefore represents a nonlinear Hebb rule, with a nonlinear postsynaptic term P, together with multiplicative synaptic normalization. Yet this model segregates afferents for all but perfectly correlated afferent activity patterns (when D xy D dxy ), explicitly contradicting the widely held view that in order to segregate positively correlated afferents, multiplicative normalization fails and subtractive normalization must be used (Goodhill & Barrow, 1994; Miller & MacKay, 1994). Furthermore, it contradicts, or at least dramatically reduces, the signicance of claims that nonlinear Hebbian dynamics are reducible to linear Hebbian dynamics (Miller, 1990), for linear dynamics will not segregate positively correlated afferents under multiplicative normalization (Goodhill & Barrow, 1994; Miller & MacKay, 1994). We discuss these issues more fully later. The form of P allows us to see more readily the model’s behavior in certain limits. When c D 1, we have that P ´ 1. In this case, equation 2.15 is purely presynaptic, and so we would expect, as in fact observed (Elliott & Shadbolt, 1998a), chaotic oscillations in the Pvi as they grow or decay independently of each other (except through i vi D 1). In the limit a ! 1 with ac held xed (because ac D T 0 / T1 , T 0 and T1 being parameters independent of a), P ! 0. Hence, as a grows, the vi evolve more slowly, so that segregation slows down, and, in the limit, the vi are constant in time. P When c < 1, P < 1 and is a monotone increasing function of j vj aj on [0, 1) (although the actual range of this sum is just P [0, 1]), while for c > 1, P > 1 and is a monotone decreasing function of j vj aj . P should therefore not in general be regarded as a postsynaptic ring rate, but rather as a postsynaptic “plasticityP rate,” which, of course, depends on the summed postsynaptic response, j vj aj . While we have dened a Hebb rule as any synaptic growth rule that is expressible as the product of a presynaptic and a postsynaptic term, many Hebb rules are often stated in more restricted forms involving only correlations between presynaptic and postsynaptic P activity. For c < 1, because P is a monotone increasing function of j vj aj , our Hebb rule denition for our model is equivalent to the more restricted denition. Hence, for c < 1, equation 2.15 is Hebbian in both senses, and because our model segregates afferents only for c < 1, this justies our use
1318
T. Elliott and N. R. Shadbolt
of our more general denition of a Hebbian model. However, for c > 1, P is monotone decreasing. Therefore, the growth rule in equation 2.15, although Hebbian by our denition, is not Hebbian according to the more restricted denition. Indeed, for c > 1, equation 2.15 would often be regarded as an anti-Hebbian rule, rewarding anticorrelations between presynaptic and postsynaptic activity. We thus see the dynamical importance of the point c D 1: it corresponds to the transition in the model between a (classical) Hebbian rule and a (classical) anti-Hebbian rule. 3 Discussion
We have extended our earlier analysis of our neurotrophic model of anatomical, competitive synaptic plasticity, which can segregate afferents in the presence of positively correlated afferent activity patterns, and have shown that a change of variables reveals two essentially decoupled sets of dynamics. One set determines an overall synaptic scale and can be discarded without any important loss of generality. The other set completely determines the competitive dynamics underlying the model, independent of the synaptic scale. These competitive dynamics can be viewed in two contrasting ways—as the dynamics of a spin glass, which we have not pursued at any length here, or formulated precisely as a nonlinear Hebbian model with synaptic normalization implemented multiplicatively (cf. Wiskott & Sejnowski, 1998). In contrast to many other models of synaptic competition, this normalization is not imposed, but rather is “emergent,” in the sense that the interesting, competitive interactions of the model restrict themselves dynamically to a lower-dimensional subspace in which a set of variables can be found that fully capture these dynamics and satisfy a normalization constraint. Critically, although synaptic normalization is implemented multiplicatively in this lower-dimensional subspace, our model can segregate positively correlated afferents. Indeed, in the absence of lateral interactions between target cells (so that D xy D dxy), the model can segregate all but perfectly correlated afferents. The fact that a nonlinear Hebb rule and multiplicative synaptic normalization can segregate even strongly positively correlated afferents is intriguing. It is well known that multiplicative normalization together with a linear Hebb rule, or a linear Hebb rule coupled with nonlinearities such as a winner-take-all mechanism, cannot segregate positively correlated afferents (von der Malsburg, 1973; Goodhill & Barrow, 1994; Miller & MacKay, 1994) and that subtractive rather than multiplicative normalization must be used (Miller et al., 1989; Goodhill, 1993). The fact that certain nonlinear Hebbian models are reducible to linear Hebbian models (Miller, 1990) has led to the widespread belief that, in general, no Hebbian model, linear or nonlinear, can segregate positively correlated afferents under multiplicative normalization. Our results constitute an explicit and constructive counterexample to these beliefs.
Neurotrophic Models and Synaptic Normalization
1319
We have frequently criticized synaptic normalization for being a mathematical device that simply imposes rather than seeks to illuminate synaptic competition (Elliott & Shadbolt, 1998a, 1998b, 1999; Elliott et al., 2001). Even if we accept that synaptic normalization underlies competition in the nervous system and accept that normalization could arise from decay or homeostatic mechanisms controlling synaptic efcacy (Turrigiano, Leslie, Desai, Rutherford, & Nelson, 1998), it appears to us implausible to assume, as required by subtractive normalization, that these mechanisms should regulate synaptic efcacy in a fashion that is independent of the concentrations of any of the presynaptic or postsynaptic components that together determine synaptic efcacy. If a model of synaptic plasticity requires an arguably implausible form of synaptic normalization to segregate positively correlated afferents, then perhaps that model should be regarded as implausible. Prior to our results above, such a conclusion would have been unpalatable, as it would have left a vacuum in the space of conventional competitive, synaptic normalization-based Hebbian models that can segregate positively correlated afferents. A few other models will segregate positively correlated afferents (e.g., Bienenstock et al., 1982; Harris et al., 1997), but these do not enforce hard constraints via synaptic normalization on summed synaptic efcacy. Our present model, in its nonlinear Hebb reformulation, lls this vacuum by being an explicit, constructive example of a simple Hebbian model that uses multiplicative normalization to achieve competitive dynamics. This model is almost certainly not unique: presumably an innite number of nonlinear Hebbian models exist that are capable of segregating positively correlated afferents using various forms of synaptic normalization that do not require biologically problematic assumptions. Our model should therefore be construed, from a mathematical point of view, as merely an existence proof of one such model. How do we reconcile our previous criticisms of synaptic normalization (of any form) as a mathematical device and our presentation of our neurotrophic model as a competing alternative with the results above, showing an underlying multiplicative synaptic normalization in the competitive dynamics of our model? There are two related aspects to our reply. First, the synaptic normalization that we have found was in no way imposed from the outset. After changing variables so as to eliminate an overall synaptic scale, we found that the competitive dynamics reside entirely in the scaleless variables (the vxi ), which by denition satisfy a normalization equation, and, moreover, the dynamics of the scaleless variables are basically independent of the dynamics of the scaling variables (the sxC ); at least, the coupling is completely trivial and uninteresting and can be ignored without any important loss of generality. In this sense, then, we have referred to this normalization as “emergent.” Second, mathematically, it will almost always be possible to perform a change of variables and thereby introduce a set of scaleless variables that satisfy a normalization equation. The only case in which this may not be
1320
T. Elliott and N. R. Shadbolt
possible is when the transformation would introduce possibly singular variables, but we shall ignore this possibility. Whether the scaling and scaleless dynamics decouple, it will also always be possible to ask whether the scaleless dynamics can be mathematically separated into a general growth term (a Hebb or Hebb-like term, for example) and a term that maintains the normalization equation (the normalization term that sets the derivative of the summed scaleless variables to zero). If the competitive dynamics reside entirely in the scaleless variables and if the scaleless and scaling variables do not interact in any important fashion, then any such model will possess an underlying synaptic growth rule with competition implemented by some form of synaptic normalization. Given that neurons exhibit such properties as homeostasis and gain control, which can be thought of as mechanisms to eliminate dependence on, or to adjust to, synaptic scale, it is possible that many models of synaptic plasticity can be so reformulated. In this view, in those classes of model of synaptic plasticity in which the underlying competitive dynamics are scale independent, “emergent” synaptic normalization is seen to be inevitable, fully capturing, mathematically speaking, the competitive dynamics. Thus, when rooted in a biologically plausible model of synaptic plasticity, synaptic normalization, provided that its emergent form is not implausible, would appear to be an acceptable, abstract, and general description of the underlying biology. Can we reverse this statement and argue that we may impose competitive dynamics on any given synaptic growth rule by enforcing a hard normalization constraint? The answer is probably in the afrmative, provided that such a model is regarded as tentative, awaiting a derivation from an underlying model whose scaleless dynamics correspond to the given synaptic growth rule. Of course, if it is found that the synaptic growth rule requires an implausible form of synaptic normalization to achieve afferent segregation in the presence of positive correlations, as we believe that linear and semilinear Hebb rules do, then that particular synaptic growth rule should be discarded.
Acknowledgments
T. E. thanks the Royal Society for the support of a University Research Fellowship. References Bienenstock, E. L, Cooper, L. N., & Munro, P. W. (1982). Theory for the development of neuron selectivity: Orientation specicity and binocular interaction in visual cortex. J. Neurosci., 2, 32–48. Crair, M. C., Gillespie, D. C., & Stryker, M. P. (1998). The role of visual experience in the development of columns in cat striate cortex. Science, 279, 566–570.
Neurotrophic Models and Synaptic Normalization
1321
Crowley, J. C., & Katz, L. C. (1999). Early development of ocular dominance columns. Science, 290, 1321–1324. Crowley, J. C., & Katz, L. C. (2000). Development of ocular dominance columns in the absence of retinal input. Nature Neurosci., 2, 1124–1130. Elliott, T., & Shadbolt, N. R. (1998a).Competition for neurotrophic factors: Mathematical analysis. Neural Comp., 10, 1939–1981. Elliott, T., & Shadbolt, N. R. (1998b). Competition for neurotrophic factors: Ocular dominance columns. J. Neurosci., 18, 5850–5858. Elliott, T., & Shadbolt, N. R. (1999). A neurotrophic model of the development of the retinogeniculocortical pathway induced by spontaneous retinal waves. J. Neurosci., 19, 7951–7970. Elliott, T., Maddison, A. C., & Shadbolt, N. R. (2001). Competitive anatomical and physiological plasticity: A neurotrophic bridge. Biol. Cybern., 84, 13–22. Fraser, S. E., & Perkel, D. H. (1989). Competitive and positional cues in the patterning of nerve connections. J. Neurobiol., 21, 51–72. Goodhill, G. J. (1993). Topography and ocular dominance: A model exploring positive correlations. Biol. Cybern., 69, 109–118. Goodhill, G. J., & Barrow, H. G. (1994). The role of weight normalisation in competitive learning. Neural Comp., 6, 255–269. Harris, A. E., Ermentrout, G. B., & Small, S. L. (1997). A model of ocular dominance column development by competition for trophic support. Proc. Natl. Acad. Sci. U.S.A., 94, 9944–9949. Horton, J. C., & Hocking, D. R. (1996).An adult-like pattern of ocular dominance columns in striate cortex of newborn monkeys prior to visual experience J. Neurosci., 16, 1791–1807. Hubel, D. H., & Wiesel, T. N. (1962). Receptive elds, binocular interaction and functional architecture in the cat’s visual cortex. J. Physiol., 160, 106–154. LeVay, S., Stryker, M. P., & Shatz, C. J. (1978). Ocular dominance columns and their development in layer IV of the cat’s visual cortex: A quantitative study. J. Comp. Neurol., 179, 223–244. LeVay, S., Wiesel, T. N., & Hubel, D. H. (1980). The development of ocular dominance columns in normal and visually deprived monkeys. J. Comp. Neurol., 191, 1–51. Linsker, R. (1986a). From basic network principles to neural architecture: Emergence of spatial-opponent cells. Proc. Natl. Acad. Sci. U.S.A., 83, 7508–7512. Linsker, R. (1986b). From basic network principles to neural architecture: Emergence of orientation-selective cells. Proc. Natl. Acad. Sci. U.S.A., 83, 8390–8394. Linsker, R. (1986c). From basic network principles to neural architecture: Emergence of orientation columns. Proc. Natl. Acad. Sci. U.S.A., 83, 8779–8783. McAllister, A. K., Katz, L. C., & Lo, D. C. (1999). Neurotrophins and synaptic plasticity. Annu. Rev. Neurosci., 22, 295–318. Miller, K. D. (1990). Derivation of linear Hebbian equations from a nonlinear Hebbian model of synaptic plasticity. Neural Comp., 2, 321–333. Miller, K. D., Keller, J. B., & Stryker, M. P. (1989). Ocular dominance column development: Analysis and simulation. Science, 245, 605–615. Miller, K. D., & MacKay, D. J. C. (1994). The role of constraints in Hebbian learning. Neural Comp., 6, 100–126.
1322
T. Elliott and N. R. Shadbolt
Montague, P. R., Gally, J. A., & Edelman, G. M. (1991). Spatial signaling in the development and function of neural connections. Cereb. Cortex, 1, 199–220. Purves, D. (1994). Neural activity and the growth of the brain. Cambridge: Cambridge University Press. Sejnowski, T. J. (1977). Storing covariance with nonlinearly interacting neurons. J. Math. Biol., 4, 303–321. Swindale, N. V. (1980). A model for the formation of ocular dominance stripes. Proc. Roy. Soc. Lond. Ser. B, 208, 243–264. Swindale, N. V. (1996). The development of topography in the visual cortex: A review of models. Network, 7, 161–247. Tanaka, S. (1991). Theory of ocular dominance column formation— mathematical basis and computer simulation. Biol. Cybern., 64, 263–272. Turrigiano, G. G., Leslie, K. R., Desai, N. S., Rutherford, L. C., & Nelson, S. B. (1998). Activity-dependent scaling of quantal amplitude in neocortical neurons. Nature, 391, 892–896. van Ooyen, A. (2001). Competition in the development of nerve connections: A review of models. Network, 12, R1–R47. von der Malsburg, C. (1973). Self-organization of orientation selective cells in the striate cortex. Kybernetik, 14, 85–100. Wiskott, L., & Sejnowski, T. J. (1998). Constrained optimization for neural map formation: A unifying framework for weight growth and normalization. Neural Comp., 10, 671–716. Received May 11, 2001; accepted September 28, 2001.
LETTER
Communicated by Andreas Andreou
Energy-Efcient Coding with Discrete Stochastic Events Susanne Schreiber [email protected] Institute of Biology, Humboldt-University Berlin, 10115 Berlin, Germany, and Department of Zoology, University of Cambridge, Cambridge CB2 3EJ, U.K. Christian K. Machens [email protected] Andreas V. M. Herz [email protected] Institute of Biology, Humboldt-University Berlin, 10115 Berlin, Germany Simon B. Laughlin [email protected] Department of Zoology, University of Cambridge, Cambridge CB2 3EJ, U.K.
We investigate the energy efciency of signaling mechanisms that transfer information by means of discrete stochastic events, such as the opening or closing of an ion channel. Using a simple model for the generation of graded electrical signals by sodium and potassium channels, we nd optimum numbers of channels that maximize energy efciency. The optima depend on several factors: the relative magnitudes of the signaling cost (current ow through channels), the xed cost of maintaining the system, the reliability of the input, additional sources of noise, and the relative costs of upstream and downstream mechanisms. We also analyze how the statistics of input signals inuence energy efciency. We nd that energy-efcient signal ensembles favor a bimodal distribution of channel activations and contain only a very small fraction of large inputs when energy is scarce. We conclude that when energy use is a signicant constraint, trade-offs between information transfer and energy can strongly inuence the num ber of signaling molecules and synapses used by neurons and the manner in which these mechanisms represent information.
1 Introduction
Energy and information are intimately related in all forms of signaling. Cellular signaling involves local movements of ions and molecules, shifts in their concentration, and changes in molecular conformation, all of which c 2002 Massachusetts Institute of Technology Neural Computation 14, 1323– 1346 (2002) °
1324
S. Schreiber, C. K. Machens, A. V. M. Herz, and S. B. Laughlin
require energy. Nervous systems have highly evolved cell signaling mechanisms to gather, process, and transmit information, and the quantities of energy consumed by neural signaling can be signicant. In the blowy retina, the transmission of a single bit of information across one chemical synapse requires the hydrolysis of more than 100, 000 ATP molecules (Laughlin, de Ruyter van Steveninck, & Anderson, 1998). The adult human brain accounts for approximately 20% of resting energy consumption (Rolfe & Brown, 1997). Recent calculations suggest that the high rate of energy consumption in cortical gray matter results mainly from the transmission of electrical signals along axons and across synapses (Attwell & Laughlin, 2001). Given that high levels of energy consumption constrain function, it is advantageous for nervous systems to use energy-efcient neural mechanisms and neural codes (Levy & Baxter, 1996; Baddeley et al., 1997; Sarpeshkar, 1998; Laughlin, Anderson, O’Carroll, & de Ruyter van Steveninck, 2000; Schreiber, 2000; Balasubramanian, Kimber, & Berry, 2001; de Polavieja, in press). We set out to investigate the relationship between energy and information at the level of the discrete molecular events that generate cell signals. Ultimately, information is transmitted by the activation and deactivation of signaling molecules. These are generally single proteins or small complexes that respond to changes in electrical, chemical, or mechanical potential. Familiar neural examples are the opening of an ion channel, the binding of a ligand to a receptor, the activation of a G-protein, and vesicle exocytosis. These events involve the expenditure of energy—for example, to restore ions that ow across the membrane, restore G-proteins to the inactive (GGDP) state, and remove and recycle neurotransmitter. The ability of these events to transmit information is limited by their stochasticity (Laughlin, 1989; White, Rubinstein, & Kay, 2000). This uncertainty reduces reliability and hence the quantity of transmitted information. To increase information, one must increase the number of events used to transmit the signal; this, in turn, increases the consumption of energy. We investigate this fundamental relationship between information and energy in molecular signaling systems by developing a simple model within the context of neural information processing: a population of ion channels that responds to an input by changing their open probability. We derive the quantity of information transmitted by the population of channels and demonstrate how information varies as a function of the properties of the input and the number of channels in the population. We identify optima that maximize the ratio between transmitted information and cost. These optima depend on the input statistics and, as with spike codes (Levy & Baxter, 1996), the ratio between the costs of generating signals (the signaling cost) and the cost of constructing the system, maintaining it in a state of readiness and providing it with an input (the xed costs). The article is organized as follows. Section 2 introduces the model for a system that transmits information with discrete stochastic signaling events.
Energy-Efcient Coding
1325
In section 3, we dene measures of information transfer, energy consumption, and metabolic efciency for such a system. In section 4, we analyze the dependence of energy efciency on the number of stochastic units for gaussian input distributions. The inuence of additional noise sources is studied in section 5. In section 6, we look at the energy efciency of combinations of systems, and in section 7, we derive optimal input distributions and show how energy efciency depends on the number of stochastic units when the distribution of inputs is optimal. Finally, in section 8 we conclude our investigation with an extensive discussion. 2 The Model
We consider an information transmission system with input x and output k. The system has N identical, independent units that are activated and deactivated stochastically. The input x directly controls the probability that a unit is activated. The number of activated units, k, constitutes the output of the system. Because realistic physical inputs are bounded in magnitude, any given distribution of inputs can be mapped in a one-to-one fashion onto the activation probabilities of units, within the interval [0I 1]. We therefore assume, without loss of generality, that x 2 [0I 1] is equivalent to the probability of being in an activated state. Consequently, the conditional probability that a given input x activates k units is given by a binomial distribution, p ( k|x ) D
³ ´ N k x (1 ¡ x ) N¡k . k
(2.1)
2 The variance sk|x of this binomial probability distribution, 2 sk|x D Nx (1 ¡ x ) ,
(2.2)
is a measure for the system’s transduction accuracy, dening the magnitude 2 depends on both the number of available units of the noise. Note that sk|x N and the input x. In an equivalent interpretation, the model can also be considered as a linear input-output system, k D Nx C g (N, x ) ,
(2.3)
where Nx is the “deterministic” component of the output and g ( N, x ) represents the noise due to the stochasticity of the units. The noise distribution corresponds to p ( k|x ) shifted to have zero mean; its variance is therefore 2 given by sk|x . Thus, we see that the input x species a noise-free output N ¢ x, to which the noise g is added, yielding the integer-valued output k.
1326
A
S. Schreiber, C. K. Machens, A. V. M. Herz, and S. B. Laughlin
Input x determines open probability of Na+ channels
constant proportion pK open
stochastic
non-stochastic
B N=5
Na+ Channel
K+ Channel
N = 10
Figure 1: (A) Schematic view of the membrane model. The input x directly determines the probability that sodium channels are open. In contrast to the stochastic sodium channels, potassium channels are considered nonstochastic at this stage of analysis; independent of the input, a constant fraction pK is open. (B) Schematic view of two model signaling systems with N D 5 and N D 10 sodium channels, respectively. Note that the ratio of sodium to potassium channels is kept constant (here NK / N D 1).
2.1 Implementing the Model. For concreteness, we have chosen to implement the model in the context of a basic neural signaling mechanism. A membrane contains two populations of voltage-insensitive channels: one controlling inward current and the other outward current (see Figure 1). For convenience, we refer to these as sodium and potassium channels, respectively. There are N sodium channels and NK potassium channels. In our analysis, an input, x, produces a voltage response by changing the open probability of the set of sodium channels, which take the role of the stochastic signaling units. The input x could be derived from a variety of sources, both external, such as light, and internal, such as synaptically released transmitter. But regardless of its origins, the input is assumed to be unambiguously represented by the open probability of sodium channels. For simplicity, the second set of ion channels—the potassium channels—is considered to be input independent and noise free. Thus, a xed proportion of potassium channels is kept open, regardless of the size of the input. The input signal species the voltage output signal by directly determining the probability that sodium channels are open. Thus, a given input value x, presented in a small time interval D t , will result in k open sodium channels. Note that because of channel noise, the number of open sodium channels k will vary with different presentations of the same input. The state of the model system, on the other hand, is given by the number of open sodium channels k and translates uniquely into a voltage output if we neglect the inuence of membrane capacitance. Conversely, if we know the
Energy-Efcient Coding
1327
output voltage V, we can directly infer the number of open sodium channels k. Both variables are therefore equivalent from an information-theoretic point of view. By working in the channel domain (i.e., by taking the number of open sodium channels k as a measure of the system’s output), we can simplify our analysis and avoid the nonlinear relationship of conductance, current, and voltage. Note that to achieve the linear relationship between the input x and the (average) output k, the channels are not voltage sensitive. For simplicity, we do not study the effects of membrane capacitance on the time course of the voltage response, assuming that the signal variation is slow in comparison to the timescale of the effects introduced by membrane capacitance. Nor do we analyze the effects of channel kinetics. By working at a fundamental level, the mapping of the input onto discrete molecular events, we can investigate a simple model of general validity. 3 Calculating the Metabolic Efciency
We dene the metabolic efciency of a signaling system as the ratio between the amount of transmitted information I and the metabolic energy E required. Efciency I / E thus can be expressed in bits per energy unit. (For other efciency measures, see section A.3.) Both information transfer I and metabolic energy (or cost) E depend on the number of available signaling units—the number of channels, N. To investigate the relationship between efciency and the number of stochastic units, we drive the model system with a xed distribution of inputs, px ( x ), and vary N, the size of the population of sodium channels used for signaling. This is equivalent to changing the channel density of our model membrane while maintaining the same input. To ensure that systems with different values of N, the number of sodium channels, produce the same mean voltage output in response to a given input, x, the population of potassium channels, NK , is set to a constant proportion of N. Under these conditions, we can now calculate how the information transmitted, I, and the energy consumed, E, vary for different numbers of channels. 3.1 Information Transfer. Consider the transmission of a single signal. The model system receives an input, x, drawn from the input distribution, px (x ) , and produces a single output, k. According to Shannon, the information transmitted by the system is given by
I[NI px (¢) ] D
N Z X kD 0
µ
1 0
dxp ( k|x ) px (x ) log2
¶ p ( k|x ) , p k ( k)
(3.1)
and depends on the input distribution px (x ) and, via the binomial distribution of channel activation p ( k|x ) (see equation 2.1), also on the number
1328
S. Schreiber, C. K. Machens, A. V. M. Herz, and S. B. Laughlin
of available units, N. For a continuously signaling neuron, this is equivalent to the response during a discrete time bin of duration D t , and I is the rate of transmission in bits/ D t . In this article, we study two different scenarios—gaussian input distributions and input distributions that maximize the information transfer—both in the presence of noise caused by stochastic units (e.g., channel noise). 3.2 Energy. The total amount of energy required for the maintenance of a signaling system and the transmission of a signal is given by
Z E[NI px (¢) ] D b C
1 0
px ( x) e ( x, N ) dx,
(3.2)
where b is the xed cost, and e (x, N ) characterizes the required energy as a function of the input x and the number of stochastic signaling units N. Thus, we classify the metabolic costs into two groups: costs that are independent of N and costs that depend on the total number of channels, N. For simplicity, we assume that the latter costs are dominated by the energy used to generate signals (in this case, restoring the current that ows through ion channels), and we neglect the energy required to maintain channels. The rst group of costs, b, relates to costs that have to be met in the absence of signals, such as the synthesis of proteins and lipids. These costs are therefore called xed costs and are constant with respect to x and N. Because we have set up our systems to produce identical mean values of the voltage output signal given x (by xing the ratio N / NK ), the function e ( N, x ) is separable into the variables N and x (see section A.1), e (x, N ) D NQe ( x) ,
(3.3)
so that the signaling-related total energy consumption rises linearly with N. The function eQ ( x ) increases monotonically between eQ (0) D 0 and eQ (1) D ech , where ech denotes the energy cost associated with a single open sodium channel. (The precise form of eQ ( x ) is derived in section A.1.) Rescaling the measurement units of energy, we will from now on set ech D 1. Altogether, the total energy reads E[NI px (¢) ] D b C eQ (x ) N,
(3.4)
where eQ ( x) is the average signal-related energy requirement of one stochastic unit and the average is taken with respect to the input distribution px ( x) . In the rst part of the analysis, where we analyze energy-efcient numbers of channels, we make the simplifying assumption that the average cost per channel, eQ (x ) , is approximately equal to the mean of the input, x. Note that the energy E is dened as a measure of cost for one time unit D t , just as I is the measure of information transfer in D t.
Energy-Efcient Coding
1329
4 Gaussian Input Distributions 4.1 Information Transfer. We focus on gaussian inputs rst, because according to the central limit theorem, they are a reasonable description of signals composed of many independent subsignals, and they also allow an analytic expression of information transfer. To conne the bulk of the gaussian distributions within the input interval [0I 1], the mean, x, and the variance,sx2 , are chosen such that the distance from the mean to the interval borders is always larger than 3sx . Values falling outside the interval [0I 1] are cut off, and the distribution is then normalized to unity. Numerical simulations (see section A.2) show that the effects of this procedure on the results are negligible. The information transfer I (Shannon & Weaver, 1949) per unit time for a linear system with additive gaussian noise and gaussian inputs is given by
ID
1 log2 (1 C SNR ) , 2
(4.1)
where SNR denotes the signal-to-noise ratio. It is dened as the ratio be2 tween the signal variance, sx2 , and the effective noise variance, sk|x / N 2 . If two criteria are met—rst, the binomial noise g (N, x ) can be approximated by a gaussian and, second, within the regime of most likely inputs x, changes in the noise variance sx|2 k are negligible—the following equation gives a reasonably good approximation of the information transfer of our model system: ³ ´ 1 Nsx2 C log 1 . ID 2 2 x (1 ¡ x)
(4.2)
This is the case for large N and a restriction to the gaussian inputs described above. Numerical tests (see section A.2) show that the deviation between the real information transfer with N stochastic signaling units and the information transfer given by equation 4.2 is very small. 4.2 Efciency. For the efciency, dened as I / E, we obtain the following
expression:
³ ´ 1 I Nsx2 log2 1 C . D 2(xN C b) E x (1 ¡ x )
(4.3)
In Figure 2, efciency curves I / E are depicted as a function of N for three different values of xed costs b. The energy efciency exhibits an optimum for all curves. Signal transmission with numbers of channels N within the range of the optimum is especially energy efcient. The position of the optimum depends strongly on the size of the xed cost b relative to the average cost of a single channel eQ (x ) ¼ x. Figure 2 displays the dependence of the
S. Schreiber, C. K. Machens, A. V. M. Herz, and S. B. Laughlin
efficiency I/E
0.004
A b=500
0.003
2000 1000 2500
fixed cost b
0.002
5000
b=2000 0.001
b=5000 0
2000
4000
6000
8000
number of channels N
10000
B 1
b=5000
0.8 1
0.6
b=50
(I/E)n
normalized efficiency
optimal N
1330
0.5
0.4
0
0.2
5
0
5
log(N/optimal N) 0
1
2
3
N / optimal N
4
5
Figure 2: (A) Efciency I / E as a function of N for three different xed costs. From top to bottom, the xed cost b is equivalent to the cost of 500, 2000, and 5000 open channels. The optimal N shifts to larger N with rising xed cost b (see also the inset), and the efciency curves become wider. (B) Efciency curves for xed costs of b D 50, b D 500, and b D 5000 rescaled, so that the maximum corresponds to (1,1). The inset shows the same data on a logarithmic scale. These rescaled curves, ( I / E) n , are similar in shape, independent of the size of b. For all graphs, the parameters of the input distribution px ( x ) are sx D 0.16 and x D 0.5.
optimal number of channels N on the xed cost b. The most efcient number of channels increases approximately linearly with the size of the xed cost, although close inspection reveals that the slope is steeper for smaller xed costs and shallower in the range of higher b. If b D 0, the most efcient system capable of transmitting information uses one channel. The average costs of the most energy-efcient population of channels, employing Nopt channels, are given by Nopt x. Therefore, the ratio between these signaling costs at xN D 0.5, and the xed cost b is approximately 1:4 in the example depicted. This ratio however, which can be derived from the slope of the
Energy-Efcient Coding
1331
b-Nopt curve in the inset of Figure 2A, strongly depends on the input distribution. The analysis for gaussian input distributions shows that the ratio of Nopt to b increases with decreasing input variance, sx2 (results not shown). Remarkably, the efciency curves rise very steeply for small N and, after passing the optimum, decrease only gradually. This characteristic does not strongly depend on the size of b, as shown in Figure 2B. It is thus very uneconomical to use a number of channels sufciently far below the optimum, whereas above, there is a broad range of high efciency. In this range, a cell could adjust its number of channels to the amount of information needed (e.g., add more channels), without losing much efciency. However, as the inset to Figure 2B indicates, increasing the number of channels by a given factor has a similar effect on efciency as decreasing them by the same factor. 5 Additional Noise Sources
The most energy-efcient number of channels is inuenced by the size of additional noise. This might be noise added to the input (additive input noise) or internal noise generated within the signaling system independent of the activation of sodium channels (input-independent internal noise). 5.1 Additive Input Noise. If the input contains gaussian noise of a xed variance hgx2 i that is not correlated with the signal, the variances of the signal and the noise add, yielding the modied SNR at the output
SNR D
Nsx2 . x (1 ¡ x)
Nhgx2 i C
(5.1)
The additional noise gx decreases the SNR and, consequently, the information transfer. For N ! 1, the SNR converges to sx2 / hgx2 i, and thus sets an upper limit to the information transfer. Figure 3 shows that it is more efcient to operate at lower numbers of channels in the presence of additional signal noise. 5.2 Input-Independent Internal Noise. To exemplify internal noise, we consider an additional population of sodium channels that is not inuenced by the input x but rather has a xed open probability p0 , though it contributes to the noise. Assuming a xed ratio N / N 0 between the total number of the original input-dependent sodium channels N and the total number of these new input-independent sodium channels N 0 , the voltage output is determined by the sum of open channels from both populations k C k0 . Because noise from both populations is uncorrelated, the SNR reads
SNR D
N 2 sx2 , N 2 hg2x i C Nx (1 ¡ x ) C N0 p0 (1 ¡ p0 )
(5.2)
S. Schreiber, C. K. Machens, A. V. M. Herz, and S. B. Laughlin
efficiency I/E
1332
noiseless input
0.0006 0.0004
noisy input
0.0002
0
2000
4000
6000
8000
10000
number of channels N Figure 3: Efciency I / E as function of the number of signaling units, N, for the case of a noisy input signal (solid line). For comparison, we also replot the case where there is no input noise (dotted line), as in Figure 2A. For both curves, the variance of the input distribution equals sx2 D 0.162 , x D 0.5, and b D 5000. The noise variance of the signal is hgx2 i D 0.042 .
where N 0 p0 (1 ¡ p0 ) represents the noise variance of the input-independent population of channels. The efciency of the signaling system is thus further decreased for all N, and the efciency optimum, Nopt , is shifted to lower values. Other noise sources, such as leak channels and additional synaptic inputs, will also lower the SNR. Therefore, they will reduce efciency and inuence the optimum number of channels. 6 Efciency in Systems Combining Several Mechanisms
Signaling mechanisms do not act in isolation; they are usually organized into systems in which one mechanism drives another, either within a cell or between cells. The relationship between information transfer and cost of each mechanism determines optimal patterns of investment in signaling units across the whole system, as we will demonstrate with some simple examples. First, consider two signaling mechanisms in series (see Figure 4). Cell 1 uses N1 channels to convert the input x into the output k1 , which, in turn, drives cell 2, which uses N2 channels, to produce the output k2 . From equation 2.3, we know that k1 D N1 x C g1 , where Nx is the signal and g1 is the noise generated by the random activity of channels. Because we dene an input in terms of a probability distribution of signals, ranging from 0 to 1, the output k1 of cell 1 should be normalized by N1 , so that the input to cell 2 is k1 / N1 . Note that, for simplicity, we are neglecting nonlinearities in signal transfer within a cell, as, for example, in neurotransmitter release. As a consequence, the mean open probability in both cells is the same, but its variance differs. The output of cell 2 is given by k2 D N2 x2 C g2 , where g2
Energy-Efcient Coding
1333
channel noise 1
channel noise 2
output k 1 Cell 1
input x
output k 2 Cell 2
input x 2
input x 3
x 2 = k1 / N1
x 3 = k2 / N2
...
Figure 4: Schematic view of the cell arrangement. The normalized output of cell 1 serves as input for cell 2. Both cells are subject to channel noise g1 and g2 , respectively.
is the additive channel noise of cell 2. Therefore, the information transfer from an input signal x, with mean x and variance sx2 , to the output k2 can be approximated by Shannon’s formula as ID
³ ´ 1 N1 N2 sx2 log2 1 C . ( N1 C N2 ) x (1 ¡ x ) 2
(6.1)
The cost of information transfer through the two-cell system is E D ech1 xN1 C ech2 xN2 C b1 C b2 ,
(6.2)
where b1 and b2 are the xed metabolic costs of the cells and ech1/ 2 x the costs per open channel. i If we introduce an effective number of channels, Neff D N1 N2 / ( N1 C E ) C N2 , for the information transfer and Neff D N1 N2 for the metabolic cost, the equations for I and E correspond to those of the single-cell case— equations 4.2, and 3.4, respectively. For simplicity, the cost per open channel E ¸ i for all nonnegative is set to unity for both cells. Because Neff Neff N1 and N2 , the information transfer increases more slowly with the number of channels than the cost, cutting down efciency. Thus, a two-cell system requires more channels to transmit the same amount of information and is therefore less efcient than a single cell, even if the xed cost of a cell in the two-cell model is only half the cost of the single cell. Consequently, in a metabolically efcient nervous system, serial connections should be made only when signal processing is required.
1334
S. Schreiber, C. K. Machens, A. V. M. Herz, and S. B. Laughlin
Furthermore, in an energy-efcient system, the cost of one mechanism inuences the amount of energy that should be consumed by another. Such considerations are important when one set of signaling events is more costly than another. This can be demonstrated by incorporating the cost of an upstream mechanism into the xed cost of a downstream mechanism (i.e., the cost of a mechanism includes the cost of providing it with a signal). Here we can dene E as E D ech2 xN2 C b¤2
with
b¤2 D ech1 xN1 C b1 C b2 .
(6.3)
As we have seen with one mechanism, an increase in xed cost raises the optimal number of channels. Therefore, when the cost of an upstream mechanism is high, one improves energy efciency by using more units downstream. The precise pattern of investment required for optimal performance will depend on the relative costs (xed and signaling) of every mechanism (e.g., channels) and the characteristics of the input signal. 7 Limits to the Achievable Efciency
Information transfer and efciency depend on the distribution of input signals. In the previous sections, we have considered gaussian inputs. We now calculate the input distribution that maximizes information transfer I given a limited energy budget E and a particular number of channels N. The efciency I / E reached gives an upper bound on the efciency the system can achieve for given E and N. Although the nervous system has less inuence on the distribution of external signals, it is able to shape its internal signals to optimize information transfer (Laughlin, 1981). opt The optimal input distribution, px ( x ), and the maximum information transfer (the information capacity CN ) of a system with N stochastic units can be obtained by the Blahut-Arimoto algorithm (Arimoto, 1972; Blahut, 1972), which is described in further detail in section A.4. Given the noise distribution p ( k|x ), the algorithm yields a numerical solution to the optimization of the input distribution px ( x ), maximizing the information transfer and minimizing the metabolic cost. This algorithm has been applied by Balasubramanian et al. (2001) and de Polavieja (in press) to study the metabolic efciency of spike coding. 7.1 Optimal Input Distributions. For a given number of channels N, a given xed cost b, and a given cost function depending on the input eQ ( x) , the energy E D N eQ ( x ) C b used by the system depends exclusively on the input distribution px (x ) . If the available energy budget is sufciently large, energy constraints do not inuence the shape of the optimal input distribution, and the information capacity CN of the system reaches its maximum (see Figure 5E, point A). The optimum input distribution turns out to be symmetrical, with inputs from the midregion around x D 0.5 drawn less of-
Energy-Efcient Coding
1335
probability
A
0.3 0.2
0.4
B
probability
0.4
0.3 0.2
0.1
0.1
0.2 0.4 0.6 0.8 1
0.8 0.6
in put x
1
D
probability
C
probability
1
0.2 0.4 0.6 0.8 1
in pu t x
0.8 0.6
0.4
0.4
0.2
0.2
0.2 0.4 0.6 0.8 1
0.2 0.4 0.6 0.8 1
information capacity CN
in pu t x
in put x
4
E
3
C
B
A
2
1
0
D
10
20
30
40
energy E
50
60
Figure 5: (A)–(D) Optimal input distributions for different energy budgets E with b D 0 and N D 100. The distributions were calculated numerically and are discretized to a resolution D x D 0.01. All distributions show small values for 2 inputs around x D 0.5 where the noise variance sk|x is highest. With decreasing energy budget, the distributions become less symmetrical, preferring low inputs. (E) Information capacity CN D 100 depending on the energy budget (for b D 0). The points mark the location of the input distributions shown in A–D in the energy capacity space.
ten, whereas inputs at the boundaries 0 and 1 are preferred (see Figure 5A). This result is very intuitive, when we take the dependence of the noise vari2 ance, sk|x , as dened in equation 2.2, on the input x into account. The noise variance is symmetrical as well, showing a maximum at x D 0.5 and falling off toward x D 0 and x D 1. Thus, an input distribution that is optimal from the point of view of information transfer, leaving metabolic considerations aside for a moment, favors less noisy input values over noisier ones.
4
A
0.02
efficiency I/E
5
S. Schreiber, C. K. Machens, A. V. M. Herz, and S. B. Laughlin
3
1 0
b = 20
40
energy
60
80
500 250
Optimal
0.01
2
B
optimal N
1336
0 0
500
1000
fixed cost b
Gaussian
500
number of channels N
1000
Figure 6: (A) Capacity CN D 100 (of Figure 5E) as a function of energy E for a xed cost b D 20 (solid line). The capacity curve is simply shifted along the energy axis by the value of b. The efciency CN / E as a function of E is shown as a dash-dotted line. The maximum efciency (CN / E ) max is given at the point of the CN ( E ) curve whose tangent (dashed line) intersects with the origin. (B) Achievable efciency ( CN / E ) max as a function of N for a xed cost b D 200 (solid line). For comparison, the efciency I / E for a gaussian input distribution (Nx D 0.5, sx D 0.16) is also shown (dashed line). The inset depicts the optimal number of channels Nopt as a function of b.
Limiting the energy that can be used by a system of N units, however, changes the optimal input distribution, px ( x) , and destroys the symmetry. As we reduce the energy budget, values neighboring x D 0 are increasingly preferred and the costly values approaching x D 1 are avoided (see Figures 5B–5D). This asymmetry reduces the information capacity CN . Thus, efcient use of a restricted budget requires a cell to keep most of its units deactivated. For our simple model, this is equivalent to maintaining the membrane close to resting potential by keeping most of its sodium channels closed. The xed cost, b, is a metabolic cost independent of the value of the input x and cannot be avoided. Consequently, it does not inuence the shape of the energy-capacity curve of a system. Adding the xed cost b results merely in a horizontal translation of the energy-capacity curve (see Figure 6A). Here as well, the shape of the input distribution changes with the value of E. 7.2 Efciency. Having obtained the dependence of the information capacity on the energy used for a given N, we can also derive energy efciency CN / E as a function of the energy E. Of particular interest is the maximum value (CN / E ) max D maxE fCN / Eg, giving the optimal efciency for a xed value of N, which is achieved by a specic input distribution px ( x) . Note that this efciency gives the upper bound to the achievable efciency in our system and therefore cannot be surpassed by any other input distribution px ( x ) .
Energy-Efcient Coding
1337
At the maximum of CN / E, the rst derivative with respect to E is zero, @
³
@E
CN E
´ D 0,
(7.1)
which can be transformed to give @CN @E
¢ E D CN .
(7.2)
Thus, geometrically, the maximum value ( CN / E ) max corresponds to the point on the capacity graph whose tangent intersects the origin (see Figure 6A). The slope of the tangent is given by (CN / E ) max itself, so that the optimal efciency (CN / E ) max decreases with increasing b, as can be inferred from Figure 6A by shifting the CN ( E ) curve to the right. Figure 6B shows the optimal efciency (CN / E ) max as a function of the number of channels N for a xed cost b D 200. For comparison, we also show the efciency I / E obtained for a gaussian input distribution. The optimal input distribution surpasses the gaussian distribution roughly by a factor of two in this case. In conclusion, as with gaussian input distributions, we can derive the most efcient number of channels Nopt as a function of the xed cost. The use of optimal input distributions reduces Nopt but, as with gaussian inputs, Nopt rises approximately linearly with the basic cost, b, as illustrated in the inset of Figure 6B. 8 Discussion
A growing number of studies of neural coding suggest that the consumption of metabolic energy constrains neural function (Laughlin, 2001). Comparative studies indicate that the mammalian brain is an expensive tissue whose evolution, development, and function have been shaped by the availability of metabolic energy (Aiello & Wheeler, 1995; Martin, 1996). The human brain accounts for about 20% of an adult’s resting metabolic rate. In children, this proportion can reach 50% and in electric sh 60% (Rolfe & Brown, 1997; Nilsson, 1996). Because much of this energy is used to generate and transmit signals (Siesjo, ¨ 1978; Ames, 2000), these levels of consumption have the potential to constrain neural computation by placing an upper bound on synaptic drive and spike rate (Attwell & Laughlin, 2001). Current evidence suggests that the advantages of an energy-efcient nervous system are not conned to larger animals with highly evolved brains. In general, the specic metabolic rate of brains is 10 to 30 times the average for the whole animal, at rest (Lutz & Nilsson, 1994). In addition, for both vertebrates and insects, the metabolic demands of the brain are more acute in smaller animals because the ratio of brain mass to total body mass de-
1338
S. Schreiber, C. K. Machens, A. V. M. Herz, and S. B. Laughlin
creases as body mass increases (Martin, 1996; Kern, 1985). Moreover, insect species with similar body masses exhibit large differences in the total mass of the brain and in the masses or relative volumes of different brain areas, and these differences correlate with behavior and ecology (Kern, 1985; Gronenberg & Liebig, 1999; Laughlin et al., 2000). Signicant changes have been observed among individuals of a single species. In the ant Harpegnathos, the workers are usually visually guided foragers. However, when young workers are inseminated and begin to lay eggs following the death of their queen, their optic lobes are reduced by 20% (Gronenberg & Liebig, 1999). These observations suggest that the reduction of neural energy consumption is also a signicant factor in the evolution of small brains. The relationship between energy consumption and the ability of neurons to transmit information suggests that nervous systems have evolved a number of ways of increasing energy efciency. These methods include redundancy reduction (Laughlin et al., 1998), the mix of analog and digital operations found in cortical neurons (Sarpeshkar, 1998), appropriate distributions of interspike intervals (Baddeley et al., 1997; Balasubramanian et al., 2001; de Polavieja, 2001), and distributed spike codes (Levy & Baxter, 1996). We have extended these previous theoretical investigations to a level that is more elementary than the analysis of signaling with spikes: the representation of information by populations of ion channels. Thus, as independently advocated by Abshire and Andreou (2001), we have analyzed the energy efciency of information transmission at the level of its implementation by molecular mechanisms. We have estimated the amount of information transmitted by a population of voltage-insensitive channels when their open probability is determined by a dened input. These channels typify the general case of signaling with stochastic events. Consequently, our analysis is also applicable to many other forms of molecular signaling (e.g., the binding of a ligand to a receptor) and to synaptic transmission (e.g., the release of a synaptic vesicle according to Poisson or binomial statistics). Our theoretical results verify a well-known trend: the amount of information carried by a population of channels increases with the size of the population because random uctuations are averaged out, as observed in blowy photoreceptors and their synapses (de Ruyter van Steveninck, Lewen, Strong, Koberle, & Bialek, 1997; Laughlin et al., 1998; Abshire & Andreou, 2000) and demonstrated by models of synaptic transmission to cortical neurons (Zador, 1998). However, increasing the number of channels in the population increases both the level of redundancy and the energy used for transmission, leading to changes in metabolic efciency (Laughlin et al., 1998). Following Levy and Baxter (1996), we have chosen to discuss efciency as the ratio between the number of bits of information transmitted and the energy consumed. However, our analysis also provides a mathematical framework to describe energy efciency from the more general point of view of maximizing information transfer and minimizing the metabolic cost
Energy-Efcient Coding
1339
(maximizing I ¡ sE, where s describes the importance of energy minimization over information maximization), as briey outlined in section A.3. We distinguish two energy costs: the cost of generating signals and the xed cost of keeping a signaling system in a state of readiness. The signaling cost is derived from the current ow through ion channels. Under the assumptions of the model, the signaling cost increases with the number of channels. This simple linear relationship can be easily applied to other forms of signaling, such as protein phosphorylation or the turnover of transmitter and second messenger. In the absence of data on the costs of constructing and maintaining a population of channels in a membrane, we again follow Levy and Baxter (1996). The xed cost is expressed in the same units as the signaling cost and is varied to establish its effect on efciency. The analysis demonstrates that energy efciency is maximized by using a specic number of channels. These optima depend on a number of important factors: the xed cost of the system, the cost of signaling, the reliability of the input, the amount of noise generated by other intrinsic mechanisms, the cost of upstream and downstream signaling mechanisms, and the distribution of the input signals provided by upstream mechanisms. Each of these factors is involved in neural processing. The xed cost of building and maintaining the cell in a physiological state within which the ion channels operate is a dominant factor. When the xed cost increases, the optimum system increases the energy invested in signaling by increasing the number of channels (see Figure 2A). Levy and Baxter (1996) discovered the same effect in their analysis of energy-efcient spike trains. This may well be a general property of energy-efcient systems because when a signaling system is expensive to make and maintain (i.e., the ratio between xed and signaling costs is high), it pays to increase the return on the xed investment by transmitting more bits. This takes more channels and more signaling energy. For the example shown, which operates with a broad input distribution and a mean open probability of 50%, the optimum population of channels has a peak energy consumption (all channels open) that is approximately half the xed cost (see Figure 2A). The relationship between the energy spent on signaling by the channels and the xed cost varies with the distribution of inputs. For input distributions that make a reasonably broad use of possible open probabilities, the ratio between signaling costs and xed costs lies approximately in the range between 1:4 and 1:1. It is difcult to judge whether populations of neuronal ion channels follow this pattern because data about the ratio of xed costs to signaling costs in single cells are not available. However, in more complicated systems, the proportion of energy devoted to signaling is in the predicted range. For the whole mammalian brain, signaling has been linked to approximately 50% of the metabolic rate (Siesj¨o, 1978), and in cortical gray matter this rises to 75% (Attwell & Laughlin, 2001). We are aware that there are a number of additional factors, not accounted for by our model, that will inuence the ratio of signaling costs to xed costs
1340
S. Schreiber, C. K. Machens, A. V. M. Herz, and S. B. Laughlin
in nervous systems. For example, our analysis underestimates the total energy usage by the brain because it is conned to a single operation: the generation of a voltage signal by channels. Within neural systems, signals must be transmitted over considerable distances, and various computations must be performed. These extra operations take extra energy. Along these lines, the transmission of signals along axons, in the form of action potentials, accounts for over 40% of the signaling cost of cortical gray matter (Attwell & Laughlin, 2001). Noise at the input reduces the optimum number of channels (see Figure 3) because it reduces the effect of channel uctuations on the total noise power. There is some evidence that nervous systems reduce the number of signaling events in response to a less reliable input. In the blowy retina, the SNR of chromatic signals is lower than that of achromatic signals, and a class of interneurons involved in the chromatic pathway uses fewer synapses than comparable achromatic interneurons (Anderson & Laughlin, 2000). Considering a chain of signaling mechanisms allows us to study networks where the output from one population of channels denes the SNR at the input of the next. As a result, when analog signals are transferred from one mechanism to another, the noise accumulates stage by stage (Sarpeshkar, 1998). The chain describes how this buildup of noise reduces metabolic efciency. Given this reduction, an energy-efcient system should connect one population of channels (or synapses) to another in a serial manner only when information is actually processed, not when it is merely transmitted. Where signals must be repeatedly amplied to avoid attenuation, pulses should be used to resist the buildup of analog noise (Sarpeshkar, 1998). These design strategies are hypothetical. The energy savings that are made by restricting the number of serial or convergent analog processes and converting analog signals to spikes (Sarpeshkar, 1998; Laughlin et al., 2000) have yet to be demonstrated in a neural circuit. Our analysis suggests that when transmission and processing involve several types of signaling events (e.g., calcium channels, synaptic vesicles, and ligand gated channels at a synapse), it is advantageous to use more of the less expensive events and fewer of the more expensive. This distribution is analogous to the pattern of investment in bones of a mammal’s leg. Proximal bones are thicker than distal bones because they are less costly per unit mass (they move less during a stride). The thickening of distal bones is adjusted to optimize the ratio between the probability of fracture and cost for the limb as a whole (Alexander, 1997). Finally, the probability distribution of the input signal has a large effect on efciency. On both an evolutionary timescale as well as on the timescale of physiological adaptation processes, the way an external signal with specic statistical properties is transmitted could therefore be optimized by mapping the external signal distribution on an efcient distribution of probabilities of channels to be open (which we call the input distribution). More
Energy-Efcient Coding
1341
importantly, the internal signals passed on from one mechanism to the next could be shaped such that the signal distributions employed will enhance the efciency of information transfer. The Blahut-Arimoto algorithm yields input distributions that optimize the amount of information transferred under a cost constraint and has been successfully applied to spike codes (Balasubramanian et al., 2001; de Polavieja, 2001). Our application shows how inputs can be mapped onto the probabilities of activating signaling molecules to maximize the metabolic efciency of analog signaling. The improvement over gaussian inputs is greater than 50% and is achieved by two means. First, signaling avoids the midregion of activation probabilities where, according to binomial statistics, the noise variance is high. Second, signaling avoids the expensive situation of having a high probability of opening channels in favor of the energetically cheaper low-probability condition, similar to the results that a metabolically efcient spike code avoids high rates (Baddeley et al., 1997; Balasubramanian et al., 2001). Our analysis suggests that an efcient population of sodium and potassium channels usually operates close to resting potential, with most of its sodium channels closed, but infrequently switches to opening most of its sodium channels. In other words, there is a tendency toward using a combination of numerous small signals close to a low resting potential and less frequent voltage peaks. In conclusion, we have analyzed the energy efciency of a simple biological system that represents analog signals with stochastic signaling events, such as ion channel activation. Optimum congurations depend on the basic physics that connects information to energy (the dependency of noise, redundancy, and cost on the number of signaling events) and basic economics (the role played by xed costs in determining optimum levels of expenditure on throughput). Given this fundamental basis, the principles that we have demonstrated are likely to apply to other systems. In particular, we have shown that energy efciency is a property of both the component mechanism and the system in which it operates. To assess a single population of ion channels, we had to consider the xed cost, the distribution of signal and noise in the input, and additional noise sources. After connecting two populations of ion channels in series, we had to add the relative costs of the two mechanisms and the noise fed from one population to the next to our list of factors. Energy efciency is achieved by matching the mechanism to its signal or, for optimum input distributions, the signal to the mechanism. Matching is a characteristic of designs that maximize the quantity of information coded by single cells, regardless of cost. To achieve this form of efciency, neurons exploit the adaptability of their cellular and molecular mechanisms (Laughlin, 1994). The extent to which the numbers of channels and synapses used by neurons, and their transfer functions, are regulated for metabolic efciency remains to be seen. The analysis presented here provides a starting point that can guide further experimental and theoretical work.
1342
S. Schreiber, C. K. Machens, A. V. M. Herz, and S. B. Laughlin
Appendix A.1 Energetic Cost of Inputs. For the channel model, the average energetic cost per unit time e ( x, N ) is a function of the input x and the number of sodium channels N. The sodium and potassium currents, iNa and iK , depend on the reversal potentials ENa and EK , the membrane potential V, as well as on the conductances of the sodium and potassium channels, respectively,
iNa ( x) D NgNa0x ( V ( x ) ¡ ENa ) ,
(A.1)
iK ( x) D NK gK0 pK ( V ( x ) ¡ EK ) .
(A.2)
The vigorous electrogenic pump extrudes three sodium ions and takes up two potassium ions for every ATP molecule hydrolyzed. The pump current ipump equals iK / 2, assuming that the pump maintains the internal potassium concentration by accumulating potassium ions at a rate equal to the outward potassium current, iK . Equating all currents across the membrane gives iNa ( x) C iK ( x ) C ipump (x ) D iNa ( x) C
3 iK ( x) D 0. 2
(A.3)
The energetic cost e ( x, N ) is proportional to the pump current, ipump , so that we dene e (x, N ) D c ¢ ipump ( x ) D cN ¢
gK0 pK
(A.4) ±
NK N
3gK0 pK
² ±
gNa0 (ENa ¡ EK ) x ² , NK C 2gNa0 x N
(A.5)
where c is the factor of proportionality. Because ( NNK ) is constant, we can separate e ( x, N ) into the variables N and x: e (x, N ) D NQe ( x) .
(A.6)
The energy function eQ ( x ) can be written as eQ (x ) D C
ABx , Ax C B
(A.7)
with the constants A D 2gNa0, B D 3gK0 pK ( NNK ), and C D c ( ENa ¡ EK ) / 6. Because we dene units of energy in this study such that eQ (1) D ech D 1, the rescaled energy function that is implemented in the Blahut-Arimoto algorithm reads eQ (x ) D
( A C B) x . Ax C B
(A.8)
It does not depend on the values of the reversal potentials ENa and EK .
Energy-Efcient Coding
1343
A.2 Numerical Validation. The signaling system with stochastic units operates under additive binomial noise (caused by the random activation and deactivation of the units), whose variance depends on the input x (see equation 2.2). In order to validate the approximation of the information transfer in such a model by equation 4.2, that, strictly speaking, applies only to systems with additive gaussian noise of a xed, input-independent variance, we performed numerical tests. To this end, the input x was discretized into 1000 equispaced values between zero and one. The information transfer was then calculated according to equation 3.1. The noise distribution 2.1, and the output disp ( k|x) was assumed to be binomial as in equation P tribution p k ( k) was calculated using p k ( k) D x p ( k|x ) px ( x ). We found that the information transfer in the stochastic unit system is well approximated by equation 4.2, as shown in Figure 7. For very broad input distributions (e.g., sx D 0.16), where the approximation error is biggest, the deviations are well below 4% for N > 100. The relative error is signicantly smaller for narrower distributions. The largest deviation below N D 100 occurs for N D 1 and is less than 11% for broad input distributions px ( x ). A ner discretization into 10, 000 input values gives similar results. A.3 Efciency. The most general approach to treat energy as a constraint is the maximization of I ¡ sE, where s is a trade-off parameter between energy and information. The higher s, the tighter is the energy budget. This is a Lagrangian multiplier problem, where for a given, xed energy, the optimal information transfer I and number of channels N have to be determined. However, both the information transfer I and the metabolic cost E depend on only one variable: the number of channels N. So if the energy is xed, N and I are xed too. Thus, each point on the curve depicting energy efciency as a function of the number of channels N (as shown in Figure 2) corresponds to the solution of the general I ¡ sE optimization problem for a particular trade-off parameter s. From these solutions, we have chosen to pay special attention to the ratio I / E. This approach becomes more obvious for the optimization of input distributions with the BlahutArimoto algorithm, where I ¡sE is optimized explicitly before concentrating on the subset of distributions maximizing the ratio I / E. A.4 Information Capacity and Blahut-Arimoto Algorithm. The Blahut-
Arimoto algorithm maximizes the function £ ¤ C[NI s] D max I[NI px (¢) ] ¡ sE[NI px (¢)] , px (¢)
(A.9)
where s is a trade-off parameter that gives the relative importance of energy minimization versus information maximization. The input distribution px ( x ) needs to be discretized. Given an initial input distribution px0 ( x ), the set of conditional probabilities p ( k|x ) , and the expense for each input
1344
S. Schreiber, C. K. Machens, A. V. M. Herz, and S. B. Laughlin x 10e-2
0.5 0.4
0.8
A
0.6
information I / bit
0.3
B
0.4
0.2 0.1
0
2 4 3.5 3 2.5 2 1.5 1 0.5
4
6
0.2
1
0
input x 8
10
2
0.5
C
0.4
4
6
1
input x 8
10
D
0.3 0.2 0
200
400
600
1
input x 800
1000
0.1
0
200
400
600
1
input x 800
1000
number of channels N Figure 7: Numerically derived information transfer as a function of N for the model of stochastic units (lled diamonds) in comparison to the information transfer specied by Shannon’s formula (gray stars) in equation 4.2. The largest relative difference D I is depicted for each set of inputs. The input distributions px ( x ) are shown in the insets. (A) Information transfer for a broad gaussian input distribution px ( x) with sx D 0.16 and x D 0.5 for very small numbers of channels N. (B) The same for a narrow input distribution px ( x) with sx D 0.01 and x D 0.9. (C) Information transfer for large N and a broad input distribution with parameters as in A. (D) Information transfer for large N and the narrow input distribution used in B. Shannon’s formula gives a very good approximation of the information in the channel model. The quality of the approximation increases with rising N and decreasing variance of the input distributions.
symbol, the algorithm iteratively calculates the distribution px (x ) that maximizes C[NI s]. In every step of the iterative algorithm, upper and lower bounds on the information capacity can be derived (Blahut, 1972), which help to estimate the quality of the current probability set. For the numerical calculations, we discretized the input (xj D j/ 200 with j D 0, . . . , 200, xj 2 [0I 1]). The input distributions maximizing information transfer were calculated for N D 10, 20, . . . , 1000 and 26 values of the parameter s, ranging from 0 (no energy restriction) to 20 (high energy constraint). The cost eQ ( x) was assumed to depend on x according to equation A.8, with gNa0 D 20pS, gK0 D 20pS, pK D 0.5, and NK / N D 0.5. The resulting 26 points in C-E space for a given N were subject to a cubic spline interpolation (Figure 5E shows four points and the respective interpolation curve for the
Energy-Efcient Coding
1345
whole set of s; N D 100 here.). Afterward, the energy was scaled by multiplication with N according to equation A.6. For the whole study, the xed cost is always stated in the unit of cost of one open channel (i.e., b D 500 corresponds to the cost of 500 open channels). The optimal number of stochastic units, Nopt , for a given xed cost b was determined choosing the N of all calculated N, where the efciency ( CN / E ) max for that b was maximal. Since those values of N are multiples of 10, the data were subsequently tted with a function f of the form f ( x) D a1 xa2 (a1 and a2 being parameters) in order to obtain a smooth graphical representation. Acknowledgments
We thank Gonzalo Garcia de Polavieja for help and advice, as well as John White and Aldo Faisal for comments on the manuscript. This work is supported by the Daimler-Benz Foundation, the DFG (Graduiertenkolleg 120, Graduiertenkolleg 268, and ITB), the BBSRC, and the Rank Prize Fund. References Abshire, P., & Andreou, A. G. (2000). Relating information capacity to a biophysical model for blowy photoreceptors. Neurocomputing, 32, 9–16. Abshire, P., & Andreou, A. G. (2001). Capacity and energy cost of information in biological and silicon photoreceptors. Proceedings of the IEEE, 89(7), 1052– 1064. Aiello, L. C., & Wheeler, P. (1995). The expensive tissue hypothesis: The brain and the digestive system in human and primate evolution. Curr. Anthropol., 36, 199–221. Alexander, R. M. (1997). A theory of mixed chains applied to safety factors in biological systems. J. Theor. Biol., 184, 247–252. Ames, A. (2000). CNS energy metabolism as related to function. Brain Res. Rev., 34, 42–68. Anderson, J. C., & Laughlin, S. B. (2000). Photoreceptor performance and the coordination of achromatic and chromatic inputs in the y visual system. Vision Research, 40, 13–31. Arimoto, S. (1972). An algorithm for computing the capacity of an arbitrary discrete memoryless channel. IEEE Trans. on Info. Theory, IT-18, 14–20. Attwell, D., & Laughlin, S. B. (2001). An energy budget for signalling in the grey matter of the brain. J. Cereb. Blood Flow Metab., 21, 1133–1145. Baddeley, R., Abbott, L. F., Booth, M. C. A., Sengpiel, F., Freeman, T., Wakeman, E. A., & Rolls, E. T. (1997). Responses of neurons in primary and inferior temporal visual cortices to natural scenes. Proc. R. Soc. Lond. B, 264, 1775– 1783. Balasubramanian, V., Kimber, D., & Berry, M. J. I. (2001). Metabolically efcient information processing. Neural Comput., 13, 799–816. Blahut, R. E. (1972). Computation of channel capacity and rate distortion functions. IEEE Trans. on Info. Theory, IT-18, 460–473.
1346
S. Schreiber, C. K. Machens, A. V. M. Herz, and S. B. Laughlin
de Polavieja, G. G. (in press). Errors drive the evolution of biological signalling to costly codes. J. Theor. Biol. de Ruyter van Steveninck, R. R., Lewen, G. D., Strong, S. P., Koberle, R., and Bialek, W. (1997). Reproducibility and variability in neural spike trains. Science, 275, 1805–1808. Gronenberg, W., & Liebig, J. (1999). Smaller brains and optic lobes in reproductive worker of the ant Harpegnathos. Naturwiss., 86, 343–345. Kern, M. F. (1985). Metabolic rate and the insect brain in relation to body size and phylogeny. Comp. Biochem. Physiol., 81A, 501–506. Laughlin, S. B. (1981). A simple coding procedure enhances a neuron’s information capacity. Z. Naturforsch., 36c, 910–912. Laughlin, S. B. (1989). The reliability of single neurons and circuit design: A case study. In R. Durbin, C. Miall, & G. Mitchison (Eds.), The computing neuron (pp. 322–336). Reading, MA: Addison-Wesley. Laughlin, S. B. (1994). Matching coding, circuits, cells, and molecules to signals: General principles of retinal design in the y’s eye. Prog. Ret. Eye Res., 13, 165–196. Laughlin, S. B. (2001). Energy as a constraint on the coding and processing of sensory information. Curr. Opin. Neurobiol., 11, 475–480. Laughlin, S. B., Anderson, J. C., O’Carroll, D. C., & de Ruyter van Steveninck, R. R. (2000). Coding efciency and the metabolic cost of sensory and neural information. In R. Baddeley, P. Hancock, & P. Foldiak (Eds.), Informationtheory and the brain (pp. 41–61). Cambridge: Cambridge University Press. Laughlin, S. B., de Ruyter van Steveninck, R. R., & Anderson, J. C. (1998). The metabolic cost of neural information. Nat. Neurosci., 1(1), 36–41. Levy, W. B., & Baxter, R. A. (1996). Energy efcient neural codes. Neural Comput., 8(3), 531–543. Lutz, P., & Nilsson, G. E. (1994). The brain without oxygen. Austin: R. G Landes. Martin, R. D. (1996). Scaling of the mammalian brain—the maternal energy hypothesis. News in Physiol. Sci., 11, 149–156. Nilsson, G. E. (1996). Brain and body oxygen requirements of Gnathonemus petersii, a sh with an exceptionally large brain. J. Exp. Biol., 199, 603–607. Rolfe, D. F. S., & Brown, G. C. (1997). Cellular energy utilization and molecular origin of standard metabolic rate in mammals. Physiol. Rev., 77, 731–758. Sarpeshkar, R. (1998). Analog versus digital: Extrapolating from electronics to neurobiology. Neural Comput., 10, 1601–1638. Schreiber, S. (2000). Inuence of channel noise and metabolic cost on neural information transmission. Unpublished diploma thesis available at the library of Humboldt-University Berlin, Germany. Shannon, C. E., & Weaver, W. (1949). The mathematical theory of communication. Urbana: University of Illinois Press. Siesjo, ¨ B. (1978). Brain energy metabolism. New York: Wiley. White, J. A., Rubinstein, J. T., & Kay, A. R. (2000). Channel noise in neurons. TINS, 23(3), 131–137. Zador, A. (1998). Impact of synaptic unreliability on the information transmitted by spiking neurons. J. Neurophysiol., 79, 1219–1229. Received July 23, 2001; accepted October 2, 2001.
LETTER
Communicated by Andrew Barto
Multiple Model-Based Reinforcement Learning Kenji Doya [email protected] Human Information Science Laboratories, ATR International, Seika, Soraku, Kyoto 619-0288, Japan; CREST, Japan Science and Technology Corporation, Seika, Soraku, Kyoto 619-0288, Japan; Kawato Dynamic Brain Project, ERATO, Japan Science and Technology Corporation, Seika, Soraku, Kyoto 619-0288, Japan; and Nara Institute of Science and Technology, Ikoma, Nara 630-0101, Japan Kazuyuki Samejima [email protected] Human Information Science Laboratories, ATR International, Seika, Soraku, Kyoto 619-0288, Japan, and Kawato Dynamic Brain Project, ERATO, Japan Science and Technology Corporation, Seika, Soraku, Kyoto 619-0288, Japan Ken-ichi Katagiri [email protected] ATR Human Information Processing Research Laboratories, Seika, Soraku, Kyoto 619-0288, Japan, and Nara Institute of Science and Technology, Ikoma, Nara 630-0101, Japan Mitsuo Kawato [email protected] Human Information Science Laboratories, ATR International, Seika, Soraku, Kyoto 619-0288, Japan; Kawato Dynamic Brain Project, ERATO, Japan Science and Technology Corporation, Seika, Soraku, Kyoto 619-0288, Japan; and Nara Institute of Science and Technology, Ikoma, Nara 630-0101, Japan We propose a modular reinforcement learning architecture for nonlinear, nonstationary control tasks, which we call multiple model-based reinforcement learning (MMRL). The basic idea is to decompose a complex task into multiple domains in space and time based on the predictability of the environmental dynamics. The system is composed of multiple modules, each of which consists of a state prediction model and a reinforcement learning controller. The “responsibility signal,” which is given by the softmax function of the prediction errors, is used to weight the outputs of multiple modules, as well as to gate the learning of the prediction models and the reinforcement learning controllers. We formulate MMRL for both discrete-time, nite-state case and continuous-time, continuousstate case. The performance of MMRL was demonstrated for discrete case c 2002 Massachusetts Institute of Technology Neural Computation 14, 1347– 1369 (2002) °
1348
Kenji Doya et al.
in a nonstationary hunting task in a grid world and for continuous case in a nonlinear, nonstationary control task of swinging up a pendulum with variable physical parameters.
1 Introduction
A big issue in the application of reinforcement learning (RL) to real-world control problems is how to deal with nonlinearity and nonstationarity. For a nonlinear, high-dimensional system, the conventional discretizing approach necessitates a huge number of states, which makes learning very slow. Standard RL algorithms can perform badly when the environment is nonstationary or has hidden states. These problems have motivated the introduction of modular or hierarchical RL architectures (Singh, 1992; Dayan & Hinton, 1993; Littman, Cassandra, & Kaelbling, 1995; Wiering & Schmidhuber, 1998; Parr & Russel, 1998; Sutton, Precup, & Singh, 1999; Morimoto & Doya, 2001). The basic problem in modular or hierarchical RL is how to decompose a complex task into simpler subtasks. This article presents a new RL architecture based on multiple modules, each composed of a state prediction model and an RL controller. With this architecture, a nonlinear or nonstationary control task, or both, is decomposed in space and time based on the local predictability of the environmental dynamics. The mixture of experts architecture (Jacobs, Jordan, Nowlan, & Hinton, 1991) has been applied to nonlinear or nonstationary control tasks (Gomi & Kawato, 1993; Cacciatore & Nowlan, 1994). However, the success of such modular architecture depends strongly on the capability of the gating network to decide which of the given modules should be recruited at any particular moment. An alternative approach is to provide each of the experts with a prediction model of the environment and to use the prediction errors for selecting the controllers. In Narendra, Balakrishnan, and Ciliz (1995), the model that makes the smallest prediction error among a xed set of prediction models is selected, and its associated single controller is used for control. However, when the prediction models are to be trained with little prior knowledge, task decomposition is initially far from optimal. Thus, the use of “hard” competition can lead to suboptimal task decomposition. Based on the Bayesian statistical framework, Pawelzik, Kohlmorge, and Muller ¨ (1996) proposed the use of annealing in a “soft” competition network for time-series prediction and segmentation. Tani and Nol (1999) used a similar mechanism for hierarchical sequence prediction. The use of the softmax function for module selection and combination was originally proposed for a tracking control paradigm as the multiple paired forward-inverse models (MPFIM) (Wolpert & Kawato, 1998; Wolpert, Miall, & Kawato, 1998; Haruno, Wolpert, & Kawato, 1999). It was recently
Multiple Model-Based Reinforcement Learning
1349
Figure 1: Schematic diagram of the MMRL architecture.
reformulated as modular selection and identication for control (MOSAIC) (Wolpert & Ghahramani, 2000; Haruno, Wolpert, & Kawato, 2001). In this article, we apply the idea of a softmax selection of modules to the paradigm of reinforcement learning. The resulting learning architecture, which we call multiple model-based reinforcement learning (MMRL), learns to decompose a nonlinear or nonstationary task through the competition and cooperation of multiple prediction models and reinforcement learning controllers. In section 2, we formulate the basic MMRL architecture and in section 3 describe its implementation in discrete-time and continuous-time cases, including multiple linear quadratic controllers (MLQC). We rst test the performance of the MMRL architecture for the discrete case in a hunting task with multiple preys in a grid world (section 4). We also demonstrate the performance of MMRL for continuous case in a nonlinear, nonstationary control task of swinging up a pendulum with variable physical parameters (section 5). 2 Multiple Model-Based Reinforcement Learning
Figure 1 shows the overall organization of the MMRL architecture. It is composed of n modules, each of which consists of a state prediction model and a reinforcement learning controller. The basic idea of this modular architecture is to decompose a nonlinear or nonstationary task into multiple domains in space and time so that within each of the domains, the environmental dynamics is predictable. The
1350
Kenji Doya et al.
action output of the RL controllers, as well as the learning rates of both the predictors and the controllers, are weighted by the “responsibility signal,” which is a gaussian softmax function of the errors in the outputs of the prediction models. The advantage of this module selection mechanism is that the areas of specialization of the modules are determined in a bottom-up fashion based on the nature of the environment. Furthermore, for each area of module specialization, the design of the control strategy is facilitated by the availability of the local model of the environmental dynamics. In the following, we consider a discrete-time, nite-state environment, P ( x ( t) | x ( t ¡ 1) , u ( t ¡ 1) ) D F ( x (t ) , x (t ¡ 1) , u (t ¡ 1) ) , (t D 1, 2, . . .),
(2.1)
where x 2 f1, . . . , Ng and u 2 f1, . . . , Mg are discrete states and actions, and a continuous-time, continuous-state environment, xP ( t) D f ( x ( t) , u (t) ) C º ( t) ,
( t 2 [0, 1) ) ,
(2.2)
where x 2 RN and u 2 RM are state and action vectors, and º 2 RN is noise. Actions are given by a policy—either a stochastic one, P (u ( t) | x (t) ) D G ( u (t) , x (t) ) ,
(2.3)
or a deterministic one, u ( t) D g (x ( t) ).
(2.4)
The reward r ( t) is given as a function of the state x (t) and the action u (t) . The goal of reinforcement learning is to improve the policy so that more rewards are acquired in the long run. The basic strategy of reinforcement learning is to estimate cumulative future reward under the current policy as the “value function” V ( x) for each state and then to improve the policy based on the value function. We dene the value function of the state x ( t) under the current policy as " V ( x (t) ) D E
1 X kD 0
# k
c r (t C k)
in discrete case (Sutton & Barto, 1998) and µZ 1 ¶ s V ( x (t) ) D E e¡ t r (t C s) ds
(2.5)
(2.6)
0
in continuous case (Doya, 2000), where 0 · c · 1 and 0 < t are the parameters for discounting future reward.
Multiple Model-Based Reinforcement Learning
1351
2.1 Responsibility Signal. The purpose of the prediction model in each module is to predict the next state (discrete time) or the temporal derivative of the state (continuous time) based on the observation of the state and the action. The responsibility signal li ( t) (Wolpert & Kawato, 1998; Haruno et al., 1999, 2001) is given by the relative goodness of predictions of multiple prediction models. For a unied description, we denote the new state in the discrete case as
y (t ) D x (t )
(2.7)
and the temporal derivative of the state in the continuous case as y (t ) D xP (t ).
(2.8)
The basic formula for the responsibility signal is given by Bayes’ rule, P ( i) P ( y (t) | i) li ( t) D P (i | y (t) ) D Pn , j D1 P ( j ) P ( y ( t ) | j )
(2.9)
where P (i) is the prior probability of selecting module i and P ( y ( t) | i) is the likelihood of model i given the observation y ( t). In the discrete case, the prediction model gives the probability distribution of the new state xO ( t) based on the previous state x ( t ¡ 1) and the action u ( t ¡ 1) as P (xO ( t) | x (t ¡ 1) , u (t ¡ 1) ) D Fi (xO ( t) , x ( t ¡ 1) , u ( t ¡ 1) ) ( i D 1, . . . , n ).
(2.10)
If there is no prior knowledge of module selection, we take the priors as uniform (P(i) D 1 / n), and then the responsibility signal is given by Fi ( x ( t) , x ( t ¡ 1) , u ( t ¡ 1) ) li ( t) D Pn , j D1 Fj ( x ( t ) , x ( t ¡ 1), u ( t ¡ 1) )
(2.11)
where x ( t) is the newly observed state. In the continuous case, the prediction model gives the temporal derivative of the state: xOP i (t) D fi ( x (t ) , u ( t) ) .
(2.12)
By assuming that the prediction error is gaussian with variance s 2 , the responsibility signal is given by the gaussian softmax function, e li ( t) D P n
¡
1 2s 2
j D1 e
kxP ( t ) ¡xOP i (t )k2
¡
1 2s 2
kxP ( t )¡xOP i (t )k2
,
where xP ( t) is the observed state change.
(2.13)
1352
Kenji Doya et al.
2.2 Module Weighting by Responsibility Signal. In the MMRL architecture, the responsibility signal li ( t) is used for four purposes: weighting the state prediction outputs, gating the learning of prediction models, weighting the action outputs, and gating the learning of reinforcement learning controller:
State prediction: The outputs of the prediction models are weighted by the responsibility signal li (t) . In the discrete case, the prediction of the next state is given by P ( xO ( t) ) D
n X iD 1
li ( t) Fi ( xO ( t) , x ( t ¡ 1) , u ( t ¡ 1) ) .
(2.14)
In the continuous case, the predicted state derivative is given by xOP ( t) D
n X iD 1
li (t ) xOP i ( t) .
(2.15)
These predictions are used in model-based RL algorithms and also for the annealing of s, as described later. Prediction model learning: The responsibility signal li (t) is also used for weighting the parameter update of the prediction models. In general, it is realized by scaling the error signal of prediction model learning by li (t) . Action output: The outputs of reinforcement learning controllers are linearly weighted by li ( t) to make the action output. In the discrete case, the probability of taking an action u ( t) is given by P ( u ( t) ) D
n X
li ( t) Gi (u ( t) , x ( t) ).
(2.16)
i D1
In the continuous case, the output is given by the interpolation of modular outputs u (t) D
n X iD 1
l i ( t ) ui ( t ) D
n X
li ( t) gi (x ( t) ).
(2.17)
iD 1
Reinforcement learning: li ( t) is also used for weighting the learning of the RL controllers. The actual equation for the parameter update varies with the choice of the RL algorithms, which are detailed in the next section. When a temporal difference (TD) algorithm (Barto, Sutton, & Anderson, 1983; Sutton, 1988; Doya, 2000) is used, the TD error, d (t ) D r ( t) C c V ( x ( t C 1) ) ¡ V (x ( t) ) ,
(2.18)
Multiple Model-Based Reinforcement Learning
1353
in the discrete case and d ( t) D rO (t) ¡
1 V ( t) C VP (t) t
(2.19)
in the continuous case, is weighted by the responsibility signal di ( t) D li ( t)d (t )
(2.20)
for learning of the ith RL controller. Using the same weighting factor li ( t) for training the prediction models and the RL controllers helps each RL controller learn an appropriate policy and its value function for the context under which its paired prediction model makes valid predictions. 2.3 Responsibility Predictors. When there is some prior knowledge or belief about module selection, we incorporate the “responsibility predictors” (Wolpert & Kawato, 1998; Haruno et al., 1999, 2001). By assuming that their outputs lO i ( t) are proportional to the prior probability of module selection, from equation 2.9, the responsibility signal is given by
lO i ( t) P ( y ( t) | i) li ( t) D Pn . O j D1 lj ( t ) P ( y ( t ) | j )
(2.21)
In modular decomposition of a task, it is desired that modules do not switch too frequently. This can be enforced by incorporating responsibility priors based on the assumption of temporal continuity and spatial locality of module activation. 2.3.1 Temporal Continuity. The continuity of module selection is incorporated by taking the previous responsibility signal as the responsibility prediction signal. In the discrete case, we take the responsibility prediction based on the previous responsibility, lO i ( t) D li ( t ¡ 1) a ,
(2.22)
where 0 < a < 1 is a parameter that controls the strength of the memory effect. From equations 2.21 and 2.22, the responsibility signal at time t is given by the product of likelihoods of past module selection, li ( t) D
t 1 Y k P ( x ( t ¡ k) | i ) a , Z ( t ) kD 0
where Z ( t) denotes the normalizing factor, that is, Z ( t) D k) | j
k )a .
(2.23) Pn Q t j D1
kD 0 P ( x ( t ¡
1354
Kenji Doya et al.
In the continuous case, we choose the prior Dt lO i ( t) D li ( t ¡ D t) D ta ,
(2.24)
where D t is an arbitrarily small time difference (note 2.24 coincides with 2.22 with D t D 1). Since the likelihood of the module i is given by the gaussian P ( xP ( t) | i) D ¡
1
kxP ( t ) ¡xOP ( t)k2
i , from recursion as in equation 2.23, the responsibility signal e 2s 2 at time t is given by
li ( t ) D D
t/ D t 1 Y kDt P (xP ( t ¡ kD t) | i) D ta Z ( t ) kD 0
1 ¡ 12 D t e 2s Z (t)
Pt / D t kD 0
kxP ( t¡kD t) ¡xOP i ( t¡kD t)k2 akD t
,
(2.25)
that is, a gaussian softmax function of temporally weighted squared errors. In the limit of D t ! 0, equation 2.25 can be represented as e li ( t ) D P n
¡
1 2s 2
jD 1 e
Ei ( t )
¡
1 2s 2
Ej (t )
,
(2.26)
where Ei ( t) is a low-pass ltered prediction error EP i (t ) D log a Ei ( t) C kxP ( t) ¡ xOP i ( t) k2 .
(2.27)
The use of this low-pass ltered prediction error for responsibility prediction is helpful in avoiding chattering of the responsibility signal (Pawelzik et al., 1996). 2.3.2 Spatial Locality. spatial prior, e lO i ( t) D P n
In the continuous case, we consider a gaussian
¡ 12 ( x ( t ) ¡c i ) 0 M¡1 i ( x ( t ) ¡c i )
jD 1 e
( x ( t ) ¡cj ) ¡ 12 (x ( t )¡cj ) 0 M¡1 j
,
(2.28)
where ci is the center of the area of specialization, Mi is a covariance matrix that species the shape, and 0 denotes transpose. These parameters are updated so that they approximate the distribution of the input state x ( t) weighted by the responsibility signal, cP i D gc li ( t) (¡c i C x ( t) ) ,
P i D gM li (t) [¡Mi C ( x (t ) ¡ ci ) ( x ( t) ¡ c i ) 0 ], M where gc and gM are update rates.
(2.29) (2.30)
Multiple Model-Based Reinforcement Learning
1355
3 Implementation of MMRL Architecture
For the RL controllers of MMRL, it is generally possible to use model-free RL algorithms, such as actor-critic and Q-learning. However, because the prediction models of the environmental dynamics are intrinsic components of the architecture, it is advantageous to use these prediction models not just for module selection but also for designing RL controllers. In the following, we describe the use of model-based RL algorithms for discrete-time and continuous-time cases. One special implementation for the continuous-time case is the use of multiple linear quadratic controllers derived from linear dynamic models and quadratic reward models. 3.1 Discrete-Time MMRL. Now we consider implementation of the MMRL architecture for discrete-time, nite-state, and nite-action problems. The standard way of using a predictive model in RL is to use it for action selection by the one-step search,
u (t) D arg max E[Or ( x ( t) , u) C c V ( xO (t C 1) ) ], u
(3.1)
where rO ( x ( t) , u ) is the predicted immediate reward and xO (t C 1) is the next state predicted from the current state x ( t) and a candidate action u. In order to implement this algorithm, we provide each module with a reO x, u ). ward model rOi ( x, u) , a value function Vi (x ) , and a dynamic model Fi ( x, Each candidate action u is then evaluated by q ( x ( t) , u) D E[Or ( x ( t) , u ) C c V (xO ( t C 1) ) | u] D
n X iD 1
li ( t) [Ori ( x (t) , u ) C c
N X xO D 1
O x ( t) , u ) ]]. Vi ( xO ) Fi (x,
(3.2)
For the sake of exploration, we use a stochastic version of the greedy action selection, equation 3.1, where the action u ( t) is selected by a Gibbs distribution, ebq(x(t),u) P (u | x ( t) ) D PM , b q ( x ( t) ,u0 ) u0 D 1 e
(3.3)
where b controls the stochasticity of action selection. The parameters are updated by the error signals weighted by the responsibility signal: li ( t) ( Fi ( j, x ( t ¡ 1) , u ( t ¡ 1) ) ¡ c ( j, x (t) ) ) for the dynamic model (j D 1, . . . , N; c ( j, x ) D 1 if j D x and zero otherwise), li ( t) ( rOi (x ( t) , u ( t) ) ¡r(t) ) for the reward model, and li (t )d ( t) for the value function model. 3.2 Continuous-Time MMRL. Next we consider a continuous-time MMRL architecture. A model-based RL algorithm for a continuous-time,
1356
Kenji Doya et al.
continuous-state system (see equation 2.2) is derived from the HamiltonJacobi-Bellman (HJB) equation, µ ¶ 1 @V ( x ( t) ) ( ( ) ) ( ( ) ) ( ( ) ) C max V xt D r x t ,u f x t ,u , u t @x
(3.4)
where t is the time constant of reward discount (Doya, 2000). Under the assumptions that the system is linear with respect to the action and the action cost is convex, a greedy policy is given by ³ uDg
0
@f ( x, u ) @V ( x ) 0 @u
´
@x
,
(3.5)
@V ( x) 0 is a vector representing the steepest ascent direction of the value @x @f ( x,u ) 0 function, @u is a matrix representing the input gain of the dynamics, and
where
g is a sigmoid function whose shape is determined by the control cost (Doya, 2000). To implement the HJB-based algorithm, we provide each module with a dynamic model fi (x, u ) and a value model Vi ( x ). The outputs of the dynamic models, equation 2.12, are compared with the actually observed state dynamics xP ( t) to calculate the responsibility signal li ( t) according to equation 2.13. The model outputs are linearly weighted by li ( t) for state prediction, xOP ( t) D
n X
li ( t) fi (x ( t) , u ( t) ) ,
(3.6)
i D1
and value function estimation, V ( x) D
n X
li ( t) Vi ( x ).
(3.7)
iD 1 (
)
The derivatives of the dynamic models @fi @x,u and value models u used to calculate the action for each module: ³ ui ( t ) D g
@Vi ( x) @x
are
´
0 @fi ( x, u ) @Vi ( x ) 0
@u
@x
.
(3.8)
x ( t)
They are then weighted by li ( t) according to equation 2.17 to make the actual action u (t) . Learning is based on the weighted prediction errors li ( t) ( xOP i ( t) ¡ xP (t) ) for dynamic models and li ( t)d (t) for value function models.
Multiple Model-Based Reinforcement Learning
1357
3.3 Multiple Linear Quadratic Controllers. In a modular architecture like the MMRL, the use of universal nonlinear function approximators with large numbers of degrees of freedom can be problematic because it can lead to an undesired solution in which a single module tries to handle most of the task domain. The use of linear models for the prediction models and the controllers is a reasonable choice because local linear models have been shown to have good properties of quick learning and good generalization (Schaal & Atkeson, 1996). Furthermore, if the reward function is locally approximated by a quadratic function, then we can use a linear quadratic controller (see, e.g., Bertsekas, 1995) for the RL controller design. We use a local linear dynamic model,
xOP i (t) D Ai ( x (t) ¡ xdi ) C Bi u ( t) ,
(3.9)
and a local quadratic reward model, rOi ( x (t) , u (t ) ) D ri0 ¡
1 1 (x ( t) ¡ xri ) 0 Qi ( x ( t) ¡ xri ) ¡ u0 ( t) Ri u ( t) , 2 2
(3.10)
for each module, where xdi , xri is the center of local prediction for state and reward, respectively. The ri0 is a bias of quadratic reward model. The value function is given by the quadratic form, Vi ( x ) D vi0 ¡
1 ( x ¡ xvi ) 0 Pi ( x ¡ xvi ) . 2
(3.11)
The matrix Pi is given by solving the Riccati equation, 0D
1 Pi ¡ Pi Ai ¡ A0i Pi C Pi B i Ri¡1 B0i Pi ¡ Qi . t
(3.12)
The center xvi and the bias vi0 of the value function are given by xvi D ( Qi C Pi Ai ) ¡1 ( Qi xri C Pi Ai xdi ) , 1 0 1 v D ri0 ¡ ( xvi ¡ xri ) 0 Qi ( xvi ¡ xri ). t i 2
(3.13) (3.14)
Then the optimal feedback control for each module is given by the linear feedback, ui (t ) D ¡Ri¡1 BTi Pi ( x (t) ¡ xvi ).
(3.15)
The action output is given by weighting these controller outputs by the responsibility signal li (t) : u (t) D
n X iD 1
l i ( t ) ui ( t ) .
(3.16)
1358
Kenji Doya et al.
The parameters of the local linear models Ai , Bi , and xdi and those of the quadratic reward models ri0, Qi , and Ri are updated by the weighted prediction errors li ( t) ( xOP i ( t) ¡xP ( t) ) and li ( t) ( rOi ( x, u ) ¡r(t) ) , respectively. When we assume that the update of these models is slow enough, then the Riccati equations, 3.12, may be recalculated only intermittently. We call this method multiple linear quadratic controllers (MLQC).
4 Simulation: Discrete Case
In order to test the effectiveness of the MMRL architecture, we rst applied the discrete MMRL architecture to a nonstationary hunting task in a grid world. The hunter agent tries to catch a prey in a 7£7 torus grid world. There are 47 states representing the position of the prey relative to the hunter. The hunter chooses one of ve possible actions: fnorth (N), east (E), south (S), west (W), stayg. A prey moves in a xed direction during a trial. At the beginning of each trial, one of four movement directions fNE, NW, SE, SWg is randomly selected, and a prey is placed at a random position in the grid world. When the hunter catches the prey by stepping into the same grid with the prey, a reward r (t ) D 10 is given. Each step of movement costs r ( t) D ¡1. A trial is terminated when the hunter catches a prey or fails to catch it within 100 steps. In order to compare the performance of MMRL with conventional methods, we applied standard Q-learning and compositional Q-learning (CQ-L) (Singh, 1992) to the same task. A major difference between CQ-L and MMRL is the criterion for modular decomposition: CQ-L uses the consistency of the modular value functions, while MMRL uses the prediction errors of dynamic models. In CQ-L, the gating network as well as component Q-learning modules are trained so that the composite Q-value well approximates the action value function of the entire problem. In the original CQ-L (Singh, 1992), the output of the gating network was based on the “augmenting bit” that explicitly signaled the change in the context. Since our goal now is to let the agent learn appropriate decomposition of the task without an explicit cue, we used a modied CQ-L (see the appendix for the details of the algorithm and the parameters). 4.1 Results. Figure 2 shows the performance difference of standard Qlearning, CQ-L, and MMRL in the hunting task. The modied CQ-L did not perform signicantly better than standard, at Q-learning. Investigation of the modular Q functions of CQ-L revealed that in most simulation runs, modules did not appropriately differentiate for four different kinds of preys. On the other hand, the performance of MMRL approached close to theoretical optimum. This was because four modules successfully specialized in one of four kinds of prey movement.
Multiple Model-Based Reinforcement Learning
1359
Figure 2: Comparison of the performance of standard Q-learning (gray line), modied CQ-L (dashed line), and MMRL (thick line) in the hunting task. The average number of steps needed for catching a prey during 200 trial epochs in 10 simulation runs is plotted. The dash-dotted line shows the theoretically minimal average steps required for catching the prey.
Figure 3 shows examples of the value functions and the prediction models learned by MMRL. From the output of the prediction models Fi , it can be seen that the modules 1, 2, 3, and 4 were specialized for the prey moving to NE, NW, SW, and SE, respectively. The landscapes of the value functions Vi (x ) are in accordance with these movement directions of the prey. A possible reason for the difference in the performance of CQ-L and MMRL in this task is the difculty of module selection. In CQ-L, when the prey is far from the hunter, the differences in discounted Q values for different kinds of prey are minor. Thus, it would be difcult to differentiate modules based solely on the Q values. In MMRL, on the other hand, module selection based on the state change, in this case prey movement, is relatively easy even when the prey is far from the hunter. 5 Simulation: Continuous Case
In order to test the effectiveness of the MMRL architecture for control, we applied the MLQC algorithm described in section 3.3 to the task of swinging up a pendulum with limited torque (see Figure 4) (Doya, 2000). The driving torque T is limited in [¡T max , T max ] with T max < mgl. The pendulum has to be swung back and forth at the bottom to build up enough momentum for a successful swing up.
1360
Kenji Doya et al.
Figure 3: Example of value functions and prediction models learned by MMRL after 10,000 trials. Each slot in the grid shows the position of the prey relative to the hunter, which was used as the state x. (a) The state value functions Vi ( x ) . (b) The prediction model outputs Fi ( xO , x, u ) , where the current state x of the prey was xed as (2, 1), shown by the circle, and the action u was xed as ”stay.”
The state space was two-dimensional: x D (h , hP ) 0 2 [¡p , p ] £ R , where h is the joint angle (h D 0 means the pendulum hanging down). The action was u D T. The reward was given by the height of the tip and the negative squared torque: 1 r (x, u ) D ¡ cosh ¡ RT2 . 2
(5.1)
A trial was started from random joint angle h 2 [¡p / 4, p / 4] with no angular velocity. We devised the following automatic annealing process for the parameter s of the softmax function for the responsibility signal, equation 2.26, sk C 1 D gaEk C (1 ¡ g)sk,
(5.2)
where k denotes the number of trial and E k is the average state prediction error during the kth trial. The parameters were g D 0.25, a D 2, and the initial value set as s0 D 4. 5.1 Task Decomposition in Space: Nonlinear Control. We rst used two modules, each of which had a linear dynamic model (see equation 3.9) and a quadratic reward model (see equation 3.10). The centers of the local linear dynamic models were initially placed randomly with the angular component in [¡p , p ].
Multiple Model-Based Reinforcement Learning
1361
Each trial was started from a random position of the pendulum and lasted for 30 seconds. Figure 4 shows an example of swing-up performance from the bottom position. Initially, the rst prediction model predicts the pendulum motion better than the second one, so the responsibility signal l1 becomes close to 1. Thus, the output of the rst RL controller u1 , which destabilizes the bottom position, is used for control. As the pendulum is driven away from the bottom, the second prediction model predicts the movement better, so l2 becomes higher and the second RL controller takes over and stabilizes the upright position. Figure 5 shows the changes of linear prediction models and quadratic reward models before and after learning. The two linear prediction models approximated the nonlinear gravity term. The rst model predicted the negative feedback acceleration around the equilibrium state with the pendulum hanging down. The second model predicted the positive feedback acceleration around the unstable equilibrium with the pendulum raised up. The two reward models also approximated the cosine reward function using parabolic curves. Figure 6 shows the dynamic and reward models when there were eight modules. Two modules were specialized for the bottom position, three modules were specialized near the top position, and two other modules were centered somewhere in between. The result shows that proper modularization is possible even when there are redundant modules. Figure 7 compares the time course of learning by MLQC with two, four, and eight modules and a nonmodular actor-critic (Doya, 2000). Learning was fastest with two modules. The addition of redundant modules resulted in more variability in the time course of learning. This is because there were multiple possible ways of modular decomposition, and due to the variability of the sample trajectories, it took longer for modular decomposition to stabilize. Nevertheless, learning by the eight-module MLQC was still much faster than by the nonmodular architecture. An interesting feature of the MLQC strategy is that qualitatively different controllers are derived by the solutions of the Riccati equations, 3.12. The controller at the bottom is a positive feedback controller that destabilizes the equilibrium where the reward is minimal, while the controller at the top is a typical linear quadratic regulator that stabilizes the upright state. Another important feature of the MLQC is that the modules were exibly switched simply based on the prediction errors. Successful swing up was achieved without any top-down planning of the complex sequence.
5.2 Task Decomposition in Time: Nonstationary Pendulum. We then tested the effectiveness of the MMRL architecture for the nonlinear and nonstationary control tasks in which mass m and length l of the pendulum were changed every trial.
1362
Kenji Doya et al.
Figure 4: (a) Example of swing-up performance. Dynamics are given by ml2hR D ¡mgl sin h ¡ m hP C T. Physical parameters are m D l D 1, g D 9.8, m D 0.1, and T max D 5.0. (b) Trajectory from the initial state (0[rad],0.1[rad/s]). o: start, +: goal. Solid line: module 1. Dashed line: module 2. (c) Time course of the state (top), the action (middle), and the responsibility signal (bottom).
Multiple Model-Based Reinforcement Learning
1363
Figure 5: Development of state and reward prediction of models. (a,b) Outputs of state prediction models (a) before and (b) after learning. (c,d) Outputs of the reward prediction model (c) before and (d) after learning. Solid line: module 1. Dashed line: module 2; dotted line: targets (Rx and r). ±: centers of spatial responsibility prediction c i .
Figure 6: Outputs of eight modules. (a) State prediction models. (b) Reward models.
1364
Kenji Doya et al.
Figure 7: Learning curves for the pendulum swing-up task. The cumulative R 20 reward 0 r ( t ) dt during each trial is shown for ve simulation runs. (a) Two modules. (b) Four modules. (c) Eight modules. (d) Nonmodular architecture.
We used four modules, each of which had a linear dynamic model (see equation 3.9) and a quadratic reward model (see equation 3.10). The centers xi of the local linear prediction models were initially set randomly. Each trial was started from a random position with h 2 [¡p / 4, p / 4] and lasted for 40 seconds. We implemented responsibility prediction with tc D 50, tM D 200, and tp D 0.1. The parameters of annealing were g D 0.1, a D 2, and an initial value of s0 D 10. In the rst 50 trials, the physical parameters were xed at fm D 1.0, l D R 1.0g. Figure 8a shows the change in the position gain (fA21g D @@hh ) of the four prediction models. The control performance is shown in Figure 8b. Figures 8c, 8d, and 8e show the outputs of prediction models in the section of fh D 0, T D 0g. Initial position gains are set randomly (see Figure 8c). After 50 trials, both modules 1 and 2 specialized in the bottom region (h ’ 0) and learned similar prediction models. Modules 3 and 4 also learned the same prediction model in the top region (h ’ p ) (see Figure 8d). Accordingly, the RL controllers in modules 1 and 2 learned a reward model with a minimum near (0, 0) 0 , and
Multiple Model-Based Reinforcement Learning
1365
Figure 8: Time course of learning and changes of the prediction models. (a) R Changes of a coefcient A21 D @@hh of the four prediction models, coefcient with angle. (b) Change of average reward during each trial. Thin lines: results of 10 simulation runs. Thick line: average to 10 simulation runs. Note that the average reward with the new, longer pendulum was lower even after successful learning because of its longer period of swinging. (c,d, and e) Linear prediction models in the section of fh D 0, T D 0g (c) before learning, (d) after 50 trials with xed parameters, and (e) after 150 trials with changing parameters. Slopes of linear models correspond to A21 shown in a.
a destabilizing feedback policy was given by equations 2.15 through 2.17. Modules 3 and 4 also learned a reward model with a peak near (p , 0) 0 and implemented a stabilizing feedback controller. In 50 to 200 trials, the parameters of the pendulum were switched between fm D 1, l D 1.0g and fm D 0.2, l D 10.0g in each trial. At rst, the degenerated modules tried to follow the alternating environment (see Figure 8a), and thus swing up was not successful for the new, longer pendulum. The performance for the shorter pendulum was also disturbed (see Figure 8b). After about 80 trials, the prediction models gradually specialized in either new or learned dynamics (see Figure 8e), and successful swing up was achieved for both the shorter and longer pendulums.
1366
Kenji Doya et al.
We found similar module specialization in 6 of 10 simulation runs. In 4 other runs, due to the bias in initial module allocation, three modules were aggregated in one domain (top or bottom) and one model covered the other domain during the stationary condition. However, after 150 trials in the nonstationary condition, module specialization, as shown in Figure 8e, was achieved. 6 Discussion
We proposed an MMRL architecture that decomposes a nonlinear or nonstationary task in space and time based on the local predictability of the system dynamics. We tested the performance of the MMRL in both nonlinear and nonstationary control tasks. It was shown in simulations of the pendulum swing-up task that multiple prediction models were successfully trained and corresponding model-based controllers were derived. The modules were specialized for different domains in the state space. It was also conrmed in a nonstationary pendulum swing-up task that available modules are exibly allocated for different domains in space and time based on the task demands. The modular control architecture using multiple prediction models was proposed by Wolpert and Kawato as a computational model of the cerebellum (Wolpert et al., 1998; Wolpert & Kawato, 1998). Imamizu et al. (1997, 2000) showed in fMRI experiments of novel tool use that a large area of the cerebellum is activated initially, and then a smaller area remains active after long training. They proposed that such local activation spots are the neural correlates of internal models of tools (Imamizu et al., 2000). They also suggested that internal models of different tools are represented in separated areas in the cerebellum (Imamizu et al., 1997). Our simulation results in a nonstationary environment can provide a computational account of these fMRI data. When a new task is introduced, many modules initially compete to learn it. However, after repetitive learning, only a subset of modules are specialized and recruited for the new task. One might argue whether MLQC is a reinforcement learning architecture since it uses LQ controllers that were calculated off-line. However, when the linear dynamic models and quadratic reward models are learned on-line, as in our simulations, the entire system realizes reinforcement learning. One limitation of MLQC architecture is that the reward function should have helpful gradients in each modular domain. A method for backpropagating the value function of the successor module as the effective reward for the predecessor module is under development. In order to construct a hierarchical RL system, it appears necessary to combine both top-down and bottom-up approaches for task decomposition. The MMRL architecture provides one solution for the bottom-up approach. Combination of this bottom-up mechanism with a top-down mechanism is the subject of our ongoing study.
Multiple Model-Based Reinforcement Learning
1367
Appendix: Modied Compositional Q-Learning
On each time step, the gating variable gi ( t) is given by the prior probability of module selection, in this case from the assumption of temporal continuity, equation 2.22: li ( t ¡ 1) a . g i ( t ) D Pn a j D 1 lj ( t ¡ 1)
(A.1)
The composite Q-values for state x ( t) are then computed by QO ( x ( t) , u ) D
n X
gi ( t) Qi ( x ( t) , u ) ,
(A.2)
i D1
and an action u ( t) is selected by P (u | x ( t) ) D P
eb QO (x ( t) , u ) . eb QO (x ( t) , v )
(A.3)
v2U
After the reward r ( x ( t) , u ( t) ) is acquired and the state changes to x ( t C 1), the TD error for the module i is given by ei ( t) D r (x ( t) , u ( t) ) C c max QO (x ( t C 1) , u ) ¡ Qi ( x ( t) , u (t) ) . u
(A.4)
From gaussian assumption of value prediction error, the likelihood of module i is given by 1
¡ e P (ei ( t) | i) D e 2s 2 i
( t) 2
,
(A.5)
and thus the responsibility signal, or the posterior probability for selecting module i, is given by ¡
1
e (t ) 2
gi ( t) e 2s 2 i li ( t) D P . ¡ 1 2 ej ( t) 2 2s j gj ( t ) e
(A.6)
The Q values of each module are updated with the weighted TD error li (t ) ei ( t) as the error signal. The discount factor was set as c D 0.9 and the greediness parameters as b D 1 for both MMRL and CQ-L. The decay parameter of temporal responsibility predictor was a D 0.8 for MMRL. We tried different values of a for CQ-L without success. The value used in Figures 2 and 3 was a D 0.99.
1368
Kenji Doya et al.
Acknowledgments
We thank Masahiko Haruno, Daniel Wolpert, Chris Atkeson, Jun Tani, Hidenori Kimura, and Raju Bapi for helpful discussions. References Barto, A. G., Sutton, R. S., & Anderson, C. W. (1983). Neuronlike adaptive elements that can solve difcult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, 13, 834–846. Bertsekas, D. P. (1995). Dynamic programming and optimal control. Belmont, MA: Athena Scientic. Cacciatore, T. W., & Nowlan, S. J. (1994). Mixture of controllers for jump linear and non-linear plants. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing system, 6. San Mateo, CA: Morgan Kaufmann. Dayan, P., & Hinton, G. E. (1993). Feudal reinforcement learning. In C. L. Giles, S. J. Hanson, & J. D. Cowan (Eds.), Advances in neural information processing systems, 5 (pp. 271–278). San Mateo, CA: Morgan Kaufmann. Doya, K. (2000). Reinforcement learning in continuous time and space. Neural Computation, 12, 215–245. Gomi, H., & Kawato, M. (1993). Recognition of manipulated objects by motor learning with modular architecture networks. Neural Networks, 6, 485–497. Haruno, M., Wolpert, D. M., & Kawato, M. (1999). Multiple paired forwardinverse models for human motor learning and control. In M. S. Kearns, S. A. Solla, & D. A. Cohen (Eds.), Advances in neural information processing systems, 11 (pp. 31–37). Cambridge, MA: MIT Press. Haruno, M., Wolpert, D. M., & Kawato, M. (2001). MOSAIC model for sensorimotor learning and control. Neural Computation, 13, 2201–2220. Imamizu, H., Miyauchi, S., Sasaki, Y., Takino, R., Putz, ¨ B., & Kawato, M. (1997). Separated modules for visuomotor control and learning in the cerebellum: A functional MRI study. In A. W. Toga, R. S. J. Frackowiak, & J. C. Mazziotta (Eds.), NeuroImage: Third International Conference on Functional Mapping of the Human Brain (Vol. 5). Copenhagen, Denmark: Academic Press. Imamizu, H., Miyauchi, S., Tamada, T., Sasaki, Y., Takino, R., Putz, ¨ B., Yoshioka, T., & Kawato, M. (2000). Human cerebellar activity reecting an acquired internal model of a new tool. Nature, 403, 192–195. Jacobs, R. A., Jordan, M. I., Nowlan, S. J., & Hinton, G. E. (1991). Adaptive mixtures of local experts. Neural Computation, 3, 79–87. Littman, M., Cassandra, A., & Kaelbling, L. (1995). Learning policies for partially observable environments: Scaling up. In A. Prieditis & S. Russel (Eds.), Machine Learning:Proceedings of the 12th International Conference (pp. 362–370). San Mateo, CA: Morgan Kaufmann. Morimoto, J. & Doya, K. (2001). Acquisition of stand-up behavior by a real robot using hierarchical reinforcement learning. Robotics and Autonomous Systems, 36, 37–51.
Multiple Model-Based Reinforcement Learning
1369
Narendra, K. S., Balakrishnan, J., & Ciliz, M. K. (1995, June). Adaptation and learning using multiple models, switching and tuning. IEEE Control Systems Magazine, 37–51. Parr, R., & Russel, S. (1998). Reinforcement learning with hierarchies of machines. In M. I. Jordan, M. J. Kearns, & S. A. Solla (Eds.), Advances in neural informationprocessing systems,10 (pp. 1043–1049).Cambridge, MA: MIT Press. Pawelzik, K., Kohlmorge, J., & Muller, ¨ K. R. (1996). Annealed competition of experts for a segmentation and classication of switching dynamics. Neural Computation, 8, 340–356. Schaal, S., & Atkeson, C. G. (1996). From isolation to cooperation: An alternative view of a system of experts. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 605–611). Cambridge, MA: MIT Press. Singh, S. P. (1992). Transfer of learning by composing solutions of elemental sequential tasks. Machine Learning, 8, 323–340. Sutton, R. S. (1988). Learning to predict by the methods of temporal difference. Machine Learning, 3, 9–44. Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning. Cambridge, MA: MIT Press. Sutton, R., Precup, D., & Singh, S. (1999). Between MDPS and semi-MDPS: A framework for temporal abstraction in reinforcement learning. Articial Intelligence, 112, 181–211. Tani, J., & Nol, S. (1999). Learning to perceive the world as articulated: An approach for hierarchical learning in sensory-motor systems. Neural Networks, 12, 1131–1141. Wiering, M., & Schmidhuber, J. (1998). HQ-learning. Adaptive Behavior, 6, 219– 246. Wolpert, D. M., & Ghahramani, Z. (2000). Computational principles of movement neuroscience. Nature Neuroscience, 3, 1212–1217. Wolpert, D. M., & Kawato, M. (1998). Multiple paired forward and inverse models for motor control. Neural Networks, 11, 1317–1329. Wolpert, D. M., Miall, R. C., & Kawato, M. (1998). Internal models in the cerebellum. Trends in Cognitive Sciences, 2, 338–347. Received June 12, 2000; accepted November 8, 2001.
LETTER
Communicated by Ning Qian
A Bayesian Approach to the Stereo Correspondence Problem Jenny C. A. Read [email protected] University Laboratory of Physiology, Oxford, OX1 3PT, U.K. I present a probabilistic approach to the stereo correspondence problem. Rather than trying to nd a single solution in which each point in the left retina is assigned a partner in the right retina, all possible matches are considered simultaneously and assigned a probability of being correct. This approach is particularly suitable for stimuli where it is inappropriate to seek a unique partner for each retinal position—for instance, where objects occlude each other, as in Panum’s limiting case. The probability assigned to each match is based on a Bayesian analysis previously developed to explain psychophysical data (Read, 2002). This provides a convenient way to incorporate constraints that enable the ill-posed correspondence problem to be solved. The resulting model behaves plausibly for a variety of different stimuli. 1 Introduction
A fundamental problem facing the visual system is how to extract information about a three-dimensional world from a two-dimensional retinal image. One clue in this task is provided by retinal disparity: the difference in the position of an object’s images in the left and right eyes, arising from the horizontal displacement of the eyes in space. Evidently, this calculation depends on correctly matching each feature in the left retinal image with its counterpart in the right retina. This task has become known as the correspondence problem. In a simple visual scene such as that illustrated in Figure 1, this presents little difculty. However, in a more complicated visual scene, the correspondence problem may be highly complex and indeed is ill posed. In general, there are many possible solutions of the correspondence problem, each implying a different arrangment of physical objects. Despite this, the visual system is capable of arriving, almost instantaneously, at a judgment of disparity across a scene. This implies that the visual system must be using additional constraints to select a solution. Bayesian probability theory (Knill & Richards, 1996) provides a natural way of framing these constraints. Bayesian models of perception typically envisage an observer attempting to deduce information about the visual scene, S, given an image I. In the context of stereopsis, S represents the location of objects in space, and I represents the pair of retinal images. c 2002 Massachusetts Institute of Technology Neural Computation 14, 1371– 1392 (2002) °
1372
Jenny C. A. Read
Figure 1: How probability relates to perception in physical space. The eyes (circles) are xating on object F , whose image thus falls at the fovea in both retinas. Object P is in front of the xation point, and thus has positive (crossed) disparity d D xL ¡ xR , where xL , xR are the horizontal positions of the image of P in the left and right retinas. The distance d describes how far the object at P is in front of the xation point F , while the angle xc describes how far it is to the right of F . 2I is the interocular distance, and a is the vergence angle. Under the approximation that the xation point is sufciently distant and all objects viewed are sufciently close to it that the angles a, xL , xR , and xc are all small, it can be shown that xc ¼ ( xL C xR ) / 2 and d ¼ ( xL ¡ xR ) I / 2a2 . Each potential match between a point ( xL , y) in the left retina and a point ( xR , y ) in the right retina implies a percept of an object at the corresponding location in space P, with luminance depending on the mean of the light intensities recorded in the two retinas. The strength of the perception is presumed to increase monotonically with the probability Pf(xL , y ) $ ( xR , y ) g assigned to the match (zero probability D no perception).
Due to noise within the system, a given conguration of objects does not necessarily produce the same image on successive presentations. However, we assume that the imaging system and its limitations is well characterized, so that the brain knows the likelihood P(I | S ) of obtaining an image I given a particular scene S. Furthermore, we assume that the brain has its own a priori estimate of the probability P(S) that a particular scene occurs. Then Bayes’ theorem allows us to deduce the posterior probability of a particular
Stereo Correspondence Problem
1373
scene S, given that we receive the image I: P(S | I ) D P(I | S ) P(S) / P(I) . For instance, in the double-nail illusion, human observers presented with two nails in the midsagittal plane report a clear perception of two nails with zero disparity in the frontoparallel plane (Krol & van der Grind, 1980; Mallot & Bideau, 1990). This solution is preferred over the physically correct solution, even though other cues, such as size and shading, mean that the latter presumably has the higher likelihood P(I | S ). One way of explaining this within a Bayesian framework is to postulate a prior preference for small disparities. Then the solution in which both nails have zero disparity is assigned a higher posterior probability than the other solution, in which one nail has crossed disparity and the other uncrossed. The zero-disparity solution is thus the one perceived. A second advantage of a probabilistic approach is its ability to handle ambiguity. Several existing models of the correspondence problem (Dev, 1975; Nelson, 1975; Marr & Poggio, 1976, 1979; Grimson, 1981; Pollard, Mayhew, & Frisby, 1985; Sanger, 1988) employ a uniqueness constraint. That is, they seek a unique match in the right eye for every point in the left, and vice versa. This could be implemented within a Bayesian scheme by taking the correct match at every point to be that which has the highest posterior probability. However, although computationally convenient in avoiding false matches, a uniqueness constraint is clearly not satised in practice. Parts of the visual scenes are often occluded from one or the other eye; for instance, a stereogram consisting of a disparate target superimposed on a zero-disparity background may contain regions that have no match in the other eye. Conversely, occluding stimuli may require one point in the left image to be matched with two points in the right image. Experimental evidence suggests that the human visual system does indeed produce double matches in this situation (McKee, Bravo, Smallman, & Legge, 1995). An algorithm avoiding the uniqueness constraint is capable of deriving the correct solution (McLoughlin & Grossberg, 1998). Finally, although algorithms incorporating the uniqueness constraint have been able to solve stereograms incorporating transparency (Qian & Sejnowski, 1989; Pollard & Frisby, 1990), the uniqueness constraint certainly appears less than ideal for such stimuli (Westheimer, 1986; Weinshall, 1989). A probabilistic theory is well suited to such situations. Let Pf(xL , y ) $ ( xR , y )g denote the probability that the point ( xL , y ) in the left retina corresponds to the point ( xR , y ) in the right retina, that is, both points are images of the same object. High probability can be assigned to several matches for a given object without necessarily having to implement a uniqueness constraint by deciding on only one match as being “correct.” I adopt the following working hypothesis of how the match probability relates to perception, illustrated in Figure 1. If the point (xL , y ) in the left retina corresponds to the point ( xR , y ) in the right, then both must be viewing
1374
Jenny C. A. Read
an object located at an angle xc D ( xL C xR ) / 2 to the straightahead direction, and at a distance d D ( xL ¡ xR ) I / 2a2 in front of the xation point (see Figure 1). I propose that a nonzero match probability Pf(xL , y ) $ ( xR , y )g implies a perceptual experience of an object at the corresponding point in space, with luminance equal to the mean luminance of the retinal images, [IL ( xL , y ) C IR (xR , y) ] / 2. I propose that the strength of the perceptual experience depends on the probability assigned to the match, Pf(xL , y) $ ( xR , y )g, with higher probabilities creating a clearer perception. In previous work (Read, 2002), I developed a computational model based on these Bayesian ideas. This model was designed to explain the results of psychophysical experiments involving two-interval forced-choice discrimination of the sign of stereoscopic disparity (crossed versus uncrossed) (Read & Eagle, 2000). It was tested only with stimuli whose disparity was constant across the entire image, and the model reported the most likely value of this global disparity. This procedure was appropriate for modeling the results of our two-interval forced-choice discrimination experiments, and the simulations captured the key features of the psychophysical data across a variety of spatial frequency and orientation bandwidths. However, the model actually calculated internally a detailed probabilistic disparity map of the stimulus, assigning a probability to every potential match. For the previous article, this information was then compressed into a single estimate of stimulus disparity. Now I wish to probe the probabilistic disparity map in more detail, testing the model on stimuli whose disparity varies across the image, to see whether the model can reconstruct disparity in a way that qualitatively captures the behavior of human subjects. 2 Methods 2.1 Structure of the Model. The model is described in full mathematical detail in Read (2002), and only an outline is given here. Following earlier work by Qian (1994) and Prince and Eagle (2000a), the model was designed to incorporate the known physiology presumed to underlie binocular vision. Thus, rather than applying a Bayesian analysis to the original retinal images, the retinal images are rst processed by model simple cells, which are assumed to be linear with an output nonlinearity of half-wave rectication (Movshon, Thompson, & Tolhurst, 1978; Anzai, Ohzawa, & Freeman, 1999). The receptive elds of these cells are Gabor functions (the product of a gaussian and a cosine), described by the spatial frequency and orientation to which they are tuned, and by the bandwidth of this tuning. For simplicity, all simple cells in my model have a spatial frequency bandwidth of 1.5 octaves and an orientation bandwidth of 30 degrees, dened as the full width at half height of the tuning curve. This is in accordance with psychophysical and physiological evidence for the bandwidths of channels in the visual system (Manseld & Parker, 1993; de Valois, Albrecht, & Thorell, 1982; de Valois, Yund, & Hepler, 1982; de Valois & de Valois, 1988). These model simple
Stereo Correspondence Problem
1375
cells then feed into disparity-tuned model complex cells, whose response is simulated using the energy model developed by Adelson & Bergen (1985), Fleet, Wagner, & Heeger (1996), Ohzawa, De Angelis, & Freeman (1990, 1997), and analyzed by Qian (1994). All simple cells feeding into a single complex cell are assumed to be tuned to the same spatial frequency and orientation; they differ only in the phase of their Gabor receptive eld and in their position on the retina. This model uses quadrature pairs of simple cells—simple cells whose receptive elds differ in phase by p / 2. The difference between the positions of the receptive elds in left and right retina denes the disparity to which the complex cell is tuned. My model employs only tuned-excitatory complex cells, which respond maximally to stimuli at their preferred disparity.
2.2 The Bayesian Analysis. A complex cell of this form can be used to test the hypothesis that the region of the stimulus within the complex cell’s receptive eld has the disparity that the complex cell is tuned to. If this were the case, then the part of the image falling within the complex cell’s left retinal receptive eld would be expected to be identical to the part of the image viewed by the right retinal receptive eld. Thus, the response of the binocular complex cell could actually be predicted from a knowledge of the ring rates of simple cells with receptive elds in one eye only. This observation forms the basis of the Bayesian analysis carried out by the present computational model. The model assumes that the left and right retinal images are independently degraded by noise. Any other sources of noise within the visual system are neglected. The retinal noise means that even if the stimulus does have exactly the disparity that the complex cell is tuned to, there will nevertheless usually be some slight discrepancy between the actual ring rate of the complex cell and that predicted by considering the response of simple cells with receptive elds in just one eye. However, the distribution of this discrepancy can be calculated analytically given the amplitude of the retinal noise. Hence, one can deduce the probability of obtaining the observed binocular complex cell ring rate, given the ring rate of the monocular simple cells from one eye, on the assumption that the disparity the complex cell is tuned to is that actually present in the stimulus. According to Bayes’ theorem, this calculation can then be inverted to arrive at the probability that the stimulus really does have the disparity that the complex cell is tuned to, given the observed ring rates of the binocular complex cell itself and the monocular simple cells that feed into it. This calculation applies only to the patch of the stimulus falling within the complex cell’s receptive eld. In addition, since the complex cell is tuned to a particular spatial period l and orientation h, it applies only to that part of the Fourier spectrum of the stimulus that falls within the complex cell’s bandpass region. I thus refer to this probability as the local single-channel
1376
Jenny C. A. Read
match probability (Read, 2002): Plh f(xL , y ) $ (xR , y ) g. This is the probability that the region of the left retinal image centered on (xL , y ) corresponds to the region of the right image centered on ( xR , y ). Note that both RFs are assumed to have the same vertical position y: I include model complex cells tuned to horizontal disparities only. l, h are the orientation and spatial period to which the complex cell is tuned. I include several different orientation tunings (0± , 30± , 60± , 90± , 120± , 150± ) ranging from horizontal to vertical, and several different spatial frequencies (1, 2, 4, 8, 16 cycles per image) designed to cover the full range of frequencies visible to humans (de Valois & de Valois, 1988). Precisely what is meant by “local” in this context depends on the channel under consideration. The probability analysis within each channel implements a local smoothness constraint—that is, it assumes the stimulus disparity is constant across the complex cell’s receptive eld. This is a region in each retina whose extent scales with the spatial period l to which the complex cell is tuned to (for the bandwidths employed here, it is approximately 0.277 £ 0.506l). I now average the local single-channel match probability over all spatial frequency and orientation channels to arrive at a local match “probability” to which all channels contribute: Pf(xL , y ) $ ( xR , y) g D Slh Plh f(xL , y ) $ ( xR , y ) g. This averaging process is purely heuristic. A full mathematically valid treatment would require the joint probability of obtaining a particular set of ring rates from the entire population of complex cells. The motivation for this approach is discussed in Read (2002). Thus, Pf(xL , y ) $ ( xR , y ) g is strictly not a probability, although I shall refer to it as such. It is regarded as an estimate of the probability that the position ( xL , y ) in the left retina corresponds to the position ( xR , y ) in the right retina, that is, that the stimulus disparity in the vicinity of this region is d D xL ¡ xR . The “vicinity” represents an average over the receptive elds of the different channels, whose area and orientation are different for each channel. This local match probability Pf(xL , y ) $ ( xR , y) g forms the probabilistic disparity map, describing how likely each potential match (xL , y ) $ ( xR , y) is. 2.3 Details of the Model Parameters. The Bayesian prior, describing the a priori probability Pfdg that a stimulus has the disparity d, is taken to have the form
Pfdg D [D2 C (d ¡ D / 2) 2 ]¡3/ 2 C [D2 C (d C D / 2) 2 ] ¡3 / 2 . This closely resembles a gaussian function but is less sharply peaked at the origin and decays less steeply (Read, 2002). D is a scale parameter, effectively describing which disparities are counted as “small.” In the original
Stereo Correspondence Problem
1377
article (Read, 2002), D was one of two free parameters that were systematically adjusted in order to produce a good t to experimental results, the other being the level of retinal noise. I found it necessary to postulate a very low noise level, just 0.075% of the contrast of the binary random dot patterns used here and in the previous article. Then a very tight prior of 2.4 arcmin was necessary in order to reproduce the observed decline in performance as stimulus disparity increased. Although the model was constructed on the presumption that the brain introduces rather little noise and that the major noise affecting the calculation arises at the inputs (for instance, spontaneous photoisomerizations, photon shot noise, and the limitation of the eye’s optics), the tted noise level is extremely low and may not be realistic. The low noise was found to be necessary in order to prevent the model nding incorrect “reversed-phi” (Anstis & Rogers, 1975) matches in dense anticorrelated stereo stimuli, which would disagree with human psychophysics (Julesz, 1971; Cogan, Lomakin, & Rossi, 1993; Cumming, Shapiro, & Parker, 1998). With correlated stimuli, good matches to human psychophysics could be obtained with much higher noise levels. The very low noise level may thus be an artifact of other inadequacies of the model—for instance, its failure to incorporate any non-Fourier mechanisms that might suppress false matches in anticorrelated stimuli. Alternatively, it is possible that such low noise levels are achieved by averaging uncorrelated noise over a population. In this view, a single unit in the model would represent a small local population of identical physiological units, in which noise tends to be averaged away. All simulations presented here use the value of noise and prior scale-length D tted to psychophysical data (Read, 2002). The model was originally developed to account for the results of psychophysical experiments (Read & Eagle, 2000) in which the stimuli were 128 £ 128 pixels at a distance of 127 centimeters, subtending an angle 1.7± £ 1.7± . Accordingly, the model retina was originally constructed to be 128 £ 128 pixels, where each pixel represents an angle of 0.8 arcmin on the retina. In this article, I also use a more detailed model retina of 256 £ 256 pixels. This is constructed to represent the same visual angle of 1.7± £ 1.7± , meaning that each pixel now represents 0.4 arcmin. The pixel values of the spatial periods of the model’s channels and the prior scale length D were accordingly doubled. The computer memory and runtime required for simulations depend on the number of different simple and complex cells included. The simulations presented here use 128 different horizontal positions of receptive eld (RF) centers. This dense sampling makes the model highly sensitive to variations in disparity across the stimulus. The simulations use simple cell RFs at just one vertical position in the model retina. Note that this does not mean that the model uses information only from a single horizontal strip across the image, since the RFs themselves extend over a wide region of the image.
1378
Jenny C. A. Read
2.4 Perception. My hypothesis is that the local match probability Pf(xL , y ) $ ( xR , y )g determines which matches are consciously perceived. Several different matches may be perceived for the same feature. In section 3, I show plots of Pf(xL , y ) $ (xR , y) g, plotted against xL and xR for a particular horizontal section through the retina. These plots imply a distribution of probability in space, for a particular horizontal plane in front of the observer. Lines of constant disparity d D xL ¡ xR run diagonally up across the ( xL , xR ) plot; these correspond to frontoparallel lines in space. Lines of constant xc D ( xL C xR ) / 2 run diagonally down across the (xL , xR ) plot; these correspond to radial lines from the observer, at an angle xc to straight ahead. Incorporating a prior preference for small disparities, as implied by psychophysical data, inevitably means that lower posterior probability will be assigned to matches with nonzero disparities than those with zero, even if the matches are equally valid and thus have the same likelihood. But in our perceptual experience, disparate regions within Panum’s fusional limit are perceived as clearly as regions with zero disparity. This may imply that the relationship between probability and perception saturates, so that all probabilities above a certain threshold cause the same clarity of perception. This article postulates only that clarity of perception increases (not necessarily strictly) monotonically with probability. 2.5 Maximal Complex Cell Response. For comparison, I also consider an extension of the model of Qian (1994) to multiple spatial frequency and orientation channels. Qian’s model extracts, for each line of constant xc D (xL C xR ) / 2, the disparity d D xL ¡ xR for which the complex cell ring rate is maximal. This yields a disparity map giving the best disparity as a function of position xc across the image: best ( ) dlh xc D argmax Clh ( xc , d) d
One simple way of extending this to multiple spatial frequency and orientation channels would be to sum the complex cell responses across all channels and extract the disparity, for each xc , at which this summed response is maximal: X d best (xc ) D argmax Clh (xc , d) d
l,h
However, I found that this method gave noisy results; it was poor at extracting the disparity of the target region in a random dot stereogram. Better results were obtained by extracting the maximum within each channel independently and combining these by constructing a “maxima eld” M(xc , d) , best ( ) where the value of M(xc , d) is the number of channels that had dlh xc D d. This “maxima eld” shares several properties with the Bayesian approach already described. It is independent of how we choose to relate ring rates
Stereo Correspondence Problem
1379
across different channels (whether by making all simple cells respond with the same ring rate to their optimal sinusoidal grating, or some other method). In the Bayesian approach, this was achieved by converting all ring rates into the common language of probability—here, by allowing each channel to signal only the position where its response is maximum rather than the value of that maximum. Similarly, Qian’s approach is capable of matching a single point xL in the left retina with several different points xR in the right retina. The form of uniqueness constraint it imposes is that along each line of sight xc , there should be only one disparity d. In the present version, this constraint is imposed on each channel separately. A conventional disparity map, such as those plotted by Qian (1994), could then be extracted from M(xc , d) by dening the disparity perceived at each xc to be that where M(xc , d) is maximal. To achieve good matches to human psychophysics, some weighting function would have to be imposed that favored small disparities (Cleary & Braddick, 1990; Prince & Eagle, 2000a, 2000b; Read & Eagle, 2000); this complication is neglected here. 3 Results 3.1 Random Dot Patterns. I begin by investigating the model’s response to binary random dot patterns, where the false-matching problem is at its most acute. I have previously (Read, 2002) demonstrated that the model performs close to 100% when tested with binary random dot stereograms in a front-back discrimination task. Now, I investigate the probability eld in more detail and examine whether the model is capable of extracting the details of how disparity varies acoss the image. First I consider a central disparate region superimposed on a zero-disparity background. Figure 2 shows the images used in the simulation (A–C), and the results obtained (D–F), considering just the y D 0 horizontal crosssection through the image. Figure 2C shows the total luminance of the left and right images, at y D 0 (horizontal line in Figures 2A and 2B) and different x positions. Figure 2D shows the total response of complex cells, summed over all channels. Figure 2E shows the maxima eld M(xc , d) , obtained by extending the model of Qian (1994) to multiple spatial-frequency and orientation channels. Figure 2F shows the probability eld obtained with the Bayesian approach. Both models succeed in extracting the disparate region. The lines mark regions of each monocular image that have no match in the other eye due to occlusion. Naturally enough, the models cannot assign these a clear disparity, although the imposed smoothness leads to a slight tendency to continue the disparity of adjacent binocularly viewed regions into the occluded region. 3.2 The Double Nail Illusion. Here, I consider the model’s response to the double nail illusion. The model was presented with the images in Figures 3A and 3B. The images in the left and right eyes are identical (apart
1380
Jenny C. A. Read
Figure 2: (A, B) The random dot stereogram used in obtaining the model results shown in the lower two plots. The stereogram is 256 £ 256 pixels, representing 1.7± £ 1.7± of visual angle. It contains a target region of 128 £ 128 pixels, with disparity 18 pixels, centered on a background with zero disparity. The dots’ luminance relative to background is § 1300 times the standard deviation of the model’s retinal noise. (C) The sum of the left and right images, IL ( xL , y) C IR ( xR , y ) , at the vertical position y marked with the line in A and B. (D) The response of complex cells, summed over all spatial frequency and orientation channels. The grayscale at ( xL , xR ) represents the total response of the population of complex cells with left- and right-eye RFs centered on ( xL , y) and ( xR , y ) , respectively. (E) The maxima eld M(xc , d) obtained by generalizing the model of Qian (1994). Within each channel, for each xc , the disparity where the complex cell response was maximal contributes 1 to M(xc , d) . (F) The probabilistic disparity map, in which the grayscale at ( xL , xR ) represents the probability Pf(xL , y) $ ( xR , y ) g that ( xL , y ) matches (xR , y) .
Stereo Correspondence Problem
1381
Figure 3: (A, B) Stimuli for the double-nail illusion. Each image contains two dots, 3 pixels (2.4 arcmin) square, with luminance 1300 times the noise. Only the center 40 pixels are shown. The remaining panels concern the horizontal plane containing the two objects (horizontal line across image plots). Details as for Figure 2.
from the noise), each containing two objects positioned at xL D ¡3 pixels, xR D C 3 pixels. Depending on the correspondence made, this can be interpreted as two bars with disparity 0, positioned at xc D § 3 pixels (2.4 arcmin) in the frontoparallel plane, or as two bars with xc D 0 and disparity § 6 (4.8 arcmin). These four matches are apparent in the plots of total luminance (see Figure 3C) and in the complex cell response (see Figure 3D).
1382
Jenny C. A. Read
Because neither the Bayesian nor the maxima model imposes a uniqueness constraint demanding a one-one match between xL and xR , they could potentially nd all four matches. However, in fact the zero disparity match is favored throughout the image, as is apparent from the dark stripe along the line of zero disparity, xL D xR , in Figures 3E and 3F. Since the background is devoid of features, any region of the background matches equally well with any other region. The smoothing implied by the receptive elds and (in the Bayesian case) the prior preference for small disparities ensure that the background is assigned zero disparity. So far there is little reason to prefer the more elaborate Bayesian analysis to the simpler maxima analysis based on the model of Qian (1994) and Qian and Zhu (1997). I therefore turn now to a stimulus where the Bayesian model performs better. 3.3 Occluding Bars/Panum’s Limiting Case. I consider a famous stimulus that provides a challenge to many existing models of stereopsis (Howard & Rogers, 1995; Qian, 1997) because it violates the uniqueness constraint. The visual scene is supposed to consist of two vertical bars, one centered on xc D 4 and disparity C 6.4 arcmin (here, 8 pixels) and the other centered on xc D ¡4 and disparity ¡6.4 arcmin. Both bars fall at the same position in the left retina, xL D 0, while falling at different positions in the right retina, xR D § 8 pixels. Physically, in the left eye, the nearer bar occludes the farther bar, whereas in the right eye, both are visible. Thus, the correct match requires the position xL D 0 in the left retina to be matched both with position xR D ¡8 and with position xR D 8 in the right retina. The images and the results of the simulation are shown in Figure 4. The probability eld in Figure 4F shows high probability at disparities d D § 8 pixels, as indicated by the dark stripes along the lines drawn at these disparities. Thus, the model successfully nds both correct matches. Because it does not enforce a uniqueness constraint, it is able to match the single bar in the left eye simultaneously with both bars in the right eye. Although the bars themselves are only 3 pixels wide, the probability is nonzero for several pixels along the lines d D § 8, even though the prior preference for small disparities means that blank regions of the image are normally assigned zero disparity. These constant disparity stripes reect the smoothness constraint built into the model. The lowest-frequency simple cells have very large receptive elds, so the match implied by the bars can potentially inuence the probability assigned to very distant matches. The prior preference for zero disparity is thus opposed by the assumption that adjacent points have the same disparity. Under the hypothesis that the probability plotted here underlies perception, the intepretation is that an isolated disparate object tends to produce the perception of being embedded in a disparate frontoparallel plane. This is consonant with perceptual experience and could potentially help the brain reconstruct smooth surfaces from discrete disparate stimuli (Grimson, 1982).
Stereo Correspondence Problem
1383
Figure 4: (A, B) Stimuli for Panum’s limiting case. The right image contains two bars and the left image one. Each bar is 3 pixels (2.4 arcmin) wide, with luminance 13 in arbitrary units. Only the center 40 pixels are shown. Remaining panels as in Figure 2.
In contrast, the maxima eld in Figure 4E still shows clear remnants of the cruciform structure of the raw complex cell response. This leads to a predicted perception of a “ghost” object in between the two objects actually present. (In the stimuli used, this is at the xation point; in general, wherever the bars are with respect to xation, the ghost will lie exactly between them.) The origin of this ghost is clear if one traces along the line xc D 0, marked
1384
Jenny C. A. Read
Figure 5: Stimuli as for Figure 4 except that the bars have disparity § 2 pixels. Only the center 20 pixels are shown. In all panels, the diagonal lines indicate disparity § 2 pixels. (A) The sum of the left and right images, IL ( xL , y ) C IR ( xR , y ) at y D 0. (B) The response of complex cells, summed over all spatial frequency and orientation channels. (C) The probabilistic disparity map Pf(xL , y ) $ ( xR , y ) g. (D) Probability weighted by image intensity: Pf(xL , y ) $ ( xR , y) g £ [IL ( xL , y ) C IR ( xR , y ) ]. This is intended to approximate the perceptual experience. Thus, the model predicts a perception of two bars, disparity slightly greater than 2 pixels.
with a line in the plot of summed complex cell responses (see Figure 4D). Within each channel, the maximum complex cell response along this line occurs at zero disparity. If the bars’ disparity is reduced, the Bayesian model experiences a repulsion illusion. Figure 5 shows the results obtained with bars of disparity § 2 pixels (1.6 arcmin). Now, the model perceives the bars at slightly different positions in the left eye and at slightly larger disparities than veridical. This is especially apparent when we plot the probability eld weighted by the combined images, [IL ( xL , y) C IR ( xR , y ) ], Figure 5D, in an attempt to show the spatial location of the visual objects perceived by the model. This illusion occurs when the separation between the images in the left
Stereo Correspondence Problem
1385
Figure 6: Why the bars in Panum’s limiting case tend to repel at small separations. The ve rows show complex cells tuned to ve different values of disparity d and xc . (A, B) The positions of the complex cell receptive elds in left and right retinas. (C) The corresponding position on the disparity map (circle = RF center). The shading in the disparity map indicates the probability assigned to the match. (D) The effect of all this on the probability map assigned to the stimulus. The bars tend to “repel” each other, being shifted away from each other along both d and xc .
and right eyes becomes smaller than the RF size in the majority of complex cells. Figure 6 explains why. Figures 6A and 6B represent the left and right retinas, with the positions of the bar stimuli in each eye indicated. The left retina contains a single bar at horizontal position xL ; the right retina, two bars at positions xR1 and xR2 . In addition, each plot contains an example complex cell RF. The circle represents the center of the RF in each retina, and the oval indicates the extent of the RF. The ve rows of plots differ in the positions of the complex cell RFs. Figure 6C shows where the RF centers are located in the (xL , xR ) disparity space familiar from the previous gures. The top row shows a complex cell tuned to a correct match, in which the bar in the left eye is correctly identied with a bar in the right eye (clearly
1386
Jenny C. A. Read
there is another correct match, given by the pairing with the other bar in the right eye, which is not shown here). However, although this match is in fact correct, the complex cell shown will accord it low probability. This is because the large RF of this cell extends over both bars in the right retina. The RFs in either eye are thus not stimulated equally, meaning that the calculated probability is low. The total match probability summed over all channels will be relatively low, as represented by the pale shading assigned to this correspondence in the disparity map in Figure 6C. The same effect occurs for the cells shown in the second row, which are tuned to the same disparity but a mean retinal position xc slightly to the right, and for those in the third row, which are tuned to the same mean retinal position xc as those in the rst row, but a smaller disparity. Again, the complex cell reports low probability. The fourth row shows a cell that has the same xc as the rst row but larger disparity; the fth row shows a cell that has the same disparity but smaller xc . Here, exactly one bar falls in each eye’s RFs. Since the RFs are symmetric, the complex cell receives equal stimulation in both eyes, and so signals high probability. This is represented by the dark shading in the disparity map. Figure 6D summarizes the results from the ve rows. As in Figures 2 through 5, the grayscale at the point ( xL , xR ) represents the probability that the point xL is the correct match for xR . The correct match is not assigned the highest probability; matches with larger disparity or lower xc are considered more likely than the correct match. Thus, the bars are perceived shifted away from each other both in the front-back (d) and the left-right ( xc ) direction. This effect is due to channels whose RFs are longer horizontally than the separation between the bars’ images in the right eye. The repulsion is therefore due predominantly to channels tuned to low spatial frequencies or orientations close to horizontal. Similar repulsion illusions have been reported with human subjects (Ruda, 1998; Badcock & Westheimer, 1985) and have been reproduced with a non-Bayesian model also based on energymodel complex cells (Qian, 1997; Mikaelian & Qian, 2000). At smaller bar separations still, human observers display an apparent “pooling” effect: the bars appear to attract rather than repel each other, and the actual disparity perceived is an average of the disparities of each bar. Qian’s model reproduces this effect. This model, however, cannot display such pooling. This is because when two bars falls in one eye’s RF and only one in the other, the two eyes’ RFs are very unequally stimulated; the model naturally assigns a very low probability to such a match. This deciency might be addressed by including some form of local contrast normalization between the left and right images.
Stereo Correspondence Problem
1387
4 Discussion
This article discusses one way of applying a probabilistic approach to the solution of the correspondence problem. Rather than seeking a unique match for every point in each retina, I propose assigning a probability to each potential match between left and right retinal positions. This is interpreted as the probability that an object exists in front of the viewer at the spatial location implied by the match. This probability is assumed to underlie perception: zeros of the probability eld imply no perception of an object at that location, whereas nonzero values of probability imply a perception, whose clarity is presumed to increase with increasing probability, of an object with luminance given by the mean of the values in the left and right retinas (see Figure 1). There are many ways in which the brain might attempt to assign a probability to each potential match. The approach adopted here was motivated by a desire to incorporate as much as possible of the known physiology of binocular vision. The available physiological and psychophysical evidence suggests that initial processing takes place within channels tuned to a particular spatial frequency and orientation. Accordingly, the match probability is initially calculated within a single channel, using a Bayesian analysis based on the outputs of disparity-tuned complex cells and the simple cells that feed into them. This analysis introduces the two key constraints employed by the model in order to overcome the ill-posed nature of the correspondence problem. First, the disparity is smoothed over the RF dimensions in each channel. Second, the Bayesian prior is used to enforce a preference for small disparities. Information from different channels is then combined by averaging the probability reported from each individual channel. This means that information from all spatial scales is handled simultaneously; the model employs no coarse-to-ne hierarchy (Mallot, Gillner, & Arndt, 1996). Bayesian stereo algorithms have a long history in the computer vision literature (Szeliski, 1990; Geiger, Ladendorf, & Yuille, 1992; Chang & Chatterjee, 1992; Scharstein & Szeliski, 1998). This article differs from these in two major respects. First, it attempts to incorporate the known physiology of disparity-tuned cells in primary visual cortex. Thus, it applies Bayesian ideas to the results of previous workers who used linear lters as a matching primitive (Lehky & Sejnowski, 1990; Jones & Malik, 1992; Sanger, 1988; Qian, 1994). The probability eld is then derived (albeit with a number of simplications and approximations) from the statistics of retinal images degraded by gaussian noise and processed by simple and complex cells. In contrast, previous Bayesian models have usually been postulated to take a Gibbs form: exp(-potential/temperature), where the potential is some cost function incorporating punishment for disparity discontinuities, poor luminance matches, and so on. Second, this article postulates that the probability eld directly underlies perception. Previous studies incorporating Bayesian ideas have generally
1388
Jenny C. A. Read
used it as a tool to arrive at a single match between points in the left and right eyes, explicitly including a form of uniqueness constraint (often allowing for occlusion—each point in the left image must match at most one point in the right image). The approach here was designed to allow for the possibility of multiple matches. The model is closest to that developed by Qian and coworkers (Qian, 1994; Zhu & Qian, 1996; Qian & Zhu, 1997; Qian & Andersen, 1997) and Prince and Eagle (2000a). It differs from their work in incorporating multiple spatial frequency and orientation channels and in processing the complex cell output to assign a probability to each match. The model presented here has a number of serious limitations, which means that it can be regarded only as a preliminary model of stereo correspondence. It was initially developed to model psychophysical data obtained in a front-back depth discrimination task in which images were presented for 130 ms. Although it succeeds admirably in this task, the percept implied by the model lacks the clarity and sharpness of human perceptions. For instance, the ragged outline of the target region perceived in the random dot stereogram of Figure 2 does not accord with the sharp, square outline that humans viewing this gure perceive. This is perhaps not surprising for a model based on the known physiology of primary visual cortex, which incorporates no hierarchy of interactions or iterative processes that sharpen the performance of many other stereo models (cf. Marr & Poggio, 1976; Geiger et al., 1992; Scharstein & Szeliski, 1998). Thus, this model may represent an initial step in stereoscopic perception, providing a sketch of the 3d visual scene, which is then processed by still higher visual areas. The model makes use of a prior probability assigned to each disparity but does not address how the brain could arrive at such a judgment. It does not include any mechanism for updating the prior on the basis of visual experience or information from nonvisual sources. A full Bayesian model would require such a mechanism, based on experimental evidence of how performance develops with training. The model fails to show the correct response with sparse anticorrelated stereograms. Anticorrelated stereograms are those in which the polarity of one eye’s image has been inverted, with black pixels becoming white and vice versa. Dense examples of such stimuli, such as binary random dot patterns, produce little or no impression of depth in most human subjects, although some display a slight tendency to see depth in the opposite direction to that implied by the disparity of the stereogram. The model reproduces this behavior (Read, 2002). However, sparse anticorrelated stimuli give human observers the impression of depth in the veridical direction (von Helmholtz, 1909; Cogan, Kontsevich, Lomakin, Halpern, & Blake, 1995), presumably because boundaries are extracted and matched even though the polarity of the boundaries is opposite. The model presented here assigns low probability to all anticorrelated potential matches and thus has no mechanism for reproducing this behavior.
Stereo Correspondence Problem
1389
Further, the model currently includes no form of contrast normalization between left and right eyes. It seems likely that some such normalization exists, because human observers can fuse stereograms even where considerable contrast differences exist between the monocular images. The model cannot, because any contrast differences between left and right eyes mean that even the correct matches are considered improbable. In addition, the model fails to reproduce the “disparity pooling” observed with disparate bars at very small separations. A suitable scheme of local contrast normalization might be able to correct both of these discrepancies, although considerable work might be required to nd a suitable implementation. Despite these limitations, the model is able to produce plausible solutions of the correspondence problem for a range of stimuli. It extracts the appropriate disparity map in dense random dot stereograms containing several target regions with different disparity. In the double-nail illusion, it perceives only two of four possible matches, in agreement with human observers. Models that impose a strict uniqueness constraint cannot handle Panum’s limiting case, whereas the model presented here performs correctly with this stimulus. This is an advantage of the Bayesian approach over models with a similar physiological basis but a different method of extracting disparity, such as that of Qian (1994) and Qian and Zhu (1997). Finally, it has already been shown (Read, 2002) that the model accurately reproduces psychophysical functions in a front-back discrimination task, for both correlated and anticorrelated random noise stereograms with a range of spatial frequency and orientation bandwidths. Thus, the model combines physiological plausibility with wide explanatory power. Acknowledgments
I am supported by a Mathematical Biology Training Fellowship from the Wellcome Trust. The simulations presented in this article were run with support from the Oxford Supercomputing Centre. I thank Bruce Cumming for helpful discussions. References Adelson, E. H., & Bergen, J. R. (1985). Spatiotemporal energy models for the perception of motion. Journal of the Optical Society of America A, 2(2), 284–299. Anstis, S. M., & Rogers, B. J. (1975). Illusory reversal of visual depth and movement during changes of contrast. Vision Research, 15, 957–961. Anzai, A., Ohzawa, I., & Freeman, R. (1999). Neural mechanisms for processing binocular information I. Simple cells. Journal of Neurophysiology, 82(2), 891– 908. Badcock, D. R., & Westheimer, G. (1985). Spatial location and hyperacuity: Flank position within the centre and surround zones. Spatial Vision, 1(1), 3–11.
1390
Jenny C. A. Read
Chang, C., & Chatterjee, S. (1992). A deterministic approach for stereo disparity calculation. In G. Sandin (Ed.), Computer vision: ECCV’92 (pp. 425–433). New York: Springer-Verlag. Cleary, R., & Braddick, O. (1990). Direction discrimination for band-pass ltered random dot kinematograms. Vision Research, 30(2), 303–316. Cogan, A. I., Kontsevich, L. L., Lomakin, A. J., Halpern, D. L., & Blake, R. (1995). Binocular disparity processing with opposite-contrast stimuli. Perception, 24(1), 33–47. Cogan, A. I., Lomakin, A. J., & Rossi, A. F. (1993). Depth in anticorrelated stereograms: Effects of spatial density and interocular delay. Vision Research, 33(14), 1959–1975. Cumming, B., Shapiro, S. E., & Parker, A. J. (1998). Disparity detection in anticorrelated stereograms. Perception, 27(11), 1367–1377. de Valois, R. L., Albrecht, D. G., & Thorell, L. G. (1982). Spatial frequency selectivity of cells in macaque visual cortex. Vision Research, 22(5), 545–559. de Valois, R. L., & de Valois, K. K. (1988).Spatial vision. Oxford: Oxford University Press. de Valois, R., Yund, E. W., & Hepler, N. (1982). The orientation and direction selectivity of cells in macaque visual cortex. Vision Research, 22(5), 531–544. Dev, P. (1975).Perception of depth surfaces in random-dot stereograms: A neural model. International Journal of Man-Machine Studies, 7, 511–528. Fleet, D., Wagner, H., & Heeger, D. (1996). Neural encoding of binocular disparity: Energy models, position shifts and phase shifts. Vision Research, 36(12), 1839–1857. Geiger, D., Ladendorf, B., & Yuille, A. (1992). Occlusions and binocular stereo. In G. Sandini (Ed.), Computer vision: ECCV’92 (pp. 425–433). New York: Springer-Verlag. Grimson, W. E. (1981). A computer implementation of a theory of human stereo vision. Philosophical Transactions of the Royal Society, London, B: Biological Sciences, 292(1058), 217–253. Grimson, W. E. (1982). A computational theory of visual surface interpolation. Philosophical Transactions of the Royal Society, London, B: Biological Sciences, 298(1092), 395–427. Howard, I. P., & Rogers, B. J. (1995).Binocular vision and stereopsis. Oxford: Oxford University Press. Jones, D. G., & Malik, J. (1992). A computation framework for determining stereo correspondence from a set of linear spatial lters. In G. Sandini (Ed.), Computer vision: ECCV’92 (pp. 395–410). New York: Springer-Verlag. Julesz, B. (1971). Foundations of Cyclopean Perception. Chicago: University of Chicago Press. Knill, D., & Richards, W. (1996). Perception as Bayesian inference. Cambridge: Cambridge University Press. Krol, J. D., & van de Grind, W. A. (1980). The double-nail illusion: Experiments on binocular vision with nails, needles, and pins. Perception, 9(6), 651–669. Lehky, S. R., & Sejnowski, T. J. (1990). Neural model of stereoacuity and depth interpolation based on a distributed representation of stereo disparity. Journal of Neuroscience, 10(7), 2281–2299.
Stereo Correspondence Problem
1391
Mallot, H., & Bideau, H. (1990). Binocular vergence inuences the assignment of stereo correspondences. Vision Research, 30(10), 1521–1523. Mallot, H. A., Gillner, S., & Arndt, P. A. (1996). Is correspondence search in human stereo vision a coarse-to-ne process? Biological Cybernetics, 74(2), 95–106. Manseld, J. S., & Parker, A. (1993). An orientation-tuned component in the contrast masking of stereopsis. Vision Research, 33(11), 1535–1544. Marr, D., & Poggio, T. (1976). Cooperative computation of stereo disparity. Science, 194(4262), 283–287. Marr, D., & Poggio, T. (1979). A computational theory of human stereo vision. Proceedings of the Royal Society, London, B: Biological Sciences, 204(1156), 301– 328. McKee, S. P., Bravo, M. J., Smallman, H. S., & Legge, G. E. (1995).The “uniqueness constraint” and binocular masking. Perception, 24(1), 49–65. McLoughlin, N. P., & Grossberg, S. (1998). Cortical computation of stereo disparity. Vision Research, 38(1), 91–99. Mikaelian, S., & Qian, N. (2000). A physiologically-based explanation of disparity attraction and repulsion. Vision Research, 40(21), 2999–3016. Movshon, J., Thompson, I., & Tolhurst, D. J. (1978). Spatial summation in the receptive elds of simple cells in the cat’s striate cortex. Journal of Physiology, 283, 53–77. Nelson, J. I. (1975). Globality and stereoscopic fusion in binocular vision. Journal of Theoretical Biology, 49(1), 1–88. Ohzawa, I., DeAngelis, G., & Freeman, R. (1990). Stereoscopic depth discrimination in the visual cortex: Neurons ideally suited as disparity detectors. Science, 249(4972), 1037–1041. Ohzawa, I., DeAngelis, G., & Freeman, R. (1997).Encoding of binocular disparity by complex cells in the cat’s visual cortex. Journal of Neurophysiology, 77(6), 2879–2909. Pollard, S. B., & Frisby, J. P. (1990). Transparency and the uniqueness constraint in human and computer stereo vision. Nature, 347(6293), 553–556. Pollard, S. B., Mayhew, J. E., & Frisby, J. P. (1985). PMF: A stereo correspondence algorithm using a disparity gradient limit. Perception, 14(4), 449–470. Prince, S. J. D., & Eagle, R. A. (2000a) Weighted directional energy model of human stereo correspondence. Vision Research, 40(9), 1143–1155. Prince, S. J. D., & Eagle, R. A. (2000b) Stereo correspondence in one-dimensional Gabor stimuli. Vision Research, 40(9), 913–924. Qian, N. (1994). Computing stereo disparity and motion with known binocular cell properties. Neural Computation, 6, 390–404. Qian, N. (1997). Binocular disparity and the perception of depth. Neuron, 18(3), 359–368. Qian, N., & Andersen, R. A. (1997). A physiological model for motion-stereo integration and a unied explanation of Pulfrich-like phenomena. Vision Research, 37(12), 1683–1698. Qian, N., & Sejnowski, T. (1989). Learning to solve random-dot stereograms of dense and transparent surfaces with recurrent backpropagation. In Pro-
1392
Jenny C. A. Read
ceedings of the 1988 Connectionist models summer school (pp. 435–443). London: Harcourt. Qian, N., & Zhu, Y. (1997). Physiological computation of binocular disparity. Vision Research, 37(13), 1811–1827. Read, J. C. A. (2002). A Bayesian model of stereo depth/motion direction discrimination. Biological Cybernetics, 86(2), 117–136. Read, J. C. A., & Eagle, R. A. (2000). Reversed stereo depth and motion direction with anti-correlated stimuli. Vision Research, 40(24), 3345–3358. Ruda, H. (1998). The warped geometry of visual space near a line assessed using a hyperacuity displacement task. Spatial Vision, 11(4), 401–419. Sanger, T. (1988). Stereo disparity computation using Gabor lters. Biological Cybernetics, 59, 405–418. Scharstein, D., & Szeliski, R. (1998). Stereo matching with nonlinear diffusion. International Journal of Computer Vision, 28, 155–174. Szeliski, R. (1990). Bayesian modelling of uncertainty in low-level vision. International Journal of Computer Vision, 5, 271–301. von Helmholtz, H. (1909). Handbuch der physiologischen Optik. Hamburg, Germany: Voss. Weinshall, D. (1989). Perception of multiple transparent planes in stereo vision. Nature, 341(6244), 737–739. Westheimer, G. (1986). Panum’s phenomenon and the conuence of signals from the two eyes in stereoscopy. Philosophical Transactions of the Royal Society, London, B: Biological Sciences, 228(1252), 289–305. Zhu, Y. D., & Qian, N. (1996). Binocular receptive eld models, disparity tuning, and characteristic disparity. Neural Computation, 8(8) , 1611–1641. Received April 3, 2001; accepted October 15, 2001.
LETTER
Communicated by Chris Williams
Learning Curves for Gaussian Process Regression: Approximations and Bounds Peter Sollich [email protected] Anason Halees [email protected] Department of Mathematics, King’s College London, London WC2R 2LS, U.K.
We consider the problem of calculating learning curves (i.e., average generalization performance) of gaussian processes used for regression. On the basis of a simple expression for the generalization error, in terms of the eigenvalue decomposition of the covariance function, we derive a number of approximation schemes. We identify where these become exact and compare with existing bounds on learning curves; the new approximations, which can be used for any input space dimension, generally get substantially closer to the truth. We also study possible improvements to our approximations. Finally, we use a simple exactly solvable learning scenario to show that there are limits of principle on the quality of approximations and bounds expressible solely in terms of the eigenvalue spectrum of the covariance function. 1 Introduction
Within the neural networks community, there has in the past few years been a good deal of excitement about the use of gaussian processes (GPs) as an alternative to feedforward networks (see, e.g., Williams & Rasmussen, 1996; Williams, 1997; Barber & Williams, 1997; Goldberg, Williams, & Bishop, 1998; Sollich, 1999a; Malzahn & Opper, 2001). The advantages of GPs are that prior assumptions about the problem to be learned are encoded in a very transparent way and that inference—at least in the case of regression that we will consider—is relatively straightforward. GPs are also “nonparametric” in the sense that their effective number of parameters (degrees of freedom) can grow arbitrarily large as more and more training data are collected. Finally, interest in GPs has also been stimulated by the fact that they are at the heart of the large family of kernel machine learning methods (see, e.g., www.kernel-machines.org). One crucial question for applications is then how fast GPs learn—that is, how many training examples are needed to achieve a certain level of generalization performance. The typical (as opposed to worst-case) behavior is c 2002 Massachusetts Institute of Technology Neural Computation 14, 1393– 1428 (2002) °
1394
Peter Sollich and Anason Halees
captured in the learning curve, which gives the average generalization error 2 as a function of the number of training examples n. Several workers have derived bounds on 2 (n ) (Michelli & Wahba, 1981; Plaskota, 1990; Opper,
1997; Trecate, Williams, & Opper, 1999; Opper & Vivarelli, 1999; Williams & Vivarelli, 2000) or studied its large n asymptotics (Silverman, 1985; Ritter, 1996). As we will illustrate below, however, the existing bounds are often far from tight, and asymptotic results will not necessarily apply for realistic sample sizes n. Our main aim in this article is therefore to derive approximations to 2 ( n ) that get closer to the true learning curves than existing bounds, and apply for both small and large n. We compare these approximations with existing bounds and the results of numerical simulations; possible improvements to the approximations are also discussed. Finally, we study an analytically solvable example scenario that sheds light on how tight bounds on learning curves can be made in principle. Summaries of the early stages of this work have appeared in conference proceedings (Sollich, 1999a, 1999b). In its simplest form, the regression problem that we are considering is this: We are trying to learn a function h ¤ that maps inputs x (real-valued vectors) to (real-valued scalar) outputs h ¤ ( x ). We are given a set of training data D, consisting of n input-output pairs (xl , yl ) ; the training outputs yl may differ from the clean target outputs h ¤ ( xl ) due to corruption by noise. Given a test input x, we are then asked to come up with a prediction h (x ) for the corresponding output, expressed either in the simple form of a mean prediction hO ( x ) plus error bars, or more comprehensively in terms of a predictive distribution P (h ( x) |x, D) . In a Bayesian setting, we do this by specifying a prior P (h ) over our hypothesis functions and a likelihood P ( D|h ) with which each h could have generated the training data; from this, we deduce the posterior distribution P (h |D) / P (D|h ) P (h ). If we wanted to use a feedforward network for this task, we could proceed as follows: Specify candidate networks by a set of weights w, with prior probability P ( w) . Each network denes a (stochastic) input-output relation described by the distribution of output y given input x (and weights w), P ( y|x, w) . Multiplying over the whole data set, we get the probability of the observedQdata having been produced by n ( ). the network with weights w: P ( D|w) D l D1 P yl |xl , w Bayes’ theorem then gives us the posterior—the probability of network w given the data— as P ( w|D ) / P ( D|w) P ( w) up to an overall normalization R factor. From this, nally, we get the predictive distribution P ( y|x, D ) D dwP ( y|x, w) P ( w|D) . This solves the regression problem in principle, but leaves us with a nasty integral over all possible network weights: the posterior P (w|D ) generally has a highly nontrivial structure, with many local peaks (corresponding to local minima in the training error). One therefore has to use sophisticated Monte Carlo integration techniques (Neal, 1993) or local approximations to P (w|D ) around its maxima (MacKay, 1992) to tackle this problem. Even once this has been done, one is still left with the question of how to interpret the results. We may, for example, want to select priors on the basis of the
Learning Curves for Gaussian Process Regression
1395
data, by making the prior P ( w|h ) dependent on a set of hyperparameters h and choosing h such as to maximize the probability P ( D|h) of the data. Once we have found the optimal prior, we would then hope that it tells us something about the regression problem at hand (whether certain input components are irrelevant, for example). This would be easy if the prior revealed directly how likely certain input-output functions are; instead, we have to extract this information from the prior over weights, often a complicated process. By contrast, for a GP it is an almost trivial task to obtain the posterior and the predictive distribution (see below). One reason for this is that the prior P (h ) is dened directly over input-output functions h. How is this done? Any h is uniquely determined by its output values h (x ) for all x from the input domain, and for a GP, these are simply assumed to have a joint gaussian distribution (hence, the name). This distribution can be specied by the mean values hh ( x) ih (which we take in the following, as is com« to be zero ¬ monly done), and the covariances h ( x )h (x0 ) h D C (x, x0 ) ; C ( x, x0 ) is called the covariance function of the GP. It encodes in an easily interpretable way prior assumptions about the function to be learned. Smoothness, for example, is controlled by the behavior of C ( x, x0 ) for x0 ! x. The Ornstein-Uhlenbeck (OU) covariance function C ( x, x0 ) / exp(¡||x ¡ x0 || / l) produces very rough (nondifferentiable) functions, while functions sampled from the radial basis function (RBF) prior with C ( x, x0 ) / exp[¡||x ¡x0 || 2 / (2l2 ) ] are—in the meansquare sense—innitely differentiable. (Intermediate priors yielding r times mean-square differentiable functions can also be dened by using modied Bessel functions as covariance functions; see Stein, 1989.) Figure 1 illustrates these characteristics with two samples from the OU and RBF priors, respectively, over a two-dimensional input domain. The length scale parameter l in the covariance functions also has an intuitive meaning. It corresponds directly to the distance in input space over which we expect our function to vary signicantly. More complex properties can also be encoded; by replacing l with different length scales for each input component, for example, relevant (small l) and irrelevant (large l) inputs can be distinguished. How does inference with GPs work? We give only a brief summary here and refer to existing reviews on the subject (see, e.g., Williams, 1998) for details. It is simplest to assume that outputs y are generated from the clean values of a hypothesis function h ( x ) by adding gaussian noise of x-independent variance s 2 . The joint distribution of a set of n training outputs fyl g and the function values h ( x ) are then also gaussian, with covariances given by «
¬ yl ym D C ( xl , xm ) C s 2dlm D (K ) lm « ¬ ylh (x ) D C ( xl , x ) D ( k ( x) ) l , where we have dened an n £n matrix K and an x-dependent n-component vector k ( x ). The posterior distribution P (h |D) is then obtained by simply
1396
Peter Sollich and Anason Halees
Figure 1: Samples h ( x ) drawn from GP priors over functions on [0, 1]2 . (Top) OU covariance function, C ( x, x0 ) D exp (¡|| x ¡ x0 || / l ) . (Bottom) RBF covariance function, C ( x, x0 ) D exp[¡||x ¡ x0 || 2 / (2l2 ) ]. The length scale l D 0.1 determines in both cases over what distance the functions vary signicantly. Note the difference in roughness of the two functions; this is related to the behavior of the covariance functions for x ! x0 .
Learning Curves for Gaussian Process Regression
1397
conditioning on the fyl g. It is again gaussian and has mean hO ( x, D ) ´ hh ( x) ih |D D k ( x) T K ¡1 y
(1.1)
and variance D
2 2 ( x, D ) ´ (h ( x) ¡ hO ( x ) )
E h|D
D C ( x, x ) ¡ k ( x ) T K ¡1 k ( x ).
(1.2)
Equations 1.1 and 1.2 solve the inference problem for a GP. They provide us directly with the predictive distribution P (h (x ) |x, D ) . The posterior variance, equation 1.2, in fact also gives us the expected generalization error (or Bayes error) at x. Why? If the true target function ish ¤ , the squared deviation between our mean prediction and the true output1 is (hO ( x) ¡ h ¤ (x ) ) 2 ; averaging this over the posterior distribution of target functions P (h ¤ |D) gives equation 1.2. The underlying assumption is that our assumed GP prior is the true one from which the target functions are actually generated (and that we are using the correct noise model). Otherwise, the expected generalization error is larger and given by a more complicated expression (Williams & Vivarelli, 2000). In line with most other work on the subject, we consider only the “correct prior” case in the following. Averaging the generalization error at x over the distribution of inputs gives D
T ¡1 2 ( D) D h2 ( x, D) ix D C ( x, x ) ¡ k ( x ) K k ( x)
E x
.
(1.3)
This form of the generalization error, which is well known (Michelli & Wahba, 1981; Opper, 1997; Williams, 1998; Williams & Vivarelli, 2000), still depends on the training inputs; the fact that the training outputs have dropped out already is a signature of the fact that GPs are linear predictors (compare equation 1.1). Averaging over data sets yields the quantity we are after: 2
D h2 ( D ) iD .
(1.4)
This average expected generalization error (we will drop the “average expected” in the following) depends on only the number of training examples n; the function 2 ( n ) is called the learning curve. Its exact calculation is difcult because of the joint average in equations 1.3 and 1.4 over the training inputs xl and the test input x. Before proceeding with our calculation of the learning curve 2 ( n ) , let us try to gain some intuitive insight into its dependence on n. Consider a simple 1
One can also measure the generalization error by the squared deviation between the prediction hO (x) and the noisy version of the true output; this simply adds a term s 2 to equation 1.3.
1398
Peter Sollich and Anason Halees
e(x,D)
n=2
1
n = 100
0.02
0.8
0.015
0.6 0.01 0.4 0.2 0
0.005
s2 0
0.2
0.4
x
0.6
0.8
1
0
0
0.2
0.4
x
0.6
0.8
1
Figure 2: Generalization error 2 ( x, D ) as a function of input position x 2 [0, 1], for noise level s 2 D 0.05, RBF covariance function C ( x, x0 ) D exp[¡|x ¡x0 | 2 / (2l2 )] with l D 0.1, for randomly drawn training sets D of size n D 2 (left) and n D 100 (right). To emphasize the difference in scale, the plot on the left also includes the results for n D 100 shown on the right, just visible below the dashed line at 2 ( x, D ) D s 2 .
example scenario, where inputs x are one-dimensional and drawn randomly from the unit interval [0, 1], with uniform probability. For the covariance function, we choose an RBF form, C ( x, x0 ) D exp[¡|x ¡ x0 | 2 / (2l2 ) ] with l D 0.1. Here, we have taken the prior variance C ( x, x ) as unity; as seems realistic for most applications, we assume the noise level to be much smaller than this, s 2 D 0.05. Figure 2 illustrates the x-dependence of the generalization error 2 ( x, D ) for a small training set (n D 2). Each of the examples has made a “dent” in 2 ( x, D ) , with a shape that is similar to that of the covariance function.2 Outside the dents, 2 ( x, D ) still has essentially its prior value, 2 ( x, D ) D 1; at the center of each dent, it is reduced to a much smaller value, 2 ( x, D ) ¼ s 2 / (1 C s 2 ) (this approximation holds as long as the different training inputs are sufciently far away from each other). The generalization error 2 ( D ) is therefore dominated by regions where no training examples have been seen; one has 2 ( D ) À s 2 , and the precise value of 2 ( D ) depends only very little on s 2 (assuming always that s 2 ¿ 1). Gradually, as n is increased, the covariance dents will cover the input space, so that 2 ( x, D) and 2 ( D ) become of order s 2 ; this situation is shown on the right of Figure 2. From this point on, further training examples essentially have the effect of averaging out noise, eventually making 2 ( D) ¿ s 2 for large enough n. In 2 More precisely, the dents have the shape of the square of the covariance function. If the training inputs xi are sufciently far apart, then around each xi we can neglect the inuence of the other data points and apply equation 1.2 with n D 1, giving 2 (x, D) ¼ C(x, x) ¡ C2 (x, xi ) / [C(xi , xi ) C s 2 ].
Learning Curves for Gaussian Process Regression
1399
summary, we expect the learning curve 2 ( n ) to have two regimes. In the initial (small n) regime, where 2 ( n ) À s 2 , 2 (n ) is essentially independent of s 2 and reects mainly the geometrical distribution of covariance dents across the inputs space. In the asymptotic regime (n large enough such that 2 ( n ) ¿ s 2 ), the noise level s 2 is important in controlling the size of 2 ( n ) because learning arises mainly from the averaging out of noise in the training data. The remainder of this article is structured as follows. In the next section, we derive several approximations to GP learning curves, starting from an exact representation in terms of the eigenvalues and eigenfunctions of the covariance function. In section 3, we compare these approximations to previous bounds and nd, across a range of learning scenarios, that they generally get rather closer to the true learning curves. This section also includes extensions of two existing bounds to a larger domain of applicability. In section 4, we investigate the potential for improving the accuracy of our approximations further. Finally, section 5 deals with the question of whether there are limits of principle on the quality of learning curve approximations and bounds expressible solely in terms of the eigenvalues of the covariance function; we nd that such limits do indeed exist. In the nal section, we summarize our results and give an overview of the challenges that remain. 2 Approximate Learning Curves
Calculating learning curves for GPs exactly is a difcult problem because of the joint average in equations 1.3 and 1.4 over the training inputs xl and the test input x. Several workers have therefore derived upper and lower bounds on 2 (Michelli & Wahba, 1981; Plaskota, 1990; Opper, 1997; Williams & Vivarelli, 2000) or studied the large n asymptotics of 2 ( n ) (Silverman, 1985; Ritter, 1996). As we will illustrate below, however, the existing bounds are often far from tight; similarly, asymptotic results can only capture the large n regime dened above and will not necessarily apply for sample sizes n occurring in practice. We therefore now attempt to derive approximations to 2 ( n ) that get closer to the true learning curves than existing bounds, and are applicable for both small and large n. As a starting point for an approximate calculation of 2 (n ) , we rst derive a representation of the generalization error in terms of the eigenvalue decomposition of the covariance function. Mercer’s theorem (see, e.g., Wong, 1971) tells us that the covariance function can be decomposed into its eigenvalues li and eigenfunctions w i ( x ) : C ( x, x0 ) D
1 X
li w i ( x) w i ( x0 ) .
(2.1)
iD 1
This is simply the analog of the eigenvalue decomposition of a nite symmetric matrix. We assume here that eigenvalues and eigenfunctions are
1400
Peter Sollich and Anason Halees
dened relative to the distribution over inputs x, i.e., «
C ( x, x0 )w i ( x0 )
¬ x0
D li w i ( x ).
(2.2)
The eigenfunctions are then orthogonal with respect to the same distribu« ¬ tion, w i (x ) wj ( x ) x D dij (see, e.g., Williams & Seeger, 2000). Now write the data-dependent generalization error, equation 1.3, as D
E
¡1 k ( x) k ( x) T 2 ( D ) D hC(x, x ) ix ¡ tr K
x
and perform the x-average: D
( k ( x) k ( x ) T ) lm
E x
D
X ij
X « ¬ li lj w i ( xl ) w i ( x )w j ( x) x wj ( xm ) D l2i w i ( xl ) w i (xm ). i
This suggests introducing the diagonal matrix ( ¤ ) ij D lidij and the design « ¬ matrix ( ª ) li D w i ( xl ) , so that k (x ) k ( x) T x D ª ¤ 2 ª T . One then also has hC(x, x ) ix D tr ¤ ,
(2.3)
and the matrix K is expressed as K D s 2I C ª
¤ ª
T
,
with I being the identity matrix. Collecting these results, we get 2 ( D ) D tr ¤
¡ tr(s 2 I C ª ¤ ª
T ¡1 )
ª ¤
2
ª
T
.
Now one can apply the Woodbury formula (Press, Teukolsky, Vetterling, & Flannery, 1992), which states that ( A C UV T ) ¡1 D A ¡1 ¡ A ¡1 U ( I C V T A ¡1 U ) ¡1 V T A ¡1 for matrices A , U , V of appropriate size. Setting A D ¤ ¡1 , U D V D s ¡1 ª and using the cyclic invariance of the trace, one then nds 2 ( n ) D h2 ( D) iD with3 2 ( D) D tr( ¤
¡1
C s ¡2 ª
T
ª ) ¡1 D tr ¤ ( I C s ¡2 ¤ ª
T
ª ) ¡1 .
(2.4)
The advantage of this (still exact) representation of the generalization error is that the average over the test input x has already been carried out and that the remaining dependence on the training inputs is contained entirely 3
¡1
If the covariance function has zero eigenvalues, the inverse ¤ does not exist, and the second form of 2 (D) given in equation 2.4 must be used; similar alternative forms, though not explicitly written, exist for all following results.
Learning Curves for Gaussian Process Regression
1401
in the matrix ª T ª . It also includes as a special case the well-known result for linear regression (see, e.g., Sollich, 1994); ¤ ¡1 and ª T ª can be interpreted as suitably generalized versions of the weight decay (matrix) and input correlation matrix. Starting from equation 2.4, one can now derive approximate expressions for the learning curve 2 ( n ) . The most naive approach is to neglect entirely the uctuations in ª T ª over different data sets and replace it by its average, ¬ P « which is simply h( ª T ª ) ij iD D nlD 1 w i ( xl ) wj ( xl ) D D ndij . This leads to 2 OV ( n) D tr( ¤
¡1
C s ¡2 nI ) ¡1 .
(2.5)
While this is not, in general, a good approximation, it was shown by Opper and Vivarelli (1999) to be a lower bound (called the OV bound below) on the learning curve. It becomes tight in the large noise limit s 2 ! 1 at constant elements of the matrix s ¡2 ª T ª then become n / s 2 : The uctuations of thep p ¡2 2 vanishingly small (of order ns D ( n / s ) / n ! 0), and so replacing ª T ª by its average is justied. By a similar argument, one also expects that (for any xed s 2 > 0) the OV bound will become increasingly tight as n increases. To derive better approximations, it is useful to see how the matrix G D ( ¤ ¡1 C s ¡2 ª T ª ) ¡1 changes when a new example is added to the training set. Using the Woodbury formula again, this change can be expressed as
G ( n C 1) ¡ G (n ) D [G D ¡
¡1 (
n ) C s ¡2 ÃÃ T ]¡1 ¡ G ( n )
G ( n) ÃÃ T G ( n ) s 2 C Ã T G (n) Ã
(2.6)
in terms of the vector à with elements (à ) i D w i ( xn C 1 ). To get exact learning curves, one would have to average this update formula over both the new training input xn C 1 and all previous ones. This is difcult, but progress can be made by neglecting correlations of numerator and denominator in equation 2.6, averaging them separately instead. Also treating n as a continuous variable, this yields the approximation @G ( n ) @n
«
¬ n) D ¡ 2 , s C tr G ( n )
G
2(
(2.7)
where we have introduced the « ¬notation G D hG i. If we also neglect uctuations in G , approximating G 2 D G 2 , we have @G / @n D ¡G 2 / (s 2 C tr G ). Since at n D 0, the solution G D ¤ is diagonal, it will remain so for all n, and one can rewrite this equation for G (n ) as @G ¡1 / @n D (s 2 C tr G ) ¡1 I. G ¡1 ( n ) thus differs from G ¡1 (0) D ¤ ¡1 only by a multiple of the identity matrix and can be represented as G ¡1 ( n ) D ¤ ¡1 C s ¡2 n0 I , where n0 has to
1402
Peter Sollich and Anason Halees
obey s ¡2 dn0 / dn D [s 2 C tr( ¤ the equation
¡1
C s ¡2 n0 I ) ¡1 ] ¡1 . Integrating, one nds for n0
n0 C tr ln( I C s ¡2 n0 ¤ ) D n, and the generalization error 2 D tr G is given by ¡1
2 UC ( n ) D tr( ¤
C s ¡2 n0 I ) ¡1
(2.8)
By comparison with equation 2.5, n0 can be thought of as an effective number of training examples. The subscript UC in equation 2.8 stands for “upper continuous” (treating n as continuous) approximation. A better approximation with a lower value is obtained by retaining uctuations in G . As in the case of the linear perceptron (Sollich, 1994), this can be achieved by introducing an auxiliary offset parameter v into the denition of G , according to
G
¡1
¡1 C s ¡2 ª D vI C ¤
T
ª .
(2.9)
One can then write D ¡ tr G
2
E D
@ @v
tr hG i D
@2
(2.10)
@v
and obtain from equation 2.7 the partial differential equation, @2
@n
¡
1 @2 D 0 s 2 C 2 @v
(2.11)
This can be solved for 2 ( n, v) using the methods of characteristic curves (see appendix A). Resetting the auxiliary parameter v to zero yields the lower continuous (LC) approximation to the learning curve, which is given by the self-consistency equation: ³ 2 LC ( n ) D tr
¤
¡1
C
n s2 C 2
I
´¡1
.
(2.12)
LC
It is easy to show that 2 LC · 2 UC . One can also check that both approximations converge to the exact result, equation 2.5, in the large noise limit (as dened above). Encouragingly, we see that the LC approximation reects our intuition about the difference between the initial and asymptotic regimes of the learning curve. For 2 À s 2 , we can simplify equation 2.12 to ³ 2 LC ( n ) D tr
¤
¡1
C
n 2 LC
I
´ ¡1
,
Learning Curves for Gaussian Process Regression
1403
where, as expected, the noise level s 2 has dropped out. In the opposite limit, 2 ¿ s 2 , on the other hand, we have ± 2 LC ( n ) D tr
¤
¡1
C
n ²¡1 I , s2
(2.13)
which—again as expected—retains the noise level s 2 as an important parameter. Equation 2.13 also shows that 2 LC approaches the OV lower bound (from above) for sufciently large n. We conclude this section with a brief qualitative discussion of the expected n-dependence of 2 LC (which will turn out to be the more accurate of our two approximations). Obviously, this n-dependence depends on the spectrum of eigenvalues li ; below, we always assume that these are arranged in decreasing order. Consider, then, rst the asymptotic regime 2 ¿ s 2 , where 2 LC and 2 OV become identical. One shows easily that for eigenvalues decaying as a power law for large i, li » i¡r , the asymptotic learning curve scales as 2 LC » ( n / s 2 ) ¡(r¡1) / r .4 This is in agreement with known exact results (Silverman, 1985; Ritter, 1996). In the initial regime 2 À s 2 , on the other hand, one can take s 2 ! 0 and nds then a faster decay of the generalization error, 2 LC » n ¡(r¡1) .5 We are not aware of exact results pertaining to this regime, except for the OU case in d D 1, which has r D 2 and therefore 2 LC » n ¡1 , in agreement with an exact calculation (Manfred Opper, private communication, March 2001). 3 Comparisons with Bounds and Numerical Simulations
We now compare the LC and UC approximations with existing bounds, and with the “true” learning curves as obtained by numerical simulations. A lower bound on the generalization error was given by Michelli and Wahba (1981) as 2 ( n ) ¸ 2 MW ( n ) D
1 X
li .
(3.1)
iD n C 1
This bound is derived for the noiseless case by allowing generalized observations (projections of h ¤ ( x ) along the rst n eigenfunctions of C (x, x0 ) ) 4 Among the scenarios studied in the next section, the OU covariance function provides a concrete example of this kind of behavior. In d D 1 dimension, it has, from equation B.2, li » i¡2 and thus r D 2. The RBF covariance function, on the other hand, has eigenvalues decaying faster than any power law, corresponding to r ! 1 and thus 2 LC » s 2 / n (up to logarithmic corrections) in the asymptotic regime. 5 For covariance functions with eigenvalues li decaying faster than a power law of i, the behavior in the initial regime is nontrivial. For the RBF covariance function in d D 1 with uniform inputs, for example, we nd (for large n) 2 LC » n exp (¡cn2 ) with some constant c.
1404
Peter Sollich and Anason Halees
and so is unlikely to be tight for the case of “real” observations at discrete input points. Also, given that the bound is derived from the s 2 ! 0 limit, it can be useful only in the initial (small n, 2 À s 2 ) regime of the learning curve. There, it conrms a conclusion of our intuitive discussion above: The learning curve has a lower limit below which it will not drop even for s 2 ! 0. Plaskota (1990) generalized the MW approach to the noisy case and obtained the following improved lower bound, 2 ( n) ¸ 2 Pl ( n ) D min fgi g
1 n X X li s 2 C li , g C s 2 iD n C 1 iD 1 i
(3.2)
Pn where P n the minimum is over all nonnegative g1 , . . . , gn obeying iD 1 gi D S ´ lD1 C (xl , xl ) . (For the purposes of numerical evaluation, the minimization can be carried out largely analytically, leaving only a search over an integer variable determining the number of nonzero gi ; Plaskota, 1990; Sollich, 2001.) Plaskota derived his bound only for covariance functions for which the prior variance C ( x, x ) is independent of x. We call these uniform covariance functions; due to the general identity hC(x, x) ix D tr ¤ , they obey C ( x, x ) D tr ¤ (and hence S D n tr ¤ ). By following through Plaskota’s proof, we have shown that the bound actually extends to general covariance functions in the form stated above (Sollich, 2001). The Plaskota bound is close to the MW bound in the small n regime (where one can effectively take s 2 ! 0); for larger n, it becomes substantially larger. It therefore has the potential to be useful for both small and large n. Note that, in contrast to all other bounds discussed in this article, the MW and Plaskota bounds are in fact single data set (worst-case) bounds. They apply to 2 (D ) for any data set D of the given size n rather than just to the average 2 ( n ) D h2 ( D) i.6 Opper (1997) used information-theoretic methods to obtain a different lower bound, but we will not consider this because the more recent OV bound, equation 2.5, is always tighter. Note that the OV bound incorrectly suggests that 2 decreases to zero for s 2 ! 0 at xed n. It therefore becomes void for small n (where 2 À s 2 ) and is expected to be of use only in the asymptotic regime of large n. There is also an upper bound due to Opper (UO; Opper, 1997), ¡2 ¡1 ¡2 2 Q ( n) · 2 UO ( n ) D (s n ) tr ln( I C s n¤
) C tr( ¤
¡1
C s ¡2 nI ) ¡1 .
(3.3)
Here 2 Q is a modied version of 2 that (in the rescaled version that we are using) becomes identical to 2 in the limit of small generalization errors 6 Our generalized version P n of the Plaskota bound depends on the specic data set only ). through the value of S D lD 1 C(xl , xl To obtain an average case bound, one would need to average over the distribution of S.
Learning Curves for Gaussian Process Regression
1405
(2 ¿ s 2 ), but never gets larger than 2s 2 ; for small n in particular, 2 (n ) can therefore actually be much larger than 2 Q ( n ) and its bound, equation 3.3. For this reason, and because in our simulations we never get very far into the asymptotic regime 2 ¿ s 2 , we do not display the UO bound in the gures that follow. The UO bound is complemented by an upper bound due to Williams and Vivarelli (WV; Williams & Vivarelli, 2000), which never decays below values around s 2 and is therefore mainly useful in the initial regime 2 À s 2 . It applies for one-dimensional inputs x and stationary covariance functions— for which C ( x, x0 ) D Cs ( x ¡ x0 ) is a function of x ¡ x0 alone—and reads: Z 1 1 (3.4) 2 ( n ) · 2 WV ( n ) D Cs (0) ¡ da fn ( a) C2s ( a) Cs (0) C s 2 0 with fn ( a) D 2(1 ¡ a ) n H (1 ¡ a ) C 2(n ¡ 1) (1 ¡ 2a) n H (1 ¡ 2a) ,
(3.5)
and where the Heaviside step functions H (dened as H (z ) D 1 for z > 0 and D 0 otherwise) in the two terms in equation 3.5 imply that only values of a up to 1 and 1 / 2, respectively, contribute to the integral in equation 3.4. The function fn ( a) is a normalized distribution over a, which for n ! 1 becomes peaked around a D 0, implying that the asymptotic value of the bound is 2 WV ( n ! 1) D Cs (0) s 2 / [Cs (0) C s 2 ] ¼ s 2 for s 2 ¿ Cs (0) . The derivation of the bound is based on the insight that 2 ( x, D ) always decreases as more examples are added; it can therefore be upper bounded for any given x by the smallest 2 ( x, D0 ) that would result from training on any data set D0 comprising only a single example from the original training set D. The idea can be generalized to using the smallest 2 ( x, D0 ) obtainable from any two of the training examples, but this does not signicantly improve the bound (Williams & Vivarelli, 2000). As stated in equations 3.4 and 3.5, the WV bound applies only to the case of a uniform input distribution over the unit interval [0, 1]. However, it is relatively straightforward to extend the approach to general (one-dimensional) input distributions P ( x) ; only the data set average becomes technically a little more complicated. We omit the details and quote only the result: Equation 3.4 remains valid if the expression 3.5 for fn ( a) is generalized to D E (3.6) fn ( a) D n P ( x ¡ a) [1 ¡ Q ( x) ]n¡1 C P ( x C a) [Q(x) ]n¡1 D
x
0
0
C n ( n ¡ 1) H ( x ¡ x ¡ 2a) [P(x C a) C P ( x ¡ a )]
£[1 C Q ( x ) ¡ Q ( x0 )]n¡2
E
x,x0
,
Rz where Q (z ) D ¡1 dx P (x ) is the cumulative distribution function. In the simpler scenario considered by Williams and Vivarelli, this can be shown
1406
Peter Sollich and Anason Halees
to reduce to equation 3.5, while in the most general case, the numerical evaluation of the bound requires a triple integral (over x, x0 , and a). Finally, there is one more upper bound, due to Trecate, Williams, and Opper (TWO; Trecate et al., 1999). Based on the generalization error achieved by a “suboptimal” gaussian regressor, they showed 2 ( n ) · 2 TWO ( n ) D tr ¤
¡n
X li , i ci
where D E ci D ( n ¡ 1) li C s 2 C C (x, x) w i2 ( x ) .
(3.7)
x
For a« uniform ¬ covariance function, the average in the denition of ci becomes tr ¤ w i2 ( x ) x D tr ¤ , and the bound simplies to 2 TWO ( n ) D
X i
li
tr ¤ tr ¤
C s 2 ¡ li
C s 2 C ( n ¡ 1)li
.
We now compare the quality of these bounds and our approximations with numerical simulations of learning curves. All the theoretical expressions require knowledge of the eigenvalue spectrum of the covariance function, so we focus on situations where this is known analytically. We consider three scenarios. For the rst two, we assume that inputs x are drawn from the d-dimensional unit hypercube [0, 1]d and that the input density is uniform. As covariance functions, we use the RBF function C (x, x0 ) D exp[¡||x ¡ x0 | | 2 / (2l2 ) ] and the OU function C ( x, x0 ) D exp(¡||x ¡ x0 || / l) , to have the extreme cases of smooth and rough functions to be learned; both contain a tunable length scale l. To be precise, we use slightly modied versions of the RBF and OU covariance functions (using what physicists call periodic boundary conditions), which make the eigenvalue calculations analytically tractable; the details are explained in appendix B. In the third scenario, we explore the effect of a nonuniform input distribution by considering inputs x drawn from a d-dimensional (zero mean) isotropic gaussian distribution P (x ) / exp[¡||x|| 2 / (2sx2 ) ], for an RBF covariance function. Details of the eigenvalue spectrum for this case can also be found in appendix B. Note that in all three cases, the covariance function is uniform, that is, it has a constant variance C ( x, x ) ; we have xed this to unity without loss of generality. This leaves three variable parameters: the input space dimension d, the noise level s 2 , and the length scale l. We generically expect the prior variance to be signicantly larger than the noise on the training data, so we consider only values of s 2 < 1. The length scale l should also obey l < 1; otherwise, the covariance functions C (x, x0 ) would be almost constant across the input space, corresponding to a trivial GP prior of essentially
Learning Curves for Gaussian Process Regression
1407
x-independent functions. We in fact choose the length scale l for each d in such a way as to get a reasonable decay of the learning curve within the range of n D 0, . . . , 300 that can be conveniently simulated numerically. To see why this is necessary, note that each covariance dent covers a fraction of order ld of the input space, so that the number of examples n needed to see a signicant reduction in generalization error 2 will scale as (1 / l) d . This quickly becomes very large as d increases unless l is increased simultaneously. (The effect of larger l leading to a faster decay of the learning curve was also observed by Williams & Vivarelli, 2000.) In the following gures, we show the lower bounds (Plaskota, OV), the nonasymptotic upper bounds (TWO and, for d D 1 with uniform input distribution, WV), and our approximations (LC and UC). The true learning curve as obtained from numerical simulations is also shown. For the numerical simulations, we built up training sets by randomly drawing training inputs from the specied input distribution. For each new training input, the matrix inverse K ¡1 has to be recalculated. By partitioning the matrix into its elements corresponding to the old and new inputs, this inversion can be performed with O (n2 ) operations (see Press et al., 1992), as opposed to O ( n3 ) if the inverse is calculated from scratch every time. With K ¡1 known, the generalization error 2 ( D ) was then calculated from equation 1.3 with the average over x estimated by an average over randomly sampled test inputs. This process was repeated up to our chosen nmax D 300; the results for 2 ( D ) were then averaged over a number of training set realizations to obtain the learning curve 2 ( n ) . In all the graphs shown, the size of the error bars on the simulated learning curve is of the order of the visible uctuations in the curve. In Figure 3, we show the results for an OU covariance function with inputs from [0, 1]d , for d D 1 (top) and d D 2 (bottom). One observes that the lower bounds (Plaskota and OV) are rather loose in both cases. The TWO upper bound is also far from tight; the WV upper bound is better where it can be dened (for d D 1). Our approximations, LC and UC, are closer to the true learning curve than any of the bounds and in fact appear to bracket it. Similar comments apply to Figure 4, which displays the results for an RBF covariance function with inputs from [0, 1]. Because functions from an RBF prior are much smoother than those from an OU prior, they are easier to learn, and the generalization error 2 shows a more rapid decrease with n. This makes visible, within the range of n shown, the anticipated change in behavior as 2 crosses over from the initial (2 À s 2 ) to the asymptotic (2 ¿ s 2 ) regime. The LC approximation and, to a less quantitative extent, the Plaskota bound both capture this change. By contrast, the OV bound (as expected from its general properties discussed above) shows the right qualitative behavior only in the asymptotic regime. Figure 5 displays corresponding results in higher dimension (d D 4), at two different noise levels s 2 . One observes in particular that the OV
1408
Peter Sollich and Anason Halees
1 Sim UC LC WV OV TWO Plas
e
0.1
0
100
n
200
300
0
100
n
200
300
1
e
0.2
Figure 3: Learning curve for a GP with OU covariance function and inputs uniformly drawn from x 2 [0, 1]d , at noise level s 2 D 0.05. (Top) d D 1, length scale l D 0.01. (Bottom) d D 2, l D 0.1.
lower bound becomes looser as s 2 decreases; this is as expected since for s 2 ! 0, the bound actually becomes void (2 OV ! 0). The Plaskota bound also appears to get looser for lower s 2 , though not as dramatically. (Note that the kinks in the Plaskota curve are not an artifact. For larger d, the multiplicities of the different eigenvalues can be quite large; the value of 2 Pl can become dominated by one such block of degenerate eigenvalues, and kinks occur where the dominant block changes.) The TWO upper bound, nally, is only weakly affected by the value of s 2 and quite loose throughout.
Learning Curves for Gaussian Process Regression
1409
1 Sim UC LC WV OV TWO Plas MW
e
0.1
0.01
0
100
n
200
300
Figure 4: Learning curve for a GP with RBF covariance function and inputs uniformly drawn from x 2 [0, 1]d , at noise level s 2 D 0.05, dimension d D 1, length scale l D 0.01. We show the MW bound here to demonstrate how it is rst close to the Plaskota bound but then misses the change in behavior where the generalization error crosses over into the asymptotic regime (2 ¿ s 2 ).
All results shown so far pertain to uniform input distributions (over [0, 1]d ). We now move to the last of our three scenarios: a GP with an RBF covariance function and inputs drawn from a gaussian distribution (see appendix B for details). In Figure 6 we see that in d D 1, the (generalized) WV bound is still reasonably tight, while the LC approximation now provides a less good representation of the overall shape of the learning curve than for the case of uniform input distributions. However, as in all previous examples, the LC and UC approximations still bracket the true learning curve (and come closer to it than the bounds). One is thus led to speculate whether the approximations we have derived are actually bounds. Figure 7 shows this not to be the case, however. In d D 4, the true learning curve drops visibly below the LC approximation in the small n regime, and so the latter cannot be a lower bound. The low noise case (s 2 D 0.001) displayed here illustrates once more that the OV lower bound ceases to be useful for small noise levels. In summary, of the approximations that we have derived, the LC approximation performs best. Although we know on theoretical grounds that it will be accurate for large noise levels s 2 , the examples shown above demonstrate that it produces predictions close to the true learning curves even for the more realistic case of noise levels which are low compared to the prior variance. As a general trend, agreement appears to be better for the case of uniform input distributions.
1410
Peter Sollich and Anason Halees
Figure 5: Learning curve for a GP with RBF covariance function and inputs uniformly drawn from x 2 [0, 1]d , for dimension d D 4 and length scale l D 0.3. (Top) Noise level s 2 D 0.05. (Bottom) s 2 D 0.001. Note that for lower s 2 , the OV bound becomes looser as expected (it approaches zero for s 2 ! 0).
It is interesting at this stage to make a connection to the recent work of Malzahn and Opper (2001); some details of their approach were kindly provided by Manfred Opper in private communication, March 2001. Malzahn and Opper devised an elegant way of approaching the learning curve problem from the point of view of statistical physics, calculating the relevant partition function (which is an average over data sets) using a so-called gaussian variational approximation. The result they nd for the Bayes error is iden-
Learning Curves for Gaussian Process Regression
1411
Figure 6: Learning curve for a GP with RBF covariance function (length scale l D 0.01) and inputs drawn from a gaussian distribution in d D 1 dimension; noise level s 2 D 0.05. Note that the LC approximation is not as good a representation of the overall shape of the learning curve here than for the previous examples with uniform input distributions. The curve labeled WV shows our generalized version of the Williams-Vivarelli bound (see equation 3.6).
tical to the LC approximation under the condition that 2 (x ) D h2 (x, D) iD , the x-dependent generalization error averaged over all data sets, is independent of x. Otherwise, they nd a result of the same functional form, ¡1 C gI ) ¡1 , but the self-consistency equation for g is more com2 ( n ) D tr( ¤ plicated than the simple relation g D n / (2 C s 2 ) obtained from the LC approximation, equation 2.12. The LC approximation would thus be expected to perform less well for such “nonuniform” scenarios. This agrees qualitatively with our above ndings. For the scenario with a gaussian input distribution, the LC approximation is of poorer quality than for the cases with uniform input distributions over [0, 1]d .7 4 Improving the Approximations
In the previous section we saw that in our test scenarios, the LC approximation, equation 2.12, generally provides the closest theoretical approximation to the true learning curves. This may appear somewhat surprising, given
7 It is easy to see that in these cases, 2 (x) is indeed independent of x. The absence of effects from the boundaries of the hypercube comes from the periodic boundary conditions that we are using.
1412
Peter Sollich and Anason Halees
Figure 7: Learning curve for a GP with RBF covariance function (length scale l D 0.3) and inputs drawn from a gaussian distribution in d D 4 dimensions. (Top) Noise level s 2 D 0.05. (Bottom) s 2 D 0.001. The OV lower bound is extremely loose for this smaller noise level. Note that in contrast to all previous examples, there is a range of n here where the true learning curve lies below the LC approximation.
Learning Curves for Gaussian Process Regression
1413
that we made two rather drastic approximations in deriving equation 2.12. We treated the number of training examples n as a continuous variable and decoupled the average of the right-hand side of equation 2.6 into separate averages over numerator and denominator. We now investigate whether the LC prediction for the learning curves can be further improved by removing these approximations. We begin with the effect of n, the number of training examples, taking only discrete (rather than continuous) values. Starting from equation 2.6, averaging numerator and denominator separately as before and introducing the auxiliary variable v as in equations 2.9 and 2.10, we obtain 2 ( n C 1) ¡ 2 ( n ) D
1 @2 s 2 C 2 @v
(4.1)
instead of equation 2.11. It is possible to interpolate between these two equations by writing 1 1 @2 [2 ( n C d) ¡ 2 (n ) ] D 2 . d s C 2 @v
(4.2)
Then d D 1 corresponds to equation 4.1, which is the equation we wish to solve (discrete n), while in the limit d ! 0 we retrieve equation 2.11. To proceed, we treat d as a perturbation parameter and assume that the solution of equation 4.2 can be expanded as 2
where 2 yields
0
´ 2
LC .
D 2 0 C d2 1 C O (d 2 ) ,
Expanding both sides of equation 4.2 to rst order in d
1 @22 0 @2 0 @2 1 Cd C d C O (d 2 ) D 2 @n2 @n @n
³
1 s2 C 2
0
¡d
2 1 (s 2 C 2 0 ) 2
´³
@2 0 @2 1 Cd @v @v
´ .
Comparing the coefcients of the O (d 0 ) terms gives us back equation 2.11 for 2 0 , while from the O (d) terms we get 1 1 @22 0 @2 1 @2 1 2 1 @2 0 ¡ 2 ¡ 2 . D ¡ 2 2 (s ) C 2 0 @v s C 2 0 @v 2 @n @n
(4.3)
This can again be solved using characteristics (see appendix A), with the result n (a22 ¡ a3 ) , (1 ¡ na2 ) 2 ³ 2 ¡k ak D (s C 2 0 ) tr ¤ ¡1 C
2 2 1 D (s C 2 0 )
n s2 C 2
I 0
´¡k
.
(4.4)
1414
Peter Sollich and Anason Halees
1 e0 = eLC 100 e1 e
0
0
100
n
200
300
Figure 8: Solid line: LC approximation 2 LC for the learning curve of a GP with RBF covariance function and gaussian inputs. Parameters are as in Figure 7 (top): dimension d D 4, length scale l D 0.3, noise level s 2 D 0.05. Dashed line: Firstorder correction 2 1 arising from the discrete nature of the number of training examples n. Note that 2 1 has been multiplied by ¡100 to make it positive and visible on the scale of the values of 2 LC .
Setting d back to 1 to have the case of discrete n in equation 4.2, we then have 2 LC1 D 2 0 C 2 1 ´ 2 LC C 2 1
as the improved LC approximation that takes into account the effects of discrete n (up to linear order in a perturbation expansion in d). We see that the correction term, equation 4.4, is zero at n D 0 as it must since 2 LC gives the exact result 2 D tr ¤ there. It can also be shown that 2 1 < 0 for all nonzero n. This can be understood as follows: The decrease of 2 ( n ) becomes smaller (in absolute value) as n increases. Comparing equations 2.11 and 4.1, we see that the continuous n-approximation effectively averages the decrease term over the range n, . . . , n C 1 rather than evaluating it at n itself; it therefore produces a smaller decrease in 2 ( n ). The true decrease for discrete n is larger, and so one expects the correction 2 1 to be negative, in agreement with our calculation. In Figure 8 we show 2 LC and 2 1 for one of the scenarios considered earlier; the results are typical also of what we nd for other cases. The most striking observation is the smallness of 2 1 . Its absolute value is of the order of 1% of 2 LC or less, and consequently 2 LC and 2 LC1 would be indistinguishable on the scale of the plot. On the one hand, this is encouraging. Given that 2 1 is already so small, one would expect higher orders in a perturbation
Learning Curves for Gaussian Process Regression
1415
expansion in d to yield even smaller corrections. Thus, 2 LC1 is likely to be very close to the result that one would nd if the discrete nature of n was taken into account exactly. On the other hand, we also conclude that treating n as discrete is not sufcient to turn the LC approximation into a lower bound on the learning curve; in Figure 6, for example, the curve for 2 LC1 would lie essentially on top of the one for 2 LC and so still be signicantly above the true learning curve for small n. It is clear at this stage that in order to improve the LC approximation signicantly, one would have to address the decoupling of the numerator and denominator averages in equation 2.6. Generally, if a and b are random variables, one can evaluate the average of their ratio perturbatively as * + DaE E ¬ hai C D a hai hai D 1 « D « ¬ D « ¬ C « ¬3 ( D b ) 2 ¡ « ¬2 D aD b C ¢ ¢ ¢ b b C Db b b b up to second order in the uctuations. (This idea was used by Sollich, 1994, to calculate nite N-corrections to the N ! 1 limit of a linear learning problem.) To apply this to the average of the right-hand side of equation 2.6 over the new training input xn C 1 and all previous ones, one would set a D G (n ) ÃÃ T G ( n ) ,
b D s 2 C à T G ( n) à . « ¬ « ¬ « ¬ « ¬ One then sees that averages such « as ab , required in ¬ D aD b D ab ¡hai b , involve fourth-order averages w i ( x ) wj ( x )w k ( x )w l ( x ) x of the components of à . « ¬ In contrast to the second-order averages w i ( x )w j ( x ) D dij , such fourth-order statistics of the eigenfunctions do not have a simple, covariance functionindependent form. Even if these statistics were known, however, one would end up with averages over the entries of the matrix G that cannot be reduced to 2 D tr hG i (for example, by derivatives with respect to auxiliary parameters). Separate equations for the change of these averages with n would then be required, generating new averages and eventually an innite hierarchy that cannot be closed. We thus conclude that a perturbative approach is of little use in improving the LC approximation beyond the decoupling of averages. The approach of Malzahn and Opper (2001) thus looks more hopeful as far as the derivation of systematic corrections to the approximation is concerned. 5 How Good Can Bounds and Approximations Be?
In this nal section, we ask whether there are limits of principle on the quality of theoretical predictions (either bounds or approximations) for GP learning curves. Of course, this question is meaningless unless we specify what information the theoretical curves are allowed to exploit. Guided by the insight that all predictions discussed above depend (at least for uniform covariance functions) on the eigenvalues of the covariance function
1416
Peter Sollich and Anason Halees
only (and of course the noise level s 2 ), we ask: How tight can bounds and approximations be if they use only this eigenvalue spectrum as input? To answer this question, it is useful to have a simple scenario with an arbitrary eigenvalue spectrum for which learning curves can be calculated exactly. Consider the case where the input space consists ofP N discrete points xa; the input distribution is arbitrary, P ( xa ) D pa with a pa D 1. Now take the covariance function to be degenerate in the sense that there are no correlations between different points: C ( xa, xb ) D cadab . The eigenvalue equation, 2.2, then becomes simply X C ( xa , xb ) w ( xb ) pb D ca pa w ( xa ) D lw ( xa ) , b
so that the N different eigenvalues are la D ca pa . The eigenfunctions are ¡1 / 2 w a ( xb ) P D pa dab , where the prefactor follows from the normalization condition c pc w a ( xc ) w b ( xc ) D dab . Note that by choosing the ca and pa appropriately, the la can be adjusted to any desired value in this setup. The same still holds even if we require the covariance function to be uniform, that is, ca to be independent of a. For this case, a simple connection can also be made to the covariance functions discussed in previous sections. If one imagines the points xa arranged in some d-dimensional space, then the present covariance function can be viewed as an RBF covariance function C (x, x0 ) D const ¢ exp[¡| |x ¡ x0 || 2 / (2l2 ) ] in which the correlation length scale l has been taken to zero, so that different input points have entirely uncorrelated outputs. A set of n training inputs is, in this scenario, fully characterized by how often it contains each of the possible inputs xa ; we call these numbers na . The generalization error is easy to work out from equation 2.4 using X ( ª T ª ) ab D nc w a (xc ) w b ( xc ) D ( na / pa )dab . c
This shows that ª 2 ( D) D tr( ¤
D
X a
T
ª
¡1
is a diagonal matrix, and thus from equation 2.4, C s ¡2 ª
T
ª ) ¡1
(la¡1 C s ¡2 na / pa ) ¡1 D
X a
la
s2 C
s2 . na l a / p a
(5.1)
This has the expected form. The contribution of each eigenvalue is reduced according to the ratio of the noise level s 2 and the signal na la / pa D na ca received at the corresponding input point. To average this over all training sets of size n, one notices that na has a binomial distribution, so that Á ! n X X n s2 la . 2 ( n) D pnaa (1 ¡ pa ) n¡na 2 s C na l a / p a na a n D0 a
Learning Curves for Gaussian Process Regression
1417
R1 2 Writing (s 2 C na la / pa ) ¡1 D s ¡2 0 dr rna la / pa s , we can perform the sum over na and obtain Z 1 X 2 la (1 ¡ pa C pa rla / pa s ) n (5.2) 2 ( n) D dr 0
a
as the nal result; the integral over r can easily be performed numerically for a given set of eigenvalues la and input probabilities pa . Note that, having done the calculation for a nite number N of discrete inputs points (and therefore of eigenvalues), we can now also take N to innity and therefore analyze scenarios (such as the ones studied in section 3) with an innite number of nonzero eigenvalues. A simple limiting case will now tell us about the quality of eigenvaluedependent upper bounds on learning curves. Assume that one of the pa is close to 1, whereas all the others are close to 0. From equation 5.2, one then sees that only the contribution from the eigenvalue la with pa ¼ 1 is reduced as n increases 8 while all other ones remain unaffected, so that 2 ( n ) ¼ tr ¤
¡ la C
la s 2 2 s C nla
D tr ¤
¡
nl2a ¸ tr ¤ nla
s2 C
¡ la.
(5.3)
If la ¿ tr ¤ , then we can make the reduction in generalization error arbitrarily small. It thus follows that there is no nontrivial upper bound on learning curves that takes only the eigenvalue spectrum of the covariance function as input. (Accordingly, the two nonasymptotic upper bounds, WV and TWO, discussed in section 3 both contain other information, via the weighted averages of C2s ( x) D C2 (0, x ) in equation 3.4 and the average of C ( x, x) w i2 ( x ) in equation 3.7.) In particular, this implies that our UC approximation cannot be an upper bound (even though the results for all scenarios investigated above suggested that it might be). Furthermore, our result shows that lower bounds on the generalization error (e.g., the OV or Plaskota bounds) can be arbitrarily loose. A similar observation holds for upper bounds on the (data set–averaged) training error, 2 t dened as 2 t D h2 t ( D ) iD ,
where
2 t (D ) D
n 1X (hO (xl , D ) ¡ yl ) 2 . n lD 1
Opper and Vivarelli (1999) showed that their 2 OV actually also provides an upper bound for this quantity (so that the two errors sandwich the bound, 2 t · 2 OV · 2 ). In the present case, it is easy to calculate 2 t explicitly; we omit details and quote just the result: 2 t ¼ 8
la s 2 · la . s 2 C nla
6 a). This holds if n is not too large, more precisely if npb ¿ 1 for all the “small” pb (b D
1418
Peter Sollich and Anason Halees
1 Sim UC LC TWO Deg
e
0.1
0.01
0
100
n
200
300
Figure 9: Learning curve for a GP with RBF covariance function and inputs uniformly drawn from x 2 [0, 1]d . Parameters are as in Figure 5 (top): dimension d D 4, length scale l D 0.3, and noise level s 2 D 0.05. The curves Sim (true learning curve, from numerical simulations), UC, LC, and TWO are also as in Figure 5 (top). The curve labeled Deg shows the exact learning curve for the degenerate scenario (outputs for different inputs are uncorrelated) with exactly the same spectrum of eigenvalues li of the covariance function (and uniform prior variance C ( x, x ) ). The curves Sim and Deg differ signicantly, showing that learning curves cannot be predicted reliably based on eigenvalues alone.
By taking la ¿ tr ¤ , one then sees that the upper bound 2 OV can be made arbitrarily loose for any xed n (and that the ratio of training and generalization error can be made arbitrarily small). One may object that the above limit of most of the pa tending to zero is unrealistic because it implies that the corresponding prior variances ca D la / pa would become very large. Let us therefore now restrict the prior variance to be uniform, ca D c. It then follows that la D c / pa and hence pa D la / tr ¤ . With this assumption, only the la and s 2 remain as parameters affecting the learning curve, equation 5.2. The results for an eigenvalue spectrum from one of the situations covered in section 3 are shown in Figure 9. The main conclusion to be drawn is that the learning curves for the present scenario are quite different from the ones we found earlier, even though the eigenvalue spectra and noise levels are, by construction, precisely identical. This demonstrates that theoretical predictions for learning curves that take into account only the eigenvalue spectrum of a covariance function cannot universally match the true learning curves with a high degree of accuracy; the quality of approximation will vary depending on details of the covariance function and input distribution that are not encoded in the spectrum.
Learning Curves for Gaussian Process Regression
1419
Note that Figure 9 also provides a concrete example for the fact that the UC approximation is not in general an upper bound on the true learning curve; in fact, it here underestimates the true 2 ( n ) quite signicantly. We can also use the scenario to assess whether, as a bound on the generalization error resulting from a single training set, the Plaskota bound, equation 3.2, could be signicantly improved. We focus on the case of uniform covariance functions, where equation 5.1 becomes 2 ( D) D
X a
la
s2 . s 2 C na tr ¤
(5.4)
For P any assignment of the na there is at least one training set of size n D a na for which the generalization error is given by equation 5.4. Minimizing numerically over the na for each given n, we nd the curves shown in Figure 10, where the Plaskota bounds for the same eigenvalue spectra are also shown. The curves are quite close to each other, implying that the Plaskota bound cannot be signicantly tightened as a single data set bound (assuming, as throughout, that the improved bound would again be based on only the covariance function’s eigenvalue spectrum). In the limit s 2 ! 0, the bound, which then reduces to the MW bound (see equation 3.1), cannot be tightened at all, as setting na D 1 for a D 1, . . . , n and na D 0 for a ¸ n C 1 in equation 5.4 shows. Within the simple degenerate scenario introduced in this section, we nally comment briey on a recently proposed universal relation (Malzahn & Opper, 2001; Manfred Opper, private communication, March 2001). Malzahn and Opper suggest considering an empirical estimate of the (Bayes) generalization error, which is obtained by replacing the average over all inputs x by one over the n training inputs xi : 2 emp ( D) D
n 1X 2 ( xi , D ) . n i D1
Within the approximations of their calculation, the data set average of this quantity is then universally linked to a modied version of the true generalization error: ½ ¾ « ¬ 2 ( x) . (5.5) 2 emp ( D ) D D s 2 C 2 (x ) x Note that the average over data sets is on the inside of the fraction on the right-hand side, through the denition of 2 ( x ) D h2 ( x, D ) iD . Within our degenerate scenario, we can calculate both sides of equation 5.5 explicitly, but nd no obvious relation between the two sides. However, if we move the data set average on the right-hand side to the outside, we do (after a
1420
Peter Sollich and Anason Halees
1 1
10 e 10
2
10
3
10
4
2
unif s =0.05 2 unif s =0.001 2 Gauss s =0.05 2 Gauss s =0.001
0
100
200
n
300
Figure 10: Comparison of the Plaskota bound (solid lines) and the lowest generalization error achievable for single data sets of size n within the degenerate scenario (dashed lines). The eigenvalue spectra used to construct the curves are those for an RBF covariance function with length scale l D 0.3, in d D 4 dimensions, and for the input distributions (uniform over [0, 1]d , or gaussian) shown in the legend; the noise level s 2 is also given there. Note that for a given n, the curves become closer for lower s 2 ; this is as expected since for s 2 ! 0, the Plaskota bound can be saturated for a specic data set (see text).
brief calculation, the details of which we omit) nd a simple result: «
2 emp ( Dn C 1 )
¬ Dn C 1
½ ½ D
2 ( x, Dn )
s 2 C 2 (x, Dn )
¾ ¾ .
(5.6)
x Dn
As indicated by the subscripts, the left-hand side of this relation is to be evaluated for data sets of size n C 1 rather than n. The result, equation 5.6, is remarkable in that it holds for any eigenvalue spectrum and any input distribution (within the degenerate scenario considered here). We take this as a hopeful sign that some universal link between true and empirical generalization errors, along the lines derived by Malzahn and Opper (2001) within their approximation, may indeed exist.
Learning Curves for Gaussian Process Regression
1421
6 Conclusion
In summary, we have derived an exact representation of the average generalization error 2 of GPs used for regression, in terms of the eigenvalue decomposition of the covariance function. Starting from this, we obtained two different approximations (LC and UC) to the learning curve 2 ( n ). Both become exact in the large noise limit; in practice, one generically expects the opposite case (s 2 / C ( x, x) ¿ 1), but comparison with simulation results shows that even in this regime, the new approximations perform well. The LC approximation in particular represents the overall shape of the learning curves very well, both for “rough” (OU) and “smooth” (RBF) gaussian priors, and for small as well as for large numbers of training examples n. It is not perfect, but it does generally get substantially closer to the true learning curves than existing bounds (two of which, due to Plaskota and to Williams and Vivarelli, we generalized to a wider range of scenarios). For situations with nonuniform input distributions, the predictions of the LC approximation tend to be less accurate, and we linked this observation to recent work by Malzahn and Opper (2001) on the effects of nonuniformity across input space. Their result, which reduces to the LC approximation for sufciently uniform scenarios, may in other cases provide better approximations, but this has to be traded off against the higher computational cost that would be involved in actually evaluating the predictions. We next discussed how the LC approximation could be improved. The effects of discrete n can be incorporated to leading order but were seen to be relatively minor; on the other hand, the second approximation involved in the derivation (decoupling of averages) appears difcult to improve on within our framework. Finally, we investigated a simple “degenerate” GP learning scenario, where the outputs corresponding to different inputs are uncorrelated. This provided us with a means of assessing whether there are limits on the quality of approximations and bounds that take into account only the eigenvalue spectrum of the covariance function. We found that such limits indeed exist. There can be no nontrivial upper bound on the learning curve of this form, and approximations are necessarily of limited quality because different covariance functions with the same eigenvalue spectrum can produce rather different learning curves. We also found that as a bound on the generalization error for single data sets (rather than its average over data sets), the Plaskota bound is close to being tight. Whether a tight lower bound on the average learning curve exists remains an open question; one plausible candidate worth investigating would be the average generalization error of our degenerate scenario, minimized over all possible input distributions for a xed eigenvalue spectrum. There are a number of open problems. One is whether a subclass of GP learning scenarios can be dened for which the covariance function’s eigenvalue spectrum is sufcient for predicting the learning curves accurately.
1422
Peter Sollich and Anason Halees
Alternatively, one could ask what (minimal) extra information beyond the eigenvalue spectrum needs to be taken into account to arrive at accurate learning curves for all possible GP regression problems. Finally, one may wonder whether the eigenvalue decomposition we have chosen, which explicitly depends on the input distribution, is really the optimal one. On the one hand, recent work (see, e.g., Williams and Seeger, 2000) appears to answer this question in the afrmative. On the other hand, the variability of learning curves among GP covariance functions with the same eigenvalue spectrum suggests that the eigenvalues alone do not provide sufcient information for accurate predictions. One may therefore speculate that eigendecompositions with respect to other input distributions (e.g., maximum entropy ones) might not suffer from this problem. We leave these challenges for future work. Appendix A: Solving for the LC Approximation
In this appendix we describe how to solve equations 2.11 and 4.3 for the LC approximation and its rst-order correction, using the method of characteristic curves. The method applies to partial differential equations of the form a@f / @x C b@f / @y D c, where f D f ( x, y) and a, b, c can be arbitrary functions of x, y, f . Viewing the solution as a surface in x, y, f -space, one can show (John, 1978) that if the point (x 0 , y0 , f0 ) belongs to the solution surface, then so does the entire characteristic curve ( x ( t) , y ( t) , f ( t) ) dened by dx D a, dt
dy D b, dt
df D c, dt
(x (0) , y (0) , f (0) ) D (x 0 , y0 , f0 ).
The solution surface can then be recovered by combining an appropriate one-dimensional family of characteristic curves. Denote the generalization error predicted by the LC approximation as 2 0 ( n, v ), with v the auxiliary parameter introduced in equations 2.9 and 2.10. It is the solution of equation 2.11, 1 @2 0 @2 0 ¡ 2 D 0, s C 2 0 @v @n subject to the initial conditions 2 0 (n D 0, v ) D tr( ¤ ¡1 C vI ) ¡1 . These give us a family of solution points that the characteristic curves have to pass through: (n (0) D 0, v (0) D v 0 , 2 0 (0) D tr(¤ ¡1 C v 0 I ) ¡1 ) . The equations for the characteristic curves are dn D 1, dt
1 dv D ¡ 2 s C2 dt
, 0
d2 0 D 0 dt
Learning Curves for Gaussian Process Regression
1423
and can be integrated to give n D n (0) C t D t v D v (0) ¡
(A.1)
t
s2 C 2
2 0 D 2 0 (0) D tr( ¤
0 ¡1
D v0 ¡ C v 0I
t s2 C 2
) ¡1
0
.
Eliminating t (the curve parameter) and v 0 (which parameterizes the family of initial points) gives the required solution 2 0 D trf¤ ¡1 C [v C n / (s 2 C 2 0 ) ]Ig ¡1 . The LC approximation, equation 2.12, is obtained by setting v D 0. For the rst-order correction 2 1 , we have to solve equation 4.3, 1 1 @22 0 @2 1 @2 1 2 1 @2 0 ¡ 2 ¡ 2 D ¡ , (s C 2 0 ) 2 @v s C 2 0 @v 2 @n2 @n with the initial condition (explained in the main text) 2 1 ( n D 0, v) D 0. Hence, a suitable family of initial solution points is (n (0) D 0, v (0) D v 0, 2 1 (0) D 0) . The characteristic curves must obey dn D 1, dt
1 dv D ¡ 2 s C2 dt
, 0
1 @22 0 d2 1 2 1 @2 0 ¡ 2 . D ¡ (s C 2 0 ) 2 @v 2 @n2 dt
The solutions for n ( t) and v ( t) are the same as before, and as a result, 2 0 is again constant along the characteristic curves. For the derivatives of 2 0 that appear, one nds after some algebra, @2 0 a2 D ¡(s 2 C 2 0 ) 2 , 1 ¡ na2 @v
a22 ¡ a3 @22 0 . D ¡2(s 2 C 2 0 ) 2 (1 ¡ na2 ) 3 @n
Here we have used the denitions from equation 4.4. Because both a2 and a3 depend on only n and v in the combination v C n / (s 2 C 2 0 ) , they are also constant along the characteristic curves. Using also that n D t from equation A.1, the equation for 2 1 becomes a2 ¡ a3 d2 1 C2 D (s 2 C 2 0 ) 2 (1 ¡ ta2 ) 3 dt
1
a2 . 1 ¡ ta2
This linear differential equation is easily integrated; using the initial condition 2 1 (0) D 0, one nds 2 2 1 D (s C 2 0 )
t (a22 ¡ a3 ) . (1 ¡ ta2 ) 2
Eliminating t again via t D n nally gives the solution, equation 4.4.
1424
Peter Sollich and Anason Halees
Appendix B: Eigenvalue Spectra for the Example Scenarios
We explain rst how the covariance functions with periodic boundary conditions for x 2 [0, 1]d are constructed. Consider rst the case d D 1. The periodic RBF covariance function is dened as C (x, x0 ) D
1 X rD ¡1
c ( x ¡ x0 ¡ r ) ,
(B.1)
where c (x¡x0 ) D exp[¡|x¡x0 | 2 / (2l2 ) ] is the original covariance function. For the periodic OU case, we use instead c ( x ¡ x0 ) D exp(¡|x ¡ x0 | / l). One sees that for sufciently small l (l ¿ 1), only the r D 0 term makes a signicant contribution, except when x and x0 are within ¼ l of opposite ends of the input space (so that either x ¡ x0 C 1 or x ¡ x0 ¡ 1 are of order l). We therefore expect the periodic covariance functions and the conventional nonperiodic ones to yield very similar learning curves, as long as the length scale of the covariance function is smaller than the size of the input domain. The advantage of having a periodic covariance function is that its eigenfunctions are simple Fourier waves and the eigenvalues can be calculated by Fourier transformation. This can be seen as follows. For the assumed uniform input distribution on [0, 1], the dening equation for an eigenfunction w ( x ) with eigenvalue l is, from equation 2.2, « ¬ C ( x, x0 ) w ( x0 ) x0 D
Z
1 0
dx0 C ( x, x0 )w (x0 ) D lw ( x ).
Inserting equation B.1 and assuming that w ( x ) is continued periodically outside the interval [0, 1], this becomes 1 Z X r D ¡1 0
1
dx0 c (x ¡ x0 ¡ r )w (x0 ) D
1 Z X r D ¡1 r
Z D
1
¡1
rC 1
dx0 c ( x ¡ x0 ) w ( x0 ¡ r)
dx0 c ( x ¡ x0 )w (x0 ) D lw ( x ) .
It is well known that the solutions of this eigenfunction equation are Fourier waves w q ( x ) D e2p iqx for integer (positive or negative) q, with corresponding eigenvalues, Z 1 lq D dx c (x ) e¡2p iqx . ¡1
The eigenvalues lq are real since c ( x) D c (¡x ) (this follows from the requirement that the covariance function C ( x, x0 ) must be symmetric). The eigen6 0 are complex in the form given but can be made explicitly functions for q D
Learning Curves for Gaussian Process Regression
1425
p real by linearly transforming the pair w q (x ) and w ¡q ( x) into (1 / 2) cos(2p qx) p and (1 / 2) sin(2p qx ) . All of the above generalizes immediately to higher input dimension d. One denes X C ( x, x0 ) D c ( x ¡ x0 ¡ r ) , r
where r now runs over all d-dimensional vectors with integer components; the argument x ¡ x0 ¡ r of c (¢) is now a d-dimensional real vector. The eigenvalues of this periodic covariance function are then given by Z lq D
dx c ( x ) e¡2p iq¢x .
They are indexed by d-dimensional integer vectors q; the integration is over all real-valued d-dimensional vectors x, and q ¢ x is the conventional dot product between vectors. Explicitly, one derives that for the periodic RBF covariance function, lq D (2p ) d / 2 ld e ¡(2p l)
2 | |q|| 2
/2 .
For the periodic OU covariance function, on the other hand, one has lq D k d ld [1 C (2p l) 2 | |q| | 2 ] ¡(d C 1) / 2 ,
(B.2)
with kd D 2, 2p , 8p for d D 1, 2, 3, respectively; for general d, kd D p ( d¡1 ) / 2 2d C ( (d C 1) / 2) where C ( z) D (z ¡ 1)! is the gamma function. All the bounds and approximations in principle require traces over the whole eigenvalue spectrum, corresponding to sums over an innite number of terms. Numerically, we perform the sums over all eigenvalues up to some suitably large maximal value qmax of ||q| |. The remaining small eigenvalue tail of the spectrum is then treated by approximating | |q|| as a continuous variable qQ and integrating over it from qmax to innity, with the appropriate weighting for the number of vectors q in a shell qQ · ||q|| · qQ C dQq. To check that this procedure gave accurate results, we always veried that the numerically calculated tr ¤ agreed with the known value of C ( x, x) . The third scenario we consider is that of a conventional RBF kernel C ( x, x0 ) D exp[¡| |x ¡ x0 | | 2 / (2l2 ) ] with a nonuniform input distribution that we assume to be an isotropic zero-mean gaussian, P ( x ) / exp[¡| |x|| 2 / (2sx2 )]. The eigenfunctions and eigenvalues are worked out by Zhu, Williams, Rohwer, and Morciniec (1998) for the case d D 1; the eigenvalues are labeled by a nonnegative integer q and given by lq D (1 ¡ b ) b q ,
1426
Peter Sollich and Anason Halees
where b ¡1 D 1 C r/ 2 C
q r2 / 4 C r,
r D l2 / sx2 .
As expected, only the ratio r of the length scales l and sx enters since the overall scale of the inputs is immaterial. (To avoid this trivial invariance, we xed sx2 D 1 / 12; this specic value gives the same variance for each component of x as for the uniform distribution over [0, 1]d used in the other scenarios.) For d > 1, this result generalizes immediately because both the covariance function and the input distribution factorize over the different input components (Opper & Vivarelli, 1999; Williams & Seeger, 2001). The eigenvalues are therefore just products of the eigenvalues for each component and indexed by a d-dimensional vector q of nonnegative integers:
lq D (1 ¡ b ) d b s ,
sD
d X
qi .
(B.3)
iD 1
One sees that P the eigenvalues will come in blocks. All vectors q with the same s D i qi give the same lq . Numerically, we therefore only have to store the different eigenvalues and their multiplicities, which can be shown to be (d C s ¡ 1) ! / [s!(d ¡ 1)!].9 With this trick, so many eigenvalues can be treated by direct summation that a separate treatment of the neglected eigenvalues (via an integral, as above) is unnecessary.
Acknowledgments
We thank Chris Williams, Manfred Opper, and D¨orte Malzahn for stimulating discussions and the Royal Society for nancial support through a Dorothy Hodgkin Research Fellowship.
9
There is a nice combinatorial way of seeing this. Imagine a row of d C s C 1 holes. Each hole is either empty or contains one ball, except the holes at the two ends, which have one ball placed in them permanently. Now consider d ¡ 1 identical balls distributed across the d C s ¡ 1 free holes, and let qi (i D 1, . . . , d) be the number of free holes between balls i ¡ 1 and i (where balls 0 and d are the xed balls in the holes at the left and right end, respectively).P The qi are nonnegative integers, and their sum equals the number of D (d C s ¡ 1) ¡ (d ¡ 1) D s. The number of different assignments free holes, giving i qi of the qi is thus identical to the number of different arrangements of d ¡ 1 identical balls across d C s ¡ 1 holes; hence, the result.
Learning Curves for Gaussian Process Regression
1427
References Barber, D. & Williams, C. K. I. (1997). Gaussian processes for Bayesian classication via hybrid Monte Carlo. In M. C. Mozer, M. I. Jordan, & T. Petsche (eds.), Advances in neural information processing systems, 9 (pp. 340–346). Cambridge, MA: MIT Press. Goldberg, P. W., Williams, C. K. I., & Bishop, C. M. (1998). Regression with inputdependent noise: A gaussian process treatment. In M. I. Jordan, M. J. Kearns, & S. A. Solla (eds.), Advances in neural information processing systems, 10 (pp. 493–499). Cambridge, MA: MIT Press. John, F. (1978). Partial differential equations (3rd ed.). New York: Springer-Verlag. MacKay, D. J. C. (1992). A practical Bayesian framework for backpropagation networks. Neural Computation, 4, 448–472. Malzahn, D., & Opper, M. (2001). Learning curves for gaussian processes regression: A framework for good approximations. In T. K. Leen, T. G. Dietterich, & V. Tresp (eds.), Advances in neural information processing systems, 13 (pp. 273–279). Cambridge, MA: MIT Press. Michelli, C. A., & Wahba, G. (1981).Design problems for optimal surface interpolation. In Z. Ziegler (ed.), Approximation theory and applications (pp. 329–348). Orlando, FL: Academic Press. Neal, R. M. (1993). Probabilistic inference using Markov chain Monte Carlo methods (Tech. Rep. No. CRG-TR-93-1). Toronto: University of Toronto. Opper, M. (1997). Regression with gaussian processes: Average case performance. In I. K. Kwok-Yee, M. Wong, I. King, & D.-Y. Yeung (eds.), Theoretical aspects of neural computation: A multidisciplinary perspective (pp. 17–23). New York: Springer. Opper, M., & Vivarelli, F. (1999). General bounds on Bayes errors for regression with gaussian processes. In M. Kearns, S. A. Solla, & D. Cohn (eds.), Advances in neural information processingsystems,11 (pp. 302–308). Cambridge, MA: MIT Press. Plaskota, L. (1990). On the average case complexity of linear problems with noisy information. Journal of Complexity, 6, 199–230. Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. (1992). Numerical recipes in C (2nd ed.). Cambridge: Cambridge University Press. Ritter, K. (1996). Almost optimal differentiation using noisy data. Journal of Approximation Theory, 86(3), 293–309. Silverman, B. W. (1985). Some aspects of the spline smoothing approach to nonparametric regression curve tting. Journal of the Royal Statistical Society Series B, 47(1), 1–52. Sollich, P. (1994). Finite-size effects in learning and generalization in linear perceptrons. Journal of Physics A, 27, 7771–7784. Sollich, P. (1999a). Approximate learning curves for gaussian processes. In ICANN99—Ninth International Conference on Articial Neural Networks (pp. 437–442). London: Institution of Electrical Engineers. Sollich, P. (1999b). Learning curves for gaussian processes. In M. S. Kearns, S. A. Solla, & D. A. Cohn (eds.), Advances in neural information processing systems, 11 (pp. 344–350). Cambridge, MA: MIT Press.
1428
Peter Sollich and Anason Halees
Sollich, P. (2001). Generalization of Plaskota’s bound for gaussian process learning curves (Tech. rep.). London: King’s College London. Available on-line: www.mth.kcl.ac.uk/»psollich/publications/. Stein, M. L. (1989). Comment on the paper by Sacks, J et al: Design and analysis of computer experiments. Statistical Science, 4, 432–433. Trecate, G. F., Williams, C. K. I., & Opper, M. (1999). Finite-dimensional approximation of gaussian processes. In M. Kearns, S. A. Solla, & D. Cohn (Eds.), Advances in neural information processing systems,11 (pp. 218–224). Cambridge, MA: MIT Press. Williams, C. K. I. (1997). Computing with innite networks. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 295–301). Cambridge, MA: MIT Press. Williams, C. K. I. (1998). Prediction with gaussian processes: From linear regression to linear prediction and beyond. In M. I. Jordan (Ed.), Learning and inference in graphical models (pp. 599–621). Norwell, MA: Kluwer Academic. Williams, C. K. I., & Rasmussen, C. E. (1996). Gaussian processes for regression. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 514–520). Cambridge, MA: MIT Press. Williams, C. K. I., & Seeger, M. (2000). The effect of the input density distribution on kernel-based classiers. In Proceedings of the Seventeenth International Conference on Machine Learning. San Mateo, CA: Morgan Kaufmann. Williams, C. K. I., & Seeger, M. (2001). Using the Nystr om ¨ method to speed up kernel machines. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.), Advances in neural information processing systems, 13 (pp. 682–688). Cambridge, MA: MIT Press. Williams, C. K. I., & Vivarelli, F. (2000). Upper and lower bounds on the learning curve for gaussian processes. Machine Learning, 40(1), 77–102. Wong, E. (1971). Stochastic processes in information and dynamical systems. New York: McGraw-Hill. Zhu, H., Williams, C. K. I., Rohwer, R. J., & Morciniec, M. (1998). Gaussian regression and optimal nite dimensional linear models. In C. M. Bishop (Ed.), Neural networks and machine learning. New York: Springer-Verlag. Received May 10, 2001; accepted September 28, 2001.
LETTER
Communicated by Nicol Schraudolph
A Global Optimum Approach for One-Layer Neural Networks Enrique Castillo [email protected] Department of Applied Mathematics and Computational Sciences, University of Cantabria and University of Castilla-La Mancha, 39005 Santander, Spain Oscar Fontenla-Romero [email protected] Bertha Guijarro-Berdinas ˜ [email protected] Amparo Alonso-Betanzos [email protected] Department of Computer Science, Faculty of Informatics, University of A Coruna, ˜ ˜ Spain 15071 A Coruna,
The article presents a method for learning the weights in one-layer feedforward neural networks minimizing either the sum of squared errors or the maximum absolute error, measured in the input scale. This leads to the existence of a global optimum that can be easily obtained solving linear systems of equations or linear programming problems, using much less computational power than the one associated with the standard methods. Another version of the method allows computing a large set of estimates for the weights, providing robust, mean or median, estimates for them, and the associated standard errors, which give a good measure for the quality of the t. Later, the standard one-layer neural network algorithms are improved by learning the neural functions instead of assuming them known. A set of examples of applications is used to illustrate the methods. Finally, a comparison with other high-performance learning algorithms shows that the proposed methods are at least 10 times faster than the fastest standard algorithm used in the comparison.
1 Introduction
Single-layer networks were widely studied in the 1960s, and their history has been reviewed in several places (Widrow & Lehr, 1990). For a onelayer linear network, the weight values minimizing the sum-of-squares erc 2002 Massachusetts Institute of Technology Neural Computation 14, 1429– 1449 (2002) °
1430
Enrique Castillo, et al.
ror function can be found in terms of the pseudo-inverse of a matrix (Bishop, 1995). Nevertheless, if a nonlinear activation function, such as a sigmoid or hyperbolic tangent, is used or if a different error function is considered, a closed-form solution is no longer possible. However, if the activation function is differentiable, as is the case for the mentioned functions, the derivatives of the error function with respect to the weight parameters can be easily evaluated. These derivatives can then be used in a variety of gradient-based optimization algorithms for efciently nding the minimum of the error function. The performance of these algorithms depends on several design choices, such as the selection of the step size and momentum terms, most of which are currently made by rules of thumb and trial and error. In addition to this problem, it is possible for the one-layer neural network to get stuck in a local minima, as shown by Brady, Raghavan, and Slawny (1989), Sontag and Sussmann (1989), and Budinich and Milotti (1992). The existence of this problem was mathematically demonstrated by Sontag and Sussmann (1989), who provided a neural network example without hidden layers and with a sigmoid transfer function, for which the sum of squared errors has a local minimum that is not a global minimum. They observed that the existence of local minima is due to the fact that the error function is the superposition of functions whose minima are at different points. In the case of linear neurons, all of these functions are convex, so no difculties appear because a sum of convex functions is also convex. However, sigmoidal units lead to nonconvex functions, so it is not guaranteed that their sums will have a unique minimum. Moreover, it was shown (Auer, Hebster, & Warmuth, 1996) that the number of such minima can grow exponentially with the input dimension (d). In particular, they proved that for the squared error and the logistic neural function, the error function of a single neuron for n training examples may have bn / dcd local minima. In fact, this holds for any error and transfer functions for which their composition is continuous and has a bounded range. The absence of local minima under certain separability or linear independence conditions (Budinich & Milotti, 1992; Gori & Tesi, 1992) or modications of the least-squares error norm (Sontag & Sussmann, 1991) has also been proven. However, the problem of characterizing these extrema for the standard least-squares error, using only a nite number of arbitrary inputs, has not been previously solved, nor there is a clear understanding of the conditions that cause the appearance of local minima. Coetzee and Stonick (1996) proposed a method for the a posteriori evaluation, in single-layer perceptrons, of whether a weight solution is unique or globally optimal and for a priori scaling of desired vector values to ensure uniqueness through analysis of the input data. Although these results are potentially useful for evaluating optimality and uniqueness, the minima can be characterized only after training is complete. In this article, a new training algorithm is proposed in order to avoid the problem of local minima in a supervised one-layer neural network with
A Global Optimum Approach for One-Layer Neural Networks
1431
nonlinear activation functions. In section 2, we describe the rationale of the problem. In section 3, the method for learning the weights of the network, leading to a global optimal solution, is presented. This approach is later enhanced by learning the neural functions. Section 4 deals with an alternative learning method that allows studying the variability of the different weights and indirectly testing the quality of the network. Section 5 gives some examples of applications to illustrate how the proposed methods can be used in practice. In section 6, the proposed methods are compared with other highperformance standard methods in terms of learning speed. Finally, section 7 gives some conclusions and recommendations. 2 Motivation
Consider the neural network in Figure 1a, where it is assumed that the nonlinear neural functions f1 , f2 , . . . , fJ are invertible. The set of equations relating inputs and outputs is given by Á yjs D fj wj0 C
I X iD 1
! wji xis I
j D 1, 2, . . . , JI
s D 1, 2, . . . , S,
(2.1)
where wj0 and wji I i D 1, 2, . . . , I, are the threshold values and the weights associated with neuron j (for j D 1, 2, . . . , J) and S is the number of data points. System 2.1 has J £ S equations in J £ ( I C 1) unknowns. However, since the number of data is large (S > > I C 1) , in practice, this set of equations in wji is not compatible and consequently has no solution. Thus, the usual approach is to consider some errors, djs ; equation 2.1 is transformed into Á djs D yjs ¡ fj wj0 C
I X iD 1
! wji xis I
j D 1, 2, . . . , JI
s D 1, 2, . . . , S,
(2.2)
and to estimate (learn) the weights, the sum of squared errors,
Q1 D
J S X X s D1 j D1
djs2 D
J S X X sD 1 j D 1
Á
Á yjs ¡ fj wj0 C
I X
!!2 wji xis
,
(2.3)
i D1
is minimized. It is important to note that due to the presence of the neural functions fj , the function appearing in equation 2.3 is nonlinear in the wji weights. Thus, Q1 is not guaranteed to have a global optimum. Actual methods for learning the weights in neural networks have two shortcomings. First, the function Q1 to be optimized has several (usually many) local optima. This implies that the users cannot know whether they
1432
Enrique Castillo, et al.
w 11
w 20
w J1 w 21
f1
y1 s
f2
y2 s
w 12
...
...
w 22 w J2
w 1I
xIs
fJ
w 2I w JI
yJs
(a)
1
xIs
f1
y1 s
f2
y2 s
fJ
yJs
f1
y1 s
f2
y2 s
(b)
1
x1s
f1
y1s
f2
y2 s
fJ
(c)
x1 s x2 s
...
...
...
x2s
xIs
x2 s
...
x2s
x1 s
...
w J0
x1s
1
w 10
...
1
yJs
fJ
xIs
yJs
(d)
Figure 1: (a) Original one-layer feedforward neural network. (b–d) J independent neural networks leading to the J independent systems of linear equations in equation 3.4.
are in the presence of a global optimum and how far from it the local optimum resulting from the learning process is. In fact, several users of the same computer program can get, for the same data and model, different solutions if they use a different starting set of weights in the learning procedures. Second, the learning process requires a huge amount of computational power compared with other models. Since for each j the weights wj0 and wji I i D 1, 2, . . . , I are related only to yjs , and appear only in S of the J £ S equations in equation 2.2, it is evident that the problem of learning the weights can be separated into J independent
A Global Optimum Approach for One-Layer Neural Networks
1433
problems (one for each j). This means that the neural network in Figure 1a can be split in J independent simpler neural networks (see Figures 1b, 1c, and 1d). Thus, the problem becomes much simpler. In what follows we deal with only one of these problems (for a xed j). 3 Proposed Method for Learning the Weights Based on Linear Systems of Equations
Motivated by the problems indicated in the previous section, an alternative approach for learning the weights is proposed. The system of equations, 2.2, that measures the errors in the output scale (the units of the yjs ) can be written as 2 js D wj0 C
I X i D1
wji xis ¡ fj¡1 ( yjs )I
s D 1, 2, . . . , SI
j D 1, 2, . . . , J,
(3.1)
which measures the errors in terms of the input scale (units of the xis ). We have two possibilities for learning the weights: 1. This option is to minimize the sum of squared errors,
Q2j D
S X sD 1
2 2 js
D
S X sD 1
Á wj0 C
I X i D1
!2 wji xis ¡
fj¡1 ( yjs )
,
(3.2)
which leads to the system of linear equations,
@Q2j @wjp
8 ³ ´ S I P P > ¡1 > > wj0 C wji xis ¡ fj ( yjs ) xps D 0 <2 s D1 iD 1 D ³ ´ S I > P P > ¡1 > wj0 C wji xis ¡ fj ( yjs ) D 0 :2 s D1
I P i D1 I P i D1
Á Á
iD 1
if
p> 0 (3.3)
if
p D 0,
which can be written as ! S S P P ( fj¡1 ( yjs ) ¡ wj0 ) xps I p D 1, 2, . . . , I, xis xps wji D
sD 1 S P sD 1
s D1
!
xis wji
D
S P s D1
( fj¡1 ( yjs )
¡ wj0 ).
9 > > > > = > > > > ;
.
(3.4)
2. Alternatively, we can minimize the maximum absolute error, ) I X ¡1 ( ) 2 D min max wj0 C wji xis ¡ f yjs , w s D1 (
i
(3.5)
1434
Enrique Castillo, et al.
which can be stated as the following linear programming problem: Minimize 2 subject to 9 I P > ¡1 · fj ( yjs ) > > wj0 C wji xis ¡ 2 = ¡wj0 ¡
iD 1 I P iD 1
wji xis ¡ 2
·
>
> ¡ fj¡1 ( yjs ) > ;
I
s D 1, 2, . . . , SI
(3.6)
the global optimum can be easily obtained by well-known linear programming techniques. In addition, it is easy to nd the set of all possible solutions leading to the global optimum. This can be achieved by testing which constraints in equation 3.6 are active and solving the corresponding systems of linear equations to nd the extreme points of the solution polytope. Consequently, we obtain the global optimum by solving a linear system of equations or a linear programming problem, methods that require much less computational power than that involved in minimizing Q1 in equation 2.3. 3.1 Learning the Neural Functions. An important improvement is obtained if we learn the fj¡1 functions instead of assuming that they are known. More precisely, they can be assumed to be linear convex combinations of a set
fw 1 ( x ) , w 2 ( x ) , . . . , w R ( x )g of invertible basic functions, fj¡1 ( x ) D
R X rD 1
ajr w r ( x ) I j D 1, 2, . . . , J,
(3.7)
where fajr I r D 1, 2, . . . , Rg is the set of coefcients associated with function fj¡1 , which has to be chosen for the resulting function fj¡1 to be invertible. Without loss of generality, it can be assumed to be increasing. Then, as before, we have two options: 1. Minimize, with respect to wji I i D 0, 1, . . . , I and ajr I r D 1, 2, . . . , R, the function Á !2 S S I R X X X X 2 ajr w r ( yjs ) , (3.8) Q2j D 2 js D wj0 C wji xis ¡ s D1
sD 1
iD 1
rD 1
subject to R X r D1
ajr w r ( yjs1 ) ·
R X r D1
ajr w r ( yjs2 ) I
8yjs1 < yjs2 ,
A Global Optimum Approach for One-Layer Neural Networks
1435
which forces the candidate functions fj¡1 to be increasing, at least in the corresponding intervals. 2. Alternatively, for each j D 1, 2, . . . , J, we can minimize the maximum absolute error, ( ) I R X X ( ) C ajr w r yjs , (3.9) 2 D min max wj0 wji xis ¡ w,a s D1 D1 i
r
which can be stated as the following linear programming problem: Minimize 2 subject to wj0 C ¡wj0 ¡
I P i D1 I P i D1
wji xis ¡ wji xis C
R P rD 1 R P rD 1
ajr w r ( yjs ) ¡2 · 0 ajr w r ( yjs ) ¡2 · 0
R P
ajr w r ( yjs1 ) ·
rD 1 R P
r D1
R P r D1
(3.10) ajr w r ( yjs2 )I
8yjs1 < yjs2
ajr w r ( y 0 ) D 1,
which has a global optimum easily obtainable. In order to avoid transformations with a large derivative, it is convenient to limit the rst derivative of the fj¡1 ( yjs ) transformation. This can be done by adding the extra constraints, dfj¡1 ( y) dy
D
R X r D1
ajr w r0 ( y ) ¸ bI j D 1, 2, . . . , J,
(3.11)
where b is the upper bound of the derivative of the inverse function that is equal to the inverse of the derivative of the original function. Once the fj¡1 have been obtained, to use the network, it is necessary to work with their inverses, which can be dealt with using, for example, the bisection method. To avoid problems with the dimensions of the data, it is strongly suggested that the data be transformed to the unit hypercube. 4 Alternative Learning Method: Variability Study
If we write equation 2.1 as wj0 C
I X iD 1
wji xis D fj¡1 ( yjs )I
s D 1, 2, . . . , S,
(4.1)
1436
Enrique Castillo, et al.
since the number of unknowns is (I C 1), we realize that a given set with (I C 1) data points is enough (apart from a degenerated linear system) for learning the weights fwij I i D 0, 1, . . . , Ig. The proposed alternative consists of learning the weights with a selected (deterministically or randomly) class of subsets, each containing v ¸ ( I C 1) data points, and determining the variability of the different weights based on the resulting values. To proceed, repeat steps 1 and 2 below, a large number of times, that is, from r D 1 to r D m with m large, and then proceed to steps 3 and 4: Step 1: Select, at random, a subset Dr of v ¸ ( I C 1) data points from the
initial set of S data points.
Step 2: Use the system of equations 3.4 or 3.6, or solve the linear program-
ming problems in equations 3.8 and 3.10 to obtain the corresponding set of weights fwjir I i D 0, 1, . . . , Ig. Step 3: Calculate the means m ji , medians Mji , and standard deviations sji
of the sets fwjir gI 8i.
O ji D Mji as the estimates of wji , and sji as a Step 4: Return wO ji D m ji or w measure of the precision in the estimate of wji .
5 Example Applications
In this section, we illustrate the proposed methods by their application to well-known sets of data examples. The rst two cases, in sections 5.1 and 5.2, 1 correspond to classication problems and are used to verify the soundness of the proposed learning methods in sections 3 and 4. The last two examples, in sections 5.3 and 5.4 2 , are regression problems employed to compare the approach proposed in section 3, the improvement method presented in section 3.1 that allows the learning of the neural functions and the Levenberg-Marquardt backpropagation method (Hagan & Menhaj, 1994). In order to estimate the true error of the neural networks, a leave-one-out cross-validation was used, except in the case of example 5.3. 5.1 The Fisher Iris Data Example. Our rst example uses the Iris data (Fisher, 1936), perhaps one of the best-known databases in the pattern recognition literature. Fisher’s article is a classic in the eld and continues to be referenced frequently. The data set contains three classes of 50 instances 1 Data obtained from the UCI Machine Learning Repository (http://www.ics.uci.edu /s mlearn/MLRepository.html). 2 Data obtained from the Working Group on Data Modeling Benchmarks of the IEEE Neural Networks Council Standards Committee (http://neural.cs.nthu.edu.tw /jang/benchmark).
A Global Optimum Approach for One-Layer Neural Networks
1437
Table 1: Estimated Weights for Iris Data and Associated Standard Deviations. Weight
Value
Mean
Median
SD
w10 w11 w12 w13 w14
0.9704812 ¡0.0273781 ¡0.0498872 0.0722752 0.0979362
0.9642 ¡0.0263 ¡0.0495 0.0713 0.0996
0.9567 ¡0.0262 ¡0.0500 0.0717 0.0985
0.0614 0.0149 0.0170 0.0128 0.0199
Note: Equation 3.6 and the method in section 4 are used.
each, where each class refers to a type of iris plant (Setosa, Versicolour, and Virginica). One class is linearly separable from the other two, which are not linearly separable from each other. The attribute information used to determine the type of plant (1 for Setosa, 2 for Versicolour, and 3 for Virginica) consists of sepal length in centimeters, sepal width in centimeters, petal length in centimeters, and petal width in centimeters. We have used the neural network described in section 2 and Figure 1 and the learning method described in section 3. The neural function has been assumed to be f ¡1 (x ) D arctan(x) . This neural function was used also in all the following examples. The resulting value of the mean squared error was 0.0337, and the obtained weights are shown in the second column of Table 1. The dot plot in Figure 2B shows the value of the output for each of the 150 test examples employed in the leave-one-out cross-validation and the cutoff discriminating values (dashed lines) for the three groups. The resulting classication rule is as follows: 4 P 1. If tan ( w10 C wji xis ) · 1.4, the plant s is in class Setosa. i D1
2. If 1.4 < tan ( w10 C
4 P iD 1
3. If 2.365 < tan ( w10 C
wji xis ) · 2.365, the plant s is in class Versicolour.
4 P
wji xis ), the plant s is in class Virginica.
iD 1
With these criteria, 98.67% of the plants are classied correctly. The confusion matrix in Table 2 shows the number of cases misclassied. Next, the alternative learning method described in section 4 was used. One hundred subsets of 50 randomly selected data points were used to learn the set of weights fwi1 I i D 0, 1, 2, 3, 4g. Their means, medians, and standard deviations and their associated empirical cumulative distributions functions are shown in Table 1 and Figure 2A. A comparison of the different weight estimates shows that they are very similar. On the other hand, the low values
1438
Enrique Castillo, et al.
Figure 2: (A) Empirical cumulative distribution functions of the weights. (B) Plot for the iris data example.
A Global Optimum Approach for One-Layer Neural Networks
1439
Table 2: Confusion Matrix for the Iris Data. System Classication Setosa Versicolour Virginica
True Classication Setosa
Versicolour
Virginica
50 0 0
0 49 1
0 1 49
Table 3: Attributes Used in Breast Cancer Data. Attribute Clump thickness Uniformity of cell size Uniformity of cell shape Marginal adhesion Single epithelial cell size Bare nuclei Bland chromatin Normal nucleoli Mitoses
Range 1–10 1–10 1–10 1–10 1–10 1–10 1–10 1–10 1–10
of the standard deviations suggest that the model is adequate for the Iris data set. 5.2 Breast Cancer Data. The breast cancer database was obtained from the University of Wisconsin Hospitals, Madison, from William H. Wolberg (Bennett & Mangasarian, 1992). It consists of 699 instances—458 (65.5%) benign and 241 (34.5%) malignant—with 16 having some missing data. We have used only the data with complete information: 683 data instances. The meanings of the nine attributes used to determine the class (0 for benign, 1 for malignant) are shown in Table 3. The dot plot in Figure 3B shows the value of the output for each of the 683 test examples employed in the leaveone-out cross-validation and the cutoff discriminating values (dashed lines) for the two classes. The resulting classication criterion is as follows:
1. If tan ( w10 C 2. If tan ( w10 C
9 P i D1 9 P
wji xis ) · 0.28, the cancer s is classied as malignant. wji xis ) > 0.28, the cancer s is classied as benign.
i D1
The number of cases classied correctly is 96.92%. The confusion matrix in Table 4 shows the number of misclassied data for each of the classes. Using the methods described in sections 3 and 4, the weight estimates with their standard deviations, shown in Table 5, have been obtained for 100 sub-
1440
Enrique Castillo, et al.
Figure 3: (A) Empirical cumulative distribution functions of the weights. (B) Plot for the breast cancer data.
A Global Optimum Approach for One-Layer Neural Networks
1441
Table 4: Confusion Matrix for Breast Cancer Data. System Classication Benign Malignant
True Classication Benign
Malignant
432 12
9 230
Table 5: Estimated Weights for Breast Cancer Data and Associated Standard Deviations. Weight
Value
Mean
Median
SD
w10 w11 w12 w13 w14 w15 w16 w17 w18 w19
1.0530 0.0069 0.0048 0.0034 0.0018 0.0022 0.0099 0.0042 0.0041 0.0002
1.0415 0.0112 ¡0.0028 0.0089 0.0036 0.0031 0.0089 0.0025 0.0060 0.0033
1.0411 0.0109 ¡0.0015 0.0074 0.0030 0.0029 0.0089 0.0025 0.0059 0.0035
0.0128 0.0030 0.0074 0.0070 0.0041 0.0047 0.0035 0.0047 0.0043 0.0054
Note: Equation 3.6 and section 4 were used.
sets of 300 randomly selected data points. Figure 3A contains the associated empirical cumulative distributions of the weights. The estimates are very similar. 5.3 Modeling a Three-Input Nonlinear Function. In this example, we consider using the learning algorithm proposed in section 3 (see equation 3.3) and the improved method proposed in section 3.1 to model the nonlinear function
y D z2 C z C sin(z) , where z D 3x1 C 2x2 ¡ x3 . We carried out 100 simulations employing 800 training data and 800 test data uniformly sampled from the input ranges [0, 1] £ [0, 1] £ [0, 1]. The values of the nonlinear function y were normalized in the interval [0.05, 0.95]. In the method with learnable neural functions, we employ the following polynomial family, fw 1 ( y) , w 2 ( y ) , . . . , w R ( y ) g D fx, x2 , . . . , xR g,
(4.2)
1442
Enrique Castillo, et al.
Table 6: Mean Error and Variability for the Three-Input Nonlinear Function.
NN with known neural functions NN with learnable neural functions NN trained with Levenberg-Marquardt
Mean Error
Variability
1.155e-02 5.438e-03 1.189e-02
1.729e-04 6.509e-06 1.746e-04
as basic functions in equation 3.7. Several values of R and b (see equation 3.11) were tried. The results presented in this example were achieved with R D 7 and b D 0.4. Table 6 shows the mean error and the variability obtained for the test data over these 100 simulations. Also, the results obtained by the Levenberg-Marquardt method are shown in this table in order to compare our methods with a standard one. We tested the statistical signicance of the differences between the mean error of the systems in Table 6 using a t-test. As can be seen, the neural network with known neural functions and that trained using the Levenberg-Marquardt obtained similar performance, whereas for a 99% signicance level, the neural network with learnable functions outperformed the other two approaches. Figure 4 shows a graph of the real output against the desired output (for one of the simulations) and the rst 100 samples of the curve predicted by the networks in Table 6. 5.4 Box-Jenkins Furnace Data. This example deals with the problem of modeling a gas furnace that was rst presented by Box and Jenkins (1970). The modeled system consists of a gas furnace in which air and methane are combined to form a mixture of gases containing carbon dioxide (CO2 ). The furnace output, the CO2 concentration, is measured in the exhaust gases at the outlet of the furnace. The data set corresponds to a time series consisting of 296 pairs of observations of the form ( ( u ( t) , y ( t) ), where u ( t) represents the methane gas feed rate at the time step t and y (t ) is the concentration of CO2 in the gas outlets. The sampling time interval is 9 seconds. The goal is to predict y ( t) based on fy(t ¡ 1) , y (t ¡ 2) , y ( t ¡ 3) , y ( t ¡ 4) , u (t ¡ 1) , u ( t ¡ 2), u (t ¡ 3) , u ( t ¡ 4) , u (t ¡ 5), u ( t ¡ 6) g. This reduces the number of effective data points to 290. Also, before the training began, the output, y (t ), was normalized in the interval [0.05,0.95]. The methods proposed in sections 3 (neural network with known neural functions using equation 3.1) and 3.1 (neural network with learnable neural functions) were employed to predict the CO2 concentration. In the latter, several values of R and b were tried. The results obtained, for test data, using the polynomial family in equation 4.2, R D 15 and b D 0.4, are shown in Table 7. Again, results obtained by the Levenberg-Marquardt method are included in this table. To test the signicance of the difference between the mean errors of the systems in Table 7, a t-test was used. As in the previous
A Global Optimum Approach for One-Layer Neural Networks
1443
Figure 4: Comparison of the three-input nonlinear function. Neural network with given neural functions: (A) real against desired output and (B) observed (solid line) and predicted (dashed line) curves. Neural network with learnable neural functions: (C) real against desired output and (D) observed (solid line) and predicted (dashed line) curves. Levenberg-Marquardt method: (E) real against desired output and (F) observed (solid line) and predicted (dashed line) curves.
1444
Enrique Castillo, et al.
Table 7: Mean Error and Variability, Box-Jenkins Furnace Data.
NN with known neural functions NN with learnable neural functions NN trained with Levenberg-Marquardt
Mean Error
Variability
2.084e-02 1.667e-02 2.090e-02
2.617e-04 1.794e-04 2.626e-04
example, for a 99% signicance level, signicant performance differences between the neural network with learnable functions and the other two approaches were found. Therefore, we can afrm that the improved method proposed in section 3.1 outperforms the rst one introduced (in section 3), and so the learning of the neural functions enhances the performance of the system. Figure 5 shows a plot of the real output against the desired output and the source data versus the time series predicted by our two neural systems and the Levenberg-Marquardt method. 6 A Learning Speed Comparison
This section contains a comparative study of the learning speed between the method proposed in section 3 (see equation 3.4) and several high performance algorithms. To this end, the examples described in sections 5.1, 5.2, 5.3, and 5.4 are used. The following algorithms were used in the experiments: Levenberg-Marquardt backpropagation (LM) (Hagan & Menhaj, 1994) Fletcher-Powell conjugate gradient (CGF) (Fletcher & Reeves, 1964; Hagan, Demuth, & Beale, 1996) Polak-Ribi e re conjugate gradient (CGP) (Fletcher & Reeves, 1964; Hagan et al., 1996) Conjugate gradient with Powell-Beale restarts (CGB) (Powell, 1977; Beale, 1972) Scaled conjugate gradient (SCG) (Moller, 1993) Quasi-Newton with Broyden, Fletcher, Goldfarb, and Shanno updates (BFGS) (Dennis & Schnabel, 1983) One-Step secant backpropagation (OSS) (Battiti, 1992) Resilient backpropagation (RP) (Riedmiller & Braun, 1993) For each of the data sets, 100 trials, with initial randomly weights, were carried out using a Silicon Graphics Origin 200 computer. The networks were trained using the mean squared error function in all cases. The training was stopped when the error goal (4e-4, 3e-4, 7.7e-3, and 1.7e-2 for examples
A Global Optimum Approach for One-Layer Neural Networks
1445
Figure 5: Comparison of results for Box-Jenkins data. Neural network with known neural functions: (A) real against desired output and (B) observed (solid line) and predicted (dashed line) time series. Neural network with learnable neural functions: (C) real against desired output and (D) observed (solid line) and predicted (dashed line) time series. Levenberg-Marquardt method: (E) real against desired output, and (F) observed (solid line) and predicted (dashed line) time series.
1446
Enrique Castillo, et al.
Table 8: Training Time for Iris Data.
Algorithm
Mean Time(s)
Ratio
Minimum Time(s)
Maximum Time(s)
SD
SLE LM CGF CGP CGB SCG BFGS OSS RP
0.0054 0.1538 0.7929 0.8885 0.7166 0.6719 1.3441 7.5704 7.2496
1.00000 28.4800 146.830 164.540 132.700 124.430 248.910 1401.93 1342.52
0.0053 0.1255 0.1133 0.1260 0.1131 0.1190 0.7065 1.4436 1.1705
0.0059 0.1837 1.4683 1.6335 1.2714 1.6152 1.7931 22.398 19.511
0.000074 0.012203 0.269170 0.320160 0.211600 0.200100 0.266610 4.333000 4.351700
Table 9: Training Time for Breast Cancer Data.
Algorithm
Mean Time(s)
Ratio
Minimum Time(s)
Maximum Time(s)
SD
SLE LM CGF CGP CGB SCG BFGS OSS RP
0.0245 0.2351 1.3737 1.6820 1.1796 0.7870 2.7466 39.230 1.6540
1.0000 9.6000 56.070 68.650 48.150 32.120 112.11 1601.2 67.510
0.0238 0.1989 0.1219 0.1338 0.1226 0.1296 0.6902 2.7709 0.6553
0.0265 0.3119 12.592 20.211 7.3879 1.9263 8.3371 173.39 2.1685
0.00038 0.02264 1.62990 2.86940 0.86438 0.19669 0.84858 56.0295 0.31739
5.1, 5.2, 5.3, and 5.4, respectively) was attained, the maximum number of epochs (2.000) was reached, or the gradient fell below a minimum threshold gradient (1e-10). Tables 8 through 11 summarize the training speed results of the networks associated with all of the algorithms used. The tables contain the mean, the minimum and maximum time and the standard deviation obtained in 100 simulations. Also, the fastest algorithm is used as the reference for the time ratio parameter, so that it reects how many times each algorithm is slower, on average, than the fastest algorithm. The method proposed in this article was identied with the SLE (system of linear equations) acronym. As can be seen in the tables, even in the worst case, the SLE method is about 10 times faster than the next faster algorithm. Finally, it is also important to note that the high speed differences obtained between our proposed method and the others are due to the iterative nature of the latter, while the former requires only solving a linear system of equations or a programming problem.
A Global Optimum Approach for One-Layer Neural Networks
1447
Table 10: Training Time for Nonlinear Function Data.
Algorithm
Mean Time(s)
Ratio
Minimum Time(s)
Maximum Time(s)
SD
SLE LM CGF CGP CGB SCG BFGS OSS RP
0.0057 0.1768 0.8346 0.8814 0.7844 0.5832 1.1666 3.3554 3.1387
1.0000 31.020 146.42 154.63 137.61 102.32 204.67 588.67 550.65
0.0055 0.1642 0.5026 0.4800 0.4426 0.3701 0.7801 1.6546 1.0870
0.0066 0.1901 1.3558 1.8231 1.3517 0.8441 1.4787 7.6841 4.5230
0.00015 0.01122 0.19879 0.31403 0.22015 0.10821 0.13563 1.03380 0.98790
Table 11: Training Time for Box-Jenkins Data.
Algorithm
Mean Time(s)
Ratio
Minimum Time(s)
Maximum Time(s)
SD
SLE LM CGF CGP CGB SCG BFGS OSS RP
0.0243 0.3178 6.7133 13.658 6.0728 6.2298 3.5819 106.25 31.572
1.0000 13.080 276.27 562.05 249.91 256.37 147.40 4372.4 1299.3
0.0224 0.2306 2.5028 4.2101 4.2211 3.8638 2.4899 69.439 31.464
0.0963 1.0212 10.770 98.329 8.4621 11.047 4.2659 113.09 32.032
0.0073 0.0892 1.4806 23.169 0.9492 1.2715 0.3173 9.6851 0.0785
7 Conclusion
In this article a new learning method for single-layer neural networks has been presented. This approach, which measures the errors in terms of the input scale instead of the output scale, allows reaching the global optimum of the error surface avoiding local minima. Two alternative training procedures were proposed for learning the weights. The rst one, which employs the sum of squared errors as a cost function, is based on a system of linear equations, and the second one, which minimizes the maximum absolute error function, is based on a linear programming problem. It is worthwhile mentioning that due to the techniques employed to minimize the objective function, these approaches exhibit a very fast convergence. The proposed methods were later extended so that learning the neural functions is also possible. To this end, these functions were assumed to be a linear combination of invertible basic functions. In all the models presented in this article, a polynomial combination has been used, but many others can be considered (e.g., Fourier expansion). Finally, the examples show that the proposed
1448
Enrique Castillo, et al.
methods present good performance on well-known data sets. Moreover, the approach that allows learning the neural functions seems to be superior to the one with xed neurons. The comparison with other learning algorithms shows that the proposed method clearly outperforms all existing methods. Finally, an important issue to address is the correspondence between minima of the standard error (see equation 2.3) and the novel formulation of the error (see equation 3.2). This correspondence is exact when the error is zero, that is, when the estimated model coincides with the real model underlying the data. Under other conditions, this correspondence does not exist and needs further analysis, which will be addressed in future work. Acknowledgments
We thank the Universities of Cantabria and Castilla-La Mancha, the DirecciÂon General de Investigaci Âon CientÂõ ca y TÂecnica (DGICYT) (project PB98-0421), Iberdrola, and the Xunta de Galicia (project PGIDT99COM10501 and the predoctoral grants programme 2000/01) for partial support of this research. References Auer, P., Hebster, M., & Warmuth, M. K. (1996). Exponentially many local minima for single neurons. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural processing systems, 8 (pp. 316–322). Cambridge, MA: MIT Press. Battiti, R. (1992). First and second order methods for learning: Between steepest descent and Newton’s method. Neural Computation, 4(2), 141–166. Beale, E. M. L. (1972).A derivation of conjugate gradients. In F. A. Lootsma (Ed.), Numerical methods for nonlinear optimization. London: Academic Press. Bennett, K. P., & Mangasarian, O. L. (1992). Robust linear programming discrimination of two linearly inseparable sets. Optimization Methods and Software, 1, 23–34. Bishop, C. M. (1995). Neural networks for pattern recognition. New York: Oxford University Press. Box, G. E. P., & Jenkins, G. M. (1970). Time series analysis, forecasting and control. San Francisco: Holden Day. Brady, M., Raghavan, R., & Slawny J. (1989). Back propagation fails to separate where perceptrons succeed. IEEE Transactions on Circuits and Systems, 36, 665– 674. Budinich, M., & Milotti, E. (1992). Geometrical interpretation of the backpropagation algorithm for the perceptron. Physica A, 185, 369–377. Coetzee, F. M., & Stonick V. L. (1996). On uniqueness of weights in single layer perceptrons. IEEE Transactions on Neural Networks, 7(2), 318–325. Dennis, J. E., & Schnabel, R. B. (1983). Numerical methods for unconstrained optimization and nonlinear equations. Englewood Cliffs, NJ: Prentice Hall.
A Global Optimum Approach for One-Layer Neural Networks
1449
Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annual Eugenics, 7, 179–188. Fletcher, R., & Reeves, C. M. (1964). Function minimization by conjugate gradients. Computer Journal, 7, 149–154. Gori, M., & Tesi, A. (1992). On the problem of local minima in backpropagation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 14(1), 76–85. Hagan, M. T., Demuth, H. B., & Beale, M. H. (1996).Neural network design. Boston: PWS Publishing. Hagan, M. T., & Menhaj, M. (1994). Training feedforward networks with the Marquardt algorithm. IEEE Transactions on Neural Networks, 5(6), 989–993. Moller, M. F. (1993). A scaled conjugate gradient algorithm for fast supervised learning. Neural Networks, 6, 525–533. Powell, M. J. D. (1977). Restart procedures for the conjugate gradient method. Mathematical Programming, 12, 241–254. Riedmiller, M., & Braun, H. (1993). A direct adaptive method for faster backpropagation learning: The RPROP algorithm. In Proceedings of the IEEE International Conference on Neural Networks. San Francisco. Sontag, E., & Sussmann, H. J. (1989). Backpropagation can give rise to spurious local minima even for networks without hidden layers. Complex Systems, 3, 91–106. Sontag, E., & Sussmann, H. J. (1991). Back propagation separates where perceptrons do. Neural Networks, 4, 243–249. Widrow, B., & Lehr, M. A. (1990). Thirty years of adaptive neural networks: perceptron, madeline and backpropagation. Proceedings of the IEEE, 78(9), 1415–1442. Received May 11, 2001; accepted August 22, 2001.
LETTER
Communicated by J urgen ¨ Schmidhuber
MLP in Layer-Wise Form with Applications to Weight Decay Tommi K a¨ rkka¨ inen [email protected]. Department of Mathematical Information Technology, University of Jyv¨askyl¨a, P.O.Box 35 (Agora), FIN-40351 Jyv¨askyl¨a, Finland
A simple and general calculus for the sensitivity analysis of a feedforward MLP network in a layer-wise form is presented. Based on the local optimality conditions, some consequences for the least-means-squares learning problem are stated and further discussed. Numerical experiments with formulation and comparison of different weight decay techniques are included.
1 Introduction
There are many assumptions and intrinsic conditions behind well-known computational techniques that are sometimes lost from a practitioner. In this work, we try to enlighten some of the issues related to multilayered perceptron networks (MLPs). The point of view here is mainly based on the theory and practice of optimization, with special emphasis on sensitivity analysis (computation of gradient), scale balancing, convexity, and smoothness. First, we propose a simplied way to derive the error backpropagation formulas for the MLP training in a layer-wise form. The basic advantage of the layer-wise formalism is that the optimality system is presented in a compact form that can be readily exploited in an efcient computer realization. Moreover, due to the clear description of the optimality conditions, we are able to derive some consequences and interpretations concerning the nal structure of trained network. These results have direct applications to different weight decay techniques, which are presented and tested through numerical experiments. In section 2, we introduce the algebraic formalism and the original learning problem. In section 3, we compute the optimality conditions for the network learning and derive and discuss some of their consequences. Finally, in section 4 we present numerical experiments for studying different regularization techniques and make some observations based on the computational results. c 2002 Massachusetts Institute of Technology Neural Computation 14, 1451– 1480 (2002) °
1452
Tommi K¨arkk¨ainen
2 Preliminaries
Action of the multilayered perceptron in a layer-wise form is given by (Rohas, 1996; Hagan & Menhaj, 1994) o 0 D x,
o l D F l ( W l oO
( l¡1 )
) for l D 1, . . . , L.
(2.1)
We have placed the layer number (starting from zero for the input) as an upper index. By b we indicate the addition of bias terms, and F l (¢) denotes the usual componentwise activation on the lth level. The dimensions of the weight matrices are given by dim( W l ) D nl £ (nl¡1 C 1) , l D 1, . . . , L, where n 0 is the length of an input vector x, nL the length of the output vector o L , and nl , 0 < l < L, determine the sizes (number of neurons) of the hidden layers. We notice that the activation of m neurons can be equivalently represented by using a diagonal function matrix F D F (¢) D Diagf fi (¢) gm i D1 of the form 2 3 0 f1 (¢) . . . 6 .. 7 . .. F D 4 ... (2.2) . . 5 0 . . . fm (¢) A function matrix suppliedP with the natural way to dene the matrix vector product y D F (v ) , yi D jmD 1 fij ( vj ) yields the usual activation approach. However, any enlargement of the function matrix to contain nondiagonal function entries as well offers interesting possibilities for generalizing the basic MLP architecture. This topic is out of the scope of the work presented here, although both the sensitivity analysis and its consequences in what follows remain valid in this case©too. ª N Using a given training data xi , y i iD 1 , xi 2 Rn0 and y i 2 RnL , the unknown weight matrices fW l gLlD1 are determined as a solution of the optimization problem min J (fW l g),
(2.3)
fW l gLlD 1
where
J (fW l g) D
N 1 X kN 2N iD 1
(fW l g) ( x i ) ¡ y i k2
(2.4)
is the least-mean-squares (LMS) cost functional for a simplied architecture of MLP containing only a linear transformation in the nal layer. That is, ( ) fiL (x ) D x for all 1 · i · nL , and o Li D N (fW l g) ( xi ) D W L oO i L¡1 . The relation
MLP with Applications to Weight Decay
1453
between this choice and the original form, equation 2.1, will be considered in sections 3.3 and 3.4. In equation 2.4, k ¢ k denotes the usual Euclidian norm, which is induced by the l2 inner product ( v , w ) D w T v . Preprocessing of the training data fxi , y i g has been extensively considered, for example, by LeCun, Bottous, Orr, and Muller ¨ (1998). From the optimization point of view, an essential property of any problem is to have all unknowns on the same scale, because gradient with components of different orders of magnitude can yield nonbalanced updates in the training algorithm (Nocedal & Wright, 1999). This usually leads to nonconvergence or large deviation of weights, decreasing the fault tolerance of the trained network (Cavalieri & Mirabella, 1999). Therefore, we require that all activation functions fF l g have the same range, and all of the training data are prescaled into this range. Then all layers of the network treat vectors with components of the same order of magnitude. Notice, however, that using tanh activation and the proposed prescaling, the average of minimum and maximum values of each feature in fxi g is transformed to zero. Then all records containing the zero value become insensitive to the corresponding column in the weight matrix W 1 . Hence, for features having symmetric and peak-like distribution, the prescaling may result in the nonuniqueness of W 1 (and subsequent matrices W l , 1 < l · L). Naturally, one can study the existence of such problems through the distribution (histogram) of individual prescaled features. 3 Sensitivity Analysis
An essential part of any textbook on neural networks consists of the derivation of error-backpropagation formulas for network training (Bishop, 1995; Reed & Marks, 1999; Rohas, 1996). What makes such sensitivity analysis messy is the consecutive application of the chain rule in an index jungle. Here, we describe another approach that uses the Lagrangian treatment of equality constraint optimization problems (Bertsekas, 1982) and circumvents these technicalities, allowing one to derive the necessary optimality conditions in a straightforward way. The idea behind our treatment is to work on the same level of abstraction, layer-wise form, that is used for the initial description of MLP. There are many techniques for solving optimization problems like equation 2.3 without the need of precise formulas for the derivatives. One class of techniques consists of the so-called gradient-free optimization methods as presented, for example, in Osman and Kelly (1996) and Bazaraa, Sherali, and Shetty (1993). The computational efciency of these approaches, however, cannot compete with methods using gradient information during the search procedure. Another interesting possibility for avoiding explicit sensitivity analysis is to use automatic differentiation (AD), which enables the computation of derivatives as a (user-hidden) by-product of cost function evaluation (Griewank, 2000). AD allows fast and straightforward develop-
1454
Tommi K¨arkk¨ainen
ment and simulation of different architectures and learning problems for MLPs, but at the expense of increased storage requirements and computing time compared to the fully exploited analytical approach (see the discussion after theorem 2 in section 3.2). Moreover, clear and precise description of optimality conditions is necessary to be able to analyze the properties of trained networks (cf. sections 3.3 and 3.4). 3.1 MLP with One Hidden Layer. To simplify the presentation, we start ¤ ¤ with MLP with only one hidden layer. Then any local solution ( W 1 , W 2 ) of the minimization problem, equation 2.3, is characterized by the conditions ¤
¤
r(W1, W2 ) J (W1 , W2 ) D ³ Here, rW l J D
´ @J @wlij
i, j
µ
µ ¶ ¤ ¤ ¶ rW 1 J ( W 1 , W 2 ) O . D ¤ ¤ O rW 2 J ( W 1 , W 2 )
(3.1)
, l D 1, 2, are given in a similar matrix form as the
unknown weight matrices. Hence, our next task is to derive the derivatives with regard to these data structures (cf. appendix B in Diamantaras & Kung, 1996; Hagan & Menhaj, 1994). For this analysis, we presuppose that all activation functions in the function matrix F D F 1 are differentiable. We start the derivation by stating some simple lemmas for which the proofs are given in appendix 4.2. We note that the proposed approach is not restricted to the LMS error function. For other differentiable cost functionals like softmax and cross-entropy, one can use exactly the same technique for simple derivation of the necessary optimality conditions in a layer-wise form. Lemma 1. Let v 2 Rm1 and y 2 Rm2 be given vectors. The gradient matrix rW J (W ) 2 Rm2 £m1 for the functional J ( W ) D 12 kW v ¡ y k2 is of the form
rW J (W ) D [W v ¡ y ] v T . Let W 2 Rm2 £m1 be a given matrix, y 2 Rm2 a given vector, and F D 1 m1 ( ) Diagf fi (¢) gm i D1 a given diagonal function matrix. The gradient vector ru J u 2 R 1 2 ( ) ( ) for the functional J u D 2 kW F u ¡ y k reads as Lemma 2.
± ²T 0 0 ru J (u ) D W F ( u ) [W F (u ) ¡ y ] D DiagfF ( u ) g W T [W F ( u ) ¡ y ].
N 2 Rm2 £m1 be a given matrix, F D Diagf fi (¢) gm1 a given Let W iD 1 diagonal function matrix, and v 2 Rm0 , y 2 Rm2 given vectors. The gradient N F (Wv ) ¡ y k2 is of matrix rW J ( W ) 2 Rm1 £m0 for the functional J ( W ) D 12 kW the form Lemma 3.
0
N T [W N F ( Wv ) ¡ y ] v T . rW J ( W ) D DiagfF ( Wv )g W
MLP with Applications to Weight Decay
1455
Now we are ready to state the actual result for the perceptron with one hidden layer. Gradient matrices rW 2 J ( W 1 , W 2 ) and rW 1 J ( W 1 , W 2 ) for the cost functional, equation 2.4, are of the form Theorem 1.
rW 2 J (W 1 , W 2 ) D D
rW 1 J ( W 1 , W 2 ) D D
N 1 X b( W 1 xO i ) ¡ y i ] [F b( W 1 xO i )]T [W 2 F N iD 1
i.
N 1 X b(W 1 xO i ) ]T , e i [F N iD 1 N 1 X 0 b(W 1 xO i ) ¡ y i ]xO T ii. DiagfF (W 1 xO i ) g( W 21 ) T [W 2 F i N i D1 N 1 X 0 DiagfF ( W 1 xO i ) g ( W 21 ) T ei xO Ti . N iD 1
¡ ¢ In formula ii, W 21 is the submatrix W 2 i, j , i D 1, . . . , n2 , j D 1, . . . , n1 , which is obtained from W 2 by removing the rst column W 20 containing the bias nodes.
Formula i is a direct consequence of lemma 1. Moreover, due to the denition of the extension operator b we have, for all 1 · i · N, Proof.
µ
b( W 1 xO i ) ¡ y i D [W 2 W 2 ] W2 F 0 1
¶
1
F (W 1 xO
i)
¡ yi
2 2 1 D W 0 C W 1 F ( W xO i ) ¡ y i .
(3.2)
Using equation 3.2 and lemma 3 componentwise for the cost functional, equation 2.4, shows formula ii and ends the proof. 3.2 MLP with Several Hidden Layers. Next, we generalize the previous analysis to the case of several hidden layers.
Q 2 Rm3 £m2 and W N 2 Rm2 £m1 be given matrices, FQ D Diagf fQi Let W m 1 FN D Diagf fNi (¢) giD1 given diagonal function matrices, and v 2 Rm0 , R given vectors. The gradient matrix rW J ( W ) 2 Rm1 £m0 for the functional N FN (W Q FQ ( Wv ) ) ¡ y k2 is of the form J (W ) D 12 kW
Lemma 4.
2 (¢)gm i D 1 and y 2 m3
0 Q T DiagfFN 0 ( W Q FQ (Wv ) ) g rW J ( W ) D DiagfFQ ( Wv )g W
N T [W N FN ( W Q FQ ( Wv ) ) ¡ y ] v T . W
1456
Tommi K¨arkk¨ainen
Gradient-matrices rW l J (fW l g) , l D L, . . . , 1, for the cost functional equation 2.4 read as Theorem 2.
rW l J (fW l g) D
N 1 X ( ) d l [oO l¡1 ]T , N iD 1 i i
where d Li D e i D W L oO i(L¡1 ) ¡ y i , 0
(3.3) (
C 1) d li D Diagf( F l ) ( W l oO i( l¡1 ) )g ( W 1l ) T d i( l C 1) .
Proof.
(3.4)
Apply lemma 4 inductively.
The denition of d li contains the backpropagation of the output error in a layer-wise form. The compact presentation of the optimality system can be readily exploited in the implementation. More precisely, computation of 0 ( ) the activation derivatives ( F l ) ( W l oO i l¡1 ) can, when using the usual logistic sigmoid or hyperbolic tangent functions, be realized using layer outputs o li without additional storage. Hence, in the forward loop, it is enough to store the outputs of different layers for the backward gradient loop, where these same vectors can be overwritten by d li when the whole operation in equation 3.4 is implemented using a single loop. In modern workstations, such combination of operations yielding minimal amount of loops through the memory can decrease the computing time signicantly (K¨arkk¨ainen & Toivanen, 2001). 3.3 Some Corollaries. Next, we derive some corollaries of the layerwise optimality conditions. All of these results follow from having a linear transformation with biases in the nal layer of MLP. Moreover, whether one applies on-line or batch training makes no difference if the optimization problem 2.3 is solved with high accuracy, because then both methods yield ¤ ¤ rW l J (fW l g) ’ O for the local solution fW l g. Corollary 1.
For locally optimal MLP network satisfying the conditions in the-
orem 2: i. The average error
1 N
PN
¤ iD 1 e i
is zero.
ii. Correlation between the error vectors and the action of layer L ¡ 1 is zero. P ¤ ¤ O ( L¡1) ]T The optimality condition rW L J (fW l g) D N1 N D O i i D 1 e i [o ¤ ( ) ( ) (with the abbreviation oO i L¡1 D oO i L¡1 ) in theorem 2 can be written in the Proof.
MLP with Applications to Weight Decay
1457
nonextended form as N 1 X T e ¤i [1 ( o i( L¡1) ) ] D O . N iD 1
(3.5)
By taking the transpose of this, we obtain ¶ ¶ µ ¶ N µ N µ ( e ¤i ) T 1 X 1 1 X O ¤ T . D ( L¡1 ) ( ei ) D ( L¡1 ) O ( e ¤i ) T N iD 1 o i N i D 1 oi
(3.6)
This readily shows both results. In the next corollary, we recall some conditions concerning the nal weight matrix W L . We note that the assumption on nonsingularity below can be relaxed to pseudo-invertibility, which has been widely used to generate new training algorithms for the MLP networks (Di Martino, Fanelli, & Protasi, 1996; Wang & Chen, 1996). Corollary 2.
If the autocorrelation matrix A D
nal layer inputs v i D W L D B A ¡1
( L¡1) oi
is nonsingular,
for B D
WL
1 N
PN
vi b vTi iD 1 b
of the (extended)
can be recovered from the formula
N 1 X yi b v Ti . N iD 1
(3.7)
¤
Furthermore, if A ¤ for a minimizer fW l gLlD1 of equation 2.4 is nonsingular, then ¤ W L is unique. Proof.
W L satisfying the system
N 1 X [W L oO i( L¡1) ¡ yQ i ] [oO i( L¡1) ]T D O N iD 1
(3.8)
is independent of i. Hence, equation 3.8 can be written as WL
N N 1 X 1 X b vi b v Ti D yQ i b v Ti N i D1 N iD 1
,
W L A D B.
(3.9)
Multiplying both sides from right with A ¡1 gives equation 3.7 and ends the proof.
1458
Tommi K¨arkk¨ainen
Let us add one further observation concerning the above result. Because dim(span(b v i ) ) D 1, each matrix b vi b v Ti has only one nonzero eigenT value l D b vi b v i with the corresponding eigenvector b v i . This means that 1 · dim(span( A ) ) · min(N, nL¡1 C 1) . Hence, if N < nL¡1 C 1, then A must ¤ be singular and W L cannot be unique. On the other hand, if N > nL¡1 C 1, then all vectors b v i cannot be linearly independent. This shows that there exists a relation between the amount of learning data and an appropriate size of the last hidden layer for the LMS learning problem. 3.4 Some Consequences. We conclude this section by giving a list of observations and comments based on the previous results.
3.4.1 Final Layer with or Without Activation. If MLP also contains the ( ) nal layer activation F L (W L oO i L¡1 ) , then it follows from lemma 3 (choose N W D I) that equation 3.3 replaced with 0
d Li D Diagf( F L ) ( W L oO i( L¡1) ) g e i D D i e i
(3.10)
gives the corresponding sensitivity with regard to W L . In this case, we have in corollary 1 instead of equation 3.6, ¶ µ ¶ N µ ( e ¤i ) T 1 X O D . D ( L¡1 ) i ¤)T O ( ei N iD 1 o i Hence, the two formulations with and without activating the nal layer yield locally the same result for the zero-residual problem e ¤i D 0 for all 1 · i · N. This slightly generalizes the corresponding result derived by Moody and Antsaklis (1996), where bijectivity of the nal activation F L was also assumed. The use of 0-1 coding for the desired output vectors in classication enforces the weight vector of 1-neuron to the so-called saturation area of a sigmoidal activation where the derivative is nearly zero (Vitela & Reifman, 1997). For this reason, the error function with the nal layer activation has a nearly at region around such points, whereas the nal linear layer has no such problems. This is also evident from equation 3.10, which holds ( ) independently on e ¤i and oi L¡1 for D i D O . Indeed, the tests reported by Gorse, Shepherd, and Taylor (1997) (see also Figure 8.7 in Reed & Marks, 1999; de Villiers & Barnard, 1992) suggest that the network with sigmoid in the nal layer has more local minima than when using a linear nal layer. Furthermore, Japkowicz, Hanson, and Gluck (2000) pointed out that the reconstruction error surface for a sigmoidal autoassociator consists of multiple local valleys on the contrary to the linear one.
MLP with Applications to Weight Decay
1459
3.4.2 The Basic Interpretation. We recall that the transformation before the last linear layer with biases had no inuence whatsoever on the previous formal consequences. This suggests the following interpretation of the MLP action with the proposed architecture: nonlinear hidden layers produce the universal approximation capability of MLP, while the nal linear layer compensates the hidden action with the desired output in an uncorrelated and error-averaging manner (cf. Japkowicz et al., 2000). 3.4.3 The Basic Consequence. The simplest model of noise in regression is to assume that given targets are generated by y i D w ( x i ) C "i ,
(3.11)
where w ( x ) is the unknown stationary function and "i ’s are sampled from an underlying noise process. For equation 3.11, it follows from corollary 1i ¤ that a locally optimal MLP network N (fW l g) satises N h 1 X N N iD 1
N i 1 X ¤ (fW l g) ( xi ) ¡ w ( xi ) D ei . N iD 1
(3.12)
¤
Hence, this shows that every N (fW l g) treats optimally gaussian noise with zero mean (and enough samples). This result is not valid for other error functions or with nal layer activation (except in the impractical zero-residual case), but it remains valid for networks with input enlargements (e.g., Flake, 1998), output enlargements (Caruna, 1998), hidden layer modications (e.g., the linearly augmented feedforward network as proposed by van der Smagt & Hirzinger, 1998), and convex combinations of different locally optimal networks (averaged ensemble, e.g., Liu & Yao, 1999; Horn, Naftaly, & Intrator, 1998) for all cases when the nal bias is left untouched in the architecture. 3.4.4 Implicit Prior and Its Modication. Many authors writing on ANNs note that changing the prior frequency of different samples in the training data to favor the rare ones may improve the performance of the obtained network (e.g., LeCun et al., 1998; Yaeger, Webb, & Lyon, 1998). Corollary 1i gives a precise explanation how such a modication alters the nal result. For stochastic on-line learning, the prior frequency can be altered by controlling the feeding of examples, but for batch learning, one needs an explicit change in the error function for this purpose. For example, consider the classication problem with learning data from K different classes fCkgKkD1 so that P N D KkD 1 N k, where N k denotes the number of samples from the class Ck. To have an equal prior probability K1 for all classes in the classier (instead of Nk / N for the kth class), one needs to adjust the LMS cost functional to
J (fW l gLlD1 ) D
K X kD 1
1 X kW L oO i(L¡1 ) ¡ y i k2 . 2KN k i2C k
1460
Tommi K¨arkk¨ainen
Hence, using nonconstant weighting of the learning data in the cost functional incorporates prior information into the learning problem. For example, a time-series prediction could be based on larger weighting of the more recent samples, especially if one tries to simulate a slowly varying dynamical process (Park, El-Sharkawi, & Marks, 1991). It is also straightforward to improve the LMS learning problem when more knowledge on the variance of the output data is available (MacKay, 1992; van de Laar & Heskes, 1999). Finally, a locally weighted linear regression also adaptable for MLP learning is introduced by Schaal and Atkeson (1998).
3.4.5 Relation to Early Stopping. A popular way to try to improve the generalization power of MLP is to use early stopping (cutting, termination) of the learning algorithm (Bishop, 1995; Tetko & Villa, 1997; Prechelt, 1998). In most cases, early stopping (ES) is due to a cross-validation technique, but basically it also happens every time when a learning problem is solved inexactly, for example, because of a xed number of epochs or a bad learning rate. Formulation of general and robust conditions to stop different optimization algorithms prematurely is problematic and usually leads to a very large number of tests for validating different networks (Gupta & Lam, 1998). Moreover, Cataltepe, Abu-Mostafa, and Magdon-Ismail (1999) showed that if models with the same training error are chosen with equal probability, then the lowest generalization error is obtained by choosing the model corresponding to the training error minimum. This result is valid globally for linear models and locally, around the training error minimum, for nonlinear models. The success of ES is sometimes argued as due to producing smoother results when the network is initialized with small random numbers in the almost linear region of a sigmoidal activation function. This is certainly vague because different optimization methods follow different search paths, and nothing guarantees that even an overly hasty stopped training algorithm produces smooth networks. Hence, we believe that the role of ES is more related to equation 3.12, which for an inexact solution is not valid. Namely, for a nongaussian error distribution, ES may decrease the signicance of heavy error tails and, due to this fact, produce a result that generalizes better. Especially in this case, ES and weight decay (WD) are not alternatives to each other, because ES may improve the learning problem and WD can favor simpler models during learning. Notice, however, that because the usual bias-variance decomposition (Geman, Bienenstock, & Doursat, 1992) of the expected generalization error is also based on the least-squares estimation, the least-squares-based cross-validation technique may not be appropriate for a nongaussian error in the validation set. Changing the underlying gaussian assumption on noise should instead lead to the derivation of new cost functionals for the learning problem (Chen & Jain, 1994; Liano, 1996).
MLP with Applications to Weight Decay
1461
4 Numerical Results
Here, we describe numerical experiments based on the proposed techniques. All experiments are performed on an HP9000/J280 workstation (180 MHz PA8000 CPU), and the implementation is based on F77 (optimization and MLP realization) and Matlab for data pre- and postprocessing. (More comprehensive coverage of the computed examples is presented in K¨arkk¨ainen, 2000.) As an optimization software, we apply the limited memory quasi-Newton subroutine L-BFGS (Byrd, Lu, & Nocedal, 1995), which uses a sparse approximation of the BFGS-formula-based inverse of the Hessian matrix and is intended for solving large nonlinear optimization problems efciently. As a stopping criterion for the optimization, we use
J k ¡ J kC 1 ª · 106 ¤ epsmch, max | J k |, | J kC 1 |, 1 ©
where epsmch is the machine epsilon (» 10¡16 in the present case). This choice reects our intention to solve the optimization problem with (unnecessary) high precision for test purposes. The main ingredient in the L-BFGS software is that due to the limited-memory Hessian update, the computational complexity is only O ( n ) , where n is the total amount of unknowns. For ordinary quasi-Newton methods, the O (n2 ) consumption due to full Hessian has been one of the main reasons preventing the application of these methods for learning problems with larger networks. There exists a large variety of different tests comparing backpropagation (gradient descent with constant learning rate), conjugate gradient, and second-order methods (Gauss-Newton, Levenberg-Marquart, Hessian approximation, and quasi-Newton) for MLP training (e.g., Bishop, 1995; Hagan & Menhaj, 1994; Magoulas, Vrahatis, & Androulakis, 1999; McKeown, Stella, & Hall, 1997; Wang & Lin, 1998). The difculty of drawing reliable conclusions between the quality of different methods is that in addition to how to solve it (training method), as (or even more) important is also to consider what to solve (learning problem) to connect our approach to an application represented by a nite set of samples. According to Gorse et al. (1997), a quasi-Newton method can survey a larger amount of local minima than the BP and CG methods, but to truly enforce search through different minima when using a local gradient-based method, one must either start the training using multiple initial congurations or apply some global optimization strategy as on outer iteration. In the numerical experiments, we consider only simple examples, because the emphasis here is on testing the algorithms and different formulations rather than on complex applications of MLPs. Therefore, we also restrict ourselves to the perceptron with only one hidden layer. Concerning the discussion between one or more hidden layers, we refer to Reed
1462
Tommi K¨arkk¨ainen
and Marks (1999) and Tamura and Tateishi (1997). Finally, notice that more complex (nonconvex) optimization problems with an increased number of local minima must be solved when training a network with several hidden layers. In order to generate less regular nonlinear transformations without affecting the scale of weights when the hidden layer is enlarged, we choose k-tanh functions of the form t k (a ) D
2 ¡ 1, 1 C exp(¡2 k a )
k D 1, . . . , n1 ,
(4.1)
to activate the hidden neurons (McLean, Bandar, & O’Shea, 1998). Although we at the same time introduce some kind of ordering for the hidden layer by using different activation functions for each neuron, the symmetry problem related to the hidden neurons (e.g., Bishop, 1995) remains, as can be seen by a simple rescaling argument. Use of an even more general mixture of different activation functions in the hidden layer is suggested by Zhang and Morris (1998). Finally, by increasing the nonsmoothness of the hidden activation functions, and hence the whole MLP mapping, decreases the regularity of the learning problem and thus also the convergence rate of rst- and second-order optimization methods (Nocedal & Wright, 1999). 4.1 Approximation of Noisy Function.
Reconstruction of function f ( x ) D sin(2p x ) , x 2 I D [0, 2p ], which is corrupted with normally distributed (quasi-)random noise. The input data fxi g result from the uniform discretization of the interval I with the step size h D 0.1. Points in the output data are taken as yi D f ( xi ) C d ei for d D 0.3 and ei 2 N (0, 1). Altogether, we have in this example N D 63 and n2 D n 0 D 1. Example 1.
In the following experiments we have solved the optimization problem, equation 2.3, with prescaled learning data into [¡1, 1] starting from 10 random initial guesses from the range (¡1, 1) for the weight matrices ( W 1 , W 2 ) . For an overview and study of different initialization techniques we refer to Thimm and Fiesler (1997) and LeCun et al. (1998). Let us make some comments based on Table 1: Local minima: Even for the smallest network (n1 D 2) and especially
for larger ones, there exist a lot of local minima in the optimization problem. Moreover, the local minima are strict in the sense that they correspond to truly different values of the cost functional and not just different representations (symmetries) of the same MLP transformation. Condition i in Corollary 1: Is valid with a precision related to the
stopping criterion.
41 120 153 0.010 0.010 0.0086
2 3 4 5 6 7
234 1079 341 0.013 0.013 0.013
Maximum
126 427 250 0.012 0.011 0.010
Mean
0.23 0.42 0.47 4e-7 3e-7 2e-7
Minimum 0.52 0.80 0.60 2e-5 3e-5 2e-5
Maximum
| eN¤ | 0.39 0.60 0.54 6e-6 8e-6 7e-6
Mean 0.014 (2) 0.013 0.012 (2) 237 183 440
Minimum 0.032 0.024 0.014 804 737 1541
Maximum
Its
0.018 0.014 0.013 430 466 927
Mean 2e-8 4e-7 1e-6 0.55 0.51 0.67
Minimum 4e-6 2e-5 8e-6 0.76 0.75 0.95
Maximum
CPU
2e-6 4e-6 4e-6 0.64 0.66 0.80
Mean
Notes: The minimum, maximum, and average mean values correspond to the 10 solutions of the optimization problem for the following quantities: J ¤ is the nal value of the cost functional. If the minimum or maximum value (with tolerance e D 10¡6 ) is scored more than once, the number of instances is included in parentheses. | eN¤ | denotes the absolute value of the average output error over the learning data as stated in corollary 1i. Its D total number of function/gradient evaluations (always computed together). CPU D time in seconds for solving the optimization problem.
Minimum
n1
J¤
Table 1: Computational Results in Example 1 Without Regularization.
MLP with Applications to Weight Decay 1463
1464
Tommi K¨arkk¨ainen Efciency: CPU time is negligible in this small example, but the num-
ber of iterations in the optimization varies a lot. Generalization: Visually (K¨arkk¨ainen, 2000) the best result is obtained
using the MLP corresponding to the minimal value of J ¤ for n1 D 2. However, MLP corresponding to the maximal value of J ¤ for n1 D 7 also gives a good result. This illustrates the difculty of naming the optimal network even in a simulated example. Moreover, from the large variation of the number of iterations, we conclude that when a xed number of iterations is taken in the learning algorithm, one has no knowledge on the error between the obtained weights and the true (local) solution of the optimization problem. 4.1.1 Regularization of MLP Using Weight Decay. Weight decay (WD) is a popular technique for pruning an MLP (Goutte & Hansen, 1997; Gupta & Lam, 1998; Hintz-Madsen, Hansen, Larsen, Pederson, & Larsen, 1998; Ormoneit, 1999; Reed & Marks, 1999). In WD, the basic LMS error function is augmented with a WD term R ( W lij ), which imposes some restriction on the generality (universality) of the MLP transform to prevent overlearning. Here we start with the simplest possible strictly convex form with P regard to the unknown weights by considering initially R ( W lij ) D b / 2 l,i, j ( W lij ) 2 , where b is the WD parameter. The choice of having only a single coefcient makes sense, because the network inputs and outputs of the hidden layer are enforced in the same range. Moreover, for the gaussian noise with known estimate of variance, b is actually the inverse of the Lagrange multiplier for the corresponding equality or inequality constraint and therefore single-valued (Chambolle & Lions, 1997). In general the “best” value of b is related to both the complexity of the MLP transformation and the (usually unknown) amount of noise contained in the learning data (Scherzer, Engl, & Kunisch, 1993). There exist, however, various techniques for obtaining an effective choice of b (e.g., Bishop, 1995; Rohas, 1996; Rognvaldsson, ¨ 1998). In addition to strict convexity, some particular reasons for choosing the proposed form of WD with the quadratic penalty function p (w ) D |w| 2 are: This form improves the convexity of the cost functional and therefore makes the learning problem easier to solve. The proposed form forces weights in the neighborhood of zero (similarly to prior distributions with zero mean suggested by Neal, 1996), thus further balancing their scale in a gradient-based optimization algorithm (Cavalieri & Mirabella, 1999). The smoothing property is due to the fact that the activation functions tk ( a ) are nearly linear around zero, although the size of the linear region is decreasing as k is increasing. Furthermore, rst derivatives of the activation functions are most informative around zero, so that the proposed form of WD is
MLP with Applications to Weight Decay
1465
also helpful to prevent saturation of weights by enforcing them in the neighborhood of this transient region (Kwon & Cheng, 1996; Vitela & Reifman, 1997). One drawback of quadratic WD is that it produces weight matrices with groups of small components even if a choice of one large weight instead could be sufcient. This can yield unnecessarily large networks, even if the overlearning can be prevented. On the other hand, this property increases the fault tolerance of the trained network due to the so-called graceful degradation (Reed & Marks, 1999). Let us comment on some of the difculties of other forms of WD for individual weights suggested in the literature (Goutte & Hansen, 1997; Gupta & Lam, 1998; Saito & Nakano, 2000): l1 ¡formulation: Each weight w is regularized using a penalization p ( w ) D |w|. This function is convex but not strictly so. A severe difculty is that because the derivative of p (w ) is multivalued for w D 0, the resulting optimization problem is nonsmooth, that is, only subdifferentiable (Makel ¨ a¨ & Neittaanm a¨ ki, 1992). In general, function |w| p for 1 < p < 2 belongs only to the Holder ¨ space C1,p¡1 (Gilbarg & Trudinger, 1983), so that assumptions for convergence of gradient descent (on batch-mode Lipschitz continuity of gradient, for on-line stochastic iteration C2 continuity), CG (Lipschitz continuity of gradient), and especially quasi-Newton methods (C2 -continuity) are violated (Haykin, 1994; Nocedal & Wright, 1999). As documented, for example, for MLP by Saito and Nakano (2000) and for image restoration by K¨arkk¨ainen, Majava, and M¨akel¨a (2001), this yields nonconvergence of ordinary training algorithms, when the cost functional does not fulll the required smoothness assumptions. Furthermore, even if a smoothed p counterpart w2 C e for e > 0 is introduced, this formulation is either (for small e ) too close to the nonsmooth case so that again the convergence fails, or otherwise (for larger e ), the smoothed formulation differs substantially from the original one. To conclude, for such nonsmooth optimization problems, one needs special algorithms (K¨arkk¨ainen & Majava, 2000a, 2000b; K¨arkk¨ainen et al., 2001). Finally, these same difculties also concern the so-called robust backpropagation where the P error function is dened as i kN (fW l g) (x i ) ¡ y i kl1 (Kosko, 1992). Mixed WD: Each weight is regularized using mixed penalization p ( w ) D
w2 / (1 C w2 ) . This form is nonconvex, producing even more local minima to the optimization problem, and it favors both small and large weights, thus destroying the balance in scales of different weights.
The numerical results reported by Saito and Nakano (2000) emphasize the above difculties. Whereas the combination of the quasi-Newton optimization algorithm and quadratic WD drastically improved both the con-
1466
Tommi K¨arkk¨ainen
vergence of the training algorithm and the generalization performance of the trained MLP, the l1 -penalized problem was not convergent, and the mixed form of WD was very unstable. However, corollary 1 and the role of bias terms as a shift from the origin © ªL raise the question: Which components of the weight matrices W l lD 1 should be regularized and which not? Certainly, if condition i in corollary 1 is required for the resulting MLP, one should exclude the bias terms of the nal layer W L from WD. Similarly, for both conditions to hold, one should leave out all components of W L from WD. Hence, we apply the quadratic WD only to selected components of the weight-matrices ( W 1 , W 2 ). We distinguish four cases: I Regularize all other components except the bias terms W 20 in the weight matrix W 2 . II Exclude all components of W 2 from the regularization. III Exclude all bias terms of ( W 1 , W 2 ) from the regularization (Holm-
strom, ¨ Koistinen, Laaksonen, & Oja, 1997). IV Exclude all components of W 2 and bias terms of W 1 from the regu-
larization. Let us state one further observation concerning the nonregularP ¤ ization of W 20 . Using the error-average formula N1 N i D 1 e i D O of corollary 1 ¤ and the expression for ei according to equation 3.2 yields (cf. Bishop, 1995) Remark.
2,¤
W0
D ¡
N 1 X [W 2,¤ F ( W 1,¤ xO i ) ¡ y i ]. N iD 1 1
Hence, increased coercivity and thereby uniqueness with regard to ( W 21 , W 1 ) immediately affect the uniqueness of the bias W 20 as well. The results corresponding to the above cases are presented in Tables 2 through 5. For all experiments, we have chosen b D 10¡3 according to some prior tests, which also indicated that the results obtained here were not very sensitive to the choice of b. Let us state some observations based on Tables 1 through 5: Local minima: For regularization methods I and III, the minimal cost
function values are scored more than once when n1 is small. However, there still exist a lot of local minima for larger networks. Condition i in corollary 1: Is valid for all regularization methods with
a precision related to the stopping criterion. Efciency: By means of the number of iterations and the CPU time,
regularization methods I and III improved the performance, whereas
0.022 (5) 0.0195 0.017 0.017 0.016 0.016
2 3 4 5 6 7
0.025 (5) 0.0199 (2) 0.019 0.018 0.017 (2) 0.017
Maximum
0.023 0.0197 0.018 0.017 0.017 0.016
Mean
3e-8 3e-8 1e-7 1e-7 5e-7 2e-7
Minimum 4e-6 5e-6 4e-6 2e-5 1e-5 2e-5
Maximum
| eN¤ | 1e-6 2e-6 2e-6 7e-6 6e-6 5e-6
Mean 48 76 88 142 145 227
Minimum 84 149 208 245 382 395
Maximum
Its
65 101 146 188 244 327
Mean
Minimum
0.015 0.014 0.014 0.014 0.013 0.013
n1
2 3 4 5 6 7
0.016 0.015 0.015 0.015 (2) 0.014 0.014
Maximum
Jb¤
0.015 0.015 0.015 0.014 0.014 0.014
Mean
2e-6 2e-7 8e-8 2e-6 3e-6 1e-7
Minimum 8e-5 1e-5 2e-5 3e-5 3e-5 1e-5
Maximum
| eN¤ | 2e-5 3e-6 8e-6 1e-5 1e-5 4e-6
Mean
Table 3: Computational Results in Example 1 for Regularization II.
67 131 257 329 607 1213
Minimum 554 1009 963 1283 2402 2472
Maximum
Its
231 383 414 832 1331 1599
Mean
Note: Here and in the sequel, Jb¤ refers to the value of the cost functional containing the regularization term.
Minimum
n1
Jb¤
Table 2: Computational Results in Example 1 for Regularization I.
0.33 0.44 0.48 0.61 0.72 0.86
Minimum
0.25 0.33 0.38 0.46 0.47 0.57
Minimum
0.67 0.78 0.79 0.87 1.10 1.13
Maximum
CPU
0.35 0.43 0.51 0.55 0.63 0.65
Maximum
CPU
0.49 0.59 0.61 0.75 0.88 0.96
Mean
0.30 0.37 0.44 0.50 0.55 0.62
Mean
MLP with Applications to Weight Decay 1467
0.022 (5) 0.019 (4) 0.017 (2) 0.017 0.016 0.016
2 3 4 5 6 7
0.025 (5) 0.025 0.019 0.019 0.017 0.018
Maximum
0.023 0.019 0.018 0.017 0.017 0.016
Mean
1e-7 3e-8 3e-7 3e-7 3e-7 2e-6
Minimum 3e-6 7e-6 6e-6 1e-5 7e-6 3e-5
Maximum
| eN¤ | 1e-6 2e-6 2e-6 4e-6 3e-6 9e-6
Mean
Minimum
0.015 0.014 0.014 0.014 0.013 0.012
n1
2 3 4 5 6 7
0.016 0.016 0.015 0.015 (2) 0.014 0.014
Maximum
Jb¤
0.015 0.015 0.015 0.014 0.014 0.014
Mean
4e-7 9e-7 2e-7 9e-8 2e-7 1e-8
Minimum 2e-5 5e-5 3e-5 1e-5 2e-5 2e-5
Maximum
| eN¤ | 1e-5 1e-5 9e-6 5e-6 7e-6 7e-6
Mean
Table 5: Computational Results in Example 1 for Regularization IV.
Minimum
n1
Jb¤
Table 4: Computational Results in Example 1 for Regularization III.
73 110 151 244 361 321
Minimum
49 73 87 80 104 118
Minimum
913 953 744 2436 2135 3119
Maximum
Its
93 173 172 205 199 537
Maximum
Its
235 352 344 1176 1184 1368
Mean
71 109 121 141 153 269
Mean
0.31 0.39 0.47 0.55 0.62 0.61
Minimum
0.25 0.33 0.36 0.36 0.40 0.44
Minimum
0.75 0.77 0.74 1.09 1.05 1.19
Maximum
CPU
0.36 0.48 0.48 0.51 0.51 0.70
Maximum
CPU
0.47 0.56 0.58 0.82 0.84 0.89
Mean
0.30 0.39 0.41 0.45 0.47 0.56
Mean
1468 Tommi K¨arkk¨ainen
MLP with Applications to Weight Decay
1469
methods II and IV made it worse compared to the unregularized approach. Generalization: All regularization approaches improve the general-
ization compared to the unregularized problem by preventing oscillation in the nal mapping generated by the MLP. When n1 is increased, regularization methods I and III seem to be more stable and thus more robust than methods II and IV. Conclusion: From the four regularization methods tested, I and III are
preferable to II and IV in every respect. One cannot make a difference between I and III, and we note that centering the input data already decreases the signicance of the hidden bias. 4.2 Classication.
We consider the well-known benchmark to classify Iris owers (cf. Gupta & Lam, 1998) according to measurements obtained from the UCI repository (Blake & Merz, 1998). In this example, we have n2 D 3 (number of classes), n 0 D 4 (number of features), and initially 50 samples from each of the three classes. Due to the choice of the k-tanh activation functions prescaling of the output data fyi g into the range [¡1, 1] destroys the linear independence between different classes. Example 2.
In Tables 6 through 11, we have solved the classication (optimization) problem again 10 times using 40 random samples from each class as the learning set and the remaining 10 samples from each class as the test set. These two data sets have been formed using ve different random permutations of the initial data realizing a simple cross-validation technique. Hence, we have N D 120 and the remaining 30 samples in the test set for each permutation. In Tables 9 through 11, we have added noise to the inputs in permuted learning sets. First, means with regard to each four features within the three classes have been computed. After that, the class mean vectors have been multiplied component-wise with a noise vector d ei for d D 0.3 and ei 2 N (0, 1) , and this has been added to the unscaled learning data. To this end, we derive some nal observations based on the computational results, especially in example 2: Local minima: In all numerical tests for unregularized and regularized
learning problems, the L-BFGS optimization algorithm was convergent. This suggests that by using the proposed combination of linear nal layer, prescaling, and mixed activation, we are able to deal with the at error surfaces during the training. Furthermore, because usually in all 10 test runs of one learning problem, a different value of the cost functional is obtained, the different results are revealing “true”
0 0 0 0 0
0
1 2 3 4 5
Mean
0.4
0 0 1 0 1
Maximum
0.4
0 0 1 0 1
Mean
0
0 0 0 0 0
CL
1.6
2 3 1 2 0
CT
C
103
109 81 106 129 92
Minimum
807
534 664 1243 891 702
Maximum
Its
331
239 290 492 349 286
Mean
0.34
0.37 0.30 0.35 0.36 0.31
Minimum
0.72
0.60 0.65 0.93 0.75 0.67
Maximum
CPU
0.50
0.47 0.49 0.57 0.51 0.48
Mean
Notes: The rst column, P, contains the permutation index. In the second column, CL , minimum, maximum, and (rounded) mean values for the number of false classications in the learning set are given. The third column, C, includes numbers of false classications in the learning and test sets (denoted with CL and CT ) for the optimal perceptron N P¤ satisfying CL (N P¤ ) C CT (N P¤ ) · CL (N P ) C CT (N P ) over all networks encountered during the 10 optimization runs for the current permutation P.
Minimum
P
CL
Table 6: Computational Results in Example 2 Without Regularization for n1 D 9.
1470 Tommi K¨arkk¨ainen
0 0 0 0 0
0
1 2 3 4 5
Mean
0.2
0 0 0 0 1
Maximum
0
0 0 0 0 0
Mean
0
0 0 0 0 0
CL
1.4
2 3 0 2 0
CT
C
555
517 443 742 426 646
Minimum
1389
1280 814 1797 1451 1604
Maximum
Its
823
843 551 1047 792 884
Mean
0.62
0.60 0.57 0.68 0.56 0.69
Minimum
Minimum
0 0 0 0 0
0
P
1 2 3 4 5
mean
1
0 0 2 0 3
Maximum
CL
0.4
0 0 1 0 1
Mean
0
0 0 0 0 0
CL
1.6
2 3 0 3 0
CT
C
413
369 484 418 382 413
Minimum
869
939 923 868 643 973
Maximum
Its
590
543 602 677 507 619
Mean
0.56
0.56 0.58 0.56 0.54 0.56
Minimum
Table 8: Computational Results in Example 2 for Regularization III with e D 10¡3 and n1 D 9.
Minimum
P
CL
Table 7: Computational Results in Example 2 for Regularization I with e D 10¡3 and n1 D 9.
0.77
0.77 0.80 0.78 0.69 0.82
Maximum
CPU
0.92
0.93 0.72 1.04 0.91 1.00
Maximum
CPU
0.63
0.61 0.64 0.67 0.60 0.64
Mean
0.74
0.76 0.63 0.82 0.71 0.76
Mean
MLP with Applications to Weight Decay 1471
0 0 0 0 0
0
1 2 3 4 5
Mean
1.6
1 1 3 1 2
Maximum
0.8
0 1 1 1 1
Mean
0
0 0 0 0 0
CL
1.4
2 2 1 2 0
CT
C
224
275 240 175 146 284
Minimum
2653
3008 460 2680 1655 5464
Maximum
Its
754
968 339 948 589 927
Mean
0.45
0.49 0.47 0.41 0.39 0.50
Minimum
1.06
1.19 0.57 1.16 0.98 1.42
Maximum
CPU
0.65
0.73 0.52 0.73 0.61 0.65
Mean
Minimum
0 0 0 0 0
0
P
1 2 3 4 5
Mean
0.6
1 0 0 1 1
Maximum
CL
0.2
0 0 0 0 1
Mean
0
0 0 0 0 0
CL
1.6
2 3 1 2 0
CT
C
527
492 486 579 616 464
Minimum
1114
1196 1059 982 1120 1211
Maximum
Its
773
724 782 739 862 756
Mean
0.60
0.59 0.58 0.62 0.64 0.58
Minimum
0.84
0.88 0.81 0.78 0.84 0.88
Maximum
CPU
0.70
0.68 0.70 0.69 0.74 0.69
Mean
Table 10: Computational Results in Example 2 for Regularization I with e D 10¡3 , n1 D 9, and Noisy Learning Set.
Minimum
P
CL
Table 9: Computational Results in Example 2 Without Regularization for n1 D 9 and Noisy Learning Set.
1472 Tommi K¨arkk¨ainen
Minimum
0 0 0 0 0
0
P
1 2 3 4 5
mean
1.8
1 2 3 2 1
Maximum
CL
0.8
0 1 1 1 1
Mean
0
0 0 0 0 0
CL
1.6
2 3 1 2 0
CT
C
421
434 365 464 386 458
Minimum
955
1024 941 800 1007 1003
Maximum
Its
672
710 580 653 725 690
Mean
0.56
0.56 0.54 0.58 0.54 0.58
Minimum
0.78
0.80 0.76 0.71 0.83 0.79
Maximum
CPU
0.66
0.68 0.62 0.65 0.69 0.67
Mean
Table 11: Computational Results in Example 2 for Regularization III with e D 10¡3 , n1 D 9, and Noisy Learning Set.
MLP with Applications to Weight Decay 1473
1474
Tommi K¨arkk¨ainen
local minima instead of just some symmetrical representations of the same MLP transformation. Efciency: By comparing the number of iterations and the CPU time,
the unregularized problem seems to be easier to solve than the regularized one (with given b) for the clean learning data. On average, there is no real difference between the unregularized and regularized problems for the noisy learning data, but the variation in the number of iterations is signicantly larger for the unregularized approach. Com parison: There seems to be no signicant difference between the
quality of perceptrons resulting from the unregularized and regularized problems, with or without additional noise. This suggests that the iris learning data are quite stable and contain only a few cases with nongaussian degradations. Moreover, according to the observations in example 1, one reason for the quite similar behavior can be the size of n1 , which is not very large compared to n 0 and n2 . Finally, we cannot favor either of the two regularization methods I and III according to these results. Generalization: The amount of false classications in the test set is
between 0 and 3, that is, between 0% and 10% in all test runs. A single preferable choice (an ensemble, of course, is another possibility) for an MLP classier would probably be the one with median of false classications in CT obtained by using permutation 1 or 4, even if for the fth permutation, an optimal classier was found according to the given data. Notice that the amount of variation in CT over the different permutations provides useful information on the quality of the data. Appendix A: Proofs of Lemmas Proof of Lemma 1.
Functional J ( W ) in a component-wise form reads as 0 12 m2 m1 X 1X @ J (W ) D wij vj ¡ yi A , 2 i D1 j D 1
where i represent the row and j the column index of W , respectively. A straightforward calculation shows that 0 1 0 1 m1 m1 X X @J @J D@ wij vj ¡ yi A v k ) D@ wij vj ¡ yi A v T w @wik @ i,: jD 1 jD 1 D [W v ¡ y ]i v T )
@J @W
D [W v ¡ y ] v T .
(A.1)
Here we have used Matlab-type abbreviations for the ith row vector w i,: of matrix W .
MLP with Applications to Weight Decay Proof of Lemma 2.
1475
As above, consider the component-wise form of the
functional Á !2 m2 m1 m2 X X 1X J (u) D wji fi ( ui ) ¡ yj D Jj (u ) , 2 j D 1 i D1 jD 1 where Jj ( u ) D @Jj @u k
D
1 2
Pm 1
iD 1
" m1 X iD 1
(A.2)
(wji fi ( ui ) ¡ yj ) 2 D 12 [W F ( u ) ¡ y ]j2 . Then # 0
0
wji fi ( ui ) ¡ yj wjk fk ( u k ) D wjk fk (u k ) [W F (u ) ¡ y]j . (A.3)
Remember here that u is a vector with m2 components. As in lemma 1, we get the derivative of J (u ) with respect to u by treating k as row index and j as column index. Proceeding like this, we obtain from equation A.3 0
1 0 0 ... w11 f1 ( u1 ) wm2 1 f1 ( u1 ) B C .. .. C [W F ( u ) ¡ y ] .. ru J (u ) D B @ . A . . 0 0 ( ) ( ) . . . wm2 m1 fm1 um1 w1m1 fm1 um1 ± ² T 0 [W F ( u ) ¡ y ], (A.4) D W F (u) 0
0
which is the desired result when (W F ( u ) ) T is replaced for DiagfF (u )g W T . Here we introduce the basic Lagrangian technique to simplify the calculations. As the rst step, we dene the extra variable u D Wv and instead of the original problem consider the equivalent constraint optimization problem, Proof of Lemma 3.
1 N min JQ( u , W ) D kW F ( u ) ¡ y k2 2
(u,W )
subject to u D W v ,
(A.5)
where u and W are treated as independent variables linked together by the given constraint. The Lagrange functional associated with equation A.5 reads as
L ( u , W , ¸ ) D JQ(u , W ) C ¸T (u ¡ W v ) ,
(A.6)
where ¸ 2 Rm1 contains the Lagrangian variables for the constraint. Noticing that the values of functions JQ( u , W ) in equation A.5 and L ( u , W , ¸ ) in equation A.6 coincide if the constraint u D W v is satised, it follows that a solution of equation A.5 is equivalently characterized by the saddlepoint conditions r L ( u , W 1 , ¸ ) D 0. Therefore, we compute the derivatives ru L , rW L , and r¸ L (in a suitable form) for the Lagrangian.
1476
Tommi K¨arkk¨ainen
The gradient vector ru L ( u , W , ¸ ) is, due to lemma 2, of the form 0 N T [W F ( u ) ¡ y ] C ¸ . ru L D ru JQ( u ) C ¸ D DiagfF ( u ) g W
(A.7)
Other derivates for the Lagrangian are given by rW L D ¡¸ vT and r¸ L D 0 N T [W F ( u ) ¡ y ] due to equau ¡ Wv . Using formula ¡¸ D DiagfF ( u ) g W tion A.7 and substituting this into rW L , we obtain
rW L
N T [W N F ( u ) ¡ y] v T . D DiagfF ( u ) g W 0
(A.8)
Due to the equivalency of equations A.5 and A.6 with the original problem, the desired gradient matrix is given by equation A.8 when Wv is substituted for u . Proof of Lemma 4. To simplify the calculations, we now introduce two Q FQ ( u ) D W Q FQ ( Wv ) . As in the previous extra variables u D Wv and uQ D W
proof, we rst consider the constraint optimization problem: 1 N N min JQ( u , uQ , W ) D kW F ( uQ ) ¡ y k2 2 ( u, uQ , W )
Q FQ ( u ) . (A.9) s.t. u D W v , uQ D W
The Lagrange functional associated with equation A.9 reads as Q ) D JQ( u , uQ , W ) C ¸T ( u ¡ W v ) C ¸ Q T ( uQ ¡ W Q FQ ( u ) ). L ( u , uQ , W , ¸, ¸
(A.10)
Using similar techniques as in the previous proof, derivates for the saddleQ C Q FQ 0 ( u )]T ¸ point conditions of the Lagrangian are given by ru L D ¡[W 0( ) T N N ( ) T Q N N ¸, ruQ L D [W F uQ ] [W F uQ ¡ y ] C ¸, rW L D ¡¸ v , r¸ L D u ¡ Wv , Q FQ (u ) . From r( u , uQ ) L D O we obtain and r Q L D uQ ¡ W
¸
Q D ¡[W Q FQ 0 (u )]T ¸ Q FQ 0 (u ) ]T [W N FN 0 ( uQ ) ]T [W N FN ( uQ ) ¡ y ], ¸ D [W
(A.11)
which, substituted into rW L , yields
rW L
Q T DiagfFN 0 (uQ )g W N T [W N FN ( uQ ) ¡ y ] vT . D DiagfFQ ( u ) g W 0
(A.12)
Q FQ (u ) , This, together with the original expressions u D W v and uQ D W proves the result. Acknowledgments
I thank Pasi Koikkalainen for many enlightening discussions concerning ANNs. The comments and suggestions of the unknown referee improved and claried the presentation signicantly, especially through the excellent reference Orr and Muller ¨ (1998).
MLP with Applications to Weight Decay
1477
References Bazaraa, M. S., Sherali, H. D., & Shetty, C. M. (1993). Nonlinear programming: Theory and algorithms (2nd ed.). New York: Wiley. Bertsekas, D. P. (1982). Constrained optimization and Lagrange multiplier methods. New York: Academic Press. Bishop, C. M. (1995). Neural networks for pattern recognition. Oxford: Clarendon Press. Blake, C. L., & Merz, C. J. (1998). UCI repository of machine learning databases. Available on-line: http://www.ics.uci.edu/»mlearn/MLRepository.html. Byrd, R. H., Lu, P., & Nocedal, J. (1995). A limited memory algorithm for bound constrained optimization, SIAM Journal on Scientic and Statistical Computing, 5, 1190–1208. Caruna, R. (1998). A dozen tricks with multitask learning. In G. B. Orr & K.R. Muller ¨ (Eds.), Neural networks: Tricks of the trade. Berlin: Springer-Verlag. Cataltepe, Z., Abu-Mostafa, Y. S., & Magdon-Ismail, M. (1999). No free lunch for early stopping. Neural Computation, 11, 995–1009. Cavalieri, S., & Mirabella, O. (1999). A novel learning algorithm which improves the partial fault tolerance of multilayer neural networks. Neural Networks, 12, 91–106. Chambolle, A., & Lions, P. L. (1997). Image recovery via total variation minimization and related problems. Numerische Mathematik, 76, 167–188. Chen, D. S., & Jain, R. C. (1994). A robust back propagation learning algorithm for function approximation. IEEE Transactions on Neural Networks, 5, 467–479. de Villiers, J., & Barnard, E. (1992). Backpropagation neural networks with one and two hidden layers. IEEE Transactions on Neural Networks, 1, 136–141. Diamantaras, K. I., & Kung, S. Y. (1996). Principal component neural networks: Theory and applications. New York: Wiley. Di Martino, M., Fanelli, S., & Protasi, M. (1996). Exploring and comparing the best “direct methods” for the efcient training of MLP-networks. IEEE Transactions on Neural Networks, 7, 1497–1502. Flake, G. W. (1998). Square unit augmented, radially extended, multilayer perceptrons. In G. B. Orr & K.-R. Muller ¨ (Eds.), Neural networks: Tricks of the trade. Berlin: Springer-Verlag. Geman, S., Bienenstock, E., & Doursat, R. (1992). Neural networks and the bias/variance dilemma. Neural Computation, 4, 1–58. Gilbarg, D., & Trudinger, N. S. (1983). Elliptic partial differential equations of second order. Berlin: Springer-Verlag. Gorse, D., Shepherd, A. J., & Taylor, J. G. (1997). The new ERA in supervised learning. Neural Networks, 10, 343–352. Goutte, C., & Hansen, L. K. (1997). Regularization with a pruning prior. Neural Networks, 10, 1053–1059. Griewank, A. (2000). Evaluating derivatives: Principles and techniques of algorithmic differentiation. Philadelphia: SIAM. Gupta, A., & Lam, S. M. (1998). Weight decay backpropagation for noisy data. Neural Networks, 11, 1527–1137.
1478
Tommi K¨arkk¨ainen
Hagan, M. T., & Menhaj, M. B. (1994). Training feedforward networks with the Marquardt algorithm. IEEE Transactions on Neural Networks, 5, 989–993. Haykin, S. (1994). Neural networks; A comprehensive foundation. New York: Macmillan. Hintz-Madsen, M., Hansen, L. K., Larsen, J., Pedersen, M. W., & Larsen, M. (1998). Neural classier construction using regularization, pruning and test error estimation. Neural Networks, 11, 1659–1670. Holmstr om, ¨ L., Koistinen, P., Laaksonen, J., & Oja, E. (1997). Neural and statistical classiers—taxonomy and two case studies. IEEE Transactions on Neural Networks, 8, 5–17. Horn, D., Naftaly, U., & Intrator, N. (1998). Large ensemble averaging. In G. B. Orr & K.-R. Muller ¨ (Eds.), Neural networks: Tricks of the trade. Berlin: Springer-Verlag. Japkowicz, N., Hanson, S. J., & Gluck, M. A. (2000). Nonlinear autoassociation is not equivalent to PCA. Neural Computation, 12, 531–545. K¨arkk¨ainen, T. (2000). MLP-network in a layer-wise form:Derivations, consequences, ¨ Finland: Deand applications to weight decay (Tech. Rep. No. C1). Jyv a¨ skyla, partment of Mathematical Information Technology, University of Jyv¨askyla¨ . K¨arkk¨ainen, T., & Majava, K. (2000a). Nonmonotone and monotone active-set methods for image restoration, Part 1: Convergence analysis. Journal of Optimization Theory and Applications, 106, 61–80. K¨arkk¨ainen, T., & Majava, K. (2000b). Nonmonotone and monotone active-set methods for image restoration, Part 2: Numerical results. Journal of Optimization Theory and Applications, 106, 81–105. K¨arkk¨ainen, T., Majava, K., & Makel¨ ¨ a, M. M. (2001). Comparison of formulations and solution methods for image restoration problems. Inverse Problems, 17, 1977–1995. K¨arkk¨ainen, T., & Toivanen, J. (2001). Building blocks for odd-even multigrid with applications to reduced systems. Journal of Computational and Applied Mathematics, 131, 15–33. Kosko, B. (1992). Neural networks and fuzzy systems: A dynamical systems approach to machine intelligence. Englewood Cliffs, NJ: Prentice-Hall. Kwon, T. M., & Cheng, H. (1996). Contrast enhancement for backpropagation. IEEE Transactions on Neural Networks, 7, 515–524. LeCun, Y., Bottous, L., Orr, G. B., & Muller, ¨ K.-R. (1998). Efcient backprop. In G. B. Orr & K.-R. Muller ¨ (Eds.), Neural networks: Tricks of the trade. Berlin: Springer-Verlag. Liano, K. (1996). Robust error measure for supervised neural network learning with outliers. IEEE Transactions on Neural Networks, 7, 246–250. Liu, Y., & Yao, X. (1999). Ensemble learning via negative correlation. Neural Networks, 12, 1399–1404. MacKay, D. J. C. (1992). A practical Bayesian framework for backpropagation networks. Neural Computation, 4, 448–472. Magoulas, G. D., Vrahatis, M. N., & Androulakis, G. S. (1999).Improving the convergence of the backpropagation algorithm using learning rate adaptation methods. Neural Computation, 11, 1769–1796.
MLP with Applications to Weight Decay
1479
Makel¨ ¨ a, M. M., & Neittaanm¨a ki, P. (1992). Nonsmooth optimization: Analysis and algorithms with applications to optimal control. Singapore: World Scientic. McKeown, J. J., Stella, F., & Hall, G. (1997).Some numerical aspects of the training problem for feed-forward neural nets. Neural Networks, 10, 1455–1463. McLean, D., Bandar, Z., & O’Shea, J. D. (1998). An empirical comparison of back propagation and the RDSE algorithm on continuously valued real world data. Neural Networks, 11, 1685–1694. Moody, J. O., & Antsaklis, P. J. (1996). The dependence identication neural network construction algorithm. IEEE Transactions on Neural Networks, 7, 3– 15. Neal, R. M. (1996). Bayesian learning for neural networks. New York: Springer. Nocedal, J., & Wright, S. J. (1999). Numerical optimization. New York: Springer. Ormoneit, D. (1999). A regularization approach to continuous learning with an application to nancial derivatives pricing. Neural Networks, 12, 1405–1412. Orr, G. B., & Muller, ¨ K.-R. (Eds.). (1998). Neural networks: Tricks of trade. Berlin: Springer-Verlag. Osman, I. H., & Kelly, J. P. (Eds.) (1996). Meta-heuristics: Theory and applications. Norwell, MA: Kluwer. Park, D. C., El-Sharkawi, M. A., & Marks II, R. J. (1991). An adaptively trained neural network. IEEE Transactions on Neural Networks, 2, 334–345. Prechelt, L. (1998). Early stopping. In G. B. Orr & K.-R. Muller ¨ (Eds.), Neural networks: Tricks of the trade. Berlin: Springer-Verlag. Reed, R. D., & Marks, II, R. J. (1999). Neural smithing: Supervised learning in feedforward articial neural networks. Cambridge, MA: MIT Press. Rognvaldsson, ¨ T. S. (1998). A simple trick for estimating the weight decay parameter. In G. B. Orr & K.-R. Muller ¨ (Eds.), Neural networks: Tricks of the trade. Berlin: Springer-Verlag. Rohas, R. (1996). Neural networks: A systematic introduction. Berlin: SpringerVerlag. Saito, K., & Nakano, R. (2000). Second-order learning algorithm with squared penalty term. Neural Computation, 12, 709–729. Schaal, S., & Atkeson, C. G. (1998). Constructive incremental learning from only local information. Neural Computation, 10, 2047–2084. Scherzer, O., Engl, H. W., & Kunisch, K. (1993). Optimal a posteriori parameter choice for Tikhonov regularization for solving nonlinear ill-posed problems. SIAM Journal on Numerical Analysis, 30, 1796–1838. Tamura, S., & Tateishi, M. (1997).Capabilities of a four-layered feedforward neural network: Four layers versus three. IEEE Transactions on Neural Networks, 8, 251–255. Tetko, I. V., & Villa, A. E. P. (1997). Efcient partition of learning data sets for neural network training. Neural Networks, 10, 1361–1374. Thimm, G., & Fiesler, E. (1997). High-order and multilayer perceptron initialization. IEEE Transactions on Neural Networks, 8, 349–359. van de Laar, P., & Heskes, T. (1999). Pruning using parameter and neuronal metrics. Neural Computation, 11, 977–993.
1480
Tommi K¨arkk¨ainen
van der Smagt, P., & Hirzinger, G. (1998). Solving the ill-conditioning in neural network learning. In G. B. Orr & K.-R. Muller ¨ (Eds.), Neural networks: Tricks of the trade. Berlin: Springer-Verlag. Vitela, J. E., & Reifman, J. (1997). Premature saturation in backpropagation networks: Mechanism and necessary conditions. Neural Networks, 10, 721–735. Wang, G.-J., & Chen, C.-C. (1996). A fast multilayer neural-network training algorithm based on the layer-by-layer optimizing procedures. IEEE Transactions on Neural Networks, 7, 768–775. Wang, Y.-J., & Lin, C.-T. (1998).A second-order learning algorithm for multilayer networks based on block Hessian matrix. Neural Networks, 11, 1607–1622. Yaeger, L. S., Webb, B. J., & Lyon, R. F. (1998). Combining neural networks and context-driven search for on-line printed handwriting recognition in the Newton. In G. B. Orr & K.-R. Muller ¨ (Eds.), Neural networks: Tricks of the trade. Berlin: Springer-Verlag. Zhang, J., & Morris, A. J. (1998).A sequential learning approach for single hidden layer neural networks. Neural Networks, 11, 65–80. Received November 20, 2000; accepted October 12, 2001.
LETTER
Communicated by Dana Ron
Local Overtting Control via Leverages Ga e tan Monari [email protected] Ecole SupÂerieure de Physique et de Chimie Industrielles de la Ville de Paris, Laboratoire d’Electronique, F 75005 Paris, France, and Usinor, DSI/DISA Sollac, F 13776 Fos-surMer Cedex, France G e rard Dreyfus [email protected] Ecole SupÂerieure de Physique et de Chimie Industrielles de la Ville de Paris, Laboratoire d’Electronique, F15005, Paris, France We present a novel approach to dealing with overtting in black box models. It is based on the leverages of the samples, that is, on the inuence that each observation has on the parameters of the model. Since overtting is the consequence of the model specializing on specic data points during training, we present a selection method for nonlinear models based on the estimation of leverages and condence intervals. It allows both the selection among various models of equivalent complexities corresponding to different minima of the cost function (e.g., neural nets with the same number of hidden units) and the selection among models having different complexities (e.g., neural nets with different numbers of hidden units). A complete model selection methodology is derived. 1 Introduction
The traditional view of overtting refers mostly to the bias-variance tradeoff introduced in Geman, Bienenstock, and Doursat, (1992). A family of parameterized functions with too few parameters, with respect to the complexity of a problem, is said to have too large a bias, because it cannot t the deterministic model underlying the data. Conversely, when the model is overparameterized, the dependence of the resulting functions on the particular training set is too large, and so is the variance of the corresponding family of parameterized functions. Therefore, overtting is usually detected by the fact that the modeling error on a test set is much larger than the modeling error on the training data. In practice, there are two major ways of preventing overtting: A priori, by limiting the variance of the considered family of parameterized functions. These regularization methods include weight decay c 2002 Massachusetts Institute of Technology Neural Computation 14, 1481– 1506 (2002) °
1482
GaÂetan Monari and Ge rard Dreyfus
(see MacKay, 1992, for a Bayesian approach to weight decay) and similar cost function penalizations, as well as early stopping (Sjoberg ¨ & Ljung, 1992). None of them exempts the model designer from an additional estimation of the generalization performance of the selected model. A posteriori, by estimating the generalization performance on data that have not been used to t the model. This approach relies on data resampling and has given rise to the cross-validation methods (Stone, 1974), including leave-one-out. Statistical tests can be used after selecting candidate models, to test whether differences in the estimated performances are signicant (see, e.g., (Anders & Korn, 1999)). The limitations of the above methods are well known: weight decay and similar methods require the estimation of meta-parameters, and resampling methods tend to be computationally intensive. In this article, we consider overtting as a local phenomenon that occurs when the inuence of one (or more) particular examples on the model becomes too large because of the excessive exibility of the model. Therefore, we suggest a new approach to model selection that takes overtting into account on the basis of the leverages of the available samples, that is, on the inuence of each sample on the parameters of the model. In the next section, we recall briey the mathematical framework of this approach. In sections 3 and 4, we show that this perspective on overtting suggests a model selection method that takes into account both an estimation of the generalization error and the distribution of the inuences of the training examples. Section 3 is devoted to model selection within a given family of parameterized functions, and section 4 focuses on the choice of the appropriate family among several candidate families. Section 5 illustrates the method with several examples. 2 Mathematical Framework
This article discusses static single-output processes with a nonrandom ninput vector x D [x1 , . . . , xn ]T and an output yp that is considered as a measurement of a random variable Yp . We assume that an appropriate model can be written under the form Yp D r (x) C W,
(2.1)
where W is a zero-mean random variable and r (x) is the unknown regression function. A data set of N input-output pairs fx k, ypkg kD 1,...,N is assumed to be available for estimating the parameters of the model f (x, h ) . In the following, all vectors are column vectors, denoted by boldface letters (e.g., the n-vectors x and fxkg.
Local Overtting Control via Leverages
1483
In Monari and Dreyfus (2000), the effect of withdrawing an example from the training set was investigated. Assume that a model with parameters hLS has been found by minimizing the least-squares cost function computed on the training set. Under the sole assumption that the removal of an example from the training set has a small effect on hLS (in contrast to standard leaveone-out, no stability assumption is required in this approach, as discussed in section 3.2 and in appendix B), a second-order Taylor development of the model output with respect to the parameters was derived. It was shown that this generates an approximate model that is locally linear with respect to the parameters and whose variables are the components of the gradient of the model output with respect to the parameters. Introducing the Jacobian matrix @f ( xi , µ ) 1 N T i Z D [z , . . . , z ] , where z D , @µ
µ Dµ LS
the solution subspace can be dened (in analogy with linear models) as the subspace determined by the columns of matrix Z (assuming that the latter has full rank). Then the quantity hii D ziT ( ZT Z) ¡1 zi
(2.2)
is the ith component of the projection, on the solution subspace, of the unit vector along axis i. The quantities fhii giD1,...,N , termed leverages of the training examples, satisfy the following relations: 8i 2 [1, . . . .N] N X i D1
0 · hii · 1
hii D q,
(2.3) (2.4)
where q is the number of adjustable parameters of the model (e.g., the number of weights of a neural network and the number of monomials of a polynomial approximation). Equations 2.3 and 2.4 stem from the fact that the quantities fhii giD 1,...,N are the diagonal terms of the orthogonal projection matrix Z ( ZT Z) ¡1 ZT . If axis i is orthogonal to the solution subspace, all columns have their ith component equal to zero; hence, zi D 0 and hii D 0. Since the output of the model is the orthogonal projection of the process output onto the solution subspace, example i has essentially no inuence on the model (it has none in the case of a linear model). If axis i lies within the solution subspace, then hii D 1 and example i is learned almost perfectly (it is learned perfectly in the case of a linear model). Thus, hii is an indication of the inuence of example i on the model: the closer hii is to 1, the larger the inuence of example i is on the model. This will be further shown below. If all examples have the same inuence on the model, all leverages fhii giD 1,...,N equal Nq . In other words, each example uses up a fraction 1 / N
1484
GaÂetan Monari and Ge rard Dreyfus
of the q available degrees of freedom (adjustable parameters) of the model. If one considers that overtting results from an excessive inuence of one (or more) examples on a model, then model selection aims at obtaining the model that has the best performances and in which the inuences of all examples are roughly equal. The inuence of an example on the model should be reected in the effect of its withdrawal from the training set. If an example has a large inuence on the model ( hii ¼ 1), it should be very accurately learned when it is present in the training set, but it should be badly predicted otherwise; conversely, if an example has no inuence on the model ( hii ¼ 0) , it should be predicted with equal accuracy regardless of its presence or absence in the training set. Indeed, if we denote by Ri the residual of example i (i.e., the modeling error on example i when it is present in the training set: Ri D ypi ¡ f ( xi , hLS ) ) , an (¡ )
approximate expression of the prediction error Ri i that would occur if this example had been removed from the training set, is given by Ri(¡i) » D
Ri . 1 ¡ hii
(2.5)
Details of the derivation of this relation are given in Monari and Dreyfus (2000). This approximation is founded insofar as the second-order Taylor development of the output is valid (i.e., third-order terms are negligible). Hence, the difference between the predictions of example i, depending on whether it belongs or not to the training set, is a function of its leverage hii .1 This property will be taken advantage of in the selection method presented in the next sections. If the Jacobian matrix Z is not of rank q, that is, if the manifold of RN dened by the columns of Z is—at µLS —not of full rank, the corresponding models must be discarded. For such models, the number of available parameters is locally too large in view of the number of training examples, which leads to an underdetermination of these parameters. This is an extreme case of overtting that can be detected by computing the rank of Z directly (a difcult numerical task) or indirectly by checking the validity of relations 2.3 and 2.4. The latter solution supposes that we are able to compute the fhii giD 1,...,N , regardless of the fact that ZT Z is invertible or not. To address this problem, a singular value decomposition (SVD; see, e.g., Press, Teukolsky, Flannery, and Vetterling (1988)) of matrix Z can be performed, as shown in appendix A. SVD is very accurate and can always be performed 1
Actually, when hii approaches one, the residual Ri vanishes less rapidly than (1 ¡hii ). This can be understood as follows: if an example has a strong inuence on the model, its withdrawal from the training set causes its residual to be signicantly different. Therefore, Ri h ii D 1¡h from relation 2.5, the quantity Ri ¡ 1¡h R is not vanishingly small; hence, the ii ii i (¡i) » R i estimate of the prediction error for this example Ri does not vanish either. D 1¡h ii
Local Overtting Control via Leverages
1485
even if Z is singular. In the latter case, however, the leverages are not computed accurately, hence should not be used (e.g., for computing the quantity Ep dened in the next section) because they do not comply with their basic properties 2.3 and 2.4. Moreover, some estimates of the condence intervals on the model output (Seber & Wild, 1989) rely on the assumption that Z has full rank. Under this assumption, a condence interval on the model output for input x can be easily computed as N¡q
E (Yp | x) 2 f (x, µ LS ) § ta
p s zT ( ZT Z ) ¡1 z,
(2.6)
where
zD
@f (x, µ ) @µ
µ D µ LS
,
N¡q
ta
is the realization of a random variable with a Student’s t-distribution with N ¡q degrees of freedom and a level of signicance 1¡a, and s is an estimate of the residual standard deviation of the model: v u N u 1 X sDt R2 . N ¡ q iD 1 i This is an additional motive for rejecting models with rank deciency. All theoretical material presented in sections 3 and 4 will be illustrated graphically on a small problem with one input and one output. The training data were generated using the function y D sinx x and an additive gaussian noise of variance s 2 D 10¡2 . Fifty examples were drawn, with a uniform distribution of x within the interval [0; 15]. Throughout this article, this data set will be referred to as the sinx x problem. The application of our method to the selection of more complex, multivariable models will be demonstrated in section 5. 3 Selection of a Model Among Models Having the Same Architecture
For clarity, we consider that once the number of inputs and outputs is chosen, model selection is performed in two steps: 1. An architecture is chosen (i.e., a family of functions having the same complexity) within which the model is sought (e.g., the family NN 3 of neural networks with ve inputs, three hidden neurons, and a linear output neuron). If the model is nonlinear with respect to the parameters, several trainings with different parameter initializations are performed, thereby generating various models of the same architecture
1486
GaÂetan Monari and Ge rard Dreyfus
(e.g., for neural nets, a set of K models with three hidden neurons fNN3k , k D 1 to Kg) . For this family of functions, the most appropriopt
ate model (e.g., model NN3 in family NN 3 ) is selected, as described below.
2. The previous step is performed for different architectures (i.e., for different families of parameterized functions, having different complexities, such as families NN i of neural networks having the same number of inputs and outputs but different numbers i of hidden neuopt rons). This results in a set of models (e.g., models fNNi , i D 0 to Ig). The most appropriate model among these (e.g., model NN opt ) is selected as described in section 4. In this section, we rst propose a criterion for the selection of a model among models having the same complexity, and we subsequently compare our approach with the traditional use of the leave-one-out technique. The choice among models of different complexities is addressed in section 4. 3.1 A Selection Criterion. A preliminary screening of the models corresponding to the different minima must be performed by checking the rank of the corresponding Jacobian matrices. Models with rank deciency are overtted and should therefore be discarded.2 However, the fact that for a given architecture, the global minimum has rank deciency does not mean that the considered architecture is too large; local minima may provide very good models,3 provided they are not rank decient. Therefore, an additional selection criterion must be found. To this end, we use the results presented in the previous section. In the spirit of leave-one-out cross-validation, we dene the quantity Ep as
v u ´2 N ³ u1 X Ri Ep D t , N iD 1 1 ¡ hii
(3.1)
which is identical to the leave-one-out score except for the fact that the summation is performed on the estimated modeling errors given by relation 2.5, instead of being performed on the actual prediction error on each left-out example. This quantity can be compared to the traditional training mean 2 Moreover, a rank deciency of Z has an adverse effect on the efciency of some second-order training algorithm (for instance, the Levenberg-Marquardt algorithm). To overcome this, Zhou & Si (1998) suggested a training algorithm that guarantees that the Jacobian matrix is of full rank throughout training. 3 It is well known that a suboptimal model, whose parameters do not minimize the cost function, may have a smaller generalization error than the global minimum. This is the basis of the “early stopping” regularization method, whereby training is stopped before a minimum is reached.
Local Overtting Control via Leverages
1487
Figure 1: Distribution of the solutions obtained with networks having four hidden neurons. Models lying outside the frame of the graph are models with rank deciency, for which only TMSE can be computed. Note that such is the case for the model with the smallest value of the training cost function.
squared error (TMSE): v u N u1 X TMSE D t R2 . N iD 1 i
(3.2)
Ep is larger than TMSE; it penalizes models that are strongly inuenced by some examples. Therefore, this quantity is an appropriate choice criterion between models having the same architecture but corresponding to various local minima of the cost function. Note that if all examples have the same q N leverage N , then Ep D N¡q TMSE; hence, Ep and TMSE are equal in the large-sample limit for a model without overtting. To illustrate this, consider the case of a neural network with four sigmoidal hidden neurons and a linear output neuron, trained on the data set indicated in section 2. Starting from 500 different weight initializations and using the Levenberg-Marquardt algorithm leads to various minima, represented in Figure 1 as follows: the solutions without rank deciency are plotted in the (TMSE, Ep ) plane, using a logarithmic scale for the ordinates, and the solutions with rank deciency (rank (Z ) < q), for which Ep cannot be computed reliably (see section 2), are plotted outside the frame of the graph. The analysis of this plot leads to several comments: Even for this simple architecture, a very large number of different minima of the cost function are found.
1488
GaÂetan Monari and Ge rard Dreyfus
Figure 2: Outputs of models having four hidden neurons, selected on the basis of TMSE and Ep .
For this example, about 70% of the solutions have rank deciency, so that they may be discarded without further consideration. This includes the deepest minimum found, shown as a gray dot in Figure 1, which is likely to be a global minimum. For solutions without rank deciency, the ratio of Ep to TMSE varies from almost 1 to very high values, hence the choice of a logarithmic y-scale. This shows that some minima correspond to solutions that are overinuenced by a few training examples. As expected, this overtting is not apparent on TMSE, but it is on Ep . The solution with the smallest Ep is shown as a gray triangle. The model outputs corresponding to the minima having, respectively, the smallest TMSE and the smallest Ep are shown in Figure 2. Note that it is not claimed that the model with the smallest Ep is not overtted. It is claimed that among the various minima found with the weight initializations used, it is the model that for the considered architecture provides the best trade-off between accuracy and uniform inuence of the training examples. This point will be addressed further in the next section. Starting from a linear model and increasing gradually the number of hidden neurons, one obtains the results shown in Figure 3. It appears that Ep is essentially constant between two and four neurons. Therefore, an additional step is required to perform the nal model selection, that is, to choose the appropriate number of hidden neurons; this is addressed in section 4. In this particular case where the noise level is known, it may be inferred that models with more than three hidden neurons are probably overtted since their TMSE is smaller than the standard deviation of the noise.
Local Overtting Control via Leverages
1489
Figure 3: Evolution of the optimal Ep and of the corresponding training mean square error as a function of the number of hidden neurons. s is the standard deviation of the noise.
The question that arises naturally is whether Ep (or the standard leaveone-out score Et as dened below) is a good estimate of the true generalization error of the model. Under the hypothesis of error stability introduced in his article, sanity-check bounds for the leave-one-out error have been derived by Kearns and Ron, (1997). Obtaining narrower bounds would require some additional stability properties of the training algorithm or cost function. 3.2 Comparison with the Standard Leave-One-Out Procedure. Since our approach relies on the rst principles of the leave-one-out procedure, a comparison to the original procedure is of interest. For models that are nonlinear with respect to their parameters, the most popular version of the leave-one-out, called generalized leave-one-out (Moody, 1994), consists of rst nding a minimum of the cost function, with weights W, through training with the whole training set. The N subsequent trainings with one left-out example start with the set of initial weights W. Then, using the resulting N prediction errors on left-out examples, an estimation of the generalization error of the model is computed as v u N u1 X (¡ ) ( Ri i ) 2 . (3.3) Et D t N iD 1
This method assumes that the withdrawal of one example from the training set does not result in a jump from one minimum of the cost function to another one. We formalize this as follows (see appendix B for more details): consider a training procedure using a deterministic minimization algorithm;4 assume that a minimum of the cost function has been found 4 This excludes such algorithms as simulated annealing and probabilistic hill climbing, which may overcome local minima.
1490
GaÂetan Monari and Ge rard Dreyfus
Figure 4: Application of relation 2.5 to a neural model of sin ( x) / x with two hidden neurons, whose output is shown in Figure 7.
with a vector of parameters µ LS . Assume that example i is withdrawn from the training set and that a training is performed with µ LS as initial values (¡ ) of the parameters, leading to a new parameter vector µ LS i . Further assume that example i is subsequently reintroduced into the training set and that a (¡ ) new training is performed with initial values of the parameters µ LSi . In the following, the minimum of the cost function corresponding to a parameter vector µ LS will be said to be stable for the usual leave-one-out procedure if and only if, for each i 2 [1, . . . , N], starting from µ LS , the procedure described above retrieves µLS . It has been known since Breiman (1996) that this is a major stability problem. In practice, it turns out that if an overparameterized model is used, most minima of the cost function are unstable with respect to the leave-one-out procedure. Therefore, for all such minima, the computation of the leave-one-out score Et and its comparison to Ep are meaningless. For minima that are stable with respect to the leave-one-out procedure, the validity of approximating Et by Ep depends on the validity of the Taylor expansions used to derive relation 2.5. Checking this validity can be performed by estimating the curvature of the cost function (see Antoniadis, Berruyer, & Carmona, 1992); alternatively, one can perform the leave-oneout procedure for a model selected on the basis of Ep and compare Et and Ep . This is illustrated in Figure 4, for an approximation of sinx x with two hidden neurons, for the minima that are stable and with full rank. The time required for the computation of the quantities fhii g from relation 2.2 and of the leave-one-out score Ep at the end of each training is
Local Overtting Control via Leverages
1491
negligibly small as compared to the time required for training; therefore, our procedure divides the computation time by N, as compared to the standard leave-one-out, where N is the number of examples. 3.3 Conclusion. We have shown in this section that Ep is an efcient basis for selecting a model within a family of parameterized functions; furthermore, it automatically eliminates the solutions that are rank decient. Once a model has been selected on this basis for several families of parameterized functions, one has to select the appropriate architecture. This is addressed in the next section. 4 Selection of the Optimal Architecture
Having selected, for each architecture (e.g., for each number of hidden neurons), the minimum with the smallest value of Ep , we have to dene a procedure for choosing the best architecture. It is known that among models with approximately the same performances, one should favor the model with the smallest architecture. In practice, however, the trade-off between parsimony and performance may be difcult to perform. Referring to Figure 3, the choice between two, three, or four hidden neurons is not easy in the absence of further information. In this section, we show that the leverage may be used as an additional element in the process of choosing a model among candidate models having different architectures. 4.1 Leverage Distribution. In a purely black box modeling context, all data points of the training set are equally important; therefore, we suggest that among models with approximately the same generalization error estimates (Ep ), one should select the model for which the inuence of the training examples is most evenly shared, 5 that is, for which the distribuq tion of the leverages hii is narrowest around N . For example, the models with two and four hidden neurons of Figure 3 have signicantly different leverage distributions, as shown in Figure 5. As expected from relation 2.3, q the leverages fhii giD 1,...,N are centered on the corresponding values of N . However, the distribution is broader for the model with four hidden neurons. The width of the distribution is due to the fact that for x 2 [8I 13], the number of examples is too small given the number of parameters of the model; this results in condence intervals on the model output that are signicantly larger in the corresponding region of input space than elsewhere. In order to characterize the leverage distribution p of a given model, it is appropriate to use the mean value of the quantities f hii giD 1,...,N , since the latter are involved in the computation of the condence intervals. Indeed,
5 Unless it is explicitly desired, for some reason arising from prior knowledge on the modeled process, that one or more example be of special importance.
1492
GaÂetan Monari and Ge rard Dreyfus
Figure 5: Model outputs (top and middle) and leverage distributions (bottom) for the models with two and four hidden neurons selected on the basis of Ep .
Local Overtting Control via Leverages
1493
Figure 6: Performance indices of the solutions selected on the basis of Ep , including m (right scale), and tested on a separate data set.
the application of relation 2.6 to example i of the training set yields p N¡q E (Yp | xi ) 2 f (xi , µ LS ) § ta s hii .
(4.1)
Therefore, we dene the quantity N 1 X m D N iD 1
s N hii . q
(4.2)
Using Cauchy’s inequality and relations 2.3 and 2.4, one can derive the following properties for m , regardless of N and q: ( m ·1
¡ m D 1 , hii D
q¢ 8i N
2 [1, . . . , N]
(4.3)
Therefore, m is a normalized parameter that characterizes the distribution of the leverages around their mean value. Regardless of N and q, the closer m is to 1, the more evenly shared the inuences are among the training examples. Hence, m can be used as an additional indication of the overtting of a model. To illustrate this, Figure 6 shows, in addition to the curves of Figure 3, the values of m and an estimate of the generalization error, obtained on a separate representative test set (100 examples). To this end, we dene the generalization mean square error (GMSE) as s GMSE D
X 1 ( yp | x ¡ f (x, µLS ) ) 2 . card(Test set) xbTest set
(4.4)
1494
GaÂetan Monari and Ge rard Dreyfus
The analysis of Figure 6 leads to the following comments: The distinction between leverage distribution of the two models depicted in Figure 5 can actually be performed by m (0.98 versus 0.96) . The model with the highest value of m is actually the model with the smallest value of GMSE. For larger architectures, both m and GMSE get worse. Models with more than two hidden neurons do not reach the performance of that with two hidden neurons. However, the increase of the GMSE turns out to be limited, which demonstrates that selecting the minima on the basis of Ep is an effective means of limiting overtting. As a conclusion, considering several architectures with approximately the same values of Ep , the most desirable solution is the solution with the highest value of m in the context of black box modeling, that is, if there is no reason, arising from prior knowledge, to devote a large fraction of the available degrees of freedom to one or to a few specic training samples. 4.2 Model Selection and Extension of the Training Set. In the previous section, we considered the situation where one wishes to select a model built from a data set that is available once and for all. It is often the case, however, that additional samples may be gathered in order to improve the model. Then the following question must be addressed: In what regions of input space would it be useful to gather new data? We show in the following that the approach to model selection described in the previous section can be helpful in this respect. To this end, one has to dene a maximal acceptable value for the condence interval. Using the expression of that interval (relation 2.6), one can choose, for instance, the following threshold: N¡q
ta
p N¡q s zT (zT z) ¡1 z < ta s.
(4.5)
This guarantees that the condence on the model output f (x, µLS ) is not poorer than that on the measured output Yp | x. To start, consider the model with two hidden neurons, which has been selected, using m , as the best possible model given the available training set (see Figure 6). According to the threshold 4.5, this model should not be used outside the interval x 2 [¡0.3, 16.5] (see Figure 7). In order to use the model outside this input space area, additional data should be gathered. This is not surprising since the interval [¡0.3, 16.5] is roughly the interval from which the training set was drawn. However, if it is desirable to improve the model within this interval, the condence interval on the model with two hidden neurons does not provide any information for detecting where additional examples would be required. To that effect, we have to choose slightly more overtted models,
Local Overtting Control via Leverages
1495
Figure 7: Application of threshold 4.5 to the condence interval on the model output (two hidden neurons).
Figure 8: Application of threshold 4.5 to the condence interval on the model output (four hidden neurons).
such as the model with four hidden neurons shown in Figure 5, and consider their condence intervals. Then the application of threshold 4.5 shows areas of input space, both within and outside the interval x 2 [¡0.3, 16.5], where additional examples would be helpful (see Figure 8): the large condence interval between 9 and 11 shows that the corresponding hii are large—that a large fraction of the degrees of freedom of the four-hidden-neuron network was used to t the training examples within this interval. Therefore, gathering new data in this area would indicate whether the wiggle of the output in this area is signicant or whether it is an artifact due to the small number of samples. Hence, considering slightly oversize models for selection is a means for detecting areas of input space where gathering additional data would be appropriate. Once additional examples are gathered, the selection procedure is performed again.
1496
GaÂetan Monari and Ge rard Dreyfus
Figure 9: Performances of the candidate models with one to ve hidden neurons.
4.3 Discussion. For didactic reasons, the selection method discussed in sections three and four was split into two phases. For a given training set, architectures of increasing sizes were considered:
1. For each architecture, perform several trainings, starting with various random weights. 2. For each model, compute the leverages fhii giD 1,...,N of the training examples and check the validity of relations 2.2 and 2.3. Discard the model in case of rank deciency; otherwise, compute Ep . 3. Keep the model with the smallest value of Ep , and compute its parameter m . 4. Among the candidate models exhibiting a satisfactory Ep , select the model with the highest value of m . Thus, the method relies essentially on two quantities that can be compared for a given problem, regardless of the model size: Ep , which is an estimate of the generalization error, and m , an index of the degree of overtting of the model. Hence, one should consider, for various architectures, all candidate models as a pair ( Ep , m ) , and perform model selection directly on the basis of the position of each model in the Ep ¡ m plane. For instance, on the sinx x problem, the selection should be performed by considering Figure 9, which represents about 1000 candidate models for ve different architectures. According to the required trade-off between Ep and m , the nal model should be selected within the indicated area. Models outside this area should not be chosen, for one can always select a better
Local Overtting Control via Leverages
1497
one (i.e., with a smaller Ep and a higher m ). Therefore, architectures with one and ve hidden neurons can denitely be discarded. As explained in section 4.2, the model designer should perform his nal choice depending on his objectives. If he wants the best possible model given this particular training set, he should favor, within this dotted area, the solutions with the highest possible value of m . Conversely, if he has the opportunity of gathering additional training examples and wants to select the most relevant ones, he should favor slightly overtted models. Within this dotted area, he should favor Ep . The condence intervals for the prediction of such a model indicate the areas, in input space, where gathering data would be desirable. With this new data, the selection procedure may be performed again, which may lead to less complex models. 5 Validation on Numerical Examples
In this section, we illustrate the use of our method on two academic examples. We demonstrate on a teacher-student problem that the method we propose gives very good results, in contrast to the usual leave-one-out approach. For comparison purposes, we model a simulated process investigated previously in Anders and Korn (1999), and we outline briey an industrial application. 5.1 Comparison to the Standard Leave-One-Out Approach on a Teacher-Student Problem. We consider the following problem: a database
of 800 examples is generated by a neural network with ve inputs, one hidden layer of ve neurons with sigmoidal (tan h) nonlinearity, and a linear output neuron: 0 1 5 5 X X j ai tanh @b i0 C bi xj A . (5.1) E (Yp | x) D a0 C iD 1
jD 1
Thus, we guarantee that the family of parameterized functions, in which we search for the model, actually contains the regression function. The weights and the inputs of this teacher network are uniformly distributed in [¡1, C 1] and [¡3, C 3] respectively, in order to use the sigmoids in their nonlinear zone. A gaussian noise is added to the output, with a standard deviation equal to 20% of the unconditional standard deviation of the model output: s 2 D 0.05. Student networks with ve inputs are investigated. We show how all the above results can be successfully used to perform model selection. To this end, 300 examples are used to perform both training and model selection as described in the previous sections, and 500 examples are devoted to the tests. All results reported below were obtained by estimating the gradient of the cost functions by backpropagation, and minimizing the cost functions by the BFGS algorithm (Press et al., 1988). The values of fhii giD 1,...,N are
1498
GaÂetan Monari and Ge rard Dreyfus
Figure 10: Performances of the solutions selected on the basis of Ep , including m (right scale), and tested on a separate data set (teacher-student problem).
computed by singular value decomposition of matrix Z, as presented in appendix A. For increasing architecture sizes (from one to nine hidden neurons), the minimum of the cost function with the smallest value of Ep was selected, and the following values were computed: TMSE and m on the training set and GMSE on the separate test set. These results are summarized in Figure 10. As on the sinx x problem, Ep decreases monotonically and reaches approximately the value of the standard deviation of the noise. Based on the values of m , the optimal solution appears to be the model having ve hidden neurons. Indeed, this proves to be the model for which the generalization error GMSE is smallest. Remarkably, the weights of this network are almost identical to those of the teacher network (they should not be expected to be strictly identical, since noise is added to the output of the teacher network during generation of the data). Actually, the weights of this network are identical to those obtained when training is performed with initial weights equal to the teacher ’s weights: this demonstrates that the selection method nds the best possible model. By contrast, the generalized leave-one-out approach appears to give very poor results: If leave-one-out is performed on the minima with the smallest value of TMSE, without checking Z’s rank nor the stability of the minima for the usual leave-one-out, the error Et decreases as hidden neurons are added and becomes signicantly smaller than the standard deviation of the noise. Indeed, the global minima of the cost functions corresponding to architectures having ve (i.e., the teacher network architecture) to nine hidden neurons have a rank deciency and are therefore overtted solutions with poor GMSE.
Local Overtting Control via Leverages
1499
If leave-one-out is performed on the minima with the smallest value of TMSE among the minima without rank deciency, without checking that the minima are stable for the usual leave-one-out, the error Et reaches a plateau close to the standard deviation of the noise. However, both Ep and GMSE prove that the performances of these models are worse than expected from E t . In fact, the corresponding minima appear to be unstable for the usual leave-one-out, which makes Et invalid, If the procedure is restricted to the minima with the smallest value of TMSE among the minima without rank deciency and stable for the usual leave-one-out, almost all minima, from ve to nine hidden neurons, are rejected. Furthermore, the procedure is extremely lengthy. This exemplies the limitations of the conventional generalized leave-oneout. When the number of training examples is small given the complexity of the family of parameterized functions, the withdrawal of some training examples makes the minima of the cost function unstable for the usual leave-one-out. By contrast, performing leave-one-out on the basis of relation 2.5 prevents these stability problems. Furthermore, the computational overhead of this selection method is negligible, since we only have to compute the fhii giD 1,...,N for each minimum of the cost function. 5.2 Comparison to Other Selection Methods on a Benchmark Problem.
Anders and Korn (1999) introduce a two-input process, simulated with the following regression function: E (Yp | x) D ¡0.5 C 0.2x21 ¡ 0.1 exp(x2 ) .
(5.2)
The inputs x1 and x2 are drawn from a normal distribution. Gaussian noise is added to the output, with a standard deviation equal to 20% of the unconditional standard deviation of the model output: s 2 D 5 10¡3 . To perform statistically signicant comparisons, 1000 training sets of 500 examples each were generated with different realizations of the noise (the inputs remaining unchanged). A separate set of 500 samples was used for computing the GMSE. The following performance index was used: rD
GMSE2 ¡ s 2 £ 100%. s2
(5.3)
Using several model selection techniques (hypothesis testing, information criteria, and 10-fold cross-validation, each of them being followed, if necessary, by network pruning), the best performance (averaged over the 1000 training sets), reported in Anders and Korn (1999), was r D 28% (standard deviation not indicated). The authors state that these (relatively) poor performances arise from the complexity of the true regression function, equation 5.2.
1500
GaÂetan Monari and Ge rard Dreyfus
Table 1: Summary of Numerical Results on a Benchmark Problem. LOCL Method Anders and Korn (1999) 500 training samples 100 training samples
Average r S.D. Average r S.D.
28% Not indicated
Outliers Not Outliers Removed Removed 3% 2% 126% 632%
3%a 2% 27%b 28%
Weight Decay 2% 2% 54% 36%
Notes: a No outlier detected. b 3% outliers detected.
Under the same conditions, we performed model selection according to the present method (on the basis of Ep and m as described in sections 3 and 4) and the Bayesian approach to weight decay6 described in MacKay (1992). In both cases, the performance indices were very satisfactory. The results reported in this section are summarized in Table 1, where the present method is referred to as LOCL (local overtting control via leverages). Hence, this academic problem can be accurately and consistently solved by our method as well as by weight decay. As Rivals and Personnaz (2000) proposed, a reduction of the size of the training set from 500 to 100 samples is of interest. Unlike them, we simulated the 1000 new training sets with different values of the noise and of the inputs. Because of the decrease of the training set size, this was necessary to obtain statistically signicant results. 7 Our method gave a value of r D 126%, whereas using weight decay,8 one obtains r D 54%. This leads to the following comments: In a problem with such a small number of training examples, knowing the true noise level (see note 6) is a tremendous advantage, which explains the superiority of our implementation of weight decay. The standard deviation associated to our r shows the large scattering of the results. However, advantage can be taken from the fact that our method is local, whereas alternative methods rely on global scores. We can show that the 6 Due to the difculties encountered while implementing this approach, we made the assumption that the noise level (meta-parameter b in the cost function) was known and set it to its true value. Without this strong assumption, the performances of the models selected on the basis of the “evidence” would certainly have been worse. 7 Otherwise, the results depend very strongly on the location of the training examples in input space. 8 Keeping the inputs unchanged and using only Ep followed, if necessary, by network pruning, Rivals and Personnaz (2000) obtained r D 140% (S.D. not given).
Local Overtting Control via Leverages
1501
Figure 11: The welding process. Two electrodes are pressed against the metal sheets, and a strong electrical current is own through the assembly.
poor results reported above are due to only a few points of the test set. Just as in section 4.2, one can dene a threshold that the condence interval on the model output should not exceed. Choosing threshold 4.5, one obtains the following results: the condence intervals of 3% (S.D. 2%) of the test examples were found to be above threshold; these example were discarded; on the remaining 97% of the examples, an average model performance r D 27% was obtained. This is comparable to the performances obtained by Anders and Korn (1999) using a training set that was ve times as large. 5.3 An Industrial Application. This method was used in a large-scale industrial application: the modeling of the spot welding process. The quantity to be predicted was the diameter of the weld as a function of physical parameters measured during the process. Because of the relatively small number of examples at the beginning of the project and the economic and safety issues involved, model selection was a critical point. Spot welding is the most widely used welding process in the car industry; millions of such welds are made every day. Two steel sheets are welded together by passing a very large current (tens of kiloamperes) between two electrodes pressed against the metal surfaces, typically for a hundred milliseconds (see Figure 11). The heating thus produced melts a roughly cylindrical region of the metal sheets. After cooling, the diameter of the melted zone—typically 5 mm—characterizes the effectiveness of the process. A weld spot whose diameter is smaller than 4 mm is considered mechanically unreliable; therefore, the spot diameter is a crucial element in the safety of a vehicle. At present, no fast, nondestructive method exists for measuring the spot diameter, so there is no way of assessing the quality of the weld immediately after welding. Therefore, a typical industrial strategy consists of using an intensity that is much larger than actually necessary, which re-
1502
GaÂetan Monari and Ge rard Dreyfus
sults in excessive heating, which in turn leads to the ejection of steel droplets from the welded zone (hence the “sparks” that can be observed during each welding by robots on automobile assembly chains), and in making a much larger number of welds than necessary to be sure that a sufcient number of valid spots are produced. Both the excessive current and the excessive number of spots result in a fast degradation of the electrodes, which must be changed or redressed frequently. For all these reasons, the modeling of the process, leading to a reliable on-line prediction of the weld diameter, is an important industrial challenge. Modeling the dynamics of the welding process from rst principles is a very difcult task, because the computation time necessary for the integration of the partial differential equations of the knowledge-based model is many orders of magnitude larger than the duration of the process (which precludes real-time prediction of the spot diameter), and because many physical parameters appearing in the equations are not known reliably. These considerations led to considering black box modeling as an alternative. Since the process is nonlinear and has several input variables, neural networks were natural candidates. The main goal was to predict the spot diameter from measurements performed during the process, immediately after weld formation, for on-line quality control. The main concerns for the modeling task were the choice of the model inputs and the limited number of examples available initially in the database. The quantities that are candidates for input selection are mechanical and electrical signals that can be measured during the welding process. Input selection was performed by forward stepwise selection of polynomial models, whereby the signicance of adding a new input or removing a previously selected input is tested by statistical tests. The variables thus selected were subsequently used as inputs to neural networks. The experts involved in the knowledge-based modeling of the process validated this set. Because no simple nondestructive weld diameter measurement exists, the database is built by performing a number of welds in well-dened condition and subsequently tearing them off. The melted zone, remaining on one of the two sheets, is measured. This is a lengthy and costly process, so that the initial training set was made of 250 examples. Using the condence interval estimates 2.6 and the training set extension strategy discussed in section 4.2, further experiments were performed, so that, nally, a much larger database became available, half of which was used for training and half for testing (since the selection method does not require any validation set). Model selection was performed using the full procedure discussed in section 4. Typical results are displayed in Figure 12, which shows the scatter plots for spot diameter prediction on the training set and on the completely independent test set (for a certain type of steel sheet), together with the condence intervals. The leave-one-out score Ep is equal to 0.27, while the EQMT is 0.23. The full description of this application is beyond the scope of this paper; noncondential data can be found in Monari (1999).
Local Overtting Control via Leverages
1503
Figure 12: Scatter plots with condence intervals for diameter prediction. (Left) Training set. (Right) Test set.
6 Conclusion
A model selection method relying on local overtting control via leverages, termed the LOCL method, has been presented. It is based on computing the leverage of each example on the candidate models. The leverage of a given example can be regarded as an indication of the percentage of the degrees of freedom of the model that is used to t the example. This allows a precise monitoring of overtting, since the method relies on local indicators in sample space (the leverages) instead of on global indicators such as a cross-validation score or the value of a penalty function. Although it is similar in spirit to the generalized leave-one-out procedure, it is very economical in computation time, and it avoids the stability problems inherent in leave-one-out. Furthermore, the values of the leverages (or of the condence interval computed therefrom) give an indication of areas, in input space, where new examples are needed. This method has been validated in a large-scale industrial application. Appendix A: Computation of the Leverages
This appendix shows how the values of the all-important quantities fhii giD 1,...,N can be computed without matrix inversion by making use of the singular value decomposition (see, e.g., Press et al., 1988) of matrix Z, which can always be performed, even if matrix Z is singular. Z can be written as: Z D UWV T ,
(A.1)
1504
GaÂetan Monari and Ge rard Dreyfus
where U is a ( N, q) column orthogonal matrix (i.e., U T U D I), W is a ( q, q) diagonal matrix with positive or zero elements (the singular values of Z), and V is a (q, q) orthogonal matrix (i.e. VT V D VV T D I). We thus obtain ( ZT Z ) ¡1 D ( VWU T UWV T ) ¡1 D ( VW2 V T ) ¡1 D VW ¡2 VT .
(A.2)
Therefore, the elements of this ( q, q) matrix can easily be computed as ( ZT Z ) lj¡1 D
q X VlkVjk kD 1
2 W kk
.
(A.3)
From equation A.3 and hii D ziT ( ZT Z) ¡1 zi , one gets 12 q 1 X @ hii D Zij VjkA . W kk jD1 kD 1 q X
0
(A.4)
Appendix B: Stability of a Model for the Usual Leave-One-Out Procedure
Consider a training procedure using a deterministic minimization algorithm of the cost function;9 assume that a minimum of the cost function has been found, with a vector of parameters µ LS . Assume that example i is withdrawn from the training set and that a training is performed with µLS as initial (¡ ) values of the parameters, leading to a new parameter vector µLS i . Further assume that example i is subsequently reintroduced into the training set and that a new training is performed with initial values of the parameters (¡ ) µLSi : The minimum of the cost function corresponding to a parameter vector µLS is said to be stable for the usual leave-one-out procedure if and only if, for each i 2 [1, . . . , N], starting from µLS , the procedure described above retrieves µ LS . This denition is illustrated in Figure 13. On this gure, in the (¡ ) unstable case, after the removal of example i (and convergence to µ LSi ) , the replacement of the same example into the training set makes the learning procedure converge to another solution with parameters µ¤LS . In such a case, Et is not an estimate of the generalization performance of the model with (¡ ) parameters µ LS , since the prediction error Ri i actually corresponds to the ¤ model with parameters µLS . This denition is different from the stability considerations introduced by other authors in order to derive bounds on the estimate of the generalization performance (see Vapnik, 1982). For instance, bounding the difference 9 This excludes such algorithms as simulated annealing and probabilistic hill climbing, which may overcome local minima.
Local Overtting Control via Leverages
1505
Figure 13: Minima that are stable (left) and unstable (right) for the usual leaveone-out procedure.
between the true error and the leave-one-out estimate thereof (with an arbitrary accuracy at a given level of signicance) requires some stability of the training algorithm regardless of the available data set (Kearns and Ron, 1997). Therefore, this stability does not depend on the considered minimum of the cost function. Our denition of stability is less stringent than the alternative one. It should be remembered that, in contrast to the usual leave-one-out, the validity of our approach does not depend on any stability condition. References Anders, U., & Korn, O. (1999). Model selection in neural networks. Neural Networks, 12, 309–323. Antoniadis, A., Berruyer, J., & Carmona, R. (1992). RÂegression non linÂeaire et applications. Paris: Economica. Breiman, L. (1996). Heuristics of instability and stabilization in model selection. Annals of Statistics, 24, 2350–2383. Geman, S., Bienenstock, E., & Doursat, R. (1992). Neural networks and the bias/variance dilemma. Neural Computation, 4, 1–58. Kearns, M., & Ron, D. (1997). Algorithmic stability and sanity-check bounds for leave-one-out cross validation. In Tenth Annual Conference on Computational Learning Theory, (pp. 152–162). New York: Association for Computing Machinery Press. MacKay, D. J. C. (1992). A practical Bayesian framework for backpropagation networks. Neural Computation, 4, 448–472. Monari, G. (1999). SÂelection de mod`eles non line aires par leave-one-out: Etude thÂeorique et application des rÂeseaux de neurones au procÂedÂe de soudage par points. The` se de l’Universit e Paris 6. Available on-line: http://www.neurones.espci.fr/Francais.Docs/publications.html. Monari, G., & Dreyfus, G. (2000). Withdrawing an example from the training set: An analytic estimation of its effect on a nonlinear parameterized model. Neurocomputing Letters, 35, 195–201.
1506
GaÂetan Monari and Ge rard Dreyfus
Moody, J. (1994). Prediction risk and neural network architecture selection. In V. Cherkassky, J. H. Friedman, & H. Wechsler (eds.), Statistics to neural networks: Theory and pattern recognition applications. Berlin: Springer-Verlag. Press, W. H., Teukolsky, S. A., Flannery, B. P., & Vetterling, W. T. (1988).Numerical recipes in C: The art of scientic computing. Cambridge: Cambridge University Press. Rivals, I., & Personnaz, L. (2000). A statistical procedure for determining the optimal number of hidden neurons of a neural model. Paper presented at the ICSC Symposium on Neural Computation, Berlin. Seber, G. A. F., & Wild, C. J. (1989). Nonlinear regression. New York: Wiley. Sjoberg, ¨ J., & Ljung, L. (1992). Overtraining, regularization and searching for minimum in neural networks (Tech. Rep. LiTH-ISY-I-1297). Linkoping: Department of Electrical Engineering. Linkoping University. Available on-line: http://www.control.isy.liu.se. Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society B, 36, 111–147. Vapnik V. N. (1982). Estimation of dependences based on empirical data. New York: Springer-Verlag. Zhou, G., & Si, J. (1998). A systematic and effective supervised learning mechanism based on Jacobian rank deciency. Neural Computation, 10, 1031–1045. Received October 16, 2000; accepted October 19, 2001.
ARTICLE
Communicated by Stuart Geman
A Monte Carlo EM Approach for Partially Observable Diffusion Processes: Theory and Applications to Neural Networks Javier R. Movellan [email protected] Machine Perception Laboratory, Institute for Neural Computation, University of California, San Diego, La Jolla, CA 92093, U.S.A.
Paul Mineiro [email protected] Department of Cognitive Science, University of California, San Diego, La Jolla, CA 92093, U.S.A.
R. J. Williams [email protected] Department of Mathematics and Institute for Neural Computation, University of California, San Diego, La Jolla, CA 92093, U.S.A.
We present a Monte Carlo approach for training partially observable diffusion processes. We apply the approach to diffusion networks, a stochastic version of continuous recurrent neural networks. The approach is aimed at learning probability distributions of continuous paths, not just expected values. Interestingly, the relevant activation statistics used by the learning rule presented here are inner products in the Hilbert space of square integrable functions. These inner products can be computed using Hebbian operations and do not require backpropagation of error signals. Moreover, standard kernel methods could potentially be applied to compute such inner products. We propose that the main reason that recurrent neural networks have not worked well in engineering applications (e.g., speech recognition) is that they implicitly rely on a very simplistic likelihood model. The diffusion network approach proposed here is much richer and may open new avenues for applications of recurrent neural networks. We present some analysis and simulations to support this view. Very encouraging results were obtained on a visual speech recognition task in which neural networks outperformed hidden Markov models. 1 Introduction Since Hopeld’s seminal work (Hopeld, 1984), continuous deterministic neural networks and discrete stochastic neural networks have been thorc 2002 Massachusetts Institute of Technology Neural Computation 14, 1507–1544 (2002) °
1508
Javier R. Movellan, Paul Mineiro, and R. J. Williams
oughly studied by the neural network community (Pearlmutter, 1995; Ackley, Hinton, & Sejnowski, 1985). However, the continuous stochastic case has been conspicuously ignored. This is surprising considering the success of continuous stochastic models in other elds (Oksendal, 1998). In this article, we focus on the continuous stochastic case and present a Monte Carlo expectation-maximization (EM) approach for training continuous-time, continuous-state, stochastic recurrent neural network models. The goal is to learn probability distributions of continuous paths, not just equilibrium points. This is important for problems involving sequences, such as speech recognition and object tracking. The approach proposed here potentially opens new avenues for applications of recurrent neural networks showing results comparable to, if not better than, those obtained with hidden Markov models. In addition, the maximum likelihood learning rules used to train these networks are based on inner products that are computable using local Hebbian statistics. This is an aspect of potential value for neurally plausible learning models and for potential generalizations of kernel methods (Aizerman, Braverman, & Rozoner, 1964; Burges, 1998). Continuous-time, continuous-state recurrent neural networks (hereafter referred to simply as recurrent neural networks) are dynamical systems consisting of n point neurons coupled by synaptic connections. The strength of these connections is represented by an n £ n real-valued matrix w, and the network dynamics are governed by the following differential equations, dxj ( t) D m j ( x ( t) , l) , for t 2 [0, T], j D 1, . . . , n, dt
(1.1)
where Á m j ( x ( t) , l) D kj ¡rj xj ( t) C jj C
n X
!
Q (xi ( t) ) wij ,
(1.2)
iD 1
xj is the soma potential of neuron j, 1 / rj > 0 is the transmembrane resistance,1 1 /kj > 0 is the input capacitance, jj is a bias current, wij is the conductance (synaptic weight) from unit i to unit j, and Q is a nonlinear activation function, typically a scaled version of the logistic function
Q (v ) D h1 C h2
1 , for v 2 R, 1 C e¡h3 v
(1.3)
1 We allow 1 / rj or 1 /kj to be C 1, corresponding to situations in which rj D 0 or kj D 0, respectively.
A Monte Carlo EM Approach
1509
where h D (h1 , h2 , h3 ) 2 R3 are xed scale and gain parameters. Here, l 2 R p represents the w, j , k , and r terms, whose values are typically varied in accordance with some learning rules. 1.1 Hidden Units. A variety of algorithms have been developed to train these networks, and are commonly known as recurrent neural network algorithms (Pearlmutter, 1995). An important achievement of these algorithms is that they can train networks with hidden units. Hidden units allow these networks to develop time-delayed representations and feature conjunctions. For this reason, recurrent neural networks were expected to become a standard tool for problems involving continuous sequences (e.g., speech recognition) in the same way that backpropagation networks became a standard tool for problems involving static patterns. Recurrent network learning algorithms have proved useful in some elds. In particular, they have been useful for understanding the role of neurons in the brain. For example, when these networks are trained on simple sequences used in controlled experiments with animal subjects, the hidden units act as “memory” neurons similar to those found in neural recordings (Zipser, Kehoe, Littlewort, & Fuster, 1993). While these results have been useful to help understand the brain, recurrent neural networks have yielded disappointing results when applied to engineering problems such as speech recognition or object tracking. In such domains, probabilistic approaches, such as hidden Markov models and Kalman lters, have proved to be superior. 1.2 Diffusion Networks. We believe that the main reason that recurrent neural networks have provided disappointing results in some engineering applications (e.g., speech recognition) is due to the fact that they implicitly rely on a very simplistic likelihood model that does not capture the kind of variability found in natural signals. We will elaborate on this point in section 8.1. To overcome this deciency, we propose adding noise to the standard recurrent neural networks dynamics, as would be done in the stochastic ltering and systems identication literature (Lewis, 1986; Ljung, 1999). Mathematically, this results in a diffusion process, and thus we call these models diffusion neural networks, or diffusion networks for short (Movellan & McClelland, 1993). While recurrent neural networks are dened by ordinary differential equations (ODEs), diffusion networks are described by stochastic differential equations (SDEs). Stochastic differential equations provide a rich language for expressing probabilistic temporal dynamics and have proved useful in formulating continuous-time inference problems, as, for example, in the continuous Kalman-Bucy lter (Kalman & Bucy, 1961; Oksendal, 1998). Diffusion networks can be interpreted as a low-power version of recurrent networks, in which the thermal noise component is nonnegligible. The temporal evolution of a diffusion network with vector parameter l denes
1510
Javier R. Movellan, Paul Mineiro, and R. J. Williams
an n-dimensional stochastic process Xl that satises the following SDE:2 dX l ( t) D m (Xl ( t) , l) dt C sdB(t), t 2 [0, T], l
X (0) » º.
(1.4) (1.5)
Here m D (m 1 , . . . , m n ) 0 is called the drift vector. We use the same drift as recurrent neural networks, but the learning algorithm presented here is general and can be applied to other drift functions; B is a standard n-dimensional Brownian motion (see section 2), which provides the random driving noise for the dynamics; s > 0 is a xed positive constant called the dispersion, which governs the amplitude of the noise; T > 0 is the length of the time interval over which the model is used; and º is the probability distribution of the initial state Xl (0) . We regard T > 0 and º as xed henceforth. 1.3 Relationship to Other Models. Figure 1 shows the relationship between diffusion networks and other approaches in the neural network and stochastic ltering literature. Diffusion networks belong to the category of partially observable Markov models (Campillo & Le Gland, 1989), a category that includes standard hidden Markov models and Kalman lters as special cases. Standard hidden Markov models are dened in discrete time, and their internal states are discrete. Diffusion networks can be viewed as hidden Markov models with continuous-valued hidden states and continuous-time dynamics. The continuous-time nature of the networks is convenient for data with dropouts or variable sample rates, since continuous-time models dene all of the nite dimensional distributions. The continuous-state representation (see Figure 2) is well suited for problems involving inference about continuous unobservable quantities, such as object tracking tasks, and modeling of cognitive processes (McClelland, 1993; Movellan & McClelland, 2001). If the logistic activation function Q is replaced by a linear function, the weights between the observable units and from the observable to the hidden units are set to zero, the r parameters are set to zero, and the probability distribution of the initial states is constrained to be gaussian, diffusion networks have the same dynamics underlying the continuous-time Kalman-Bucy lter (Kalman & Bucy, 1961). If, on the other hand, the weight matrix w is symmetric, at stochastic equilibrium, diffusion networks behave like continuous-time, continuous-state Boltzmann machines (Ackley et al., 2
We use Xl to make explicit the dependence of the solution process on the parameter l. The following assumptions are sufcient for existence and uniqueness in distribution of solutions to equations 1.4 and 1.5; º has bounded support, m (¢, l) is continuous and satises a linear growth condition |m (u, l)| · Kl (1 C |u|), for some Kl > 0 and all u 2 Rn , where | ¢ | denotes the Euclidean norm (see e.g., proposition 5.3.6 in Karatzas & Shreve, 1991). These assumptions are satised by the recurrent neural network drift function.
A Monte Carlo EM Approach
1511
Figure 1: Relationship between diffusion networks and other approaches in the neural network and stochastic ltering literature.
Figure 2: In stochastic differential equations (left), the states are continuous and the dynamics are probabilistic. Given a state at time t, there is a distribution of possible states at time t C dt. In ordinary differential equations (center), the states are continuous, and the dynamics are deterministic. In hidden Markov models (right) the hidden states are discrete and probabilistic. This is represented in the gure by partitioning a continuous state-space into four discrete regions and assigning equal transition probability to states within the same region.
1985). Finally, if the dispersion constant s is set to zero, the network becomes a standard deterministic recurrent neural network (Pearlmutter, 1995). In the past, algorithms have been proposed to train expected values and equilibrium points of diffusion networks (Movellan, 1994, 1998). Here, we present a powerful and general algorithm to learn distributions of trajectories. The algorithm can be used to train diffusion networks with hidden units, thus maintaining the versatility of the recurrent neural network approach. As we will illustrate, the algorithm is sufciently fast to train networks with thousands of parameters using current computers and fares well when compared to other probabilistic approaches, such as hidden Markov models.
1512
Javier R. Movellan, Paul Mineiro, and R. J. Williams
2 Mathematical Preliminaries 2.1 Brownian Motion and Stochastic Differential Equations. Brownian motion is a stochastic process originally designed to model the behavior of pollen grains subject to random collisions with molecules of water. A stochastic process B D fB(t) , t 2 [0, 1) g is a standard one-dimensional Brownian motion under a probability measure P if (1) it starts at 0 with Pprobability one; (2) for any t 2 [0, 1) , D t > 0, the increment B ( t C D t) ¡ B ( t) is a gaussian random variable under P with zero mean and variance D t; and (3) for all l 2 N, and 0 · t0 < t1 ¢ ¢ ¢ < tl < 1, the increments B (t k ) ¡B(tk¡1 ) , k D 1, ¢ ¢ ¢ , l are independent random variables under P. An ndimensional Brownian motion consists of n independent one-dimensional Brownian motion processes. One can always choose a realization of Brownian motion that has continuous paths. 3 However, these paths are nowhere differentiable with probability one (Karatzas & Shreve, 1991). Brownian motion can be realized as the limit in distribution as D t ! 0 of the processes obtained by the following iterative scheme: B (0) D 0,
(2.1)
B ( tkC 1 ) D B ( tk ) C
p
D tZ ( tk ) ,
(2.2)
B ( s) D B ( tk ) , for s 2 [tk , tkC 1 ) ,
(2.3)
where tk D kD t, k D 0, 1, ¢ ¢ ¢ , and the Z (t k ) terms are independent gaussian random variables with zero mean and unit variance. It is common in the engineering literature to represent stochastic differential equations as ordinary differential equations with an additive white noise component, dX ( t) D m ( X ( t) ) C sW ( t ) , dt
(2.4)
where W represents white noise, a stochastic process that is required to have the following properties: (1) W is a zero-mean stationary gaussian process; (2) for all t, t0 , the covariance between W ( t0 ) and W ( t) is a Dirac delta function of t ¡ t0 . While white noise does not exist as a proper stochastic process, equation 2.4 can be given mathematical meaning by thinking of it as dening a process of the form Z X ( t) D X (0) C 3
Z
t 0
m (X ( s ) ) ds C s
t
W ( s) ds, 0
A Brownian motion for which P(for all t ¸ 0, lim s!t Xs D Xt ) D 1.
(2.5)
A Monte Carlo EM Approach
1513
Rt where 0 W ( s) ds is now a stochastic process required to have continuous paths and zero-mean, independent stationary increments. It turns out that Brownian motion is the only process with these properties. Thus, in stochastic calculus, equation 2.4 is seen just as a symbolic pointer to the following integral equation, Z X (t ) D X (0) C
t 0
m ( X (s ) ) ds C sB(t) ,
(2.6)
where B is Brownian motion. Moreover, if it existed, the white noise W (t) would be the temporal derivative of Brownian motion, dB (t) / dt. Since Brownian paths are nowhere differentiable with probability one, in the mathematical literature, the following symbolic form is preferred to equation 2.4, and it is the one adopted in this article: dX ( t) D m ( X (t ) ) dt C sdB(t) .
(2.7)
Intuitive understanding of the solution to this equation can be gained by thinking of it as the limit as D t ! 0 of the processes dened by the following iterative scheme, p (2.8) X (t kC 1 ) D X ( tk ) C m (X ( tk ) ) D t C s D tZ (tk ) , X (s ) D X ( tk ) for s 2 [tk, tkC 1 ) ,
(2.9)
where tk D kD t, k D 0, 1, . . .. Under mild assumptions on m and the initial distribution of X, the processes obtained by this scheme converge in distribution to the solution of equation 2.7 (Kloeden & Platen, 1992). Under the assumptions in this article, the solution of equation 2.7 is unique only in distribution, and thus it is called a distributional (also known as a weak or statistical) solution.4 2.2 Probability Measures and Densities. We think of a random experiment as a single run of a diffusion network in the time interval [0, T]. The outcome of an experiment is a continuous n-dimensional path x: [0, T] ! Rn describing the state of the n units of a diffusion network throughout time. We let V denote the set of continuous functions dened on [0, T] taking values in Rn . This contains all possible outcomes. We are generally interested in measuring probabilities of sets of outcomes. We call these sets events, and we let F represent the set of events. 5 4 For an explanation of the notion of weak solutions of stochastic differential equations, see section 5.3 of Karatzas and Shreve (1991). 5 In this article, the set of events F is the smallest sigma algebra containing the open sets of V. A set A of V is open if for every x 2 A there exists a d > 0 such that the set B(x, d) D fv 2 V: maxt2[0,T] |x(t) ¡ v (t)| < dg is a subset of A.
1514
Javier R. Movellan, Paul Mineiro, and R. J. Williams
A probability measure Q is a function Q: F ! [0, 1] that assigns probabilities to events in accordance with the standard probability axioms (Billingsley, 1995). A random variable Y is a function Y: V ! R such that for each open set A in R, the inverse image of A under Y is an event. We represent expected values using integral notation. For example, the expected value of the random variable Y with respect to the probability measure Q is represented as follows: Z EQ ( Y ) D
Y (x ) dQ (x ) .
(2.10)
V
Probability densities of continuous paths are dened as follows. Let P and Q be probability measures on (V , F ) . If it exists, the density of P with respect to Q is a nonnegative random variable L that satises the following relationship, Z
Z
P(
E Y) D
V
Y (x ) dP ( x ) D
V
Y (x ) L ( x) dQ ( x) D EQ (YL ) ,
(2.11)
for any random variable Y with nite expected value under P. The function L is called the Radon-Nikodym derivative or Radon-Nikodym density of P with respect to Q and is commonly represented as dP / dQ. Conditions for the existence of these derivatives can be found in any measure theory textbook (Billingsley, 1995). Intuitively, ( dP / dQ ) ( x ) represents how many times the path x is likely to occur under P relative to the number of times it is likely to occur under Q. In this article, we concentrate on the probability distributions on the space V of continuous paths associated with diffusion networks, and thus we need to consider distributional solutions of SDEs. We let the n-dimensional random process Xl represent a solution of equations 1.4 and 1.5. We regard the rst d components of Xl as observable and denote them by Ol . The last n ¡d components of Xl are denoted by Hl and are regarded as unobservable or hidden. We dene the observable and hidden outcome spaces Vo and Vh with associated event spaces Fo and Fh by replacing n by d and n¡d, respectively, in the denitions of V and F provided previously. The observable and hidden components Ol and Hl of a solution Xl D ( Ol , Hl ) of equations 1.4 and 1.5 take values in Vo and Vh , respectively. Note that V D Vo £ Vh and F is generated by sets of the form Ao £ Ah where Ao 2 Fo and Ah 2 Fh . For each path x 2 V, we write x D ( xo , xh ), where xo 2 Vo and xh 2 Vh . The process Xl induces a unique probability distribution Ql on the measurable space (V , F ) . Intuitively, Ql ( A ) represents the probability that a diffusion network with parameter l produces paths in the set A. Since our goal is to learn a probability distribution of observable paths, we need the
A Monte Carlo EM Approach
1515
probability measure Qlo associated with the observable components alone. If there are no hidden units, d D n and Qlo D Ql . If there are hidden units, d < n, and we must marginalize Ql over the unobservable components: Qlo ( Ao ) D Ql ( Ao £ Vh )
for all Ao 2 Fo .
(2.12)
We will also need to work with the marginal probability measure of the hidden components: Qlh ( Ah ) D Ql (Vo £ Ah )
for all Ah 2 Fh .
(2.13)
Intuitively, Qlo ( Ao ) is the probability that a diffusion network with parameter l generates observable paths in the set Ao , and Qlh ( Ah ) is the probability that the hidden paths are in the set Ah . Finally, we set Q D fQl : l 2 Rp g and Qo D fQlo : l 2 Rp g, referring to the entire family of probability measures parameterized by l and dened on (V , F ) and (Vo , Fo ), respectively.
3 Density of Observable Paths Our goal is to select a value of l on the basis of training data, such that Qlo best approximates a desired distribution. To describe a maximum likelihood or a Bayesian estimation approach, we need to dene probability densities Llo of continuous observable paths. In discrete time systems, like hidden Markov models, the Lebesgue measure 6 is used as the standard reference with respect to which probability densities are dened. Unfortunately, for continuous-time systems, the Lebesgue measure is no longer valid. Instead, our reference measure R will be the probability measure induced by a diffusion network with dispersion s, initial distribution º but with no drift,7 so that R ( A ) represents the probability that such a network generates paths lying in the set A. An important theorem in stochastic calculus, known as Girsanov’s theorem,8 tells us how to compute the relative density of processes with the same diffusion term and different drifts (see Oksendal, 1998). Using Girsanov’s
6
The Lebesgue measure of an interval (a, b) R is bu¡ a, the length of that interval. n More formally, for each A 2 F, R (A) D n R (A) dº(u) where for each u 2 R the measure Ru on (V, F) is such that under it, the process B D fB(t, x) D (x (t) ¡ x (0)) / s: t 2 [0, T], x 2 Vg is a standard n-dimensional Brownian motion and Ru (x (0) D u) D 1. 8 The conditions on m mentioned in the introduction are sufcient for Girsanov’s theorem to hold. 7
1516
Javier R. Movellan, Paul Mineiro, and R. J. Williams
theorem, it can be shown that l(
L x) D
dQl ( x ) D exp dR
( 1 s2
Z
T 0
1 ¡ 2 2s
m (x ( t) , l) ¢ dx ( t) Z
T
) |m ( x ( t) , l) | dt 2
(3.1)
0
8 Á Z T <X n 1 m j ( x ( t) , l) dxj (t ) D exp : s2 0 j D1 ¡
1 2s 2
Z
T 0
(m j ( x ( t) , l) ) 2 dt
!9 = ;
,
(3.2)
R for x 2 V. The integral 0T m j (x ( t) , l) dxj (t) is an Itoˆ stochastic integral (see Oksendal, 1998). Intuitively, we can think of it as the (mean square) limit, P ( ( ) ( ( ) ( )) as D t ! 0 of the sum: l¡1 kD 0 m j x t k , l) xj t k C 1 ¡ xj t k , where 0 D t 0 < t1 ¢ ¢ ¢ < tl D T are the sampling times and tkC 1 D tk C D t, and D t > 0 is the sampling period. The term Ll is a Radon-Nikodym derivative. Intuitively, it represents the likelihood of a diffusion network with parameter l generating the path x relative to the likelihood for the reference diffusion network with no drift. For a xed path x 2 V, the term Ll ( x) can be treated as a likelihood function9 of l. If there are no hidden units, d D n, Ql D Qlo , and thus we can take Ro D R and Llo D Ll . If there are hidden units, more work is required. For the construction here, we impose the condition that the initial probability measure º is a product measure º D ºo £ºh , where ºo , ºh are probability measures on Rd and Rn¡d , respectively.10 It then follows from the independence of the components of Brownian motion that R ( Ao £ Ah ) D Ro ( Ao ) Rh ( Ah ) ,
(3.3)
for all Ao 2 Fo , Ah 2 Fh , where Ro ( Ao ) D R ( Ao £ Vh ) ,
(3.4)
Rh ( Ah ) D R (Vo £ Ah ) .
(3.5)
9 Unfortunately, since R depends on s, Ll cannot be treated as a likelihood function of s. For this reason, estimation of s needs to be treated differently from estimation of l for continuous-time problems. We leave the issue of continuous-time estimation of s for future work. 10 Here, Rd and Rn¡d are endowed with their Borel sigma algebras, which are generated by the open sets in these spaces.
A Monte Carlo EM Approach
1517
To nd an appropriate density for Qlo , note that for Ao 2 Fo , Qlo ( Ao ) D Ql ( Ao £ Vh ) Z Z D Ll ( xo , xh ) dRh (xh ) dRo ( xo ) ,
(3.6) (3.7)
Ao V h
and therefore the density of Qlo with respect to Ro is Llo (xo ) D
dQlo ( xo ) D dRo
Z Vh
Ll ( xo , xh ) dRh ( xh ) , for xo 2 Vo.
(3.8)
Similarly, the density of Qlh with respect to Rh is as follows: Llh (xh ) D
dQlh ( xh ) D dRh
Z Vo
Ll ( xo, xh ) dR o (xo ) , for xh 2 Vh .
(3.9)
4 Log-Likelihood Gradients Let Po represent a probability measure on (Vo , Fo ) that we wish to approximate (e.g., a distribution of paths determined by the environment). Our goal is to nd a probability measure from the family Qo that best matches Po . The hope is that this will provide an approximate model of Po , the environment, that could be used for tasks such as sequence recognition, sequence generation, or stochastic ltering. We approach this modeling problem by dening a Kullback-Leibler distance (White, 1996) between Po and Qlo : ³ ´ Lo (4.1) EPo log l , Lo where EPo is an expected value with respect to Po and L o D dPo / dRo is the density of the desired probability measure. 11 We seek values of l that minimize equation 4.1. In practice, we estimate such values by obtaining nQ fair sample paths12 fxio gniQD 1 from Po and seeking values of l that maximize the following function: Á 1 W (l) D nQ
nQ X iD 1
! log Llo ( xio )
¡ Y (l) .
(4.2)
11 We are assuming that the expectation in equation 4.1 is well dened and nite and in particular L o exists, which is also an implicit assumption in nite dimensional density estimation approaches. 12 Fair samples are samples obtained in an independent and identically distributed manner.
1518
Javier R. Movellan, Paul Mineiro, and R. J. Williams
This function is a combination of the empirical log-likelihood and a regularizer Y: R p ! R that encodes prior knowledge about desirable values of l (Bishop, 1995). To simplify the notation, hereafter we present results for nQ D 1 and Y (l) D 0 for all l 2 Rp . Generalizing the analysis to nQ > 1 and interesting regularizers is easy but obscures the presentation. The gradient of the log density of the observable paths with respect to l is of interest to apply optimization techniques such as gradient ascent and the EM algorithm. If there are no hidden units, equation 3.1 gives the density of the observable measure, and differentiation yields13 ÁZ n T @m ( x ( t ) , l) 1 X j log L ( x ) D 2 dxj ( t) s jD 1 @li @li 0 @
l
Z ¡ D
T 0
@m j ( x ( t) , l) @li
!
m j ( x ( t) , l) dt
n Z T m ( ( ) l) @ j x t , 1 X dIjl ( x, t) , for x 2 V, 2 s jD 1 0 @li
(4.3)
(4.4)
where s ¡1 Il is a (joint) innovation process (Poor, 1994): Z
t
l(
I x, t ) D x (t) ¡ x (0) ¡
m ( x (s ) , l) ds.
(4.5)
0
Such a process is a standard n-dimensional Brownian motion under Ql .
5 Stochastic Teacher Forcing If there are no hidden units, the likelihood and log-likelihood gradient of paths can be obtained directly via equations 3.1 and 4.4. If there are hidden units, we need to take expected values over hidden paths: Z Llo ( xo )
D Vh
13
Ll ( xo , xh ) dRh (xh ) D ERh ( Ll (xo , ¢) ) .
(5.1)
Conditions that are sufcient to justify the differentiation leading to equation 4.4 are that the rst and second partial derivatives of m (u, l) with respect to l exist and together with m are continuous in (u, l) and satisfy a linear growth condition of the form |m (u, l)| · Kl (1 C |u| ), where Kl can be chosen independent of l whenever l is restricted to a compact set in Rp (Levanony, Shwartz, & Zeitouni, 1990; Protter, 1990). These conditions are satised by the neural network drift (see equation 1.2).
A Monte Carlo EM Approach
1519
In this article, we propose estimates of this likelihood obtained by adapting a technique known as Monte Carlo importance sampling (Fishman, 1996). Instead of averaging with respect to Rh , we average with respect to another distribution Sh and multiply the variables being averaged by a correction factor known as the importance function. This correction factor guarantees that the new Monte Carlo estimates will remain unbiased. However, by using an appropriate sampling distribution Sh , the estimates may be more efcient than if we just sample from Rh (they may require fewer samples to obtain a desired level of precision). Following equation 2.11, we have that ³ ´ Z dR h dRh l l l Sh ( ) ( ) ( ) ( ) ( (¢) Lo x o D L xo , xh xh dSh xh D E L xo , ¢) , (5.2) dSh dSh Vh
where Sh is a xed distribution on (Vh , Fh ) for which the density dRh / dSh exists. This density thus acts as the desired importance function. We can obtain unbiased Monte Carlo estimates of the expected value in equation 5.2 by averaging over a set of hidden paths H D fh1 , . . . , hm g sampled from Sh : LO lo (xo ) D
m X
pl (xo , hl ) ,
(5.3)
l D1
where ( l(
p xo , h) D
1 l( m L xo ,
h ( ) h ) dR dSh h ,
0
for h 2 H , else.
(5.4)
We use the gradient of the log of this density estimate for training the network
rl log LO lo ( xo ) D
m X lD 1
plh|o ( hl | xo ) rl log Ll ( xo , hl ) ,
(5.5)
where plh|o (h | xo ) D
pl ( xo , h ) , for h 2 Vh , plo ( xo )
(5.6)
and plo (xo ) D LO lo ( xo ) D
m X lD 1
pl ( xo , hl ) .
(5.7)
1520
Javier R. Movellan, Paul Mineiro, and R. J. Williams
P
plh|o ( h | xo ) D 1, and thus we can think of plh|o (¢ | xo ) as a probability mass function on Vh . Moreover, rl log LO lo ( xo ) is the average of the joint log-likelihood gradient rl log Ll ( xo , ¢) with respect to that probability mass function. While the results presented here are general and work for any sampling distribution for which dRh / dSh exists, it is important to nd a distribution Sh from which we know how to sample, for which we know dRh / dSh , and for which the Monte Carlo estimates are relatively efcient (i.e., do not require a large number of samples to achieve a desired reliability level). An obvious choice is to sample from Rh itself, in which case dRh / dSh D 1. In fact, this is the approach we used in previous versions of this article (Mineiro, Movellan, & Williams, 1998) and which was also recently proposed in Solo (2000). The problem with sampling from Rh is that as learning progresses, the Monte Carlo estimates become less and less reliable, and thus there is a need for better sampling distributions that change as learning progresses. We have obtained good results by sampling in a manner reminiscent of the teacher forcing method from deterministic neural networks (Hertz, Krogh, & Palmer, 1991). The idea is to obtain a sample of hidden paths from a network whose observable units have been forced to exhibit the desired observable path xo . Consider a diffusion network with parameter vector ls 2 Rp , not necessarily equal to l. The role of this network will be to provide sample hidden paths. For each time t 2 [0, T], we x the output units of such a network to xo (t ), and let the hidden units run according to their natural dynamics, Note
h2Vh
dH (t ) D m h (xo ( t) , H ( t) , ls ) dt C sdBh ( t) ,
(5.8)
H (0) » ºh , where m h is the hidden component of the drift. We repeat this procedure m times to obtain the sample of hidden paths H D fhl gm . One advantage of l D1 this sampling scheme is that we can use Girsanov’s theorem to obtain the desired importance function, ( Z T 1 dRh ( xh ) D exp ¡ 2 m h ( xo ( t) , xh (t ) , ls ) ¢ dxh ( t) s 0 dSh ) Z T 1 2 ( ( ) ( ) ) C |m | dt , h xo t , xh t , ls 2s 2 0
(5.9)
where Sh , the sampling distribution, is now the distribution of hidden paths induced by the network in equation 5.8 with xed parameter ls . If we let
A Monte Carlo EM Approach
1521
ls D l then the pl function dened in equation 5.4 simplies as follows: 1 l dR h ( h) L ( xo , h ) m dSh ( Z T 1 1 exp m o ( xo ( t) , h ( t) , l) ¢ dxo (t ) D s2 0 m
pl ( xo , h) D
Z
1 ¡ 2 2s
T 0
) |m o (xo ( t) , h ( t) , l) | dt , 2
for h 2 H .
(5.10)
6 Monte Carlo EM learning Given a xed sampling distribution Sh and a xed set of hidden paths H D fh1 , . . . , hm g sampled from Sh , our objective is to nd values of l that maximize the estimate of the likelihood LO lo ( xo ). We can search for such values using standard iterative procedures, like gradient ascent or conjugate gradient. Another iterative approach of interest is the EM algorithm (Dempster, Laird, & Rubin, 1977). On each iteration of the EM algorithm, we start with a xed parameter vector lN and search for values of l that optimize the following expression:14 N l, xo ) D M (l, Note that since then
m X
N
plh|o ( hl | xo ) log pl (xo , hl ).
(6.1)
l D1
Pm
lN ( l l D1 ph|o h
| xo ) D 1 and plo ( xo ) D pl ( xo , hl ) / plh|o ( hl | xo ),
log LO lo ( xo ) D log plo ( xo ) D
m X
N
plh|o ( hl | xo ) log
lD 1
N , l, xo ) ¡ D M (l
m X
pl ( xo , hl ) plh|o ( hl | xo )
N
plh|o ( hl | xo ) log plh|o ( hl | xo ) ,
(6.2)
(6.3)
lD 1
and N
N l, N xo ) log LO lo (xo ) ¡ log LO lo ( xo ) D M ( lN , l, xo ) ¡ M ( l,
N C KL(plh|o (¢ | xo ) , plh|o (¢ | xo ) ) ,
(6.4)
14 If a regularizer is used, redene pl (xo , h) in equation 5.4 as Ll (xo , h)dRh / dSh (h) exp (¡Y (l)).
1522
Javier R. Movellan, Paul Mineiro, and R. J. Williams
where KL stands for the Kullback-Leibler distance between the functions N plh|o (¢ | xo ), and plh|o (¢ | xo ). Since KL is nonnegative, it follows that if N l, xo ) > M ( lN , l, N xo ) , then LO lo ( xo ) > LO loN (xo ) . Since this adaptation of M (l, the EM algorithm maximizes an estimate of the log likelihood instead of the log likelihood itself, we refer to it as stochastic EM. On each iteration of the stochastic EM procedure, we nd a value of l that maximizes M ( lN , l, xo ) , and we let lN take that value. The procedure guarantees that the estimate of the likelihood of the observed path LO lo ( xo ) will increase or stay the same, at which point we have converged. Our approach guarantees convergence only when the sampling distribution Sh and the sample of hidden paths H are xed. In practice, we have found that even with relatively small sample sizes, it is benecial to change Sh and H as learning progresses. For instance, in the approach proposed at the end of section 7, we change the sampling distribution and the sample paths after each iteration of the EM algorithm. While we have not carefully analyzed the properties of this approach, we believe the reason that it works well is that as learning progresses, it uses sampling distributions Sh that are better suited for the values of l that EM moves into. 7 The Neural Network Case Up to now we have presented the learning algorithm in a general manner applicable to generic SDE models. In this section, we show how the algorithm applies to diffusion neural networks. Let wsr be the component of l representing the synaptic weight (conductance) from the sending unit s to the receiving unit r. For the neural network drift, we have @m j ( x ( t) , l) @wsr
D kj
n X
@
@wsr iD1
Q ( xi (t) ) wij D djrkrQ ( xs ( t) ) , for x 2 V,
(7.1)
where d is the Kronecker delta function (djr D 1 if j D r, 0 else) and xi stands for the ith component of x (i.e., the activation of unit i at time t in path x). 7.1 Networks Without Hidden Units. Combining equations 4.4 and 7.1, we get @ log Ll ( x ) @wsr
D
n Z T @m j ( x ( t ) , l) l 1 X dIj ( x, t ) s 2 jD 1 0 @wsr
D
Z T n 1 X d k Q (xs (t) ) dIjl ( x, t) jr r s 2 jD 1 0
D
1 kr s2
Z
T 0
Q (xs ( t) ) dI lr ( x, t) ,
(7.2)
(7.3)
(7.4)
A Monte Carlo EM Approach
1523
where s ¡1 Irl is the innovation process of the receiving unit, that is, dI lr ( x, t ) D dxr (t ) ¡ m r ( x ( t) , l) dt
(7.5)
D dxr ( t ) C krr r xr ( t) dt ¡ krjr dt ¡ kr
n X
Q ( xi (t) ) wir dt.
i D1
Combining equations 7.4 and 7.5, we get @ log Ll ( x) @wsr
Á ! n X 1 2 D 2 k r bsr ( x ) ¡ asi ( x ) wir , s iD 1
(7.6)
where bsr (x ) D
1
Z
kr
Z
T 0
Q ( xs (t ) ) dxr ( t) C r r
Z ¡ jr Z asi (x ) D
T
Q (xs (t) ) xr ( t) dt 0
T 0
Q (xs ( t) ) dt,
(7.7)
T 0
Q (xs ( t) ) Q (xi ( t) ) dt.
(7.8)
Let rw log Ll (x ) be an n £ n matrix with cell i, j containing the derivative of log Ll ( x ) with respect to wij for i, j D 1, . . . , n. It follows that
rw log Ll ( x ) D
1 ( b ( x) ¡ a ( x) w ) K s2
2
,
(7.9)
where K is a diagonal matrix with diagonal elements k1 , . . . , kn . Note that all the terms involved in the computation of the gradient of the log likelihood are Hebbian; they involve time integrals of pairwise activation products. Also note that each cell of the matrix a is an inner product in the Hilbert space of squared integrable functions. This may open avenues for generalizing standard kernel methods (Aizerman et al., 1964; Burges, 1998) for learning distributions of sequences. n ( ) ( ) PnThe matrix a x is positive semidenite. To see why, let v 2 R , y t D ( ( ) ) [0, Q for all 2 and note that v x t t T], j jD 1 j Z
T 0
y (t ) 2 dt D v0 av ¸ 0.
(7.10)
Thus, if a ( x ) is invertible, there is a unique maximum for the log-likelihood function. The maximum likelihood estimate of w follows: wO D ( a ( x) ) ¡1 b ( x ) .
(7.11)
1524
Javier R. Movellan, Paul Mineiro, and R. J. Williams
A similar procedure can be followed to nd the maximum likelihood estimates of the other parameters: ² R ± P ( xr (T ) ¡ xr (0) )k r¡1 C 0T rr xr ( t) ¡ jnD1 wjrQ (xj ( t) ) dt jOr D , (7.12) T RT R Pn ¡1 T ( )j ( ) ( ) C j D 1 Q ( xj ( t ) ) wjr xr ( t ) dt ¡ k r 0 xr t r dt 0 xr t dx r t rOr D , (7.13) RT 2( ) 0 xr t dt RT mN r ( x (t) , l) dxr ( t) kO r D 0R T (7.14) , for r D 1, . . . , n, 2 ( () 0 mN r x t , l) dt where mN r (x ( t) , l) D m r (x ( t) , l) /kr , which is not a function of k r . Equations 7.11 through 7.14 maximize a parameter or set of parameters assuming the other parameters are xed.
7.2 Networks with Hidden Units. If there are hidden units, we use the stochastic EM approach presented in section 6. Given an observable path, xo 2 Vo , we obtain a fair sample of hidden paths Hm D fh1 , . . . , hm g from a sampling distribution Sh . The gradient of the log–likelihood estimate with respect to w is as follows: 1 rw log LO lo ( xo ) D 2 (bQl (xo ) ¡ aQl ( xo ) w) K s
2
,
(7.15)
where bQl (xo ) D aQ l (xo ) D
m X
plh|o (hl | xo ) b ( xo , hl ) ,
(7.16)
plh|o (hl | xo ) a ( xo , hl ) ,
(7.17)
l D1 m X l D1
and a, b are given by equations 7.7 and 7.8. Note that in this case, the aQ and bQ coefcients depend on w in a nonlinear manner in general, even if the activation function Q is linear. We can nd values of w for which the gradient vanishes by using standard iterative procedures like gradient ascent or conjugate gradient. The situation simplies when the EM algorithm is used. Let lN and l represent the parameter vectors of two n-dimensional diffusion networks that have the same values for the j, k and r terms. Let wN and w represent the connectivity matrices of these two networks. Following the argument presented in section 6, we seek values of w that maximize N l, xo ) . To do so, we rst nd the gradient with respect to w: M (l,
rw M ( lN , l, xo ) D
1 QlN ( b ( xo ) ¡ aQ lN ( xo ) w ) K s2
2
.
(7.18)
A Monte Carlo EM Approach
1525
The matrix aQ lN ( xo ) is an average of positive semidenite matrices, and thus it is also positive semidenite. Thus, if aQ lN is invertible, we can directly maximize M ( lN , l, xo ) with respect to w by setting the gradient to zero and solving for w. The solution, N N wO D ( aQ l ( xo ) ) ¡1 bQl ( xo ) ,
(7.19)
becomes the new w, N and equation 7.19 is iterated until convergence is achieved. Note that this procedure is iterative and guarantees convergence only to a local maximum of the estimate of the likelihood function. A similar procedure can be followed to train the j, k , and r parameters. Summary of the Stochastic EM Learning Algorithm for the Neural Network Case 1. Choose an initial network with parameter lN 2 Rp including the values of w, N jN , kN , and r. N The initial connectivity matrix wN is commonly the zero matrix. Hereafter, we concentrate on how to train the connectivity matrix. Training the other parameters in lN is straightforward. 2. With the observation units forced to exhibit a desired sequence xo , run the hidden part of the network m times, starting at time 0 and ending at time T, to obtain m different hidden paths: h1 , . . . , hm . 3. Compute the weight, here represented as p (¢) , of each hidden path: 8 < 1 X d Z T p ( l) D exp m ( xl ( t) , lN ) dxjl ( t) : s 2 D1 0 j j ) Z T 1 2 l N (m j ( x ( t) , l) ) dt , ¡ 2 (7.20) 2s 0 for l D 1, . . . , m, where xl D ( xo , hl ) is a joint sequence consisting of the desired sequence xo for the observation units and the sampled sequence hl for the hidden units, and the index j D 1, . . . , d goes over the output units. In practice, the integrals in equation 7.20 are approximated using discrete time approximations, as described in section 9. 4. Compute the a and b matrices of each hidden sequence: Z T Q ( xli ( t) ) Q ( xjl ( t) ) dt, for i, j D 1, . . . , n, aij ( l) D 0
bij ( l) D
1
kj
Z
T 0
¡ jjl
Z
Q ( xli ( t) ) dxjl ( t) C rj
Z
T 0
T 0
(7.21)
Q ( xli ( t) ) xjl ( t) dt
Q (xli ( t) ) dt, for i, j D 1, . . . , n.
(7.22)
1526
Javier R. Movellan, Paul Mineiro, and R. J. Williams
Q 5. Compute the averaged matrices aQ and b: Pm l D1 p ( l ) a ( l ) P , m l D1 p ( l ) Pm l D1 p ( l ) b ( l ) . bQ D P m lD 1 p ( l) aQ D
(7.23) (7.24)
6. Update the connectivity matrix: Q wN D aQ ¡1 b.
(7.25)
7. Go back to step 2 using the new value of w. N 8 Comparison to Previous Work We use diffusion networks as a way to parameterize distributions of continuous-time varying signals. We proposed methods for nding local maxima of a likelihood estimate. This process is known as learning in the neural network literature, system identication in the engineering literature, and parameter estimation in the statistical literature. The main motivation for the diffusion network approach is to combine the versatility of recurrent neural networks (Pearlmutter, 1995) with the well-known advantages of stochastic models(Oksendal, 1998). Thus, our work is closely related to the literature on continuous-time recurrent neural networks and the literature on stochastic ltering. 8.1 Continuous-Time Recurrent Neural Networks. We use a recurrent neural network drift function and allow full interconnectivity between hidden and observable units, as is standard in neural network applications. In recurrent neural networks, the dynamics are deterministic (s D 0), while in diffusion networks they are not (s > 0). Learning algorithms for recurrent neural networks typically nd values of the parameter vector l that minimize a mean squared error of the following form (Pearlmutter, 1995), Z W (l) D
T 0
( o ( t, l) ¡ r ( t) ) 2 dt,
(8.1)
where r is a desired path we want the network to learn and o (¢, l) is the unique observable path produced by a recurrent neural network with parameter vector l. Since the optimal properties of maximum-likelihood methods are well understood mathematically, it is of interest to nd under what conditions the solutions found by minimizing equation 8.1 are maximumlikelihood estimates. This will give us a sense of the likelihood model implicitly used by standard learning algorithms for deterministic neural networks.
A Monte Carlo EM Approach
1527
For a given deterministic neural network with parameter vector l, we dene a stochastic process Ol by adding white noise to the unique observable path o (¢, l) produced by that network: Ol ( t) D o ( t, l) C sW (t) .
(8.2)
To obtain a mathematical interpretation of equation 8.2, we introduce Z Zl ( t) D
t
Ol ( s) ds,
(8.3)
0
which is thus governed by the following SDE, dZ l (t ) D o ( t, l) dt C sdB(t),
(8.4)
where B is standard Brownian motion. Note that if Zl is known, Ol is known and vice versa, so no information is lost by working with Zl instead of Ol . To nd the likelihood of a path o given a network with parameter vector l, we rst compute the integral trajectory z: Z z ( t) D
t
o ( s) ds.
(8.5)
0
Using Girsanov’s theorem, the likelihood of z given that is generated as a realization of the SDE model in equation 8.4 can be shown to be as follows: Ll ( z ) D
1 s2
Z
T 0
o ( t, l) dz ( t) ¡
1 2s 2
Z
T
o (t, l) 2 dt.
(8.6)
0
Considering that dz (t) D o ( t) dt, it is easy to see that maximizing Ll ( z) with respect to l is equivalent to minimizing W (l) of equation 8.1. Thus, standard continuous-time neural network learning algorithms can be seen as performing maximum likelihood estimation with respect to the stochastic process dened in equation 8.2. In other words, the generative model underlying those algorithms consists of adding white noise to the unique observable trajectory generated by a network with deterministic dynamics. Note that this is an extremely poor generative model. In particular, this model cannot handle bifurcations and time warping, two common sources of variability in natural signals. Instead of adding white noise to a deterministic path, in diffusion networks, we add white noise to the activation increments; that is, the activations are governed by an SDE of the form dX l ( t) D m ( X (t) , l) dt C sdB(t) .
(8.7)
1528
Javier R. Movellan, Paul Mineiro, and R. J. Williams
Figure 3: An illustration of the different effects of adding noise to the output of a deterministic network versus adding it to the network dynamics. See the text for an explanation.
This results in a much richer and more realistic likelihood model. Due to the probabilistic nature of the state transitions, the approach is resistant to time warping, for the same reason hidden Markov models are. In addition, the approach can also handle bifurcations. Figure 3 illustrates this point with a simple example. The gure shows a single neuron model in which the drift is proportional to an energy function with two wells. If we let the network start at the origin and there is no internal noise, the activation of the neuron will be constant, since the drift at the origin is zero. Adding white noise to this constant activation will result in a unimodal gaussian distribution centered at the origin. However, if we add noise to the activation dynamics, the neuron will bifurcate, sometimes moving toward the left well, sometimes toward the right well. This will result in a bimodal distribution of activations. Diffusion networks were originally formulated in Movellan and McClelland (1993). Movellan (1994) presented an algorithm to train diffusion networks to approximate expected values of sequence distributions. Movellan (1998) presented algorithms for training equilibrium probability densities. Movellan and McClelland (2001) showed the relationship between diffusion networks and classical psychophysical models of information integration. Mineiro et al. (1998) presented an algorithm for sequence density estimation with diffusion networks. In that article, we used a gradient-descent approach instead of the EM approach presented here, and we sampled from
A Monte Carlo EM Approach
1529
the reference distribution Rh instead of the distribution of hidden states with clamped observations. These two modications greatly increased learning speed and made possible the use of diffusion networks on realistic problems involving thousands of parameters. 8.2 Stochastic Filtering. Our work is also closely related to the literature on stochastic ltering and systems identication. As mentioned in section 1.3, if the logistic activation function Q is replaced by a linear function, the weights between the observable units and from the observable to the hidden units are set to zero, the r terms are set to zero, and the probability distribution of the initial states is constrained to be gaussian, diffusion networks have the same dynamics as the continuous-time Kalman-Bucy lter (Kalman & Bucy, 1961). While the usefulness of the Kalman-Bucy lter is widely recognized, its limitations (due to the linear and gaussian assumptions) have become clear. For this reason, many extensions of the Kalman lter have been proposed (Lewis, 1986; Fahrmeir, 1992; Meinhold & Singpurwalla, 1975; Sage & Melsa, 1971; Kitagawa, 1996). This article contributes to that literature by proposing a new approach for training partially observable SDE models. An important difference between our work and stochastic ltering approaches is that stochastic ltering is generally restricted to models of the following form: dH ( t) D m h ( H ( t) ) dt C sdBh ( t) ,
(8.8)
dO ( t) D m o ( H (t) ) dt C sdBo ( t) .
(8.9)
In our case, this restriction would require setting to zero the coupling between observable units and the feedback coupling from observable units to hidden units. While this restriction simplies the mathematics of the problem, having a general algorithm that does not require it is important for the following reasons: (1) such restriction is not commonly used in the neural network literature; (2) when modeling biological neural networks, such restriction is not realistic; (3) in many physical processes, the observations are coupled by inertia and dampening processes (e.g., bone and muscles with large dampening properties are involved in the production of observed motor sequences); and (4) in practice, the existence of connections between observable units results in a much richer set of distribution models. Iterative solutions for estimation of drift parameters of partially observable SDE models are well known in the statistics and stochastic ltering literatures. However, most current approaches are very expensive computationally, do not allow training fully coupled systems, and have been used only for problems with very few parameters. Campillo and Le Gland (1989) present an approach that involves numerical solution of stochastic partial differential equations. The approach has been applied to models with only a handful of parameters. Shumway and Stoffer (1982) present an exact EM approach for discrete-time linear systems; however, the approach does not
1530
Javier R. Movellan, Paul Mineiro, and R. J. Williams
generalize to nonlinear systems. Ljung (1999) presents a general approach for discrete-time systems with nonlinear drifts. However, the approach requires the inversion of very large matrices (order of length of the training sequence times number of states). Ghahramani and Roweis (1999) present an ingenious method to learn discrete-time dynamical models with arbitrary drifts. The approach approximates nonlinear drift functions using gaussian radial basis functions. While promising, the approach has been shown to work on only a single parameter problem. Our work is closely related to Kitagawa (1996), the work by Blake and colleagues (North & Blake, 1998; Blake, North, & Isard, 1999), and Solo (2000). Kitagawa (1996) presents a discrete-time Monte Carlo method for general-purpose nonlinear ltering and smoothing. While Kitagawa did not experiment with maximum likelihood estimation of drift parameters, he mentioned the possibility of doing so. Blake et al. (1999) independently presented a Monte Carlo EM algorithm at about the same time we published our earlier work on diffusion networks (Mineiro et al., 1998). The main differences between their approach and ours are as follows: (1) they work in discrete time while we work in continuous time; (2) they require that there be no coupling between observation units and no feedback coupling from observations to hidden units, whereas we do not have such requirements; and (3) they sample from an approximation to the distribution of hidden units given observable paths, as proposed by Kitagawa (1996). Instead, we sample from the distribution of hidden units with clamped observations and use corrections to obtain unbiased estimates of the likelihood gradients. Solo (2000) presents a simulation method for generating approximate likelihood functions for partially observable SDE models and for nding approximate maximum likelihood estimators. His approach can be seen as a discrete-time version of the method we independently proposed in Mineiro et al. (1998). The approach we propose here generalizes (Mineiro et al., 1998; Solo, 2000). We incorporate the use of importance sampling in continuous time and do not need to restrict the observation units to be uncoupled from each other or the hidden units to be uncoupled from the observation units.
9 Simulations Sample source code for the simulations presented here can be found at J. R. M.’s Web site (mplab.ucsd.edu). In practice, lacking hardware implementations of diffusion networks, we model them on digital computers using discrete-time approximations. While more sophisticated techniques are available for simulating SDEs in digital computers (Kloeden & Platen, 1992; Karandikar, 1995), we opted to start with simple rst-order Euler approximations.
A Monte Carlo EM Approach
1531
Consider the term LO lo ( xo ) in equation 5.3. It requires, among other things, computing integrals of the form l(
log L x ) D
1 s2
Z
T 0
m (x ( t) , l) ¢ dx ( t) ¡
1 2s 2
Z
T
|m ( x ( t) , l) | 2 dt,
0
for x 2 V.
(9.1)
In practice, we approximate such integrals using the following sum: s¡1 s¡1 1 X 1 X |m ( x (tk ) , l) | 2 D t, m (x ( tk ) , l) ¢ ( x (tk C 1 ) ¡ x (tk ) ) ¡ 2 2 s kD 0 2s kD 0
(9.2)
where 0 D t0 < t1 ¢ ¢ ¢ < ts D T are the sampling times and tkC 1 D tk C D t, and D t > 0 is the sampling period. The Monte Carlo approach proposed here requires generating sample hidden paths fhl gm from a network with the observable units clamped to l D1 the path xo . We obtain these sample paths by simulating a diffusion network in discrete time as follows: p H ( tkC 1 ) D H (tk ) C m h ( xo (tk ) , H (tk ) , l) D tk C sZk D t k,
(9.3)
H (0) » ºh , where Z1 , . . . , Zs are independent ( n ¡ d )-dimensional gaussian random vectors with zero mean and covariance equal to the identity matrix. 9.1 Example 1: Learning to Bifurcate. This simulation illustrates the capacity of diffusion networks with nonlinear activation functions to learn bifurcations. This is a property important for practical applications, which allows these networks to entertain multiple distinct hypotheses about the state of the world. Standard deterministic neural networks and linear stochastic networks like Kalman lters cannot learn bifurcations. A diffusion network with an observable unit and a hidden unit was trained on the distribution of paths shown in Figure 4. The top of the gure shows there were four equally probable desired paths, organized as a double bifurcation. The network was trained for 60 iterations of the EM algorithm, with 10 sample hidden paths per iteration. The bottom row shows the distribution learned by a standard recurrent neural network. As expected, the network learned the average path—a constant zero output. The trajectories displayed on the gure were produced by estimating the variance of the desired outputs from the learned trajectory and adding gaussian noise of that estimated variance to the output learned by the network. The second row of Figure 4 shows 48 sample paths produced by the diffusion network after training. While not perfect, the network learned to
1532
Javier R. Movellan, Paul Mineiro, and R. J. Williams
Figure 4: (Top row) The double bifurcation denes a desired path distribution with four equally probable paths. (Second row) Observable unit sequences obtained after training a diffusion network. (Third row) Hidden unit sequences of the trained diffusion network. (Bottom row) Sequences obtained after training a standard recurrent neural network and adding an optimal amount of noise to the output of that network.
bifurcate twice and produced a reasonable approximation to the desired distribution. We were surprised that this problem was learnable with a single hidden unit, until we saw the ingenious solution learned by the network (see Table 1). Basically the network learned to toss two binary random variables sequentially and compute their sum. The observable unit learned a relatively small time constant and, if disconnected from the hidden unit, it bifurcated once, with each branch taking approximately equal probability. The hidden unit had a faster time constant and learned to bifurcate unaffected by the observable unit. Thus, the hidden unit basically behaved as a Bernoulli random bias for the observable unit. The result was a double bifurcation of the observable unit, as desired. Figure 5 shows four example paths for the observable and hidden units. 9.2 Example 2: Learning a Beating Heart. In this section we show the capacity of diffusion networks to learn natural oscillations. The task was
A Monte Carlo EM Approach
1533
Figure 5: Four sample paths learned in the bifurcation problem. Solid lines represent observable unit paths, and dashed lines represent hidden unit paths. Table 1: Parameters Learned for the Double Bifurcation Problem. woo D 4.339 who D 2.303
woh D ¡0.008 whh D 0.5912
jo D ¡0.005 jh D 0.000
ko D 0.429 kh D 1.175
Notes: All the weights and biases were initialized to zero and the k terms to 1. The xed parameters were set to the following values: h1 D ¡1, h2 D 2, h3 D 7, s D 0.2.
learning the expansion and contraction sequence of a beating heart. Once a distribution of normal sequences is learned, the network can be used for tracking sequences of moving heart images (Jacob, Noble, and Blake (1998) or to detect the presence of irregularities. Jacob et al. (1998) tracked the contour of a beating heart from a sequence of ultrasound images and applied principal component analysis (PCA) to the resulting sequence of contours. Figure 6 shows the rst principal component coefcient as a function of time for a typical sequence of contours. North and Blake (1998) compared two discrete-time models of the heart expansion sequence: (1) a linear model with time delays and no hidden units and (2) a linear model with time delays and hidden units. The rst model was dened
1534
Javier R. Movellan, Paul Mineiro, and R. J. Williams
by the following stochastic difference equation, O ( t C 1) D O ( t) C l1 C l2 O ( t) C l3 O ( t ¡ 1) C s1 Z1 ( t) ,
(9.4)
where O represents the observed heart expansion, Z1 (1) , Z1 (2) , . . . is a sequence of independent and identically distributed gaussian random variables with zero mean and unit variance, and l1 , l2 , l3 2 R, s1 > 0 are adaptive parameters. The second model had the following form, H ( t C 1) D H ( t) C l1 C l2 H ( t) C l3 H ( t ¡ 1) C s1 Z1 ( t) , O ( t) D H ( t) C s2 Z2 ( t) ,
(9.5) (9.6)
where Z2 (1) , Z2 (2) , . . . is a sequence of zero mean, unit variance, independent gaussian random variables, and s2 > 0 is also adaptive. The two models were trained on the heart expansion sequence displayed in Figure 6. After training, the two models were run with s D 0 to see the kind of process they had learned. The two linear models learned to oscillate at the correct frequency, but their motion was too damped (see Figure 6). It should be noted that both models are perfectly capable of learning undampened oscillations; they just could not learn this particular kind of oscillation. Unfortunately, the original heart expansion data are not available. Instead, we recovered the data by digitizing a gure from the original article and automatically sampling it at 872 points. We tried the two linear systems on these data and obtained results identical to those reported in Jacob et al. (1998), so we feel the digitization process worked well. We then trained a diffusion network with one observation unit and one hidden unit. We did not include time delays and instead allowed the network to develop its own delayed representations by fully coupling the observation and hidden units. The network was trained using 60 iterations of the stochastic EM algorithm, with 10 hidden samples per iteration. Table 2 shows the parameters learned by the network. Figure 6 shows a typical sample path produced by the trained network. Note that the network did not exhibit the dampening problem found with the linear models. We also tried a diffusion network with one observation unit and one hidden unit, but with linear activation functions instead of logistic. The result was again a very dampened process, like the one depicted in Figure 6. It appears that the saturating nonlinearity of diffusion networks was benecial for learning this task. 9.3 Example 3: Learning to Read Lips. In this section, we report on the use of diffusion networks with thousands of parameters for a sequence classication task involving a body of realistic data. We compare a diffusion network approach with the best hidden Markov model approaches published in the literature for this task. The question at hand is whether diffusion networks may be a viable alternative to hidden Markov models for sequence recognition problems. The main difference between diffusion networks and
A Monte Carlo EM Approach
1535
Figure 6: (Top left) The evolution of the rst PCA coefcient for a beating heart. (Top right) The path learned by a second-order linear model with no hidden units. (Bottom left) The path learned by a second-order linear model with hidden units. (Bottom right) Typical path learned by a diffusion network.
Table 2: Parameters Learned for the Heart Beat Problem. woo D 1.280 who D ¡1.218
woh D 0.673 whh D 0.088
jo D ¡0.143 jh D 0.001
k o D 0.709 k h D 1.087
Notes: All the weights and biases were initialized to zero and the k terms to 1. The xed parameters were set to the following values: h1 D ¡1, h2 D 2, h3 D 7, s D 0.2.
hidden Markov models is the nature of the hidden states: diffusion networks use continuous-state representations, while hidden Markov models use discrete-state representations. 15 It is possible that continuous-state representations may be benecial for modeling some natural sequences. 15
While HMMs can use continuous observations, the hidden states are always discrete.
1536
Javier R. Movellan, Paul Mineiro, and R. J. Williams
Our approach to sequence recognition using diffusion networks is similar to the approach used in the hidden Markov model literature. First, several diffusion networks are independently trained with samples of sequences from each of the categories at hand. For example, if we want to discriminate between c categories of image sequences, we would rst train c different diffusion networks. The rst network would be trained with examples of category 1, the second network with examples of category 2, and the last network with examples of category c. This training process results in c values of the parameter l, each of which has been optimized to represent a different category. We represent these values as l¤1 , . . . , l¤c . Once the networks are trained, we can classify a new observed sequence xo as follows: we compute l¤ log LO o i ( xo ) for i D 1, . . . , c. These log likelihoods are combined with the logprior probability of each category, and the most probable category of the sequence is chosen. 9.3.1 Training Database. We used Tulips1 (Movellan, 1995), a database consisting of 96 movies of nine male and three female undergraduate students from the Cognitive Science Department at the University of California, San Diego. For each student, two sample utterances were taken for each of the digits “one” through “four” (see Figure 7). The database is challenging due to variability in illumination, gender, ethnicity of the subjects, and position and orientation of the lips. The database is available at J. R. M.’s Web site. 9.3.2 Visual Processing. We used a 2 £ 2 factorial experimental design to explore the performance of two different image processing techniques (contours and contours plus intensity) in combination with two different recognition engines (hidden Markov models and diffusion networks). The image processing was performed by Luettin, Thacker, and Beet (1996a, 1996b). They employ point density models, where each lip contour is represented by a set of points; in this case, both the inner and outer lip contour are represented, corresponding to Luettin’s double contour model (see Figure 7). The dimensionality of the representation of the contours was reduced using principal component analysis. For the work presented here, 10 principal components were used to approximate the contour, along with a scale parameter that measured the pixel distance between the mouth corners; associated with each of these 11 components was a corresponding delta component. The value of the delta component for the frame sampled at time tk equals the value of the original component at that time minus the value of the original component at the previous sampling time tk¡1 (these delta components are dened to be zero at t0 ). In this manner, 22 components were used to represent lip contour information for each still frame. These 22 components were represented using diffusion networks with 22 observation units, one per component.
A Monte Carlo EM Approach
1537
Figure 7: Example images from the Tulips1 database.
We also tested the performance of a representation that used intensity information in addition to contour shape information. We obtained Luettin et al. (1996b) representations in which gray-level values are sampled along directions perpendicular to the lip contour. These local gray-level values are then concatenated to form a single global intensity vector that is compressed using the rst 20 principal components. There were 20 associated delta components, for a total of 40 components of intensity information per still frame. These 40 components were concatenated with the 22 contour components, for a total of 62 components per still frame. These 62 components were represented using diffusion networks with 62 observation units, one per component. Thus, the total number of weight parameters was (62 C (n ¡ d ) ) 2 , where n ¡ d is the number of hidden units. We tested networks with up to 100 hidden units—26,244 parameters. 9.3.3 Training. We independently trained four diffusion networks to approximate the distributions of lip-contour trajectories of each of the four words to be recognized; the rst network was trained with examples of the word one and the last network with examples of the word four. Each network had the same number of nodes, and the drift of each network was given by equation 1.2 with k D 1, r D 0, and a hyperbolic tangent activation function—h1 D ¡1, h2 D 2, h3 D 1 in equation 1.3. The connectivity matrix w
1538
Javier R. Movellan, Paul Mineiro, and R. J. Williams
and the bias parameters j were adaptive. The initial state of the hidden units was set to (1, . . . , 1) 0 with probability 1, and s was set to 1 for all networks. The diffusion network dynamics were simulated using the forward Euler technique described in previous sections. In our simulations, we set D t D 1 / 30 seconds, the time between video frame samples. Each diffusion network was trained with examples of one of the four digits using the following cost function (see equation 4.2), W (l) D
X i
log LO lo ( xio ) ¡ Y (l) ,
(9.7)
where each xio is a movie from the Tulips1 database— a sample from the desired empirical distribution Po , and Y (l) D 12 a|l| 2 for a > 0. Several values of a were tried, and performance is reported with the optimal values. Here, Y (l) acts as a gaussian prior on the network parameters. Training was done using 20 hidden sample paths per observed path. These paths were sampled using the “teacher forcing” approach described in equations 5.8 through 5.10. 9.3.4 Results. The bank of diffusion networks was evaluated in terms of generalization to new speakers. Since the database is small, generalization performance was estimated using a jackknife procedure (Efron, 1982). The four models (one for each digit) were trained with labeled data from 11 subjects, leaving a subject out for generalization testing. Percentage correct generalization was then tested using the decision rule D ( xo ) D arg max
i2f1,2,3,4g
l¤
l¤
log LO o i (xo ) ,
(9.8)
where log LO o i is the estimate of the log likelihood of the test sequence evaluated at the optimal parameters l¤i found by training on examples of digit i. This rule corresponds to assuming equal priors for each of the four categories under consideration. The entire procedure was repeated 12 times, each time leaving a different subject out for testing, for a total of 96 generalization trials (4 digits £ 12 subjects £ 2 observations per subject). This procedure mimics that used by Luettin et al. (1996a, 1996b; Luettin, 1997) to test hidden Markov model architectures. We tested performance using a variety of architectures, some including no hidden units and some with up to 100 hidden units. Best generalization performance was obtained using four hidden units. These generalization results are shown in Table 3. The hidden Markov model results are those reported in Luettin et al. (1996a, 1996b) and Luettin (1997). The only difference between their approach and ours is the recognition engine, which is a bank of hidden Markov models in their case and a bank of diffusion networks in our case. The image representations mentioned above were optimized
A Monte Carlo EM Approach
1539
Table 3: Generalization Performance on the Tulips1 Database. Approach Best HMM, shape information only Best diffusion network, shape information only Untrained human subjects Best HMM, shape and intensity information Best diffusion network, shape and intensity information Trained human subjects
Correct Generalization 82.3% 85.4 89.9 90.6 91.7 95.5
Notes: Shown in order are the performance of the best-performing HMM from Luettin et al. (1996a, 1996b) and Luettin (1997) which uses only shape information; the best diffusion network obtained using only shape information; the performance of untrained human subjects (Movellan, 1995); the HMM from Luettin’s thesis (Luettin, 1997) which uses both shape and intensity information; the best diffusion network obtained using both shape and intensity information; and the performance of trained human lip readers (Movellan, 1995).
by Luettin et al. to work with hidden Markov models. They also tried a variety of hidden Markov model architectures and reported the best results obtained with them. In all cases, the best diffusion networks outperformed the best hidden Markov models reported in the literature using exactly the same visual preprocessing. The results show that diffusion networks may outperform hidden Markov model approaches in sequence recognition tasks. While these results are very promising, caution should be exercised since the database is relatively small. More work is needed with larger databases. 10 Conclusion We think the main reason that recurrent neural networks have not worked well in practical applications is that they rely on a very simplistic likelihood model: white noise added to a deterministic sequence. This model cannot cope well with known issues in natural sequences like bifurcations and dynamic time warping. We proposed adding noise to the activation dynamics of recurrent neural networks. The resulting models, which we called diffusion networks, can handle bifurcations and dynamic time warping. We presented a general framework for learning path probability densities using continuous stochastic models and then applied the approach to train diffusion neural networks. The approach allows training networks with hidden units, nonlinear activation functions, and arbitrary connectivity. Interestingly, the gradient of the likelihood function can be computed using Hebbian coactivation statistics and does not require backpropagation of error signals. Our work was inspired by the rich literature on continuous stochastic ltering and recurrent neural networks. The idea was to combine the versatil-
1540
Javier R. Movellan, Paul Mineiro, and R. J. Williams
ity of recurrent neural networks and the well-known advantages of stochastic modeling approaches. The continuous-time nature of the networks is convenient for data with dropouts or variable sample rates, since the models we use dene all the nite dimensional distributions. The continuous-state representation is well suited to problems involving continuous unobservable quantities, as in visual tracking tasks. In particular, the diffusion approach may be advantageous for problems in which continuity and sparseness constraints are useful. Diffusion networks naturally enforce a sparse state transition matrix (there is an innite number of states, and given any state, there is a very small volume of states to move into with nonnegligible probability). It is well known that enforcing sparseness in state transition matrices is benecial for many sequence recognition problems (Brand, 1998). Diffusion networks naturally enforce continuity constraints in the observable paths, and thus they may not have the well-known problems encountered when hidden Markov models are used as generative models of sequences (Rabiner & Juang, 1993). We presented simulation results on realistic sequence modeling and sequence recognition tasks with very encouraging results. While the results were highly encouraging, the databases used were relatively small, and thus the current results should be considered only as exploratory. It should also be noticed that there are situations in which hidden Markov models will be preferable to diffusions. For example, it is relatively easy to compose hierarchies of simple trainable hidden Markov models, capturing acoustic, syntactic, and semantic constraints. At this point, we have not explored how to compose trainable hierarchies of diffusion networks. Another disadvantage of diffusion networks relative to conventional hidden Markov models is training speed, which is signicantly slower for diffusion networks than for hidden Markov models. However, once a network was trained, the computation of the density functions needed in recognition was fast and could be done in real time. Signicant work remains to be done. In this article, we assume that the data are a continuous stochastic process. However, in many applications, the data take the form of discrete-time samples from a continuous-time process. In this article, we approached this issue by using simple discrete-time Euler approximations to stochastic integrals. In the future, we plan to investigate alternative approximations that allow for path-wise convergence, such as that in Karandikar (1995). The theoretical properties of the stochastic EM procedure when combined with discrete-time approximations of stochastic integrals remain to be investigated. In this article, we did not consider dispersions that are a function of the state, and we have not optimized the dispersion and the initial distribution of hidden states. Optimization of the initial distribution of hidden states is easy but unnecessarily obscures the presentation of the main issues in the article. Optimization of the dispersion is easy in discrete-time systems but presents mathematical challenges in continuous-time systems; thus, we have deferred the issue for future work. Theoretical issues concerning the approximating power of diffusion
A Monte Carlo EM Approach
1541
networks need to be explored. In deterministic neural networks, it is known that with certain choices of activation function and sufciently many hidden units, neural networks can approximate a large set of functions with arbitrary accuracy (Hornik, Stinchombe, & White, 1989). An analogous result for diffusion networks stating the class of distributions that can be approximated arbitrarily closely would be useful. It is also important to compare the properties of the algorithm presented here with alternative learning algorithms for partially observable stochastic processes that could also be applied to diffusion networks (Campillo & Le Gland, 1989; Blake et al., 1999). Another aspect of theoretical interest is the potential connection between diffusion networks and recent kernel-based approaches to learning. In particular, notice that the learning rule developed in this article depends on a matrix of inner products in the Hilbert space of squared integrable functions, as dened in equation 7.8. This may allow application of standard kernel methods (Aizerman et al., 1964; Burges, 1998) to the problem of learning distributions of sequences. We are currently exploring applications of diffusion networks to realtime stochastic ltering problems (face tracking) and sequence-generation problems (face animation). Our work shows that diffusion networks may be a feasible alternative to hidden Markov models for problems in which state continuity and sparseness in the state transition distributions are advantageous. The results obtained for the visual speech recognition task are encouraging and reinforce the possibility that diffusion networks may become a versatile general-purpose tool for a wide variety of continuous-signal processing tasks. Acknowledgments We thank Anthony Gamst, Ian Fasel, Tim Kalman Marks, and David Groppe for helpful discussions and Juergen Luettin for generously providing lip contour data and technical assistance. The research of R. J. W. was supported in part by NSF grant DMS 9703891. J. R. M. was supported in part by the NSF under grant 0086107. References Ackley, D. H., Hinton, G. E., & Sejnowski, T. J. (1985). A learning algorithm for Boltzmann machines. Cognitive Science, 9, 147–169. Aizerman, M. A., Braverman, E. M., & Rozoner, L. (1964). Theoretical foundations of the potential function method in pattern recognition learning. Automation and Remote Control, 15, 821–837. Billingsley, P. (1995). Probability and measure. New York: Wiley. Bishop, C. M. (1995). Neural networks for pattern recognition. Oxford: Clarendon Press.
1542
Javier R. Movellan, Paul Mineiro, and R. J. Williams
Blake, A., North, B., & Isard, M. (1999). Learning multi-class dynamics. In M. S. Kearns, S. Solla, & D. Cohn (Eds.), Advances in neural information processing systems, 11 (pp. 389–395). Cambridge, MA: MIT Press. Brand, M. (1998). Pattern discovery via entropy minimization. In D. Heckerman & J. Whittaker (Eds.), Proceedings of the Seventh International Workshop on Articial Intelligence and Statistics. San Mateo, CA: Morgan Kaufmann. Burges, C. J. C. (1998). A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2), 121–167. Campillo, F., & Le Gland, F. (1989).MLE for partially observed diffusions—direct maximization vs. the EM algorithm. StochasticProcessesand Their Applications, 33(2), 245–274. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc., 39, 1–38. Efron, A. (1982).The jacknife, the bootstrapand other resampling plans. Philadelphia: SIAM. Fahrmeir, L. (1992). Posterior mode estimation by extended Kalman lter for multivariate dynamics generalized linear models. Journal of the American Statistical Association, 87, 501–509. Fishman, G. S. (1996). Monte Carlo sampling: Concepts, algorithms,and applications. New York: Springer-Verlag. Ghahramani, Z., & Roweis, S. T. (1999). Learning nonlinear dynamical systems using an EM algorithm. In M. S. Kearns, S. Solla, & D. Cohn (Eds.), Advances in neural information processing systems, 11. Cambridge, MA: MIT Press. Hertz, J., Krogh, A., & Palmer, R. (1991). Introduction to the theory of neural computation. Reading, MA: Addison-Wesley. Hopeld, J. (1984).Neurons with graded response have collective computational properties like those of two-state neurons. Proceedings of the National Academy of Science, 81, 3088–3092. Hornik, K., Stinchombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2, 259–366. Jacob, G., Noble, A., & Blake, A. (1998). Robust contour tracking of electrochardiographic sequences. In Proc 6th Int. Conf. on Computer Vision (pp. 408–413). Kalman, R. E., & Bucy, R. S. (1961). New results in linear ltering and prediction theory. Transactions ASME J. of Basic Eng., 83, 95–108. Bombay: IEEE Computer Society Press. Karandikar, R. L. (1995). On pathwise stochastic integration. Stochastic Processes and Their Applications, 57, 11–18. Karatzas, I., & Shreve, S. E. (1991). Brownian motion and stochastic calculus. New York: Springer-Verlag. Kitagawa, G. (1996). Monte Carlo lter and smoother for non-gaussian nonlinear state space models. Journal of Computational and Graphical Statistics, 5(1), 1–25. Kloeden, P. E., & Platen, E. (1992). Numerical solutions to stochastic differential equations. Berlin: Springer-Verlag. Levanony, D., Shwartz, A., & Zeitouni, O. (1990). Continuous-time recursive estimation. In E. Arikan (Ed.), Communication, control, and signal processing. Amsterdam: Elsevier.
A Monte Carlo EM Approach
1543
Lewis, F. L. (1986). Optimal estimation—with an introduction to stochastic control theory. New York: Wiley. Ljung, L. (1999). System identication: Theory for the user. Upper Saddle River, NJ: Prentice-Hall. Luettin, J. (1997). Visual speech and speaker recognition. Unpublished doctoral dissertation, University of Shefeld. Luettin, J., Thacker, N., & Beet, S. (1996a). Statistical lip modelling for visual speech recognition. In Proceedings of the VIII European Signal Processing Conference. Trieste, Italy. Luettin, J., Thacker, N. A., & Beet, S. W. (1996b). Speechreading using shape and intensity information. In Proceedings of the Internal Conference on Spoken Language Processing. Philadelphia: IEEE. McClelland, J. L. (1993). Toward a theory of information processing in graded, random, and interactive networks. In D. E. Meyer & S. Kornblum (Eds.), Attention and performance XIV: Synergies in experimental psychology, artical intelligence, and cognitive neuroscience (pp. 655–688). Cambridge, MA: MIT Press. Meinhold, R. J., & Singpurwalla, N. D. (1975). Robustication of Kalman lter models. IEEE Transactions on Automatic Control, 84, 479–486. Mineiro, P., Movellan, J. R., & Williams, R. J. (1998). Learning path distributions using nonequilibrium diffusion networks. In M. Kearns (Ed.), Advances in neural information processing systems, 10 (pp. 597–599). Cambridge, MA: MIT Press. Movellan, J. (1994). A local algorithm to learn trajectories with stochastic neural networks. In J. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing systems, 6 (pp. 83–87). San Mateo, CA: Morgan Kaufmann. Movellan, J. (1995). Visual speech recognition with stochastic neural networks. In G. Tesauro, D. Touretzky, & T. Leen (Eds.), Advances in neural information processing systems, 7. Cambridge, MA: MIT Press. Movellan, J. R. (1998). A learning theorem for networks at detailed stochastic equilibrium. Neural Computation, 10(5), 1157–1178. Movellan, J., & McClelland, J. L. (1993). Learning continuous probability distributions with symmetric diffusion networks. Cognitive Science, 17, 463–496. Movellan, J. R., & McClelland, J. L. (2001). The Morton-Massaro law of information integration: Implications for models of perception. Psychological Review, 1, 113–148. North, B., & Blake, A. (1998). Learning dynamical models by expectation maximization. In Proc 6th Int. Conf. on Computer Vision (pp. 911–916). Bombay: IEEE Computer Society Press. Oksendal, B. (1998). Stochastic differential equations: An introduction with applications (5th ed.). Berlin: Springer-Verlag. Pearlmutter, B. A. (1995). Gradient calculations for dynamic recurrent neural networks: A survey. IEEE Transactions on Neural Networks, 6(5), 1212–1228. Poor, H. V. (1994). An introduction to signal detection and estimation. Berlin: Springer-Verlag. Protter, P. (1990). Stochastic integration and differential equations. Berlin: SpringerVerlag.
1544
Javier R. Movellan, Paul Mineiro, and R. J. Williams
Rabiner, L. R., & Juang, B.-H. (1993). Fundamentals of speech recognition. Englewood Cliffs, NJ: Prentice Hall. Sage, A. P., & Melsa, J. L. (1971).Estimationtheory with applicationto communication and control. New York: McGraw-Hill. Shumway, R., & Stoffer, D. (1982). An approach to time series smoothing and forecasting using the EM algorithm. J. Time Series Analysis, 3, 253–265. Solo, V. (2000). Unobserved Monte-Carlo method for identication of partially observed nonlinear state space systems, Part II: Counting process observations. In Proc. IEEE Conference on Decision and Control. Sidney, Australia. White, H. (1996). Estimation, inference and specication analysis. Cambridge: Cambridge University Press. Zipser, D., Kehoe, B., Littlewort, G., & Fuster, J. (1993). A spiking network model of short-term active memory. Journal of Neuroscience, 13(8), 3406–3420. Received July 2, 1999; accepted November 15, 2001.
NOTE
Communicated by Klaus Obermayer
Are Visual Cortex Maps Optimized for Coverage?  Carreira-Perpi n Miguel A. ˜ an [email protected]
Geoffrey J. Goodhill [email protected] Department of Neuroscience, Georgetown University Medical Center, Washington, D.C. 20007, U.S.A.
The elegant regularity of maps of variables such as ocular dominance, orientation, and spatial frequency in primary visual cortex has prompted many people to suggest that their structure could be explained by an optimization principle. Up to now, the standard way to test this hypothesis has been to generate articial maps by optimizing a hypothesized objective function and then to compare these articial maps with real maps using a variety of quantitative criteria. If the articial maps are similar to the real maps, this provides some evidence that the real cortex may be optimizing a similar function to the one hypothesized. Recently, a more direct method has been proposed for testing whether real maps represent local optima of an objective function (Swindale, Shoham, Grinvald, Bonhoeffer, & Hubener, ¨ 2000). In this approach, the value of the hypothesized function is calculated for a real map, and then the real map is perturbed in certain ways and the function recalculated. If each of these perturbations leads to a worsening of the function, it is tempting to conclude that the real map is quite likely to represent a local optimum of that function. In this article, we argue that such perturbation results provide only weak evidence in favor of the optimization hypothesis.
1 Introduction Neurons in visual cortex respond to several kinds of visual stimuli, the best studied of which include position in visual eld, eye of origin, and orientation, direction, and spatial frequency of a grating. The pattern of preferred stimulus values over the whole visual cortex for each kind of stimulus is called a (visual) cortical map. Thus, maps of visual eld position, ocular dominance, orientation, and so forth coexist on the same neural substrate. Given that these maps show a highly organised spatial structure, the question arises of what underlying principles explain these maps. Two such principles are coverage uniformity, or completeness, and continuity, or similarity (Hubel & Wiesel, 1977). Coverage uniformity means that each combination c 2002 Massachusetts Institute of Technology Neural Computation 14, 1545–1560 (2002) °
1546
Miguel A. Carreira-Perpin ˜ an and Geoffrey J. Goodhill
of stimuli values (e.g., any orientation in any visual eld location of either eye) has equal representation in the cortex; completeness means that any combination of stimuli values is represented somewhere in cortex. Thus, coverage uniformity implies completeness (disregarding the trivial case of a cortex uniformly nonresponsive to stimuli), but not vice versa, since it is possible to have over- and underrepresented stimuli values (in addition, it is not practically possible to represent all values of a continuous higherdimensional stimulus space with a continuous two-dimensional cortex). A useful middle ground is to consider that the set of stimulus values represented by the cortex be roughly uniformly scattered in stimulus space. A common qualitative denition of continuity is that neurons that are physically close in cortex tend to have similar stimulus preferences; this can be motivated in terms of economy of cortical wiring (Durbin & Mitchison, 1990). Coverage and continuity compete with each other. If, say, retinotopy and preferred orientation vary slowly from neuron to neuron, sizable visual eld regions will lack some orientations. If neurons’ preferred stimuli values are scattered like a salt and pepper mixture, continuity is lost. The striped structure of several of the maps can be seen as a compromise between these two extremes. An early model based on these principles is the ice cube model of Hubel and Wiesel (1977), where stripes of ocular dominance run orthogonally to stripes of orientation and all combinations of eye and orientation preference are represented within a cortical region smaller than a cortical point image (the collection of neurons whose receptive elds contain a given visual eld location). The competition can be explained in a dimensionreduction framework, where a two-dimensional cortical sheet twists in a higher-dimensional stimulus space to cover it as uniformly as possible while minimizing some measure of continuity. Optimization models based on such principles produce maps with a quantitatively good match to the observed phenomenology of cortical maps, including the striped structure of ocular dominance and orientation columns with appropriate periodicity and interrelations (Erwin, Obermayer, & Schulten, 1995; Swindale, 1996). However, a more direct approach to test the validity of such optimization models would be to calculate the value of the objective function for a real cortical map and then determine by perturbation whether this represents a local optimum. Such an approach has recently been proposed by Swindale, Shoham, Grinvald, Bonhoeffer, & Hubener ¨ (2000). Although the results they presented are consistent with the hypothesis that real maps are optimized for a particular function measuring coverage, here we argue that these results offer only weak evidence in favor of the hypothesis. 2 The Coverage Measure Consider a resolution-dependent representation of a cortical map dened as a two-dimensional array of vector values of the stimulus variables of interest. Each position ( i, j) in the array represents an ideal cortical cell;
Are Visual Cortex Maps Optimized for Coverage?
1547
call C the set of all such cortical positions. There is a vector of stimulus values ¹ij associated with each cortical position ( i, j) ; stimulus variables considered by Swindale et al. (2000) are the retinotopic position (or receptive eld center in the visual eld) ( x, y ) in degrees, the preferred orientation h 2 [0± , 180± ) , the ocular dominance n (¡1: left eye, C 1: right eye), and def the spatial frequency m 2 f¡1, 1g. Therefore, ¹ij D ( nij , mij , hij , xij , yij ) for ( i, j) 2 C can be considered a generalized receptive eld center; a receptive eld would then be dened by a function sitting on the receptive eld center def and monotonically decreasing away from it (see below). The collection M D f¹ij g ( i, j)2C of such receptive eld centers, together with the two-dimensional ordering of cortical positions in C , denes the cortical map. A mathematically convenient way of representing the trade-off between the goals of attaining uniform coverage and respecting the constraints of cortical wiring is to assume that cortical maps maximize a function def F (M) D C (M) C lR (M) ,
(2.1)
where C is a measure of the uniformity of coverage, R is a measure of the continuity, and l > 0 species the relative weight of R with respect to C . We assume that maximizing either C or R separately does not lead to a maximum of F and therefore that maxima of F imply compromise values of C and R. The exact form of the combination of C and R of equation 2.1 (a weighted sum) need not be biologically correct, but for the purposes of embodying the competition between C and R, it is sufcient. Swindale (1991) introduced the following mathematical denition of coverage. Given an arbitrary stimulus v, the total amount of cortical activity that it produces is dened as def
A (v) D
X ( i, j )2C
f (v ¡ ¹ij ) ,
(2.2)
where f is the (generalized) receptive eld of cortical location ( i, j) , assumed translationally invariant (so it depends on only the difference of stimulus v and generalized receptive eld center ¹ij ); f is taken as a product of functions: gaussian for orientation and retinotopic position (with widths derived from biological estimates of tuning curves)1 and delta for ocular dominance and spatial frequency. A is calculated for a regular grid in stimulus space, which is assumed to be a representative set of stimulus values. The measure 1 Strictly, the receptive eld size depends on the location of the stimulus in the visual eld and adapts to the surround (e.g., with contrast). However, extending the coverage denition to account for this is difcult. Therefore, in common with Swindale et al. (2000), we will consider xed receptive eld sizes.
1548
Miguel A. Carreira-Perpin ˜ an and Geoffrey J. Goodhill
of coverage uniformity is nally obtained as c0 D
def
stdev fAg , meanfAg
(2.3)
that is, the magnitude of the normalized dispersion of the total activity A in the stimulus space. Intuitively, c0 will be large when A takes different values for different stimuli and zero if A has the same value independent of the stimulus. Thus, it is a measure of lack of coverage uniformity, and we could def dene C D ¡c0 . Equation 2.2 can be seen as a generalization of the tness term of the elastic net objective function (Durbin, Szeliski, & Yuille, 1989). R is the combined effect of several factors, none of which is fully understood, and so it is hard to write down a functional form for it condently. 3 Determining Map Optimality via Perturbations If suitable functions C and R are dened, the mathematical procedure to determine whether a given map M D f¹ij g ( i, j)2C is a (local) maximum of F is to check that the gradient of F at M is zero and the Hessian of F at M is negative denite (or negative semidenite). However, there are two problems with this. First, C is obtained in an approximate way 2 using a sample of the stimulus distribution, and so the numerical accuracy of the gradient and Hessian will be affected by a discretization error, particularly if the sample is coarse and symmetric. But second and crucially, even if we commit ourselves to a given (approximated) mathematical denition of C such as ¡c0 , we still do not have a suitable denition of R. Given the difculties in the denition of R and the mathematical treatment of C , the goal of Swindale et al. (2000) was less ambitious: to check whether the maps are at local optima of C by examining the effect on C of a fairly small set of perturbations of the maps that hopefully would not affect R, however the latter is dened. That is, they argued that although we do not know what R is exactly, we may be able to determine what perturbations of a map should leave R unaffected. Specically, they suggested rigid motion perturbations (horizontal translations, 180 degree rotations, and horizon2 Although we have a mathematical denition of C D ¡c0 in terms of M (via the intermediate denition of A(vI M)), it is not possible to obtain C as an explicit function of M for a given distribution of the stimulus v (e.g., uniform) since one cannot analytically determine the distribution of A. Thus, C (M) must be approximated with a sample of the stimulus distribution. Swindale et al. (2000) computed the total cortical activity A for all combinations of n 2 f¡1, 1g, m 2 f¡1, 1g, h 2 f0± , 30± , 60± , 90± , 120± , 150± g, x 2 f1, 5, 9, . . . , imax g, and y 2 f1, 5, 9, . . . , jmax g, where imax and jmax are the size in pixels of the rectangular map, amounting to about 14, 000 stimuli. This is a coarse and symmetric sample of h, x, and y. Drawing a ner random sample from a uniform distribution in stimulus space could avoid potential artifactual estimates and could also be used to control whether different samples lead to essentially the same value of A.
Are Visual Cortex Maps Optimized for Coverage?
1549
tal and vertical ips) applied separately to each individual map (discussed further in section 3.3). If such perturbations unambiguously worsened coverage uniformity for biologically observed maps of developed animals, it would be tempting to conclude that such maps are local maxima of both C and F . To test this idea, Swindale et al. (2000) used empirical maps of ocular dominance, orientation, and spatial frequency obtained simultaneously in area 17 of the cat using standard optical imaging methods for young animals. After some preprocessing (including smoothing, necessary to remove noise), rectangular regions of about 5 £ 2.5 mm (approximately 140 £ 70 pixels) were obtained in which each pixel has associated values of ocular dominance in f¡1, 1g, orientation in degrees in [0, 180) , and spatial frequency in f¡1, 1g. Since optical imaging provides no information about topography, they chose to make the retinotopic map linear, that is, perfectly topographic,3 which makes coverage uniform by denition along the retinotopic variables x, y. Swindale et al. (2000) then computed the variation of c0 for a range of rigid-motion perturbations and found that coverage became less uniform for most of the perturbations described, often as an increasing function of the size of the perturbation (notably for horizontal shifts). The lack of negative results in the perturbation simulations led Swindale et al. (2000) to argue that such maps are local maxima of both C and F —or as Das (2000) put it in an associated article, “Real experimentally obtained maps from V1 are indeed optimally arranged with respect to each other such that any departure from the real maps worsens the coverage.” However, there are several reasons to question whether this really follows from the results presented by Swindale et al. 3.1 Incompleteness of the Perturbation Set. The set of perturbations used by Swindale et al. (2000) does not include all possible perturbations that would leave a certain continuity function R unchanged. Assume that all stimulus variables are continuous, and call D the number of (independent) such variables, that is, the number of scalar variables in M D f¹ij g ( i, j)2C . This is a large number: for a rectangular map of 140£70 with 5 stimulus variables, D is 49, 000, and it could even be innite if one considers a nonparametric representation of the map (in which case we would have a variational problem). Then an elementary perturbation of a map M that does not alter R will result in a perturbed map lying at an 2 -distance from M on the manifold def R (M) D fN: R (N) D R (M) g (see Figure 1). Such a manifold will have dimension D ¡1 (or less, if some variables are dependent). Therefore, if D > 2, the number of different directions in which to perturb the map is innite. In other words, proving that M is a maximum of C for xed R requires proving
3 The effect on the coverage estimates of assuming that the retinotopic map is strictly topographic is likely to be considerable. Besides, this assumption may not be true in reality (e.g., see Das & Gilbert, 1997).
1550
Miguel A. Carreira-Perpin ˜ an and Geoffrey J. Goodhill
Figure 1: Illustration of an elementary neighborhood of M contained in the def manifold R (M) D fN: R (N) D R (M) g. In this example, the map space is threedimensional, and the manifold is two-dimensional. The thick lines indicate manifolds along which some specic classes of perturbations leave R constant. These are just a subset of all perturbations that leave R constant (the dark-shaded neighborhood). B 2 (M) is a ball of radius 2 centered at M.
that every perturbed map in an elementary neighborhood of M contained in the mentioned manifold has a lower value of C . Such a neighborhood can be specically dened as the intersection of a ball B2 (M) of small radius 2 > 0 and the manifold R (M) , as shown in Figure 1, and contains an innite number of maps. Therefore, a procedure based on trying a nite number of different perturbations can never prove the statement, although it can disprove it by nding one such perturbation that increases C (assuming it is possible to implement an elementary perturbation numerically). No matter how many perturbations we try that decrease C , we can never be sure that the rest of them will as well. Even if a candidate map passes a seemingly convincingly large number of perturbations, there are still many more perturbations to test, and there are many other candidates that would pass those perturbations too. Therefore, a “mathematically complete exploration of a full range of distortions” (Das, 2000) is an unattainable goal. The elementary perturbations we have mentioned include not just (vanishingly small) rigid motions in all possible directions, but also different classes of perturbations, such as “rubber-sheet” distortions of the map. Both Swindale et al. (2000) and Das (2000) admit the possibility that rubbersheet distortions exist that improve coverage uniformity while leaving R unchanged. However, Swindale et al. argue that since R is not properly dened, a given distortion that leaves unchanged a function R1 dened in some way would likely change a function R2 dened in a different way. Besides the fact that the same argument could be equally applied to the definition of coverage uniformity (a distortion decreasing coverage uniformity
Are Visual Cortex Maps Optimized for Coverage?
1551
as c0 does might increase coverage uniformity under a different denition), our argument remains, since for any given denition of R, there potentially exist distortions that could increase C , that is, the neighborhood dened earlier depends on the chosen denition of R, but it always exists. These issues are illustrated more concretely in Figure 2. This presents an example of a very simplied version of the mapping problem investigated by Swindale et al. (2000) akin to a one-dimensional ocular dominance problem (Goodhill & Willshaw, 1990). A particular map is hypothesized to be optimal and is then perturbed. We show that rigid shifts of this map decrease C D ¡c0 as a monotonically decreasing function of the size of the shift. However, we also show that rubber-sheet perturbations of this map that leave R unchanged can increase C . 3.1.1 Checking for Stationary Points. Could one at least determine whether the maps are at a stationary point of C for xed R, that is, whether the gradient of C at the map M is zero in the direction tangent to the manifold R (M) ? A numerical approximation to the gradient of a function f of D variables can be computed by nite differences by using small perturbations along D linearly independent directions4 —for example, along the coordinate axes (with unit vectors e1 , . . . , eD ), by computing @f @xd
¼
f (x C 2 ed ) ¡ f (x) 2
.
This would require f to be computed at x C 2 ed for d D 1, . . . , D, that is, D component-wise small perturbations (for comparison, the shifts of Swindale et al., 2000, would amount to only six perturbation dimensions in a space of D D 49, 000). Whether r f (x) D 0 could then be determined, at least up to some numerical threshold (which may not be a straightforward matter), and thus whether x is a stationary point of f . However, r f (x) D 0 is true for saddle points as well as optima, and high-dimensional multivariate functions that have many optima typically have far more saddle points (see the appendix).5 Figure 3 shows an example of a function of two variables with a stationary point at the origin that is neither a maximum nor a minimum, but perturbing a point at the origin along many directions will result in a lower value of the function. Thus, knowing that the gradient is zero and nding that the high-dimensional function decreases along a few directions in no way guarantees that it is at a maximum.
4
For this and other statements in this section, see a text on optimization, e.g., Nocedal and Wright (1999). 5 That the coverage and continuity functions must have many local optima follows from symmetry considerations and from the fact that, assuming the maps are indeed optimal, while the maps of any two normal animals (of the same species) are qualitatively similar to each other, no two animals have the same map.
1552
Miguel A. Carreira-Perpin ˜ an and Geoffrey J. Goodhill
Are Visual Cortex Maps Optimized for Coverage?
1553
To determine whether the stationary point is a maximum (say), one would need to compute a numerical approximation of the Hessian (the matrix of second-order derivatives) and check that it is negative denite (or positive denite, for a minimum)—that all its eigenvalues are strictly negative. This is now a much harder numerical problem than determining whether r f (x) D 0: estimating the Hessian requires O ( D2 ) perturbations, rather than O ( D) as for the gradient, and even if it could be estimated, the real problem is then determining whether its eigenvalues are negative (computationally an O ( D3 ) problem). This is very difcult because it is well known that the Hessian of a function of many variables is likely to be ill conditioned: the ratio of the smallest to largest eigenvalue (in absolute value)
Figure 2: Facing page. A nonoptimal map that shows a systematic worsening of coverage uniformity on shifts but a systematic improvement on rubber-sheet perturbations. In this thought experiment, reminiscent of an elastic net (Durbin & Willshaw, 1987), the array of empty circles represents a uniform sample in a two-dimensional stimulus space (e.g., the horizontal axis could be a retinotopic variable and the vertical axis the ocular dominance as in Goodhill & Willshaw, 1990). The string of lled circles (receptive eld centers) represents a one-dimensional map that tries to cover the stimuli as much and as uniformly def
as possible (measured by C D ¡c0 as in section 2) while respecting the map continuity as much as possible (here dened as the sum R of the lengths of the individual segments). To compute c0 , in equation 2.2, a gaussian kernel f with a standard deviation equal to twice the radius of the shaded disks was used. Map B is the result of rigidly shifting map A to the right: it has the same length R as map A but a lower value of C . In general, rigidly shifting map A horizontally systematically decreases C and results in an inverted- U curve for C, wrongly suggesting that map A is an optimum. The same happens for vertical shifts, as in map C (although, for this particular example, map A is slightly off the maximum of C). However, map A can be stretched and compressed symmetrically (a rubber-sheet transformation) to reach map D, keeping its length R constant. Thus, map A is not optimal, and there is a continuous path inside the manifold of constant R that monotonically increases C until map D is reached, as shown in map D. Whether map A is a saddle point of F (M) D C (M) C lR (M) or lies on an inclined ridge depends on the actual values of l and the parameters of C and R. In all graphs in the right column, the vertical scale is the same; the steepness of the curves may be increased or decreased by changing the gaussian kernel width, and the curves were computed with a Matlab program from the actual data and denitions of C and R. Note that if one considers periodic boundary conditions in the horizontal axis, by symmetry c0 becomes exactly 0 for both maps A and D, even though intuitively map D is better than map A. This is a shortcoming of the denition of c0 ; the tness term of, for instance, the elastic net objective function does differentiate between maps A and D.
1554
Miguel A. Carreira-Perpin ˜ an and Geoffrey J. Goodhill 2
3 2
1
1 0 1
0
2 3 2
1
1 0 1 2
2
1
0
1
2 2 2
1
0
1
2
Figure 3: A function with a stationary point at the origin (M D 0) that is a saddle point: (left) surface plot; (right), contour plot. The equation of the function in polar coordinates is f ( r, h ) D r2 (sin2m ( Nh ¡ a) ¡ 12 ) with m D 10, N D 3, and a D p2 . In the contour plot, the shaded areas correspond to f > 0 and the white areas to f < 0. Any straight line in the (r, h ) plane that passes through the origin is associated with either a U curve, along which the function increases away from the origin (if inside the shaded areas); an inverted- U curve, along which the function decreases away from the origin (if inside the white areas); or a horizontal line, along which the function is constant (if on the boundary). For the particular function shown, inverted- U curves are much more abundant than U curves, and so perturbations of the point M D 0 inside a small ball B2 (M) typically result in a lower value of f .
is often very close to 0 and geometrically corresponds to a direction along which f is nearly at. This is a perennial problem in multivariate optimization (e.g., backpropagation training of a multilayer perceptron), where it can be difcult to tell whether the optimization algorithm converged to an optimum or got stuck in a saddle point or even in a nonstationary point. Bentler and Tanaka (1983) report a remarkable example from the factor analysis literature (in a space of merely 36 dimensions). But the problem we really have is even harder, because we want to see whether a function C has an optimum along the manifold R (M) , not along the whole space. This means that we cannot even obtain D ¡ 1 linearly independent directions along which to perturb6 the map (D ¡ 1 being the dimension of R (M) ), as shown in section 3.2, and therefore we cannot even compute the gradient tangential to R (M). Consequently, we are not merely unable to determine whether the map is a maximum inside R (M) ; we cannot even determine whether the map is at a stationary point inside R (M) .
6 Assuming true small perturbations, not the ones that Swindale et al. (2000) used, as discussed in section 3.3.
Are Visual Cortex Maps Optimized for Coverage?
1555
3.2 Such Perturbations May Indeed Alter R. Swindale et al. (2000) argue that the perturbations they tried should not affect the continuity measure, however this may actually be dened. The justication is that presumably such a measure would depend on the Euclidean distances between stimulus values, and these distances are preserved by the class of rigid motions. However, Swindale et al. (2000) applied rigid motions to the individual maps separately,7 and this does alter the geometric relationships between the individual maps which are observed in biological maps. For example, the stripes of ocular dominance and orientation maps are known to intersect at approximately right angles, and the singularities of the orientation map are known to lie generally at the centers of the ocular dominance stripes (Bartfeld & Grinvald, 1992; Obermayer & Blasdel, 1993; Hubener, ¨ Shoham, Grinvald, & Bonhoeffer, 1997); and (although still awaiting experimental replication) the orientation discontinuities seem to be matched with retinotopic discontinuities (Das & Gilbert, 1997). If the individual maps are independently rotated or translated, these relationships are altered (or completely broken, if the perturbations are large). Such alterations of the individual map interrelations are likely to have an effect on the cortical wiring constraints and therefore on the value of R. Consequently, the fact that coverage uniformity generally decreased becomes hard to interpret, since it could be accompanied by an increase or a decrease in R. 3.3 Such Perturbations Are Not Local. So far we have used the term perturbation in its usual sense of small or elementary perturbation, whose amount is vanishingly small. For example, a small translation of the map M in the direction of a vector N could be dened as M0 D M C 2 N, or ¹0ij D ¹ij C 2 º ij 8(i, j) 2 C , where " > 0 is very small; similarly, if every ¹ij is perturbed by a different small amount, then we would have a rubber-sheet perturbation. However, the perturbations that Swindale et al. (2000) used are not small. To see this, note that their perturbations are actually permutations. Consider, for example, perturbing the orientation map while keeping the other maps xed, and assume for notational convenience that the cortex origin is at the center of the rectangular region that Swindale et al. examined: Horizontal shift of k pixels: hij takes the value of hi C k, j . Horizontal ip: hij is swapped with h¡i, j . 180 degree rotation: hij is swapped with h¡i, ¡j .
7 If the individual maps were transformed jointly by a rigid motion, we would gain no information (assuming that the cortex is homogeneous and isotropic).
1556
Miguel A. Carreira-Perpin ˜ an and Geoffrey J. Goodhill
Figure 4: Shifts of a one-dimensional bit-mapped cortex with a discontinuity.
Thus, the new value for hij will often be very different from the original one. It could be argued that for shifts of a small amount (e.g., k D 1 pixel or less), the value of hi C k, j will be very similar to that of hij , but this rests on an assumption of continuity of orientation that does not hold generally (e.g., at singularities or fractures), and the same would happen with other maps. This argument holds no matter how nely one discretizes the cortex, since the discontinuities do not go away as the pixel size goes to zero. The argument would also apply to small-angle individual map rotations, though Swindale et al. (2000) considered only 180 degree rotations. Figure 4 illustrates the idea. It shows a one-dimensional bit-mapped cortex M 0 where each pixel i contains an orientation valuehi that varies continuously except at a single point; it then shows maps M1 , M2 , . . . shifted by increasing amounts of 1, 2, . . . pixels and the respective perturbations M 0 ¡ M1 , M0 ¡ M2 , . . . of the original map. It can be seen that the effect of both the continuous region and the discontinuity is accumulative with the shift size, consistent with the U-curves reported by Swindale et al., and if the pixel size is very small, the 1 values in M0 ¡ M1 will be very small but the ¡7 will remain, and similarly for other shifts, so that the perturbation remains large. In summary, the “perturbations” of Swindale et al. (2000) are not small perturbations, but permutations that often result in very large perturbations, therefore being nonlocal. That is, the perturbed map is not in the immediate vicinity of the original one but in a faraway region of map space, and therefore comparison of the coverage values of both maps is hard to interpret. Put another way, it makes sense to add 1 degree to a given hij and see how that affects C , but not to add unknown, potentially arbitrarily large amounts to all hij ’s. Besides, if the original (observed) map was indeed at a local optimum of C , given the multiplicity of local optima in map space
Are Visual Cortex Maps Optimized for Coverage?
1557
mentioned earlier, it is to be expected that the perturbed map would then be near a completely different local optimum. 4 Conclusion The general principle that cortical maps are wired in a way that achieves uniform coverage while also minimizing cortical wiring is important for understanding cortical map structure. The abstract implementation of such a principle in cortical map models based on dimensionality reduction replicates most of the characteristics of such maps. The evidence that Swindale et al. (2000) presented is certainly consistent with the optimization hypothesis, but it does not add signicant support for it. Being based on trial and error of a subset of possible perturbations, it might disprove the hypothesis that empirical maps maximize (a certain measure of) coverage uniformity, but it cannot prove it. The lack of an appropriate denition of economy of cortical wiring prevents us from nding perturbations that leave it unchanged and ultimately prevents a quantitative assessment of the general principle stated above. What could be more easily tested is whether empirical maps are stationary points of the coverage uniformity by itself (irrespective of any connectivity constraint), but it would be surprising if this was so. Since most of the perturbations that Swindale et al. (2000) tried worsened coverage uniformity (the more so the larger the perturbation, for the horizontal shifts), it may be argued that this cannot be due to chance. In section 3, we showed that in high dimensions, this intuition about chance is misleading and that continuity is also being altered by those manipulations. In addition, the example shown in Figure 2 demonstrates that it is also possible to observe a systematic decrease in coverage uniformity for shifts when the map is not optimal. Visual cortex has an orderly columnar structure. That perturbations such as shifts should disturb that structure, and probably worsen coverage uniformity, should not come as a surprise. That any perturbation should worsen coverage uniformity is a far stronger statement that would be very difcult to conrm empirically. The work of Swindale et al. (2000) provides useful evidence that coverage is fairly uniform across visual cortex, but does not prove that it is as uniform as possible. Optimality is a very important principle in biology, and there are many examples where biological systems have been proven to achieve the best performance possible given the relevant physical constraints (e.g., Bialek, 1987). The approach in such cases is generally to calculate from rst principles what optimal performance would be and then show that biology achieves this performance. This is analogous to the standard methodology in the visual cortical map eld of hypothesizing an objective function (though on rather less certain grounds than direct physical constraints), calculating the maps that follow from this function, and comparing them with real maps. An alternative method, practical when the problem is discrete and the number of possible states is relatively small, is to calculate the value
1558
Miguel A. Carreira-Perpin ˜ an and Geoffrey J. Goodhill
Figure 5: (Left) Setup for the 3D case of the proof of the appendix.The nearest neighbors of the central point (at the origin) of rst order p lie at the diagonals of length 1 (), of second order at the diagonals of length 2 ( ), and of third order p at the diagonals of length 3 (4); they correspond to the 6 faces, 12 edges, and 8 vertices, respectively, of a cube of side 2 centered on the origin (dotted line). The midpoints of the diagonals are marked with small dots. (Right) Contour plot of a function of two variables with many maxima. The maxima are marked with ¤, minima with ±, and saddle points with £. Observe the abundance of saddle points compared to that of maxima or minima. This is because when going from one maximum to another maximum, or from one minimum to another minimum, we cross a saddle point. In higher dimensions, saddle points become even more abundant, possibly exponentially so.
of an objective function for all states and show that the biological state represents the optimum of this function (e.g., Cherniak, 1995). However, in the case discussed by Swindale et al. (2000) the number of states is innite, and we argue that a numerical perturbation approach cannot yield signicant insight into optimality. Appendix: Abundance of Saddle Points and Optima in High Dimensions Consider a linear superposition of localized spherical functions (each function falls away quickly, e.g., at a distance 12 ) each centered at the knots of a D-dimensional cubic array. That is, we place one such function at every position ( x1 , x2 , . . . , xD ) where each xd is integer, for d D 1, . . . , D. Call whole-knot every such position. By symmetry, we will have a maximum at every whole-knot and, over a large region, the same number of minima. In the midpoint of the segment joining any two maxima (minima), there must be either a saddle point or a minimum (maximum). These midpoints are located at positions ( x1 , x2 , . . . , xD ) where at least one xd is an integer plus 12 p p p and correspond to the midpoint of diagonals of lengths 2, 3, . . . , D that link a maximum with its respective nearest-neighboring maxima (see Figure 5, left). Call such midpoints half-knots. Over a hypercubic region of side
Are Visual Cortex Maps Optimized for Coverage?
1559
N in each axis, we have (2N ) D half-knots and whole-knots (saddle points, minima and maxima) and N D whole-knots (maxima). Thus, it contains N D maxima, N D minima, and (2N ) D ¡ 2N D saddles. Therefore, the ratio maxima:minima:saddles is 1:1:2D ¡ 2, and there are O (2D ) saddle points per maximum or minimum. The expression is also valid for D D 1 (where there exist no saddle points). Figure 5 (left) shows the proof setup for D D 3. The groups maxima of a maximum knot are at distances p of nearest-neighboring p 1 (), 2 ( ), and 3 (4) and correspond to the 6 faces, 12 edges, and 8 vertices, respectively, of a cube of side 2 centered on the knot. At the midpoint of every such diagonal, there is either a saddle point or a minimum. In less crystalline arrangements, some maxima, minima, and saddle points will coalesce, but if the function under consideration has many maxima uniformly scattered over its domain, we would expect the ratio to remain approximately correct. Figure 5 (right) shows the landscape of a bivariate function with many maxima. Hence, saddle points are typically much more numerous than maxima and minima in high dimensions. Acknowledgments We thank Peter Dayan and Graeme Mitchison for helpful discussions and comments on this article, and particularly Nick Swindale for his willingness to discuss with us the issues it raises. References Bartfeld, E., & Grinvald, A. (1992).Relationships between orientation-preference pinwheels, cytochrome oxidase blobs, and ocular-dominance columns in primate striate cortex. Proc. Natl. Acad. Sci. USA, 89(24):11905–11909. Bentler, P. M., & Tanaka, J. S. (1983). Problems with EM algorithms for ML factor analysis. Psychometrika, 48(2), 247–251. Bialek, W. (1987).Physical limits to sensation and perception. Annu. Rev. Biophys. Biophys. Chem., 16, 455–478. Cherniak, C. (1995). Neural component placement. Trends Neurosci., 18(12), 522– 527. Das, A. (2000). Optimizing coverage in the cortex. Nat. Neurosci., 3(8), 750–752. Das, A., & Gilbert, C. D. (1997). Distortions of visuotopic map match orientation singularities in primary visual cortex. Nature, 387(6633), 594–598. Durbin, R., & Mitchison, G. (1990). A dimension reduction framework for understanding cortical maps. Nature, 343(6259), 644–647. Durbin, R., Szeliski, R., & Yuille, A. (1989).An analysis of the elastic net approach to the traveling salesman problem. Neural Computation, 1(3), 348–358. Durbin, R., & Willshaw, D. (1987). An analogue approach to the traveling salesman problem using an elastic net method. Nature, 326 (6114), 689–691.
1560
Miguel A. Carreira-Perpin ˜ an and Geoffrey J. Goodhill
Erwin, E., Obermayer, K., & Schulten, K. (1995). Models of orientation and ocular dominance columns in the visual cortex: A critical comparison. Neural Computation, 7(3), 425–468. Goodhill, G. J., & Willshaw, D. J. (1990). Application of the elastic net algorithm to the formation of ocular dominance stripes. Network: Computation in Neural Systems, 1(1), 41–59. Hubel, D. H., & Wiesel, T. N. (1977). Functional architecture of the macaque monkey visual cortex. Proc. R. Soc. Lond. B, 198(1130), 1–59. Hubener, ¨ M., Shoham, D., Grinvald, A., & Bonhoeffer, T. (1997). Spatial relationships among three columnar systems in cat area 17. J. Neurosci., 17(23), 9270–9284. Nocedal, J., & Wright, S. J. (1999). Numerical optimization. New York: SpringerVerlag. Obermayer, K., & Blasdel, G. G. (1993). Geometry of orientation and ocular dominance columns in monkey striate cortex. J. Neurosci., 13(10), 4114–4129. Swindale, N. V. (1991). Coverage and the design of striate cortex. Biol. Cybern., 65(6), 415–424. Swindale, N. V. (1996). The development of topography in the visual cortex: A review of models. Network: Computation in Neural Systems, 7(2), 161–247. Swindale, N. V., Shoham, D., Grinvald, A., Bonhoeffer, T., & Hubener, ¨ M. (2000). Visual cortex maps are optimised for uniform coverage. Nat. Neurosci., 3(8), 822–826. Received April 24, 2001; accepted October 29, 2001.
NOTE
Communicated by Klaus Obermayer
Kernel-Based Topographic Map Formation by Local Density Modeling Marc M. Van Hulle [email protected] K. U. Leuven, Laboratorium voor Neuro- en Psychofysiologie, Leuven, Belgium We introduce a new learning algorithm for kernel-based topographic map formation. The algorithm generates a gaussian mixture density model by individually adapting the gaussian kernels’ centers and radii to the assumed gaussian local input densities. 1 Introduction
Rather than developing topographic maps with disjoint and uniform activation regions (Voronoi tessellation), such as in the case of the popular self-organizing map (SOM) algorithm (Kohonen, 1995) and its adapted versions, algorithms have been introduced that can accommodate neurons with overlapping activation regions, usually in the form of kernel functions, such as radially symmetric gaussians. For these kernel-based topographic maps, several learning principles and schemes have been proposed (for a review, see Van Hulle, 2000). One of the earliest examples is the elastic net of Durbin and Willshaw (1987), which can be viewed as a gaussian mixture density model, tted to the data points by a penalized maximum likelihood term. The standard deviations of the gaussians (radii) are all equal and are gradually decreased over time. More recently, Bishop, SvensÂen, and Williams (1998) introduced the generative topographic map, which is based on constrained gaussian mixture density modeling—constrained since the gaussians cannot move independently from each other (the map is topographic by construction). Furthermore, all gaussians have an equal and xed radius. Sum, Leung, Chan, and Xu (1997) maximized the correlations between the activities of neighboring lattice neurons. The radii of the gaussians are under external control, and are gradually decreased over time. Graepel, Burger, and Obermayer (1997) introduced the soft topographic vector quantization (STVQ) algorithm and showed that a number of probabilistic, SOM-related topographic map algorithms can be regarded as special cases. The gaussian kernel represents a fuzzy membership (in clusters) function, and its radius, which is equal for all neurons, is again under external control (deterministic annealing). In the kernel-based maximum entropy learning rule (kMER) (Van Hulle, 1998), the outputs of the gaussians are thresholded and the radii individually adapted so as to make the c 2002 Massachusetts Institute of Technology Neural Computation 14, 1561– 1573 (2002) °
1562
Marc M. Van Hulle
neurons (suprathreshold) active with equal probabilities (equiprobabilistic maps). In the kernel-based soft topographic mapping (STMK) algorithm (Graepel, Burger & Obermayer, 1998), a nonlinear transformation is introduced that maps the inputs to a high-dimensional feature space and also admits a kernel function, an idea borrowed from support vector machines (SVMs) (Vapnik, 1995; Sch¨olkopf, Burges, & Smola, 1999). The kernel operates in the original input space, but its parameters are not modied by the algorithm. This connection with SVMs was taken up again by AndrÂas (2001), but with the purpose of linearizing the class boundaries. The kernel radii are adapted individually so that the map’s classication performance is optimized (using supervised learning). Yin and Allinson (2001) proposed an algorithm aimed at minimizing the Kullback-Leibler divergence (also called relativeor cross-entropy) between the true and the input density estimate obtained from the individually adapted gaussian kernels in the topographic map. In this contribution, we propose a still different approach: the gaussian kernels are adapted individually and in an unsupervised manner, but in such a way that their centers and radii correspond to those of the assumed gaussian local input densities. 2 Kernel-Based Topographic Map Formation
Let A be a lattice of N formal neurons and V µ R d the input space. To each neuron i 2 A corresponds a weight vector w i D [wi1 , . . . , wid ] 2 V and a lattice coordinate r i 2 VA , with VA the lattice space (we assume discrete lattices with regular topologies). As an input to lattice space transformation, we take one that admits a kernel function: hY (v ) , Y (w i ) i D K ( v , w i ), with v 2 V, Y the transformation, and h., .i the internal product operator in VA . We prefer to use gaussian kernels: ³ ´ kv ¡ w i k2 . (1) K ( v , w i , si ) D exp ¡ 2si2 When performing topographic map formation, we require that the weight vectors are updated so as to minimize the expected value of the squared Euclidean distance kv ¡ w i k2 and, hence, following our transformation Y, we instead wish to minimize kY (v ) ¡ Y ( w i ) k2 , which we will achieve by performing gradient descent with respect to w i . We rst determine the gradient: @ @w i
kY ( v ) ¡ Y ( w i )k2 D
@ @w i
¡2
K ( w i , w i , si ) C
@ @w i
K (v , v , si )
@
K ( v, w i , si ) ³ ´ kv ¡ w i k2 @ exp ¡ . D ¡2 2si2 @w i @w i
(2)
Kernel-Based Topographic Map Formation
1563
Hence, the learning rule for the kernel centers w i becomes (without neighborhood function—see further)
D wi
D gw
(v ¡ wi ) K ( v , w i , si ) , si2
(3)
with gw the learning rate. The equilibrium weight vector can be written as R
eq
eq
w i D RV
v K ( v , w i , si ) p ( v ) dv eq
V
K ( v , w i , si ) p ( v ) dv
,
8i 2 A,
(4)
that is, the center of gravity of K (.) p (.) , or, when we dene the product of the input density and the gaussian kernel as a new, local input density eq p¤ ( v ) D K ( v , w i , si ) p ( v )—note that p¤ is normalized by the denominator eq in equation 4—then we can rewrite the equilibrium weight vector as w i D hv ip¤ . In order to achieve a topology-preserving mapping, we multiply the right-hand side of equation 3 with a neighborhood function L ( i, i¤ ), with i¤ D arg max8i2A ( K (v , w i , si ) ) , that is, an activity-based denition of winnertakes-all rather than a minimum Euclidean distance-based denition (i¤ D arg mini kw i ¡ v k), which is usually adopted in topographic map formation. The equilibrium weight vector now becomes R eq wi
eq
D RV
v L ( i, i¤ ) K ( v , w i , si ) p ( v ) dv eq
V
L ( i, i¤ ) K (v , w i , si ) p ( v ) dv
,
8i 2 A.
(5)
We now derive the learning rule for the kernel radii si . We consider two steps. First, we perform gradient descent on kY ( v ) ¡ Y ( w i ) k2 , but with respect to si , so that after some algebraic manipulations:
D si /
kv ¡ w i k2 si3
K (v , w i , si ) ,
8i 2 A.
(6)
Second, we wish each radius si to correspond to the standard deviation of a d-dimensional gaussian input distribution centered at the (equilibrium) weight vector of neuron i so that, together with the latter, the assumed gaussian local input density is modeled. Since the expected value of the Euclidean p distance to the center of a d-dimensional gaussian can be approximated as ds for d large (Graham, Knuth, & Patashnik, 1994), 1 the radius update rule becomes (without neighborhood function)
D si
D gs
1 si
³
´ kv ¡ w i k2 ¡ rd K (v , w i , si ) , si2
8i 2 A,
(7)
1 Note that the Euclidean distance distribution is chi-squared with h D 2 and a D degrees of freedom.
d 2
1564
Marc M. Van Hulle
with gs the learning rate. The scale factor r (a constant) is designed to relax the local gaussian (and d large) assumption in practice. The equilibrium condition for the kernel radii can be written as 2eq
si
D
1 rd
R
2eq
V
kv ¡ w i k2 K (v , w i , si ) p (v ) dv , R 2eq ( ) ( ) v w s v v K , , p d i i V
that is, the moment of inertia of p¤ (without the 2eq si
D
8i 2 A, 1 rd
(8)
term), or for short,
1 2 ¤ r d hkv ¡w i k ip . Finally, similar to the weight update rule, we multiply
the right-hand side of equation 7 with our neighborhood function L (.) . The corresponding equilibrium radius then becomes 2eq
si
D
1 rd
R
2eq
V
L ( i, i¤ ) kv ¡ w i k2 K ( v , w i , si ) p ( v ) dv , R 2eq ( ¤) ( ) p ( v ) dv V L i, i K v , w i , si
8i 2 A.
(9)
Equations 5 and 9 are in the form in which the so-called iterative contraction mapping (xed-point iteration), used for solving nonlinear equations, can be directly applied (thus without learning rates). 2 Such an iterative process is, at least for the weights, reminiscent of the so-called batch map version of the SOM algorithm (Kohonen, 1995; Mulier & Cherkassky, 1995). This observation leads us to the soft topographic vector quantization (STVQ) algorithm (Graepel et al., 1997), a general model for probabilistic, SOM-based topographic map formation, since several algorithms can be considered as special cases, including Kohonen’s batch map version. In appendix A, we detail the correspondence of our learning scheme, which we further call the local density estimation (LDE) algorithm, with the STVQ, the batch map, and also the kernel-based Soft Topographic Mapping (STMK) algorithm (Graepel et al., 1998). As an example, consider the standard case of a 10£10 lattice, trained on samples drawn from the uniform distribution in the unit square (0, 1]2 . The weights are initialized by sampling the same distribution; the radii are initialized by randomly sampling the± (0, 1] range. ² We use a gaussian neigh¡ri¤ k borhood function: L (i, i¤ , sL ) D exp ¡ kr i 2s 2 L
2
, with sL the neighborhood
radius, and r i neuron i’s lattice coordinate. ± We adopt ² the following neighborhood cooling scheme: sL ( t) D sL 0 exp ¡2sL 0 t t , with t the present and max
tmax the maximum number of time steps, and sL 0 the radius at t D 0. We take tmax D 500, 000, sL 0 D 5, gw D 0.01 and gs D 0.02gw , and r D 0.4. The results are shown in Figure 1. We observe that prior to the lattice disentangling 2 Note that, strictly speaking,since we opt for a winner-takes-all rule, for computational reasons, we should integrate over the subspace of V that leads to an update of neuron i’s kernel parameters.
Kernel-Based Topographic Map Formation
1565
0
1k
10k
30k
100k
500k
Figure 1: Evolution of a 24 £ 24 lattice as a function of time. The circles demarcate the areas spanned by the standard deviations (radii) of the corresponding gaussian kernels. The boxes correspond to the range spanned by the uniform input density. The values given below the boxes represent time.
phase, the kernel radii grow rapidly in size and span a considerable part of the input distribution. Furthermore, for larger r, the lattice disentangles more rapidly, and the radii at convergence are larger. Hence, what happens in case the radii are forced to stay small? Will the lattice still disentangle? This is exemplied in Figure 2 for the same initial weight distribution but with the radii kept constant. The lattice disentangles, albeit at a much slower rate (tmax D 2M). 3 Density Estimation Performance
We now explore the density estimation performance of our LDE algorithm by using samples drawn from the distribution shown in Figure 3A (quadri-
1566
Marc M. Van Hulle
0
1k
10k
100k
200k
2M
Figure 2: Evolution when the radii are kept constant at a value of 0.1. Same conventions as in Figure 1.
modal product distribution). We will also use this case as a benchmark for comparing the performances of two modied versions of our algorithm and of four other kernel-based topographic map formation algorithms. The analytic equation of the product distribution is, in the rst quadrant, (¡ log v1 ) (¡ log v2 ) , with ( v1 , v2 ) 2 (0, 1]2 , and so on. Each quadrant is chosen with equal probability. The resulting asymmetric distribution is unbounded and consists of four modes separated by sharp transitions (discontinuities), which makes it difcult to model. The support of the distribution is bounded by the unit square [¡1, 1]2 . We take a 24 £ 24 lattice and train it in the same way as in the previous example, but with tmax D 1,000,000, gw D 0.01, and gs D 0.1gw . The density estimate is obtained by taking the sum of all (equal-volume) gaussian kernels at convergence, normalized so as to obtain a unit-volume density estimate. The result is shown in Figure 3B.
Kernel-Based Topographic Map Formation
A
1567
B
2 pd 1 0 -1
1 0 v2 v1
0
2 pd 1 0 -1
1 0 v2 v1
1 -1
0
1 -1
C
pd
2 1 0 -1
1
pd
0 v2 v1
0
2 1 0 -1
1 0 v2 v1
1 -1
0
1 -1
F
4
|
2
|
number of clusters
6
|
0| 0
|
1
8
|
r2
10
|
24
1
r1
24
| 50
| | | | 100 150 200 250 number of neurons (k)
Figure 3: (A–D) Density estimation with kernel-based topographic maps. Twodimensional quadrimodal product distribution (A), and the density estimates obtained with our learning algorithm, when adapting the kernel radii (B), and when keeping them xed (C), and the estimate obtained with the STVQ algorithm (D). The size of the lattices is 24 £ 24 neurons. pd = probability density. (E, F) Hill-climbing procedure applied in lattice space on the density estimate shown in B. We rst determine the density estimates at the neurons’ weight vectors, since hill climbing is performed on them. For each neuron i, we look for the neuron with the highest density estimate among itself and its k-nearest lattice neighbors. When this is neuron i itself, it is called a “top” neuron, since it represents a local density maximum; when it is another neuron j, then the procedure is repeated for neuron j, and we say that neuron i “points” to neuron j. All neurons that eventually point to the same top neuron receive the same cluster label. The result is represented as a cluster map (E). In order to motivate the choice of the parameter k in the hill-climbing procedure, the number of clusters found as a function of k is plotted (F). The long plateau indicates that there are four clusters. The cluster map corresponding to k D 100 is shown in E.
1568
Marc M. Van Hulle
Table 1: Density Estimation Performance of Several Kernel-Based Topographic Map algorithms for the Product Distribution Example in Figure 3A. Algorithm LDE LDE (min Eucl) LDE (xed radii) kMER Yin & Allinson (2001) STVQ SSOM SOM
1 b 1 b
Parameters
MSE
gw D 0.01, gs D 0.1gw gw D 0.01, gs D 0.1gw gw D 0.01
6.23 £ 10¡2 6.77 £ 10¡2 9.78 £ 10¡2
rr D 1, rs D 2, g D 0.001 g D 0.01 D 0.01, sL D 0.5, sopt D 0.0905 D 0.01, sL D 0.5, sopt D 0.0925
g D 0.015, sopt D 0.15
6.71 £ 10¡2 7.16 £ 10¡2 7.48 £ 10¡2 7.28 £ 10¡2 1.21 £ 10¡1
The mean squared error (MSE) between the estimated and the theoretical distribution is 6.23 £ 10¡2 . In order to explore the contribution of our activity-based competition rule, i¤ D arg maxi yi , we also train our lattice using the classic Euclidean distance-based rule, i¤ D arg mini kw i ¡ vk, but with the radii adapted as before, using equation 7 (“LDE min Eucl” case in Table 1). The MSE is now 6.77 £ 10¡2 , which is inferior to our initial result. Furthermore, in order to see the effect of individually adapting the kernel radii, we also consider the case where the radii are equal and kept constant during learning (“LDE xed radii” case). But here the following problem arises, as it does in other algorithms that do not adapt the kernel radii: how to choose this radius, since it directly determines the smoothness of the density estimate? In order not to have to rely on a separate optimal smoothness procedure, which could bias our results, we determine the (common) kernel radius for which the MSE between the estimated and the true theoretical distributions is minimal. In this way, we at least know that a better MSE result cannot be obtained. The “optimal” MSE found is 9.78 £ 10¡2 for sopt D 0.195 (optimized in steps of 5£10¡3 ). The result is shown in Figure 3C. This clearly shows the advantage of adapting the kernel radii individually to the local input density. For the sake of comparison, we also consider four other kernel-based topographic map algorithms, provided that their kernels specify a density estimate in the input space directly. We use the following algorithms: the kernel-based maximum entropy learning rule (kMER) (Van Hulle, 1998), the algorithm of Yin and Allinson (2001), and the STVQ and soft-SOM (SSOM) algorithms (Graepel et al., 1997). All simulations are run on the same input data, using the same cooling scheme (except for the SSOM and STVQ algorithms; see further), and the same number of iterations, as before. For the STVQ algorithm, we take for the neighborhood function radius sL D 0.5, and for the equal and constant kernel radii (“temperature param-
Kernel-Based Topographic Map Formation
1569
eter”) b1 D 0.01, as suggested in Graepel et al. (1997). 3 We also adopt these parameter values for the SSOM algorithm, since it is in fact a limiting case of the STVQ algorithm (see appendix A). Furthermore, again for the SSOM and STVQ algorithms, since they do not adapt their kernel radii, we look for the (common) kernel radius that optimizes the MSE between the estimated and the theoretical distributions, as explained above. The result for the STVQ algorithm is shown in Figure 3D. Finally, we also consider the original SOM algorithm, add gaussian kernels at the converged weight vectors, and look for the (common) kernel radius that minimizes the MSE. The results for these algorithms are summarized in Table 1, together with the parameters used. 4 Density-Based Clustering
The density estimation facility described above can be used for visualizing clusters in data distributions by performing density-based clustering in lattice space. For each neuron in the lattice, the density estimate at the neuron’s weight vector is determined, and the result displayed graphically, on a relative scale, in the lattice. Clusters then correspond to high-density regions in the lattice, and they can be found and demarcated by a procedure such as hill climbing (see the caption for Figures 3E and 3F). An efcient tree-based hill-climbing algorithm is given in appendix B. This procedure could be regarded as an alternative to the gray-level clustering procedure that has been devised for the SOM and used for data mining purposes (Kohonen, 1995; Lagus & Kaski, 1999). Here, for each neuron, the average distance to the weight vectors of the neuron’s lattice neighbors is determined, rather than a local estimate of the input density. Finally, density-based clustering itself can be regarded as an alternative to the (usually Euclidean) distance-based similarity criterion, which in most cases is adopted for clustering with competitive learning and topographic maps (Luttrell, 1990; Rose, Gurewitz, & Fox, 1990; Graepel et al., 1997): distance-based clustering assumes that the cluster shape is hyperspherical, at least for the Euclidean case, whereas density-based clustering does not make any assumptions about the cluster shape. Appendix A: Correspondence with STVQ, SOM, and STMK Algorithms
The soft topographic vector quantization (STVQ) algorithm (Graepel et al., 1997) performs a fuzzy assignment of data points to clusters, whereby each cluster corresponds to a single neuron. It also serves as a general model for 3 Note that our input distribution also spans the unit square [¡1, 1]2 , as in (Graepel et al. (1997).
1570
Marc M. Van Hulle
probabilistic SOM-based topographic map formation. The weight vectors represent the cluster centers and are determined by iterating the following equilibrium equation (put into our format): R eq wi
v
P
V D R P V
j
L ( i, j) P (v 2 C j ) p ( v ) dv
j L ( i,
j ) P ( v 2 C j ) p ( v ) dv
,
8i 2 A,
with P ( v 2 C j ) the assignment probability of data point v to cluster C the probability of “activating” neuron j), which is given by ± ² P exp ¡ b2 k L ( j, k) kv ¡ w kk2 ± ², P (v 2 C j ) D P b P 2 l exp ¡ 2 k L ( l, k) kv ¡ w k k
(A.1)
j
(i.e.,
(A.2)
with b the inverse temperature parameter and L (i, j) the transition probability of the noise-induced change of data point v from cluster C i to C j . A number of topographic map algorithms can be considered as special cases by putting b ! 1 in equation A.2, and L ( i, j) D dij in equation A.2 and/or A.1. For example, the soft SOM (SSOM) algorithm (Graepel et al., 1997) is obtained by putting L ( i, j) D dij in equation A.2, but not in equation A.1. Kohonen’s batch map version (Kohonen, 1995; Mulier & Cherkassky, 1995) is obtained for b ! 1 and L ( i, j) D dij in equation A.2, but not in equation A.1, and for i¤ D arg minj kv ¡wj k2 (i.e., distance-based winner-takes-all rule). Our LDE algorithm is different from the STVQ algorithm in three ways. First, in the STVQ algorithm, the kernel P (.) in equation (A.2) represents a fuzzy membership (in clusters) function, that is, the softmax function, normalized with respect to the other neurons in the lattice, with the degree of fuzzication depending on b. In our case, the kernel K (.) operates in the input space, instead of the (discrete) lattice space, and represents a local density estimate. 4 Our algorithm is also different from Kohonen’s batch map by the denition of the kernel, which is in Kohonen’s case the neighborhood function L , whereas in our case we have both K (.) and L (.) , and by the denition of the winner i¤ (distance- versus activity-based). Second, instead of using kernels P (.) with equal radii b1 , with b externally controlled (deterministic annealing or “cooling”), our radii differ from one another and are individually adapted. Third, the kernels also differ conceptually since, in the STVQ algorithm, the kernel radii are related to the magnitude of the noise-induced change in the cluster assignment (thus, in lattice space), whereas in our case, they are related to the standard deviations of the local input density estimates (thus, in input space). 4 Technically, the “kernel” in the STVQ algorithm represents a probability distribution, whereas our kernel represents a probability density.
Kernel-Based Topographic Map Formation
1571
In the kernel-based soft topographic mapping (STMK) algorithm (Graepel et al., 1998), a nonlinear transformation Y, from the original input space V to some feature space F , is introduced that admits a kernel function: hY (x ) , Y ( y ) i D K (x , y ), with K (.) , for example, a gaussian, and with h., .i the internal product operator in F -space. The topographic map’s parameters (“weights”) are expressed in this feature space as linear combinations of the P transformed inputs: w i D mMD1 am i Y ( vm ), given a batch of M input samples fvm g. The coefcients am i are determined by iterating an equilibrium equation. Furthermore, as in the STVQ algorithm, soft assignments of the inputs to clusters are made, P ( vm 2 C j ). Our algorithm differs in several ways from the STMK algorithm. First, the topographic map’s parameters am i are developed in the feature space F rather than in the input space V, as in our LDE algorithm. Second, the STMK algorithm does not update the kernels’ parameters. In fact, when initializing the STMK algorithm, the kernel-transformed data fK ( vm , vº )g are determined, and they replace the original input data (see p. 182 Graepel et al., 1998). In our LDE algorithm, both the kernel centers and the kernel radii are updated during learning. Third, the (transformed) inputs are assigned in probability to clusters in the STMK algorithm, whereas in our case, the (original) inputs are assigned with an activity-based winner-takes-all rule. Appendix B: Tree-Based Hill-Climbing Algorithm
Conventions: D D lattice number of a neuron; Top D lattice number of the neuron which has the highest density estimate of all k C 1 neurons in the current hypersphere; Ultimate Top D lattice number of the neuron, which represents a local maximum in the input density. /* Three types of nodes: leaves : (0, D, successor, top ) nodes : ( predecessor, D, successor, top) tops : ( predecessor, D, D, D ) /* /* Initialize all neurons, by labeling them all as leaves */ for ( i ( 1I i · NI i ( i C 1) label[i] ( (0, i, 0, 0) /* Search for all neurons the top of the cluster they belong to */ /* and with respect to the k nearest neighbors (parameter)*/ for ( i ( 1I i · NI i ( i C 1) if ( label[i].successor D D 0) /* Neuron pointer is not processed until now */ pointer ( i /* pointer will point to the neuron currently being processed */ while ( label[pointer]. D ! D label[pointer].successor) top ( MaxNeighbor ( pointer, k) label[pointer].successor ( top
1572
Marc M. Van Hulle if ( top ! D pointer )
/* pointer is not ultimate top-neuron of cluster */ label[top].predecessor ( pointer pointer ( top
else
label[top].top ( top
if ( label[pointer].successor ! D 0)
/* Current neuron is processed and leads to a top */ /* No further processing is needed ) quit while-loop */ /* First give all parsed neurons D of ultimate top */ top ( label[pointer].top while ( label[pointer].predecessor ! D 0) pointer ( label[pointer].predecessor label[pointer].top ( top break
Acknowledgments
M.M.V.H. is supported by research grants received from the Fund for Scientic Research (G.0185.96N), the National Lottery (Belgium) (9.0185.96), the Flemish Regional Ministry of Education (Belgium) (GOA 95/99-06; 2000/11), the Flemish Ministry for Science and Technology (VIS/98/012), and the European Commission, Fifth Framework Programme (QLG3-CT2000-30161 and IST-2001-32114). References Andr a s, P. (2001). Kernel-Kohonen networks. Manuscript submitted for publication. Bishop, C. M., Svense n, M., & Williams, C. K. I. (1998). GTM: The generative topographic mapping. Neural Computat., 10, 215–234. Durbin, R., & Willshaw, D. (1987). An analogue approach to the travelling salesman problem using an elastic net method. Nature, 326, 689–691. Graepel, T., Burger, M., & Obermayer, K. (1997). Phase transitions in stochastic self-organizing maps. Physical Rev. E, 56(4), 3876–3890. Graepel, T., Burger, M., & Obermayer, K. (1998). Self-organizing maps: Generalizations and new optimization techniques. Neurocomputing, 21, 173–190. Graham, R. L., Knuth, D. E., & Patashnik, O. (1994). Concrete mathematics: A foundation for computer science. Reading, MA: Addison-Wesley. Kohonen, T. (1995). Self-organizing maps. Heidelberg: Springer-Verlag. Lagus, K., & Kaski, S. (1999). Keyword selection method for characterizing text document maps. In Proc. ICANN99, 9th Int. Conf. on Articial Neural Networks (Vol. 1, pp. 371–376). London: IEE. Luttrell, S. P. (1990). Derivation of a class of training algorithms. IEEE Trans. Neural Networks, 1, 229–232.
Kernel-Based Topographic Map Formation
1573
Mulier, F., & Cherkassky, V. (1995). Self-organization as an iterative kernel smoothing process. Neural Computat., 7, 1165–1177. Rose, K., Gurewitz, E., & Fox, G. C. (1990). Statistical mechanics and phase transitions in clustering. Phys. Rev. Lett., 65(8), 945–948. Sch¨olkopf, B., Burges, C. J. C., & Smola, A. J. (1999). Advances in kernel methods: Support vector learning. Cambridge, MA: MIT Press. Sum, J., Leung, C.-S., Chan, L.-W., & Xu, L. (1997). Yet another algorithm which can generate topography map. IEEE TNNS, 8(5), 1204–1207. Van Hulle, M. M. (1998). Kernel-based equiprobabilistic topographic map formation. Neural Computat., 10(7), 1847–1871. Van Hulle, M. M. (2000). Faithful representations and topographic maps: From distortion- to information-based self-organization. New York: Wiley. Vapnik, V. N. (1995). The nature of statistical learning theory. New York: SpringerVerlag. Yin, H., & Allinson, N. M. (2001). Self-organizing mixture networks for probability density estimation. IEEE Trans. Neural Networks, 12, 405–411. Received March 8, 2001; accepted October 24, 2001.
LETTER
Communicated by Emilio Salinas
A Simple Model of Long-Term Spike Train Regularization Relly Brandman [email protected] Department of Computer Science and Beckman Institute for Advanced Science and Technology, University of Illinois, Urbana, IL 61801, U.S.A. Mark E. Nelson [email protected] Department of Molecular and Integrative Physiology and Beckman Institute for Advanced Science and Technology, University of Illinois, Urbana, IL 61801, U.S.A. A simple model of spike generation is described that gives rise to negative correlations in the interspike interval (ISI) sequence and leads to long-term spike train regularization. This regularization can be seen by examining the variance of the kth-order interval distribution for large k (the times between spike i and spike i C k). The variance is much smaller than would be expected if successive ISIs were uncorrelated. Such regularizing effects have been observed in the spike trains of electrosensory afferent nerve bers and can lead to dramatic improvement in the detectability of weak signals encoded in the spike train data (Ratnam & Nelson, 2000). Here, we present a simple neural model in which negative ISI correlations and long-term spike train regularization arise from refractory effects associated with a dynamic spike threshold. Our model is derived from a more detailed model of electrosensory afferent dynamics developed recently by other investigators (Chacron, Longtin, St.-Hilaire, & Maler, 2000;Chacron, Longtin, & Maler, 2001).The core of this model is a dynamic spike threshold that is transiently elevated following a spike and subsequently decays until the next spike is generated. Here, we present a simplied version—the linear adaptive threshold model—that contains a single state variable and three free parameters that control the mean and coefcient of variation of the spontaneous ISI distribution and the frequency characteristics of the driven response. We show that refractory effects associated with the dynamic threshold lead to regularization of the spike train on long timescales. Furthermore, we show that this regularization enhances the detectability of weak signals encoded by the linear adaptive threshold model. Although inspired by properties of electrosensory afferent nerve bers, such regularizing effects may play an important role in other neural systems where weak signals must be reliably detected in noisy spike trains. When modeling a neuronal system that exhibits this type of ISI correlation structure, the linear adaptive threshold model may c 2002 Massachusetts Institute of Technology Neural Computation 14, 1575– 1597 (2002) °
1576
Relly Brandman and Mark E. Nelson
provide a more appropriate starting point than conventional renewal process models that lack long-term regularizing effects. 1 Introduction
When a spiking neuron encodes an input signal, subsequent processing of that signal by postsynaptic neurons must be based on changes in the statistical properties of the output spike train. If there is background spike activity, then the variability of the background will inuence how reliably other neurons can detect the presence of a weak signal encoded in the spike train data. The variability of a spike train is often characterized by the coefcient of variation (CV) of the rst-order interspike interval (ISI) distribution. However, the rst-order ISI distribution provides information about variability only on short timescales comparable to the mean ISI (for review, see Gabbiani & Koch, 1998). It is possible for a spike train to be irregular on short timescales but regular on longer timescales, as we have shown experimentally for P-type (probability-coding) electrosensory afferent nerve bers in a weakly electric sh (Ratnam & Nelson, 2000). This longer-term regularization can be observed by analyzing the kth-order interval distribution (the distribution of time intervals between spike i and spike i C k). If successive ISIs in the spike train are uncorrelated, then the variance of the kth-order distribution will be a factor of k times larger than the variance of the rst-order ISI distribution. However, in our experimental study of electrosensory afferents, we found that the variance between, say, every ftieth spike in the spike train was signicantly smaller than would be expected if successive ISIs were uncorrelated. We further demonstrated that this regularization is associated with negative correlations in the ISI sequence and that the detectability of a weak signal can be signicantly enhanced when such regularization exists. The negative correlation structure and regularizing effects observed in the data have recently been reproduced in a modeling study based on a stochastic model of ring dynamics (Chacron, Longtin, St. Hilaire, & Maler, 2000; Chacron, Longtin, & Maler, 2001). Refractory effects are known to have a short-term regularizing inuence on spike activity by decreasing the CV of the rst-order ISI distribution and increasing the temporal precision of the driven response (Berry & Meister, 1998). Refractory effects are often modeled by introducing a recovery function that reduces the ring probability immediately following a spike (for reviews, see Berry & Meister, 1998; Johnson, 1996). In such models, refractory effects are dependent only on the time of the previous spike and are not sensitive to the duration of previous interspike intervals. If the input is held constant in such models, then successive intervals are independent and identically distributed. In this case, no correlations are introduced into the ISI sequence. For such renewal models, the refractory mechanism has no impact on the long-term regularity of the spike train. In contrast, the
Spike Train Regularization
1577
refractory mechanism presented here is implemented as a dynamic state variable that retains a memory of previous activity spanning multiple interspike intervals. This nonrenewal model of spike generation gives rise to negative correlations in the ISI sequence and long-term regularization of the spike train. Here we present a simple model of a spike-generating mechanism that gives rise to regularizing effects similar to those observed in electrosensory afferent spike trains. Our model is inspired by the more detailed model of Chacron et al. (2000, 2001), in which they showed that a stochastic model of spike generation with a dynamic threshold is able accurately to describe the key features of spike trains observed in the electrosensory afferent data (Nelson, Xu, & Payne, 1997; Ratnam & Nelson, 2000). To achieve a good match with the data, their model included about 15 parameters. However, because our model has only 3 parameters and one state variable, the relationships between the model parameters and the spike train properties are more readily apparent. Because of its simplicity, the model is easily adaptable to many neural modeling applications. In particular, it is a better choice than more widely used renewal process models when modeling spike trains that exhibit long-term regularizing effects. 2 The Linear Adaptive Threshold Model
The goal of this simplied model is to obtain a minimal description of the spike-generating mechanism that gives rise to long-term spike train regularization. This simplied model is intended to serve as a generic basis for constructing more detailed system-specic models, as illustrated by the example in section 5. Although the model is highly simplied, it captures the important dynamic features of the process and reects a level of abstraction similar to that of the well-known integrate-and-re model (Stein, 1967). An important simplication is that the model presented here uses a linear decay function rather than the exponential threshold decay function used by Chacron et al. (2000, 2001). As we will show, this results in a simpler relationship between the model parameters and the spike train characteristics. Finally, the model presented here is formulated in a discrete-time framework, although it can also be cast in continuous time. A discrete-time formulation has the advantage of avoiding complications associated with the numerical integration of gaussian noise in continuous time and for this reason is more computationally efcient because it requires fewer integration steps per unit time. We are currently using an extended version of this model (see section 5) to simulate the neural activity of the entire population of 15,000 P-type electrosensory afferent nerve bers of an electric sh, so matters of computational efciency become of practical importance. The linear adaptive threshold model contains three essential parameters (a, b, and s), and a single dynamic state variable, the spike threshold h. For the sake of generality, we also include a fourth parameter, c, the input
1578
Relly Brandman and Mark E. Nelson
gain, which we will subsequently take to be unity. As will be shown in section 4, the parameter c is redundant in terms of functionality, it is included to facilitate conceptualization of the model in a neural framework. If one wishes to think of the input as a current and the threshold as a voltage level, then the gain parameter c takes on units of electrical resistance. Figure 1 illustrates the operating principles of the model. The model is described by four update rules, which are evaluated in the following order at each time step n: (2.1)
v[n] D ci[n] C w[n]
h [n] D h[n ¡ 1] ¡ ( b / a) s[n] D H ( v[n] ¡ h[n]) D
(
(2.2) 1 0
if v[n] ¸ h [n] otherwise
( h[n] C b h [n] D h[n] C bs[n] D h[n]
if s[n] D 1 otherwise
(2.3)
(2.4)
where H is the Heaviside function, dened as H ( x ) D 0 for x < 0 and H ( x ) D 1 for x ¸ 0. The voltage v is the product of the input resistance c and the instantaneous input current i, plus random noise w, where w is zero-mean gaussian noise with variance s 2 . (In section 5, we show that the model can easily be extended to include the effects of a membrane time constant, but this extension is not necessary for understanding the regularizing effects of the model.) When the voltage v rises above a threshold level h, a spike is generated ( s D 1) , and the threshold level is elevated by an amount b. The threshold subsequently decays linearly with a slope of ¡b / a until the next spike is generated. From equation 2.2 alone, one might get the impression that the threshold h is unbounded and could decay to arbitrarily large negative values. However, because the threshold level is boosted whenever h < v, the voltage level v serves as the effective lower bound for the threshold. The output of the model is a binary spike train s, with s[n] D 1 if a spike was generated at time step n and s[n] D 0 otherwise. The model parameters are restricted to a > 1, b > 0, and s > 0. The parameter a has units of time steps, while b and s have units of voltage. The update interval can be adjusted to meet the temporal resolution required for a specic modeling application. 3 Statistical Properties of Spontaneous Spike Activity in the Model 3.1 Mean and CV of the First-Order ISI. In the absence of an input signal ( i D 0) , the linear adaptive threshold model generates spontaneous spike activity. The parameter a controls the mean ISI, and the ratio s / b controls the CV of the ISI distribution. Representative spontaneous ISI distributions are shown in Figure 2. For a sufciently long spike train, the empirically
Spike Train Regularization
1579
s[n]
[n] v[n]
b a
Figure 1: Representative time history of variables in the linear adaptive threshold model. Model parameters: a D 5 msec, b D 1 mV, s D 0.2 mV, c D 1 MV. The input signal is a sinusoid with a period of 100 time steps: i[n] D sin (2p n / 100) nA. The voltage v[n], shown by the heavy solid line, is a noisy version of the input. The spike threshold h [n] is shown by the sawtooth-shaped solid line. A spike (s[n] D 1) is generated whenever the voltage crosses the threshold level. Immediately following each spike, the threshold is boosted by an amount b and subsequently decays linearly with a slope ¡b / a until the next spike is generated. Total duration shown in the gure is 100 time steps.
measured mean ISI (in time steps) becomes identical to a. The mathematical basis for this result is presented in section 3.4 (see equation 3.7). The CV of the ISI distribution can range between 0 and 1 and increases monotonically with s / b. The mean and CV of an experimentally observed spontaneous ISI distribution can be matched by appropriate adjustments of a and s / b and the size of the time step. In our experimental studies of electrosensory afferents in weakly electric sh, the frequency of the oscillatory electric organ discharge (EOD) signal provides a natural time reference. P-type afferents re at most one spike per EOD cycle (Scheich, Bullock, & Hamstra, 1973); hence, it is natural to set the step size equal to one EOD cycle. For brown ghost knifesh, Apteronotus leptorhynchus, the EOD frequency is extremely stable for an individual sh (Moortgat, Keller, Bullock, & Sejnowski, 1998) and ranges from about 600 to1200 Hz. The corresponding step size in the model would range from 0.8 to 1.7 msec. Figure 3A1 shows the spontaneous ISI distribution for a representative P-type afferent ber (Ratnam & Nelson, 2000). The ISI distribution has a mean of 2.9 EOD cycles and a CV of 0.46. Figure 3A2 shows the corresponding distribution for the linear adaptive threshold model with a D 2.9 msec, b D 2.0 mV, and s D 1.0 mV. Although the two distributions are clearly not identical, the mean and CV of the model ISI distribution match that of the data (mean D 2.9, CV D 0.46).
1580
Relly Brandman and Mark E. Nelson
prob
0.4
A1
mean=3.0 cv=0.77
0.2
0
0
prob
0.4
B1
0
15
10
50
15
0
0
0.2
0
a = 30 /b=1.0
50 C2
100
a = 30 /b=0.1 mean=30.0 cv=0.10
0.1
15
100
mean=30.0 cv=0.53
0.02
a=3 /b=0.1
5 10 interval
0
0.04
mean=3.0 cv=0.16
0
0
a = 30 /b=10.0 mean=30.0 cv=0.90
B2
a=3 /b=1.0
5 C1
0.5
0
10
A2
0.02
mean=3.0 cv=0.58
1
prob
5
0.2
0
0.04
a=3 /b=10.0
0
50 interval
100
Figure 2: Representative spontaneous ISI distributions obtained from the linear adaptive threshold model. The parameter values for a and s / b, as well as the empirically measured mean and CV of the ISI distribution, are shown in each panel. The parameter a controls the mean of the ISI distribution, and the ratio s / b controls the CV. The left three panels (A1–C1) show results for a relatively short mean ISI (a D 3 msec), while the right three panels (A2–C2) show results for a longer mean ISI (a D 30 msec). Simulation duration was 100,000 time steps. 3.2 Negative Correlations in the ISI Sequence. The linear adaptive threshold model gives rise to negative correlations between adjacent intervals in the ISI sequence, meaning that short intervals tend to be followed by long intervals, and vice versa. Similar effects are observed in electrosensory afferent data, as illustrated by the joint interval histograms of adjacent ISIs shown in Figures 3B1 and 3B2. In the experimental data, we observed
Spike Train Regularization
1581
Data
prob
0.3
Model 0.3
A1 mean=2.9
0.2
0
5 interval
interval j+1
0
4
4
2
2 0
0
5 interval
B2
6
1 variance/mean
10
6
0
2 4 6 interval j
0
1
C1
0.1
0.01
cv=0.46
0.1
B1
0
mean=2.9
0.2
cv=0.46
0.1 0
A2
10
2 4 6 interval j
C2
0.1
1
10 order k
100
0.01
1
10 order k
100
Figure 3: Spontaneous spike train properties of the linear adaptive threshold model compared with experimental data. The left side (A1–C1) shows the ISI distribution, joint interval histogram, and variance-to-mean ratio of the kthorder interval distribution for a representative P-type electrosensory afferent nerve ber from an electric sh (Ratnam & Nelson, 2000). The right side (A2– C2) shows the corresponding plots for the model with a D 2.9 msec, b D 2.0 mV, and s D 1.0 mV. The model is able to match the mean and variance of the rst-order ISI distribution (A1, A2), as well as qualitatively reproduce the shortlong correlations between neighboring intervals observed in the joint interval histogram (B1, B2), and the approximate decline as k¡1 in the variance-to-mean ratio (C1, C2). The dashed line in C1 and C2 indicates k¡1 . Simulation duration was 100,000 time steps.
1582
Relly Brandman and Mark E. Nelson
a mean correlation coefcient of ¡0.52 in a population of 52 P-type afferent spike trains (Ratnam & Nelson, 2000). For the particular unit shown in Figure 3B1, the correlation coefcient was ¡0.58, and for the model it was ¡0.40 (see Figure 3B2). The linear adaptive threshold model qualitatively captures the shortlong correlation structure of the ISI sequences observed in the data. In the model, the negative correlation structure arises because the decay function tends to leave the threshold at a higher level following a short interval than following a long interval. This short-long correlation structure has been observed experimentally in many neural systems (Kufer, Fitzhugh, & Barlow, 1957; Calvin & Stevens, 1968; Johnson, Tsuchitani, Linebarger, & Johnson, 1986; Lowen & Teich, 1992), and is one indication of spike train regularization. 3.3 Spike Train Variability on Longer Timescales. There are two simple ways to characterize the variability of a spike train on timescales longer than the mean ISI. The traditional way is to count the number of spikes occurring in nonoverlapping windows of xed duration T and examine how the variance of the count distribution changes with T. An alternative approach is to measure the time interval between every kth spike in the spike train and examine how the variance of the kth-order interval distribution changes with k. If the spike train arises from a renewal process (Cox, 1962), there are no correlations in the interspike interval sequence, in which case both the mean and the variance of the kth-order interval distribution grow linearly with k. Thus, for a renewal process, the variance-to-mean ratio of the kth-order interval distribution is a constant, independent of k. In the traditional approach, where one counts the number of spikes in windows of duration T, the variance-to-mean ratio of the count distribution is called the Fano factor (Fano, 1947). For a renewal model, the Fano factor asymptotically approaches a constant value for large T, but it is not constant for small count windows (Cox & Lewis, 1966). Thus, analysis of the kth-order interval distributions offers a more denitive test for deviations from a renewal process in the ISI sequence. In both the data and the model, regularization effects persist over time periods that are much longer than a single interspike interval. As described above, these longer-term effects can be quantied by observing the behavior of the variance-to-mean ratio of the kth-order interval distribution with increasing interval order k. As shown in Figure 3C1, the variance-to-mean ratio for the data falls rapidly for the rst 10 to 20 interval orders (approximately as k¡1 ). The behavior of the model is quite similar (see Figure 3C2). In the model, the dynamic threshold provides a long-term memory of previous spike activity, allowing regularizing effects to persist over multiple interspike intervals. Thus, the simple linear adaptive threshold model is able to capture the key features of spike train regularization observed in the experimental data.
Spike Train Regularization
1583
3.4 The Mathematical Basis of Long-Term Regularity in the Model. In this section, we explain how the mathematical structure of the linear adaptive threshold model gives rise to long-term regularity of the output spike train. Specically, we analyze spontaneous spike activity and show that the variance of the kth-order interval distribution Var ( Ik ) approaches a constant value for large k. The fact that the variance becomes independent of interval order k means, for example, that the variance in the distribution of time intervals between every thousandth spike in the spike train is essentially the same as the variance between every hundredth spike. This is in striking contrast to a renewal process model, for which the variance would continue to increase linearly with k, giving rise to a variance-to mean ratio that stays constant for all interval orders k. The key result regarding long-term spike train regularity for the linear adaptive threshold model is that Var (Ik ) approaches a constant for large k. Since the mean interval between spikes grows linearly with interval order k, the variance-to-mean ratio will fall as k¡1 , as illustrated in Figure 3. To understand why Var(Ik ) approaches a constant for large k, it is useful to recast the linear adaptive threshold model (equations 2.1–2.4) into a slightly different form. The new formulation gives rise to a set of spike times that are identical to those generated by the original model, but the internal state variables are handled differently. Rather than raising the threshold level by an amount b each time a spike occurs (as in equation 2.4), we will instead lower the mean voltage level by an amount b. Since the decision of whether to generate a spike (see equation 2.3) depends on only the relative difference between the threshold level and the voltage level, these two formulations will give rise to an identical set of spike times. Hence, either formulation can be used when analyzing the statistical properties of the output spike train. The two formulations of the linear adaptive threshold model are illustrated in Figure 4. Following the structure of the original model (equations 2.1–2.4), we express the reformulated model as:
(3.1)
v[n] D ci[n] C w[n] C vbase
h[n] D h[n ¡ 1] ¡ ( b / a )
(
s[n] D H (v[n] ¡ h[n]) D ( vbase D vbase ¡ bs[n] D
(3.2) 1 0
if v[n] ¸ h[n] otherwise
vbase ¡ b vbase
if s[n] D 1 otherwise,
(3.3)
(3.4)
where vbase is the newly introduced baseline voltage level, and all other variables are as dened previously. Note that only two of the equations have changed from the original model (equations 3.1 and 3.4), but all four have
1584
Relly Brandman and Mark E. Nelson
s[n]
A
[n] v[n]
s[n] [n] v[n]
B
b 2b 3b 4b Figure 4: Reformulation of the linear adaptive threshold model to facilitate the analysis of spike train properties. (A) Representative time history of spontaneous spike activity and internal state variables as originally formulated (see equations 2.1–2.4). (B) Time history of the state variables in the reformulated version of the model (see equations 3.1–3.4). The reformulated model gives rise to an identical set of spike times. Parameter values: a D 20 msec, s D 0.2 mV, b D 1 mV.
been rewritten above for convenience. In the reformulated model (equations 3.1–3.4), the threshold level h is never boosted; rather, it falls monotonically with a constant slope (see equation 3.2). For spontaneous spike activity, the input i is zero; thus, the voltage v is simply the baseline level vbase plus random noise (see equation 3.1). In this reformulated version of the model, the threshold falls linearly toward a noisy voltage oor; each time a spike is generated, the mean level of the oor drops by an amount b, as illustrated in Figure 4B. Now consider what happens in the reformulated model between spike i and spike i C k. Since k spikes were generated, the baseline level vbase will have dropped by an amount kb. If we choose k sufciently large ( k À s / b), then the drop in the baseline level kb will be much larger than the standard deviation s of the voltage uctuations around the baseline. Thus, the change in voltage level between the time of spike i and spike i C k is
D vi,i C k D ¡kb C O(s) ,
(3.5)
where O(s) is a small, random correction on the order of s related to the voltage uctuations around the baseline level. Since the threshold falls lin-
Spike Train Regularization
1585
early at a constant slope (¡b / a) and spikes are generated whenever the threshold crosses the voltage level, then the time difference between spike i and spike i C k is equal to the voltage difference divided by the threshold slope; thus:
D ti,i C k D D vi,i C k / (¡b / a )
D ak C O(as / b) .
(3.6)
Thus, for sufciently large k, the time interval between spike i and spike i C k is equal to ak, plus a small random correction on the order of as / b. As long as the threshold level starts well outside the noise band (kb À s), the variance of this random correction will be independent of k. Hence Var (Ik ) becomes constant for sufciently large k (k À s / b). Furthermore, the mean interspike interval hISIi is given by hISIi D lim
k!1
D ti,i C k k
D a,
(3.7)
as was noted in section 3.1. The two key results obtained above are that Var(Ik ) approaches a constant for large k and the mean ISI is equal to a. These two results are independent of the noise structure used in the model. We formulated the model using gaussian noise, but the same results would have been obtained for other forms, such as uniform or pink noise. The noise structure will have an effect on the asymptotic numerical value of Var ( Ik ) . However, the fact that Var (Ik ) approaches a constant value, and hence that the variance-to-mean ratio falls as k¡1 as shown in Figure 3C2, is a robust result that is independent of assumptions about the detailed noise structure. 4 Driven Response Characteristics of the Model
The driven response characteristics of the linear adaptive threshold model were evaluated using sinusoidal stimuli at frequencies between 0.1 and 100 Hz. In these simulations, the step size was taken to be 1 msec. The input signal was given by i[n] D S sin(2p f n / 1000), where S is the stimulus amplitude (arbitrary units) and f is the stimulus frequency (Hz). The total stimulus duration was 100 seconds at each stimulus frequency. The response gain and phase were computed using methods described in Nelson et al. (1997). Briey, cycle histograms of spike times were constructed and normalized such that the ordinate corresponded to the ring rate in spikes per second. A single cycle sinusoid was t to the cycle histogram r ( x ) D R sin(2p x C w ) C B, where x is the cycle fraction (0 · x · 1), R is the response amplitude, w is the response phase, and B is the baseline ring rate. The gain of the response at each frequency is computed as the ratio of response amplitude R to the stimulus amplitude S and has units of spikes per second per unit input. The phase of the response at each frequency is given by the best-t value of w (degrees).
1586
Relly Brandman and Mark E. Nelson
As illustrated in Figure 5, the linear adaptive threshold model has highpass lter characteristics. At low frequencies, the gain is proportional to the stimulus frequency, and the phase shift is 90 degrees, implying that the model behaves as a differentiator. At higher frequencies, the gain curve becomes at, and the phase drops toward zero. The overall gain of the response is determined by the model parameter b, which reects the amount that the threshold level is elevated following a spike. The larger the threshold boost, the lower the gain. In the low-frequency range, where the model behaves as a differentiator, the gain is equal to 2p f / b, with units of spikes per second per unit input. This functional form can be understood by considering the response of the model to a sinusoidal stimulus of amplitude S and frequency f . The rising phase of the sine wave will have a maximum slope of 2p f S. The rising slope will tend to shorten the mean interval between threshold crossings relative to baseline conditions. Recall that the threshold falls with a constant slope of ¡b / a (see equation 2.2), and the mean ISI under baseline conditions is equal to a (see equation 3.7). For a weak stimulus, a differential analysis reveals that the ISIs are shortened on average by an amount corresponding to a change in ring rate of 2p f S / b, and hence an overall gain of 2p f / b. If the input is scaled by an input gain c as in equation 2.1, then the overall gain becomes 2p f c / b. The parameter c is redundant in terms of being able to control the input-output gain of the model, since gain changes can be accomplished by changing b. However, as discussed in section 2, the parameter c is convenient if one wishes to interpret the model variables as currents and voltages. Empirically, the phase of the response remains unaffected by changes in gain (see Figure 5A). The corner frequency of the high-pass lter is determined by the model parameters a and s / b. As these values increase, the corner frequency decreases. Qualitatively, the location of the corner frequency is related to the timescale that characterizes the interval between successive spikes in the spike train. If the shape of the ISI distribution is such that almost all ISIs are short compared to the period of the stimulus, the model behaves as a differentiator. If either a (which controls the mean ISI) or s / b (which controls the CV) is large enough so that some of the ISIs in the spike train become comparable to the stimulus period, then the gain of the response begins to roll off, giving rise to the knee in the gain curve. Changes in the corner frequency also result in a corresponding change in the phase of the response (see Figure 5B). 5 Extensions to the Model
We now illustrate how one might extend the model to make it more biophysically plausible. For example, the extensions discussed here allow the model to match the experimentally measured frequency response characteristics of electrosensory afferent data better. The key point that we wish
Spike Train Regularization
1587
Figure 5: Frequency response characteristics of the linear adaptive threshold model. The model has high-pass lter characteristics. (A) Gain and phase for three different values of b, with a D 3 msec and s / b D 1. The gain has units of spikes per second per unit input. The gain varies inversely with b; the phase curves are overlapping and indistinguishable. (B) Gain and phase for three different values of a, with b D 0.1 mV and s D 0.1 mV. The parameter a inuences the corner frequency of the high-pass lter. Simulation duration was 100,000 time steps for a D 3 msec and a D 10 msec, and 500,000 time steps for a D 30 msec.
to make, however, is not that the extensions improve the t to empirical data, but rather that the extensions do not alter the long-term regularizing effects exhibited by the simpler model. In the linear adaptive threshold model, there were no dynamics associated with the membrane voltage v. Most neural modeling applications would want to include at least the effects of leaky integration by the cell membrane. This can be modeled as a rst-order low-pass lter with time constant tm , which is incorporated by replacing equation 2.1 with equations 5.1 and 5.2: u[n] D exp(¡1 / tm ) u[n ¡ 1] C [1 ¡ exp(¡1 / tm )]i[n]
(5.1)
v[n] D u[n] C w[n].
(5.2)
Note that the noise term w[n] is added to the output of the low-pass lter u rather than to the input. Thus, we consider the noise to reect stochastic properties that are intrinsic to the neuron rather than properties of the input signal. In terms of the frequency response characteristics, this extension to
1588
Relly Brandman and Mark E. Nelson
the model causes a roll-off in gain and a decrease in phase above the corner frequency ( fc D 1 / 2p tm ) of the low-pass lter. The second extension is to change the linear threshold decay function to a more biophysically plausible exponential decay toward a baseline level h0 , with a decay time constant th , as originally suggested by Chacron et al. (2000). This is incorporated by replacing equation 2.2 with equation 5.3: h [n] D exp(¡1 / th )h[n ¡ 1] C [1 ¡ exp(¡1 / th ) ]h0 .
(5.3)
This change in the representation of the threshold decay does not have a signicant effect on the general features of the rst-order ISI distribution (see Figure 6A) or the long-term regularization properties (see Figure 6B), but it does alter the frequency response characteristics of the model (see Figure 6C). Representative gain and phase plots for the extended model are shown in Figure 6C (solid lines). The change in frequency response characteristics for the extended model can be appreciated by comparing the general shapes of the gain and phase curves in Figure 6C with those for the simpler model shown in Figure 5. The parameters for the extended model were selected to closely match the average properties of P-type electrosensory afferents recorded in our experimental data (Nelson et al., 1997; Ratnam & Nelson, 2000). The extended model (equations 5.1–5.3, 2.3, and 2.4) is able to provide a good description of the response characteristics of P-type electrosensory afferents, including the baseline ISI distribution, interval correlations, and frequency response characteristics. However, the main point of this section is to demonstrate that the linear adaptive threshold model can be extended to match empirical data better, while maintaining the long-term regularizing effects that are of central importance here (see Figure 6B). 6 Weak Signal Detectability
In this section, we demonstrate that under certain circumstances, long-term spike train regularization can dramatically improve the detectability of a weak stimulus. We illustrate this by encoding a weak signal using two different neuron models: one that exhibits long-term spike train regularization and one that does not. The parameters of the two models are adjusted to have matched characteristics, including the mean and CV of the spontaneous ISI distribution and by the frequency response characteristics (gain and phase) of the driven response. Such characteristics are commonly used by neural modelers to assess how well a particular model describes experimental data. We show that although two models are well matched by these criteria, they can have signicantly different properties in terms of signal detectability. Our goal here is not to model any specic biological signal or system but rather to present a generic example illustrating the potential functional importance of long-term spike train regularization in
0.2 0.1 0
variance/mean
A
0
1
5 10 interval
B
0.1 0.01 1
10 100 order k
10
4
10
3
10
2
C
90 phase (deg)
prob
0.3
1589
gain (spikes/s/mV)
Spike Train Regularization
60 30 0 0.1
1 10 freq (Hz)
100
Figure 6: Frequency response characteristics of the exponential adaptive threshold model compared with experimental data. The left panels show the spontaneous spike train properties of the model: (A) ISI distribution and (B) varianceto-mean ratio of the kth-order interval distribution. The right panel (C) shows gain and phase of the driven response. The data points show the populationaveraged responses from 99 P-type electrosensory afferent bers (modied from Nelson et al., 1997). Error bars represent the standard deviation of the population average at each frequency. The continuous solid lines show the gain and phase of the exponential adaptive threshold model with b D 0.11 mV, s D 0.04 mV, h 0 D ¡1 mV, tf D 2 msec, and th D 30 msec. Simulation duration was 300,000 time steps.
biological systems and highlighting the importance of selecting a modeling framework that adequately accounts for correlations in the ISI sequence. 6.1 Linear Adaptive Threshold Model. For a model that exhibits longterm regularization, we use the simple form of the adaptive threshold model. Alternatively, we could have used the extended model, since it also exhibits long-term regularization, but the simple model embodies the essential features that are relevant for the comparison. For this example, we implement equations 2.1 through 2.4 with the following parameters: a D 20 msec, b D 0.5 mV, and s D 1 mV and a time step of 1 msec. This parameter set gives rise to a spontaneous ISI distribution with a mean of 20 msec and a CV of 0.69 (see Figure 7A1). For this example, we intentionally chose a s / b ratio that produces an irregular spike train on short timescales, as judged by the CV of the rst-order ISI distribution. The frequency response characteristics of the model are summarized in Figure 7B1. The model has high-pass
1590
Relly Brandman and Mark E. Nelson
lter characteristics with a corner frequency of about 8 Hz. The effects of long-term spike train regularization are shown in Figure 7C1, where it is seen that the variance-to-mean ratio for the kth-order interval distribution decreases as k¡1 . As discussed in section 3.4, this decrease in long-term variability arises from memory effects associated with the threshold dynamics.
6.2 Integrate-and-Fire Model with Random Threshold. We now wish to compare this model with one lacking any such memory effects. For the memoryless model, we also need to be able to adjust the mean and CV of the spontaneous ISI distribution, as well as the frequency response characteristics. These criteria can be satised by using a stochastic integrate-and-re model with random threshold (Gabbiani & Koch, 1996, 1998), coupled with a linear prelter to adjust the frequency response characteristics. In this model, the input signal i is passed through a unity-gain high-pass prelter with time constant t f and summed with a constant bias input Ib , which controls the spontaneous ring rate of the model. This input signal is integrated on each time step. When the integrated signal v exceeds a threshold h, a spike is generated (s D 1) . Following a spike, v is reset to zero, and h is reset to a new random value drawn from a gamma distribution of order m. Because the reset values contain no information about the previous state of the system, there are no memory effects in the ISI sequence of this model. In a discrete-time representation, this memoryless model including the high-pass prelter is described by the following update rules:
f [n] D exp(¡1 / tf ) f [n ¡ 1] C [1 ¡ exp(¡1 / tf )]i[n]
(6.1)
v[n] D v[n ¡ 1] C i[n] ¡ f [n] C Ib
(6.2)
s[n] D H (v[n] ¡ h[n])
(6.3)
v[n] D (1 ¡ s[n]) v[n]
(6.4)
h[n] D (1 ¡ s[n])h[n] C s[n]gm [n],
(6.5)
where gm [n] are random values drawn from a gamma distribution of order m with mean xN (Gabbiani & Koch, 1998): gm ( x ) D cm ( x / xN ) m¡1 exp(¡mx / xN )
(6.6)
with cm D
mm 1 . ( m ¡ 1) ! xN
(6.7)
The random threshold model as described above has four free parameters: tf , Ib , m, and x. N
Spike Train Regularization
1591
Adaptive Threshold (non renewal) A1
prob
0.04
phase(deg)
gain
100
0
25
50 75 ISI (ms)
100
0
100
B1
10
1 90
1 90
60
60
30
30
0
0
25
50 75 ISI (ms)
100
B2
0 0.1
var./mean
mean = 20.1 cv = 0.68
0.02
10
10
A2
0.04
mean = 20.0 cv = 0.69
0.02 0
Random Threshold (renewal)
1 10 freq (Hz)
100
C1
0.1
10
1
1 10 freq (Hz)
100
C2
1
0.1
0.1 1
10 100 interval order k
1
10 100 interval order k
Figure 7: Comparison of two neural models with matched spontaneous ISI and driven response characteristics. The left side (A1–C1) shows results for the linear adaptive threshold model (see equations 2.1–2.4), while the right side (A2–C2) shows an integrate-and-re-based model with a random threshold (see equations 5.1–5.5). The model parameters were adjusted to yield similar rstorder ISI distributions (A1, A2) and similar frequency response characteristics (B1, B2). However, the higher-order interval statistics, as characterized by the variance-to-mean ratio of kth-order interval distribution, are quite different (C1, C2). The linear adaptive threshold model exhibits strong regularizing effects at large interval orders, whereas the random threshold model has variance-tomean that is independent of interval order.
1592
Relly Brandman and Mark E. Nelson
6.3 Comparison of Response Characteristics. The response properties of the stochastic integrate-and-re model are shown in Figure 7 for t f D 20, Ib D 0.51, m D 2, and xN D 10. The mean and variance of the spontaneous ISI distribution (see Figure 7A2) are almost identical to those of the adaptive threshold model (see Figure 7A1). Also, the frequency response characteristics of the two models are very similar (see Figures 7B1 and 7B2). However, the random threshold model has no memory effects in the ISI sequence. Hence, for spontaneous spike activity, each interspike interval is independent of the previous interval. For such a renewal process model, both the mean and variance of the kth-order interval distribution grow linearly with interval order k, and hence the variance-to-mean ratio is independent of k (see Figure 7C2). Thus, we see that the two models have almost identical response characteristics, except for their long-term regularity as measured by the kth-order interval variance-to-mean ratios. 6.4 Comparison of Signal Detectability. We now provide a weak input signal to both model neurons and evaluate how reliably the signal can be detected in the output spike train. Specically, we consider a single-cycle sinusoidal input signal with amplitude A and duration D, satisfying the boundary conditions that the stimulus level and slope are zero at the beginning and end of the stimulus cycle. In discrete time, the input signal can be represented as
i[n] D A[1 ¡ cos(2p n / D )].
(6.8)
In order to highlight the effects of long-term spike train regularization, we consider the case where the stimulus duration spans multiple interspike intervals. The mean interspike interval for the two matched models is 20 msec, as determined from a 10 s interval of simulated baseline activity with no stimulus present. In the following example, we consider an input signal with duration D D 1000 msec, such that on average, about 50 spikes occur during a stimulus cycle. The stimulus amplitude is chosen to be A D 0.25. The average response to 1000 presentations of this stimulus is shown in Figure 8A1 for the linear adaptive threshold model and in Figure 8A2 for the random threshold model. In both cases, the response is sinusoidal with an amplitude of approximately 3 spikes per second. Note that the phase is shifted by approximately 90 degrees relative to the stimulus. This is because the neurons are operating as differentiators at this stimulus frequency and are thus responding to the slope of the stimulus rather than its absolute magnitude. As can be seen by comparing Figures 8A1 and 8A2, there is no obvious difference in response gain or variability in the poststimulus rate histograms, nor is there any obvious difference in the short-term variability of the individual spike trains shown in the dot raster displays. This similarity in the response properties of the two models is not surprising, given that they were tuned to have matching characteristics. Although the properties
Spike Train Regularization
1593
of the two models are similar on average, the detectability of the stimulus on a trial-by-trial basis is dramatically different. The stimulus does not change the mean number of spikes observed during a trial. Rather, there is a slight increase in the spike count during the rst half of the trial and a slight decrease during the last half. For this particular stimulus amplitude, there is a mean increase of one spike in the rst half of the trial and a mean decrease of one spike in the second half of the trial, relative to the baseline level. To characterize the detectability of this small change in the spike train statistics, we presented each neuron model with a set of randomized trials, half of which contained a stimulus (see equation 6.8) and half of which did not. The detection task requires making a prediction on a trial-by-trial basis of whether the stimulus was present, based on the binary spike train data si for that trial. Since the mean number of spikes does not change in the presence of the stimulus, this decision cannot be based on the total spike count. To detect the stimulus optimally, the spike train data are passed through a lter with an impulse response matched to the expected temporal prole of the signal (Kay, 1998). In this case, the matched lter m is well approximated by a single-cycle sinusoid with zero phase shift, m[n] D sin(2p n / D ) ,
(6.9)
and the output of the matched lter zi on trial i is zi D
D X
m[n]si [n].
(6.10)
nD 1
Figures 8B1 and 8B2 show distributions of the matched lter output for the two models, in both the presence and absence of the stimulus. For both models, the matched lter output has a mean near zero when no stimulus is present and a mean of approximately 1.5 when there is a stimulus. Although the shift in the mean is approximately the same for both models, the width of the distribution is signicantly narrower for the adaptive threshold model (s.d. ¼ 0.6) than for the random threshold model (s.d. ¼ 3.4). This difference in variability has a signicant impact on weak signal detectability. The output of the matched lter zi can be used as a test statistic for binary hypothesis testing, in which the goal is to decide on a trial-by-trial basis whether a stimulus has occurred based on the value of zi for that trial. In this simple case, the problem can be handled using the classical NeymanPearson approach (Kay, 1998). For each trial i, the lter output zi is compared with a threshold value zthresh. If the lter output is greater than the threshold value, the detector classies the trial as a stimulus trial. Depending on the threshold level that is selected, there will be some detection probability Pd of correctly classifying a trial that contained a stimulus as a stimulus trial and some false alarm probability Pfa of misclassifying a trial without
1594
Relly Brandman and Mark E. Nelson
rate (Hz)
Adaptive Threshold A1
60 50 40
Random Threshold 60 50 40
A2
dot raster stim
detection probability
number of trials
75
0 10 5 0 5 10 matched filter output zi
20 no B2 stimulus 15 10 5 0 stimulus 15 10 5 0 10 5 0 5 10 matched filter output zi
0.75
0.75
0.5
0.5
0.25
0.25
0
0
B1
50
no stimulus
25 0 stimulus
50 25
1 C1
0 0.25 0.5 0.75 1 false alarm probability
1 C2
0 0.25 0.5 0.75 1 false alarm probability
Figure 8: Comparison of signal detectability for the linear adaptive threshold (A1–C1) and random threshold (A2–C2) models. The upper panels (A1, A2) show the stimulus waveform (arbitrary units), dot raster displays of representative spike activity, and poststimulus rate histograms computed by averaging spike activity over 1000 stimulus trials. A solid white line shows a sinusoidal t to the response. The middle panels (B1, B2) show histograms of the matched lter output for trials with and without a stimulus. The bottom panels (C1, C2) illustrate the dramatic improvement in detectability for signals encoded by the adaptive threshold model relative to the random threshold model, as measured by the ROC curves. The dashed lines indicate chance-level performance.
Spike Train Regularization
1595
a stimulus as a stimulus trial. If the threshold value is moved lower to improve detection efciency, the false alarm probability also increases. This trade-off between detection probability and false alarm probability can be summarized by the receiver operating characteristic (ROC) of the detector, which is a parametric plot of Pd versus Pfa as a function of threshold zthresh. The ROC plots for the two neuron models are shown in Figures 8C1 and 8C2. The ability to detect reliably the presence of the stimulus is much better for signals encoded by the adaptive threshold model. For example, if the threshold is set at a level corresponding to a false alarm probability of 10%, the probability of detecting the stimulus is 90% in spike trains arising from the adaptive threshold model but only 19% in spike trains from the random threshold model. 7 Conclusion
Spike trains that appear irregular on short timescales can exhibit longer-term regularity in their ring pattern. This regularity arises from the correlation structure of the ISI sequence and involves memory effects spanning multiple interspike intervals (Ratnam & Nelson, 2000). This form of long-term spike train regularization can arise from the refractory effects associated with a dynamic spike threshold (Chacron et al., 2001). The functional relevance of spike train regularity is supported by our experimental data on prey capture behavior of weakly electric sh. In our analysis of electrosensory afferents (Ratnam & Nelson, 2000), we found that spike train regularity was most pronounced on timescales of about 40 interspike intervals, which corresponds to a time period of about 175 msec. This timescale is well matched to the relevant timescales for prey capture behavior in these animals (Nelson & MacIver, 1999; MacIver, Sharabash, & Nelson, 2001). The timescale approximately matches the duration that the electrosensory image of a small prey would activate a single electrosensory afferent ber. We speculate that spike train regularization on the timescale of tens to hundreds of milliseconds may play a key role in enhancing the detectability of natural sensory signals, not just in the electrosensory system but in other systems as well. Regularizing effects, although not as pronounced, have been observed on similar timescales in auditory afferents (Lowen & Teich, 1992). Whether such effects exist in other systems is largely unknown because the appropriate analyses of multiscale spike train variability have not been carried out. The effects of spike train regularization can be most readily observed in experimental data by analyzing the variance-to-mean ratio of the kth-order interval distributions Ik. For a renewal process, which lacks correlations in the interval sequence, the variance-to-mean ratio is constant for all interval orders k. A decrease in the variance-to-mean with increasing k indicates a regularizing effect, whereas an increase indicates that the spike train is becoming more irregular. Asymptotically, similar relationships hold for the analysis of spike count distributions, where the variance-to-mean ratio is
1596
Relly Brandman and Mark E. Nelson
referred to as the Fano factor (Fano, 1947; Gabbiani & Koch, 1998). However, because the Fano factor decreases initially even for a renewal process, the effects of intermediate-term spike train regularization can be overlooked in a Fano factor analysis. Therefore, we recommend the analysis of kth-order interval distributions as the best approach for characterizing spike train variability on multiple timescales. We have presented a simple model, derived from a more detailed model by Chacron et al. (2000, 2001), that exhibits long-term spike train regularization arising from refractory effects associated with a dynamic spike threshold. Memory effects associated with the threshold dynamics give rise to negative correlations in the ISI sequence; hence, this is a nonrenewal model of spike generation. Many common neural models, including those based on integrate-and-re dynamics or inhomogeneous Poisson processes, do not produce correlations in the ISI sequence, and hence are classied as renewal models. Recent models of electrosensory afferent dynamics, including our own, fall into the category of renewal process models (Nelson et al., 1997; Kreiman, Krahe, Metzner, Koch, & Gabbiani, 2000). While such renewal models can accurately match the mean and CV of the rst-order ISI distribution, as well as the frequency response characteristics of the experimental data, their failure to generate longer-term spike train regularization may make them unsuitable for applications in which it is important to estimate accurately detection thresholds or coding efciency for weak sensory stimuli. Given that refractory effects are commonplace in neural systems, we suspect that this form of spike train regularization may be more widespread than previously appreciated. Hence, nonrenewal models, such as the one presented here, may have broad applicability when modeling the encoding of weak signals in neuronal spike trains. Acknowledgments
This research was supported by grants from the National Science Foundation (IBN-0078206) and the National Institute of Mental Health (R01MH49242). References Berry, M. J., & Meister, M. (1998).Refractoriness and neural precision. J. Neurosci., 18, 2200–2211. Calvin, W. H., & Stevens, C. F. (1968). Synaptic noise and other sources of randomness in motoneuron interspike intervals. J. Neurophysiol., 31, 574–587. Chacron, M. J., Longtin, A., & Maler, L. (2001). Negative interspike interval correlations increase the neuronal capacity for encoding time dependent stimuli. J. Neurosci., 21, 5328–5343. Chacron, M. J., Longtin, A., St.-Hilaire, M., & Maler, L. (2000). Suprathreshold stochastic ring dynamics with memory in P-type electroreceptors. Phys. Rev. Lett., 85, 1576–1579.
Spike Train Regularization
1597
Cox, D. R. (1962). Renewal theory. London: Methuen. Cox, D. R., & Lewis, P. A. W. (1966). The statistical analysis of series of events. London: Methuen. Fano, U. (1947). Ionization yield of radiations. II. The uctuations of the number of ions. Phys. Rev., 72, 26–29. Gabbiani, F., & Koch, C. (1996). Coding of time-varying signals in spike trains of integrate-and-re neurons with random threshold. Neural Comp., 8, 44–66. Gabbiani, F., & Koch, C. (1998). Principles of spike train analysis. In C. Koch & I. Segev (Eds.), Methods in neuronal modeling:From ions to networks (pp. 313–360). Cambridge, MA: MIT Press. Johnson, D. H. (1996). Point process models of single-neuron discharges. J. Comput. Neurosci., 3, 275–299. Johnson, D. H., Tsuchitani, C., Linebarger, D. A., & Johnson, M. J. (1986). Application of a point process model to responses of cat lateral superior olive units to ipsilateral tones. Hear. Res., 21, 135–159. Kay, S. M. (1998). Fundamentals of statistical signal processing, Volume II: Detection theory. Englewood Cliffs, NJ: Prentice Hall. Kreiman, G., Krahe, R., Metzner, W., Koch, C., & Gabbiani, F. (2000). Robustness and variability of neuronal coding by amplitude sensitive afferents in the weakly electric sh Eigenmania. J. Neurophysiol., 84, 189–224. Kufer, S. W., Fitzhugh, R., & Barlow, H. B. (1957). Maintained activity in the cat’s retina in light and darkness. J. Gen. Physiol., 40, 683–702. Lowen, S. B., & Teich, M. C. (1992). Auditory-nerve action potentials form a nonrenewal process over short as well as long time scales. J. Acoust. Soc. Am., 92, 803–806. MacIver, M. A., Sharabash, N. M., & Nelson, M. E. (2001). Prey-capture behavior in gymnotid electric sh: Motion analysis and effects of water conductivity. J. Exp. Biol., 204, 534–557. Moortgat, K. T., Keller, C. H., Bullock, T. H., & Sejnowski, T. J. (1998). Submicrosecond pacemaker precision is behaviorally modulated: The gymnotiform electromotor pathway. Proc. Natl. Acad. Sci., 95, 4684–4689. Nelson, M. E., & MacIver, M. A. (1999). Prey capture in the weakly electric sh Apteronotus albifrons: Sensory acquisition strategies and electrosensory consequences. J. Exp. Biol., 202, 1195–1203. Nelson, M. E., Xu, Z., & Payne, J. R. (1997). Characterization and modeling of P-type electrosensory afferent responses to amplitude modulations in a wave-type electric sh. J. Comp. Physiol. A, 181, 532–544. Ratnam, R., & Nelson, M. E. (2000). Non-renewal statistics of electrosensory afferent spike trains: Implications for the detection of weak sensory signals. J. Neurosci., 20, 6672–6683. Scheich, H., Bullock, T. H., & Hamstra, R. H., Jr. (1973). Coding properties of two classes of afferent nerve bers: High frequency electroreceptors in the electric sh Eigenmannia. J. Neurophysiol., 36, 39–60. Stein, R. B. (1967). The frequency of nerve action potentials generated by applied currents. Proc. Roy. Soc. Lond., B167, 64–86. Received May 25, 2001; accepted October 31, 2001.
LETTER
Communicated by Jonathan Victor
Spatiotemporal Spike Encoding of a Continuous External Signal Naoki Masuda [email protected] Department of Mathematical Engineering and Information Physics, Graduate School of Engineering, University of Tokyo, Tokyo, Japan Kazuyuki Aihara [email protected] Department of Complexity Science and Engineering, Graduate School of Frontier Sciences, University of Tokyo, Tokyo, Japan Interspike intervals of spikes emitted from an integrator neuron model of sensory neurons can encode input information represented as a continuous signal from a deterministic system. If a real brain uses spike timing as a means of information processing, other neurons receiving spatiotemporal spikes from such sensory neurons must also be capable of treating information included in deterministic interspike intervals. In this article, we examine functions of neurons modeling cortical neurons receiving spatiotemporal spikes from many sensory neurons. We show that such neuron models can encode stimulus information passed from the sensory model neurons in the form of interspike intervals. Each sensory neuron connected to the cortical neuron contributes equally to the information collection by the cortical neuron. Although the incident spike train to the cortical neuron is a superimposition of spike trains from many sensory neurons, it need not be decomposed into spike trains according to the input neurons. These results are also preserved for generalizations of sensory neurons such as a small amount of leak, noise, inhomogeneity in ring rates, or biases introduced in the phase distributions. 1 Introduction
Information processing mechanisms in brain have been disputed for a long time. The ring rate is understood to represent information in the external stimuli (Tuckwell, 1988; Koch, 1999). However, the rate coding works too slowly to explain some quick functions that occur in the brain because the rate calculation involves temporal averaging of spike signals. Consequently, the brain may also make use of precise spike timing in information processing. Temporal spike coding enables the brain to treat information at increased speed with decreased energy. In addition, precisely timed reproc 2002 Massachusetts Institute of Technology Neural Computation 14, 1599– 1628 (2002) °
1600
Naoki Masuda and Kazuyuki Aihara
ducible spiking in response to dynamic stimuli has been reported with a precision of milliseconds in rat neocortex (Mainen & Sejnowski, 1995), y visual system (van Steveninck, Lewen, Strong, Koberle, & Bialek, 1997), and cat lateral geniculate nucleus of the thalamus (LGN) (Warzecha & Egelhaaf, 1999; Reinagel & Reid, 2000). And the related mechanisms that enable neural networks to enhance ring time precision such as those of inhibitory connected integrate-and-re (IF) neurons (Tiesinga & Sejnowski, 2001; Tiesinga, Fellous, JosÂe, & Sejnowski, 2001) have been explored. Theoretical computation models that use spike timing in neural networks have also been proposed (Judd & Aihara, 1993, 2000; Fujii, Ito, Aihara, Ichinose, & Tsukada, 1996). Brains may be able to select between strategies and choose the one that is site and situation appropriate (Riehle, Grun, ¨ Diesmann, & Aertsen, 1997). Firing time and interspike intervals (ISIs) can, at least theoretically, play important roles in information encoding. The usefulness of ISI time series for information processing was shown by Sauer (1994). It is possible to reconstruct the attractor of the external stimuli from the ISI time series of an IF neuron when the neuron receives continuous chaotic signals (Sauer, 1994, 1997). Sauer used a simple IF neuron that possessed internal state x ( t) equal to the perfect integration of the external input S ( t) with a positive bias until it res. Once ring occurs, x ( t) is reset to the resting potential to restart the integration. The chaotic input S (t) is generated from the Rossler ¨ equations (Rossler, ¨ 1976). We assume the ring threshold and the resting potential to be x D 1 and x D 0, respectively. The ith ring time Ti is determined from the ( i ¡ 1) th ring time Ti¡1 and S ( t) by Z
x ( Ti ) D
Ti Ti¡1
S ( t) dt D 1.
(1.1)
The strange attractor generating S ( t) can be reconstructed from the delay coordinates with the ISI time series fti I ti D Ti ¡ Ti¡1 g (Sauer, 1994). It is also possible to predict fti g with the local prediction models for chaotic time series (Sugihara & May, 1990). Sauer (1997) showed that the generic IF neuron maintains a one-to-one correspondence between the system states and the ISI vectors of a sufciently large dimension. This guarantees attractor reconstruction from the ISI time series. The dynamical characteristics of the original attractors such as the correlation dimension are also calculated from the ISI time series (Castro & Sauer, 1997b). The ring threshold corresponds to a Poincare section (Suzuki, Aihara, Murakami, & Shimozawa, 2000). The attractor can be reconstructed from ISIs since an ISI represents 1 / S ( t) averaged over the ISI (Racicot & Longtin, 1997). An increased ring threshold indicates averaging of 1 / S ( t) over longer ISIs, so the ISI reconstruction degrades (Sauer, 1994; Racicot & Longtin, 1997). ISI reconstruction methods have also been studied for other types of neurons (Racicot & Longtin, 1997; Castro & Sauer, 1999) and in experiments with the CA1 region of the rat hippocampus (Schiff, Jerger, Chang, Sauer, & Aitken, 1994), rat cutaneous
Spatiotemporal Spike Encoding of a Signal
1601
afferents of hairy thigh skin (Richardson, Imhoff, Grigg, & Collins, 1998), and cricket sensory neurons (Suzuki et al., 2000). The continuous signal S ( t) in these studies can be interpreted to model dynamically changing external stimuli such as visual scenes and sounds. Consequently, the focus has been ISI encoding of sensory neurons. The next logical step is to study the meaning of ISIs for the cortical neurons that receive spatiotemporal spike inputs with deterministic information from sensory neurons. Experimental results also suggest that spike timing plays an important role in nonsensory neurons such as the mushroom body and b-lobe neurons that receive spikes from other neurons (MacLeod, B¨acker, & Laurent, 1998). Experimental results on the synchronization of neurons (Eckhorn et al., 1988; Gray, Konig, ¨ Engel, & Singer, 1989; Laurent & Davidowitz, 1994; Vaadia et al., 1995; Riehle et al., 1997; Stopfer, Bhagavan, Smith, & Laurent, 1997; Stopfer & Laurent, 1999; Brecht, Singer, & Engel, 1999; Rodriguez et al., 1999) are also relevant to the importance of spike timing. Accordingly, it is worth examining attractor reconstruction with the ISI time series generated by spike-driven neurons to gain insight into information transmission and processing of the cortical neurons rather than the sensory neurons. The simplest way to approach this problem is to approximate spike inputs by continuous signals. However, the validity of this generalization is nontrivial for the following reason. Cortical neurons receive spikes from many sensory neurons and feedback spikes from cortical neurons. These neurons generally must receive many input spikes to re, depending on nerve membrane leakage, ring rates of incident spike trains, and many other variables. The input spikes arrive from many other neurons that emit different spike trains. It is not at all clear whether the superimposition of these different spike trains can be replaced approximately by a single continuous signal. The incident spike train to a cortical neuron includes superimposed spike trains arriving from many sensory neurons. Even if the ISI of each input neuron contains deterministic information about stimuli, the ISI of the superimposed incident spike trains does not necessarily preserve the deterministic information because during the ISI of an input neuron, there generally exist spikes from other input neurons (Ichinose & Aihara, 1998; Isaka, Yamada, Takahashi, & Aihara, 2001). Moreover, it is mathematically quite difcult for a cortical neuron to decompose incident spikes according to their source neurons. This spike superimposition problem is specic to spike-driven neurons. Real neurons seem to treat the stimulus information without spike decomposition. In the light of these observations, we explore attractor reconstruction by ISI time series of a single cortical neuron receiving spatiotemporal spike inputs from many sensory neurons. Sensory neurons are modeled by the IF neurons (Sauer, 1994, 1997) for the sake of simplicity, but we also discuss some generalizations. For postsynaptic neurons, we use the IF neurons, the leaky integrate-and-re (LIF) neurons (Tuckwell, 1988; Racicot & Longtin, 1997; Castro & Sauer, 1999), and the FitzHugh-Nagumo (FHN) neurons
1602
Naoki Masuda and Kazuyuki Aihara
(FitzHugh, 1961; Nagumo, Arimoto, & Yoshizawa, 1962; Racicot & Longtin, 1997; Castro & Sauer, 1997a, 1999). We show that the original attractor can be reconstructed from the ISI in the cases of the IF and LIF spike-driven neurons without assuming strong synapses or without decomposing superimposed spike trains. Consequently, we see that the spike-driven neuron collects small contributions from many sensory neurons to gain the stimulus information. The FHN neurons differs intrinsically from the other two models and cannot well preserve deterministic information, as has been shown in the continuous-input models (Racicot & Longtin, 1997; Castro & Sauer, 1997a, 1999). We consider spike effects on excitable neurons under ordinary conditions; in so doing, we focus exclusively on the excitable neuron models and do not consider the oscillatory neuron models, which has been widely studied. We cover ISI reconstruction of single neurons driven by spatiotemporal spikes from many sensory model neurons in section 2. We use the IF, LIF, and FHN neurons. The mechanisms underlying the ISI reconstruction by these neurons are discussed in section 3. We also compare information processing of model neurons receiving spatiotemporal spike inputs with that of continuous-input neurons. 2 Interspike Interval Reconstruction by Spike-Driven Neurons 2.1 Sensory Neurons. Throughout this article, we study two-layered neural networks. For simplicity, we call the neurons in the rst layer sensory neurons and those in the second layer cortical neurons. The sensory neurons are assumed to be the IF neurons receiving common continuous input S (t) , which represents external stimuli. We generate S ( t) from the x variable of the Rossler ¨ attractor (Sauer, 1994, 1997; Racicot & Longtin, 1997; Castro & Sauer, 1997a, 1999). The Rossler ¨ equations (Rossler, ¨ 1976) are represented by 8 dX < dt D a (¡Y ¡ Z) , dY (2.1) D a ( X C 0.36Y) , dt : dZ (0.4 ( ) C ¡ 4.5) D a Z X , dt
and the parameters are chosen so that the Rossler ¨ system generates a typical chaotic signal. The velocity of points on orbits is modulated by a and a D 34.48. In this situation, the autocorrelation of the Rossler ¨ system decays to a small enough value in 5 to 10 times the usual ISI duration. Accordingly, we can focus on information processing of neurons over this timescale. If we chose a larger value for a in equation 2.1, we would concern ourselves with shorter timescales. The external stimulus S ( t) is dened as follows: S ( t) D 2.7 ¢ 10¡4 C 2.2 ¢ 10¡5 X.
(2.2)
We chose the parameters in equation 2.2 so that S (t ) > 0 almost always; S ( t) may represent external stimuli that do not take negative values. Biologically,
Spatiotemporal Spike Encoding of a Signal
1603
sensory inputs are rarely rigorously chaotic. We use chaotic S (t) because we are testing short-term information coding rather than classical rate coding, which requires a long time to obtain ring rates by averaging uctuations. The prediction errors are signicantly small only for a short prediction period because of the exponential information decay of chaotic signals. We are interested in information processing performed on the timescale over which prediction is possible. As will be shown later, the prediction errors of the ISI time series can be used to measure the performance of the external signal reconstruction. We think that our results can be extended to situations with more general external signals such as ltered gaussian noise. Examination of these situations will involve other measures for performance evaluation such as correlation functions between external signals and estimated signals. We assume that there are n1 sensory neurons. We denote by Ti, j (1 · i · n1 ) the jth ring time of the ith neuron. By setting the threshold and the resting potential to 1 and 0, respectively, the dynamics of the membrane potential xi (t) of the ith sensory neuron between the j ¡ 1th ring and the jth ring is described by Z xi (t) D
t
S ( t) dt.
(2.3)
Ti, j¡1
As shown in equation 1.1, we know from equation 2.3 that Ti, j ( j ¸ 2) is determined from Ti, j¡1 and S (t) by Z
Ti, j Ti, j¡1
S (t) dt D 1,
( j ¸ 2).
The jth ISI of the ith neuron is dened by ti, j D Ti, j ¡ Ti, j¡1 where Ti, 0 ´ 0. We do not impose any further assumptions on the sensory neurons; the initial phases xi (0) (1 · i · n1 ) are chosen randomly: xi (0) D hi ,
(1 · i · n1 ) ,
(2.4)
where hi is randomly chosen from the uniform distribution on [0, 1]. With equation 2.4, we observe that the initial ring time Ti,1 is given by Z
Ti, 1 0
S (t ) dt D 1 ¡ hi .
The randomness in the initial phases will turn out to be important for information processing in terms of the attractor reconstruction, and the result indicates that noise plays a positive role in the brain; noise drives the phase distribution to become approximately uniform so that the sensory layer can
1604
Naoki Masuda and Kazuyuki Aihara
IF x1 x2
v S(t) xn1 Figure 1: Network architecture when there is only one cortical neuron. S ( t) denotes the continuous external stimulus, xi denotes the membrane potential of the ith sensory neuron, and v denotes the membrane potential of the cortical neuron.
collect versatile information about S ( t) as a whole. This noise effect differs from the effects of deterministic chaos and stochastic resonance. We consider information coding by a single cortical neuron in this article. The network architecture is schematically shown in Figure 1. The cortical neuron has a membrane potential denoted by v ( t) , and it receives spatiotemporal spikes from n1 sensory neurons. The ISI of the spike train out of each sensory neuron is assumed to be deterministic. However, the ISI of the input spikes to the cortical neuron is no longer deterministic because of spike superimposition. In response to the superimposed spike inputs, the cortical neuron emits a spike train, the determinism of which is discussed below. We examine three models of cortical neurons: the IF, LIF, and FHN neurons. 2.2 The Integrate-and-Fire Neuron. Here, we investigate the dynamics of the IF cortical neuron. The membrane potential v (t ) of the cortical neuron changes by the same mechanism observed for xi ( t) ; only the input differs. The cortical neuron is assumed to receive instantaneous spikes with homogeneous amplitude 2 N at time T i, j . Consequently, its dynamics is described by
dv X D 2 Nd ( t ¡ T i, j ) , dt i, j
(2.5)
Spatiotemporal Spike Encoding of a Signal
1605
6 where d is the delta function dened by d (0) D 1 and d ( t) D 0 for t D 0. We denote by T 0k the kth ring time of the cortical neuron. The dynamics of v (t ) ( t 2 [T 0k¡1 , T 0k]) is represented by
Z v (t) D
X
t 0
Tk¡1
2N
0 ·Ti, j ·t i, jITk¡1
d ( t ¡ Ti, j ) dt.
The next ring time T 0k is the instant when v ( t) reaches 1, and the ISI of the output spikes from the cortical neuron is dened by t0k D T 0k ¡ T 0k¡1 ,
( k ¸ 1),
where T 00 ´ 0. To examine the property of the superimposed spike inputs, we suppose that 1 > h1 ¸ h2 ¸ ¢ ¢ ¢ ¸ hn1 ¸ 0 without losing generality. It follows that 0 < T1,1 · T2,1 · ¢ ¢ ¢ · Tn1 ,1 < T1,2 · T2,2 · ¢ ¢ ¢ ,
(2.6)
since Z
Ti C 1, j Ti, j
Z S ( t) dt D
TiC 1, j¡1
Z
Z
S ( t) dt C
TiC 1, j¡1
D 1C D
Z
TiC 1, j
Ti, j¡1
TiC 1, 1
Ti C 1, j¡1
Ti, j¡1
Z S ( t) dt C
Ti, j¡1
S (t) dt Ti, j
S ( t) dt ¡ 1
S ( t) dt
Ti,1
D hi ¡ hi C 1 ¸ 0,
(1 · i · n1 ¡ 1) ,
(2.7)
and Z
T1, j C 1 Tn 1 , j
Z S ( t) dt D
S (t) dt
Tn 1 , 1
Z D
T1, 2
T1, 2
T1,1
Z S (t) dt ¡
Tn 1 , 1
S ( t) dt
T1, 1
D 1 ¡ (h1 ¡ hn1 ) > 0.
(2.8)
We denote by TN j and tNj D TN j ¡TN j¡1 the time of the jth input spike to the cortical
1606
Naoki Masuda and Kazuyuki Aihara
neuron and its ISI, respectively. With equation 2.6, TN j and tNj are represented by ( TN i C n1 ( j¡1) D Ti, j ,
tNi C n1 ( j¡1 ) D
T1, j ¡ Tn1 , j¡1 , Ti, j ¡ Ti¡1, j ,
(i D 1), (2 · i · n1 ) ,
where we assume delay from the sensory neurons to the cortical neurons to be 0, or equivalently homogeneous. In spite of determinism of the ISIs fti, j : j 2 N g for 1 · i · n1 , the time series ftNj g is not deterministic because of the random choice of hi . We treat the information in fTN j g and ftNj g without decomposing the superimposed spikes into groups corresponding to the spike sources with determinism. We see from equation 2.5 that the cortical neuron can re only at an instance when it receives a spike. Given T 0k¡1 D TN j0 , T 0k is determined by Z v ( T 0k ) D
Tk0
X
0 Tk¡1
j
2 Nd ( t ¡ TN j ) dt D 1.
Accordingly, T 0k D TN j0 C N , where N is the number of input spikes necessary for the cortical neuron to re. We can assume without loss of generality that 2 N satises ND
1 2N
.
Next, we dene Z
Dj
D
TN jC 1 TN j
S (t) dt > 0.
(2.9)
Using equations 2.7 and 2.8, we have j0 C n1 ¡1 X j D j0
Dj
D 1,
(8 j0 ¸ 1).
T 0k is determined from T 0k¡1 and S ( t) by Z
Tk0 0 Tk¡1
Z S (t) dt D
TN j0 C N TN j0
S (t ) dt D
j0 C N¡1 X j D j0
Dj.
(2.10)
Spatiotemporal Spike Encoding of a Signal
1607
P j C N¡1 The distribution of jD0 j0 D j can be determined from the uniform distribution of hi . Using equations 2.7 through 2.9, we can assume that j0 C N¡1 X
Dj
j D j0
D h N0 C 1 ¡ h 1 C
N ¡ N0 , n1
(2.11)
where N 0 D N mod n1 . Sinceh1 , h2 , . . .,hn1 are the reordered series of random variables selected from the uniform distribution restricted to [0, 1], we have Prob (hN 0 C 1 ¡ h1 2 [x, x C dx] ) D ( n1 ¡ 1)
0
0
¢ n1 ¡2 CN0 ¡1 xN ¡1 (1 ¡ x) n1 ¡N ¡1 dx,
(1 · N 0 < n1 ) ,
(2.12)
where n1 ¡2 CN0 ¡1 denotes the binomial coefcient. Equations 2.11 and 2.12 result in 0 1 j0 C N¡1 X N D jA D (2.13) E@ n1 jD j 0
and 0 V@
j0 C N¡1 X j D j0
1
Dj A D
N 0 ( n1 ¡ N 0 ) , n21 (n1 C 1)
(2.14)
where E (¢) and V (¢) are the mean and variance of a random variable, respectively. We note that equations 2.13 and 2.14 do not depend on j0 . Equations 2.10 and 2.13 indicate that the cortical neuron works as if it were the IF neuron driven by the continuous input S (t ) with the threshold equal to N / n1 . However, we must also note that the ensemble average with respect to realization is not generally equal to the time average because the ergodicity does not hold once the initial phases are xed. Strictly speaking, the thresholds are equal to N / n1 only when there is an equidistant distribution of initial phases xi (0) on [0, 1]. If we introduce inhomogeneity of ring rates so that the ring-rate deviations among neurons are incommensurable with each other, the ergodicity will hold. Additive noises in the sensory neurons also realize ergodicity. Under these conditions, equation 2.13 is valid for every realization with E (¢) interpreted as time average. Equation 2.14 suggests that this approximation holds with the error proportional to n1¡1 . This error is regarded as temporal deviation in ergodic cases and as deviation of realizations in nonergodic cases. The cortical neuron is expected to transmit deterministic information passed through spatiotemporal spikes from sensory neurons in most cases.
1608
Naoki Masuda and Kazuyuki Aihara
The variation represented in equation 2.14 may cause degradation of the attractor reconstruction. The stochastic uctuation in D j results in inconstant ring thresholds. However, this effect is small in biologically plausible cases where n1 » D 20000 and N0 D N » D 200 (Koch, 1999). Each sensory neuron roughly encodes information about the inverse signal amplitude 1 / S ( t) by emitting spikes at the proper time. The cortical neuron collects these pieces of signal information to realize the attractor reconstruction. Superimposition of the incident spike trains received by the cortical neuron is not an obstacle for signal reconstruction. The random initial phase condition is essential since it enables sensory neurons to collect various aspects of the information signal by ring spikes at different instances. If all the sensory neurons share an identical initial condition, they always re simultaneously when noise is absent. Thus, the model will reduce to the case of the single sensory neuron with the spike amplitude modied to n12 N . In this case, the ISI reconstruction degrades seriously since the n1 sensory neurons simultaneously sample the external signal information. Accordingly, the sampling intervals of the sensory neuron layer are large, and the cortical neuron receives only a restricted aspect of the signal information. Biologically speaking, the information contained in hi gradually degrades due to other noise sources. However, the phases xi ( t) (1 · i · n1 ) of sensory neurons may remain random in noisy environments. The noise plays a positive role here. To evaluate how deterministic the ISI time series ft0kg is, we follow the methods usually used in nonlinear analysis of ISI time series (Sauer, 1994, 1997; Racicot & Longtin, 1997; Castro & Sauer, 1997a, 1997b, 1999; Suzuki et al., 2000). We examine the predictability of the time series by the local prediction algorithm (Sugihara & May, 1990). In this algorithm, we transform ft0kg with d-dimensional delay coordinates to obtain the points (t0k, t0k¡1 , . . . , t0k¡d C 1 ) in a d-dimensional reconstruction space (Sauer, 1994). To perform h-step prediction of t0k0 , we search l0 nearest neighbors ( t0kl , t0kl ¡1 , . . . , t0kl ¡d C 1 ) , 1 · l · l0 to (t0k , t0k ¡1 , . . . , t0k ¡d C 1 ). The prediction tO0k C h is then 0 0 0 0 determined by tO0k0 C h D
l0 1X t0 . l0 lD1 kl C h
The effectiveness of this algorithm for ft0k g is evaluated by the normalized prediction error (NPE), dened by
NPE (h ) D
h( tO0k0 C h ¡ t0k0 C h ) 2 i1 / 2 h(m ¡ t0k0 C h ) 2 i1/ 2
,
where m is the mean of ft0kg. The denominator of NPE ( h ) is the expected prediction error for predicting by the average. Accordingly, NPE ( h ) is close to 1 when the prediction is not accurate, for example, when predicting chaotic
Spatiotemporal Spike Encoding of a Signal
1609
time series with large uctuations by just the average. However, more accurate prediction of predictable time series including deterministic series results in smaller NPE ( h) values for sufciently small values of h. Since stochastic time series with correlation can also result in a small NPE ( h) , we also calculate the NPE ( h ) of the surrogate data (Theiler, Eubank, Longtin, Galdrakian, & Farmer, 1992; Chang, Schiff, Sauer, Gossard, & Burke, 1994) to distinguish the predictability owing to the determinism from that of stochastic time series. In the surrogate data method, we compare NPE (h ) of the original time series to that of the stochastic time series articially generated from the null hypotheses for the original time series. In this article, we test two kinds of surrogate data: the Fourier shufed (FS) surrogate and the amplitude-adjusted Fourier transform (AAFT, also called gaussian-scaled shufed) surrogate. In the FS surrogate, we rst obtain the Fourier transform surrogate data generated from the null hypothesis that the time series is generated from a linear stochastic process. The FS surrogate data are generated by rearranging the original time series in accordance with the rank order of the FT surrogate data. Consequently, the FS surrogate data preserve the amplitude distribution of the original time series, and at the same time, the autocorrelation of the FS surrogate data is close to that of the original time series. In the AAFT surrogate, the null hypothesis is that the time series is a linear stochastic process observed through a monotonic nonlinear function. The AAFT surrogate data preserve the amplitude distribution, and the autocorrelation is also similar to that of the original time series. The time series is suggested to be deterministic if and only if the NPE ( h) s of the FS and AAFT surrogate data are close to 1, whereas the NPE ( h) of the original time series is signicantly smaller than that of the surrogate data. To obtain the error bar, we generate 100 surrogate data for each kind of surrogate and calculate the standard deviations of the NPE ( h ) s. The simulation results for the neural network with the IF cortical neuron are shown in Figure 2. We set n1 D 160 and 2 N D 0.007, and the result is that the cortical neuron receives N D 143 spikes to trigger one spike, a value that is consistent with values obtained experimentally (Diesmann, Gewaltig, & Aertsen, 1999; Koch, 1999). The timescale is chosen so that the ring rate of the cortical neuron is 20 Hz (Laurent & Davidowitz, 1994; Stopfer et al., 1997; Stopfer & Laurent, 1999). The length of ft0kg used for the nonlinear prediction and the surrogate data is equal to 4096, and the last 10% of the ISIs is predicted by the rst 90% with l0 D 12. We nd that the original ISI time series ft0kg made from the output spikes of the cortical neuron has an NPE ( h) that is signicantly lower than NPE ( h )s of surrogate data for various values of h (see Figure 2b) and various embedding dimensions (see Figure 2c). Figure 2d shows the attractor reconstructed from ft0kg using 4096 points; this attractor is similar to the original Rossler ¨ attractor. Consequently, nonlinear determinism of ft0kg is strongly suggested. The IF neuron driven by superimposed spike trains can also transmit the deterministic information in the sensory stimuli.
1610
Naoki Masuda and Kazuyuki Aihara
1.2 A
0.8 NPE
v (t)
1
0.5
B
1 0.6 0.4
original FS AAFT
0.2 0
0 0
200
400
1.2
C
1 original FS AAFT
0.4 0.2 0
2
4
6
4
6
8
10 12
8
embedding dimension
10
D
80 t_ {i- 1 } [m s ]
NPE
0.8 0.6
2
prediction horizon
time (ms)
60 40 40
60
80
t_i [ms]
Figure 2: Simulation results for the neural network with the IF cortical neuron. n1 D 160 and 2 N D 0.007. (A) The voltage v ( t ) of the cortical neuron. (B) NPE ( h) for the ISI time series ft0k g of the output spikes of the cortical neuron (original), the FS surrogate data (FS), and the AAFT surrogate data (AAFT). The error bars for the estimates of NPE ( h ) show the one-s ranges based on 100 samples. d D 4, l 0 D 12, and the length of ft0k g is 4096. (C) NPE (1) with various embedding dimensions for ft0k g (original), the FS surrogate data (FS), and the AAFT surrogate data (AAFT). (D) The attractor reconstructed from 4096 points of ft0k g in the delay coordinates with d D 2. Only the rst 300 points are connected by lines.
We again note that the ISI of each sensory neuron carries only a small amount of information about S ( t) . The cortical neuron aggregates the superimposed spikes from many sensory neurons to recover more precise signal information. Figure 3 shows the dependence of the prediction errors on n1 . We have kept 2 N n1 constant to generate Figure 3 so that the ring rate of the cortical neuron does not change. Figure 3 suggests that the ISI reconstruction is difcult if the cortical neuron receives incident spikes from n1 · 40 neu-
Spatiotemporal Spike Encoding of a Signal
0.8 0.7
NPE
0.6
1611
NPE(1) NPE(2) NPE(3)
0.5 0.4 0.3 0.2 0.1 0
20 40 60 80 100 120 140 160 number of sensory neurons (n_1)
Figure 3: Dependence of the prediction errors NPE (1) , NPE (2) , and NPE (3) on the number of sensory neurons n1 . d D 4.
rons. To achieve signal reconstruction with a smaller n1 , it is more helpful to apply mechanisms to guarantee nearly uniform distribution of fhi g. For example, inhibitory interaction between the sensory neurons would keep hi s almost equally separated (Mar, Chow, Gerstner, Adams, & Collins, 1999). In this case, we may reduce n1 required for maximum phase separation to less than d ( < 1) from n1 / d ¡2 to n1 / d ¡1 . To show that the phase dispersion of the sensory neurons is essential, we examine the effect of uneven initial conditions. Figure 4 shows the dependence of the prediction errors NPE (1), NPE (2), and NPE (3) on the width W of the initial phase distributions. The case with W D 1 corresponds to the totally random initial condition studied above. The prediction performance P j C N¡1 degrades as W becomes smaller. When W < 1, the quantity jD0 j0 D j appearing in equations 2.13 and 2.14 is subject to systematic deviations. If D j in the summation involves an interval with width 1 ¡ W where no hi can exist, then the summation is nearly equal to 1 ¡ ( n1 ¡ N ) W / ( n1 C 1) . Otherwise, it is nearly equal to NW / ( n1 C 1) . Consequently, the cortical neuron actually res with frequent changes in threshold. The deviation increases when W decreases, which causes higher prediction errors for smaller Ws. 2.3 Generalization of Sensory Neurons. Here, we show that the mechanism proposed in section 2.2 also holds in cases with a small amount of leak and inhomogeneity in ring rates introduced in the sensory neurons.
1612
Naoki Masuda and Kazuyuki Aihara
Figure 4: Dependence of the prediction errors NPE (1) , NPE (2) , and NPE (3) on the width W of the distribution of the initial phases. d D 4.
First, let us suppose the presence of realistic leaks in the sensory neurons. In this case, the sensory neurons tend to re synchronously even without feedback coupling because S ( t) is common to all the neurons. As a result, the cortical neuron cannot collect the signal information effectively because the phases are no longer dispersed. However, noise applied to the sensory neurons yields successful attractor reconstruction because additive noise drives the sensory neurons out of synchronization to make the phases more dispersed so that the instants at which they re are different. If the noises are not so large as to break the determinism of the ISIs, signal reconstruction is also possible for leaky sensory neurons. As an example, let us apply an independent gaussian noise to each sensory neuron with dynamics represented by equation 2.19. When c D 0.01 ms¡1 and the standard deviation of the noise is equal to 0.105 ms¡1 , we have NPE (1) D 0.40, which indicates determinism of the ISI time series. Next, we consider the effect of inhomogeneity. The real neurons are inhomogeneous even if they belong to the same functional assembly. Let us suppose inhomogeneity in the ring rates of the sensory neurons in which the ring rate of the ith neuron is 1 C D fi times larger than that of homogeneously ring neurons. We assume that D f1 , D f2 , . . ., and D fn1 are incommensurable so that the phase trajectory is ergodic. In this case, we can obtain the distribution of the ISI by taking the ensemble average with respect to h1 , h2 , . . ., and hn1 . We also suppose that N0 D N and that D fi is small enough
Spatiotemporal Spike Encoding of a Signal
1613
to prevent multiple ring of a sensory neuron between two spikes from the cortical neuron. Noting that ring is caused by the ith sensory neuron with probability (1 C D fi ) / n1 under ergodicity, we have Prob ( ISI 2 [t, t C dt]) D
n1 n1 X 1 C D fis X (1 C D fie ) dt n1 6 is is D1 ie D Á ! X X ¢ d N ¡1 ¡ si 6 ie ,is iD
6 ie ,is g2f0,1gn 1 ¡2 fsi Ii D
£ D
n1 Y 6 ie ,is iD
[(1 C D fi ) tsi C (1 ¡ (1 ¡ D fi ) t ) (1 ¡ si ) ]
N ( N C 1) dt n 1 t2 ¢
X fsi g2f0,1gn 1
£
n1 Y iD 1
Á
d NC1¡
X
! si
i
[(1 C D fi ) tsi C (1 ¡ (1 ¡ D fi ) t) (1 ¡ si ) ],
(2.15)
where si is equal to 1 when the ith neuron res in [0, t] or otherwise equals 0, and d is the delta function. We denote the index of the neuron that upon last ring caused the last ring of the cortical neuron as is and the index of the sensory neuron that marks the Nth spike since it as ie . Accordingly, this spike makes the cortical neuron re. Equation 2.15 coincides with equation 2.12 when D fi D 0 for all i. By expanding to the second order in D fi , we have Prob ( ISI 2 [t, t C dt])
¡N¡1 D ( n1 ¡ 1) n1 ¡2 CN¡1 tN¡1 (1 ¡ t) n1 dt
C
C
N ( N C 1)dt X D fi f¡n1 ¡1 CN C 1 tN C 1 (1 ¡ t) n1 ¡N¡2 n1 t i 2N ( N C 1) dt X D fi n1 i< j
C n1 ¡1 CN tN (1 ¡ t ) n1 ¡N¡1 g
£ D fj fn1 ¡2 CN C 1 tN C 1 (1 ¡ t) n1 ¡N¡3 ¡ 2n1 ¡2 CN tN (1 ¡ t) n1 ¡N¡2 C n1 ¡2 CN¡1 tN¡1 (1 ¡ t ) n1 ¡N¡1 g.
(2.16)
1614
Naoki Masuda and Kazuyuki Aihara
1
NPE(1) NPE(2) NPE(3)
0.8 0.6 0.4 0.2 0
0
0.05 0.1 0.15 0.2 0.25 standard deviation in the firing rate
Figure 5: Dependence of the prediction errors NPE (1) , NPE (2) , and NPE (3) on the standard deviation of the ring rate inhomogeneity D fi . We assume for D fi the gaussian distribution with mean 0. d D 4.
Using equation 2.16, we have E (t) D
N C O ( D fi3 ) n1
(2.17)
and Var (t) D
X 2N (N C 1) N ( n1 ¡ N ) C D fi D fj C O ( D fi3 ). n21 (n1 C 1) ( n1 ¡ 1)n21 ( n1 C 1) i < j
(2.18)
Equation 2.17 indicates that the effective threshold for the cortical neuron is equal to that in the homogeneous case obtained in equation 2.13 up to O ( D fi2 ) . Equations 2.14 and 2.18 show that the dependence of Var(t) on O (D fi ) also vanishes. The dependence of the prediction errors NPE (1) , NPE (2) , and NPE (3) on D fi is shown in Figure 5. The deviations D fi ’s are taken from the gaussian distribution with various standard deviations. Figure 5 shows that the prediction is accurate enough even under 25% inhomogeneity in the ring rates. 2.4 The Leaky Integrate-and-Fire Neuron. Next we examine the possibility of attractor reconstruction by IF-type neurons with a leak. The ISI reconstruction by these models was originally studied by Racicot and Longtin
Spatiotemporal Spike Encoding of a Signal
1615
(1997) and Castro and Sauer (1999) for continuous chaotic inputs with the LIF neurons (Tuckwell, 1988; Koch, 1999). The LIF neurons have also been used to investigate the collective behavior of spike-coupled neural networks (van Vreeswijk, 1996; Ernst, Pawelzik, & Geisel, 1998; Bressloff & Coombes, 2000; Masuda & Aihara, 2001). Racicot and Longtin (1997), Castro and Sauer (1999), and Suzuki et al. (2000) studied the dynamics of a single LIF neuron using the following model: dx D S ( t ) ¡ c x, dt
(2.19)
where x ( t) is the membrane potential, c > 0 is the leak rate, and S ( t) is the chaotic signal that is essentially the same as that in equation 2.1. The LIF neuron has the same ring mechanism as that of the IF neuron with threshold (x(t) D 1) and resting potential (x(t) D 0). In contrast to the IF neuron model, ISI attractor reconstruction and prediction of ISI time series with a little degradation are possible. By integrating equation 2.19, we have Z x ( t) D
t
0
S (t0 )e ¡c ( t¡t ) dt0 ,
(2.20)
0
where we assume that x (0) D 0. Equation 2.20 shows that S ( t0 ) weighted 0 with a linear lter e ¡c ( t¡t ) is integrated to determine the next ring time (Racicot and Longtin, 1997). This lter can cause the degradation or the failure of the ISI reconstruction since it can seriously affect the integration process. In particular, if S (t) is small while x ( t) is close to the threshold, the leak term c x ( t) in equation 2.19 is large for a long time. Unlike the IF neuron model, this situation results in a leak-induced long ISI, which can degrade the prediction. However, the ISI time series is predictable when c is not very large. Here, we examine the ISI reconstruction when the LIF neuron is used as the model of the cortical neuron. The sensory neurons are the same as ones used in section 2.2. Thus, we apply the model without voltage bias (Tuckwell, 1988; Masuda & Aihara, 2001). The dynamics of the LIF neuron with spatiotemporal spike inputs is given by X dv D 2 Nd ( t ¡ TN j ) ¡ c v ( t ) . dt j
(2.21)
Equation 2.21 coincides with equation 2.5 when c D 0. By integrating equation 2.21, we have Z v ( T 0k ) D
Tk0 0
Tk¡1
X j
2 Ne
¡c ( Tk0 ¡t0 )
d ( t0 ¡ TN j ) dt0
1616
Naoki Masuda and Kazuyuki Aihara
D 2N
jX 0 C Nl
0
N
e ¡c ( Tk ¡Tj )
j D j0
D 1,
(2.22)
which enables us to determine T 0k from T 0k¡1 . Nl in equation 2.22 can be determined uniquely. Substituting 2N D
1 1 D N NDj
Z
TN jC 1 TN j
S ( t0 ) dt0
into equation 2.22, we obtain Z TN j C 1 j CN 1 0X l 1 0 N S ( t0 )e ¡c ( Tk ¡Tj ) dt0 D 1. D N jD j j TN j
(2.23)
0
If D j is approximated by its mean given in equation 2.13, then equation 2.23 reduces to jX 0 C Nl j D j0
Z
TN j C 1
TN j
S ( t0 ) e ¡c
( Tk0 ¡TN j )
dt0 D
N . n1
(2.24)
Comparing equations 2.20 and 2.24, we expect the LIF neuron receiving spike inputs from the sensory neurons to operate like the LIF neuron receiving the continuous input S ( t) with the threshold equal to v ( t) D N / n1 . Consequently, an LIF cortical neuron can transmit the deterministic information contained in the external stimuli and passed through spatiotemporal spikes from the sensory neurons. Equation 2.24 suggests two possible factors for the degradation that occurs in the ISI reconstruction and not in the LIF neuron with continuous inputs as studied by Racicot and Longtin (1997) and Castro and Sauer (1999). As in the IF neuron model, the variance in D j given in equation 2.14 can cause the effective thresholds of the cortical 0
N
neuron to change. The other factor stems from the lter weight e ¡c (Tk ¡Tj ) 0 0 in t0 2 [TN j , TN j C 1 ] instead of e ¡c ( Tk ¡t ) . This effect is essentially due to the discrete-time sampling in response to spike inputs. Figure 6 shows the simulation results for the neural network with the LIF cortical neuron. The parameters for the sensory neurons and the timeseries length used in the analysis are the same as for the IF neuron model. We use c D 0.055 ms¡1 according to Diesmann et al. (1999) and Koch (1999). We set n1 D 160 and 2 N D 0.018 to maintain a ring rate equal to 20 Hz as in the IF neuron model. The results of the surrogate data analysis in Figures 6b and 6c imply that the ISI time series is deterministic. Furthermore, the reconstructed attractor shown in Figure 6d looks like a twisted version
Spatiotemporal Spike Encoding of a Signal
1617
Figure 6: Simulation results for the neural network with the LIF cortical neuron. n1 D 160, 2 N D 0.018, and c D 0.054 ms¡1 . (A) v ( t) . (B) NPE ( h ) for ft0k g (original) and the surrogate data (FS,AAFT). d D 4, l0 D 12, and the length of ft0k g is 4096. (C) NPE (1) with various embedding dimensions for ft0k g (original) and the surrogate data (FS,AAFT). (D) Attractor reconstructed from 4096 points of ft0k g in delay coordinates with d D 2. Only the rst 300 points are connected by lines.
of the one shown in Figure 2d. Nevertheless, it preserves a certain degree of determinism, which makes accurate deterministic prediction possible. As a result, the ISI of the LIF neuron with spike inputs can encode sensory information. The prediction errors NPE (1) , NPE (2) , and NPE (3) for varying leak rate c are shown in Figure 7. We use d D 4 for predictions and change 2 N so that the ring rate of the cortical neuron is kept almost constant. In this way, we eliminate dependence of the prediction error on the ring rate. We see that the determinism of S (t) is preserved with sufciently high quality when c < 0.065 ms¡1 . This range for c includes biologically plausible values
1618
Naoki Masuda and Kazuyuki Aihara
1
NPE
0.8 0.6 0.4 0.2 0
NPE(1) NPE(2) NPE(3) 0.02
0.04
0.06
0.08
gamma Figure 7: Dependence of the prediction errors NPE (1) , NPE (2) , and NPE (3) on c . d D 4.
(Diesmann et al., 1999; Koch, 1999). For larger c , the cortical neuron tends to re only in the vicinity of a large peak of S (t ), where the cortical neuron receives incident spikes with a sufciently high rate. In contrast, the cortical neuron cannot re when S ( t) is small. The sensory neurons are close to the synchronized or clustered state in this case. This point will be considered again in section 3. 2.5 The FitzHugh-Nagumo Neuron. The FHN neuron model (FitzHugh, 1961; Nagumo et al., 1962) can be used to describe dynamical properties of real neurons, such as ring with thresholds, relative refractoriness, and cooperation of fast and slow variables. The dynamics of the original FHN neuron is described by the following equations:
dv D ¡v(v ¡ 0.5) ( v ¡ 1) ¡ w C I, dt dw D v ¡ w ¡ 0.15, dt
a
where a ¿ 1, I is the external input, v is the fast variable related to the membrane potential, and w is the slow variable. The FHN neuron is oscillatory when I > I0 where I0 is the Hopf bifurcation point. Attractor reconstruction with ISI time series of the FHN neuron has been investigated (Racicot &
Spatiotemporal Spike Encoding of a Signal
1619
Longtin, 1997; Castro & Sauer, 1997a, 1999) when I is replaced by a continuous external signal similar to S ( t) dened in equation 2.2. The subthreshold external signal with noise (Castro & Sauer, 1997a) and the suprathreshold external signal (Racicot & Longtin, 1997; Castro & Sauer, 1999) have been studied. In both cases, the attractor reconstruction is difcult unless the range of the external signal is carefully chosen or the dynamics of the external signal is adequately slow (Castro & Sauer, 1999). The FHN neuron can operate as the amplitude-to-frequency converter only in limited cases, which are not biologically plausible (Racicot & Longtin, 1997). The external inputs inuence the ring rate less than in the IF and LIF cortical neuron models. Here, we use the FHN neuron as the cortical neuron model. Although the FHN neuron, which possesses a property whereby the spike width depends on the driving current (I or S ( t) ), is not a very good cortical neuron model, the following results will hold for other neuron models of a similar type, as we explain in section 3. The general properties associated with the incapability of attractor reconstruction will be discussed in the next section. The sensory neurons are the same as the ones used in sections 2.2 and 2.4. Since we are interested in the effect of input spikes, we use the bias current I represented by X Nj ) , (2.25) I D IO C 2 Nd ( t ¡ T j
where IO < I0 , and 2 N is the amplitude of each input spike. Consequently, the input spikes must arrive from the sensory neurons at a sufciently high rate to make the cortical neuron re. Figure 8 shows the simulation results for a neural network with an FHN cortical neuron. The parameters for the sensory neurons and the time-series length are the same as before. We set the ring threshold for the cortical FHN neuron at v D 0.7, n1 D 160, 2 N D 0.015, a D 0.05, and IO D 0.145 so that the ring rate equals 20 Hz, which is comparable to that of the previous simulations. The NPE ( h) for ft0kg is close to 1 (see Figures 8b and 8c). The reconstructed attractor shown in Figure 8d is split into clusters (Racicot & Longtin, 1997). A cluster indicates how many subthreshold oscillations are contained in two successive ISIs, each of which corresponds to a coordinate in Figure 8d. Furthermore, the determinism is lost even if we consider the restricted transitions from one specic cluster to another. These features differ qualitatively from the Rossler ¨ attractor and the attractors reconstructed from the ISI of the IF neuron (see Figure 2d) and LIF neuron (see Figure 6d), implying that the deterministic prediction is impossible. These results are consistent with those for the continuous-input cases (Racicot & Longtin, 1997; Castro & Sauer, 1997a, 1999). In the next section, we discuss these phenomena in comparison to the results for the IF neuron, the LIF neuron, and continuous inputs.
1620
Naoki Masuda and Kazuyuki Aihara
Figure 8: Simulation results for the neural network with the FHN cortical neuron. n1 D 160, 2 N D 0.015, a D 0.05, IO D 0.145, and ring threshold v D 0.7. (A) v ( t) . (B) NPE (h ) for ft0k g (original) and the surrogate data (FS,AAFT) with d D 6. (C) NPE (1) with various embedding dimensions for ft0k g (original) and the surrogate data (FS,AAFT). (D) The attractor reconstructed from 4096 points of ft0k g in delay coordinates with d D 2. Only the rst 300 points are connected by lines.
3 Mechanisms for Attractor Reconstruction by Spike-Driven Neurons
Here, we explore the mechanisms of attractor reconstruction by single neurons receiving incident spike trains with deterministic ISIs from many sensory neurons. As discussed by Racicot & Longtin (1997), an ISI roughly represents 1 / S (t) at time t (see section 1). We start by arguing the continuousinput case. When the input is continuous rather than discontinuous with spikes, three factors may cause degradation of the ISI reconstruction.
Spatiotemporal Spike Encoding of a Signal
1621
First, the time window size for integration may contribute to degradation. If the threshold of the IF or LIF neuron is increased, reconstruction becomes poorer since the longer time integration of 1 / S ( t) should be taken between two adjacent output spikes (Sauer, 1994, 1997; Racicot & Longtin, 1997). The second factor involves how much the ISIs change when S ( t) changes. In principle, if the spike frequency is a monotonic function of S ( t) , the neuron can work as an amplitude-to-frequency converter, resulting in successful ISI reconstruction (Castro & Sauer, 1999). Sensitivity of the frequency to the change in S ( t) is also important, although the monotonicity in the frequency amplitude curve is mathematically sufcient for the ISI reconstruction. In regard to amplitude-to-frequency conversion, biological and model neurons can be divided into two classes according to qualitative properties of repetitive ring (Hodgkin, 1948; Koch, 1999; Izhikevich, 2000). When neurons with class I excitability begin repetitive ring, the ring starts with an innitesimally low frequency. However, when neurons with class II excitability begin repetitive ring, they start ring with a nite frequency. In the model neurons, these classications result from the different bifurcation structures. Homoclinic bifurcations or saddle-node bifurcations on invariant circles often underlie class I excitability, whereas subcritical Hopf bifurcations typically underlie class II excitability. Strictly speaking, the quality of attractor reconstruction is not directly ascribed to the bifurcation structure itself when the repetitive ring emerges or terminates. However, this classication is also related to the frequency-modulation property of model neurons in response to the change in the external inputs. In class I neurons, the dynamic range in the ring rate is generally large (5–150 Hz) with varying bias (Hodgkin, 1948; Izhikevich, 2000). When the bias is just slightly above the threshold, the state of the neuron remains for a long time near the saddle or the ruin of the stable xed point, which existed with subthreshold bias. For larger bias, the whole system is free from slow transient behavior, and the dynamics is governed by the limit cycle. In contrast, the frequency changes relatively little (75–150 Hz) in class II neurons (Hodgkin, 1948; Izhikevich, 2000). Both the IF and LIF neurons belong to class I (Koch, 1999; Izhikevich, 2000). With equations 2.3 and 2.20, the instantaneous ring rate f at time t is given as follows: (IF)
(LIF)
dx D S ( t) , dt ¿ S ( t) ln . f Dc S ( t) ¡ c f D
The dynamic range of f for the LIF neuron is almost as large as that for the IF neuron unless S ( t) ¡ c is slightly positive or S ( t) · c . In the latter case, f is undened since the LIF neuron cannot re if S ( t) is always less than c . However, the FHN neuron belongs to class II (Koch, 1999; Izhikevich,
1622
Naoki Masuda and Kazuyuki Aihara
2000). Although the amplitude of an individual action potential is sensitive to S ( t) , the ring rate is not. The third important feature is dependence of the external signal effect on the phase variable v ( t) of the cortical neuron. Although the change in S (t ) should affect the ring rate to realize good signal reconstruction, the difference in the phase when the neuron receives a spike should not affect the ring rate. This is because the phase-dependent lters deteriorate the equal integration of information signals. The effect of S (t) on the IF neuron does not depend on the value of x (t ). In contrast to the IF neurons, the ex0 N N ponential weight e ¡c ( t¡Tj ) in equation 2.20 or e ¡c (Tk ¡Tj ) in equation 2.24 in the integration by the LIF neuron can degrade the attractor reconstruction. This is mainly because the importance of S (t ) in determining the ring time depends on t. The effect of the inputs to the FHN neuron has much stronger phase dependence. Only the inputs received around the threshold crossing can signicantly inuence the dynamics of the FHN neuron. Furthermore, S (t ) that precedes ring can also delay the ring if it is applied in the refractory period (Hansel, Mato, & Meunier, 1995). Although we cannot dene the phase of the FHN neuron analytically except in a simplied version (Ichinose, Aihara, & Judd, 1998), it is geometrically dened by projecting the two-dimensional dynamics onto a unit circle (Hansel et al., 1995; Bressloff & Coombes, 2000). These three observations for continuous-input situations explain why the attractor reconstruction is possible for the IF and LIF neurons and difcult for the FHN neuron. The properties are naturally inherited in spike-driven neurons. In spike-driven situations, phase-response curves (PRC) are useful for describing the effect of an input spike (Hansel et al., 1995; Ichinose et al., 1998; Bressloff & Coombes, 2000; Masuda & Aihara, 2001). The phase w 2 [0, 1] is dened so that the phase velocity is constant when the neuron is ring periodically. The resting potential and the ring threshold are usually identied with w D 0 and w D 1, respectively. The phase description can be extended to the excitable neuron models by interpreting the leak effect as phase regression (Masuda & Aihara, 2001). PRC is the mapping from w to tN (w ) ¡ w , where tN (w ) transforms the old phase to the new phase when an input spike arrives. Neurons with relatively at PRCs can transmit the ISI determinism since the spikes are evaluated evenly during the cycle. For example, the phase and PRC of the IF neuron are trivially dened by w Dx
(3.1)
tN (w ) ¡ w D 2 N ,
(3.2)
and
respectively. We can immediately see that every spike contributes to the ring dynamics equally. The phases for the LIF neuron are dened (Hansel
Spatiotemporal Spike Encoding of a Signal
1623
et al., 1995; Masuda & Aihara, 2001) by w ´ g ( x) D ¡
1 lnf1 ¡ x (1 ¡ e ¡c )g c
(3.3)
and tN (w ) ¡w D g ( g ¡1 (w ) C 2 N ) ¡w D ¡
1 lnf1¡2 N ec w (1¡e ¡c ) g c
( > 0) ,
(3.4)
respectively. With the limit c ! 0, the denitions in equations 3.3 and 3.4 approach those in equations 3.1 and 3.2, respectively. Equation 3.4 indicates that the phase changes more when the membrane potential is closer to the threshold. This observation is consistent with what is deduced from equation 2.24. The phase dependence manifests itself more in the spike-input case than in the continuous-input case because a spike instantaneously results in a nite amount of phase change. The effect of spikes on the FHN neuron can also be explained by numerically obtained PRC. The PRC of the FHN neuron strongly depends on the phase; spikes have signicant effects only around the threshold crossing (Hansel et al., 1995). The information contained in most of the input spikes is lost without inuencing the dynamics of the FHN neuron. This is the main reason that the reconstruction is difcult. The reconstruction is possible only when the information signal changes slowly relative to the ring rate. In this situation, the phase dependence is averaged to make the attractor reconstruction feasible (Castro & Sauer, 1999). Nevertheless, the information encoding performance would be low in this case since only slowly changing signals are recognized. The condition required for the attractor reconstruction is related to but different from the criteria given by Hansel et al. (1995), where synchronization of neurons coupled by gap junction is discussed. The neurons tend to desynchronize when each neuron has a PRC with tN (w ) ¡w > 0 for 8w (type I) and synchronize when the PRC has two regions: one with tN (w ) ¡ w > 0 and the other with tN (w ) ¡w < 0 (type II). Actually, the IF and LIF neurons belong to the type I class, and the FHN neuron belongs to the type II class. Unlike type II neurons, many type I neurons are capable of attractor reconstruction because the PRCs of type I neurons are relatively at on many occasions. We must take three more factors into consideration to understand the dynamics of spike-input neurons. One is the time discretization in sampling. This is a consequence of the fact that the neuron receives external information in the form of point processes with instantaneous spikes. As shown in section 2.4, the weighted integration of the signal is replaced by the weighted discrete sum. Another factor is the uncertainty in the spike arrival time caused by the random initial conditions of the sensory neurons in the framework presented here and other noise sources or inhomogeneity among neurons in more general situations. The variance in the spike arrival
1624
Naoki Masuda and Kazuyuki Aihara
time is evaluated in equation 2.14, which shows the asymptotics as / n1¡1 for large n1 . The last factor is the distinction between oscillatory dynamics and excitable dynamics. We have concentrated on the excitable cases to explore spike effects. There also exist oscillatory schemes corresponding to each of the three neuron models discussed here. In oscillatory schemes, the constant biases applied to the neuronal dynamics make the neurons re even in the absence of spike inputs. For example, the FHN neuron is inherently oscillatory when IO in equation 2.25 is larger than I0. Generally, when the bias is stronger, the dynamics of the output ISI is governed more by the biases than by the external inputs. In other words, the properties of the ISI time series are determined by the balance between the strength of the external input and that of the inherent neuronal dynamics comprising systematic biases applied to the neurons. In summary, the ISIs of the spike-driven neurons are deterministic when the corresponding neurons with continuous inputs are deterministic. In both cases, certain conditions are important for signal reconstruction: the spike effect should be independent of the phase, and the change in the input signal amplitude or the input spike frequency should change the output spike frequency (class I property).
4 Conclusion
We have investigated the possibility of signal reconstruction in single cortical neurons receiving incident spike trains from many sensory neurons with ISIs that reect deterministic characteristics of continuous external stimuli. The IF and LIF neurons can transmit the dynamical and deterministic information even though they receive superimposed spike trains where deterministic structures of ISI in incident spike trains from sensory neurons interfere with each other. The noises in the sensory neurons are important for spike-driven cortical neurons to collect signal information effectively. The spike-driven FHN neurons, however, with biologically plausible parameters cannot adequately encode sensory stimuli. For these three types of cortical neuron models, we have discussed the conditions for ISI reconstruction; the spike effect should not depend on the phase v (t ) of the cortical neuron and should depend monotonically on the input frequency. The difference between the spike-input models and the continuous-input models is ascribed to the difference between the discrete and continuous sampling of the information signal. Our results support the notion that the attractor reconstruction mechanism is an information processing scheme in the brain as well as a useful tool for investigating the effects of external and feedback spike inputs. The next problem we will approach is the two-layered neural network with multiple cortical neurons. Many experiments have shown the existence of synchronously ring neurons, and this synchronization is believed to play
Spatiotemporal Spike Encoding of a Signal
1625
an important role in information coding based on spatiotemporal functional assemblies (Abeles, 1991; Eckhorn et al., 1988; Gray et al., 1989; Laurent & Davidowitz, 1994; Vaadia et al., 1995; Riehle et al., 1997; Stopfer et al., 1997; Stopfer & Laurent, 1999; Brecht et al., 1999; Rodriguez et al., 1999; Steinmetz et al., 2000). The mechanisms of synchronization have also been investigated with various neuron models (Hansel et al., 1995; van Vreeswijk, 1996; Ernst et al., 1998; Diesmann et al., 1999; Bressloff & Coombes, 2000; Masuda & Aihara, 2001). We expect inter-event intervals with synchronous ring events (Tokuda & Aihara, 2000) to preserve deterministic information of external continuous stimuli, which may be relevant to robust information processing in the biological brain. Acknowledgments
We thank I. Tokuda at Muroran Institute of Technology, Japan, for helpful discussions and suggestions. This work is supported by the Japan Society for the Promotion of Science and CREST, JST. References Abeles, M. (1991). Corticonics. Cambridge: Cambridge University Press. Brecht, M., Singer, W., & Engel, A. K. (1999). Patterns of synchronization in the superior colliculus of anesthetized cats. J. of Neurosci., 19(9), 3567–3579. Bressloff, P. C., & Coombes, S. (2000). Dynamics of strongly coupled spiking neurons. Neural Computation, 12, 91–129. Castro, R., & Sauer, T. (1997a). Chaotic stochastic resonance: Noise-enhanced reconstruction of attractors. Phys. Rev. Lett., 79(6), 1030–1033. Castro, R., & Sauer, T. (1997b) Correlation dimension of attractors through interspike intervals. Phys. Rev. E, 55(1), 287–290. Castro, R., & Sauer, T. (1999). Reconstructing chaotic dynamics through spike lters. Phys. Rev. E, 59(3), 2911–2917. Chang, T., Schiff, S. J., Sauer, T., Gossard, J-P., & Burke, R. E. (1994). Stochastic versus deterministic variability in simple neuronal circuits: I. Monosynaptic spinal cord reexes. Biophysical J., 67(2), 671–683. Diesmann, M., Gewaltig, M-O., & Aertsen, A. (1999). Stable propagation of synchronous spiking in cortical neural networks. Nature, 402, 529–533. Eckhorn, R., Bauer, R., Jordan, W., Brosch, M., Kruse, W., Munk, M., & Reitboeck, H. J. (1988). Coherent oscillations: A mechanism of feature linking in the visual cortex? Biol. Cybern., 60, 121–130. Ernst, U., Pawelzik, K., & Geisel, T. (1998). Delay-induced multistable synchronization of biological oscillators. Phys. Rev. E, 57(2), 2150–2162. FitzHugh, R. (1961). Impulses and physiological states in theoretical models of nerve membrane. Biophysical J., 1, 445–465. Fujii, H., Ito, H., Aihara, K., Ichinose, N., & Tsukada, M. (1996). Dynamical cell assembly hypothesis—theoretical possibility of spatio-temporal coding in the cortex. Neural Networks, 9(8), 1303–1350.
1626
Naoki Masuda and Kazuyuki Aihara
Gray, C., M., Konig, ¨ P., Engel, A. K., & Singer, W. (1989). Oscillatory responses in cat visual cortex exhibit inter-columnar synchronization which reects global stimulus properties. Nature, 338(23), 334–337. Hansel, D., Mato, G., & Meunier, C. (1995). Synchrony in excitatory neural networks. Neural Computation, 7, 307–337. Hodgkin, A. L. (1948).The local electric changes associated with repetitive action in a non-medullated axon. Journal of Physiology, 107, 165–181. Ichinose, N., & Aihara, K. (1998). Extracting chaotic signals from a superimposed pulse train. Proc. of 1998 Int. Symposium on Nonlinear Theory and Its Applications, 2, 695–698. Ichinose, N., Aihara, K., & Judd, K. (1998). Extending the concept of isochrons from oscillatory to excitable systems for modeling an excitable neuron. Int. J. Bifurcation and Chaos, 8(12), 2375–2385. Isaka, A., Yamada, T., Takahashi, J., & Aihara, K. (2001). Analysis of superimposed pulse trains with higher-order interspike intervals. Trans. IEICE, J84-A(3), 309–320 (in Japanese). Izhikevich, E. M. (2000). Neural excitability, spiking and bursting. Int. J. Bifurcation and Chaos, 10(6), 1171–1266. Judd, K. T., & Aihara, K. (1993). Pulse propagation networks: A neural network model that uses temporal coding by action potentials. Neural Networks, 6, 203–215. (Errata. 1994. Neural Networks, 7, 1491.) Judd, K., & Aihara, K. (2000). Generation, recognition and learning of recurrent signals by pulse propagation networks. Int. J. Bifurcation and Chaos, 10(10), 2415–2428. Koch, C. (1999). Biophysics of computation. New York: Oxford University Press. Laurent, G., & Davidowitz, H. (1994). Encoding of olfactory information with oscillating neural assemblies. Science, 265, 1872–1875. MacLeod, K., B¨acker, A., & Laurent, G. (1998). Who reads temporal information contained across synchronized and oscillatory spike trains? Nature, 395, 693– 698. Mainen, Z. F., & Sejnowski, T. J. (1995). Reliability of spike timing in neocortical neurons. Science, 268, 1503–1506. Mar, D. J., Chow, C. C., Gerstner, W., Adams, R. W., & Collins, J. J. (1999). Noise shaping in populations of coupled model neurons. Proc. Natl. Acad. Sci. USA, 96, 10450–10455. Masuda, N., & Aihara, K. (2001). Synchronization of pulse-coupled excitable neurons. Phys. Rev. E, 64(5), 051906. Nagumo, J., Arimoto, S., & Yoshizawa, S. (1962). An active pulse transmission line simulating nerve axon. Proc. of the IRE, 50, 2061–2070. Racicot, D. M., & Longtin, A. (1997). Interspike interval attractors from chaotically driven neuron models. Physica D, 104, 184–204. Reinagel, P., & Reid, R. C. (2000). Temporal coding of visual information in the thalamus. J. Neurosci., 20(14), 5392–5400. Richardson, K. A., Imhoff, T. T, Grigg, P., & Collins, J. J. (1998). Encoding chaos in neural spike trains. Phys. Rev. Lett., 80(11), 2485–2488.
Spatiotemporal Spike Encoding of a Signal
1627
Riehle, A., Grun, ¨ S., Diesmann, M., & Aertsen, A. (1997). Spike synchronization and rate modulation differentially involved in motor cortical function. Science, 278, 1950–1953. Rodriguez, E., George, N., Lachaux, J-P., Martinerie, J., Renault, R., & Varela, F. J. (1999). Perception’s shadow: Long-distance synchronization of human brain activity. Nature, 397, 430–433. Rossler, ¨ O. E. (1976). An equation for continuous chaos. Phys. Lett. A, 57, 397–398. Sauer, T. (1994). Reconstruction of dynamical systems from interspike intervals. Phys. Rev. Lett., 72(24), 3811–3814. Sauer, T. (1997). Reconstruction of integrate-and-re dynamics. Field Institute Communications, 11, 63–75. Schiff, S. J., Jerger, K., Chang, T., Sauer, T., & Aitken, P. G. (1994). Stochastic versus deterministic variability in simple neuronal circuits: II. Hippocampal slice. Biophysical J., 67(2), 684–691. Steinmetz, P. N., Roy, A., Fitzgerald, P. J., Hsiao, S. S., Johnson, K. O., & Niebur, E. (2000). Attention modulates synchronized neuronal ring in primate somatosensory cortex. Nature, 404, 187–190. Stopfer, M., Bhagavan, S., Smith, B. H., & Laurent, G. (1997). Impaired odour discrimination on desynchronization of odour-encoding neural assemblies. Nature, 390, 70–74. Stopfer, M., & Laurent, G. (1999). Short-term memory in olfactory network dynamics. Nature, 402, 664–668. Sugihara, G., & May, R. M. (1990). Nonlinear forecasting as a way of distinguishing chaos from measurement error in time series. Nature, 344, 734–740. Suzuki, H., Aihara, K., Murakami, J., & Shimozawa, T. (2000). Analysis of neural spike trains with interspike interval reconstruction. Biol. Cybern., 82, 305–311. Theiler, J., Eubank, S., Longtin, A., Galdrakian, B., & Farmer, J. D. (1992). Testing for nonlinearity in time series: The method of surrogate data. Physica D, 58, 77–94. Tiesinga, P. H., Fellous, J.-M., JosÂe, J. V., & Sejnowski, T. J. (2001). Optimal information transfer in synchronized neocortical neurons. Neurocomputing, 38–40, 397–402. Tiesinga, P. H. E., & Sejnowski, T. J. (2001). Precision of pulse-coupled networks of integrate-and-re neurons. Network, 12, 215–233. Tokuda, I., & Aihara, K. (2000). Inter-event interval reconstruction of chaotic dynamics. In Proc. of the Fifth International Symposium on Articial Life and Robotics (AROB 5th’00) (pp. 177–180). Tuckwell, H. C. (1988). Introduction to theoreticalneurobiology (Vol. 1). Cambridge: Cambridge University Press. Vaadia, E., Haalman, L., Abeles, M., Bergman, H., Prut, Y., Slovin, H., & Aertsen, A. (1995). Dynamics of neuronal interactions in monkey cortex in relation to behavioural events. Nature, 373, 515–518. van Steveninck, R. R. de R., Lewen, G. D., Strong, S. P., Koberle, R., & Bialek, W. (1997). Reproductivity and variability in neural spike trains. Science, 275, 1805–1808.
1628
Naoki Masuda and Kazuyuki Aihara
van Vreeswijk, C. (1996). Partial synchronization in populations of pulsecoupled oscillators. Phys. Rev. E, 54(5), 5522–5537. Warzecha, A-K., & Egelhaaf, M. (1999).Variability in spike trains during constant and dynamic stimulation. Science, 283, 1927–1930. Received May 8, 2001; accepted November 2, 2001.
LETTER
Communicated by Jack Cowan
Attractor Reliability Reveals Deterministic Structure in Neuronal Spike Trains P. H. E. Tiesinga [email protected] Sloan-Swartz Center for Theoretical Neurobiology and Computational Neurobiology Lab, Salk Institute, La Jolla, CA 92037, U.S.A. J.-M. Fellous [email protected] Howard Hughes Medical Institute and Computational Neurobiology Lab, Salk Institute, La Jolla, CA 92037, U.S.A. Terrence J. Sejnowski [email protected] Sloan-Swartz Center for Theoretical Neurobiology and Computational Neurobiology Lab, Salk Institute; Howard Hughes Medical Institute, Salk Institute and Department of Biology, University of California–San Diego, La Jolla, CA 92037, U.S.A. When periodic current is injected into an integrate-and-re model neuron, the voltage as a function of time converges from different initial conditions to an attractor that produces reproducible sequences of spikes. The attractor reliability is a measure of the stability of spike trains against intrinsic noise and is quantied here as the inverse of the number of distinct spike trains obtained in response to repeated presentations of the same stimulus. High reliability characterizes neurons that can support a spike-time code, unlike neurons with discharges forming a renewal process (such as a Poisson process). These two classes of responses cannot be distinguished using measures based on the spike-time histogram, but they can be identied by the attractor dynamics of spike trains, as shown here using a new method for calculating the attractor reliability. We applied these methods to spike trains obtained from current injection into cortical neurons recorded in vitro. These spike trains did not form a renewal process and had a higher reliability compared to renewallike processes with the same spike-time histogram. 1 Introduction
Features that are present in the spike patterns elicited in response to a stimulus repeated across multiple trials can form the basis of a neuronal code. Here, we introduce a novel reliability measure in order to study the c 2002 Massachusetts Institute of Technology Neural Computation 14, 1629– 1650 (2002) °
1630
P. H. E. Tiesinga, J.-M. Fellous, and Terrence J. Sejnowski
reproducibility of sequences of precise spike times produced in vitro by cortical neurons (Mainen & Sejnowski, 1995). We show that spike trains obtained from in vitro cortical neurons and integrate-and-re model neurons have more deterministic structure than spike trains obtained from renewal processes with the same interspike interval and spike-time probability distribution. This structure cannot be detected from the spike-time histogram. The response of a neuron to a stimulus can be characterized as an attractor, dened as the voltage trajectory (and associated spike train) to which the neuron’s membrane potential converges from different initial conditions (Strogatz, 1994; Jensen, 1998). An attractor is stable against noise: weak noise introduces spike-time jitter, but the sequence of spike times and the inputinduced correlations between interspike intervals is conserved. Here, the sequence of spike times is considered the output signal. When the neuron stays on the attractor, it transmits a unique signal. The dynamics of a model neuron can undergo a bifurcation when its parameter values are varied. During a bifurcation, a small change in the value of a parameter introduces a large change in the spike times: a different attractor emerges. When a neuron is close to a bifurcation point, noise can induce a transition to another attractor. If the new “attractor ” is unstable, the neuron will return to the original attractor after a nite time, but if it is stable, the neuron stays on the new attractor. The attractor reliability is dened as the inverse of the number of distinct spike trains that are obtained after a large number of trials. This article consists of two parts. First, a test is given to determine whether a spike train forms a temporally modulated renewal process. Second, we quantify the extra—non-Poisson—structure present in spike trains using the attractor reliability. These techniques are illustrated using spike trains obtained from experimental data and numerical model simulations. A systematic study of the attractor and bifurcation structure of in vitro and model neurons driven by periodic, quasiperiodic, and random currents will be presented elsewhere. 2 Methods 2.1 Experimental Methods. The voltage response of cortical neurons as measured in a rat slice preparation was described previously (Fellous et al., 2001). Protocols for these experiments were approved by the Salk Institute Animal Care and Use Committee, and they conform to U.S. Department of Agriculture regulations and National Institutes of Health guidelines for humane care and use of laboratory animals. Briey, coronal slices of rat prelimbic and infralimbic areas of prefrontal cortex were obtained from 2- to 4week-old Sprague-Dawley rats. Rats were anesthetized with isourane and decapitated. Their brain was removed and cut into 350 m m thick slices on a Vibratome 1000. Slices were then transfered to a submerged chamber containing articial cerebrospinal uid (ACSF, mM: NaCl, 125; NaH2 CO3 , 25;
Attractor Reliability in Neuronal Spike Trains
1631
D-glucose, 10; KCl, 2.5; CaCl2 , 2; MgCl 2 , 1.3; NaH2 PO4 , 1.25) saturated with 95% O2 /5% CO2 at room temperature. Whole cell patch clamp recordings were achieved using glass electrodes containing (4-10 MV: mM; KmeSO4 , 140; Hepes, 10; NaCl, 4; EGTA, 0.1; Mg-ATP, 4; Mg-GTP, 0.3; Phosphocreatine 14). Patch clamp was performed under visual control at 30–32± C. In most experiments, Lucifer yellow (RBI, 0.4%) or Biocytin (Sigma, 0.5%) was added to the internal solution for morphological identication. In all experiments, synaptic transmission was blocked by D-2-amino-5-phosphonovaleric acid (D-APV; 50 m M), 6,7-dinitroquinoxaline-2,3, dione (DNQX; 10 m M), and biccuculine methiodide (Bicc; 20 m M). All drugs were obtained from RBI or Sigma, freshly prepared in ACSF and bath applied. Data were acquired with Labview 5.0 and a PCI-16-E1 data acquisition board (National Instrument) and analyzed with MATLAB (The Mathworks). We used regularly spiking layer 5 pyramidal cells that were identied morphologically. 2.2 Simulation Algorithm. The membrane potential V of an integrateand-re neuron driven by a periodic current satised,
2p dV D ¡V C I C A sin t C j ( t) , dt T
(2.1)
where I was a time-independent driving current, A was the amplitude of the drive, T D 2 was the period, and j was a white noise current, with zero mean and variance D, that represented the effects of intrinsic noise. When the voltage V reached threshold, V ( t) D 1, a spike was emitted, and the voltage was instantaneously reset to zero, V ( t) D 0. Dimensionless units were used in model simulations. One voltage unit corresponded to the distance between resting membrane potential and action potential threshold, approximately 20 mV; one time unit corresponded to the membrane time constant, approximately 40 ms. Equation 2.1 was integrated directly using the fourth-order Runge-Kutta algorithm (Press, Teukolsky, Vetterling, & Flannery, 1992), with step size dt D 0.01. The calculated voltage differed by less than 10¡11 from the voltage obtained by analytically integrating equation 2.1 for a sinusoidal current and with D D 0. The spike time ts , given by the expression V ( ts ) D 1, was determined by linear interpolation (Hansel, Mato, Meunier, & Neltner, 1998). 3 Results 3.1 Example of Deterministic Structure in Spike Trains. In most experiments, the same stimulus was presented on different trials, and the resulting spike trains were analyzed. The spike trains of a hypothetical experiment are presented in a rasterplot in Figure 1Aa. The computation of reliability and precision based on the spike-time histogram (STH) followed the procedure in Mainen and Sejnowski (1995). The length of a trial was
1632
P. H. E. Tiesinga, J.-M. Fellous, and Terrence J. Sejnowski
Figure 1: Reliability based on the spike-time histogram is insensitive to deterministic structure in spike trains. (A, B) The same spike times, with the trials ordered differently. On each row: (a) Rasterplot, with each measured spike represented as a circle, its x-ordinate is the spike time and its y-ordinate the trial index; (b) the spike-time histogram (STH)—the number of spikes in a particular time bin normalized by the number of trials. Events were dened as threshold crossings of the spike-time histogram. The event reliability was the fraction of trials during which a spike occurred during the event; the precision was the inverse of the standard deviation of the spike times in the event. The reliability of the response, RSTH , was the event reliability averaged over all events. (A) In each of the 100 trials, a spike was present during the rst event; hence, its reliability was 1. (B) The trials of A were composed of two distinct spike trains, indicated by the lled and open circles in a. In b, the resulting STH for each group is plotted separately as lled and open peaks, respectively. The rst peak in A was in fact made up of two separate events.
divided into discrete bins, and the number of spikes that fell in each bin was counted. The spike count in a bin was normalized by the number of trials. The set of bins was convolved with a smoothing function. The STH so obtained usually contained a number of peaks, each consisting of a set of consecutive bins that contain more spike counts than average. Peaks were
Attractor Reliability in Neuronal Spike Trains
1633
detected by setting a threshold and determining when the smoothed STH crossed it (see Figure 1Ab). The peaks constitute events; the reliability of an event was the fraction of trials during which a neuron spiked during the event. The reliability of the spike train, RSTH , was the reliability of each event averaged over all events. (An equivalent procedure for RSTH is given in equation 3.5.) The precision of an event was the inverse of the standard deviation of the spike times that were part of the peak. The reliability based on the STH depended on the size of the bins, the smoothing function, and the value of the peak-detection threshold. The STH reliability so dened was insensitive to deterministic structure in the spike trains. An example of how STH reliability can miss important deterministic structure in spike trains is shown in Figure 1. The data shown in Figure 1Aa were obtained by randomly mixing two types of spike trains: the open and lled circles in Figure 1Ba. Hence, Figures 1Aa and 1Ba are the same, except that the trials are reordered. On each trial, there was a spike present in the rst event in Figure 1Ab; hence, its “reliability” was 1. However, that single event in fact consisted of two events (see Figure 1Bb). In other words, there were two distinct spike trains (attractors) present across different trials.
3.2 Testing of Renewal Property. A temporally modulated renewal spike train is fully characterized by its spike-time probability density and its interspike-interval distribution. The dening property of a renewal spike train is that the intervals between spikes are independent. There is no deterministic structure in renewal spike trains apart from the structure induced by a time-varying ring rate (see below). In contrast, for an attractor, the spikes occur at particular times and in a particular sequence (see section 1): there are correlations between the spike times, and there is deterministic structure in the spike train. Hence, it is different from a renewal process. Spike trains that form a renewal process are not reliable. It is therefore important to distinguish renewal processes from nonrenewal processes (Reich, Victor, & Knight, 1998). The Poisson process is an example of a renewal process with rate l. The probability of nding a spike between times t and t C dt is ldt and does not depend on previous spike times. Hence, the spike times themselves are also independent. The probability l is time independent and is estimated from the spike-time histogram. The distribution of interspike intervals is exponential; the probability of obtaining an interspike interval between t and t C dt is le¡t l dt. There is no correlation between the length of consecutive intervals. The coefcient of variation (CV), the standard deviation divided by the mean of t , is equal to 1. Another example of a renewal process is a gamma process of order r. The probability of obtaining a spike between times t and t C dt again is ldt. However, the distribution of interspike intervals is given by a gamma probability distribution and is less disperse, with CV D p1r and r > 1. The gamma process of integer order is closely related
1634
P. H. E. Tiesinga, J.-M. Fellous, and Terrence J. Sejnowski
to the Poisson process. For instance, a gamma process of order 2 is obtained from a Poisson process of twice the rate, 2l, by removing every other spike. Shufing procedures can form the basis for a test of the renewal character of spike trains. First, generate a number of different spike trains (trials) using the same process. Then shufe the spike times randomly across different trials. The probability of obtaining a spike at a given time is conserved, since exactly the same spike times were used. If the spikes were generated by a Poisson process, then they are independent of each other. Hence, the new set of spike trains cannot be distinguished statistically from the original set of spike trains. The shufing procedure leads to an exponential distribution of interspike intervals, and CV D 1. For spike trains generated by a gamma process of order r, CV D p1r . However, after shufing, the CV changed to 1, and the new set of spike trains can be distinguished from the original set of spike trains by comparing the CVs. Hence, shufing of spike times is not appropriate for a gamma process. However, when the interspike intervals are shufed randomly across trials and then used to calculate the new spike times, the interspike interval distribution is conserved. Since the intervals in a renewal process are independent, the new set of spike trains should be indistinguishable from the original set. Note that the actual spike times are not conserved. As a result, the spike-time probability (the spike-time histogram) is different at the boundaries, close to the beginning and end of the trial. However, at some distance from the boundaries, it is approximately the same. An example is discussed below. The spike-time histogram obtained from neural spike trains usually is time dependent: l(t) is a function of time. The spike trains may still form a temporally modulated renewal process; the probability of obtaining a spike between t and t C dt is l(t) dt. The time dependence of l(t) makes testing of the renewal character difcult, since the distribution of interspike intervals depends on time via l(t) . However, temporally modulated renewal processes (referred to as “simply modulated renewal process” in Reich et al. (1998)) can be mapped into a time-independent (homogeneous) renewal process by “transforming” time. In what follows, we describe the test of the renewal property in detail. It consisted of three steps. First, time t was mapped into a new time s. Then the interspike intervals were shufed randomly across trials. Finally, the test statistic was evaluated, and its signicance was determined. The procedure is applied to spike trains obtained from repeated current injection into cortical neurons in vitro. 3.2.1 Experimental Data Set. The stimulus wave forms shown in Figure 2Ac were injected in a cortical neuron in current-clamp mode. The stimulus consisted of a 200 ms constant depolarizing current followed by a 1600 ms sinusoidal current with period 100 ms. The height of the initial depolarizing pulse was varied to bring the neuron to different initial voltages at the start of the sinusoidal wave form. For a large enough amplitude, a spike was obtained during the current pulse (see Figure 2Ab). After a transient
Attractor Reliability in Neuronal Spike Trains
1635
Figure 2: Effect of initial condition on spike timing in cortical neurons. The stimulus wave form consisted of a 200 ms long constant depolarizing current followed by a 1600 ms long sinusoidal current with period 100 ms. The amplitude of the initial depolarizing current was varied over 11 different values. Each of the resulting wave forms was presented 20 times, yielding 220 trials. (A) (a, b) Recorded voltage response during the presentation of the rst and eleventh stimulus wave form, respectively; (c) the 11 stimulus wave forms. (B, C) In each row, (a) the rastergram and (b) the spike-time histogram is plotted using (B) the experimental data; (C) Poisson spike trains that were obtained by randomly shufing spike times across trials. The spike-time histograms were identical, and the rastergrams looked similar. However, the experimental spike trains did not form a renewal process (see Figure 4).
1636
P. H. E. Tiesinga, J.-M. Fellous, and Terrence J. Sejnowski
of approximately three cycles, the neuron generated on average about one action potential on every two cycles. The rastergram and spike-time histogram of the spike trains obtained during 20 presentations of 11 different wave forms (220 trials) are shown in Figure 2B. A Poisson process with the same spike-time histogram was generated by randomly shufing the spike times across different trials (see Figure 2C). 3.2.2 Mapping the Spike Times. The spike time probability l(t) was estimated from the spike-time histogram P (t ). Note that P ( t) was evaluated in discrete bins; however, for simplicity, the notation used here does not reect this. Time t was transformed into a new time variable s such that the probability of obtaining a spike between s and s C ds was independent of time s (Reich et al., 1998). This procedure transformed an inhomogeneous renewal process into a homogeneous renewal process. The transformation was t ! s D g (t) with Rt g (t) D TSTH R T 0 0
du P (u )
STH
du P ( u )
,
(3.1)
where u was an integration variable and TSTH was the length of each trial. The transformation tin ! sin was performed without explicitly calculating P ( t) (see Figure 3). Let fti1 , ¢, tiNi g be the set of Ni spike times during the ith trial, i D 1, ¢, Ntr , where Ntr was the number of trials (see Figure 3A). The set of all spikes in all trials was collected into one set labeled by a dummy index j, ft1 , ¢, tM g, and sorted in increasing value, tj (1) · tj (2) · ¢ · tj ( k) · ¢ · tj ( M) , P Ntr here, M D i D1 Ni was the total number of spikes across all trials (see Figure 3B). The transformed time was sin D
k ( j) T STH . M
(3.2)
Here, k ( j) was the index of the jth spike time tj D tin in the sorted set, i was the trial index, and n was the spike index (see Figure 3C). For the experimental data set, the rasterplot of transformed spike times sin D g ( tin ) consisted of dots that were uniformly distributed in the plane, and the STH was constant (see Figures 4Aa and 4Ab). 3.2.3 Random Shufings of the Interspike Intervals. The interspike intervals in transformed time were dened as ºni D sin ¡ sin¡1 . The 0th spike time was dened as si0 D 0, so that there are as many intervals as there are spikes; however, si0 was not included in the analysis (and was not shown in the graphs). The intervals ºni were randomly shufed across trials, and the new times were then obtained from the shufed intervals ºOni as, Pn spike i i sOn D jD 1 ºOj (j is a dummy index). This procedure was repeated to obtain
Attractor Reliability in Neuronal Spike Trains
1637
Figure 3: Procedure to obtain a homogeneous renewal process. (A) Spike trains obtained on different trials. tin was the nth spike time in the ith trial. (B) Spike times across all trials were combined into one set and sorted from low to high values (the index is indicated below the ticks). (C) The index k in the ordered set as a function of the spike time. The new spike time was sin D kTSTH / M and took values between 0 and TSTH . (D) The resulting spike-time histogram was time independent.
Ns different independent realizations of the corresponding renewal process. These realizations will be referred to as surrogate spike trains. One realization generated from the experimental data set is shown in Figure 4B. As mentioned before, the STH (see Figure 4Bb) is reduced compared with the STH of the original spike trains (see Figure 4Ab) near the beginning and end of the stimulus presentation. The structure of the transformed interspike intervals, ºni (see Figure 4Ac) is different from the one obtained from the surrogates (see Figure 4Bc). That difference will be quantied next. 3.2.4 Test Statistics for Renewal Processes. For a temporally modulated renewal process, all the time dependence can be removed by making the transformation t ! s. However, if there was nonrenewal structure present in the original spike trains, there could still be a time dependence. Here, we focus on the time dependence of interspike intervals. The time axis was divided in Nº discrete bins of width D s. The mean º ( m) was calculated of
1638
P. H. E. Tiesinga, J.-M. Fellous, and Terrence J. Sejnowski
Figure 4: Spike trains of cortical neurons did not form a renewal process. Time was rescaled according to t ! s D g ( t ) , such that the spike-time probability distribution—the spike-time histogram—is independent of time s (see text). In A, rescaled spike trains from Figure 2 and in B renewal spike trains obtained by randomly shufing the interspike intervals from A across different trials were shown. In each row were plotted (a) the rastergram of rescaled spike times sin D g ( tin ) (tin is the nth spike time in the ith trial), (b) the spike-time histogram of rescaled spike times, and (c) the rescaled interspike intervals, ºni D sin ¡ sin¡1 , versus sin¡1 . (C) The starting time of each interval was binned (bin width was 1.5 ms), and the average interval º( s ) was determined for each bin (solid line with lled circles). Condence intervals, ºO § sO (dashed lines), were determined for a renewal process with the same interval distribution using 20 random shufings (one example was shown in B). The experimental spike trains were signicantly different from renewal spike trains, Â 2 D 2.8 for the bins between 700 and 1600 ms at signicance p < 10¡11 .
Attractor Reliability in Neuronal Spike Trains
1639
all ºni with starting points sin¡1 that fell in the mth bin, ( m ¡ 1) D s · sin¡1 < mD s. The same procedure was performed on all of the Ns surrogate spike trains, l D 1, . . . , Ns , yielding ºO l ( m) . The mean and standard deviation were subsequently determined: Ns 1 X ºO l (m ) , Ns lD1 v u Ns u 1 X (ºO l ( m) ¡ ºO ( m ) ) 2 . sO ( m) D t Ns ¡ 1 lD1
ºO ( m) D
The original spike trains are nonrenewal when º ( m) lies outside the condence interval given by ºO ( m) ¡ sO ( m ) and ºO ( m ) C sO (m ) of the equivalent renewal proces. The test statistic was: Â2 D
´ Nº ³ 1 X º ( m) ¡ ºO ( m ) 2 . sO ( m) Nº mD 1
(3.3)
The hypothesis that the spike train was generated by a renewal process was rejected when p D 1 ¡ C ( NºÂ 2 , Nº ) was smaller than a prescribed critical value. Here we assume that C ( NºÂ 2 , Nº ) was the cumulative  2 probability distribution with Nº degrees of freedom (Abramowitz & Stegun, 1974; Larsen & Marx, 1986; Press et al., 1992). In the following, the continuous notation º (s ) will be used instead of º (m ) . The condence intervals for the experimental recordings are shown together with º ( s) in Figure 4C. For the original process, the interval distribution depended on time, whereas for the renewal process, it did not depend on time. The  2 statistic based on Ns D 20 surrogates was  2 D 2.8 for 60 degrees of freedom. Hence, the difference was highly signicant, p < 10¡11, and the discharge produced by the neuron was not a renewal process. 3.3 Attractor Reliability. The attractor reliability was calculated in three steps. First, events were found (see Figure 5). Second, the binary representation for each trial was determined (see Figure 7). Third, the entropy of the distribution of spike trains was calculated (see Figure 8). The procedure was applied to the experimental data from Figure 2 and surrogate spike trains.
3.3.1 Determination of Events. An event was dened as a spike time that occurred across multiple trials. The following algorithm can be used to determine the spike times that are part of a given event. All spike times were combined in one set and ordered from low to high values, yielding ft1 , . . . , tk, . . . , tM g. Here, k denotes the index of the spike time in the ordered sequence. The spike time versus k curve had a steplike structure
1640
P. H. E. Tiesinga, J.-M. Fellous, and Terrence J. Sejnowski
Figure 5: Procedure for determining events in the rastergram. Data from cortical neuron in Figure 2. (A) All the spike times were combined into one set and were ordered increasing from left to right. (B) The rst difference of the sorted spike times was thresholded (threshold was 40 ms, dashed line). Spike times between two consecutive upward threshold crossings constituted an event. (C) The standard deviation of the spike times in an event and (D) the number of spikes in an event plotted as a function of event index.
(see Figure 5A). A step was formed by spike times with similar values and corresponded to an event. Between events and steps, spike times changed quickly. The rst difference of the ordered spike times, tk D tkC 1 ¡ tk, was small within an event (roughly, the jitter divided by the number of trials) and large between different events, when k was part of one event and k C 1 was part of the other event. Hence, the time series tk consisted of large values separated by many small values (see Figure 5B). The set of k values, k1 , . . . , kE , where t k¡1 crossed a threshold from below was determined. E is the number of events that were detected. For e D 1, . . . , E, event e consisted of ftke¡1 , . . . , tke ¡1 g with the denition keD 0 D 1. The number of spikes in an event was Ne D ke ¡ ke¡1 . The spike-time jitter (called “precision” in Mainen & Sejnowski (1995)) was the standard deviation of all the spike times in a given event e, v0 1 u u kX e ¡1 1 u se D t@ (3.4) t2 A ¡ t2e , Ne j D k j e¡1
te D
e ¡1 1 kX tj . Ne jD k e¡1
Attractor Reliability in Neuronal Spike Trains
1641
The threshold should be chosen such that the number of spikes per event is large (but smaller than the number of trials) and the spike-time jitter is small. For the experimental data, we used a threshold of 40 ms, and 17 events were obtained. The spike-time jitter was approximately 5 ms for all events except the rst (see Figure 5C). The number of spikes in an event decreased from a maximum of 220 (all trials) at the second event to approximately 110 (half the number of trials) in the last event (see Figure 5D). The STH-based reliability was RSTH D
E 1 X Ne . ENtr eD 1
(3.5)
Here, Ntr was the number of trials and E was the number of events, as before. For the experimental data, RSTH ¼ 0.64. 3.3.2 Determination of the Binary Representation. characterized by a binary value, Xi D
E X
nie 2E¡e .
Each trial i was then
(3.6)
eD 1
Here, nie D 1 when there was a spike during event e on the ith trial and nie D 0 otherwise. Binary numbers were also associated with subsets of the i for the events e D b, . . . , b C L ¡ 1 of trial i: spike trains, XbL i XbL D
L X jD 1
nib C j¡1 2L¡j .
(3.7)
Here, j was a dummy index, and L was the word length. It follows that i . Xi D X1E 3.3.3 Surrogate Spike Trains. Surrogate spike trains were constructed to compare the experimental spike trains with the equivalent Poisson process. Spike times could be randomly shufed across trials, as before. However, in that case, there could be more than one spike time during an event on a given trial. This resulted in an ambiguous binary representation that could be resolved by, for instance, using ternary numbers. However, a different Poisson-like surrogate was constructed instead by randomly permuting spike times across trials for each event separately. Let ftini g be all the spike times during event e, and denote the absence of a spike during a given trial i by - (ni is the index of the spike in the ith trial that is part of event e). Then the spike times during an event could be, for instance, ft1n1 , -, -, t4n4 g (see Figure 6A). A surrogate obtained by a random permutation would be, for instance, ft4n4 , t1n1 , -, -g (see Figure 6B). Consider the case that the original
1642
P. H. E. Tiesinga, J.-M. Fellous, and Terrence J. Sejnowski
Figure 6: Procedure to generate Poisson-like surrogate spike trains. (A) Original set of spike trains: on trials 1 and 4, attractor 1 (At1) and on trials 2 and 3, attractor 2 (At2) was reached. (B) The spike times in each event e were randomly shufed across trials. The resulting spike trains no longer resembled the attractors.
spike trains had deterministic structure and were either one of two attractor spike trains (see Figure 6A). The above procedure then breaks up the attractor spike trains and removes the non-Poisson structure (see Figure 6B). Hence, by comparing, for instance, the binary representation of the original and surrogate spike trains, the amount of non-Poisson structure can be assessed. 3.3.4 Binary Representation of Experimental Spike Trains. A transient was discarded, and binary representations were calculated based on 10 events starting from the eighth spike, yielding Xi8,10 (see Figure 7Aa). The trials were sorted according to the binary representation starting from the lowest value. There were two plateaus in the X8,10 versus index graph (indicated by * in Figure 7Aa). Each of the X8,10 values were obtained on 6% of the trials (* in Figure 7Ab). These spike sequences corresponded to the two 1:2 modelocking attractors: the neuron produced spikes on either the odd cycles or only on the even cycles. On a larger number of trials, neurons were on these attractors for a shorter duration, which led to a triangle-like structure in the rastergram (see Figure 7Ac). The same analysis was performed on surrogate spike trains, and the results are shown in Figure 7B. There were no plateaus in the X8,10 versus index graph (see Figure 7Ba). Each X8,10 occurred with approximately equal
Attractor Reliability in Neuronal Spike Trains
1643
Figure 7: Cortical neuron spike trains had a deterministic structure not present in Poisson-like spike trains. Each trial was assigned a binary representation as described in the text. In A, the original data from Figure 2 and in B spike trains obtained by randomly shufing spike times of each event across trials were used. In each row, (a) binary representation X8, 10 , (b) distribution of X8,10 values across 220trials, and (c) rastergram with trials ordered according to value of X8,10 , increasing from bottom to top. The stars (*) and the arrows (Ã) are described in the text.
probability (see Figure 7Bb). The length of time that a neuron spent on an attractor on a given trial was reduced compared to the original spike trains. In particular, the triangle-like structure in the ordered rastergram was not present (see Figure 7Bc). The difference between the two sets of spike trains is quantied next using the entropy. 3.3.5 Entropy of Spike Trains. Let PbL ( XbL ) be the probability of obtaining a trial with binary representation XbL . It was estimated using a nite number of trials by determining all the distinct words XbL and counting how often each of them occurred across trials. The count was then normalized by the number of trials to obtain a probability. An example was shown in
1644
P. H. E. Tiesinga, J.-M. Fellous, and Terrence J. Sejnowski
Figure 7Ab. The entropy of this distribution was SbL D ¡
X
P ( XbL ) log2 P ( XbL ).
(3.8)
XbL
The entropy was then averaged over all allowed b values, SL D
E¡L XC 1 1 SbL . E ¡ L C 1 bD 1
(3.9)
The entropy of the distribution of the binary representation for the whole trial length was S D S1E , SD¡
X
P ( X) log2 P ( X ).
(3.10)
X
The attractor reliability was dened as Ra D 2 ¡S .
(3.11)
The attractor probability can be interpreted as the inverse of the number of different Xi (this would be exact if each X value occurred with equal probability). 3.3.6 Entropy of Surrogate Spike Trains. The entropy of the surrogates was estimated as the mean of S and SL over independent surrogates. An analytical expression for the entropy of the surrogate spike train was calculated. Let the probability of obtaining a spike during event e on a given trial be pe . The probability for a trial X with event occupation numbers fne g D fn1 , . . . , nE g was P (X ) D
E Y e D1
pne e (1 ¡ pe ) (1¡ne ) ,
(3.12)
and the entropy of this distribution was SD ¡ D ¡
X fne g E X eD 1
P ( X) log2 P ( X )
[pe log2 pe C (1 ¡ pe ) log2 (1 ¡ pe ) ].
(3.13)
P P P Here fne g D 1n1 D 0 , . . . , 1nE D 0 is the sum over all possible combinations of event occupation numbers. The nal result followed since the entropy of a
Attractor Reliability in Neuronal Spike Trains
1645
Figure 8: The binary representations of the experimental spike trains were less variable than those of Poisson-like processes. Binary representations with word length L were determined for each trial, and the entropy was calculated as described in the text, starting from (A) the rst spike time and (B) the eighth spike time. The entropy of experimental spike trains (dashed line), entropy of surrogate spike trains averaged over 10 random shufings (lled circles), their difference (dot-dashed line), and the analytical result for the entropy of surrogate spike trains (solid line) are shown. The standard deviation of the surrogate spiketrain entropy was of the order of the circle size.
product of probability density functions is the sum over the entropy of each probability distribution. The analytical entropy of the surrogate spike trains increased linearly with E. However, for Ntr trials, the maximum entropy was log2 Ntr ; hence, the analytical limit may not be reached if too few trials are available for analysis. The entropy of the experimental spike trains was determined as a function of the word length L (see Figure 8). The entropy of the surrogate spike trains was calculated in two ways: rst by determining the mean entropy of 10 surrogate spike trains and then analytically by using equation 3.13. The probability pe of a spike during event e was estimated as the number of spikes during that event in the original data divided by the number of trials. The entropy of the original spike train was always lower than that of the surrogate spike train. This implies that there was additional structure present in the original spike train that was not present in surrogate spike trains. The entropy of surrogate spike trains started to differ from the analytical result at L D 8 (see Figure 8A). This indicated that the number of trials was not large enough to sample the probability distribution of the binary representations adequately. Initially, the difference between the entropy of the original and surrogate spike trains increased with L, but when the sampling effects occurred, it decreased. As a result, the difference had a maximum that was a sampling artifact (dot-dashed line in Figure 8A). Similar results were obtained for spike trains starting from the eighth event (see Figure 8B).
1646
P. H. E. Tiesinga, J.-M. Fellous, and Terrence J. Sejnowski
3.4 Simulated Spike Trains. The statistical test for renewal behavior and the procedure to determine the attractor reliability were also applied to simulated spike trains. As an example, the spike trains of a noisy integrateand-re neuron that was 1:2 entrained to a sinusoidal drive were examined (see Figure 9A). The entropy of the original spike trains was lower than the surrogate spike trains (see Figure 9B). The spike trains were nonrenewal; the probability of obtaining the same º (s ) (see Figure 9C) from a renewal spike train was zero. The attractor reliability was Ra D 2 ¡3.74 ¼ 0.075. 4 Discussion
Other statistical tests to determine whether neural spike trains form a renewal process have been proposed previously, such as the power ratio (Reich et al., 1998). The critical value of the power ratio depended on the interval distribution of the (rescaled) spike train. The test introduced here could be applied to all interval distributions, and its signicance value was determined using the  2 distribution. Oram, Wiener, Lestienne, and Richmond (1999) proposed a procedure to determine whether certain patterns of spikes in multiple unit recordings were present above chance levels (see also Abeles & Gat, 2001). The attractor reliability introduced here is related to this procedure since it assesses whether certain patterns—attractors—occur more often than expected for Poisson processes with the same spike-time histogram. The number of distinct spike trains was determined using a simple procedure. This procedure succeeds if (1) the spike times within an event are sufciently precise and (2) the distance between different events is larger than the spike-time jitter within an event. The latter condition is not always satised; for instance, in Poisson spike trains, spikes can occur with arbitrarily small interspike intervals. In real spike trains, there is always a minimum interspike interval equal to the refractory period. When either of the two conditions is not satised, a different method is required to separate spike trains into groups. Two alternatives are the k means method for clustering (Gershenfeld, 1999) or an algorithm based on spike metrics such as in Victor and Purpura (1996). To calculate the entropy of spike trains and compare to equivalent renewal processes, a sufcient number of trials was needed. We found 40 to 1000 trials to be sufcient for most analyses. Recent experimental studies show that the amplitude of a postsynaptic conductance in response to a presynaptic action potential depends on the previous presynaptic spike times (Markram & Tsodyks, 1996; Abbott, Varela, Sen, & Nelson, 1997; Markram, Wang, & Tsodyks, 1998). As a result, synapses are sensitive to temporal correlations in input spike trains (Brenner, Strong, Koberle, Bialek, & de Ruyter van Steveninck, 2000; Eguia, Rabinovich, & Abarbanel, 2000; Tiesinga, 2001). For instance, a Poisson spike train and a periodic spike train with the same average rate will yield different postsynaptic amplitudes. When there are more distinct spike trains
Attractor Reliability in Neuronal Spike Trains
1647
Figure 9: Spike trains obtained from a noisy integrate-and-re neuron that was 1:2 entrained to a sinusoidal drive. The spike trains had deterministic structure and did not form a renewal process. (A) The rastergrams (a) in original order and (b) ordered on binary representation. (B) Spike-train entropy; curves were annotated as in Figure 8A. (C) Average rescaled interspike interval as a function of time with condence intervals (notation as in Figure 4C). Model parameters were I D 1.0, A D 0.17, D D 10¡4 , T D 2. To facilitate comparison to experimental data, time was rescaled by a factor of 40 ms.
1648
P. H. E. Tiesinga, J.-M. Fellous, and Terrence J. Sejnowski
across trials, the synaptic drive gets more variable. The attractor reliability is a measure for this type of synaptic variability. If the task of a neuron is to transmit information about its input into its output spike train, then this variability would be considered noise since it is not related to the input. The impact on a postsynaptic neuron of the unreliability-induced synaptic variability is not clear. Cortical neurons receive a large number of synaptic inputs from different cells (reviewed in Shadlen & Newsome, 1998) and synapses themselves are unreliable (Bekkers, Richerson, & Stevens, 1990; Allen & Stevens, 1994; Zador, 1998). This issue remains for further study. A more immediate issue is how reliability depends on the characteristics of the driving input and the intrinsic neuronal dynamics. In previous experimental and theoretical studies (Mainen & Sejnowski, 1995; Nowak, SanchezVives, & McCormick, 1997; Tang, Bartels, & Sejnowski, 1997; Hunter, Milton, Thomas, & Cowan, 1998; Warzecha, Kretzberg, & Egelhaaf, 1998, 2000; Cecchi et al., 2000; Kretzberg, Egelhaaf, & Warzecha, 2001; Fellous et al., 2001), it was shown, using a different reliability measure, that neurons re unreliably in response to constant depolarizing current, but re reliably when driven by inputs containing high-frequency components. The reliability measure introduced here forms part of a theoretical framework that allows for the systematic study of neuronal reliability. Elsewhere, we will investigate how the attractor reliability depends on the type and number of attractors and their bifurcation structure. Acknowledgments
We thank Jack Cowan, Jorge JosÂe, Bruce Knight, Susanne Schreiber, and Peter Thomas for discussions and suggestions and Greg Horwitz, Arnaud Delorme, and the anonymous referees for comments that helped improve the presentation of the article. Some of the numerical calculations were performed at the High Performance Computer Center at Northeastern University. This work was partially funded by the Sloan-Swartz Center for Theoretical Neurobiology (P.T.) and the Howard Hughes Medical Institute (J.M.F., T.J.S.). References Abbott, L., Varela, J., Sen, K., & Nelson, S. (1997).Synaptic depression and cortical gain control. Science, 275, 220–224. Abeles, M., & Gat, I. (2001). Detecting precise ring sequences in experimental data. J. Neurosci. Methods, 107, 141–154. Abramowitz, M., & Stegun, I. (1974). Handbook of mathematical functions. New York: Dover. Allen, C., & Stevens, C. (1994). An evaluation of causes for unreliability of synaptic transmission. Proc. Natl. Acad. Sci., 91, 10380–10383.
Attractor Reliability in Neuronal Spike Trains
1649
Bekkers, J., Richerson, G., & Stevens, C. (1990). Origin of variability in quantal size in cultured hippocampal neurons and hippocampal slices. Proc. Natl. Acad. Sci., 87, 5359–5362. Brenner, N., Strong, S., Koberle, R., Bialek, W., & de Ruyter van Steveninck, R. (2000). Synergy in a neural code. Neural Comput., 12, 1531–1552. Cecchi, G., Sigman, M., Alonso, J., Martinez, L., Chialvo, D., & Magnasco, M. (2000). Noise in neurons is message dependent. Proc. Natl. Acad. Sci., 97, 5557–5561. Eguia, M., Rabinovich, M., & Abarbanel, H. (2000). Information transmission and recovery in neural communications channels. Phys. Rev. E, 62, 7111– 7122. Fellous, J.-M., Houweling, A., Modi, R., Rao, R., Tiesinga, P., & Sejnowski, T. (2001). The frequency dependence of spike timing reliability in cortical pyramidal cells and interneurons. J. Neurophys., 85, 1782–1787. Gershenfeld, N. (1999). The nature of mathematical modeling. Cambridge: Cambridge University Press. Hansel, D., Mato, G., Meunier, C., & Neltner, L. (1998).On numerical simulations of integrate-and-re neural networks. Neural Comput., 10, 467–483. Hunter, J., Milton, J., Thomas, P., & Cowan, J. (1998). Resonance effect for neural spike time reliability. J. Neurophysiol., 80, 1427–1438. Jensen, R. (1998). Synchronization of randomly driven nonlinear oscillators. Phys. Rev. E, 58, 6907–6910. Kretzberg, J., Egelhaaf, M., & Warzecha, A. (2001). Membrane potential uctuations determine the precision of spike timing and synchronous activity: A model study. J. Comput. Neurosci., 10, 79–97. Larsen, R., & Marx, M. (1986). An introduction to mathematical statistics and its applications. Englewood Cliffs, NJ: Prentice Hall. Mainen, Z., & Sejnowski, T. (1995). Reliability of spike timing in neocortical neurons. Science, 268, 1503–1506. Markram, H., & Tsodyks, M. (1996). Redistribution of synaptic efcacy between neocortical pyramidal neurons. Nature, 382, 807–810. Markram, H., Wang, Y., & Tsodyks, M. (1998). Differential signaling via the same axon of neocortical pyramidal neurons. Proc. Natl. Acad. Sci., 95, 5323–5328. Nowak, L., Sanchez-Vives, M., & McCormick, D. (1997). Inuence of low and high frequency inputs on spike timing in visual cortical neurons. Cereb. Cortex, 7, 487–501. Oram, M., Wiener, M., Lestienne, R., & Richmond, B. (1999). Stochastic nature of precisely timed spike patterns in visual system neuronal responses. J. Neurophysiol., 81, 3021–3033. Press, W., Teukolsky, S., Vetterling, W., & Flannery, B. (1992). Numerical recipes. Cambridge: Cambridge University Press. Reich, D., Victor, J., & Knight, B. (1998). The power ratio and the interval maps: Spiking models and extracellular recordings. J. Neurosci., 18, 10090–10104. Shadlen, M., & Newsome, W. (1998). The variable discharge of cortical neurons: Implications for connectivity, computation, and information coding. J. Neurosci., 18, 3870–3896. Strogatz, S. (1994). Nonlinear dynamics and chaos. Reading, MA: Addison-Wesley.
1650
P. H. E. Tiesinga, J.-M. Fellous, and Terrence J. Sejnowski
Tang, A., Bartels, A., & Sejnowski, T. (1997). Effects of cholinergic modulation on responses of neocortical neurons to uctuating input. Cereb. Cortex, 7, 502–509. Tiesinga, P. (2001). Information transmission and recovery in neural communications channels revisited. Phys. Rev. E, 64, 012901: 1–4. Victor, J., & Purpura, K. (1996). Nature and precision of temporal coding in visual cortex: a metric-space analysis. J. Neurophysiol., 76, 1310–1326. Warzecha, A., Kretzberg, J., & Egelhaaf, M. (1998). Temporal precision of the encoding of motion information by visual interneuron. Curr. Biol., 8, 359– 368. Warzecha, A., Kretzberg, J., & Egelhaaf, M. (2000). Reliability of a y motionsensitive neuron depends on stimulus parameters. J. Neurosci., 20, 8886–8896. Zador, A. (1998). Impact of synaptic unreliability on the information transmitted by spiking neurons. J. Neurophys., 79, 1219–1229. Received July 25, 2001; accepted January 4, 2002.
LETTER
Communicated by Paul Bressloff
Traveling Waves of Excitation in Neural Field Models: Equivalence of Rate Descriptions and Integrate-and-Fire Dynamics Daniel Cremers [email protected] Department of Mathematics and Computer Science, University of Mannheim, 68131 Mannheim, Germany Andreas V. M. Herz [email protected] InnovationskollegTheoretischeBiologie,Humboldt-Universit¨at zu Berlin, 10115Berlin, Germany Field models provide an elegant mathematical framework to analyze large-scale patterns of neural activity. On the microscopic level, these models are usually based on either a ring-rate picture or integrate-andre dynamics. This article shows that in spite of the large conceptual differences between the two types of dynamics, both generate closely related plane-wave solutions. Furthermore, for a large group of models, estimates about the network connectivity derived from the speed of these plane waves only marginally depend on the assumed class of microscopic dynamics. We derive quantitative results about this phenomenon and discuss consequences for the interpretation of experimental data. 1 Introduction
Traveling waves of synchronized neural excitation occur in various brain regions and play an important role both during development and for information processing in the adult. Such waves have been observed in various systems, including the retina (Meister, Wong, Denis, & Shatz, 1991), olfactory bulb (Gelperin, 1999), visual cortex (Bringuier, Chavane, Glaeser, & Fregnac, 1999), and motor cortex (Georgopoulos, Kettner, & Schwartz, 1988). In vitro, excitation waves have been successfully induced to probe single-neuron dynamics as well as large-scale connectivity patterns (Traub, Jefferys, & Miles, 1993; Wadman & Gutnick, 1993; Golomb & Amitai, 1997). The omnipresence of traveling waves in neural tissues is reected by a wealth of models, ranging from detailed biophysical approaches (Traub et al., 1993; Destexhe, Bal, McCormick, & Sejnowski, 1996; Golomb & Amitai, 1997; Rinzel, Terman, Wang, & Ermentrout, 1998) to highly simplied mathematical formulations (Beurle, 1956; von Seelen, 1968; Wilson & Cowan, c 2002 Massachusetts Institute of Technology Neural Computation 14, 1651– 1667 (2002) °
1652
Daniel Cremers and Andreas V. M. Herz
1973; Amari, 1977; Ermentrout & Cowan, 1979; Idiart & Abbott, 1993; BenYishai, Hansel, & Sompolinsky, 1997; Horn & Opher, 1997; Ermentrout, 1998; Kistler, Seitz, & van Hemmen, 1998; Bressloff, 1999; Golomb & Ermentrout, 1999; Bressloff, 2000; Kistler, 2000; Golomb & Ermentrout, 2001). Models of the latter type have to sacrice biological realism to a certain extent, but they often admit a quantitative analysis of the relation between the accessible macroscopic wave phenomena and otherwise hidden aspects of the neural dynamics and connectivity patterns. In some mathematical models, for example, the speed of the emergent traveling wave can be expressed in closed form in terms of the parameters describing single-neuron behavior and network circuitry (Idiart & Abbott, 1993; Ben-Yishai et al., 1997; Horn & Opher, 1997; Ermentrout, 1998; Kistler et al., 1998; Bressloff, 1999, 2000; Golomb & Ermentrout, 1999, 2001; Kistler, 2000). Given one of these models, results from large-scale neurophysiological measurements may be directly interpreted in terms of microscopic neural parameters. Most of the mathematical models describing spatiotemporal activity patterns are formulated as eld models in continuous space. Within these models, two broad classes can be distinguished: ring-rate models and models with spiking neurons. The former, more traditional, eld models are based on a locally averaged neuronal ring rate. The latter, more recent, class of eld models incorporates the existence of action potentials. Here, the local dynamics involve the generation of action potentials, for example, through an integrate-and-re mechanism. A traveling wave in a ring-rate model typically corresponds to the spread of a region with a high ring rate (Wilson & Cowan, 1973; Amari, 1977; Idiart & Abbott, 1993). As depicted in Figure 1, this is in sharp contrast to the simplest traveling wave in a system with spiking neurons, where both before and directly behind the wave front, activity is low so that the resulting neural activation pattern is highly localized in space and time. Depending on the refractory period, synaptic timescales, and heterogeneity of intrinsic and coupling properties, the activation pattern in a model with spiking neurons may also exhibit complicated spatiotemporal structures, including multiple reexcitations behind the wave front. Numerical simulations show, however, that these phenomena have almost no effect on the speed of the wave front (Ermentrout, 1998). Independent of the precise structure of traveling waves in ring-rate models and systems with spiking neurons, there is no doubt that the two types of wave correspond to entirely different dynamical scenarios. Are these two scenarios nevertheless related to each other on a mathematical level, even if they describe two highly distinct biological situations? If so, can results obtained within one framework be applied to phenomena observed in the other setting? This article addresses these questions and shows that surprisingly there is a close connection between the two model classes. In particular, it is demonstrated that for a large group of commonly used systems, there exists
Traveling Waves of Excitation
1653
Figure 1: Schematic space-time plots of the simplest traveling wave in a onedimensional ring-rate model (left) and an integrate-and-re model (right). The spatial activity at two different times is highlighted by thick lines. In the ringrate model, the wave front connects a region with a low ring rate with a region with a high ring rate. In the integrate-and-re model, on the other hand, activity is low both before and directly behind the wave front. Reexcitation may lead to more complicated activity proles but has almost no effect on the speed of the wave front. This justies the study of simplied integrate-and-re models with a single narrow activity peak as depicted on the right side.
a one-to-one mapping between the two classes. This mapping involves a nontrivial geometric transformation of the neural connectivity pattern. As a consequence, biological interpretations of experimental measurements in terms of the underlying neural circuitry may depend strongly on the assumed class of model dynamics. We derive quantitative results about this effect, discuss implications for data analysis, and close with some remarks on extensions and limitations of this approach. 2 Field Models Based on Mean-Firing-Rate Descriptions
In traditional eld models such as those proposed by Wilson and Cowan (1973) or Amari (1977), neural activity is treated as a phenomenological variable u ( x, t) that represents the local short-time averaged membrane potential. In this continuum approximation, the time evolution of neural activity is described by a partial differential equation, for example, the frequently used prototype t
@u ( x, t) @t
Z D ¡u(x, t) C
Z ¢
t ¡1
dt0 a(t ¡ t0 )
µ ³ ´¶ |x ¡ y| 0 . dy J ( x ¡ y ) g u y, t ¡ c
(2.1)
It describes the dynamics of one homogeneous population of neurons. The parameter t denotes an effective membrane time constant, and J ( z ) models the distance dependence of synaptic coupling strengths, often in the
1654
Daniel Cremers and Andreas V. M. Herz
form of a Mexican hat type of interaction with strong short-range excitation surrounded by weak inhibition. A prominent example for the functional form of such distance-dependent coupling strengths is the difference of two gaussians. The function g ( u) characterizes the nonlinear dependence of the ring rate on the mean local membrane potential. Such activation functions are usually modeled by sigmoid functions, that is, monotone increasing and s-shaped functions that approach zero for u ! ¡1 and saturate for u ! C 1. Note that within this class of model, ring rates and mean membrane potentials are treated on the same footing, simply related by the static nonlinearity g. The kernel a captures the dynamical effects of signal delays and (post)synaptic integration processes. For example, a(s) D d ( s ¡ taxon ) describes a discrete uniform delay, and ¡1 ¡s / tpsp ( ) a(s) D tpsp H s e
(2.2)
¡2 ¡s/ tpsp a(s) D stpsp H ( s) e
(2.3)
or
mimic two commonly used time courses of postsynaptic potentials. Distance-dependent axonal propagation delays are taken care of by the term |x ¡ y| / c in the argument of u in equation 2.1, where c denotes the signal propagation velocity. In the mathematical analysis, we will neglect the effects of nite c and set 1 / c D 0 for simplicity. From a neurobiological point of view, this is justied if characteristic velocities of the large-scale activity patterns are small compared to the axonal signal velocity. All our results do, however, generalize to nite propagation speed. In what follows, we assume that, on average, excitatory synaptic interactions are stronger than inhibitory interactions. This means that the temporal and spatial kernels can be taken to be normalized to plus one, Z
1 0
ds a(s) D 1
(2.4)
dz J (z ) D 1.
(2.5)
and Z
1 ¡1
Temporal integration of equation 2.1 results in the integral equation Z u ( x, t ) D
t ¡1
Z dt0 2 ( t ¡ t0 )
dy J ( x ¡ y ) g[u ( y, t0 ) ]
(2.6)
Traveling Waves of Excitation
1655
with Z ¡1 2 ( s) D t
s
ds 0 e ¡
s¡s0 t
a(s0 ) .
(2.7)
0
As a consequence of equation 2.4, the kernel 2 is also normalized, Z
1 0
ds 2 ( s) D 1.
(2.8)
As will become apparent in the following sections, the integral formulation, equation 2.6, is most suitable for discussing wave phenomena. 3 Wave Fronts in One-Dimensional Firing-Rate Models
Because of the normalizations 2.5 and 2.8, all homogeneous and stationary solutions u ( x, t) D u of equation 2.6 satisfy a simple xed-point equation, u D g ( u) .
(3.1)
Depending on the shape of g, equation 3.1 allows a single or multiple solutions. Of particular interest are sigmoid functions with three solutions ur < uu < ue . Here, ur corresponds to the stable rest state of a neuron (low ring rate), ue , to the stable excited state (high ring rate), and uu , to the intermediate unstable equilibrium. In this case of a bistable network, a traveling wave joining the rest state ur and the excited state ue can be triggered for appropriate initial conditions. For example, if limx!¡1 u ( x, 0) D ue and limx!1 u (x, 0) D ur , a wave of excitation may propagate in the positive x-direction through the system. If g vanishes for u less than a xed threshold # > 0, g ( u) D 0
u < #,
for
(3.2)
the rest state is given by ur D 0, and nding traveling wave fronts of equation 2.6 simplies signicantly, as shown by Idiart and Abbott (1993). Consider again a wave propagating in the positive x-direction that reaches the point x 0 at time t0, that is, u ( x 0, t0 ) D #. Since the wave is approaching from the left, g[u ( y, t0 ) ] D 0 for all y > x 0 as long as t0 < t0 . Thus, although in general the activity u ( y, t0 ) itself is nonzero in front of the wave, due to the condition 3.2, this has no effect on the dynamics for u < #. (See Figure 2.) Evaluated at x D x0 and t D t0 , equation 2.6 therefore reads Z # D u ( x 0, t 0 ) D
t0 ¡1
Z dt0 2 ( t0 ¡ t0 )
x0 ¡1
dy J ( x0 ¡ y) g[u ( y, t0 )].
(3.3)
1656
Daniel Cremers and Andreas V. M. Herz
u(x,t) t=t0
g[u(x,t0)]
t
x0
x
Figure 2: Schematic plot of a wave front in a ring-rate model. The wave is traveling in the positive x-direction, as indicated by the arrow. The prole of neural excitation u ( x, t) is shown at two instances: at t D t 0 by a solid line, and at some earlier time t < t0 by a dashed line. In addition, the prole of g[u ( x, t )] is depicted for t D t0 by a dotted line. Notice that although u ( x, t0 ) is nonzero for all x, g[u ( x, t0 ) ] vanishes for x > x 0 because a threshold nonlinearity with g ( u ) D 0 for u < # was chosen.
If one inserts the traveling-wave ansatz, u ( x, t ) D uQ ( t ¡ x / v ) ,
(3.4)
into equation 3.3, inuences of the structure of J and the shape of g on the propagation velocity v may readily be analyzed (Idiart & Abbott, 1993). 4 Field Models Based on Integrate-and-Fire Neurons
The models described above are based on a ring-rate description of neural activity and neglect the existence of action potentials. More recently, a different class of eld models has been introduced that explicitly incorporates the spiking nature of neural activity (Ermentrout, 1998; Kistler et al., 1998; Bressloff, 1999, 2000; Golomb & Ermentrout, 1999, 2001; Kistler, 2000). On the single-cell level, the most salient features of spiking neurons are captured by integrate-and-re models. Here, each neuron integrates synaptic inputs and generates a uniform action potential whenever the membrane potential crosses a xed threshold from below. Simulations with networks of regularly spaced integrate-and-re neurons reveal a large variety of macroscopic activity patterns such as plane waves, rotating spirals and expanding rings (see, for example, Fohlmeister, Gerstner, Ritz, & van Hemmen, 1995; Ermentrout, 1998; Horn & Opher, 1997; Kistler et al., 1998).
Traveling Waves of Excitation
1657
The length of the refractory period inuences the prole of traveling waves in systems with spiking neurons. Depending on the model parameters, shapes range from narrow soliton-like excitations, where each neuron res only once, to complex activity proles due to multiple reexcitations behind the wave front. However, extensive numerical investigations provide strong evidence that the speed of a traveling wave front is largely independent of subsequent spike activity (Ermentrout, 1998). This result justies the study of models without reexcitation, such as one-dimensional systems of the type t
@V ( x, t ) @t
Z D ¡V(x, t) C
1 ¡1
dy C ( x ¡ y ) a[t Q ¡ t?( y ) ]
(4.1)
where V ( x, t) describes the local membrane potential, C ( z) represents synaptic coupling strengths, and a Q captures transmission phenomena as in section 2. The time t? ( y ) denotes the ring time of a neuron at location y, determined by the conditions V ( y, t?) D # and @V ( y, t) / @t > 0 for t D t? (Ermentrout, 1998). With the denition Z s s¡s0 r ( s) D t ¡1 (4.2) ds0 e ¡ t a(s Q 0), 0
integration of equation 4.1 results in the integral equation Z V ( x, t ) D
Z
t ¡1
dt0 r ( t ¡ t0 )
1 ¡1
dy C (x ¡ y)d[V ( y, t0 ) ¡ #]
µ ¶ @V ( y, t0 ) @V ( y, t0 ) ¢ 0 H , @t @t0
(4.3)
where H (¢) denotes the Heaviside step function. A dynamical description that is closely related to equation 4.3 was obtained by Kistler (2000), who also provided a mathematical proof that the dynamics of a spatially discretized network approaches the dynamics of a neural eld model if the characteristic length scale of neural interactions is much larger than the grid size. In this limit, the membrane potential V at location x and time t becomes Z V ( x, t ) D
Z
t ¡1
Z
C
dt0 r ( t ¡ t0 ) t
¡1
1 ¡1
dy C (x ¡ y) S ( y, t0 )
dt0 g ( t ¡ t0 ) S (x, t0 ) ,
(4.4)
where spike activity at location y and time t0 is now denoted by S ( y, t0 ) , and r and C model the temporal and spatial aspects of signal transmission and
1658
Daniel Cremers and Andreas V. M. Herz
integration similar to equation 2.1. Refractoriness following local spike activity results from hyperpolarization, whose time course is described byg ( s) . Spike activity is triggered whenever the local eld crosses a threshold # from below, µ ¶ @V ( x, t) @V ( x, t) H . (4.5) S ( x, t ) D d[V(x, t) ¡ #] @t @t According to the derivation given by Kistler, the variables V ( x, t) and S ( x, t) can be interpreted as interpolated versions of the corresponding variables V (xi , t) and S ( xi , t) of the original spatially discretized network, where neuron i, with 1 · i · N, is located at site xi . From a neurophysiological point of view, this implies that V ( x, t) and S ( x, t) mimic different components of the local eld potential recorded at location x, with V ( x, t) representing averaged postsynaptic potentials and S (x, t) describing the effects of spike activity. Equation 4.5 implies that elementary spike activity is described by a dfunction whose size, when integrated over time, is normalized to unity (Ermentrout, 1998; Bressloff, 1999, 2000; Golomb & Ermentrout, 1999; Kistler, 2000). This normalization reects the unitary character of action potentials. ) | is reIn an alternative approach by Kistler et al. (1998), the factor | @V@( x,t t @V ( x,t ) placed by | @x | so that spike activity is normalized when integrated in the spatial domain. However, as pointed out by Bressloff (1999), and acknowledged in Kistler (2000), only the rst description, equation 4.5 is biophysically justied. Focusing on the wave front of a traveling wave, we may neglect the possible occurrence of reexcitations and set the second term on the right-hand side of equation 4.4 to zero without loss of generality. Inserting equation 4.5 into 4.4 then yields 4.3 as desired. 5 Mapping the Firing-Rate Model onto the Integrate-and-Fire Model
Although at rst glance, equation 4.3 is somewhat reminiscent of 2.6, the two classes of neural eld models are quite different from a conceptual point of view. The rst assumes that the response properties of a neuron are fully described by its ring rate. As a consequence, the local eld u (x, t) in equation 2.1 represents a postsynaptic membrane potential that has been averaged in both space and time. Within the integrate-and-re eld model, in contrast, no temporal average is carried out; the variable S ( x, t) in equation 4.5 represents spike activity whose time evolution is given by an integrate-and-re mechanism. Despite these differences, both types of model show similar macroscopic patterns such as traveling waves of excitation, but the details of these waves differ signicantly, as illustrated in Figure 1.
Traveling Waves of Excitation
1659
Figure 3: Corresponding synaptic connectivities in ring-rate and integrateand-re models. Each of the two panels, A and B, shows one example of a J coupling (ring-rate model, left side) and a T coupling (integrate-and-re model, right side). In the upper panel, gaussian couplings are assumed for the ring-rate model and lead to a strongly peaked coupling distribution for the integrate-and-re model. In the lower panel, gaussian couplings are assumed for the integrate-and-re model and imply a coupling distribution for the ringrate model that is peaked away from zero.
What is the relation between the two approaches? To investigate this questions, we will now map the dynamics of the ring-rate model onto the dynamics of the integrate-and-re model. We focus on the case of a propagating wave front as described in section 3 and denote its propagation velocity by v. Furthermore, we assume symmetric J couplings, J ( z) D J (¡z) , that are continuous and integrable. We introduce auxiliary couplings T (z ) by ( T ( z ) :D
v ¡1 v
Rz
R¡1 ¡1 1 z
dz0 J ( z0 ) , dz0 J ( z0 ) ,
z < 0, z ¸ 0.
(5.1)
(See Figure 3.) Due to the assumptions for J (z ) , these new couplings are also symmetric, T ( z ) D T (¡z) , and continuous at z D 0 with T (0) D
1 . 2v
(5.2)
1660
Daniel Cremers and Andreas V. M. Herz
In terms of the T couplings, the original J couplings are given by ( J ( z) D
d C v dz T (z) ,
d ¡v dz
T (z) ,
z < 0,
(5.3)
z ¸ 0.
By denition, the T couplings vanish for large positive and negative argument. Note that apart from the special case of exponential couplings, J ( z ) / exp(¡|z| / s) , the couplings J ( z) and T ( z ) do not have the same shape. Using integration by parts, we obtain for the second integral in equation 3.3, Z x0 dy J ( x 0 ¡ y ) g[u ( y, t0 ) ] ¡1
Dv
Z
³
x0
dy ¡1
@ @y
´ g[u ( y, t0 ) ]
T ( x0 ¡ y) Z
D v T (0) g[u ( x 0, t0 ) ] ¡ v
x0 ¡1
dy T ( x0 ¡ y) h[u ( y, t0 ) ]
@u ( y, t0 ) @y
(5.4)
with h ( u ) :D
dg ( u ) . du
(5.5)
Because T (0) is nite and g[u (x 0 , t0 ) ] vanishes for all t0 < t 0, inserting equation 5.4 into equation 3.3 results in # D u ( x0 , t 0 ) Z t0 Z D v dt0 2 ( t0 ¡ t0 )
@u ( y, t0 ) dy T ( x0 ¡ y ) h[u ( y, t0 ) ] @y ¡1 ¡1 Z t0 Z x0 @u ( y, t0 ) D dt0 2 ( t0 ¡ t0 ) dy T (x 0 ¡ y ) h[u ( y, t0 )] , @t0 ¡1 ¡1 x0
(
0
)
(
0
(5.6)
)
@u y,t @u y,t where we have employed the identity v ¢ | @y | D | @t0 | and the fact that the wave is traveling in the positive x-direction so that @u / @y · 0. Due to the particular denition of the T couplings (see equation 5.1), equation 5.6 also holds for waves traveling in the negative x-direction. In d this case, @u / @y ¸ 0 and J ( x0 ¡ y ) D ¡ dy T ( x 0 ¡ y ) in the relevant region of space (y > x 0). The above calculations hold for any kind of threshold nonlinearity (see equation 3.2). For the commonly considered case where the nonlinearity g ( u ) is a step function (see, for example, Amari, 1977),
g (u) D H ( u ¡ #) ,
(5.7)
Traveling Waves of Excitation
1661
the function h becomes a d-function, h ( u) D d (u ¡ #) ,
(5.8)
and equation 5.6 reduces to # D u ( x 0, t 0 ) Z t0 Z D dt0 2 ( t0 ¡ t0 ) Z D
¡1 t0 ¡1
Z dt0 2 ( t0 ¡ t0 )
x0 ¡1 x0 ¡1
@u ( y, t0 ) dy T (x 0 ¡ y )d[u ( y, t0 ) ¡ #] @t0 dy T (x 0 ¡ y )d[u ( y, t0 ) ¡ #]
µ ¶ @u ( y, t0 ) @u ( y, t0 ) ¢ H . @t0 @t0
(5.9)
In the last step, we used the fact that @u / @t > 0 at threshold crossing. The comparison of equation 5.9 with equation 4.3, evaluated for x D x 0 and t D t0 , reveals that equation 5.9 is identical with the threshold condition for a wave front in the integrate-and-re eld model if we identify u (x, t) D V ( x, t) , 2 (t ) D r (t ) , and C ( z ) D T ( z) . This nding implies that the wave front in a one-dimensional eld model with step nonlinearity is equivalent to the wave front in a eld model with integrate-and-re neurons. Note, however, that the spatial couplings in both models are not the same but related via the transformation 5.1. More generally, even ring-rate models with smooth sigmoid nonlinearities g ( u ) can be mapped onto integrate-and-re models, as shown by equation 5.6. In that case, the shape of an action potential in the associated integrate-and-re model is not any longer given by a d-function as for step nonlinearities but rather by the expression S (x, t) D
dg[u (x, t) ] @u ( x, t) . @t du
(5.10)
Depending on the shape of g, different action-potential shapes may thus be modeled. 6 From Macroscopic Wave Phenomena to Microscopic Dynamics
Let us investigate the interpretation of an electrophysiological experiment of the type performed by Traub et al. (1993), Wadman and Gutnick (1993), or Golomb and Amitai (1997) in which a wave velocity v has been measured in some neural substrate. What are the differences between the inferred system parameters depending on whether we base our calculations on a rate model, as in Idiart and Abbott (1993) or an integrate-and-re mechanism
1662
Daniel Cremers and Andreas V. M. Herz
as in Ermentrout (1998)? This question is an extreme case of a situation often encountered in a model-based analysis of neurophysiological data: How do assumptions about the underlying dynamics inuence the system parameters derived from experimental measurements? In the present case, we can answer this question analytically and gain further insight into the scope and limitations of deriving network characteristics from the properties of waves of neural excitation. To do so, we assume some functional form for the true synaptic couplings C ( z) . For concreteness, we take the couplings to be gaussians with standard deviation s. Within the ring-rate model, we therefore set J ( z) D C (z ) and denote s by sJ . Within the integrate-and-re model, we have to derive J ( z) through equation 5.3 from gaussian couplings T ( z ) with s D sT . The T couplings thus have the same shape as C ( z) but are normalized according to equation 5.2 so that in both scenarios, the J couplings satisfy the normalization 2.5 to guarantee an unbiased comparison. Keeping all other parameters xed, we then derive analytical expressions for sJ and sT , compare their values, and may thus judge how strongly the inferred network parameters depend on the assumptions about the single-cell dynamics. We present three different examples of this approach. In the rst example, we investigate the general model, equation 2.1, with nite signal propagation speed c < 1. Inserting the traveling wave ansatz 3.4 into equation 3.3 and integrating by parts with respect to t0 , we can extend the approach of Idiart and Abbott (1993), who focused on the specic case a(s) D d ( s) . We obtain for step nonlinearities, g (u ) D H (u ¡ # ) , the following implicit equation for v, Z
Z
0
¡c z
dz J (z )
#D
¡1
0
2 ( s) ds,
(6.1)
with c D v ¡1 ¡ c ¡1 ,
(6.2)
where the wave is, as before, traveling in the positive x-direction. Assuming that the shape a(s) of a postsynaptic potential is exponentially decaying (see equation 2.2), we get 2 ( t) D
1 (e ¡t / tpsp ¡ e¡t / t ) H ( t) , tpsp ¡ t
(6.3)
and equation 6.1 becomes Z #D
0 ¡1
µ ( ) dz J z 1 ¡
¶ 1 (tpspe ¡c |z| / tpsp ¡ t e¡c |z| / t ) . tpsp ¡ t
(6.4)
Traveling Waves of Excitation
1663
To simplify the mathematics, we consider couplings whose spatial range is small compared to the distance the wave travels in the time interval tQ D minft, tpspg relevant for the microscopic dynamics. In other words, we focus on the case where Z 0 (6.5) dz |z| n J ( z ) ¿ ( vtQ ) n 8n 2 N . ¡1
In this regime, we can neglect higher-order terms in the expansion of the exponential functions in equation 6.4 and obtain #¼
c2 t tpsp
Z
0
dz J (z ) z2 .
(6.6)
¡1
To answer the question posed in the beginning of this section, we now conR0 sider on the one hand gaussian J couplings and arrive at ¡1 dz J ( z ) z2 D sJ2 / 2. If, on the other hand, we use gaussian T couplings with proper norR0 malization (see equation 5.2), we obtain ¡1 dz J ( z ) z2 D sT2 . Independent of values for t , tpsp , c, and v, the estimates for sJ and sT are therefore always related by p sJ D 2sT . (6.7) We now turn to a second example and assume instantaneous interactions without any signal delays, a(s) D d (s ), so that 2 (s ) D t ¡1 e ¡s/ t . For gaussian J couplings, we can solve equation 6.1 and obtain Á 2 2! ³ ´ c sJ ¡c sJ 1 erf (6.8) # D ¡ exp , 2 2t 2 t where erf(x) denotes the error function. If, on the other hand, we take T couplings with gaussian shape, we get Á !r ³ ´ c 2 sT2 1 p c sT ¡c sT erf . (6.9) # D ¡ exp 2 2t 2 2 t t Equating the right-hand sides of equations 6.8 and 6.9 forc s ¿ t , we obtain sJ D
p sT . 2
(6.10)
Finally, if instead of gaussian couplings, we use block-shaped couplings, J (z ) D 1 / (2aJ ) for ¡a J · z · a J and zero elsewhere, a calculation similar to equations 6.8 through 6.10 leads to sJ D 2sT .
(6.11)
1664
Daniel Cremers and Andreas V. M. Herz
Thus, in these three cases investigated, the width of the same type of synaptic coupling inferred from the measured wave velocity differs by a factor 1.4 to 2 between the ring-rate and integrate-and-re picture. Furthermore, we obtain sJ > sT in all three cases. For gaussian couplings C, Figure 3 offers a heuristic explanation of this result: the right-hand side of equation 6.1 depends on the higher-order moments of J ( z) . For gaussian T couplings (lower right in the gure) with given standard deviation sT , the higher-order moments of the derived J couplings (lower left in the gure) are smaller than those of gaussian J couplings (upper left) with the same standard deviation, sJ D sT . To compensate for this difference, we have to choose sJ > sT as veried by the analytical calculation. 7 Discussion
As shown in this article, wave fronts of traveling waves in ring-rate models and models with integrate-and-re dynamics are intimately related on a formal level. Taking into account that these descriptions are extreme and opposite caricatures of the true neural dynamics, the differences in the inferred model parameters are surprisingly small and suggest that more elaborate biophysical models might show intermediate results. We may thus conclude that macroscopic wave phenomena can be used reliably to estimate the characteristics of neural dynamics and network architecture that are not directly observable. However, it should be noted that our calculations are based on simple homogeneous single-layer models. Neural tissues with multiple layers, more complicated feedback structures, or additional slow dynamical components might support traveling waves of a rather different nature, as also suggested by results of Rinzel et al. (1998) and Golomb and Ermentrout (1999, 2001). The emergent properties of such neural systems might depend sensitively on the biophysical details of the underlying dynamics. Our studies have been restricted to integrate-and-re models that do not exhibit reexcitation. This approach is justied by the observation of Ermentrout (1998) that the speed of a traveling wave front is largely independent of subsequent spiking activity. More elaborate modeling frameworks could also be used to describe nonvanishing asynchronous activity ahead and behind the traveling wave front and to include heterogeneity of neural and synaptic properties. We believe that these extensions offer interesting topics for further research but that they will not change the overall picture about the relation of wave phenomena in ring-rate and integrate-and-re models. In addition, traveling wave solutions in both types of model are likely to share the same stability properties, but we have not been able to prove this analytically. However, even if the stability of two corresponding solutions differed under certain circumstances, our analysis would still be helpful in that it allows nding unstable waves in one modeling framework by searching for all stable solutions in the
Traveling Waves of Excitation
1665
other framework, which, from a numerical point of view, is a much simpler task. The equivalence of ring-rate and integrate-and-re models also holds for plane waves in two- and higher-dimensional systems. For concreteness, let us assume that the wave in a three-dimensional ring-rate model is propagating in the x-direction so that u ( x, y, z, t) depends on only x and t. We may then reduce the original system to an effective one-dimensional R1 R 1 system by dening effective Jeff -couplings through Jeff ( x ) D ¡1 dy ¡1 dz J ( x, y, z ) and compute the corresponding Teff -couplings as in equation 5.1. We can therefore describe the original wave within either a one-dimensional ringrate or a one-dimensional integrate-and-re picture. In a further step, we may also wish to know which three-dimensional integrate-and-re models are compatible with the one-dimensional model. To this question, we R 1answer R1 have to search for couplings T ( x, y, z) that satisfy ¡1 dy ¡1 dz T (x, y, z ) D Teff ( x) . This problem is underdetermined so that we may set additional constraints on T ( x, y, z ). For example, we can ask which isotropic and homogeQ T (x, y, z ) D TQ ( x2 C y2 C z2 ) correspond to a given Teff . neous couplings T, Although successful for plane wave solutions, our approach is not useful for circular or spiral waves. These solutions break the translation symmetry of the underlying network and distinguish one specic location: the origin of the expanding wave. If we nevertheless apply our methods and start from homogeneous couplings in the ring-rate or integrate-and-re framework, we obtain inhomogeneous couplings in the other framework. Furthermore, the coupling strengths explicitly reect the location of the origin of the specic wave solution. With these limitations in mind, the current approach may help to unify different concepts of neural eld dynamics. Let us close with a nal observation about the relation between ringrate and integrate-and-re neural network models. Firing-rate dynamics are usually derived from models with spiking neurons by taking shorttime averages. In eld models with spatially continuous neural activity, a conceptually different method becomes possible: spatial integration of the dynamical variables, together with spatial differentiation of the synaptic couplings. For traveling waves, no information is lost by this transformation. This shows that under certain circumstances, microscopic dynamical descriptions as different as ring-rate and integrate-and-re models may exhibit equivalent patterns of large-scale activity.
Acknowledgments
We thank Katrin Suder and Florentin Worg¨ ¨ otter for stimulating discussions and Wulfram Gerstner, Werner Kistler, Martin Stemmler, Laurenz Wiskott, and two referees for most helpful comments on the manuscript. This work has been supported by the Deutsche Forschungsgemeinschaft and the Human Frontier Science Program.
1666
Daniel Cremers and Andreas V. M. Herz
References Amari, S. I. (1977). Dynamics of pattern formation in lateral-inhibition type neural elds. Biological Cybernetics, 27, 77–87. Ben-Yishai, R., Hansel D., & Sompolinsky H. (1997). Traveling waves and the processing of weakly tuned inputs in a cortical network module. Journal of Computational Neuroscience, 4, 57–77. Beurle, R. L. (1956). Properties of a mass of cells capable of regenerating pulses. Philosophical Transactions of the Royal Society B, 240, 55–94. Bressloff, P. C. (1999). Synaptically generated wave propagation in excitable neural media. Physical Review Letters, 82, 2979–2982. Bressloff, P. C. (2000). Traveling waves and pulses in a one-dimensional network of excitable integrate-and-re neurons. Journal of Mathematical Biology, 40, 169–198. Bringuier, V., Chavane, F., Glaeser, L., & Fregnac, Y. (1999). Horizontal propagation of visual activity in the synaptic integration eld of area 17 neurons. Science, 283, 695–699. Destexhe, A., Bal, T., McCormick, D. A., & Sejnowski, T. J. (1996). Ionic mechanisms underlying synchronized oscillations and propagating waves in a model of ferret thalamic slices. Journal of Neurophysiology, 76, 2049–2070. Ermentrout, G. B. (1998).The analysis of synaptically generated traveling waves. Journal of Computational Neuroscience, 5, 191–208. Ermentrout, B., & Cowan, J. (1979). A mathematical theory of visual hallucination patterns. Biological Cybernetics, 34, 137–150. Fohlmeister, C., Gerstner, W., Ritz, R., & van Hemmen, J. L. (1995). Spontaneous excitations in the visual cortex: Stripes, spirals, rings, and collective bursts. Neural Computation, 7, 1046–1055. Gelperin, A. (1999). Oscillatory dynamics and information processing in olfactory systems. Journal of Experimental Biology, 202, 1855–1864. Georgeopoulos, A. P., Kettner, R. E., & Schwartz, A. B. (1988). Primate motor cortex and free arm movements to visual targets in three-dimensional space. II. Coding of the direction of movement by a neuronal population. Journal of Neuroscience, 8, 2928–2937. Golomb, D., & Amitai, Y. (1997). Propagating neuronal discharges in neocortical slices: Computational and experimental study. Journal of Neurophysiology,78, 1199–1211. Golomb, D., & Ermentrout, G. B. (1999). Continuous and lurching traveling pulses in neuronal networks with delay and spatially decaying connectivity. Proc. Natl. Acad. Sci. USA, 96, 13480–13485. Golomb, D., & Ermentrout, G. B. (2001). Bistability in pulse propagation in networks of excitatory and inhibitory populations. Phys. Rev. Lett., 86, 4179– 4182. Horn, D., & Opher, I. (1997). Solitary waves of integrate-and-re neural elds. Neural Computation, 9, 1677–1690. Idiart, M. A. P., & Abbott, L. F. (1993). Propagation of exitation in neural network models. Network, 4, 285–294.
Traveling Waves of Excitation
1667
Kistler, W. M. (2000). Stability properties of solitary waves and periodic wave trains in a two-dimensional network of spiking neurons. Phys. Rev. E, 62(6), 8834–8837. Kistler, W. M., Seitz, R., & van Hemmen, J. L. (1998). Modeling collective excitations in cortical tissue. Physica D, 114, 273–295. Meister, M., Wong, R. O. L., Denis, A. B., & Shatz, C. J. (1991). Synchronous bursts of action potentials in ganglion cells of the developing mammalian retina. Science, 252, 939–943. Rinzel, J., Terman, D., Wang, X.-J., & Ermentrout, B. (1998). Propagating activity patterns in large-scale inhibitory neuronal networks. Science, 279, 1351–1355. Traub, R. D., Jefferys, J. G. R., & Miles, R. (1993). Analysis of propagation of disinhibition-induced after-discharges along the guinea-pig hippocampal slice in vitro. Journal of Physiology, 472, 267–287. von Seelen, W. (1968). Informationsverarbeitung in homogenen Netzen von Neuronenmodellen I. Kybernetik, 5, 133–148. Wadman, W. J., & Gutnick, M. J. (1993). Non-uniform propagation of epileptiform discharge in brain slices of rat neocortex. Neuroscience, 52, 255–262. Wilson, H. R., & Cowan, J. D. (1973). A mathematical theory of the functional dynamics of cortical and thalamic nervous tissue. Kybernetik, 13, 55–80. Received January 18, 2000; accepted January 4, 2002.
LETTER
Communicated by Alexandre Pouget
Attentional Recruitment of Inter-Areal Recurrent Networks for Selective Gain Control Richard H. R. Hahnloser [email protected] Howard Hughes Medical Institute, Department of Brain and Cognitive Sciences, MIT, Cambridge, MA 02139 U.S.A. Rodney J. Douglas [email protected] Institute of Neuroinformatics ETHZ/UNIZ, CH-8057 Zurich, ¨ Switzerland Klaus Hepp [email protected] Institute for Theoretical Physics, ETHZ, H¨onggerberg, CH-8093 Zurich, ¨ Switzerland There is strong anatomical and physiological evidence that neurons with large receptive elds located in higher visual areas are recurrently connected to neurons with smaller receptive elds in lower areas. We have previously described a minimal neuronal network architecture in which top-down attentional signals to large receptive eld neurons can bias and selectively read out the bottom-up sensory information to small receptive eld neurons (Hahnloser, Douglas, Mahowald, & Hepp, 1999). Here we study an enhanced model, where the role of attention is to recruit specic inter-areal feedback loops (e.g., drive neurons above ring threshold). We rst illustrate the operation of recruitment on a simple example of visual stimulus selection. In the subsequent analysis, we nd that attentional recruitment operates by dynamical modulation of signal amplication and response multistability. In particular, we nd that attentional stimulus selection necessitates increased recruitment when the stimulus to be selected is of small contrast and of small distance away from distractor stimuli. The selectability of a low-contrast stimulus is dependent on the gain of attentional effects; for example, low-contrast stimuli can be selected only when attention enhances neural responses. However, the dependence of attentional selection on stimulus-distractor distance is not contingent on whether attention enhances or suppresses responses. The computational implications of attentional recruitment are that cortical circuits can behave as winner-take-all mechanisms of variable strength and can achieve close to optimal signal discrimination in the presence of external noise.
c 2002 Massachusetts Institute of Technology Neural Computation 14, 1669– 1689 (2002) °
1670
Richard H. R. Hahnloser, Rodney J. Douglas, and Klaus Hepp
1 Introduction
The primate visual cortex is divided into many distinct areas that are organized hierarchically by recurrent inter-areal connections. The various areas represent visual space nearly topographically, but the sensory receptive elds of their neurons are quite different. It appears that the role of interareal ascending projections is to form progressively higher representations of the visual world, for example, from neurons tuned to moving edges in V1 (Hubel & Wiesel, 1962) up to neurons tuned to the view of particular objects in area IT (Logothetis, Pauls, Bulthoff, & Poggio, 1994; Riesenhuber & Poggio, 1999). The size of neuronal receptive elds grows with hierarchical level. For example, in the macaque monkey, classical receptive eld diameters grow from about 1 degree in V1 to 5 degrees in MT, 30 degrees in MST, and in IT they can be even larger than the entire contralateral visual eld. There is evidence that the descending inter-areal connections affect attentional selection and memorization of behaviorally relevant information (Motter, 1993; Tomita, Ohbayashi, Nakahara, & Miyashita, 1999). Attentionrelated neural activity has been observed in many cortical visual areas of the macaque monkey. The strength of attention decreases with level from MST, MT, V4, V2 down to V1 (Motter, 1993, 1994; Moran & Desimone, 1985; Luck, Chelazzi, Hillyard, & Desimone, 1997; Colby, Duhamel, & Goldberg, 1996; Treue & Maunsell, 1996, 1999; Fuster, 1990; Desimone, 1996). There is evidence that the origin of attentional signals is in prefrontal cortex (Tomita et al., 1999). The generally low ring rates in higher areas suggest that the attentional control circuitry recruits or derecruits feedback networks with neurons in lower areas by driving neurons above or below ring threshold. Several experiments provide evidence of the involvement of inter-areal feedback in the attentional selection of low-contrast stimuli. In V4, responses of neurons to very low-contrast stimuli are not increased when they are attended in the presence of high-contrast distractors: Response enhancement is possible only above a critical contrast of about 5% (Reynolds, Pasternak, & Desimone, 2000). Also, in the experiments of De Weerd, Peralta, Desimone, and Ungerleider (1999), it was found that restricted lesions of areas V4 and TEO result in monkeys being unable to report the orientation of low-contrast gratings in the presence of high-contrast distractors. However, the monkeys’ perceptual performance for low-contrast stimuli was almost unchanged when the distractors were not present. This suggests that the circuits in TEO–V4 are highly involved in attentional selection of low-contrast stimuli rather than in reading out of stimulus orientation. Previously, we have noted that if attentional stimulus selection is induced by an excitatory top-down bias, then this bias necessarily also leads to a bias in the readout of the selected stimulus (Hahnloser et al., 1999). Here, we resolve this selection / readout dilemma by considering attentional inputs that are just strong enough to recruit neurons in the higher area and their
Recruitment of Inter-Areal Recurrent Networks
1671
feedback loops with neurons in the lower area, but without providing for an additional input bias. Stimulus selection is achieved by recruiting more feedback loops for stimuli that are to be attended than for distractor stimuli. We cast our network as a simple model of two recurrently connected areas, such as MST–MT, V5–V2, or TEO–V4. We explore the computational principles that could underlie the recruitment of feedback between large and small receptive eld neurons by simulating physiological experiments in which multiple stimuli appear inside the receptive eld of a large eld neuron. We study the limits within which attentional selection is possible when the attended stimulus is of low contrast and at a small distance from distractor stimuli. By varying the amount of recruitment and the size of the attended stimulus, we explore the accuracy of readout in a noisy environment. 2 Network Equations
The ring rates of E excitatory neurons in the lower area are denoted by Mx (x D 1, . . . , E), those of N excitatory neurons in the higher area by Pi (i D 1, . . . , N) and those of I inhibitory neurons in the lower area by Iy (y D 1, . . . , I) (see Figure 1a). The indices x and y stand for a one-dimensional topography of the lower area, and the index i stands for a not necessarily topographic labeling of neurons in the higher area. The equations describing the evolution of ring rates are given by: " # E X P Pi D ¡Pi C pi C aF Mx cos(dx ¡ Âi ) C ¡ t xD 1
2
P x D ¡M x C 4mx C aB M 2 IPx D ¡Ix C 4aI
N X iD 1
N X iD 1
(2.1) C
Pi cos(dx ¡ Âi ) C ¡ b
Pi cos( yx ¡ Âi ) C ¡ b I
3
I X y D1
3 I X yD 1
Iy 5
Iy 5
(2.2) C
(2.3) C
Here [ f ] C D max (0, f ) denotes rectication and ensures positivity of ring rates. The receptive eld centers dx of neurons Mx are regularly spaced, dx D x¡1 . Similarly, the receptive eld centers of neurons are given by y dmax E¡1 Iy y D
ymax y¡1 . The inter-areal connections are purely excitatory. Their strength I¡1 decays as a function of receptive eld separation z, according to cos(z) C D [cos(z) ] C . There is uniform inhibition in the lower area (the assumption of uniformity is a simplication that could be relaxed; see Wersing, Beyn, & Ritter, 2001). The parameters aF , aB , and aI determine the strength of excitation, and b and bI the strength of inhibition.
1672
Richard H. R. Hahnloser, Rodney J. Douglas, and Klaus Hepp
Figure 1: Attending to one of two stimuli moving in antiphase. (a) Schematic of network architecture of excitatory neurons in higher area (P) and excitatory and inhibitory neurons in lower area (M and I). The Greek letters denote the synaptic coupling strengths. (b) Two moving visual stimuli are placed in the receptive eld of neuron P2 in area MST (thick circle, schematic). Synaptic weights of the three MST neurons with MT neurons are shown on top. (c) The attentional selection of the left stimulus (indicated by the rectangle) is modeled by recruiting pointer neurons P1 and P2 . Their responses are shown by the solid and the dash-dotted lines, correlating with the upward movement of the left stimulus. p1 D p2 D 1, p3 D 0. (d) This time, pointer neurons P2 and P3 are recruited. Their responses (solid and dashed lines) correlate with the upward movement of the right stimulus. E D 320, dmax D p , N D 3, I D 32, ymax D p , Â1 D 0, Â2 D p / 2 Â3 D p , aF D .5, aB D 2, aI D 2, b D 60.07, bI D 60, a D 6, t D 1, hup D 1, hdown D .1.
Recruitment of Inter-Areal Recurrent Networks
1673
For the simulations, the visual input mx contains either one or two lo± (dx ¡ r) ) C if |dx ¡ r| · a / 2 and calized stimuli of the form mx D h cos( 180 a mx D 0 otherwise. Here, h corresponds to the contrast of a stimulus (or its luminosity), a to the stimulus width (in degrees), and r to its retinal location, 0± < r < dmax . Neurons Pi have nonzero ring thresholds t > 0 that express the difculty of driving neurons in higher areas by visual stimulation alone. The attentional inputs pi to neurons in the higher area are set to either zero (no recruitment) or t (recruitment). Appendix A describes a simple method for selecting appropriate values for the ve coupling parameters aF , aB , ai , b, and b I . 3 Example of Recruitment
Typically, neurons in various higher visual areas are able to respond to just the attended stimulus in their receptive eld, ltering out nonattended stimuli (Moran & Desimone, 1985; Desimone, 1998; Reynolds, Chelazzi, & Desimone, 1999). For example, Treue and Maunsell recorded from neurons in areas MT and MST of alert monkeys during a visual attention task involving two moving stimuli (Treue & Maunsell, 1996, 1999). The monkeys were instructed to attend to one of them and to respond quickly to a change of speed. Both stimuli fell inside the receptive eld of a recorded neuron, and alternately, one stimulus moved in the neuron’s preferred direction while the other moved in the antipreferred direction. They found that most of the time, the neuronal response correlated strongly with the direction of motion of the attended stimulus, not the distractor stimulus. And when monkeys attended to a stimulus outside the receptive eld, neuronal response was suppressed, not correlating with the preferred movement direction of either of the two stimuli in the receptive eld. We chose these experiments by Treue and Maunsell to illustrate the operation of recruitment (we do not provide a complete model for the MT–MST interactions). We simulated the response behavior of N D 3 motion-selective neurons P1 , P2 , and P 3 in area MST (the higher area) and E D 320 motiondirection selective neurons Mx in area MT (the lower area). Receptive eld centers in MST are Â1 D 0± , Â2 D dmax / 2, and Â3 D dmax D 180± . There are two vertically moving dots. Neuron P2 sees both dots in its receptive eld, whereas neurons P1 and P3 each see only one dot. The two dots oscillate in antiparallel directions to each other (see Figure 1b). Because we assume that the one-dimensional map in MT encodes only the horizontal dimension, “vertical movement” was simulated by contrast changes. We assume that all MT and MST neurons have an equal upward direction preference, simply modeled by setting the contrast of the “downward-moving dot” to one-tenth of the contrast for the “upward-moving dot” (in other words, the contrasts of the two dots ipped back and forth between two values). The common threshold t of the neurons in MST is large. In the model, MST neurons are activated by visual stimulation only when they are also
1674
Richard H. R. Hahnloser, Rodney J. Douglas, and Klaus Hepp
recruited by attentional input, pi D t. But only those neurons are recruited whose receptive elds overlap with the current focus of attention. For example, when the monkey attends to the left stimulus, neurons P1 and P2 are recruited, in which case neuron P2 responds mainly during the upward movement of the dot on the left (see Figure 1c). Similarly, the response of P2 is bound to the upward movement of the right dot when the monkey attends to the right and neurons P2 and P3 are recruited (see Figure 1d). Neurons P1 and P3 are active only when the attended stimulus is the one in their receptive eld, in which case they respond to its upward movement. When the monkey attends to the other stimulus outside their receptive eld, they remain silent. In analogy with the Treue and Maunsell experiments, attention causes stimulus competition beyond receptive eld boundaries (neurons P1 and P3 ), as well as within receptive eld boundaries (neuron P2 ). The explanation of why in Figure 1 the activity of neuron P 2 correlates with the movement direction of the attended stimulus is quite simple: the recruitment of either neuron P1 or P3 contributes feedback amplication, enhancing responses in MT to the left or right stimulus. And because responses are enhanced in MT, responses will be enhanced in MST as well. This explains why neuron P2 —receiving the same attentional input p2 D t in both Figures 1c and 1d—can have a response that is biased according to which fellow MST neuron is recruited. 4 Loss of Attentional Selection for Low-Contrast and Nearby Stimuli
In this section, we analyze the conditions under which attentional recruitment enables persistent selection of a behaviorally relevant stimulus in the presence of distractor stimuli. By “persistent selection,” we mean that neural responses are locked to the selected stimulus and persist when distractor stimuli change (in other words, neural responses are as if the distractors were not present). We explore the sensitivity of persistent selection to various parameters such as stimulus contrast (relative to distractor contrast) and stimulus location (relative to distractor location). We will consider only a one-dimensional network and nearby stimuli that fall between r D 0± and r D 90± . Consequently, we restrict the map in the lower area to dmax D 90± and ymax D 90± . Because we are mainly interested in the effects of recruitment, half of the neurons in the higher area have receptive eld centers at Â1i D 0± and the other half at Â2i D 90± (i D 1, . . . , N / 2). By discretizing the receptive eld centers to just two values separated by 90 degrees, the population activity in the higher area gets a simple interpretation: the activity vector formed by pairs P i D ( Pi1 , Pi2 ) of neurons forms a vector whose direction indicates the center of activity in the lower area (and thus the location of the selected stimulus). This population vector property stems from the geometrical fact that there are sine and cosine synaptic connection proles between the two areas (which in turn is based
Recruitment of Inter-Areal Recurrent Networks
1675
on the equality cos(a ¡ 90± ) D sin(a)). As in our previous work, the activity vectors P i in the higher area shall be referred to as pointers (Hahnloser et al., 1999). Recruiting pointers that share receptive eld centers is mathematically equivalent to changing the synaptic weights aF , aB , and aI made by a single pointer. In Figure 2a, a stimulus and a distractor of equal contrast are presented to the network at three different separations. The steady response in the higher area is read out as the pointer angle, ³P i ´ P c D arctan Pi 2i , i P1 and is plotted as a function of the number N C of recruited pointers (N C is dened by p1k D p2k D t for k · N C and p1k D p2k D 0 for k > N C ). All recruited pointers are initialized so as to express a preattentive bias to the left, P i (0) D (2, 0) . This initialization tends to induce attentional selection of the stimulus on the left. When stimulus and distractor are directly adjacent to each other, then persistent selection arises only for about N C > 20. Persistent selection requires fewer pointers the farther apart the stimuli are. In a similar way as for stimulus separation, the selection of a low-contrast stimulus is possible only within limits. Figure 2b shows a diagram in which we plot (as a function of N C ) the minimal relative contrast of a stimulus that still permits its selection by attentional recruitment. Relative contrast is dened as the contrast of the stimulus divided by the contrast of the distractor. When the stimulus is close to the distractor and 32 pointers are recruited, then persistent selection is possible only if its contrast is at least 20% of the distractor contrast. However, when the stimulus is farther away, then many fewer pointers are required for persistent selection at the same relative contrast. In other words, for a xed number of recruited pointers, the minimal stimulus contrast allowing for persistent selection increases as the stimulus and the distractor move closer together. We have analyzed the sensitivity of these results to the strength of synaptic weights. Interestingly, we found that altering the strength of the excitatory feedback (e.g., decreasing aB by 4%) does not have a noticeable inuence on the distance sensitivity in Figure 2a. However, this decrease of excitatory feedback has a dramatic effect on the contrast sensitivity in Figure 2b. The reduced strength of excitatory feedback for the intermediate stimulus-distractor separation results in a highly reduced performance for selecting low-contrast stimuli. In the next section, we show that the reason for the decreased selectability is that a small reduction in feedback can cause the net effect of recruitment on neurons in the lower area to be inhibitory rather than excitatory, without affecting the strength of competition.
1676
Richard H. R. Hahnloser, Rodney J. Douglas, and Klaus Hepp
Figure 2: Distance and contrast dependence of attentional selection. (a) The effective pointer angle c is plotted as a function of the number of recruited pointers. Two identical stimuli are presented at three different interstimulus separations (insets above, 17, 34, and 51 degrees). The initial conditions of pointer neurons P i (0) D (2, 0) tend to cause selection of the left stimulus, the right stimulus representing a distractor. The closer the two stimuli are, the more pointers are required for persistent selection. For the rst case, where the stimuli are directly adjacent to each other (solid line), three snapshots of steady map activity are shown, corresponding to interpolation (left), partial selection (middle), and persistent selection (right). (b) The minimal relative contrast allowing for persistent selection is shown as a function of the number N C of recruited pointers. The curves were determined by slowly decreasing the contrast of the selected stimulus until c starts to deviate from the center of the selected stimulus. Again, curves are plotted for three stimulus-distractor separations. The solid, dashed, and dash-dotted curves correspond to aB D 0.625. The dashed curve labeled II corresponds to aB D 0.6—to be compared with the dashed curve labeled I. b D 3.755, bI D 60, aF D 0.1, aI D 10, I D 32, N D 64, E D 320. 5 Winner-Take-All and Attentional Enhancement of Responses
Here we compare inter-areal feedback and winner-take-all (WTA) mechanisms. We show that recruitment has the effect of changing the strength of the WTA mechanism. A uniform input to the lower area can be viewed as a setting in which each neuron has the same chances of being activated at a steady state (this is due to translational invariance of feedback, sin2 a C cos2 a D 1; see also Hahnloser et al., 1999). The winning neurons (the ones that are activated) are determined by the initial conditions of the dynamics. In Figure 3a, for xed initial conditions, we see that a localized response to uniform stimulation emerges. The recruitment of many pointers leads to a substantial
Recruitment of Inter-Areal Recurrent Networks
1677
narrowing of the activity prole. In Figure 3b, the response width w of the steady response prole is plotted as a function of the number of recruited pointers N C . The circles are simulation results, and the full line corresponds to an analytical calculation done in appendix B. It can be seen that w is a monotonically decreasing function of N C . This behavior can be interpreted as a WTA mechanism whose softness or hardness is modulated by the number of recruited pointers. The more of them are recruited, the harder is the WTA mechanism. As an interesting limit to this recruitment-induced strengthening of WTA mechanisms, in appendix C, we calculate the hard WTA limit, in which only one neuron Mx can be active at a steady state. This limit has similarities to a maximum operation that has been suggested to be of relevance for object recognition (Riesenhuber & Poggio, 1999). We nd that for a hard WTA, the number N hard of pointers that have to be recruited grows quadratically in E. C This scaling law suggests that there are not enough neurons to achieve an exact maximum operation by recruitment, but that at best, an approximation to the maximum operation is possible. As in the previous section, we have analyzed the effect of reducing the strength of excitatory feedback. In Figure 3c, we reduced aB by 4%. In this case, recruiting pointers does not enhance responses as it did in Figure 3a, but it suppresses responses. In appendix A, we show that the polarity of signal gain depends on the balance between excitatory and inhibitory feedback gain. Signal enhancement occurs if the gain of excitatory feedback set by aF and aB is larger than the gain of inhibitory feedback set by aF , aI , bI , and b. Interestingly, although the signal gains in Figures 3a and 3c are different, the response widths are not. Hence, the hardness of the WTA is insensitive to the exact tuning of synaptic strength, as was the stimulus-separation sensitivity of attentional selection in Figure 2a. 6 Attentionally Controlled Noise Suppression
Psychophysical studies and electrophysiological recordings show that visual attention can enhance the discriminative performance of macaque monkeys (Spitzer, Desimone, & Moran, 1988; Lu & Dosher, 1998) and the discriminative responses of single neurons (Spitzer et al., 1988; McAdams & Maunsell, 1999). Signal discrimination is in many ways equivalent to signal estimation in the presence of external noise. Population vectors (low-dimensional representations of the activity of many neurons) are possible signal estimators. They can achieve an unbiased readout of sensory input signals if the input noise is uncorrelated between neurons (Seung & Sompolinsky, 1993). In other words, the mean readout of a population vector over many stimulus repetitions is equal to the value of the sensory signal. This a highly desirable feature of any readout method. However, population vectors do not always have a good performance in averaging out noise. For example, if the neurons supporting the population
1678
Richard H. R. Hahnloser, Rodney J. Douglas, and Klaus Hepp
Figure 3: Transition from soft to hard winner-take-all. (a) Steady map response (full and dashed line) to uniform input (dash-dotted line). Attentional recruitment (from 1 to 32 pointers) leads to a sharpened response prole of similar total activity, but with enhanced peak response. The gain of excitatory feedback is approximatively equal to that of the inhibitory feedback, aB D 0.625 (see appendix A). (c) The gain of excitatory feedback is smaller than in a, aB D 0.6. Recruitment leads to a strong down modulation of the population response. (b) The response width decays monotonically with the number of recruited pointers (e.g., the strength of the WTA competition increases). For a given number of recruited pointers, the response width is invariant to small changes in aB (not shown). The circles represent simulations of equations 2.1 to 2.3 and the full line corresponds to a plot of equation B.2. E D 320, I D 32, N D 64. b D 3.755, bI D 60, aF D 0.1, aI D 10.
vector have very narrow tuning curves, then the mean squared error of the readout tends to become very large. In general, any unbiased estimator should have the smallest possible variance, because the variance determines how well two similar stimuli can be discriminated. Pouget, Zhang, Deneve, and Latham (1998) have examined the advantage of recurrence in the problem of large variability of population vectors.
Recruitment of Inter-Areal Recurrent Networks
1679
They found that lateral excitatory connections in cortex can substantially reduce the uncorrelated noise between neurons. In this way, the noisy input to a map of recurrently connected neurons is restored to a steady-state activity that can then be read out by a population vector with near-optimal accuracy. However, the near-optimal readout is achieved only if the stimulus width closely matches the intrinsic tuning width of synaptic connections. Here we show that a much broader range of near-optimality can be achieved when inter-areal feedback loops are recruited according to some prior knowledge of stimulus width. In our network, the inter-areal feedback combines desirable features of both of the above readout methods. That is, pointers can read out the activity of a map by an unbiased population vector and provide the necessary feedback to cancel uncorrelated noise. Thus, it is possible to have the best of both worlds. We set the task of the network to extract the location of a stimulus fx of variable width a, where fx (r ) D cos( pa (dx ¡ r) ) if |dx ¡ r| · a / 2 and fx D 0 otherwise (r is the location of the stimulus). We assume that there is prior knowledge available about the stimulus width a. The noise is modeled by adding to fx some random number g taken from a gaussian distribution with zero mean and xed variance s 2 . In this way, the input probability density mx is given by P (mx |r) D p
1 2p s 2
e
¡
(mx ¡ fx (r)) 2 2s 2
.
(6.1)
Figure 4a shows the response of the network to this noisy stimulus under conditions of both few and many recruited pointers. Similar results hold if the noise is Poissonian rather than gaussian. We have computed the mean and standard deviation of the readout c for 5000 presentations of the same stimulus at r D 45± , but with different noiseg. The mean of c converged toward r, in agreement with unbiased estimation. In Figure 4b, we have plotted the standard deviation S (c ) as a function of the number of recruited pointers. For a broad stimulus (dashed line), the standard deviation is minimal when the number of recruited pointers falls between 3 and 5. On the other hand, when the stimulus is narrow (full line), the readout is optimal when about 7 to 14 pointers are recruited. Thus, we nd that whichever number of pointers should be recruited depends critically on the stimulus width. Figure 4c shows the dependence of the standard deviation S (c ) on stimulus width, for 1, 4, and 32 recruited pointers. The stimulus location was held xed at r D 45± , and its width a was varied in steps of 4.5 degrees from 4.5 to 81 degrees. The standard deviation was computed using n D 1000 presentations of the stimulus for each width. To get a sense of how large these standard deviations of the pointer estimates are in absolute terms, we have compared them to the minimal
1680
Richard H. R. Hahnloser, Rodney J. Douglas, and Klaus Hepp
Figure 4: Recruiting feedback for suppressing noise. (a) A noisy stimulus of width a D 54± is indicated by the dotted line. The steady activity in the lower area is shown for the case of 32 recruited pointers (solid line) and for the case of one recruited pointer (dashed line). s 2 D .04, E D 90. (b) Standard deviations of pointer readout as a function of the number of recruited pointers. A relatively broad stimulus a D 45± results in a minimal standard deviation for about three to ve recruited pointers (dashed line). The lower bound as given by equation 6.3 is shown by the ne dashed line. For a slightly narrower stimulus, a D 34± , the standard deviation is minimal for 6 to 15 recruited pointers. The ne full line shows the lower bound for this stimulus width. For both stimuli, the optimal pointer estimates deviate by about 10% from the theoretical minimum. s 2 D .04. (c) Standard deviations as a function of stimulus width a. Narrow stimuli (in region A) require strong feedback (32 pointers). Broader stimuli (in regions B and C) require weaker feedback (4 and 1 pointer, respectively). The ne dashed line represents the performance of the population vector estimate, equation 6.4. Its performance is similar to the 1-pointer case. s 2 D .04. (d) As suggested in b, for every number of recruited pointers, there is a different stimulus width abest , for which the standard deviation of readout is minimal (thick full line: s 2 D .04 and ne full line: s 2 D .01). The dashed line shows the response width w to uniform input. E D 80, N D 40, I D 20. aF D 0.4, aB D 0.1, aI D 2.5, bI D 24, b D 0.9656.
Recruitment of Inter-Areal Recurrent Networks
1681
achievable standard deviation S (c opt ) achievable by any readout method. For a large network (E À 1) , this minimum is given by the Crame r–Rao bound (Cover & Thomas, 1991), dened by the inverse of the square root of the Fisher information: S (c opt ) D qP
1 d2 ( x h¡ dr2 P mx |r) i
.
(6.2)
A calculation shows that for large E, r S (c opt ) D s
a , pE
(6.3)
which corresponds to the ne line in Figure 4c. Hence, an optimal readout has an error that is proportional to the square root of the stimulus width and is inversely proportional to the square root of the number of neurons. It is illustrative to compare the pointer readout P to the readout achieved by a population vector dened by v D (v1 , v2 ) D x mx (cosdx , sin dx ) . We have calculated the standard deviation for large E, under the approximation that p uctuations are small, in which case only the component z D 1 / 2(v2 ¡ v1 ) of the population vector orthogonal to the stimulus direction r matters: S (c pop ) ’
(p 2 ¡ a2 ) S (z ) Ds |hv i| 4a cos(a / 2)
r
p ¡2 . 2p E
(6.4)
This standard deviation is shown as a ne dashed line in Figure 4c. It tends to diverge for narrow stimuli but approaches the optimal standard deviation as the stimulus width approaches 90 degrees (at this angle, there is a mathematical equivalence between the population vector and the maximum likelihood estimates). We have found that for any stimulus width, there are a number of recruited pointers N C , for which the standard deviation of readout is surprisingly close to the Crame r–Rao bound (see Figure 4c). For small N C (dashdotted line), the readout has a large standard deviation for narrow stimuli and decreases as stimuli become broader. This behavior is similar to that of the population vector and conrms the intuitive notion that when feedback is weak, pointers are nothing but population vectors. As more pointers are recruited (dashed line), the standard deviation is nonmonotonic and has a minimum at a stimulus width of about a D 45± . Finally, when many pointers are recruited (full line), then the minimal standard deviation is achieved for even narrower stimuli, at about a D 25± . In Figure 4d, the full line corresponds to the best stimulus width abest ( N C ) as a function of the number of recruited pointers (abest is the width at which the standard deviation of the readout is smallest). Again, strong recruitment
1682
Richard H. R. Hahnloser, Rodney J. Douglas, and Klaus Hepp
is better for narrow stimuli, and weak recruitment is better for broad stimuli. As can be seen, abest ( N C ) is not very sensitive to the variance of the noise. Furthermore, its dependence on N C is similar to the response width w to a uniform input without noise, shown by the dashed line (see also Figure 3a). w is larger than abest by about 30 degrees but decreases in a similar way. Hence, the strength of feedback (N C ) is optimal when its implicit response width (dened by uniform input) is slightly larger than the stimulus width to be encoded. To summarize, Figure 4 makes the point that attentional recruitment (based on prior information about stimulus width) can yield a substantial improvement in signal estimation in comparison to locally recurrent networks (corresponding to the case where the number of recruited pointers is xed). Increased recruitment is needed for small stimuli. Also, increased recruitment is needed for very noisy environments. However, the dependence of optimal recruitment is less sensitive on noise variance than on stimulus size. We expect that a similar improvement also holds for two-dimensional (2D) spatial receptive elds (however, unlike in the one-dimensional case, narrow 2D stimuli have an equal discriminability as broad 2D stimuli; there, the CramÂer–Rao bound is independent of stimulus width; Zhang & Sejnowski, 1999). 7 Discussion
Recruitment of neurons and their feedback loops has been studied previously in a different context. In a model of the oculomotor integrator (Seung, Lee, Reis, & Tank, 2000), recruitment was postulated to serve the role of maintaining a precise tuning of feedback amplication, compensating for saturation nonlinearity. Here, recruitment is postulated in the context of inter-areal networks to accentuate multistability. By increasing both the excitatory and inhibitory gain of inter-areal feedback, persistent attentional selection of low-contrast and nearby stimuli becomes possible. In agreement with experiments, we have found a distractor-dependent limit for the contrast of a stimulus, below which attentional selection is impossible (see Figure 2b). Our results predict that besides the contrast dependence of attentional selection, there should be an additional dependence on stimulus separation (see Figure 2a), for which there is only little experimental evidence so far (De Weerd et al., 1999). In most electrophysiological experiments, selection of a stimulus enhances visual responsiveness at attended locations (only a few experiments have shown suppressed responses at attended locations; Motter, 1993). The results in Figure 2 suggest that this response enhancement is due to the general dominance of excitatory inter-areal feedback gain over the inhibitory gain. If inter-areal circuits were wired to reduce responses at attended locations, then the selectability of low-contrast stimuli would be impaired in comparison to circuits wired to enhance responses at attended locations.
Recruitment of Inter-Areal Recurrent Networks
1683
Thus, in order to be able to attend to low-contrast stimuli, attentional effects should be enhancing rather than suppressing. Support for a dominance of excitatory gain over inhibitory gain comes from anatomical studies (Johnson & Burkhalter, 1997) and a more recent nding, where after cooling of area MT, 33% of the neurons in area V2 showed a signicant decrease in response to visual stimulation, whereas only 6% showed an increase (HupÂe et al., 1998). We have shown that attentional recruitment can give cortical processing the ability to adjust signal processing to achieve near-optimal noise reduction for a broad range of stimulus sizes (see Figure 4). This ability generalizes previous results, where recurrent connections have been shown to be near-optimal for only a limited range of stimulus sizes (Deneve, Latham, & Pouget, 1999). However, we point out that the question remains how cortex would be able to recruit the appropriate amount of feedback, given some noisy environment and stimulus to be expected. We do not provide a solution for this problem, but we imagine that the appropriate computation involving the use of some prior information to deduce recruitment level is done in prefrontal cortex. Our results on noise reduction are comparable to psychophysical studies where attention has been suggested to activate WTA competition between visual lters (Lee, Itti, Koch, & Braun, 1999). Lee et al. have tted their psychophysical data from orientation discrimination tasks to a model where there is a divisive normalization between simple cortical visual lters. They found best agreement between model and data when the effect of attention was to change the exponents of the divisive normalization between lters rather than any other parameter of the lter interactions. Because the exponents determine the strength of competition between cortical lters, their result is consistent with our nding that attentional recruitment has the effect of hardening WTA competition. Appendix A: Parameter Selection
In the model equations 2.1 to 2.3, there are ve coupling constants: aF , aB , b, aI , and bI . Here we show how to choose these constants as a function of the feedback gain they result in. In the following, excitatory gain means the loop gain of the excitatory feedback to the map, via neurons P i ; and inhibitory gain means the loop gain of the effective inhibitory feedback to the map, via neurons I . The total gain of feedback is the sum of these two loop gains. The relative gain of feedback is important for stability, because in our network, the eigenvectors of the positive and negative feedback loops corresponding to the largest and smallest eigenvalues, respectively, are similar to each other (the excitatory connections are broad, comparable to global inhibitory connections). First, choose the strength of localizing inhibition bI according to the number nI of inhibitory neurons to be active at a steady state (it can be shown
1684
Richard H. R. Hahnloser, Rodney J. Douglas, and Klaus Hepp
that nI is independent of the visual input to the map). Assume that the different pointers P i get the same attentional input and are thus parallel to each other, forming a single “effective pointer.” Denote the common steady pointer angle by c D arctan(Pi2 / Pi1 ). In this case, the inhibitory neurons Ix P receive a feedforward input prole aI P cos(c ¡ yx ) , where P D i kP i k is the length of the effective pointer. nI is determined by the steady state of equation 2.3 (e.g., a symmetric prole centered at c ). The cut-off relation Iz D 0 determines the border yz of this prole. In terms of the angular sepp aration between inhibitory neurons, c ¡ yz D nI yO / 2 where yO D 2(I¡1) , we get aI P cos(nI yO / 2) D bI S,
(A.1)
P where S D x Ix is the total activity of inhibitory neurons (angles are now measured in radians). In analogy to equation B.2, by integrating the steady state of equation 2.3 over x in the large N (continuum) limit, we can calculate nI to third-order approximation, using the cut-off relation, equation A.1: Á nI D 2
3 2bI y22
! 1/ 3 .
(A.2)
Notice that the width nI depends on only the strength of recurrent inhibition bI and not on aI or the length P of the effective pointer. For an inhibitory map of I D 32 neurons, if we choose b I D 60 in equation A.2, we obtain nI D 4 active inhibitory neurons at a steady state. If smaller values for bI are chosen, then the inhibitory activity prole tends to be broader. But a broad inhibitory activity prole is not very desirable for our simulations, since it can result in unwanted boundary effects. The values of the other parameters are determined by separating the inhibitory from the excitatory feedback loops. First, choose some values for aF and aB such that their product is small. In this case, the excitatory feedback loop mediated by a single pointer is weak (having weak pointers means that the increments in feedback strength that arise by recruiting pointers are small and can be precisely controlled). The activities of excitatory neurons in the lower and the higher area are typically of equal magnitude if aF is smaller than aB . By choosing an inversely proportional connection strength from pointer onto inhibitory neurons, aI D 1 / aF , the amplication from map onto pointers cancels the reduction from pointers onto inhibitory neurons, just as if the map fed onto inhibitory neurons directly and with unitary weights. Because both excitatory and inhibitory feedback are mediated by pointers, it is convenient to express their gains in terms of the length P of the effective pointer. In this view, the excitatory gain GE onto the map is GE D aB P.
(A.3)
Recruitment of Inter-Areal Recurrent Networks
1685
The inhibitory gain GI onto the map is GI D ¡bS. Using S from equation A.1 we nd GI D ¡
aI b P cos(nI / 2dI ) . bI
(A.4)
In our simulation, we have chosen b such that the sum of excitatory and inhibitory gain is equal to zero, GE C GI D 0, from which we get bD
aF aB bI . cos(nIdI / 2)
(A.5)
In order to obtain numerical values for b, we replace nI from equation A.2. As expected, with this choice of parameters, the gain of cortical amplication is about unity in Figures 2 to 4 and is independent of the number of pointers recruited. Notice that the previous calculation of feedback gain is valid only for large pointer map networks, because it is based on a continuum approximation. Nevertheless, these parameters yield stable attractor dynamics for all inputs. But the network is very close to the limit beyond which stability breaks down. For example, for the parameter settings of Figure 3, decreasing b in equation A.5 by only 2% results in unstable dynamics with unbounded amplication. Hence, although we want the excitatory gain to be at least as large as the inhibitory gain, we are limited in the amount by which they can differ from each other. This result is comparable to a previous calculation, where inhibition was instantaneous and we were able to construct a Lyapunov function if the excitatory gain was not larger than the inhibitory gain by more than 1 / E (Hahnloser et al., 1999). The fact that the recurrent inhibition in equation 2.2 is not instantaneous but mediated by separate inhibitory neurons can sometimes lead to oscillations (Li & Dayan, 1999). However, it does not lead to oscillations if the time constant of equation 2.3 is small (paradoxically, the time constant is small if bI is large—if only a small number of inhibitory neurons are active). Appendix B: Soft Winner-Take-All
As an approximation to the limit, in which the number of neurons E in the lower area becomes innite, the steady state of equations 2.2 and 2.3 can be transformed into the second-order differential equation M00x D ¡Mx C c (where 0 denotes derivation with respect to x). For uniform input, the solution is a cosine-shaped prole that can be centered at an (almost) arbitrary angle Â: Mx D H[cos ( ¡dx ) ] C C c, where ¡w / 2 · dx ·  C w / 2. The amplitude H, the width w, and the offset c of the prole are unknowns that can be R ÂCw 2 inferred from M § w/ 2 D 0, M (  ) D H and ¡w//2 Mx dx D 2H sin(w / 2) C cw. This leads to p (B.1) w ¡ sin w D , N C aF aB ( E ¡ 1)
1686
Richard H. R. Hahnloser, Rodney J. Douglas, and Klaus Hepp
where N C is the number of recruited pointers. To third-order approximation in w, we nd ³ w’
6p N C aF aB (E ¡ 1)
´ 1/ 3 .
(B.2)
The excellent match of equation B.2 with the simulation results can be seen in Figure 3b. Surprisingly, based on equation B.1, the width w does not depend on the strength of inhibition given by b and bI , nor does it depend on the number I of inhibitory neurons. In fact, the width depends on only the strengths of excitation aF and aB . Increasing the strength of inhibition does not change the width of the prole, only its amplitude H. The sharpening of activity with increasing number of recruited pointers can be understood in terms of the mathematical principle of forbidden sets (Hahnloser, Sarpeshkar, Mahowald, Douglas, & Seung, 2000). This principle was derived for symmetric linear threshold networks. The principle says that some sets of map neurons cannot be simultaneously active at a steady state, because their connectivity expresses “forbidden” or unstable differential modes (differential modes are eigenvectors with both negative and positive components). Because all unstable modes are differential, at least one map neuron will eventually fall below the rectication nonlinearity, and so the largest eigenvalue of the feedback will decrease. In this way, map neurons are progressively inactivated by the network dynamics. The process halts when the largest eigenvalue becomes smaller than one, where stability is achieved and a stable activity pattern can be formed (the set of active map neurons becomes “permitted”). By recognizing that the number N C of recruited pointers has a multiplicative inuence on eigenvalues of inter-areal feedback (recruiting twice as many pointers results in a doubling of feedback gain), we arrive at a simple understanding of what causes the WTA mechanism. The more pointers are recruited, the more map neurons have to be inactivated by the network dynamics in order for the active neurons to form a permitted set. Appendix C: Hard Winner-Take-All
If the number of attentionally recruited neurons in the higher area increases progressively and all of these neurons participate in similar feedback loops with neurons in the lower area, then at a certain point, the feedback will be so strong that only a single neuron in the lower area can be active at a steady state. Under these conditions, the network implements a hard winner-takeall mechanism. Here we calculate exactly how many pointers are required to achieve this regime. Assume that the parameters are selected according to appendix A. We study the steady states of equations 2.1 to 2.3, to establish that no neuron other than neuron M s can be active at a steady state (the choice of s is arbitrary). In other words, denoting steady states by underlin-
Recruitment of Inter-Areal Recurrent Networks
1687
6 s holds true for all stationary inputs ing, we require that Ms0 D 0, where s0 D mx . We proceed by assuming that there are N C recruited pointers and that neurons are at a steady state in which only a single neuron M s is active in the lower area. In this case, the length of the effective pointer is P D N C aI Ms . Using this expression with the relationship between parameters as given in appendix A, in particular equation A.1, we nd that
M s D ms .
(C.1)
There is an amplication gain of exactly one (this intermediate result is surprisingly consistent with the continuum assumption made in appendix A). In order for Ms to be the only activated neuron, the neurons Ms §1 adjacent to Ms should not be activated, even in the most extreme case where their input is equally large, ms C 1 D ms (notice that if ms C 1 were larger than ms , then we might as well assume that Ms C 1 is the single active neuron in steady state, which leads us back to the same argument). Thus, neurons that are not nearest neighbors of Ms do not need to be considered; they have a smaller probability of being activated than nearest neighbors, because the excitatory feedback loops via pointers decay with distance in the lower area (Hahnloser et al., 1999). A simple calculation of the steady-state M s C 1 yields O ¡ 1), M s C 1 D ms C 1 C N C aF aB Ms (cos(d)
(C.2)
p where dO D 2(E¡1) . We determine the number of recruited pointers N hard C beyond which the WTA is hard by using the constraint Ms C 1 D 0. We nd that for
¸ Nhard C
1 O ) aF aB (1 ¡ cos(d)
’
( E ¡ 1) 2 , aF aB
(C.3)
no neuron other than M s is active at a steady state. We see that the number of neurons Nhard in the higher area to be recruited increases quadratically C with the number of neurons in the lower area, which suggests that this hard WTA limit is an interesting computational limit rather than a possible tool that can be used by inter-areal circuits as studied here. Acknowledgments
We acknowledge comments on the manuscript by Martin Giese and the support of the Swiss National Science Foundation and the Korber ¨ Foundation.
1688
Richard H. R. Hahnloser, Rodney J. Douglas, and Klaus Hepp
References Colby, C., Duhamel, J.-R., & Goldberg, M. (1996). Visual, presaccadic, and cognitive activation of single neurons in monkey lateral intraparietal area. J. Neurophysiol., 76(5), 2841–2852. Cover, T., & Thomas, A. (1991). Information theory. New York: Wiley. Deneve, S., Latham, P., & Pouget, A. (1999). Reading population codes: A neural implementation of ideal observers. Nature Neuroscience, 2(8), 740–745. Desimone, R. (1996). Neural mechanisms for visual memory and their role in attention. Proc. Natl. Acad. Sci. USA, 93(24), 13494–13499. Desimone, R. (1998). Visual attention mediated by biased competition in extrastriate visual cortex. Philosophical Transactions of the Royal Society (London), B Biological Sciences, 353, 1245–1255. De Weerd, P., Peralta, M., Desimone, R., & Ungerleider, L. (1999). Loss of attentional stimulus selection after extrastriate cortical lesions in macaques. Nature Neuroscience, 2(8), 753–758. Fuster, J. (1990).Inferotemporal units in selective visual attention and short-term memory. J. Neurophysiol., 64(3), 681–697. Hahnloser, R. H., Douglas, R. J., Mahowald, M., & Hepp, K. (1999). Feedback interactions between neuronal pointers and maps for attentional processing. Nature Neuroscience, 2(8), 746–752. Hahnloser, R. H., Sarpeshkar, R., Mahowald, M., Douglas, R. J., & Seung, S. (2000). Digital selection and analog amplication coexist in a silicon circuit inspired by cortex. Nature, 405, 947–951. Hubel, D., & Wiesel, T. (1962). Receptive elds, binocular interaction and functional architecture in the cat’s visual cortex. J. Physiol., 160, 106–154. HupÂe, J., James, A., Payne, B., Lomber, S., Girard, P., & Bullier, J. (1998). Cortical feedback improves discrimination between gure and background by V1, V2 and V3. Nature, 394, 784–787. Johnson, R., & Burkhalter, A. (1997). A polysynaptic feedback circuit in rat visual cortex. J. Neurosci., 17(18), 7129–7140. Lee, D., Itti, L., Koch, C., & Braun, J. (1999). Attention activates winner-take-all competition among visual lters. Nature Neuroscience, 2(4), 375–381. Li, Z., & Dayan, P. (1999). Computational differences between asymmetrical and symmetrical networks. Network, 10, 59–77. Logothetis, N., Pauls, J., Bulthoff, H., & Poggio, T. (1994). View dependent object recognition by monkeys. Curr. Biol., 4(5), 401–414. Lu, Z.-L., & Dosher, B. (1998). External noise distinguishes attention mechanisms. Vision Research, 38(9), 1183–1198. Luck, S. J., Chelazzi, L., Hillyard, S., & Desimone, R. (1997). Neural mechanisms of spatial selective attention in areas V1, V2, and V4 of macaque visual cortex. J. Neurophysiol., 77(1), 24–42. McAdams, C. J., & Maunsell, J. H. (1999). Effects of attention on the reliability of individual neurons in monkey visual cortex. Neuron, 23, 765–773. Moran, J., & Desimone, R. (1985). Selective attention gates visual processing in the extrastriate cortex. Science, 229, 782–784.
Recruitment of Inter-Areal Recurrent Networks
1689
Motter, B. (1993). Focal attention produces spatially selective processing in visual cortical areas V1, V2, and V4 in the presence of competing stimuli. J. Neurophysiol., 70(3), 909–919. Motter, B. (1994). Neural correlates of feature selective memory and pop-out in extrastriate area V4. J. Neurosci., 14(4), 2190–2199. Pouget, A., Zhang, K., Deneve, S., & Latham, P. (1998). Statistically efcient estimation using population coding. Neural Computation, 10, 373–401. Reynolds, J., Pasternak, T., & Desimone, R. (2000). Attention increases sensitivity of V4 neurons. Neuron, 26, 703–714. Reynolds, J. H., Chelazzi, L., & Desimone, R. (1999). Competitive mechanisms subserve attention in macaque areas V2 and V4. Journal of Neuroscience, 19(5), 1736–1753. Riesenhuber, M., & Poggio, T. (1999). Hierarchical models of object recognition in cortex. Nature Neuroscience, 2, 1019–1025. Seung, H. S., Lee, D. D., Reis, B. Y., & Tank, D. D. (2000). Stability of the memory of eye position in a recurrent network of conductance-based model neurons. Neuron, 26, 259–271. Seung, H., & Sompolinsky, H. (1993). Simple models for reading neuronal population codes. Proc. Natl. Acad. Sci. USA, 90, 10749–10753. Spitzer, H., Desimone, R., & Moran, J. (1988). Increased attention enhances both behavioral and neuronal performance. Science, 240, 338–340. Tomita, H., Ohbayashi, M., Nakahara, K., & Miyashita, Y. (1999). Top-down signal from prefrontal cortex in executive control of memory retrieval. Nature, 401, 699–703. Treue, S., & Maunsell, J. (1996). Attentional modulation of visual motion processing in cortical areas MT and MST. Nature, 382, 539–541. Treue, S., & Maunsell, J. (1999).Effects of attention on the processing of motion in macaque middle temporal and medial superior temporal visual areas. Journal of Neuroscience, 19(17), 7591–7602. Wersing, H., Beyn, W.-J., & Ritter, H. (2001). Dynamical stability conditions for recurrent neural networks with unsaturating piecewise linear transfer functions. Neural Computation, 13, 1811–1825. Zhang, K., & Sejnowski, T. (1999). Neuronal tuning: to sharpen or broaden? Neural Computation, 11, 75–84. Received November 30, 2000; accepted November 20, 2001.
LETTER
Communicated by Johnathan Yedidia
CCCP Algorithms to Minimize the Bethe and Kikuchi Free Energies: Convergent Alternatives to Belief Propagation A. L. Yuille [email protected] Smith-Kettlewell Eye Research Institute, San Francisco, CA 94115, U.S.A. This article introduces a class of discrete iterative algorithms that are provably convergent alternatives to belief propagation (BP) and generalized belief propagation (GBP). Our work builds on recent results by Yedidia, Freeman, and Weiss (2000), who showed that the xed points of BP and GBP algorithms correspond to extrema of the Bethe and Kikuchi free energies, respectively. We obtain two algorithms by applying CCCP to the Bethe and Kikuchi free energies, respectively (CCCP is a procedure, introduced here, for obtaining discrete iterative algorithms by decomposing a cost function into a concave and a convex part). We implement our CCCP algorithms on two- and three-dimensional spin glasses and compare their results to BP and GBP. Our simulations show that the CCCP algorithms are stable and converge very quickly (the speed of CCCP is similar to that of BP and GBP). Unlike CCCP, BP will often not converge for these problems (GBP usually, but not always, converges). The results found by CCCP applied to the Bethe or Kikuchi free energies are equivalent, or slightly better than, those found by BP or GBP, respectively (when BP and GBP converge). Note that for these, and other problems, BP and GBP give very accurate results (see Yedidia et al., 2000), and failure to converge is their major error mode. Finally, we point out that our algorithms have a large range of inference and learning applications. 1 Introduction
Recent work by Yedidia, Freeman, and Weiss (2000) unied two approaches to statistical inference. They related the highly successful belief propagation (BP) algorithms (Pearl, 1988) to variational methods from statistical physics and, in particular, to the Bethe and Kikuchi free energies (Domb & Green, 1972). These BP algorithms typically give highly accurate results (Yedidia et al., 2000). But BP algorithms do not always converge, and, indeed, failure to converge is their major error mode. This article develops new algorithms that are guaranteed to converge to extrema of the Bethe or Kikuchi free energies and hence are alternatives to belief propagation. Belief propagation (Pearl, 1988) is equivalent to the sum-product algorithm developed by Gallager (1963) to decode low-density parity codes c 2002 Massachusetts Institute of Technology Neural Computation 14, 1691– 1722 (2002) °
1692
A. L. Yuille
(LDPC). In recent years (see Forney, 2001, for a review), the coding community has shown great interest in sum-product algorithms and LDPCs. It is predicted that this combination will enable the coding community to design practical codes that approach the Shannon performance limit (Cover & Thomas, 1991) while requiring only limited computation. In particular, it has been shown that the highly successful turbo codes (Berrou, Glavieux, & Thitimajshima, 1993) can also be interpreted in terms of BP (McEliece, Mackay, & Cheng, 1998). Although BP has been proven to converge only for tree-like graphical models (Pearl, 1988), it has been amazingly successful when applied to inference problems with closed loops (Freeman & Pasztor, 1999; Frey, 1998; Murphy, Weiss, & Jordan, 1999) including these particularly important coding applications (Forney, 2001). When BP converges, it (empirically) usually seems to converge to a good approximation to the true answer. Statistical physics has long been a fruitful source of ideas for statistical inference (Hertz, Krogh, & Palmer, 1991). The mean-eld approximation, which can be formulated as minimizing a (factorized) mean-eld free energy, has been used to motivate algorithms for optimization and learning (see Arbib, 1995). The Bethe and Kikuchi free energies (Domb & Green, 1972) contain higher-order terms than the (factorized) mean-eld free energies commonly used. It is therefore hoped that algorithms that minimize the Bethe or Kikuchi free energies will outperform standard mean-eld theory and be useful for optimization and learning applications. Overall, there is a hierarchy of variational approximations in statistical physics that starts with mean eld, proceeds to Bethe, and nishes with Kikuchi (which subsumes Bethe as a special cases). Yedidia et al.’s result (2000) proved that the xed points of BP correspond to the extrema of the Bethe free energy. They also developed a generalized belief propagation (GBP) algorithm whose xed points correspond to the extrema of the Kikuchi free energy. In practice, when BP and GBP converge, they go to low-energy minima of the Bethe and Kikuchi free energies. In general, we expect that GBP gives more accurate results than BP since Kikuchi is a better approximation than Bethe. Indeed, empirically, GBP outperformed BP on two-dimensional (2D) spin glasses and converged very close to the true solution (Yedidia et al., 2000). But these results say little about failure to converge of BP (or GBP) (which are the dominant error modes of the algorithms; if BP or GBP converges, then they typically converge to a close approximation to the true solution). This motivates the search for other algorithms to minimize the Bethe and Kikuchi free energies. This article develops new algorithms that are guaranteed to converge to extrema of the Bethe and Kikuchi free energies. (In computer simulations so far, they always converged to minima.) The algorithms have some similarities to BP and GBP algorithms that estimate “beliefs” by propagating “messages.” The new algorithms also propagate messages (formally Lagrange multipliers), but unlike BP and GBP, the propagation depends
CCCP Algorithms to Minimize the Bethe and Kikuchi Free Energies
1693
on current estimates of the beliefs that must be reestimated periodically. This similarity may help explain when BP and GBP converge. But in any case, these new algorithms offer alternatives to BP and GBP that may be of particular use in regimes where BP and GBP do not converge. The algorithms are developed using a concave convex procedure (CCCP) that starts by decomposing the free energy into concave and convex parts. From this decomposition, it is possible to obtain a discrete update rule that decreases the free energy at each iteration step. This procedure is very general and is developed further by Yuille and Rangarajan (2001) with applications to many other optimization problems. It builds on results developed when studying mean-eld theory (Kosowsky & Yuille, 1994; Yuille & Kosowsky, 1994; Rangarajan, Gold, & Mjolsness, 1996; Rangarajan, Yuille, Gold, & Mjolsness, 1996). The algorithms were implemented and tested on 2D and 3D spin-glass energy functions. These contain many closed loops (so BP is expected to have difculties; Yedidia et al., 2000). Our results show that our algorithms are stable, converge very rapidly, and give equivalent or better results than BP and GBP (BP often fails to converge for these problems). We note that there has recently been a range of new algorithms that act as either variants or improvements of BP (this article was originally submitted before we learned of these alternative algorithms). These include Teh and Welling (2001), Wainwright, Jaakkola, and Willsky (2001), and Chiang and Forney (2001). Of these, the algorithm by Teh and Welling seems most similar to the CCCP algorithms. Their algorithm is also guaranteed to converge to an extremum of the Bethe free energy. Comparative simulations of the two algorithms have been performed (Teh and Welling, private communication, May 2001) and indicate that the differences between the two algorithms lie mainly in the convergence rates rather than in the quality of the solutions obtained. But these results are preliminary. The relative advantages of these algorithms are work for the future. The structure of this article is as follows. Section 2 describes the Bethe free energy and BP algorithms. In section 3, we describe the design principle, CCCP, of our algorithm. Section 4 applies CCCP to the Bethe free energy and gives a double-loop algorithm that is guaranteed to converge (we also discuss formal similarities to BP). In section 5, we apply CCCP to the Kikuchi free energy and obtain a provably convergent algorithm. Section 6 applies our CCCP algorithm to 2D and 3D spin-glass energy models using either the Bethe or Kikuchi approximation and compares their performance to BP and GBP. Finally, we discuss related issues in section 7. 2 The Bethe Free Energy and the BP Algorithms
The Bethe free energy (Domb & Green, 1972; Yedidia et al., 2000) is a variational technique from statistical physics. The idea is to replace an inference problem we cannot solve by an approximation that is solvable. For example, we may want to estimate the variables x¤1 , . . . , x¤N , which are the
1694
A. L. Yuille
Figure 1: Grid for a 2D spin glass. Each node a, b, c, d represents a binary-valued E Ei C D v E spin variable. Our convention is that a is site Ei; b, c, d are sites Ei C D h, E , Ei C 2D h.
most probable states of a distribution P ( x1 , . . . , xN |y) . This, however, may be computationally expensive (e.g., NP-complete). Instead, we can apply variational methods where we seek an approximate solution by minimizing a free energy function to obtain a mean eld-theory solution (see Arbib, 1995). The Bethe free energy is an approximation that uses joint distributions between variables. It can give exact solutions in situations where standard mean-eld theory gives only poor solutions (Weiss, 2001). As we will discuss later, the Kikuchi free energy gives higher-order approximations (see the discussion at the end of this section). Consider a graph with nodes i D 1, . . . , N. The problem specication will determine connections between the nodes. We will list connections ij only for node pairs i, j that are connected. (Variables, such as yij , bij , which depend on two nodes, will be dened only for pairs of nodes that are connected in the graph.) The state of a node is denoted by xi (each xi has M possible states). Each unobserved node is connected to an observed node yi . The joint probability function is given by P (x1 , . . . , xN |y) D
Y 1 Y yij ( xi , xj ) yi ( xi , yi ) , Z i, j: i > j i
(2.1)
where yi ( xi , yi ) is the local “evidence” for node i, Z is a normalization constant, and yij ( xi , xj ) is the compatibility matrix between nodes i and j. We use the convention i > j to avoid double counting. To simplify the notation, we write yi (xi ) as shorthand for yi (xi , yi ) . (Recall that if nodes i and j are not connected, then we do not have a term yij .) For example, in the 2D spin-glass network (see Figure 1), the variable i labels nodes on a 2D grid (see section 6). It is convenient to represent these nodes by vector labels Ei with the rst and second components corresponding E Dv to the positions in the 2D lattice and with D h, E corresponding to shifts be-
CCCP Algorithms to Minimize the Bethe and Kikuchi Free Energies
1695
Figure 2: The quantities a,c,b,d owing into the central pixel correspond to Lagrange multipliers or messages for CCCP or BP, respectively. They correspond to either lEi § D vE ,Ei ( xEi ) , lEi § D h,E Ei ( xEi ) or mEi § D h,E Ei ( xEi ) , mEi§ D vE ,Ei ( xEi ) .
tween lattice sites in the horizontal and vertical directions. The 2D spin glass involves nearest-neighbor interactions only. So we have potentials yEi ( xEi ) at each lattice site and potentials yEi,Ei C D hE ( xEi , xEi C D hE ) and yEi,Ei C D vE ( xEi , xEi C D vE ) linking neighboring lattice sites. The goal is to determine an estimate fbi (xi ) g of the marginal distributions fP(xi |y)g. It is convenient also to make estimates fbij (xi , xj )g of the joint distributions fP(xi , xj |y) g of nodes i, j, which are connected in the graph. (Again we use the convention i > j to avoid double counting.) The BP algorithm introduces variables mij ( xj ) , which correspond to “messages” that node i sends to node j (later we will see how these messages correspond to Lagrange multipliers, which impose consistency constraints on the beliefs). The BP algorithm is given by: X Y yij ( xi , xj ) yi ( xi ) (2.2) mij (xj I t C 1) D cij m ki ( xi I t) , xi
6 j kD
where cij is a normalization constant (i.e., it is independent of xj ). Again, we have messages only between nodes that are connected. Nodes are not connected to themselves (i.e., mii (xi , t) D 1, 8i). For the 2D spin glass (see Figure 1), we can represent the messages (and later the l’s from CCCP) by the nodes into which they ow (see Figure 2). The messages determine additional variables bi ( xi ) , bij ( xi , xj ) corresponding to the approximate marginal probabilities at node i and the approximate joint probabilities at nodes i, j (with convention i > j). These are given in terms of the messages by: Y (2.3) bi ( xi I t) D cOi yi ( xi ) m ki (xi I t) , k
bij ( xi , xj I t) D cNij w ij ( xi , xj )
Y 6 j kD
m ki ( xi I t)
Y 6 i lD
mlj ( xj I t) ,
(2.4)
1696
A. L. Yuille
where w ij ( xi , xj ) D yij ( xi , xj ) yi ( xi ) yj ( xj ) and cOi , cNij are normalization constants. For a tree, the BP algorithm of equation 2.2 is guaranteed to converge, and the resulting fbi ( xi ) g will correspond to the posterior marginals (Pearl, 1988). The Bethe free energy of this system is written as (Yedidia et al., 2000) Fb (fbij , bi g) D
X X
bij ( xi , xj ) log
i, j: i > j xi ,xj
¡
X i
(ni ¡ 1)
X
bij (xi , xj ) w ij ( xi , xj )
bi (xi ) log
xi
bi ( xi ) , yi ( xi )
(2.5)
where ni is the number of neighbors of node i. Because the fbi g and fbij g correspond to marginal and joint probability distributions, they must satisfy linear consistency constraints: X X bij ( xi , xj ) D 1, 8i, j: i > j bi (xi ) D 1, 8i, xi ,xj
X xi
xi
bij (xi , xj ) D bj ( xj ) , 8j, xj ,
X xj
bij ( xi , xj ) D bi ( xi ) , 8i, xi .
(2.6)
These constraints can be imposed by using Lagrange multipliers fc ij : i > 6 jg and flij ( xj ) : i D jg and adding terms: ( ) X X c ij bij ( xi , xj ) ¡ 1 ij: i > j
xi ,xj
C
C
X X
( lij ( xj )
X
i, j: i > j xj
X X i, j: i > j xi
xi
( lji ( xi )
) bij (xi , xj ) ¡ bj ( xj )
X xj
(2.7) )
bij ( xi , xj ) ¡ bi (xi )
.
(2.8)
The Bethe free energy consists of two terms. The rst is of the form of a Kullback-Leibler (K-L) divergence between fbij g and fw ij g (but it is not actually a K-L divergence because fw ij g is not a normalized distribution). The second is minus the form of the K-L divergence between fbi g and fyi g (again, fyi g is not a normalized probability distribution). It follows that the rst term is convex in fbij g, and the second term is concave in fbi g. This will be of importance for the derivation of our algorithm in section 3. Yedidia et al. (2000) proved that the xed points of the BP algorithm correspond to extrema of the Bethe free energy (with the linear constraints
CCCP Algorithms to Minimize the Bethe and Kikuchi Free Energies
1697
of equation 2.6). Their results can be obtained by differentiating the Bethe P lji ( xi ) ¡ n 1¡1 l ki ( xi ) i k ( ) free energy and using the substitution mji xi D e and the e Q ¡l ( x ) inverse e ij j D kD6 i m kj ( xj ) . Observe that the standard (factorized) mean-eld free energy (see Arbib, 1995) can be obtained from the Bethe free energy by setting bij (xi , xj ) D bi ( xi ) bj ( xj ) , 8i, j. This gives: Fmean¡eld D ¡ C
XX i
xi
bi (xi ) log yi ( xi ) ¡
XX i
bi (xi ) log bi ( xi ) .
XX
bi (xi ) bj ( xj ) log yij ( xi , xj )
i, j xi ,xj
(2.9)
xi
By comparing equation 2.5 to equation 2.9, we see that the Bethe free energy contains higher-order terms than the mean-eld energy. It is therefore plausible that algorithms that minimize the Bethe free energy will perform better for optimization and learning than those that use the mean-eld approximation. In general, (factorized) mean eld, Bethe, and Kikuchi give a hierarchy of approximations to the underlying distribution P (x1 , . . . P , xN |y). They can be derived by minimizing the K-L divergence D (P A kP) D x1 ,...,xN PA ( x1 , . . . , ( 1 ,...,xN ) with respect to an approximating distribution PA ( x1 , . . . , xN ) log PP(Ax1x,...,x N |y) xN ) . If the P A are restricted to being factorized distributions, P A (x1 , . . . , xN ) D QN ( ) i D 1 bi xi , then we obtain the mean eld free energy (see equation 2.9). We can obtain the Bethe and Kikuchi free energies by allowing P A (.) to have different forms (though additional approximations are still required to get Bethe or Kikuchi unless the graph has no loops). 3 The Concave Convex Procedure
Our algorithms to minimize the Bethe and Kikuchi free energies are based on the concave convex procedure (CCCP) described in this section. This approach was developed (in this article) for the specic case of Bethe and Kikuchi but is far more general. Indeed, many existing discrete-time iterative dynamical systems can be interpreted in terms of CCCP (Yuille & Rangarajan, 2001). Our main results are given by theorems 1, 2, and 3 and show that we can obtain discrete iterative algorithms to minimize energy functions that are the sum of a convex and a concave term. These algorithms will typically, but not always (Yuille & Rangarajan, 2001), require an inner and an outer loop. We rst consider the case where there are no constraints for the optimization. Then we generalize to the case where linear constraints are present.
1698
A. L. Yuille
Consider an energy function E (zE ) (bounded below) of form E (zE ) D Evex ( Ez ) C Ecave ( Ez ) where Evex ( Ez) , Ecave ( Ez) are convex and concave functions of Ez, respectively. Then the discrete iterative algorithm zE t 7! Ezt C 1 given by Theorem 1.
E Evex ( Ezt C 1 ) D ¡r E Ecave ( Ezt ) , r
(3.1)
is guaranteed to decrease the energy E (E) monotonically as a function of time and hence to converge to an extremum of E ( Ez) . Proof. The convexity and concavity of Evex (.) and Ecave (.) means that: E Evex ( Ez1 ) Evex ( Ez2 ) ¸ Evex ( Ez1 ) C (zE 2 ¡ Ez1 ) ¢ r
E Ecave ( zE 3 ) , E cave ( Ez4 ) · Ecave ( Ez3 ) C ( Ez4 ¡ Ez3 ) ¢ r
(3.2)
for all Ez1 , Ez2 , zE 3 , Ez4 . Now set zE 1 D Ezt C 1 , Ez2 D Ezt , zE 3 D Ezt , Ez4 D zE t C 1 . Using E Evex ( Ezt C 1 ) D ¡r E Ecave ( Ezt ) ) equation 3.2 and the algorithm denition (i.e., r we nd that: E vex ( Ezt C 1 ) C Ecave ( Ezt C 1 ) · E vex ( Ezt ) C Ecave (zE t ) ,
(3.3)
which proves the claim. E E vex ( Ezt C 1 ) Observe that the convexity of Evex ( Ez) means that the function r C1 t is invertible. In other words, the tangent to Evec (zE ) determines zE t C 1 uniquely. Theorem 1 generalizes previous results by Marcus, Waugh, and Westervelt (Marcus & Westervelt, 1989; Waugh & Westervelt, 1993) on the convergence of discrete iterated neural networks. (They discussed a special case where one function was convex and the other was implicitly, but not explicitly, concave). The algorithm can be illustrated geometrically by the reformulation shown in Figure 3. Think of decomposing the energy function E ( Ez) into E1 ( Ez ) ¡ E2 (zE ) where both E1 ( Ez) and E 2 ( Ez) are convex. (This is equivalent to decomposing E ( Ez ) into a convex term E1 ( Ez ) plus a concave term ¡E2 ( Ez) .) The algorithm proceeds by matching points on the two terms that have the E E2 ( zE 0 ) and nd same tangents. For an input Ez0 , we calculate the gradient r E E the point Ez1 such that r E1 ( Ez1 ) D r E2 ( Ez 0 ) . We next determine the point Ez2 E E1 ( Ez2 ) D r E E2 ( Ez1 ) , and repeat. such that r We can extend this result to allow for linear constraints on the variables Ez. This can be given a geometrical intuition. First, properties such as concavity and concaveness are preserved when linear constraints are imposed. Second, because the constraints are linear, they determine a hyperplane on which the constraints are satised. Theorem 1 can then be applied to the variables on this hyperplane.
CCCP Algorithms to Minimize the Bethe and Kikuchi Free Energies
1699
Figure 3: A CCCP algorithm illustrated for convex minus convex. We want to minimize the function in the left panel. We decompose it (right panel) into a convex part (top curve) minus a convex term (bottom curve). The algorithm iterates by matching points on the two curves that have the same tangent vectors. See the text for more details. The algorithm rapidly converges to the solution at x D 5.0. (Figure concept courtesy of James M. Coughlan.)
Consider a function E (zE ) D Evex ( Ez) C Ecave ( Ez ) subject to k linear constraints wE m ¢ Ez D cm where fcm : m D 1, . . . , kg are constants. Then the algorithm Ezt 7! Ezt C 1 given by Theorem 2.
E Evex ( zE t C 1 ) D ¡r E Ecave ( Ezt ) ¡ r
k X m D1
am wE m ,
(3.4)
1700
A. L. Yuille
where the parameters fam g are chosen to ensure that zE t C 1 ¢wE m D cm for m D 1, . . . , k, is guaranteed to decrease the energy E (zE t ) monotonically and hence to converge to an extremum of E ( Ez) . Proof. Intuitively, the update rule has to balance the gradients of Evex , Ecave
P in the unconstrained directions of Ez, and the term mk D 1 am wE m is required to deal with the differences in the directions of the constraints. More formally, we dene orthogonal unit vectors fyE º: º D 1, . . . , n ¡ kg, which span the P E º ( z¢ space orthogonal to the constraints fwE m : m D 1, . . . , kg. Let yE ( Ez ) D ºn¡k D1 y E º yE ) be the projection of Ez onto this space. Dene functions EO cave ( yE ) , EO vex ( yE ) by: EO cave ( yE ( Ez) ) D Ecave ( Ez) , EO vex ( yE ( Ez) ) D Evex (zE ) .
(3.5)
Then we can use the algorithm of theorem 1 on the unconstrained variables yE D ( y1 , . . . , yn¡k ) . By denition of yE ( Ez ), we have @Ez / @yº D yE º. Therefore, the algorithm reduces to E zE Evex ( Ezt C 1 ) D ¡y Eº ¢ r E Ez Ecave ( Ezt ) , º D 1, . . . , n ¡ k. yE º ¢ r
(3.6)
This gives the result (recalling that wE m ¢ yE º D 0 for all m , º). It follows from theorem 2 that we need to impose the constraints only on Evex ( Ezt C 1 ) and not on Ecave ( Ezt ) . In other words, we set EN vex ( Ezt C 1 ) D Evex ( Ezt C 1 ) C
X m
am fwE m ¢ Ezt C 1 ¡ cm g,
(3.7)
and use update equations: @EN vex @Ez
( Ezt C 1 ) D ¡
@Ecave @Ez
( Ezt ) ,
(3.8)
where the coefcients fam g must be chosen to ensure that the constraints wE m ¢ Ezt C 1 D cm , 8m are satised. We need an additional result to determine how to solve for Ezt C 1 . Theorem 3 gives a procedure that will work for the Bethe and Kikuchi energy functions. (In general, solving for Ezt C 1 is, at worse, a convex minimization problem and can sometimes be done analytically; Yuille & Rangarajan, 2001). P We restrict ourselves to the specic case where Evex ( Ez) D i zi log zfii . This form of E vex ( Ez) will arise when we apply theorem 2 to the Bethe free energy E Ecave ( Ez) . Then the update rules of theorem 2 (see section 4.) We let hE (zE ) D r (see equation 3.4) can be expressed as selecting Ezt C 1 to minimize a convex
CCCP Algorithms to Minimize the Bethe and Kikuchi Free Energies
1701
cost function Et C 1 ( Ezt C 1 ) . Moreover, we can write the solution as an analytic function of Lagrange multipliers fam g, which are chosen to maximize a concave (dual) energy function. More formally, we have the following result: P Let Evex ( Ez ) D i zi log fzii . Then the update equation of theorem 2 can be expressed as minimizing the convex energy function: Theorem 3.
Et C 1 (zE t C 1 ) D zE t C 1 ¢ hE C
X
zti C 1 log
zti C 1
i
C
fi
X m
am fwE m ¢ zE t C 1 ¡ cm g, (3.9)
E Ecave ( Ezt ) . The solution is of the form where hE D r P m m ¡ a wi m zti C 1 (a) D fi e ¡hi e¡1 e ,
(3.10)
where the Lagrange multipliers fam g are constrained to maximize the (concave) dual energy: X X am cm EO t C 1 (a) D ¡ zti C 1 (a) ¡ m
i
D ¡
X
fi e¡hi e ¡1 e¡
P º
w ºi aº
i
¡
X
am cm .
(3.11)
m
Moreover, maximizing EO t C 1 (a) with respect to a specic am enables us to satisfy the corresponding constraint exactly. Proof. This is given by straightforward calculations. Differentiating Et C 1
with respect to zti C 1 gives 1 C log
zi fi
D ¡hi ¡
X
m
am w i ,
(3.12)
m
which corresponds to the update equation 3.4 of theorem 2. Substituting Ezt C 1 (fam g) into Et C 1 ( Ezt C 1 ) gives the dual energy function EO t C 1 (a) D Et C 1 ( Ezt C 1 (am ) . Since E t C 1 (Ez ) is convex, duality ensures that the dual EO t C 1 (a) is concave (Strang, 1986), and hence has a unique maximum that corresponds to the constraints being satised. Setting and hence satises the m th constraint.
@EO t C 1 @am
D 0 ensures that Ezt C 1 ¢ wE m D cm ,
Theorem 3 species a double-loop algorithm where the outer loop is given by equation 3.10 and the inner loop determines the fam g by maximizing EO t C 1 (a) . For the Bethe free energy, solving for the fam g can be done by a discrete iterative algorithms. The algorithm maximizes EO t C 1 (a) with respect to
1702
A. L. Yuille
each am in turn. This maximization can be done analytically for each am . This inner loop algorithm generalizes work by Kosowsky and Yuille (1994; Kosowsky, 1995), who used a result similar to theorem 3 to obtain an algorithm for solving the linear assignment problem. This relates to the Sinkhorn algorithm (Sinkhorn, 1964) (which converts positive matrices into doubly stochastic ones). Rangarajan, Gold, and Mjolsness (1996) applied this result to obtain double-loop algorithms for a range of optimization problems subject to linear constraints. In the next section, we apply theorems 2 and 3 to the Bethe free energy. In particular, we will show that the nature of the linear constraints for the Bethe free energy means that solving for the constraints in theorem 2 can be done efciently. 4 A CCCP Algorithm for the Bethe Free Energy
In this section, we return to the Bethe free energy and describe how we can implement an algorithm of the form given by theorems 1, 2, and 3. This CCCP double-loop algorithm is designed by splitting the Bethe free energy into convex and concave parts. An inner loop is used to impose the linear constraints. (This design was inuenced by the work of Rangarajan et al. (Rangarajan, Gold, & Mjolsness, 1996; Rangarajan, Yuille, et al., 1996).) First, we split the Bethe free energy in two parts:
Evex D
X X
bij ( xi , xj ) log
i, j: i > j xi ,xj
E cave D ¡
X i
ni
X xi
bi ( xi ) log
XX bij ( xi , xj ) bi (xi ) C bi ( xi ) log , w ij ( xi , xj ) yi ( xi ) i xi bi ( xi ) . yi ( xi )
(4.1)
This split enables us to get nonzero derivatives of Evex with respect to both fbij g and fbi g. (Other choices of split are possible). We now need to express the linear constraints (see equation 2.6) in the form used in theorems 2 and 3. To do this, we set zE D ( bij ( xi , xj ) , bi (xi ) ) so that the rst components of Ez correspond to the fbij g and the latter to the fbi g (There are NM components for the fbi g, but the number of fbij g depends on ) 2 the number of connections and is, at most, N ( N¡1 dot product of zE 2 P M ). The P E ( ( ) ( ) ) with a vector w D Tij xi , xj , Ui xi is given by i, j: i > j xi ,xj bij ( xi , xj ) Tij ( xi , P P xj ) C i xi bi ( xi ) Ui ( xi ). P There are two types of constraints: (1) the normalization constraints P ( ) (2) the consistency constraints xp bpq xp ,xq b pq xp , xq D 1, 8p, q: p > q and P (xp , xq ) D bq ( xq ) , 8p, q, xq : p > q and xq bpq ( xp , xq ) D bp (xp ) , 8q, p, xp : p > q. We index the normalization constraints by pq (with p > q). Then we can express the constraint vectors fwE pqg, the constraint coefcients fapqg, and the
CCCP Algorithms to Minimize the Bethe and Kikuchi Free Energies
1703
constraint values fcpqg by wE pq D (dpidqj , 0) , apq D c pq, cpq D 1, 8p, q.
(4.2)
The consistency constraints are indexed by pqxq (with p > q) and qpxp (with p > q). The constraint vectors, the constraint coefcients, and the constraint values are given by: wE pqxq D (dipdjqdxj ,xq , ¡diqdxi ,xq ) , apqxq D lpq ( xq ) , cpqxq D 0, 8p, q, xq wE qpxp D (dipdjqdxi xp , ¡dipdxi ,xp ) , aqpxp D lqp ( xp ) , cqpxp D 0, 8q, p, xp .
(4.3)
We now apply theorem 3 to the Bethe free energy and obtain theorem 4, which species the outer loop of our algorithm: Theorem 4 (CCCP Outer Loop for Bethe).
The following update rule is guaranteed to reduce the Bethe free energy provided the constraint coefcients fc pq g, flpqg, flqp g can be chosen to ensure that fbij ( t C 1) g, fbi ( t C 1)g satisfy the linear constraints of equation 2.6: ¡l (x ) ¡l ( x ) ¡c
bij ( xi , xj I t C 1) D w ij ( xi , xj ) e¡1 e ij j e ji i e ij , ³ ´ bi ( xi I t) ni P lki ( xi ) ¡1 n i . bi ( xi I t C 1) D yi ( xi ) e e e k yi ( xi )
(4.4)
Proof. These equations correspond to equation 3.10 in theorem 3, where
we have substituted equation 4.1 into equation 3.4 equations 4.2 Pandm used E m simplify owing and 4.3 for the constraints. The constraint terms a w m P to the form of the constraints (e.g., p,q: p > q c pqdipdjq D c ij ). We need an inner loop to determine the constraint coefcients flij , lji , c ij g. This is obtained by using theorem 3. Recall that nding the constraint coefcients is equivalent to maximizing the dual energy EO t C 1 (a) and that performing the maximization with respect to the m th coefcient am corresponds to solving the m th constraint equation. For the Bethe free energy, it is possible to solve the m th constraint equation to obtain an analytic expression for the m th coefcient in terms of the remaining coefcients. Therefore, we can maximize the dual energy EO t C 1 (a) with respect to any coefcient am analytically. Hence, we have an algorithm that is guaranteed to converge to the OtC1 maximum of EO t C 1 (a) : select a constraint m , solve for the equation @@Eam for m a analytically, and repeat. As we will see in section 4.1, this inner loop is very similar to BP (provided we equate the messages with the exponentials of the Lagrange multipliers).
1704
A. L. Yuille
More formally, we specify the inner loop by theorem 5: The constraint coefcients fc pqg, flpq g, flqpg of theorem 4 can be solved for by a discrete iterative algorithm, indexed by t , guaranteed to converge to the unique solution. At each step, we select coefcients c pq , lpq ( xq ) , or lqp ( xp ) and update them by Theorem 5 (CCCP Inner Loop for Bethe).
ec pq
(t C 1)
D
X
w pq (xp , xq ) e¡1 e¡lpq ( xq ,t ) e¡lqp ( xp It ) ,
xp ,xq
P
e
2lpq ( xq It C 1)
w pq ( xp , xq ) e ¡lqp ( xp It ) e ¡c pq (t ) D , ± ( ² P l ( xq It ) b x It) nq yq (xq ) enq yqq ( xqq It) e jD p jq xp
6
P
e
2lqp ( xp It C 1)
D
w pq ( xp , xq ) e ¡lpq ( xq It ) e¡c pq (t ) , ± ( ² P l ( xp It ) b x It) np yp ( xp ) enp ypp ( xpp It) e jD q jp xq
(4.5)
6
which monotically maximizes the dual energy: EO t C 1 (a) D ¡ ¡
X X
w ij ( xi , xj ) e¡1 e
¡lij ( xj ) ¡lji ( xi ) ¡c ij
e
e
i, j: i > j xi ,xj
XX i
xi
³
yi ( xi ) e¡1 eni
bi ( xi I t) yi ( xi )
´ni P X ( ) c ij . (4.6) e k lki xi ¡ i, j: i > j
Proof. We use the update rule, equation 3.10, given by theorem 3 and
calculate the constraint equations, Ez ¢ wE m D cm , 8m . For the Bethe free energy, we obtain equations 4.5, where the upper equation corresponds to the normalization constraints and the lower equations to the consistency constraints. Observe that we can solve each equation analytically for the corresponding constraint coefcients c pq , lpq ( xq ) , lqp (xp ) . By theorem 3, this is equivalent to maximizing the dual energy (see equation 3.11), with respect to each coefcient. Since the dual energy is concave, solving for each coefcient c pq, lpq ( xq ) , lqp ( xp ) (with the others xed) is guaranteed to increase the dual energy. Hence, we can maximize the dual energy, and ensure the constraints are satised, by repeatedly selecting coefcients and solving the equations 4.5. The form of the dual energy is given by substituting equation 4.4 into equation 3.11. Observe that we can update all the coefcients fc pq g simultaneously because their update rule (the right-hand side of the top equation of 4.5) depends only on the flpq g. Similarly, we can update many of the flpqg simultaneously because their update rules (the right-hand sides of the middle and bottom equation of 4.5) depend on only a subset of the flpq g. (For example,
CCCP Algorithms to Minimize the Bethe and Kikuchi Free Energies
1705
when updating lpq (xq ), we can simultaneously update any lij ( xj ) provided 6 p D 6 6 q D 6 iD j and i D j.) 4.1 Connections Between CCCP and BP . As we will show, CCCP and BP are fairly similar. The main difference is that the CCCP update rules for the Lagrange multipliers fl, c g depend explicitly on the current estimates of the beliefs. The estimates of the beliefs must then be reestimated periodically (after each run of the inner loop). By contrast, for BP, the update rules for the messages are independent of the beliefs. We now give an alternative formulation of the CCCP algorithm that simplies the algorithm (for implementation) and makes the connections to belief propagation more apparent. Dene new variables fhpq, hp , gpq , gp g by: ( ) ( ) hpq ( xp , xq ) D w pq ( xp , xq ) e ¡1 , gpq (l, c ) D e¡c pq ¡lpq xq ¡lqp xp » ¼ P bq (xq ) nq l ( xq ) hq ( xq ) D yq (xq ) e ¡1 enq , gq (l) D e j jq . ( ) yq xq
(4.7)
The outer loop can be written as: bij ( xi , xj ) D hij ( xi , xj ) gij (l , c ) bi ( xi ) D hi (xi ) gi (l).
(4.8)
The inner loop can be written as: X ec pq (t C 1) D ec pq (t ) hpq ( xp , xq ) gpq (l , c ) , xp ,xq
P ( e2lpq xq It C 1)
2lpq (xq It )
D e
xp
hpq (xp , xq ) gpq (l , c ) hq (xq ) gq (l)
.
(4.9)
This formulation of the algorithm is easier to program. As before, the inner loop is iterated until convergence, and then we perform one step of the outer loop and repeat. We now show a formal similarity between CCCP and belief propagation. To do this, we collapse the outer loop of CCCP by solving for the fbij , bi g as functions of the variables fl, c g. This is done by solving the xed-point equation 4.8 using equation 4.7 to obtain fb¤i , b¤ij g given by: b¤i ( xi ) D yi ( xi ) e ¡1 fgi (l) g1 /
( ni ¡1)
, b¤ij (xi , xj ) D w ij (xi , xj ) e ¡1 gij (l , c ). (4.10)
Now substitute fb¤i , b¤ij g into the inner loop update equation, 4.9. This collapsed CCCP algorithm reduces to: P ¡ 1 l ( xq It ) g e2lpq (xq It C 1) £ fe¡lpq ( xq It ) e nq ¡1 j jq X ypq ( xp , xq ) yp ( xp ) e ¡lqp ( xp It ) e¡c pq . (4.11) D xp
1706
A. L. Yuille
To relate this to belief propagation, we recall from section 2 that the messages can be related to the Lagrange multipliers by: mji ( xi ) D e
lji ( xi ) ¡ 1 ni ¡1
e
P k
l ki ( xi )
,e
¡lij ( xj )
D
Y
m kj ( xj ).
(4.12)
6 i kD
We can reexpress belief propagation as: e
1 lij ( xj It C 1) ¡ nj ¡1
e
P k
l kj ( xj It C 1)
D c
X
yij ( xi , xj ) yi ( xi ) e
xi
¡ bi ( xi I t ) D cOyi ( xi ) e ni ¡1 1
P l
¡lji ( xi It)
lli ( xi It)
.
,
(4.13) (4.14)
This shows that belief propagation is similar to P the collapsed CCCP, with ¡
1
l ( xq It )
the only difference that the factor fe¡lpq ( xq It ) e nq ¡1 j jq g is evaluated at time t for the collapsed double loop and at time t C 1 for belief propagation. (This derivation has ignored the “normalization dynamics”—updating the fc pqg—because it is straightforward and can be taken for granted. With this understanding we have dropped the dependence of the fc pq g on t .) 5 The Kikuchi Approximation
The Kikuchi free energy (Domb & Green, 1972; Yedidia et al., 2000) is a generalization of the Bethe free energy, which enables us to include higherorder interactions. In this section, we show that the results we obtained for the Bethe can be extended to Kikuchi. Recall that Yedidia et al. (2000) derived a “generalized belief propagation algorithm” whose xed points correspond to extrema of the Kikuchi free energy. Their computer simulations showed that GBP outperforms BP on 2D spin glasses and obtains results close to the optimum (presumably because Kikuchi is a higher-order approximation than Bethe). We now dene the Kikuchi free energy (following the exposition in Yedidia et al., 2000). For a general graph, let R be a set of regions that include some basic clusters of nodes, their intersections, the intersections of the intersections, and so on. The Bethe approximation corresponds to the special case where the basic clusters consist of all linked pairs of nodes. For any region r, we dene the superregions of r to be the set sup ( r) of all regions in R that contain r. Similarly, we dene the subregions of r to be the set sub (r ) of all regions in R that lie completely within r. In addition, we dene the direct subregions of r to be the set subd ( r) of all subregions of r that have no superregions that are also subregions of r. Similarly, dene the direct superregions of r to be the set sup d (r ) of superregions of r that have no subregions that are superregions of r. If s is a direct subregion of r, then we dene rns to be those nodes that are in r but not in s.
CCCP Algorithms to Minimize the Bethe and Kikuchi Free Energies
1707
Figure 4: The regions for the Kikuchi implementation of the 2D spin glass. (Left) Top regions. (Center and right) Subregions.
We illustrate these denitions on the 2D spin glass (see Figure 1). The basic regions are shown in Figure 4 (other choices of basic regions are possible). The direct superregions and the direct subregions are shown in Figures 5 and 6. Let xr be the state of the nodes in region r and br ( xr ) be the “belief” in xr . x 2 rns denotes the state of the nodes that are in r but not in s. Any region r has an energy Er ( xr ) associated with Q it (e.g., for a region with Q pairwise interactions, we have E r ( xr ) D ¡ log i2r, j2r: i > j yij (xi , xj ) ¡ log i2r yi ( xi )).
Figure 5: The direct subregions. The top-level regions (top panel) have four direct subregions (two horizontal and two vertical). Each middle-level region has two direct subregions (bottom two panels).
1708
A. L. Yuille
Figure 6: The direct superregions. The middle-level regions have two direct superregions (top two panels). The bottom-level region has four direct superregions (bottom panel).
Then the Kikuchi free energy is
FK D
X r2R
( cr
X xr
br (xr ) Er (xr ) C
X xr
) br ( xr ) log br (xr )
C LK ,
(5.1)
where LK gives the constraints (see equation 5.2 below) and cr is an “overP counting” number of region r, dened by cr D 1 ¡ s2sup( r ) cs where sup ( r) is the set of all regions in R that contain r. For the largest regions, we have cr D 1. We let cmax D maxr2R cr . The beliefs br (xr ) must obey two types of constraints: (1) they must all sum to one, and (2) they must be consistent with the beliefs in regions that intersect with r. This consistency can be ensured by imposing consistency between all regions with their direct subregions (i.e., between all r 2 R with their direct subregions s 2 sub d ( r) ). We can impose the constraints by
CCCP Algorithms to Minimize the Bethe and Kikuchi Free Energies
1709
Lagrange multipliers:
LK D
X
Á cr
X xr
r2R
C
X
! br (xr ) ¡ 1
X
X
( lr,s ( xs )
r2R s2subd ( r) xs
X x2rns
) br ( xr ) ¡ bs ( xs ) .
(5.2)
5.1 A CCCP Algorithm for the Kikuchi Free Energy. Now we obtain a CCCP algorithm guaranteed to converge to an extrema of the Kikuchi free energy. First, we observe that the Kikuchi free energy, like the Bethe free energy, can be split into a concave and a convex part. We choose the following split (many others are possible):
FK D FK,vex C FK,cave ,
(5.3)
where FK,vex D cmax C
r2R
X
cr
xr
£ X r2R
X x2rns
br ( xr ) Er ( xr ) C
Á X xr
r2R
(
FK,cave D
( X X
! br ( xr ) ¡ 1 C
X
) br ( xr ) log br ( xr )
(5.4)
xr
X
X
X
lr,s ( xs )
r2R s2subd ( r) xs
) br ( xr ) ¡ bs ( xs ) (
( cr ¡ cmax )
X xr
br (xr ) Er (xr ) C
X
) br ( xr ) log br (xr ) , (5.5)
xr
where we have used the results stated in equations 3.7 and 3.8 to impose the constraints only on the convex term. It can be readily veried that FK,vex is a convex function of fbr ( xr ) g, and FK,cave is a concave function. Recall that cmax D maxr2R cr . This ensures that FK,cave is concave. Theorem 6 (CCCP for Kikuchi).
The Kikuchi free energy FK can be minimized by a double-loop algorithm. The outer loop has time parameter t and is given by: cr
c max ¡c r
br ( xr I t C 1) D e ¡ cmax fEr xr 1g e ¡c r fbr ( xr I t) g cmax P P ¡ l (x ) lv, r ( xr ) s2subd (r) r, s s £e . e v2supd (r) ( )C
(5.6)
1710
A. L. Yuille
The inner loops, to solve for the fc r g, flr,s g, have time parameter t and are given by: ec r (t C 1) X D
e
cr ¡ cmax fEr ( xr ) C 1g
xr
fbr ( xr I t )g
cmax ¡cr c max
e
¡
P s2subd (r)
lr, s ( xs It )
P lv, r ( xr It ) £ e v2supd (r) ,
e2lr, u ( xu It C 1)
P
D
£e e
cr cmax ¡cr ( )C e¡ cmax fEr xr 1g fbr (xP r I t ) g cmax x2rnuP lr, s ( xs It ) l (x It ) ¡c r (t ) ¡ s2subd (r): sD 6 u v2supd (r) v,r r
e
cu ¡ cmax
£e
¡
P
e
fEu ( xu ) C 1g s2subd (u)
fbu ( xuP I t)g
lu,s (xs It )
e
cmax ¡c u c max
e ¡c u (t )
v2supd (u): vD 6 r
lv,u ( xu It )
(5.7)
.
Moreover, the inner loop is guaranteed to satisfy the constraints by converging to the unique maximum of the dual energy: EO t C 1 (l , c ) D ¡
XX
cr
e ¡ cmax fEr
( xr ) C 1g ¡c r
r2R xr
£e
¡
P
s2subd (r)
lr, s ( xs )
e
fbr ( xr I t) g
c max ¡c r cmax
P X lv,r ( xr ) ¡ c r. e v2supd (r)
(5.8)
r2R
Proof. We split the Kikuchi free energy FK into the convex and concave
parts, FK,vex , FK,cave , given by equation 5.5. We can use Theorems 1, 2, and 3 to obtain an algorithm to minimize the Kikuchi free energy. In more detail, we calculate the derivatives: 8 < @FK,vex D cmax Er ( xr ) C 1 C log br ( xr ) C c r : @br ( xr )
C
@FK,cave @br ( xr )
X s2subd (r )
lr,s (xs ) ¡
9 =
X
lv,r (xr )
v2supd ( r)
D ( cr ¡ cmax ) fEr ( xr ) C 1 C log br ( xr )g,
where we have not included constraint terms in absorbed into the constraints in @@FbK,r (vex . (Recall xr ) @FK, vex @FK, cave that we are setting @br ( xr ) ( t C 1) D ¡ @br ( xr ) ( t) ).
@FK, cave @br ( xr )
;
,
(5.9)
(5.10) because they can be
from Theorems 1, 2, and 3
CCCP Algorithms to Minimize the Bethe and Kikuchi Free Energies
1711
From equations 5.9 and 5.10, we can obtain the outer loop of our update algorithm (using theorems 2 and 3): cr
c max ¡c r
br ( xr I t C 1) D e ¡ cmax fEr xr 1g fbr ( xr I t) g cmax e¡c r P P ¡ l (x ) lv, r ( xr ) s2subd (r) r, s s £e . e v2supd (r) ( )C
(5.11)
The inner loop, to solve for the fc r g, flr,s g, is determined by theorem 2. We know that solving the constraint equations one by one is guaranteed to convergePto the unique solution for the constraints. The constraint equation for c r is xr br ( xr ) D 1, which can be solved to give an analytic expression for c r : ec r D
X
cr
e¡ cmax fEr
xr
£e
¡
( xr ) C 1g
P s2subd (r)
fbr ( xr I t)g
lr,s (xs )
cmax ¡cr c max
P lv, r ( xr ) . e v2supd (r)
The constraint equation for lr,u ( xu ) is X
cr
e¡ cmax fEr
( xr ) C 1g
x2rnu
fbr (xr I t) g
cu ¡ cmax fEu ( xu ) C 1g
cmax ¡cr cmax
P
(5.12)
x2rnu br ( xr )
e ¡c r e
¡
P s2subd (r)
lr, s ( xs )
P e
v2supd (r)
lv, r ( xr )
c max ¡cu
fbu (xu I t) g c max e¡c u P P ¡ l (x ) lv,u ( xu ) s2subd (u) u, s s £e e v2supd (u) ,
De
D bu ( xu ), which yields:
(5.13)
which can be rearranged to give an analytic solution for lr,u ( xu ) (see equation 5.7). The dual energy is given by theorem 3. Hence the result. 5.2 CCCP Kikuchi and Generalized Belief Propagation. We now generalize our arguments for the Bethe free energy (see subsection 4.1) and show there is a connection between CCCP for Kikuchi and GBP. Once again, CCCP is very similar to message passing, but it requires us to update our estimates of the beliefs periodically. First, we simplify the form of the CCCP Kikuchi. Then we briey describe GBP. Finally, we show connections between CCCP and GBP by collapsing the outer loop. We simplify the double-loop Kikuchi algorithm by dening new variables: cr
cmax ¡cr
C hr ( xr ) D e ¡ cmax fEr xr 1g fbr (xr ) g cmax , P P ¡c r ¡ s2sub (r) lr,s (xs ) C v2sup (r) lv, r ( xr ) d d . gr ( xr ) D e
( )
(5.14)
1712
A. L. Yuille
The outer loop rule becomes br ( xr ) D hr ( xr ) gr ( xr ) .
(5.15)
The inner loop becomes: X ec r (t C 1) D ec r (t ) hr (xr ) gr (xr ) , xr
e
2lr, u ( xu It C 1)
P
2lr,u ( xu It )
D e
x2rnu hr ( xr ) gr ( xr )
hu ( xu ) gu ( xu )
.
(5.16)
This form of the update rules is straightforward to program and, as we now show, relate to the generalized belief propagation algorithm of Yedidia et al. (2000). We now give a formulation of generalized belief propagation. Following Yedidia et al. (2000), we allow messages mr,s ( xs ) between regions r and any direct subregions s 2 sub d ( r ). We also dene M (r ) to be the set of all messages that ow into r, or any of its subregions, from outside r (i.e., mr0 ,s0 ( xs0 ) 2 M ( r) provided s0 is a subregion of r, or r itself, and r0 ns is outside r). The messages will correspond to (new) Lagrange multipliers fm g related to the messages by m r,s (xs ) D log mr,s ( xs ) . The main idea is to introduce new Lagrange multipliers fm r,s g so that: X
X
X
(
lr,s ( xs )
r2R s2subd ( r) xs
D
X r
cr
x2rns
X xr
X
)
br ( xr ) ¡ bs (xs )
X
br ( xr )
m r0 ,s0 (xs0 ) .
(5.17)
fr0 ,s0 g2M(r)
By extremizing the Kikuchi free energy, we can nd the solution to be: Y (5.18) br ( xr ) D yr ( xr ) mr0 ,s0 (xs0 ) , fr0 ,s0 g2M(r)
where we relate the messages to the Lagrange multipliers by m r,s ( xs ) D log mr,s ( xs ) . Generalized belief propagation can be reformulated as: P Q fr0 ,s0 g2M(r) mr0 ,s0 ( xs0 ) x2rns yr ( xr ) Q . (5.19) mr,s ( xs I t C 1) D mr,s ( xs I t ) ys ( xs ) fr0 ,s0 g2M(s) mr0 ,s0 (xs0 ) It is clear that xed points of this equation will correspond to situations P where the constraints are satised (because the numerator will equal x2r/ s br ( xr ) and the denominator is b s ( xs )). The form of br ( xr ) means that this is all we need to do.
CCCP Algorithms to Minimize the Bethe and Kikuchi Free Energies
1713
To relate this to the Kikuchi double loop, we once again collapse the outer loop by solving for the fbr g in terms of the Lagrange multipliers flg. This gives: b¤r ( xr ) D e ¡Er ( xr ) ¡1 fgr (l , c ) gcmax / cr .
(5.20)
These are precisely the same form as those given by generalized belief propagation after we solve for the relationship between the Lagrange multipliers to be: X s2subd (r )
lr,s ( xs ) ¡
X p2supd ( r)
lp,r ( xr ) D cr
X
m r0 ,s0 ( xs0 ).
(5.21)
fr0 .s0 g2M(r)
The updates are then given by: ec r (t C 1) D ec r (t ) 2lr,u ( xu It C 1)
e
X xr
2lr, u ( xu It )
D e
b¤r ( xr ) , P
¤ x2rns br ( xr ) b¤u (xu )
.
(5.22)
So these are very similar to generalized belief propagation, using equations 5.18 and 5.19. The only difference is that we are updating the flg instead of the fm g, which are related by a linear transformation. 6 Implementation
We implemented the Bethe and Kikuchi CCCP algorithms in numeric Python on spin-glass energy functions. We used binary state variables dened on 2D and 3D lattices (using toroidal boundary conditions). This is similar to the setup used in Yedidia et al. (2000) and Teh and Welling (2001). We also implemented BP and GBP for comparison. (These examples were chosen because it is known that BP has difculties with graphs with so many closed loops.) The grid was shown in Figure 1 (imagine a third dimension to get the 3D cube). We label the nodes by Ei where each component of Ei takes values from 1 to N. We let Ei § D hE and Ei § D vE be the horizontal and vertical neighbors of site Ei. Each site has a spin state xEi 2 f0, 1g. There are (unary) potentials at each lattice site labeled by yEi ( xEi ). There are horizontal and vertical connections between neighboring lattice sites, which we label as yEi,Ei C D hE (xEi , xEi C D hE ) and yEi,Ei C D vE ( xEi , xEi C D vE ) , respectively (i.e., linking node a to b and a to c in Figure 1). In sum, we have N 2 unary potentials yEi , N 2 horizontal pairwise potentials yEi,Ei C D hE , and N 2 vertical pairwise potentials yEi,Ei C D vE . The size of N was varied from 10 to 100.
1714
A. L. Yuille
The yEi ( xEi ) are of form e§ hEi where the hEi are random samples from a gaussian distribution N (m , s) . The horizontal potentials are of form e§ hEi,Ei C D hE where the hEi,Ei C D hE are random samples from a gaussian N (m , s) . The vertical potentials are generated in a similar manner. The parameters of these three gaussians were varied (typically we set m D 0 and s 2 f1, 2, 5g). 6.1 Bethe Implementations. We derived the Bethe free energy for the problem specied above and implemented the CCCP and the BP algorithm. We use randomized initial conditions on the messages or the Lagrange multipliers (as appropriate). We used unary beliefs bEi (xEi ) and horizontal joint beliefs bEi,Ei C D h (xEi , xEi C D hE ) and vertical joint beliefs bEi,Ei C D v ( xEi , xEi C D vE ). For CCCP, we have Lagrange parameter terms lEi§D vE ,Ei ( xEi ) and lEi§ D h, E Ei ( xEi ) . These are analogous to the BP messages terms mEi§ D vE,Ei ( xEi ) and mEi§ D h, E Ei ( xEi ) . They can be thought of as the messages that ow into node Ei (see Figure 2). For CCCP, we also have normalization terms c Ei,Ei C D hE and c Ei,Ei C D vE to normalize the horizontal and vertical joint beliefs. For the CCCP algorithm, we used ve iterations of the inner loop as a default. This gave stable convergence, although it did not always ensure that the constraints were satised. We experimented with fewer iterations. The algorithm was unstable if only one inner loop iteration was used, but it already showed stability when we used two iterations. The BP algorithm was implemented in the standard manner. We used a parallel update rule. For both algorithms, we measured convergence by the Bethe free energy or by the “belief difference”: the differences in the belief estimates between adjacent iterations. Note that the Bethe free energy is not very meaningful for the BP algorithm until the algorithm has converged (because only then will the constraints be satised). The CCCP algorithm converged rapidly with typically fewer than six iterations of the outer loop (see Figure 7 for 2D spin glass and Figure 8 for 3D spin glass). Most of the Bethe free energy decrease happened in the rst two iterations. Varying the standard deviation of the gaussians (which generate the potential) made no difference. It also happened when we altered the size of the lattice (our default lattice size was 10 £ 10 but we also explored 50 £ 50, 10 £ 10 £ 10, and 20 £ 20 £ 20). During the rst six iterations, the constraints were not always satised (recall we are using ve iterations of the inner loop; see, for example, the top right panel of Figure 8), where the free energy appears to increase at iteration 5. But this (temporary) noncompliance with the constraints did not seem to matter, and the Bethe free energy (almost) always decreased monotonically. The BP algorithm behaved differently as we varied the situation. It was far less stable in general. For the 2D spin glass, the algorithm would not always converge. The convergence was better the smaller the variance of
CCCP Algorithms to Minimize the Bethe and Kikuchi Free Energies
1715
Figure 7: Performance of CCCP and BP on 2D spin glasses. Bethe free energy plots (top panels) and belief difference plots (bottom panels). Left to right: s D 1.0, 2.0, 5.0. CCCP solution (full lines), BP solution (dashed lines). Observe that CCCP is more reliable for s ¸ 2 (although BP often converged correctly for s D 2.0). Each iteration of CCCP counts as ve iterations of BP (we use ve iterations in the inner loop). For BP, the Bethe free energy is not very meaningful until convergence.
the gaussian distributions generating the potentials. So if we set sp D sh D sv D 1.0, then BP would often converge. But it became far more unstable when we increased the standard deviations to sp D sh D sv D 5.0. BP did not converge at all for the 3D spin-glass case. (Note we did not try to help convergence by adding inertia terms; see section 6.2). Overall, CCCP was far more stable than BP, and the convergence rate was roughly as fast (taking into account the inner loops). The Bethe free energy of the CCCP solution was always as small as, or smaller than, that found by BP. The forms of the solutions found by both algorithms (when BP converged) were very similar (see Figure 9) though we observed that BP tended to favor beliefs biased toward 0 or 1 while CCCP often gave beliefs closer to 0.5. (By contrast, CCCP for Kikuchi did not show any tendency toward 0.5.) We conclude that on these examples, CCCP outperformed BP as far as stability and quality of the results (particularly for 3D spin glasses). These results are preliminary, and more systematic experiments should be performed. But they do provide proof of concept for our approach. 6.2 Kikuchi Implementations. We also implemented CCCP and GBP for the Kikuchi free energy. This was done for the 2D spin-glass case only.
1716
A. L. Yuille
Figure 8: Performance of CCCP and BP on 3D spin glasses. (Top panels) Bethe free energy plots and (bottom panels) belief difference plots. Left to right: s D 1.0, 2.0, 5.0. CCCP solution (full lines), BP solution (dashed lines).
Once again, we found that CCCP converged very quickly (again we used ve iterations of the inner loop as a default). Following Yedidia et al. (2000), we implemented GBP with inertia (we could not get the algorithm to converge without inertia). Yedidia et al. (2000) used an inertia factor of a D 0.5. We were more conservative and set a D 0.9 (which improved stability). (Inertia means that messnew D amessold C (1 ¡ a) messupdate where a D 1 is standard BP and GBP.)
Figure 9: Bethe 2D spin glass. (Left) CCCP solution. (Center and right) Two BP solutions with different initial conditions. We plot the beliefs bi ( xi D 0) for all nodes i on the 2D lattice. Observe the similarity between them.
CCCP Algorithms to Minimize the Bethe and Kikuchi Free Energies
1717
Figure 10: Performance of CCCP Kikuchi and GBP on 2D spin glasses. (Top panels) Kikuchi free energy plots and (botton panels) belief difference plots. Left to right: s D 1.0, 2.0, 5.0. CCCP Kikuchi solution (full lines) and GBP solution (dashed lines).
Both algorithms converged with these settings. But CCCP converged more quickly, while GBP appeared to oscillate a bit about the solution before converging. Both gave similar results for the nal Kikuchi free energy (See Figure 10). The solutions found by the two algorithms were very similar (see Figure 11). Moreover, the solutions appeared to be practically independent of the starting conditions (which were randomized). Yedidia et al. (2000) reported that GBP gave results on 2D spin glasses that were almost identical to the true solutions (found by using a maximum ow algorithm). This is consistent with our nding that both GBP and CCCP converged to very similar solutions despite randomized starting conditions. We now give more details of the implementation. The top-level regions ( are 2 £ 2 squares. These are labeled bEi,Ei C D h, E Ei C D vE ,Ei C D E, E hE C D Ev E xEi , xEi C D hE , xEi C D v xEi C D hE C D vE ) . There are two “medium-level” regions (obtained by taking the intersections of adjacent top-level regions). These are horizontal bEi,Ei C D h (xEi , xEi C D hE ) and vertical bEi,Ei C D v ( xEi , xEi C D vE ). There are bottom-level regions corresponding to the pixels bEi ( xEi ). To implement CCCP for Kikuchi, we need to specify the direct subregions and the direct superregions. It can be seen that the direct subregions of bEi,Ei C D h, E Ei C D vE ,Ei C D E CD Eh E vE are bEi ,Ei C D h , bEi C D v,Ei C D h C D v , bEi,Ei C D v , and bEi C D h,Ei C D v C D h . The direct subregions of bEi,Ei C D h are bEi and bEi C D hE . Those of bEi,Ei C D v are bEi and bEi C D v . Of course, bEi has no direct (or indirect) subregions.
1718
A. L. Yuille
Figure 11: Kikuchi 2D spin glass. (Left) CCCP solution. (Center and right) Two GBP solutions with different initial conditions. We plot the beliefs bi ( xi D 0) for all nodes i on the 2D lattice. Observe the similarity of the results on this 50 £ 50 grid.
The direct superregions of bEi are bEi,Ei C D h , bEi¡D h,Ei , bEi,Ei C D v , and bEi¡D v,Ei . The direct superregions of bEi,Ei C D h are bEi,Ei C D h, and bEi¡D v,Ei C D h¡ E Ei C D vE ,Ei C D E D v,Ei,Ei C D E hE C D Ev E hE . E Similarly, the direct superregions of bEi,Ei C D v are bEi,Ei C D h,E Ei C D vE ,Ei C DE hE C DE vE and . bEi¡D h, E Ei,Ei C D vE ¡D h, E Ei C D Ev E 7 Discussion
It is known that BP and GBP algorithms are guaranteed to converge to the correct solution on graphs with no loops. We now argue that CCCP will also converge to the correct answer on such problems. As shown by Yedidia et al. (2000), the xed points of BP and GBP correspond to extrema of the Bethe and Kikuchi free energy. Hence, for graphs with no loops, there can be only a single extremum of the Bethe and Kikuchi free energies. These free energies are bounded below, and so the extremum must be a minimum. The free energy therefore has a single unique global minimum. CCCP is then guaranteed to nd it. We also emphasize the generality of the Kikuchi approximation. In this article, we considered distributions P ( x1 , . . . , xN |y) , which had pairwise interactions only (the highest-order terms were of form yij (xi , xj ) ). The Kikuchi approximation, however, can be extended to interactions of arbitrary high order (e.g., terms such as yi1 ,...iN ( xi1 , . . . , xiN )). Good results are reported for these cases (Yedidia, private communication, August 2001). Finally, we observe that both the Bethe and Kikuchi free energies can be reformulated in terms of dual variables (Strang, 1986). This duality is standard for quadratic optimization problems. The BP algorithm can be expressed in terms of the dynamics in this dual space. If we take the Bethe free energy and extremize with respect to the variables fbij g, fbi g, it is possible to solve directly for these variables in terms of the Lagrange multipliers flij g, fc ij g. We can then substitute this back into the
CCCP Algorithms to Minimize the Bethe and Kikuchi Free Energies
1719
Bethe free energy to obtain the dual energy: G (fc ij g, flij g) D ¡
X X
w ij ( xi , xj ) e
¡flij ( xj ) C lji ( xi )g ¡1 ¡c ij e e
i, j: i > j xi ,xj
C
¡
X i
( ni ¡ 1)
X
X
yi ( xi ) e
¡ n i1¡1
P j
lji ( xi ) ¡1
e
xi
c ij .
(7.1)
i, j: i > j
BP is given by the update rule, @g @e
¡lji ( xi )
(lji ( xi I t C 1) ) D ¡ @g
@c ij
(c ij ( t C 1) ) D ¡
@f ¡lji ( xi )
@e @f
@c ij
(lji ( xi I t) ) ,
(c ij (t) ) ,
(7.2)
where f (.) and g (.) are the rst and second terms in G. By standard properties of duality, extrema of the dual energy G correspond to extrema of the Bethe free energy. Unfortunately, minima do not correspond to maxima (which they would if the Bethe free energy was convex). This means that it is hard to obtain global convergence results by analyzing the dual energy. Similarly, we can calculate the dual of the Kikuchi free energy. This is of the form P X X l (x ) ¡1 ¡Er (xr ) ¡c r / cr ¡(1 / cr ) s2subd (r) r, s s (fc GK r g, flr,s g) D ¡ cr e e e e r2R
£e
(1/ cr )
xr
P
v2supd (r)
lv, r ( xr )
¡
X
c r.
(7.3)
r2R
8 Conclusion
This article introduced CCCP double-loop algorithms that are proved to converge to extrema of the Bethe and Kikuchi free energies. They are convergent alternatives to BP and GBP (whose main error mode is failure to converge). We showed that there are similarities between CCCP and BP and GBP. More precisely, the CCCP algorithm updates Lagrange parameter variables fl, c g while BP and GBP updates messages fmg—but the exponentials of the flg correspond to linear combinations of the messages (and the fc g are implicit in BP and GBP). In BP and GBP, the beliefs fbg are expressed in terms of the messages fmg, but for CCCP, they are expressed in terms of the fl, c g. The difference is that for BP and GBP, the update rules for the fmg
1720
A. L. Yuille
do not depend explicitly on the current estimation of the fbg. But the CCCP updates for fl, c g do depend on the current fbg, and the estimates of the fbg must be periodically reestimated. We can make BP and GBP and CCCP very similar by expressing the fbg as a function of the fl, c g (this is collapsing the double loop), which makes the algorithms very similar (but prevents the convergence proof from holding). Our computer simulations on spin glasses showed that the CCCP algorithms are stable, converge rapidly, and give solutions as good as, or better than, those found by BP and GBP. (Convergence rates of CCCP were similar to those of BP and GBP when ve iterations of the inner loop were used.) In particular, we found that BP often did not converge on 2D spin glasses and never converged at all on 3D spin glasses. GBP, however, did converge on 2D spin glasses provided inertia was used. Finally, we stress that the Bethe and Kikuchi free energies are variational techniques that can be applied to a range of inference and learning applications (Yedidia et al., 2000). They are better approximations than standard mean-eld theory and hence give better results on certain applications (Weiss, 2001). Mean-eld theory algorithms can be very effective on difcult optimization problems (see Rangarajan, Gold, & Mjolsness, 1996). But it is highly probable that using Bethe and Kikuchi approximations (and applying CCCP or BP and GBP algorithms) will perform well on a large range of applications. Acknowledgments
I acknowledge helpful conversations with James Coughlan and Huiying Shen. Yair Weiss pointed out a conceptual bug in the original dual formulation. Anand Rangarajan pointed out connections to Legendre transforms. Sabino Ferreira read the manuscript carefully, made many useful comments, and gave excellent feedback. Jonathan Yedidia pointed out a faulty assumption about the fcr g. Two referees gave helpful feedback. This work was supported by the National Institutes of Health (NEI) with grant number RO1-EY 12691-01. It was also supported by the National Science Foundation with award number IRI-9700446. References Arbib, M. (Ed.). (1995). The handbook of brain theory and neural networks. Cambridge, MA: MIT Press. Berrou, C., Glavieux, A., & Thitimajshima, P. (1993). Near Shannon limit errorcorrecting coding and decoding: turbo codes (I). In Proc. ICC’93 (pp. 1064– 1070). Geneva, Switzerland. Chiang, M., & Forney, Jr., C. D. (2001). Statistical physics, convex optimization and the sum product algorithm (LIDS Tech. Rep.). Cambridge, MA: Laboratory for Information and Decision Systems, MIT.
CCCP Algorithms to Minimize the Bethe and Kikuchi Free Energies
1721
Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: Wiley. Domb, C., & Green, M. S. (Eds.). (1972). Phase transitions and critical phenomena (Vol. 2). London: Academic Press. Forney, Jr., C. D. (2001). Codes on graphs: News and views. Paper presented at the 2001 Conference on Information Sciences and Systems, John Hopkins University, Baltimore, MD. Freeman, W. T., & Pasztor, E. C. (1999). Learning low level vision. In Proc. International Conference of Computer Vision (pp. 1182–1189). Los Alamitos, CA: IEEE Computer Society. Frey, B. (1998). Graphical models for pattern classication, data compression and channel coding. Cambridge, MA: MIT Press. Gallager, R. G. (1963). Low-density parity-checkcodes. Cambridge, MA: MIT Press. Hertz, J., Krogh, A., & Palmer, R. G. (1991). Introduction to the theory of neural computation. Reading, MA: Addison-Wesley. Kosowsky, J. J. (1995).Flows suspending iterativealgorithms. Unpublished doctoral dissertation, Harvard University, Cambridge, MA. Kosowsky, J. J., & Yuille, A. L. (1994). The invisible hand algorithm: Solving the assignment problem with statistical physics. Neural Networks, 7, 477–490. Marcus, C. M., & Westervelt, R. M. (1989). Dynamics of iterated-map neural networks. Physical Review A, 40, 501–504. McEliece, R. J., Mackay, D. J. C., & Cheng, J. F. (1998). Turbo decoding as an instance of Pearl’s belief propagation algorithm. IEEE Journal on Selected Areas in Communication, 16(2), 140–152. Murphy, K. P., Weiss, Y., & Jordan, M. I. (1999). Loopy belief propagation for approximate inference: An empirical study. In Proceedings of Uncertainty in AI. San Mateo, CA: Morgan Kaufmann. Pearl, J. (1988). Probabilistic reasoning in intelligent systems. San Mateo, CA: Morgan Kaufmann. Rangarajan, A., Gold, S., & Mjolsness, E. (1996). A novel optimizing network architecture with applications. Neural Computation, 8(5), 1041–1060. Rangarajan, A., Yuille, A. L., Gold, S., & Mjolsness, E. (1996).A convergence proof for the softassign quadratic assignment problem. In Proceedings of NIPS’96. Snowmass, CO. Sinkhorn, R. (1964). A relationship between arbitrary positive matrices and doubly stochastic matrices. Ann. Math. Statist., 35, 876–879. Strang, G. (1986). Introduction to applied mathematics. Wellesley, MA: WellesleyCambridge Press. Teh, Y. W., & Welling, M. (2001). Passing and bouncing messages for generalized inference (Tech. Rep. GCTU 2001-001). London: Gatsby Computational Neuroscience Unit, University College, London. Wainwright, M., Jaakkola, T., & Willsky, A. (2001). Tree-based reparameterization framework for approximate estimation of stochastic processes on graphs with cycles (Tech. Rep. LIDS P-2510). Cambridge, MA: Laboratory for Information and Decision Systems, MIT. Waugh, F. R., & Westervelt, R. M. (1993). Analog neural networks with local competition: I. Dynamics and stability. Physical Review E, 47(6), 4524–4536.
1722
A. L. Yuille
Weiss, Y. (2001). Comparing the mean eld method and belief propagation for approximate inference in MRFs. In M. Oppor & D. Saad (Eds.), Advanced mean eld methods. Cambridge, MA: MIT Press. Yedidia, J. S., Freeman, W. T., & Weiss, Y. (2000). Bethe free energy, Kikuchi approximations and belief propagation algorithms. In S. A. Solla, T. K. Leen, & K.-R. Muller ¨ (Eds.), Advances in neural information processing systems, 12. Cambridge, MA: MIT Press. Yuille, A. L., & Kosowsky, J. J. (1994).Statistical physics algorithms that converge. Neural Computation, 6, 341–356. Yuille, A. L., & Rangarajan, A. (2001). The concave convex principle (CCCP). In T. K. Leen, T. G. Diettrich, & V. Tresp (Eds.), Advances in neural information processing systems, 13. Cambridge, MA: MIT Press. Received March 15, 2001; accepted November 19, 2001.
LETTER
Communicated by Barak Pearlmutter
Fast Curvature Matrix-Vector Products for Second-Order Gradient Descent Nicol N. Schraudolph [email protected] IDSIA, Galleria 2, 6928 Manno, Switzerland, and Institute of Computational Science, ETH Zentrum, 8092 Zurich, ¨ Switzerland We propose a generic method for iteratively approximating various second-order gradient steps—-Newton, Gauss-Newton, Levenberg-Marquardt, and natural gradient—-in linear time per iteration, using special curvature matrix-vector products that can be computed in O ( n ) . Two recent acceleration techniques for on-line learning, matrix momentum and stochastic meta-descent (SMD), implement this approach. Since both were originally derived by very different routes, this offers fresh insight into their operation, resulting in further improvements to SMD. 1 Introduction
Second-order gradient descent methods typically multiply the local gradient by the inverse of a matrix CN of local curvature information. Depending on the specic method used, this n £ n matrix (for a system with n parameters) may be the Hessian (Newton’s method), an approximation or modication thereof (e.g., Gauss-Newton, Levenberg-Marquardt), or the Fisher information (natural gradient—Amari, 1998). These methods may converge rapidly but are computationally quite expensive: the time complexity of common methods to invert CN is O ( n3 ), and iterative approximations cost at least O ( n2 ) per iteration if they compute CN ¡1 directly, since that is the time required just to access the n2 elements of this matrix. Note, however, that second-order gradient methods do not require CN ¡1 explicitly: all they need is its product with the gradient. This is exploited by Yang and Amari (1998) to compute efciently the natural gradient for multilayer perceptrons with a single output and one hidden layer: assuming independently and identically distributed (i.i.d.) gaussian input, they explicitly derive the form of the Fisher information matrix and its inverse for their system and nd that the latter ’s product with the gradient can be computed in just O ( n ) steps. However, the resulting algorithm is rather complicated and does not lend itself to being extended to more complex adaptive systems (such as multilayer perceptrons with more than one output or hidden layer), curvature matrices other than the Fisher information, or inputs that are far from i.i.d. gaussian. c 2002 Massachusetts Institute of Technology Neural Computation 14, 1723– 1738 (2002) °
1724
Nicol N. Schraudolph
In order to set up a general framework that admits such extensions (and indeed applies to any twice-differentiable adaptive system), we abandon the notion of calculating the exact second-order gradient step in favor of an iterative approximation. The following iteration efciently approaches vE D CN ¡1 uE for an arbitrary vector uE (Press, Teukolsky, Vetterling, & Flannery, 1992, page 57):
vE 0 D 0I
(8t ¸ 0)
vE t C 1 D vE t C D ( uE ¡ CN vE t ) ,
(1.1)
where D is a conditioning matrix chosen close to CN ¡1 if possible. Note that if we restrict D to be diagonal, all operations in equation 1.1 can be performed in O ( n ) time, except (one would suppose) for the matrix-vector product CN vE t . In fact, there is an O ( n ) method for calculating the product of an n £ n matrix with an arbitrary vector—if the matrix happens to be the Hessian of a system whose gradient can be calculated in O ( n ), as is the case for most adaptive architectures encountered in practice. This fast Hessian-vector product (Pearlmutter, 1994; Werbos, 1988; Møller, 1993) can be used in conjunction with equation 1.1 to create an efcient, iterative O (n ) implementation of Newton’s method. Unfortunately, Newton’s method has severe stability problems when used in nonlinear systems, stemming from the fact that the Hessian may be ill-conditioned and does not guarantee positive deniteness. Practical second-order methods therefore prefer measures of curvature that are better behaved, such as the outer product (Gauss-Newton) approximation of the Hessian, a model-trust region modication of the same (Levenberg, 1944; Marquardt, 1963), or the Fisher information. Below, we dene these matrices in a maximum likelihood framework for regression and classication and describe O ( n ) algorithms for computing the product of any of them with an arbitrary vector for neural network architectures. These curvature matrix-vector products are, in fact, cheaper still than the fast Hessian-vector product and can be used in conjunction with equation 1.1 to implement rapid, iterative, optionally stochastic O (n ) variants of second-order gradient descent methods. The resulting algorithms are very general, practical (i.e., sufciently robust and efcient), far less expensive than the conventional O ( n2 ) and O (n3 ) approaches, and—with the aid of automatic differentiation software tools—comparatively easy to implement (see section 4). We then examine two learning algorithms that use this approach: matrix momentum (Orr, 1995; Orr & Leen, 1997) and stochastic meta-descent (Schraudolph, 1999b, 1999c; Schraudolph & Giannakopoulos, 2000). Since both methods were derived by entirely different routes, viewing them as implementations of iteration 1.1 will provide additional insight into their operation and suggest new ways to improve them.
Curvature Matrix-Vector Products
1725
2 Denitions and Notation Network. A neural network with m inputs, n weights, and o linear out-
puts is usually regarded as a mapping Rm ! Ro from an input pattern xE to the corresponding output yE , for a given vector w E of weights. Here we formalize such a network instead as a mapping N : Rn ! Ro from weights to outputs (for given inputs), and write yE D N ( w E ). To extend this formalism to networks with nonlinear outputs, we dene the output nonlinearity M: Ro ! Ro and write zE D M ( yE ) D M ( N ( w E ) ) . For networks with linear outputs, M will be the identity map. Loss function . We consider neural network learning as the minimization of a scalar loss function L : Ro ! R dened as the log-likelihood L (zE ) ´ ¡ log Pr( Ez ) of the output Ez under a suitable statistical model (Bishop, 1995). For supervised learning, L may also implicitly depend on given targets Ez¤ for the outputs. Formally, the loss can now be regarded as a function L (M ( N ( w E ) ) ) of the weights, for a given set of inputs and (if supervised) targets. Jacobian and gradient. The Jacobian JF of a function F : Rm ! Rn is the £ n m matrix of partial derivatives of the outputs of F with respect to its inputs. For a neural network dened as above, the gradient—the vector gE of derivatives of the loss with respect to the weights—is given by
gE ´
@ E @w
L (M(N
(w E ) ) ) D JL 0
±M±N
D JN 0 J M 0 JL 0 ,
(2.1)
where ± denotes function composition and 0 the matrix transpose.
Matching loss functions. We say that the loss function L matches the E for some A and bE not dependent output nonlinearity M iff JL0 ±M D AEz C b, on w. E 1 The standard loss functions used in neural network regression and classication—sum-squared error for linear outputs and cross-entropy error for softmax or logistic outputs—are all matching loss functions with A D I (the identity matrix) and bE D ¡Ez¤ , so that JL0 ±M D Ez ¡Ez¤ (Bishop, 1995, chapter 6). This will simplify some of the calculations described in section 4. Hessian. The instantaneous Hessian HF of a scalar function F : Rn ! R is the n £ n matrix of second derivatives of F ( w E ) with respect to its inputs E w:
HF ´
@ JF
, E0 @w
i.e.,
( HF ) ij D
E) @2 F ( w @wi @wj
.
(2.2)
1 For supervised learning, a similar if somewhat more restrictive denition of matching loss functions is given by Helmbold, Kivinen, and Warmuth (1996) and Auer, Herbster, and Warmuth (1996).
1726
Nicol N. Schraudolph
For a neural network as dened above, we abbreviate H ´ HL ±M±N . The N is obtained by taking the expectation Hessian proper, which we denote H, N of H over inputs: H ´ hHixE . For matching loss functions, HL ±M D AJM D JM 0 A 0 . Fisher information . The instantaneous Fisher information matrix FF of a scalar log-likelihood function F : Rn ! R is the n £ n matrix formed by the outer product of its rst derivatives:
0 F F ´ JF JF ,
i.e.,
( FF ) ij D
E ) @F ( w E) @F ( w @wi
@wj
.
(2.3)
Note that FF always has rank one. Again, we abbreviate F ´ FL ±M±N D gE gE 0 . The Fisher information matrix proper, FN ´ hFixE , describes the geometric structure of weight space (Amari, 1985) and is used in the natural gradient descent approach (Amari, 1998). 3 Extended Gauss-Newton Approximation Problems with the Hessian. The use of the Hessian in second-order gradient descent for neural networks is problematic. For nonlinear systems, HN is not necessarily positive denite, so Newton’s method may diverge or even take steps in uphill directions. Practical second-order gradient methods should therefore use approximations or modications of the Hessian that are known to be reasonably well behaved, with positive semideniteness as a minimum requirement. Fisher information. One alternative that has been proposed is the Fisher information matrix FN (Amari, 1998), which, being a quadratic form, is positive semidenite by denition. On the other hand, FN ignores all second-order interactions between system parameters, thus throwing away potentially useful curvature information. By contrast, we shall derive an approximation of the Hessian that is provably positive semidenite even though it does make use of second derivatives to model Hessian curvature better. Gauss-Newton. An entire class of popular optimization techniques for nonlinear least-squares problems, as implemented by neural networks with linear outputs and sum-squared loss function, is based on the well-known Gauss-Newton (also referred to as linearized, outer product, or squared Jacobian) approximation of the Hessian. Here we extend the Gauss-Newton approach to other standard loss functions—in particular, the cross-entropy loss used in neural network classication—in such a way that even though
Curvature Matrix-Vector Products
1727
some second-order information is retained, positive semideniteness can still be proved. Using the product rule, the instantaneous Hessian of our neural network model can be written as
HD
@ E0 @w
( JL
) D JN 0 H L
±M JN
±M JN
C
o X iD 1
( JL
±M ) i HN
i
,
(3.1)
where i ranges over the o outputs of N , with N i denoting the subnetwork that produces the ith output. Ignoring the second term above, we dene the extended, instantaneous Gauss-Newton matrix: G ´ JN 0 H L
±M JN
.
(3.2)
Note that G has rank · o (the number of outputs) and is positive semidefinite, regardless of the choice of architecture for N , provided that HL ±M is. G models the second-order interactions among N ’s outputs (via HL ±M ) while ignoring those arising within N itself (HN i ). This constitutes a compromise between the Hessian (which models all second-order interactions) and the Fisher information (which ignores them all). For systems with a single linear output and sum-squared error, G reduces to F. For multiple outputs, it provides a richer (rank(G) · o versus rank(F) D 1) model of Hessian curvature. Standard Loss Functions. For the standard loss functions used in neural network regression and classication, G has additional interesting properties: First, the residual JL0 ±M D Ez ¡ Ez¤ vanishes at the optimum for realizable problems, so that the Gauss-Newton approximation, equation 3.2, of the Hessian, equation 3.1, becomes exact in this case. For unrealizable problems, the residuals at the optimum have zero mean; this will tend to make the last term in equation 3.1 vanish in expectation, so that we can still assume GN ¼ HN near the optimum. Second, in each case we can show that HL ±M (and hence G, and hence N is positive semidenite. For linear outputs with sum-squared loss—that G) is, conventional Gauss-Newton—HL ±M D JM is just the identity I; for independent logistic outputs with cross-entropy loss, it is diag[diag( Ez ) (1 ¡ Ez )], positive semidenite because (8i) 0 < zi < 1. For softmax output with cross-entropy loss, we have HL ±M D diag(zE ) ¡ EzEz0 , which is also positive
1728
Nicol N. Schraudolph
semidenite since (8i) zi > 0 and (8vE 2 R
o)
0
P
i zi
0
vE [diag( Ez) ¡ zE Ez ]vE D D
X i
zi v2i ¡ 2
D
i
zi @vi ¡
X
Á zi v2i
i
Á X
!0
zi vi @
i
0 X
D 1, and thus
X j
¡ X j
X i
!2 z i vi
1
0
zj vj A C @
X
12 zj vj A
j
12 zj vj A ¸ 0.
(3.3)
Model-Trust Region. As long as G is positive semidenite—as proved above for standard loss functions—the extended Gauss-Newton algorithm will not take steps in uphill directions. However, it may still take very large (even innite) steps. These may take us outside the model-trust region, the area in which our quadratic model of the error surface is reasonable. Model-trust region methods restrict the gradient step to a suitable neighborhood around the current point. One popular way to enforce a model-trust region is the addition of a small diagonal term to the curvature matrix. Levenberg (1944) suggested N Marquardt (1963) elaborated the adding lI to the Gauss-Newton matrix G; N additive term to ldiag(G ). The Levenberg-Marquardt algorithm directly inverts the resulting curvature matrix; where affordable (i.e., for relatively small systems), it has become today’s workhorse of nonlinear least-squares optimization. 4 Fast Curvature Matrix-Vector Products
We now describe algorithms that compute the product of F, G, or H with an arbitrary n-dimensional vector vE in O ( n ). They can be used in conjunction with equation 1.1 to implement rapid and (if so desired) stochastic versions of various second-order gradient descent methods, including Newton’s method, Gauss-Newton, Levenberg-Marquardt, and natural gradient descent. 4.1 The Passes. The fast matrix-vector products are all constructed from the same set of passes in which certain quantities are propagated through all or part of our neural network model (comprising N , M, and L ) in forward or reverse direction. For implementation purposes, it should be noted that automatic differentiation software tools2 can automatically produce these passes from a program implementing the basic forward pass f0 . 2
See http://www-unix.mcs.anl.gov/autodiff/.
Curvature Matrix-Vector Products
1729
f 0 . This is the ordinary forward pass of a neural network, evaluating the function F ( w E ) it implements by propagating activity (i.e., intermediate results) forward through F . r 1 . The ordinary backward pass of a neural network, calculating 0 JF uE by propagating the vector uE backward through F . This pass uses intermediate results computed in the f0 pass. f 1 . Following Pearlmutter (1994), we dene the Gateaux derivative
RvE (F ( w E)) ´
E C rv E) @F ( w @r
D JF vE ,
(4.1)
rD 0
which describes the effect on a function F (w E ) of a weight perturbation in the direction of vE . By pushing RvE , which obeys the usual rules for differential operators, down into the equations of the forward pass f0, one obtains an efcient procedure, to calculate JF vE from vE . (See Pearlmutter, 1994, for details and examples.) This f1 pass uses intermediate results from the f0 pass.
r 2 . When the RvE operator is applied to the r1 pass for a scalar function F , one obtains an efcient procedure for calculating the Hessian0 ) vector product HF vE D RvE ( JF . (See Pearlmutter, 1994, for details and examples.) This r2 pass uses intermediate results from the f0 , f1 , and r1 passes. 4.2 The Algorithms. The rst step in all three matrix-vector products is computing the gradient gE of our neural network model by standard backpropagation:
Gradient. gE ´ JL0 ±M±N is computed by an f0 pass through the entire model (N , M, and L ), followed by an r1 pass propagating uE D 1 back through the entire model (L , M, then N ). For matching loss functions, E we can limit the forward pass to N there is a shortcut: since JL0 ±M D AEz C b, and M (to compute Ez), then r1 -propagate uE D AzE C bE back through just N . Fisher Information. To compute FvE D gE gE 0 vE , multiply the gradient gE by the inner product between gE and vE . If there is no random access to gE or vE — that is, its elements can be accessed only through passes like the above—the scalar gE 0 vE can instead be calculated by f1 -propagating vE forward through the model (N , M, and L ). This step is also necessary for the other two matrix-vector products. Hessian. After f1 -propagating vE forward through N , M, and L , r2 -propagate RvE (1) D 0 back through the entire model (L , M, then N ) to obtain H vE D RvE ( gE ) (Pearlmutter, 1994). For matching loss functions, the shortcut is
1730
Nicol N. Schraudolph
Table 1: Choice of Curvature Matrix C for Various Gradient Descent Methods, N vE , and Passes Needed to Compute Gradient Eg and Fast Matrix-Vector Product C Associated Cost (for a Multilayer Perceptron) in Flops per Weight and Pattern. Pass Method
result: cost:
CD
name
I F G H
simple gradient natural gradient Gauss-Newton Newton’s method
f0
r1
F 2 p p p p
0
JF uE 3 p p pp p
f1
r2
JF vE 4
HF vE 7
p (p ) p
p
Cost
(for gE and CN vE ) 6 10 14 18
to f1 -propagate vE through just N and M to obtain R vE ( Ez) , then r2 -propagate RvE ( JL0 ±M ) D ARvE ( Ez) back through just N . Gauss-Newton. Following the f1 pass, r2 -propagate RvE (1) D 0 back through L and M to obtain RvE ( JL0 ±M ) D HL ±M JN vE , then r1 -propagate that back through N , giving GvE . For matching loss functions, we do not require an r2 pass. Since G D JN0 HL
±M J N
D JN 0 J M 0 A 0 JN ,
(4.2)
we can limit the f1 pass to N , multiply the result with A0 , then r1 -propagate it back through M and N . Alternatively, one may compute the equivalent GvE D JN0 AJM JN vE by continuing the f1 pass through M, multiplying with A, then r1 -propagating back through N only. Batch Average. To calculate the product of a curvature matrix CN ´ hCixE , where C is one of F, G, or H, with vector vE , average the instantaneous product CvE over all input patterns xE (and associated targets Ez¤ , if applicable) while holding vE constant. For large training sets or nonstationary streams of data, it is often preferable to estimate CN vE by averaging over “mini-batches” of (typically) just 5 to 50 patterns. 4.3 Computational Cost. Table 1 summarizes, for a number of gradient descent methods, their choice of curvature matrix C, the passes needed (for a matching loss function) to calculate both the gradient gE and the fast matrix-vector product CN vE , and the associated computational cost in terms of oating-point operations (ops) per weight and pattern in a multilayer perceptron. These gures ignore certain optimizations (e.g., not propagating gradients back to the inputs) and assume that any computation at the network’s nodes is dwarfed by that required for the weights. Computing both gradient and curvature matrix-vector product is typically about two to three times as expensive as calculating the gradient alone.
Curvature Matrix-Vector Products
1731
In combination with iteration 1.1, however, one can use the O (n ) matrixvector product to implement second-order gradient methods whose rapid convergence more than compensates for the additional cost. We describe two such algorithms in the following section. 5 Rapid Second-Order Gradient Descent
We know of two neural network learning algorithms that combine the O ( n ) curvature matrix-vector product with iteration 1.1 in some form: matrix momentum (Orr, 1995; Orr & Leen, 1997) and our own stochastic metadescent (Schraudolph, 1999b, 1999c; Schraudolph & Giannakopoulos, 2000). Since both of these were derived by entirely different routes, we gain fresh insight into their operation by examining how they implement equation 1.1. 5.1 Stochastic Meta-Descent. Stochastic meta-descent (SMD—Schraudolph, 1999b, 1999c) is a new on-line algorithm for local learning rate adaptation. It updates the weights w E by the simple gradient descent: E tC1 D w E t ¡ diag( pEt ) gE . w
(5.1)
The vector pE of local learning rates is adapted multiplicatively, ³
pEt D diag( pEt¡1 ) max
´ 1 ) C , 1 m diag( vE t gE , 2
(5.2)
using a scalar meta-learning rate m . Finally, the auxiliary vector vE used in equation 5.2 is itself updated iteratively via vE t C 1 D lvE t C diag( pEt ) ( gE ¡ lCvE t ) ,
(5.3)
where 0 · l · 1 is a forgetting factor for nonstationary tasks. Although derived as part of a dual gradient descent procedure (minimizing loss with respect to both w E and p), E equation 5.3 implements an interesting variation of equation 1.1. SMD thus employs rapid second-order techniques indirectly to help adapt local learning rates for the gradient descent in weight space. Linearization. The learning rate update, equation 5.2, minimizes the system’s loss with respect to pE by exponentiated gradient descent (Kivinen & Warmuth, 1995), but has been relinearized in order to avoid the computationally expensive exponentiation operation (Schraudolph, 1999a). The particular linearization used, eu ¼ max(%, 1 C u ) , is based on a rst-order Taylor expansion about u D 0, bounded below by 0 < % < 1 so as to safeguard against unreasonably small (and, worse, negative) multipliers for p. E The value of % determines the maximum permissible learning rate reduction; we follow many other step-size control methods in setting this to % D 12 , the
1732
Nicol N. Schraudolph
ratio between optimal and maximum stable step size in a symmetric bowl. Compared to direct exponentiated gradient descent, our linearized version, equation 5.2, thus dampens radical changes (in both directions) to pE that may occasionally arise due to the stochastic nature of the data. Diagonal, Adaptive Conditioner. For l D 1, SMD’s update of vE , equation 5.3 implements equation 1.1 with the diagonal conditioner D D diag( pE) . Note that the learning rates pE are being adapted so as to make the gradient step diag( pE) gE as effective as possible. A well-adapted pE will typically make this step similar to the second-order gradient CN ¡1 gE . In this restricted sense, we can regard diag( pE ) as an empirical diagonal approximation of CN ¡1 , making it a good choice for the conditioner D in iteration 1.1. Initial Learning Rates. Although SMD is very effective at adapting local learning rates to changing requirements, it is nonetheless sensitive to their initial values. All three of its update rules rely on pE for their conditioning, so initial values that are very far from optimal are bound to cause problems: divergence if they are too high, lack of progress if they are too low. A simple architecture-dependent technique such as tempering (Schraudolph & Sejnowski, 1996) should usually sufce to initialize pE adequately; the ne tuning can be left to the SMD algorithm. Model-Trust Region. For l < 1, the stochastic xpoint of equation 5.3 is no longer vE ! C ¡1 gE , but rather vE ! [lC C (1 ¡ l) diag( pE ) ¡1 ]¡1 gE .
(5.4)
This clearly implements a model-trust region approach, in that a diagonal matrix is being added (in small proportion) to C before inverting it. Moreover, the elements along the diagonal are not all identical as in Levenberg’s (1944) method, but scale individually as suggested by Marquardt (1963). The scaling factors are determined by 1 / pE rather than diag( CN ) , as the LevenbergMarquardt method would have it, but these two vectors are related by our above argument that pE is a diagonal approximation of CN ¡1 . For l < 1, SMD’s iteration 5.3 can thus be regarded as implementing an efcient stochastic variant of the Levenberg-Marquardt model-trust region approach. Benchmark Setup. We illustrate the behavior of SMD with empirical data obtained on the “four regions” benchmark (Singhal & Wu, 1989): a fully connected feedforward network N with two hidden layers of 10 units each (see Figure 1, right) is to classify two continuous inputs in the range [¡1, 1] into four disjoint, nonconvex regions (see Figure 1, left). We use the standard softmax output nonlinearity M with matching cross-entropy loss L , metalearning rate m D 0.05, initial learning rates pE0 D 0.1, and a hyperbolic
Curvature Matrix-Vector Products
1733
Figure 1: The four regions task (left), and the network we trained on it (right).
tangent nonlinearity on the hidden units. For each run, the 184 weights (including bias weights for all units) are initialized to uniformly random values in the range [¡0.3, 0.3]. Training patterns are generated on-line by drawing independent, uniformly random input samples; since each pattern is seen only once, the empirical loss provides an unbiased estimate of generalization ability. Patterns are presented in mini-batches of 10 each so as to reduce the computational overhead associated with SMD’s parameter updates 5.1, 5.2, and 5.3.3 Curvature Matrix. Figure 2 shows loss curves for SMD with l D 1 on the four regions problem, starting from 25 different random initial states, using the Hessian, Fisher information, and extended Gauss-Newton matrix, respectively, for C in equation 5.3. With the Hessian (left), 80% of the runs diverge—most of them early on, when the risk that H is not positive denite is greatest. When we guarantee positive semideniteness by switching to the Fisher information matrix (center), the proportion of diverged runs drops to 20%; those runs that still diverge do so only relatively late. Finally, for our extended Gauss-Newton approximation (right), only a single run diverges, illustrating the benet of retaining certain second-order terms while preserving positive semideniteness. (For comparison, we cannot get matrix momentum to converge at all on anything as difcult as this benchmark.) Stability. In contrast to matrix momentum, the high stochasticity of vE affects the weights in SMD only indirectly, being buffered—and largely 3 In exploratory experiments, comparative results when training fully on-line (i.e., pattern by pattern) were noisier but not substantially different.
1734
Nicol N. Schraudolph
Figure 2: Loss curves for 25 runs of SMD with l D 1, when using the Hessian (left), the Fisher information (center), or the extended Gauss-Newton matrix (right) for C in equation 5.3. Vertical spikes indicate divergence.
averaged out—by the incremental update 5.2 of learning rates p. E This makes SMD far more stable, especially when G is used as the curvature matrix. Its residual tendency to misbehave occasionally can be suppressed further by slightly lowering l so as to create a model-trust region. By curtailing the memory of iteration 5.3, however, this approach can compromise the rapid convergence of SMD. Figure 3 illustrates the resulting stability-performance trade-off on the four regions benchmark: When using the extended Gauss-Newton approximation, a small reduction of l to 0.998 (solid line) is sufcient to prevent divergence, at a moderate cost in performance relative to l D 1 (dashed, plotted up to the earliest point of divergence). When the Hessian is used, by contrast, l must be set as low as 0.95 to maintain stability, and convergence is slowed much further (dashdotted). Even so, this is still signicantly faster than the degenerate case of l D 0 (dotted), which in effect implements IDD (Harmon & Baird, 1996), to our knowledge the best on-line method for local learning rate adaptation preceding SMD. From these experiments, it appears that memory (i.e., l close to 1) is key to achieving the rapid convergence characteristic of SMD. We are now investigating more direct ways to keep iteration 5.3 under control, aiming to ensure the stability of SMD while maintaining its excellent performance near l D 1. 5.2 Matrix Momentum. The investigation of asymptotically optimal adaptive momentum for rst-order stochastic gradient descent (Leen & Orr, 1994) led Orr (1995) to propose the following weight update: E t C1 D w Et C v E tC 1 , w
vE t C 1 D vE t ¡ m (%t gE C CvE t ) ,
(5.5)
N largest eigenvalue, where m is a scalar constant less than the inverse of C’s and %t a rate parameter that is annealed from one to zero. We recognize
Curvature Matrix-Vector Products
1735
Figure 3: Average loss over 25 runs of SMD for various combinations of curvature matrix C and forgetting factor l. Memory (l ! 1) accelerates convergence over the conventional memory-less case l D 0 (dotted) but can lead to instability. With the Hessian H, all 25 runs remain stable up to l D 0.95 (dot-dashed line); using the extended Gauss-Newton matrix G pushes this limit up to l D 0.998 (solid line). The curve for l D 1 (dashed line) is plotted up to the earliest point of divergence.
equation 1.1 with scalar conditioner D D m and stochastic xed point vE ! ¡%t C¡1 gE ; thus, matrix momentum attempts to approximate partial secondorder gradient steps directly via this fast, stochastic iteration. Rapid Convergence. Orr (1995) found that in the late, annealing phase of learning, matrix momentum converges at optimal (second-order) asymptotic rates; this has been conrmed by subsequent analysis in a statistical mechanics framework (Rattray & Saad, 1999; Scarpetta, Rattray, & Saad, 1999). Moreover, compared to SMD’s slow, incremental adaptation of learning rates, matrix momentum’s direct second-order update of the weights promises a far shorter initial transient before rapid convergence sets in. Matrix momentum thus looks like the ideal candidate for a fast O (n ) stochastic gradient descent method. Instability. Unfortunately matrix momentum has a strong tendency to diverge for nonlinear systems when far from an optimum, as is the case during the search phase of learning. Current implementations therefore rely on simple (rst-order) stochastic gradient descent initially, turning on matrix momentum only once the vicinity of an optimum has been reached (Orr,
1736
Nicol N. Schraudolph
1995; Orr & Leen, 1997). The instability of matrix momentum is not caused by lack of semideniteness on behalf of the curvature matrix: Orr (1995) used the Gauss-Newton approximation, and Scarpetta et al. (1999) reached similar conclusions for the Fisher information matrix. Instead, it is thought to be a consequence of the noise inherent in the stochastic approximation of the curvature matrix (Rattray & Saad, 1999; Scarpetta et al., 1999). Recognizing matrix momentum as implementing the same iteration 1.1 as SMD suggests that its stability might be improved in similar ways— specically, by incorporating a model-trust region parameter l and an adaptive diagonal conditioner. However, whereas in SMD such a conditioner was trivially available in the vector pE of local learning rates, here it is by no means easy to construct, given our restriction to O ( n ) algorithms, which are affordable for very large systems. We are investigating several routes toward a stable, adaptively conditioned form of matrix momentum.
6 Summ ary
We have extended the notion of Gauss-Newton approximation of the Hessian from nonlinear least-squares problems to arbitrary loss functions, and shown that it is positive semidenite for the standard loss functions used in neural network regression and classication. We have given algorithms that compute the product of either the Fisher information or our extended Gauss-Newton matrix with an arbitrary vector in O ( n ), similar to but even cheaper than the fast Hessian-vector product described by Pearlmutter (1994). We have shown how these fast matrix-vector products may be used to construct O ( n ) iterative approximations to a variety of common secondorder gradient algorithms, including the Newton, natural gradient, GaussNewton, and Levenberg-Marquardt steps. Applying these insights to our recent SMD algorithm (Schraudoph, 1999b)—specically, replacing the Hessian with our extended Gauss-Newton approximation—resulted in improved stability and performance. We are now investigating whether matrix momentum (Orr, 1995) can similarly be stabilized though the incorporation of an adaptive diagonal conditioner and a model-trust region parameter.
Acknowledgments
I thank Jenny Orr, Barak Pearlmutter, and the anonymous reviewers for their helpful suggestions, and the Swiss National Science Foundation for the nancial support provided under grant number 2000–052678.97/ 1.
Curvature Matrix-Vector Products
1737
References Amari, S. (1985). Differential-geometricalmethods in statistics. New York: SpringerVerlag. Amari, S. (1998). Natural gradient works efciently in learning. Neural Computation, 10(2), 251–276. Auer, P., Herbster, M., & Warmuth, M. K. (1996). Exponentially many local minima for single neurons. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 316–322). Cambridge, MA: MIT Press. Bishop, C. M. (1995). Neural networks for pattern recognition. Oxford: Clarendon Press. Harmon, M. E., & Baird III, L. C. (1996). Multi-player residual advantage learning with general function approximation (Tech. Rep. No. WL-TR-1065). WrightPatterson Air Force Base, OH: Wright Laboratory, WL/AACF. Available online: www.leemon.com/papers/sim tech/sim tech.pdf. Helmbold, D. P., Kivinen, J., & Warmuth, M. K. (1996). Worst-case loss bounds for single neurons. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 309–315). Cambridge, MA: MIT Press. Kivinen, J., & Warmuth, M. K. (1995). Additive versus exponentiated gradient updates for linear prediction. In Proc. 27th Annual ACM Symposium on Theory of Computing (pp. 209–218). New York: Association for Computing Machinery. Leen, T. K., & Orr, G. B. (1994). Optimal stochastic search and adaptive momentum. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing systems, 6 (pp. 477–484). San Mateo, CA: Morgan Kaufmann. Levenberg, K. (1944). A method for the solution of certain non-linear problems in least squares. Quarterly Journal of Applied Mathematics, 2(2), 164–168. Marquardt, D. (1963). An algorithm for least-squares estimation of non-linear parameters. Journal of the Society of Industrial and Applied Mathematics, 11(2), 431–441. Møller, M. F. (1993). Exact calculation of the product of the Hessian matrix of feedforward network error functions and a vector in O (n ) time. (Tech. Rep. No. DAIMI PB-432). Århus, Denmark: Computer Science Department, Århus University. Available on-line: www.daimi.au.dk/PB/432/PB432.ps.gz. Orr, G. B. (1995). Dynamics and algorithms for stochastic learning. Doctoral dissertation, Oregon Graduate Institute, Beaverton. Available on-line ftp://neural.cse.ogi.edu/pub/neural/papers/orrPhDch1-5.ps.Z, orrPhDch6-9.ps.Z. Orr, G. B., & Leen, T. K. (1997). Using curvature information for fast stochastic search. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9. Cambridge, MA: MIT Press. Pearlmutter, B. A. (1994). Fast exact multiplication by the Hessian. Neural Computation, 6(1), 147–160.
1738
Nicol N. Schraudolph
Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. (1992).Numerical recipes in C: The art of scientic computing (2nd ed.). Cambridge: Cambridge University Press. Rattray, M., & Saad, D. (1999). Incorporating curvature information into on-line learning. In D. Saad (Ed.), On-line learning in neural networks (pp. 183–207). Cambridge: Cambridge University Press. Scarpetta, S., Rattray, M., & Saad, D. (1999). Matrix momentum for practical natural gradient learning. Journal of Physics A, 32, 4047–4059. Schraudolph, N. N. (1999a). A fast, compact approximation of the exponential function. Neural Computation, 11(4), 853–862. Available on-line: www.inf.ethz.ch/» schraudo/pubs/exp.ps.gz. Schraudolph, N. N. (1999b). Local gain adaptation in stochastic gradient descent. In Proceedings of the 9th International Conference on Articial Neural Networks (pp. 569–574). Edinburgh, Scotland: IEE. Available on-line: www.inf.ethz.ch/» schraudo/pubs/smd.ps.gz. Schraudolph, N. N. (1999c). Online learning with adaptive local step sizes. In M. Marinaro & R. Tagliaferri (Eds.), Neural Nets—WIRN Vietri-99:Proceedings of the 11th Italian Workshop on Neural Nets (pp. 151–156). Berlin: SpringerVerlag. Schraudolph, N. N., & Giannakopoulos, X. (2000). Online independent component analysis with local learning rate adaptation. In S. A. Solla, T. K. Leen, & K.-R. Muller ¨ (Eds.), Advances in neural information processing systems, 12 (pp. 789–795). Cambridge, MA: MIT Press. Available on-line: www.inf.ethz.ch/»schraudo/pubs/smdica.ps.gz. Schraudolph, N. N., & Sejnowski, T. J. (1996). Tempering backpropagation networks: Not all weights are created equal. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems (pp. 563–569). Cambridge, MA: MIT Press. Available on-line: www.inf.ethz.ch/»schraudo/pubs/nips95.ps.gz. Singhal, S., & Wu, L. (1989). Training multilayer perceptrons with the extended Kalman lter. In D. S. Touretzky (Ed.), Advances in neural informationprocessing systems (pp. 133–140). San Mateo, CA: Morgan Kaufmann. Werbos, P. J. (1988). Backpropagation: Past and future. In Proceedings of the IEEE International Conference on Neural Networks, San Diego, 1988 (Vol. I, pp. 343– 353). Long Beach, CA: IEEE Press. Yang, H. H., & Amari, S. (1998). Complexity issues in natural gradient descent method for training multilayer perceptrons. Neural Computation, 10(8), 2137– 2157. Received December 21, 2000; accepted November 12, 2001.
LETTER
Communicated by Tony Plate
Representation and Extrapolation in Multilayer Perceptrons Antony Browne [email protected] School of Computing, Information Systems and Mathematics, London Guildhall University, London U.K. To give an adequate explanation of cognition and perform certain practical tasks, connectionist systems must be able to extrapolate. This work explores the relationship between input representation and extrapolation, using simulations of multilayer perceptrons trained to model the identity function. It has been discovered that representation has a marked effect on extrapolation. 1 Introduction
There are many different denitions of extrapolation used in different elds. The dictionary denitions of extrapolation are: “1: (mathematical) ‘To estimate (a value of a function or measurement ) beyond the values already known, by the extension of a curve’, and 2: ‘To infer (something not known) by using but not strictly deducing from the known facts’ (Hanks, McLeod, & Urdang, 1986). In connectionist modeling, there are two main uses of the term extrapolation. One is extrapolation in time, where the output of a network is taken as the prediction of a value (or values) a number of time steps ahead of the current time step (for an example of this usage, see Ensley & Nelson, 1992). An alternative denition of extrapolation with relevance to connectionism is “generalisation performance to novel inputs lying outside the dynamic range of the data that the network was trained on” (compare with interpolation, which is “generalisation performance to novel inputs lying within the dynamic range of the data the network was trained on”). Here, a vector X is in the dynamic range of a set of vectors Y1, Y2, . . . , YM if and only if for every dimension I if the vector space, xi , is within the range of values occurring in dimension I of the Y vectors (i.e., if and only if there exists j and k such that yji · xi · y ki ). This is the denition of extrapolation used in this article (but see section 4 for extrapolation in reference to convex hulls). Extrapolation is a difcult topic to study; given a set of training data points representing a function to be modeled, data outside the range of the training data may be well behaved, in that their distribution still follows that expected from the underlying function represented by the training data points, or badly behaved, in that something wholly unexpected happens to the distribution outside the range of the training data. c 2002 Massachusetts Institute of Technology Neural Computation 14, 1739– 1754 (2002) °
1740
Antony Browne
Although extrapolation is a difcult topic, it is an important one to study for the following reasons: Biological organisms have to extrapolate, as they often encounter data that lie beyond the range of their previous experience. To do nothing in these circumstances could conceivably threaten the survival of the organism. In many practical tasks (including the forecasting of economic indicators such as currency exchange rates), input values may uctuate beyond the ranges available when training the neural network model. In such situations, it may be sufcient to ignore the model (and then retrain it, including the new data). However, in other situations, it may be necessary for the model to make a best guess. Obviously, models with better extrapolation capabilities will be more desirable in these circumstances. Three major questions can be asked about extrapolation and neural networks: 1. How can the extrapolation properties of neural networks, such as multilayer perceptrons (MLPs) be improved? 2. What is the relationship between input representation and extrapolation? 3. How can neural networks be manipulated to give good models of human extrapolation performance? Extrapolation has been investigated in an experiment designed as a critique of eliminative connectionism, where Marcus (1998) demonstrated that MLPs had poor extrapolation abilities. In this experiment, strings of 10 binary digits were presented to an autoassociative MLP trained to perform the identity function. This network was trained on the 512 binary strings representing even numbers in the range 0, . . . , 1022 [0 0 0 0 0 0 0 0 0 0 . . . 1 1 1 1 1 1 1 1 1 0]. The test (extrapolation) set consisted of 512 odd numbers in the range 1, . . . , 1023 [i.e., 0 0 0 0 0 0 0 0 0 1 . . . 1 1 1 1 1 1 1 1 1 1]. Marcus found that the network did not extrapolate the identity function to odd numbers (even after trying models using different learning rates, numbers of hidden units, numbers of layers, and different presentation sequences of training examples). Instead, the network would respond incorrectly; for example, it would typically respond to the input [1 1 1 1 1 1 1 1 1 1] with the output [1 1 1 1 1 1 1 1 1 0]. Marcus points out that it is unsurprising that the network described above is unable to generalize outside the training space, as “in-
Representation and Extrapolation
1741
formally, a unit that never sees a given feature value is akin to a node that is freshly added to a network after training” (p. 263). In the above training patterns, the right-most input digit in the 10-bit string was always set to zero in the training set. Given the way that backpropagation (and many other error-driven training methods) operates, the weights from this input unit were not changed by the training algorithm, as this input takes no part in the performance of the network. When this input is set to 1 during presentation of the extrapolation test set, it is unsurprising that the autoassociator does not produce the correct (odd-numbered) outputs. Such a problem can be described as statistically neutral (Thornton, 1996), as the network is attempting to learn the value of the rightmost output when the nine other input units have a probability of 0.5 toward that output unit. Such problems are hard, as searching for dependencies between specic input variables (i.e., the other nine input units) and the right-most output unit is fruitless. Marcus (1998) goes on to state: Though the training space itself is dened by the nature of the input and output representations, the limitation applies to a wide variety of possible representations. Regardless of the input representation chosen (so long as the units represent a nite number of distinct values), there will always be features that lie outside the training space. Any feature that lies outside the training space will not be generalised properly—regardless of the learning rate, the number of hidden units or the number of hidden layers. (p. 265) It is the above statement that this article takes issue with and attempts to prove that the representation of a problem has an effect on the extrapolation properties of a connectionist system. It seems that a major problem with the model described above is the representational scheme used. The input representation that Marcus used can be thought of as being both localist and distributed. It can be thought of as localist in that there is one input node for one concept (in this case, each input node is representing a concept such as a power of two) but also as distributed because the whole number is being represented on all of the inputs. Such a representation, although designed to tackle an extrapolation task, makes it impossible for the network to extrapolate. If a particular input concept is absent during training (such as the right-most digit being always set to 0 in the bit strings described above), then its presence during testing, when the network has been subjected to the deciencies of backpropagation learning described above, will inevitably cause errors. A pertinent question is, If the representation used in this problem is changed, is it possible for the network to extrapolate?
1742
Antony Browne
2 Alternative Representations
Different kind of representational schemes exist, such as those using distributed representations (Smolensky, 1990). Such schemes look attractive with respect to the task described above, in that if a concept is distributed across many or all of the inputs, then its absence during training will be reected in many of the nodes making up the input representation, and its presence during the testing of extrapolation will also be reected in many of these nodes. There have been many attempts to dene distribution in connectionist representations, such as microfeatures (Hinton, 1990) and coarse coding (Rosenfeld & Touretzky, 1988). Perhaps the most formal notions of distribution have been given by Hinton, McClelland, and Rumelhart (1986) and van Gelder (1991) who described distributed representations with respect to their extendedness and superposition. For a representation to be extended, the things being represented must be represented over many units, or more generally over some relatively extended proportion of the available resources in the system. In the model described above, the representation of the concept even-odd (signied by the right-most bit of the bit string) would be extended if it were distributed over more than one of the input units. A representation can be said to be superpositional if it represents many items using the same resources. For the representation to be superpositional in the model described above, each input unit would somehow have to represent more than one of the concepts represented by the bit-string positions. Note that these qualities of extendedness and superposition seem attractive when attempting to perform the extrapolation task described. They imply that the network cannot omit the representation of a single concept (such as even-oddness) by ignoring a single unit when training, as the representation of that concept may be spread over many input units. Representations in a standard feedforward network can be both extended and superposed (Sharkey, 1992), as the representation of each input may be distributed over many hidden units, and each hidden unit may be representing several inputs. Such representations, and those developed by other means, have been of great interest to connectionists when trying to answer criticisms posited by symbolic articial intelligence researchers (see Niklasson & van Gelder, 1994; Browne, 1998a, 1998b; Browne & Sun, 2001). However, the intention in this article is not to model the symbol processing capabilities of symbolic articial intelligence systems; instead, it is to examine alternative representations with respect to their effects on extrapolation. 2.1 Generating Alternative Representations. One approach would be to train an autoassociative MLP to generate distributed input representations by using the hidden layer of this as input to an extrapolation model. However, in actuality, this approach would not work for the extrapolation task described above, as it would be necessary to train this autoassociator with both the training and extrapolation sets so as not to run into identi-
Representation and Extrapolation
1743
cal problems (caused by one input always being set to zero) to those described for the model that Marcus developed. What are needed are simple techniques for generating distributed representations, which can generate representations for a training set (without the model being exposed to an extrapolation set) and can then (independently) be used at a later date to generate an extrapolation set. Three schemes were examined: Random matrix transformation (Sch¨onemann, 1987; Plate, 1994). An input vector X with n elements (indexed x 0, x1 , . . . , x ( n¡1) ) is transformed by multiplying it by a matrix R of random numbers to give another vector Y. The MLP can be trained on the 512 Y input vectors produced from the training set and then tested on a set of extrapolation vectors generated using the same matrix R. Circular Convolution. This technique has been used by connectionist researchers for representing embedded structure in connectionist representations (Plate, 1995) and analogical processing (Plate, 2000). However, here there are two different uses of convolution. Instead of convolving one vector with another to generate embedding, here a vector is convolved with either a randomly generated vector (the same random vector being convolved with all of the vectors in the training and extrapolation set) or with itself to generate a distributed representation. Convolving a vector with itself appears to be an attractive scheme, as it can produce a distributed representation without the need for the existence of or manipulation of an external parameter (such as the distribution of values in the random matrix or vector in the two other schemes). To illustrate the case where a vector is convolved with itself, consider a vector X with n elements (indexed x 0 , x1 , . . . , x ( n¡1 ) ). The circular convolution Y (an n-dimensional vector) of X with itself (Y D XX) is given by equation 2.1, Y D [y 0 , y1, . . . , yn¡1 ]
(2.1)
where yj D
n¡1 X kD 0
xk x ( k C j ) ,
(2.2)
where the subscripts are modulo-n. For example, given the three element vector X (x0 , x1 , x2 ) shown in Figure 1, the circular convolution Y of X with itself is given by equation 2.3: y 0 D x 0x 0 C x2 x1 C x1 x2
(2.3)
y1 D x 0x1 C x1 x 0 C x2 x2 y2 D x 0x2 C x1 x1 C x2 x 0. Now supposing that the value of the element x1 is changed by an amount 0 4x1 to give a new value for this element (x1 ). It can be seen that the new
1744
Antony Browne
Figure 1: Generating a distributed representation Y with components y0 , y1 , y2 produced by the self-convolution of the vector X. (Adapted from Plate, 2000)
value of x1 will affect all three elements of the convolution Y, so a change in a single element of an input vector changes many elements of its selfconvolution. Such a representation is superposed (from the above gures, it can be seen that many of the x components contribute to each y component) and extended (as each x component contributes to many y components). Using an example taken from the data used by Marcus (1998), consider the even input pattern (from the training set), [0 0 0 0 0 0 0 0 1 0], that when convolved with itself becomes the pattern [0 0 0 0 0 0 1 0 0 0], and also consider the odd input pattern from the extrapolation set, [0 0 0 0 0 0 0 0 1 1], that when convolved with itself becomes [0 0 0 0 0 0 1 2 1 0]. For the original patterns, only one element (the right-most element) has changed value, whereas two elements can be seen to have changed value when comparing the self-convolutions. At the other end of the data set, this effect is more extreme. Consider the even input pattern (from the training
Representation and Extrapolation
1745
set), [1 1 1 1 1 1 1 1 1 0], and its self-convolution, [8 8 8 8 8 8 8 8 9 8], and also consider the odd input pattern from the extrapolation set, [1 1 1 1 1 1 1 1 1 1], that when convolved with itself becomes [10 10 10 10 10 10 10 10 10 10]. Here, although only one of the elements differs between the two original vectors, all 10 elements are different in their self-convolutions. 3 Simulations
To test the hypothesis that representation has a direct effect on extrapolation, ve autoassociative MLPs were constructed, each having 10 inputs and 10 outputs. Each was trained with the scaled conjugate gradients algorithm (Moller, ¨ 1993). Five sets of simulation were carried out. One of these used the localist representations favored by Marcus (described above) to train a series of MLPs to perform the identity task. Another used a similar input representation, except instead of the inputs being either 1/0, they were transformed to ¡1 / 1. Another set used random matrix transformation (with a new random matrix being used for each run). The remaining two sets of simulations used either convolution of input vectors with a random vector (with a new random vector being used for each run) or vector selfconvolutions as input. During the simulations, in different trials the number of hidden units and the initial values of both input to hidden and hidden to output weight values and ranges of individual networks were varied. Training was continued until a network attained 100% correct (rounded) outputs on the training set of even numbers or the maximum number (50,000) of epochs was reached. After training nished, the extrapolation set of odd numbers described above was fed through the network (after being suitably recoded). For the self-convolution network (which attempts to match integer outputs), a particular extrapolation example was judged correct if, when all the outputs generated by this case were rounded to the nearest integer, they were identical to the expected extrapolated number. For the random matrix transformation and random vector convolution networks (which attempt to match real outputs), a particular extrapolation example was judged correct if it was the closest (by Euclidean distance) output to that of the expected extrapolated number. 3.1 Results. The networks using untransformed inputs of 1/0 and ¡1 / 1 (unsurprisingly) always gave 0% extrapolation to the set of odd numbers (replicating the results of Marcus discussed above). Initially, the networks using random matrix transformation and convolution with a random vector gave erratic results. For example, on successive runs, the networks would veer from 0% correct extrapolation to 83% correct. It was discovered that both of these transformations are sensitive to the random matrix used and the random vector selected for convolution. If values in the matrix or vector were too large because the dynamic range of the extrapolation set exceeded that of the training set by a large amount, clipping was occurring at
1746
Antony Browne
Table 1: Extrapolation Set Percentage Correct, for the Three Transformations and Networks with Different Numbers of Hidden Units. Number of Hidden Units
Random Matrix
Random Convolution
Self-Convolution
10 9 8 7 6 5
100 100 44 2 2 0
100 100 81 72 44 38
100 37 14 3 0 0
the hidden layer. The asymptotes of the sigmoidal transfer function in the hidden layer were clipping hidden-layer activation values, and this truncation was adversely affecting extrapolation performance. When the random matrix ranges was reduced to 0.1 ¤ M, where M were random numbers drawn from a standard normal distribution (centered on zero), it was found that the performance of networks using these transformations improved drastically. The transformation based on convolution with a random vector proved more problematic. Even when the range and nature of the distribution the random vector was chosen from were changed, performance was still erratic. No such problems were encountered using self-convolution, which gave consistently good performance. With this transformation, the dynamic range of the test set varied slightly outside that of the training set, but not enough to encounter clipping. For networks using the three transformations described above, percentage correct values for different network congurations (averaged over 10 runs for each number of hidden units, using different initialization weights and ranges) are shown in Table 1. It can be seen from Table 1 that networks using alternative representations can show 100% correct generalization performance to the extrapolation set consisting of self-convolutions of the odd numbers. 3.2 Analysis. A criticism of these result could be that the (self-convolution) network with 10 hidden units is merely carrying out some form of symbolic copy operation (i.e., copying the value of a particular input unit to a single hidden unit) and then copying the value represented on this hidden unit directly to a single output. For the Marcus representation, Figure 2 shows a Hinton diagram (Hinton et al., 1986) of the input to hidden weights of a typical network trained using the original inputs, with 10 hidden units. It can be seen from this diagram that even using this representation, more than one large weight exists from a particular input to a particular hidden unit. Figure 3 shows a Hinton diagram of the hidden-to-output weights of the same network. From inspection of these two diagrams, it appears that even the network with the original representation is not implementing a symbolic
Representation and Extrapolation
1747
Figure 2: Hinton diagram of weights from input-to-hidden layer of network with original input representation. White signies positive weights, black negative. Hidden-layer bias inputs are in the 0th column.
copy operation where each input is mapped uniquely to a particular hidden unit and then to the respective output. Compare this to the Hinton diagrams of input-to-hidden weights (see Figure 4) and hidden-to-output weights (see Figure 5) for a network using the self-convolution transformation. It can be seen from Figures 4 and 5 that this network’s input-to-hidden and hidden-to-output weights are denitely not as would be expected if some form of symbolic copy operation were being carried out where each input value was being copied to a single hidden unit, and this value was being copied to a single output. There is not one large weight from each input unit to a corresponding single hidden unit, with all other weights set near zero. It appears from this diagram that each input has weights of reasonable magnitude to many hidden units, and each hidden unit is being inuenced by many inputs. 4 Discussion
The work described in this article has attempted to answer questions 1 and 2 from section 1: How can the extrapolation properties of neural networks
1748
Antony Browne
Figure 3: Hinton diagram of weights from hidden-to-output layer of network with original input representation. White signies positive weights, black negative. Output-layer bias inputs are in the 0th column.
(such as MLPs) be improved? and What is the relationship between input representation and extrapolation? It has demonstrated that the extrapolation properties of neural networks such as MLPs can be improved (at least for the identity task) when given an appropriate input representation. It seems that previous work has prematurely drawn erroneous conclusions on their inability to extrapolate based on the presentation of data with an inappropriate input representation. Using an input representation such as that used by Marcus, with the nth unit always set to the same value, is tantamount to telling the network that it exists in an n ¡ 1 dimensional space. It is unsurprising that the network performs badly with an extrapolation data set when the nth dimension appears during testing. In alternative representations, such as those described above, the nth dimension is active in forming the representation of the n ¡ 1 other dimensions. This suggests that for good extrapolation, the transform used must be such that any input value that varies in the transformed test set must also vary in the transformed training set. However, it is not true that the transform used must be such that the input space of the transformed training set spans the input
Representation and Extrapolation
1749
Figure 4: Hinton diagram of weights from input-to-hidden layer of network with self-convoluted input representation. White signies positive weights, black negative. Hidden-layer bias inputs are in the 0th column.
space of the transformed test set. Given the results for the three representations discussed above, where the test set dynamic ranges exceeded those of the training sets, it can be concluded that good extrapolation performance can be obtained if the dynamic range of the test set is not too far outside that of the training set. Dening what “too far” is when discussing test set range and the dangers of clipping by the hidden-layer transfer function will be the subject of further research. One concept discussed by Marcus (1998) (following work by Touretzky, 1991) is that of training independence, which has two components. Input independence means that when using a learning algorithm such as backpropagation, the amount that a given connection weight from unit i changes is determined by the algorithm to be dependent on the activation level xi (so that whenever xi is zero, there is no change in the weight). Output independence means that output units are also trained independently of one another, because in training algorithms such as backpropagation, the weights feeding a particular output unit are changed independently of the activation levels of other outputs.
1750
Antony Browne
Figure 5: Hinton diagram of weights from hidden-to-output layer of network with self-convoluted input representation. White signies positive weights, black negative. Output-layer bias inputs are in the 0th column.
Therefore, one way of viewing MLPs trained with such algorithms is that they consist of a set of independent classiers (Touretzky, 1991; Marcus, 1998); each output unit in such a network computes its own classication function. This is obviously the case in the network trained to model the identity function using the Marcus representation, as each input is independent of the other input, and each output is independent of the other outputs. An input that is always set to zero in the training phase will be ignored and its output classier always trained to produce a zero, regardless of the values of the other outputs. However, with the autoassociator using alternative representations, the situation is different. Inputs are dependent on one another, as a change in one input also changes some (or all) of the others. Outputs are also interdependent in a similar fashion. When a novel item from the extrapolation set is presented, it affects many inputs and therefore many outputs. One question that can be asked is, Does the new representation merely code the extrapolation problem as an interpolation problem? This is partially true, as (for example) the circular convolution procedure converts the original input and outputs to integer-valued vectors, where
Representation and Extrapolation
1751
there is no explicit representation of the concepts odd or even, and extrapolation between these concepts actually becomes interpolation on a series of these vectors. Given a problem, one can always nd a bad way of representing it to an MLP (as Marcus did) or code it in a form where it can be solved by such a network (even if this loses the explicit representation of the original function). In fact, the problem was not just one of interpolation of transformed vectors. Of the 512 convolved extrapolation test patterns, 503 lay within the dynamic range of the convolved training set. However, nine lay outside the dynamic range of the convolved training set and so can be considered true extrapolations. All of these nine are extrapolated correctly. Future experiments should investigate extrapolation by using representational transformations where a higher proportion (or all) of the transformed extrapolation set lies outside the dynamic range of the training set. Although the transformations discussed above are convenient ways of producing distributed representations with the aim of enhancing the extrapolation ability of a connectionist system, there are many others, and the ones given above may not always be optimal. For example, consider the XOR training set and its self-convolutions: [0, 0 ! 0, 0], [1, 0 ! 1, 0], [0, 1 ! 1, 0], [1, 1 ! 2, 2]. As both [0, 1] and [1, 0] are being mapped to the same convolved input, this would not be a suitable representation for a training set if the task involved discrimination between these two vectors. However, in the training and extrapolation sets for the 10-bit identity task described above, there were no identical vectors, and self-convolution proved an appropriate way of generating distributed representations. Alternative transforms suitable for some extrapolation tasks may have to be found. Regarding question 3 in section 1—How can neural networks be manipulated to give good models of human extrapolation performance?—no claims are made in this article that concepts are represented as self-convolutions in the human brain. However, it is also unlikely that concepts such as “odd number ” and “even number” are represented by their corresponding binary representations in the brain. The main demonstration of this work has been that the type of representation used affects the extrapolation ability of MLPs. Further work is needed to determine if cognitive systems (including the human brain and articial neural systems) can use representational transformation to change hard tasks into easier ones. Of course, the question could be asked, “You have demonstrated extrapolation is possible when data with one missing dimension are recoded using a distributed representation. What happens when the data have higher dimensionality or more than one dimension is missing?” This is a pertinent question, and further work will examine extrapolation for tasks with more than one dimension “hidden” from the network during training. Another interesting approach would be to look at the extrapolation properties of sparse distributed representations, such as those produced by content-dependent Thinning (Rachkovskij & Kussul, 2001).
1752
Antony Browne
Other researchers have compared the performance of human subjects with connectionist systems when required to extrapolate mathematical functions, such as linear, quadratic, and exponential functions (Busemeyer, McDaniel, & Byun, 1997; DeLosh, Busemeyer, & McDaniel, 1997; Erickson & Kruschke, 2001 and in press), and found purely connectionist explanations of these tasks untenable. In future work, it will prove interesting to observe the effect of alternative input representations on modeling these tasks. In this article, the denition of extrapolation used is “generalization performance to novel inputs lying outside the dynamic range of the data that the network was trained on,” as this seemed a natural denition to use when tackling the Marcus data set (i.e., a binary data set with rectangular convex hulls). However, a more restrictive denition of extrapolation exists: “generalisation to novel inputs lying outside the convex hull of the training data.” This more rigorous denition should be applied when using other data types, as a value can be within the dynamic range of the inputs but still be an extrapolation. This is often described as hidden extrapolation and can be illustrated by a simple example in two dimensions, where a point has coordinates less than the maximal coordinates in each dimension (hence, falling within the dynamic range of the data) yet because both coordinates are high is in fact an extrapolation. No points are made in this article regarding the adequacy of eliminative connectionism (as discussed by Marcus). Indeed, we show how connectionist implementation of variable binding and inference can be accomplished using the distributed representations formed on the hidden layers of MLPs (Browne & Sun, 1999) and inference (Browne & Sun, 2001). However, suggestions by other researchers, such as that localist representations are adequate for psychological modeling (Page, 2000), may have to be reviewed in response to the extrapolation results described in this article. Acknowledgments
I thank the two anonymous reviewers of this article for their insightful comments and Dmitri Rachkovskij for the code for performing CDT and related discussion. References Browne, A. (1998a). Detecting systematic structure in distributed representations. Neural Networks, 11(5), 815–824. Browne, A. (1998b). Performing a symbolic inference step on distributed representations. Neurocomputing, 19, 23–34. Browne, A., & Sun, R. (1999). Connectionist variable binding. Expert Systems: The International Journal of Knowledge Engineering and Neural Networks, 16(3), 189–207.
Representation and Extrapolation
1753
Browne, A., & Sun, R. (2001). Connectionist inference models. Neural Networks, 14(10), 1331–1355. Busemeyer, J., McDaniel, M. A., & Byun, E. (1997). The abstraction of intervening concepts from experience with multiple input-multiple output causal environments. Cognitive Psychology, 32, 1–48. DeLosh, E. L., Busemeyer, J. B., & McDaniel, M. A. (1997). Extrapolation: The sine qua non for abstraction in function learning. Journal of Experimental Psychology, 23(4), 968–986. Ensley, D., & Nelson, D. E. (1992). Extrapolation of mackey-glass data using cascade correlation. Simulation, 58(5), 333–339. Erickson, M. A., & Kruschke, J. A. (2001). Multiple representations in inductive category learning: Evidence of stimulus and-time-dependent representation. Manuscript submitted for publication. Erickson, M. A., & Kruschke, J. K. (in press). Rule based extrapolation in perceptual categorization. Psychonomic Bulletin and Review. Hanks, P., McLeod, W., & Urdang, L. (1986). Collins dictionary of the English language. London: Collins. Hinton, G. E. (1990). Mapping part-whole hierarchies into connectionist networks. Articial Intelligence, 46, 47–75. Hinton, G. E., McClelland, J. L., & Rumelhart, D. E. (1986). Distributed representations. In D. Rumelhart & J. McClelland (Eds.), Parallel distributed processing (Vol. 1, pp. 77–109). Cambridge, MA: MIT Press. Marcus, G. F. (1998).Rethinking eliminative connectionism. Cognitive Psychology, 37, 243–282. Moller, ¨ M. F. (1993). A scaled conjugate gradient algorithm for fast supervised learning. Neural Networks, 6(4), 525–533. Niklasson, L., & van Gelder, T. (1994). Can connectionist models exhibit nonclassical structure sensitivity? In Proceedings of the Cognitive Science Society (pp. 664–669). Hillside, NJ: Erlbaum. Page, M. (2000). Connectionist modeling in psychology: A localist manifesto. Behavioral and Brain Sciences, 23(4), 443–512. Plate, T. (1994). Distributed representations and nested compositional structure. Unpublished doctoral dissertation, University of Toronto. Plate, T. (1995). Holographic reduced representations. IEEE Transactions on Neural Networks, 6(3), 623–641. Plate, T. (2000). Analogy retrieval and processing with distributed vector representations. Expert Systems: The International Journal of Knowledge Engineering and Neural Networks [Special issue], 17(1), 29–40. Rachkovskij, D. A., & Kussul, E. M. (2001). Binding and normalization of binary sparse distributed representations by context-dependent thinning. Neural Computation, 13(2), 411–452. Rosenfeld, R., & Touretzky, D. (1988). Coarse coded symbol memories and their properties. Complex Systems, 2, 463–484. Sch¨onemann, P. H. (1987). Some algebraic relations between involutions, convolutions, and correlations, with application to holographic memories. Biological Cybernetics, 56, 367–374.
1754
Antony Browne
Sharkey, N. (1992). The ghost in the hybrid—a study of uniquely connectionist representations. AISB Quarterly, 79, 10–16. Smolensky, P. (1990). Tensor product variable binding and the representation of symbolic structures in connectionist systems. Articial Intelligence, 46, 159– 216. Thornton, C. (1996). Parity, the problem that won’t go away. In G. M. Calla (Ed.), Proceedings of AI 96 (pp. 362–374). Toronto: Springer-Verlag. Touretzky, D. S. (1991). Connectionism and compositional semantics. In J. A. Barnden & J. B. Pollack (Eds.), High-level connectionist models (pp. 17–31). Hillside, NJ: Erlbaum. van Gelder, T. (1991). What is the “D” in “PDP”? A survey of the concept of distribution. In W. Ramsey, S. Stich, & D. E. Rumelhart (Eds.), Philosophy and connectionist theory (pp. 33–60). Hillside, NJ: Erlbaum. Received December 11, 2000; accepted November 20, 2001.
LETTER
Communicated by Gary Cottrell
Using Noise to Compute Error Surfaces in Connectionist Networks: A Novel Means of Reducing Catastrophic Forgetting Robert M. French [email protected] Quantitative Psychology and Cognitive Science, Psychology Department, University of Li`ege, 4000 Li`ege, Belgium Nick Chater [email protected] Institute for Applied Cognitive Science, Department of Psychology, University of Warwick Coventry, CV4 7AL, U.K. In error-driven distributed feedforward networks, new information typically interferes, sometimes severely, with previously learned information. We show how noise can be used to approximate the error surface of previously learned information. By combining this approximated error surface with the error surface associated with the new information to be learned, the network’s retention of previously learned items can be improved and catastrophic interference signicantly reduced. Further, we show that the noise-generated error surface is produced using only rst-derivative information and without recourse to any explicit error information. 1 Introduction
Everyone forgets but, thankfully, it is typically a gradual process. Neural networks, on the other hand, and especially those that develop highly distributed representations over a single set of weights, can suffer from severe and sudden forgetting. Almost all of the early solutions to this problem, called the problem of “catastrophic forgetting,” relied on learning algorithms that reduced the overlap of the network’s internal representations (see French, 1999, for a review) by making these representations sparser. This had the desired effect of reducing interference, with the obvious trade-off being a decrease in the network’s ability to generalize. A signicantly different approach, that made use of noise, was developed by Robins (1995). The idea was as follows. When a network that had previously learned a set of patterns had to learn a new set of patterns, a series of random patterns (i.e., noise) was input into the network and the associated output was collected, producing a series of pseudopatterns. These pseudopatterns, which reected the previously learned patterns, were then c 2002 Massachusetts Institute of Technology Neural Computation 14, 1755– 1769 (2002) °
1756
Robert M. French and Nick Chater
interleaved with the new patterns to be learned. This effectively decreased catastrophic forgetting of the originally learned patterns. The use of pseudopatterns will serve as the starting point for the algorithm developed in this article. Unlike Robins’s algorithm, however, we will use pseudopatterns to directly approximate the error surface associated with the original patterns. This approximated error surface will then be combined with the error surface associated with the new patterns, and gradient descent will be performed on the combined error surface. This will be shown to improve the network’s performance on catastrophic forgetting signicantly. 2 Measures of Forgetting
There are two standard measures of forgetting in connectionist models, both related to standard psychological measures. The rst is a simple error measurement. Suppose a rst set of patterns fPi : Ii ! Oi gN i D 1 has been learned to criterion by a network. A new set fQi : Ii ! Oi giMD1 is then learned to criterion. We measure the amount of network error produced by each of the patterns in the rst set. The second widely used measure of forgetting is an Ebbinghaus “savings” measure, rst applied to neural networks by Hetherington and SeidenM berg (1989). After learning fP i gN i D 1 and then fQi gi D 1 , we measure the number of epochs required to retrain the network to criterion on the initial training set fPi gN i D 1 . The faster the relearning, the less forgetting that is judged to have occurred. We will use both of these measures in the discussion that follows. 3 Overview of Hessian Pseudopattern Backpropagation (HPBP)
Any given set of patterns fPi : Ii ! Oi gN i D1 has an associated error surface, E ( w) , dened over the network’s weights. This means that for each possible combination of values of the network’s weights, there will be an overall error associated with the set of patterns (usually the sum of the squared errors produced by each individual pattern, Pi ). Learning the set of patterns fPi gN i D 1 is equivalent to the network’s nding a minimum—call it w 0 —of this error surface. When a new pattern, Pnew , is presented to the system, the original error surface E ( w) changes to EPnew ( w) . (For simplicity, we will discuss only the case where a single new pattern is presented to the network, but the argument is identical for any number of new patterns.) In general, w 0, a minimum of the original error surface, E ( w) , will no longer be a minimum of EPnew ( w) . In other words, the network “forgets” the original error surface E ( w) . What we need is some way for the network to approximate E ( w) in the absence of the original patterns. We could then create a new overall error surface that would reect both E ( w) and E Pnew ( w) . We do this by taking a weighted sum of the approximation of E (w ) , which we will call EO ( w) , and
Using Noise to Compute Error Surfaces
1757
EPnew (w ) . Our weight change algorithm will then be gradient descent on this combined error surface. In what follows, we will develop the mathematics of HPBP and demonstrate the algorithm by means of two simple simulations on empirical data: one in which we sequentially learn two sets of patterns to criterion, the other in which the network is presented with a series of patterns, each of which is learned to criterion before the presentation of the next pattern. 4 Noise and the Calculation of an Error Surface
Assume, as above, that E ( w) is the unique error surface dened by a set of real input-output patterns fPi : Ii ! Oi gN i D 1 learned by the network. The network’s having learned these patterns means it has discovered a local minimum w 0 in weight-space for which E0 ( w0 ) D 0, where E0 ( w) represents the rst derivative of the error function. If the function f underlying the original set of patterns is relatively “nice” (continuous, reasonably smooth, and so forth), then a set of pseudopatterns fyi : Ii ! Oi giMD 1 whose input values are drawn from a at random distribution will produce a reasonable approximation of f . (See French, Ans, & Rousset, 2001, for a discussion of how this approximationcould be improved by additionally making use of the values of the output associated with each random input, or Ans & Rousset, 2000, for a technique that produces an “attractor ” input pattern from uniform random input. In the present case, however, we simply use at random input to produce the pseudopatterns.) Just as the original set of patterns fPi gN i D1 had a unique error surface associated with it, so does the set of pseudopatterns fyi gM i D 1 . For this latter error surface, E ( w ), it follows from the denition of pseudopatterns that E ( w 0 ) D 0 0
and E ( w0 ) D 0. The question is how we can produce this approximation of the original error surface in the vicinity of w 0 (assuming that the original patterns fPi gN i D1 are no longer available). We know that for the original error surface E ( w) , E 0 ( w 0 ) D 0. While this tells us that w 0 is a local minimum of E, it does not provide any information about the shape (in particular, the steepness) of E (w ) around w0 , which is what we want. For this, we need the higher derivatives of E ( w) , which, unlike the rst derivative, do not disappear when evaluated at w 0 . Using this steepness and the local minimum information, we reconstruct the desired approximation of the original error surface by means of a truncated Taylor series. (For other techniques using the higher-order derivatives to improve backpropagation, see, for example, Bishop, 1991, and Becker & Le Cun, 1988, among others.) Somewhat counterintuitively, approximating the original error surface using noise does not require any explicit error information; noise moving through the system is sufcient for the calculation. Thus, unlike other techniques that make use of pseudopatterns that require the system to learn a
1758
Robert M. French and Nick Chater
mixture of pseudopatterns and new patterns (Robins, 1995; French, 1997; Ans & Rousset, 1997, 2000), here noise is simply sent through the system, and this alone allows us to approximate the error surface around w 0. This approximated error surface, combined with the error surface of the new patterns to be learned, produces an overall error surface on which gradient descent will be performed. The details of this calculation follow. 5 Hessian Pseudopattern Backpropagation (HPBP)
Assume that the network has already stored a number of patterns and has found a point wo in weight space for which all the previously learned patterns have been learned to criterion. Further assume that we are using the standard quadratic error function, outputs P NX 1X p p ( y i ¡ ti ) 2 , ED 2 pD 1 i D 1
(5.1)
where P is the number of patterns, Noutputs is the number of output units in p p the network, yi is the output of the ith output node for the pth pattern, and ti is the teacher for the ith output node for the pth pattern. E being a continuous, everywhere differentiable function, it has a Taylor series expansion about wo , which we can write as follows: E (w ) D E ( w0 ) C E0 ( w0 ) (w ¡w 0 ) C
1 ( w ¡w0 ) T H| w0 (w ¡w 0 ) C . . . , 2!
(5.2)
where w is a point in weight space, wo is the point in weight space at which the network has arrived after learning the original patterns, E0 ( w0 ) is the gradient of E evaluated at wo , and H| w0 is the Hessian matrix of second partial derivatives of E evaluated at wo . For values of w sufciently close to w0 , we will assume that we can truncate the Taylor series after the second term. Since the network is at w 0 after having learned the original patterns, this implies that w 0 is a local minimum of the error surface and, consequently, E0 ( w0 ) is 0. We can therefore write the truncated Taylor series approximation of the error surface corresponding to the originally learned set of patterns as 1 EO (w ) ¼ E ( w0 ) C (w ¡ w 0 ) T H| w0 ( w ¡ w 0 ) . 2!
(5.3)
Now assume that a new pattern, P, is presented to the network. This pattern induces an error surface, EP ( w) (as mentioned above, the argument is the same for a set of new patterns):
Using Noise to Compute Error Surfaces
1759
Let E (w ) D a E (w ) C EP ( w) , where the constant, a, is a weighting factor. The standard delta rule gives:
D w D ¡g
@E
where
@E
Da
@w @w But from equation 5.3 we have @ E ( w) @w
@ E ( w) @w
C
@EP @w
D H| w0 ( w ¡ w 0 )
The weight change rule will therefore be µ ¶ @EP D w D ¡g aH| w0 ( w ¡ w0 ) C @w
(5.4)
(5.5)
where g is the learning rate and a is the weighting factor of the prior approximated error surface. We will now show how noise allows us to calculate H| w0 . 6 Noise and the Calculation of H| w0
For each pseudopattern, the teacher and the output will, by denition, be the same. In other words, y
y
8y 2Y 8n ( yn ¡ tn ) D 0
(6.1)
where Y is the set of all pseudopatterns, yn is the output from the nth output node of the network, and tn is the teacher for the nth output node of the network. The Hessian matrix evaluated at wo is dened as follows: 2 3 @2 E @2 E ¢ ¢ ¢ 6 @w @w 7 @w1 @wN 7 1 1 6 6 7 .. .. .. H| w0 D 6 7 . . . 6 7 2 4 @2 E 5 @ E ¢¢¢ @wN @w1 @wN @wN w0 where w 0 is a solution for the originally learned set of patterns. Consider the hi, jith term of H: Hij D
@2 E
@wi @wj
We begin with the error function for the pseudopatterns y1 , y1 , y1 , . . . , yNY where NY is the number of pseudopatterns that will be used to calculate the error surface: ED
outputs NY NX 1X p p ( yn ¡ tn ) 2 2 p D1 nD 1
1760
Robert M. French and Nick Chater
where NY is the number of pseudopatterns, Noutputs is the number of output p units of the network, yn is the output of the nth output unit for the pth p pseudopattern, and tn is the teacher for the nth output unit for the pth pseudopattern. The second partial derivatives of E are calculated as follows: @2 E @wi @wj
D
D
D
outputs NY NX @ X
@wi p D1 n D1 outputs NY NX X
pD 1 n D 1 outputs NY NX X
pD 1 n D 1
( ypn ¡ tpn )
p
@yn @wj
Á
@ @wi
Á
p ( yn
p
p ¡ tn )
p
p
@yn @wj
p C ( yn
p ¡ tn ) p
@yn @yn @2 yn p p C ( y n ¡ tn ) @wi @wj @wi @wj
p
@2 yn
!
@wi @wj
!
But we know from equation 6.1 that for pseudopatterns, p
p
8p 8n ( yn ¡ tn ) D 0 and, therefore, the second term above is zero, giving @2 E @wi @wj
D
p p outputs NY NX X @yn @yn @wi @wj pD 1 n D 1
(6.2)
(The precise terms of the pseudopattern-induced Hessian matrix are given in the appendix.) Interestingly, only rst derivative information is required in this pseudopattern-induced Hessian, which means that the complexity of this calculation is O ( N 2 ) , where N is the number of weights in the network. In short, from equations 5.3 and 6.2, we conclude that noise passing through the network is sufcient to approximate the error surface for the original patterns close to w 0. The pseudocode for the HPBP algorithm is shown below. We assume that the network has already learned a set of patterns, P D fPi gN i D 1 , and is at a local minimum w 0 in weight space. The network must then learn a new data set, Q D fQi gM i D1 . To create the Hessian, we use R pseudopatterns: Initialize the Hessian to 0. Set network activation values to 0. Hessian Loop: Put a random input vector through the network to produce a pseudopattern;
Using Noise to Compute Error Surfaces
1761
Use these activation levels and network weight values to create a matrix of second-derivative values to be added to the Hessian H| w0 ; Exit Hessian Loop after R pseudopatterns; Training Loop: For each pattern in Q, do: Feedforward pass; Error backpropagation, changing the weights according to equation 5.5, including momentum; When all patterns in Q are learned to criterion, exit Training Loop; Test errors for all patterns in P. 7 Simulations
In order to show that the HPBP algorithm works, we performed two simulations, the rst involving catastrophic forgetting and the second involving sequential learning. 7.1 Simulation 1: Catastrophic Forgetting. We created two sets of four patterns, P and Q. The two sets were intentionally designed to interfere maximally with one another (even though a network would have been able to learn all patterns in the combined set P [ Q). The network was trained rst on P and then on Q. Once it had learned Q to criterion, we tested it to see how well it had remembered P. An 8-32-1 network was used for both BP and HPBP networks with learning rate 0.01, momentum 0.9, Fahlman offset 0.1, and a maximum weight-change step size of 0.075. For the HPBP network, we used 100 pseudopatterns, and because we wanted to give more weight to the approximated error surface associated with past learning, we set its weighting factor to 8. All results were averaged over 100 runs of the program.
7.1.1 Results. After learning the rst set of patterns P, then Q, the standard backpropagation network produced an average error over all items in P of 0.80. (Thus, as intended, interference of the items in P by the items in Q was extremely severe.) By contrast, the HPBP network produced an average error for these previously learned items of only 0.38. Further, the HPBP network correctly generalized on 67.5% of the previously learned items, whereas the backpropagation network was able to generalize correctly on only 10.25% of the items in P. (See Figures 1a and 1b.) In addition, we measured the number of epochs required for both networks to relearn P. Not surprisingly, HPBP also relearned P to criterion in 42% fewer epochs than the BP network.
1762
Robert M. French and Nick Chater
Figure 1: Catastrophic interference is signicantly reduced with the HPBP algorithm. (a) Errors on the originally learned patterns in set P for BP and HPBP after learning set Q. (b) Correct generalization for the originally learned items for BP and HPBP after learning Q.
Although much work clearly remains to be done on this type of algorithm, we believe that these early results demonstrate that the HPBP algorithm can be very effective in reducing catastrophic interference. 7.2 Simulation 2: Sequential Learning. In order to test the HPBP algorithm further on a sequential learning task drawn from a real-world database, we selected the 1984 Congressional Voting Records database from the UCI repository (Murphy & Aha, 1992). Twenty members of Congress (10 Republicans, 10 Democrats, each “pattern” being dened a yes-no voting record on 16 issues associated with a party afliation) were chosen randomly from the database and were learned sequentially by the network (each pattern was learned to criterion before the next pattern was presented to it). The network was then tested with both standard measures of forgetting on each of the 20 patterns. Both BP and HPBP algorithms used a 16-3-1 feedforward backpropagation network with a learning rate of 0.01, momentum of 0.9, a Fahlman offset of 0.1, with a maximum weight step of 0.09. For the HPBP network, 25 pseudopatterns were generated each time a new pattern was to be learned. The weighting parameter associated with the approximation of the original error surface was set to 3. For each new pattern that was sequentially learned, 25 pseudopatterns were generated to calculate the Hessian and thereby to produce the approximation of the prior error surface. Specically, the network learned the rst pattern, P1 , until the difference between target and output for the pattern was below 0.2. Then 25 pseudopatterns were generated, and the associated error surface was produced. The second pattern, P2 , was then presented
Using Noise to Compute Error Surfaces
1763
Figure 2: Original learning of the 20 items. It is somewhat harder for the Hessian pseudopattern network to learn the rst few patterns, given the “inertial” effect of E ( w ) . Standard error bars show the evolution of the variance for both algorithms.
to the network. The new error surface induced by P2 was combined with the previously approximated error surface, and gradient descent was performed on this combined surface until the network had learned P2 . Then 25 new pseudopatterns were generated to produce an approximation of this error surface. P 3 was then presented to the network, and so on. 7.2.1 Results. First, we considered the extent to which the addition of the approximated error surface made the initial learning more difcult. Second, once the network had sequentially learned all 20 patterns, we measured the error for each of the previously learned patterns (the error measure of forgetting). Finally, we examined how difcult it was to relearn the original patterns (the savings measure of forgetting described above). All of the data reported were averaged over 100 runs of each algorithm. The order of presentation of the patterns and the initial weights of the networks were randomized at the beginning of each run. 7.2.2 Original Learning. Figure 2 shows that, on average, it is more difcult for HPBP to learn the rst few patterns. Presumably, this is because before any learning of the patterns has occurred, E (w ) denes an error surface that is quite unlike the error surface associated with that of any of the 20 patterns to be learned (because the network weights are initialized randomly). However, the average number of epochs required for learning the items with HPBP soon converges to that of BP. 7.2.3 Error Measure of Forgetting. We computed average and median error scores for both HPBP and BP for each item after each network had sequentially learned all 20 items. The inuence of the noise-computed error surface in reducing these errors can be seen in Figures 3 and 4. All results
1764
Robert M. French and Nick Chater
Figure 3: Average error measures for the 20 sequentially learned patterns. The average overall error for HPBP is 0.05 better than for standard BP. The learning criterion is 0.2.
Figure 4: Median errors for BP and HPBP for all items after sequential learning of the original patterns. For HPBP, all 20 patterns remain at or below the 0.2 learning criterion; only 9 of the 20 items are at or below criterion for BP.
were averaged over 100 runs. Standard error bars are shown for the values produced by each algorithm. 7.2.4 Savings Measure of Forgetting. The most striking improvement with respect to standard BP can be seen in the average relearning time for each of the individual items (see Figures 5 and 6). First, all 20 patterns are learned to criterion one after the other by the network. After the twentieth pattern has been learned, the network’s weights, w f inal , are saved. Then each of the rst 19 items is tested to see how many epochs the network requires to relearn it. Specically, the rst pattern in the list is given to the network, and it relearns that pattern to criterion, the number of epochs required for relearning being recorded. The network weights are then reset to w f inal . The
Using Noise to Compute Error Surfaces
1765
50 40 30 20 10 0 1
2 3 4
5 6 7
8 9 10 11 12 13 14 15
17 18 19 20
Figure 5: Average relearning times for all 20 patterns after the network has learned them sequentially. Standard error bars are shown for the means for each pattern in the sequence.
Figure 6: The proportion of patterns requiring no relearning is signicantly higher for HPBP than for BP.
second pattern in the list is then relearned to criterion by the network and the number of epochs required to do so is recorded, and so on. Overall, relearning is about 45% faster for the patterns learned by the HPBP network compared to BP (30.8 epochs for BP versus 16.9 epochs for HPBP). In short, with HPBP, there is still forgetting, but it is shallower forgetting (i.e., relearning of the previously learned patterns is signicantly easier for HPBP than BP).
1766
Robert M. French and Nick Chater
8 Discussion
It is clear that HPBP shows improved forgetting performance compared to standard backpropagation. However, there are a number of issues concerning the generality and computational complexity of this algorithm that must be addressed. We will briey discuss the quality of the approximation of E ( w ) by EO ( w) and whether HPBP would scale to larger networks. The quality of the approximation of E (w ) by EO ( w) is largely dependent on three factors: (1) the form of the original error surface, (2) the choice of pseudopatterns, and (3) the divergence from w 0. If we assume that the error surface close to w 0 is approximately quadratic, then in this neighborhood, we would need only as many points as there are degrees of freedom in the weights to determine the shape of the bowl. Since we assume a quadratic approximation, the number of pseudopatterns required to determine the approximation as the network grows in size should scale with the number of weights. Finally, the further we move from w0 , the less accurate the Taylor series expansion of E will be. HPBP requires only rst-order derivative information and thus has a complexity of O ( n2 ), where n is the number of weights. Further, there exists an O ( n ) approximation of the Hessian (Le Cun, 1987; Becker & Le Cun, 1988) that could also be produced by pseudopatterns. This would ensure that scaling would be linear in the number of weights. This approximation of the Hessian has only diagonal elements, and as a result, weight changes in the HPBP algorithm would require only local information. Explorations of the scaling performance of the network with the diagonal approximation to the Hessian are needed.
9 Conclusion
We have presented a connectionist learning algorithm that signicantly improves network forgetting performance by turning noise to its advantage. A number of authors have shown that a certain amount of noise can enhance the performance of various systems in a wide range of contexts. For example, Linsker (1988) has shown that elementary perceptual detectors can emerge from noise. Numerous authors (e.g., Collins, Chow, & Imhoff, 1995; Grossberg & Grunewald, 1997; SougnÂe, 1998) have shown how the addition of noise to neural networks can enhance weak signal detection. In a neurobiological setting, Douglass, Wilkens, Pantazelou, and Moss (1993), Bezrukov and Vodyanoy (1995), and others have shown that optimal noise intensity in biological neurons can enhance signal detection. In this article, we have shown how noise can be harnessed to improve memory performance in feedforward backpropagation networks. We believe that this work, and the work by others on related problems, represents
Using Noise to Compute Error Surfaces
1767
the tip of the iceberg in the exploration of how noise can be turned from a problem into a performance-enhancing advantage. Appendix: Calculating the Precise Terms of the Pseudopattern-Induced Hessian Matrix
In order to approximate the error surface associated with the originally learned patterns by means of pseudopatterns, it is sufcient to calculate the terms of the Hessian matrix. The Hessian matrix evaluated at wo is dened as follows: 2 3 @2 E @2 E ¢¢¢ 6 @w @w 7 @ w @ w 1 1 1 N 7 6 6 7 .. .. .. H| w0 D 6 7 . . . 6 7 4 @2 E @2 E 5 ¢¢¢ @wN @w1 @wN @wN w0 0 ), where w E 0 is the vector of weights ( w10 , w20 , w30, . . . , wN which was a solution for the originally learned set of patterns. We have shown in equation 6.2 that for any two weights, wi and wj , in a network with Noutputs output nodes and for NY pseudopatterns, the hi, jith term of the Hessian matrix is p p outputs NY NX X @yn @yn D @wi @wj @wi @wj p D1 nD 1
@2 E
For each pseudopattern and each output node (with output y) and for all pairs of weights wi and wj , we calculate @y @y @wi @wj
Each term of the Hessian matrix will be the sum over all output nodes and over all pseudopatterns. The notation conventions are as follows: ya is the output from node a; y0a is the rst derivative of the squashing function, evaluated at ya , and for the standard squashing function, y D 1 C1e¡x we have y0a D ya (1 ¡ ya ) ; Noutputs is the number of output nodes; and wab , wcd the weights from node b to node a and from node d to node c. There are three cases of pairs of weights to consider: Case I: When wab , wcd 2 hidden-output layer: ( ( y0a yb ) ( y0c yd ) D ( y0a ) 2 yb yd @2 E @wab @wcd
D
0
if a D c 6 c if a D
1768
Robert M. French and Nick Chater
Case II: When wab 2 input-hidden layer and wcd 2 hidden-output layer: @2 E
@wab @wcd
D ( y0a yb ) ( y0c yd ) ( y0a wac ) D ( y0a ) 2 y0c yb yd wac
Case III: When wab , wcd 2 input-hidden layer: @2 E @wab @wcd
D ( y0a yb ) ( y0c yd ) D y0a y0c yb yd
NX outputs
( y0i wia ) ( y0i wic )
iD 1
NX outputs
( y0i ) 2 wia wic
i D1
Acknowledgments
This work has been supported in part by a grant from the European Commission, HPRN-CT-1999-00065. We thank Anthony Robins and Dave Noelle for their discussions of the ideas of this work. Particular thanks go to Gary Cottrell and an anonymous reviewer whose insightful comments contributed signicantly to the quality of this article. References Ans, B., & Rousset, S. (1997). Avoiding catastrophic forgetting by coupling two reverberating neural networks. AcadÂemie des Sciences, Sciences de la vie, 320, 989–997. Ans, B., & Rousset, S. (2000). Neural networks with a self-refreshing memory: Knowledge transfer in sequential learning tasks without catastrophic forgetting. Connection Sciences, 12, 1–19. Becker, S., & Le Cun, Y. (1988). Improving the convergence of back-propagation learning with second order methods. In D. S. Touretzky, G. E. Hinton, & T. J. Sejnowski (Eds.), Proceedings of the 1988 Connectionist Models Summer School (pp. 29–37). San Mateo, CA: Morgan Kauffman. Bezrukov, S., & Vodyanoy, I. (1995). Noise induced enhancement of signal transduction across voltage-dependent ion channels. Nature, 378, 362–364. Bishop, C. (1991). A fast procedure for retraining the multilayer perceptron. International Journal of Neural Systems, 2(3), 229–236. Collins, J., Chow, C., & Imhoff, T. (1995). Stochastic resonance without tuning. Nature, 376, 236–238. Douglass, J., Wilkens, L., Pantazelou, E., & Moss, F. (1993).Noise enhancement of information transfer in craysh mechanoreceptors by stochastic resonance. Nature, 365, 337–340. French, R. M. (1997). Pseudo-recurrent connectionist networks: An approach to the “sensitivity-stability” dilemma. Connection Science, 9, 353–379. French, R. M. (1999). Catastrophic forgetting in connectionist networks. Trends in Cognitive Sciences, 3(4), 128–135.
Using Noise to Compute Error Surfaces
1769
French, R. M., Ans, B., & Rousset, S. (2001). Pseudopatterns and dual-network memory models: Advantages and shortcomings. In R. French & J. SougnÂe (Eds.), Connectionist models of learning, development and evolution. London: Springer-Verlag. Grossberg, S., & Grunewald, A. (1997). Cortical synchronization and perceptual framing. Journal of Cognitive Neuroscience, 9, 117–132. Hetherington, P., & Seidenberg, M. (1989). Is there “catastrophic interference” in connectionist networks? In Proceedings of the 11th Annual Conference of the Cognitive Science Society (pp. 26–33). Hillsdale, NJ: Erlbaum. Le Cun, Y. (1987).Mod`eles connexionnistes de l’apprentissage.Unpublished doctoral dissertation, Universit e Pierre et Marie Curie, Paris, France. Linsker, R. (1988, March). Self-organization in a perceptual network. Computer, 105–117. Murphy, P., & Aha, D. (1992). UCI repository of machine learning databases. Irvine, CA: University of California. Robins, A. (1995). Catastrophic forgetting, rehearsal, and pseudorehearsal. Connection Science, 7, 123–146. SougnÂe, J. (1998). Period doubling as a means of representing multiplyinstantiated entities. In Proceedingsof the 20th Annual Conference of the Cognitive Science Society (pp. 1007–1012). Hillside, NJ: Erlbaum. Received June 4, 2001; accepted January 7, 2002.
ARTICLE
Communicated by Javier Movellan
Training Products of Experts by Minimizing Contrastive Divergence Geoffrey E. Hinton [email protected] Gatsby Computational Neuroscience Unit, University College London, London WC1N 3AR, U.K.
It is possible to combine multiple latent-variable models of the same data by multiplying their probability distributions together and then renormalizing. This way of combining individual “expert” models makes it hard to generate samples from the combined model but easy to infer the values of the latent variables of each expert, because the combination rule ensures that the latent variables of different experts are conditionally independent when given the data. A product of experts (PoE) is therefore an interesting candidate for a perceptual system in which rapid inference is vital and generation is unnecessary. Training a PoE by maximizing the likelihood of the data is difcult because it is hard even to approximate the derivatives of the renormalization term in the combination rule. Fortunately, a PoE can be trained using a different objective function called “contrastive divergence” whose derivatives with regard to the parameters can be approximated accurately and efciently. Examples are presented of contrastive divergence learning using several types of expert on several types of data.
1 Introduction One way of modeling a complicated, high-dimensional data distribution is to use a large number of relatively simple probabilistic models and somehow combine the distributions specied by each model. A well-known example of this approach is a mixture of gaussians in which each simple model is a gaussian, and the combination rule consists of taking a weighted arithmetic mean of the individual distributions. This is equivalent to assuming an overall generative model in which each data vector is generated by rst choosing one of the individual generative models and then allowing that individual model to generate the data vector. Combining models by forming a mixture is attractive for several reasons. It is easy to t mixtures of tractable models to data using expectation-maximization (EM) or gradient ascent, and mixtures are usually considerably more powerful than their individual components. Indeed, if sufciently many models are included in c 2002 Massachusetts Institute of Technology Neural Computation 14, 1771–1800 (2002) °
1772
Geoffrey E. Hinton
the mixture, it is possible to approximate complicated smooth distributions arbitrarily accurately. Unfortunately, mixture models are very inefcient in high-dimensional spaces. Consider, for example, the manifold of face images. It takes about 35 real numbers to specify the shape, pose, expression, and illumination of a face, and under good viewing conditions, our perceptual systems produce a sharp posterior distribution on this 35-dimensional manifold. This cannot be done using a mixture of models, each tuned in the 35-dimensional manifold, because the posterior distribution cannot be sharper than the individual models in the mixture and the individual models must be broadly tuned to allow them to cover the 35-dimensional manifold. A very different way of combining distributions is to multiply them together and renormalize. High-dimensional distributions, for example, are often approximated as the product of one-dimensional distributions. If the individual distributions are uni- or multivariate gaussians, their product will also be a multivariate gaussian so, unlike mixtures of gaussians, products of gaussians cannot approximate arbitrary smooth distributions. If, however, the individual models are a bit more complicated and each contains one or more latent (i.e., hidden) variables, multiplying their distributions together (and renormalizing) can be very powerful. Individual models of this kind will be called “experts.” Products of experts (PoE) can produce much sharper distributions than the individual expert models. For example, each expert model can constrain a different subset of the dimensions in a high-dimensional space, and their product will then constrain all of the dimensions. For modeling handwritten digits, one low-resolution model can generate images that have the approximate overall shape of the digit, and other more local models can ensure that small image patches contain segments of stroke with the correct ne structure. For modeling sentences, each expert can enforce a nugget of linguistic knowledge. For example, one expert could ensure that the tenses agree, one could ensure that there is number agreement between the subject and verb, and one could ensure that strings in which color adjectives follow size adjectives are more probable than the the reverse. Fitting a PoE to data appears difcult because it appears to be necessary to compute the derivatives, with repect to the parameters, of the partition function that is used in the renormalization. As we shall see, however, these derivatives can be nessed by optimizing a less obvious objective function than the log likelihood of the data.
2 Learning Products of Experts by Maximizing Likelihood We consider individual expert models for which it is tractable to compute the derivative of the log probability of a data vector with respect to the
Training Products of Experts
1773
parameters of the expert. We combine n individual expert models as follows: P m fm (d | µ m ) p (d | µ1 , . . . , µ n ) D P , c P m fm (c | µ m )
(2.1)
where d is a data vector in a discrete space, µ m is all the parameters of individual model m, fm (d | µ m ) is the probability of d under model m, and c indexes all possible vectors in the data space.1 For continuous data spaces, the sum is replaced by the appropriate integral. For an individual expert to t the data well, it must give high probability to the observed data, and it must waste as little probability as possible on the rest of the data space. A PoE, however, can t the data well even if each expert wastes a lot of its probability on inappropriate regions of the data space, provided different experts waste probability in different regions. The obvious way to t a PoE to a set of observed independently and identically distributed (i.i.d.) data vectors 2 is to follow the derivative of the log likelihood of each observed vector, d, under the PoE. This is given by @ log p (d | µ 1 , . . . , µ n ) @ log fm (d | µ m ) D @µm @µm
¡
X c
p (c | µ 1 , . . . , µ n )
@ log fm (c | µ m ) @µm
.
(2.2)
The second term on the right-hand side of equation 2.2 is just the expected derivative of the log probability of an expert on fantasy data, c, that is generated from the PoE. So assuming that each of the individual experts has a tractable derivative, the obvious difculty in estimating the derivative of the log probability of the data under the PoE is generating correctly distributed fantasy data. This can be done in various ways. For discrete data, it is possible to use rejection sampling. Each expert generates a data vector independently, and this process is repeated until all the experts happen to agree. Rejection sampling is a good way of understanding how a PoE species an overall probability distribution and how different it is from a causal model, but it is typically very inefcient. Gibbs sampling is typically much more efcient. In Gibbs sampling, each variable draws a sample from its posterior distribution given the current states of the other variables (Geman & Geman, 1984). Given the data, the hidden states of all the experts can always be updated in parallel because they are conditionally independent. This is an important consequence of the product formulation.3 If the 1 So long as fm (d | µm ) is positive, it does not need to be a probability at all, though it will generally be a probability in this article. 2 For time-series models, d is a whole sequence. 3 The conditional independence is obvious in the undirected graphical model of a PoE because the only path between the hidden states of two experts is via the observed data.
1774
Geoffrey E. Hinton
Figure 1: A visualization of alternating Gibbs sampling. At time 0, the visible variables represent a data vector, and the hidden variables of all the experts are updated in parallel with samples from their posterior distribution given the visible variables. At time 1, the visible variables are all updated to produce a reconstruction of the original data vector from the hidden variables, and then the hidden variables are again updated in parallel. If this process is repeated sufciently often, it is possible to get arbitrarily close to the equilibrium distribution. The correlations hsi sj i shown on the connections between visible and hidden variables are the statistics used for learning in RBMs, which are described in section 7.
individual experts also have the property that the components of the data vector are conditionally independent given the hidden state of the expert, the hidden and visible variables form a bipartite graph, and it is possible to update all of the components of the data vector in parallel given the hidden states of all the experts. So Gibbs sampling can alternate between parallel updates of the hidden and visible variables (see Figure 1). To get an unbiased estimate of the gradient for the PoE, it is necessary for the Markov chain to converge to the equilibrium distribution. Unfortunately, even if it is computationally feasible to approach the equilibrium distribution before taking samples, there is a second, serious difculty. Samples from the equilibrium distribution generally have high variance since they come from all over the model’s distribution. This high variance swamps the estimate of the derivative. Worse still, the variance in the samples depends on the parameters of the model. This variation in the variance causes the parameters to be repelled from regions of high variance even if the gradient is zero. To understand this subtle effect, consider a horizontal sheet of tin that is resonating in such a way that some parts have strong vertical oscillations and other parts are motionless. Sand scattered on the tin will accumulate in the motionless areas even though the time-averaged gradient is zero everywhere. 3 Learning by Minimizing Contrastive Divergence Maximizing the log likelihood of the data (averaged over the data distribution) is equivalent to minimizing the Kullback-Leibler divergence between the data distribution, P 0, and the equilibrium distribution over the visi-
Training Products of Experts
1775
ble variables, Ph1 , that is produced by prolonged Gibbs sampling from the generative model,4 P 0 k Ph1 D
X d
P 0 (d) log P 0 (d) ¡
D ¡H ( P
X d
P 0 (d) log Ph1 (d)
¡ hlog Ph1 iP0 ,
0)
(3.1)
where k denotes a Kullback-Leibler divergence, the angle brackets denote expectations over the distribution specied as a subscript, and H ( P 0 ) is the entropy of the data distribution. P 0 does not depend on the parameters of the model, so H ( P 0 ) can be ignored during the optimization. Note that Ph1 (d) is just another way of writing p (d | µ 1 , . . . , µn ). Equation 2.2, averaged over the data distribution, can be rewritten as ½
@ log Ph1 (D) @µ m
¾
½ P0
D
@ log fhm @µm
¾
½ P0
¡
@ log fhm @µm
¾ ,
(3.2)
Ph1
where log fhm is a random variable that could be written as log fm (D | µ m ) with D itself being a random variable corresponding to the data. There is a simple and effective alternative to maximum likelihood learning that eliminates almost all of the computation required to get samples from the equilibrium distribution and also eliminates much of the variance that masks the gradient signal. This alternative approach involves optimizing a different objective function. Instead of just minimizing P 0 k Ph1 , we minimize the difference between P 0 k Ph1 and Ph1 k Ph1 where Ph1 is the distribution over the “one-step” reconstructions of the data vectors generated by one full step of Gibbs sampling (see Figure 1). The intuitive motivation for using this “contrastive divergence” is that we would like the Markov chain that is implemented by Gibbs sampling to leave the initial distribution P 0 over the visible variables unaltered. Instead of running the chain to equilibrium and comparing the initial and nal derivatives, we can simply run the chain for one full step and then update the parameters to reduce the tendency of the chain to wander away from the initial distribution on the rst step. Because Ph1 is one step closer to the equilibrium distribution than P 0 , we are guaranteed that P 0 k Ph1 exceeds Ph1 k Ph1 unless P 0 equals Ph1 , so the contrastive divergence can never be negative. Also, for Markov chains in which all transitions have nonzero probability, P 0 D Ph1 implies P 0 D Ph1 , because if the distribution does not 4 0 P is a natural way to denote the data distribution if we imagine starting a Markov chain at the data distribution at time 0.
1776
Geoffrey E. Hinton
change at all on the rst step, it must already be at equilibrium, so the contrastive divergence can be zero only if the model is perfect.5 Another way of understanding contrastive divergence learning is to view it as a method of eliminating all the ways in which the PoE model would like to distort the true data. This is done by ensuring that, on average, the reconstruction is no more probable under the PoE model than the original data vector. The mathematical motivation for the contrastive divergence is that the intractable expectation over Ph1 on the right-hand side of equation 3.2 cancels out: ½ ¾ ½ ¾ @ @ log fhm @ log fhm (P 0 k Ph1 ¡ Ph1 k Ph1 ) D ¡ ¡ @µ m
@µ m
C
P0
@µ m
Ph1
@Ph1 @ ( Ph1 k Ph1 ) . @µm @Ph1
(3.3)
If each expert is chosen to be tractable, it is possible to compute the exact values of the derivative of log fm (d | µm ) for a data vector, d. It is also straightforward to sample from P 0 and Ph1 , so the rst two terms on the right-hand side of equation 3.3 are tractable. By denition, the following procedure produces an unbiased sample from Ph1 : 1. Pick a data vector, d, from the distribution of the data P 0 . 2. Compute, for each expert separately, the posterior probability distribution over its latent (i.e., hidden) variables given the data vector, d. 3. Pick a value for each latent variable from its posterior distribution. 4. Given the chosen values of all the latent variables, compute the conditional distribution over all the visible variables by multiplying together the conditional distributions specied by each expert and renormalizing. 5. Pick a value for each visible variable from the conditional distribution. ˆ These values constitute the reconstructed data vector, d. The third term on the right-hand side of equation 3.3 represents the effect on Ph1 k Ph1 of the change of the distribution of the one-step reconstructions caused by a change in µ m . It is problematic to compute, but extensive simulations (see section 10) show that it can safely be ignored because it is small 5 It is obviously possible to make the contrastive divergence small by using a Markov chain that mixes very slowly, even if the data distribution is very far from the eventual equilibrium distribution. It is therefore important to ensure mixing by using techniques such as weight decay that ensure that every possible visible vector has a nonzero probability given the states of the hidden variables.
Training Products of Experts
1777
and seldom opposes the resultant of the other two terms. The parameters of the experts can therefore be adjusted in proportion to the approximate derivative of the contrastive divergence: ½
D µm /
@ log fhm @µ m
¾
½ P0
¡
@ log fhm @µ m
¾ .
(3.4)
Ph1
This works very well in practice even when a single reconstruction of each data vector is used in place of the full probability distribution over reconstructions. The difference in the derivatives of the data vectors and their reconstructions has some variance because the reconstruction procedure is stochastic. But when the PoE is modeling the data moderately well, the onestep reconstructions will be very similar to the data, so the variance will be very small. The close match between a data vector and its reconstruction reduces sampling variance in much the same way as the use of matched pairs for experimental and control conditions in a clinical trial. The low variance makes it feasible to perform on-line learning after each data vector is presented, though the simulations described in this article use mini-batch learning in which the parameter updates are based on the summed gradients measured on a rotating subset of the complete training set. 6 There is an alternative justication for the learning algorithm in equation 3.4. In high-dimensional data sets, the data nearly always lie on, or close to, a much lower-dimensional, smoothly curved manifold. The PoE needs to nd parameters that make a sharp ridge of log probability along the low-dimensional manifold. By starting with a point on the manifold and ensuring that the typical reconstructions from the latent variables of all the experts do not have signicantly higher probability, the PoE ensures that the probability distribution has the right local structure. It is possible that the PoE will accidentally assign high probability to other distant and unvisited parts of the data space, but this is unlikely if the log probability surface is smooth and both its height and its local curvature are constrained at the data points. It is also possible to nd and eliminate such points by performing prolonged Gibbs sampling without any data, but this is just a way of improving the learning and not, as in Boltzmann machine learning, an essential part of it. 4 A Simple Example PoEs should work very well on data distributions that can be factorized into a product of lower-dimensional distributions. This is demonstrated in Figures 2 and 3. There are 15 “unigauss” experts, each of which is a 6 Mini-batch learning makes better use of the ability of Matlab to vectorize across training examples.
1778
Geoffrey E. Hinton
Figure 2: Each dot is a data point. The data have been tted with a product of 15 experts. The ellipses show the one standard deviation contours of the gaussians in each expert. The experts are initialized with randomly located, circular gaussians that have about the same variance as the data. The ve unneeded experts remain vague, but the mixing proportions, which determine the prior probability with which each of these unigauss experts selects its gaussian rather than its uniform, remain high.
mixture of a uniform distribution and a single axis-aligned gaussian. In the tted model, each tight data cluster is represented by the intersection of two experts’ gaussians, which are elongated along different axes. Using a conservative learning rate, the tting required 2000 updates of the parameters. For each update of the parameters, the following computation is performed on every observed data vector: 1. Given the data, d, calculate the posterior probability of selecting the gaussian rather than the uniform in each expert and compute the rst term on the right-hand side of equation 3.4. 2. For each expert, stochastically select the gaussian or the uniform according to the posterior. Compute the normalized product of the selected gaussians, which is itself a gaussian, and sample from it to get a “reconstructed” vector in the data space. To avoid problems, there is one special expert that is constrained to always pick its gaussian. 3. Compute the negative term in equation 3.4 using the reconstructed data vector. 5 Learning a Population Code A PoE can also be a very effective model when each expert is quite broadly tuned on every dimension and precision is obtained by the intersection of a
Training Products of Experts
1779
Figure 3: Three hundred data points generated by prolonged Gibbs sampling from the 15 experts tted in Figure 1. The Gibbs sampling started from a random point in the range of the data and used 25 parallel iterations with annealing. Notice that the tted model generates data at the grid point that is missing in the real data.
large number of experts. Figure 4 shows what happens when the contrastive divergence learning algorithm is used to t 40 unigauss experts to 100dimensional synthetic images that each contain one edge. The edges varied in their orientation, position, and intensities on each side of the edge. The intensity prole across the edge was a sigmoid. Each expert also learned a variance for each pixel, and although these variances varied, individual experts did not specialize in a small subset of the dimensions. Given an image, about half of the experts have a high probability of picking their gaussian rather than their uniform. The products of the chosen gaussians are excellent reconstructions of the image. The experts at the top of Figure 4 look like edge detectors in various orientations, positions, and polarities. Many of the experts farther down have even symmetry and are used to locate one end of an edge. They each work for two different sets of edges that have opposite polarities and different positions. 6 Initializing the Experts One way to initialize a PoE is to train each expert separately, forcing the experts to differ by giving them different or differently weighted training cases or by training them on different subsets of the data dimensions, or by using different model classes for the different experts. Once each expert has been initialized separately, the individual probability distributions need to be raised to a fractional power to create the initial PoE. Separate initialization of the experts seems like a sensible idea, but simulations indicate that the PoE is far more likely to become trapped in poor local optima if the experts are allowed to specialize separately. Better solutions are obtained by simply initializing the experts randomly with very vague distributions and using the learning rule in equation 3.4.
1780
Geoffrey E. Hinton
Figure 4: (a) Some 10 £ 10 images that each contain a single intensity edge. The location, orientation, and contrast of the edge all vary. (b) The means of all the 100-dimensional gaussians in a product of 40 experts, each of which is a mixture of a gaussian and a uniform. The PoE was tted to 500 images of the type shown on the left. The experts have been ordered by hand so that qualitatively similar experts are adjacent.
7 PoEs and Boltzmann Machines The Boltzmann machine learning algorithm (Hinton & Sejnowski, 1986) is theoretically elegant and easy to implement in hardware but very slow in networks with interconnected hidden units because of the variance problems described in section 2. Smolensky (1986) introduced a restricted type of Boltzmann machine with one visible layer, one hidden layer, and no intralayer connections. Freund and Haussler (1992) realized that in this restricted Boltzmann machine (RBM), the probability of generating a visible vector is proportional to the product of the probabilities that the visible vector would be generated by each of the hidden units acting alone. An RBM is therefore a PoE with one expert per hidden unit.7 When the hidden unit
7 Boltzmann machines and PoEs are very different classes of probabilistic generative model, and the intersection of the two classes is RBMs.
Training Products of Experts
1781
of an expert is off, it species a separable probability distribution in which each visible unit is equally likely to be on or off. When the hidden unit is on, it species a different factorial distribution by using the weight on its connection to each visible unit to specify the log odds that the visible unit is on. Multiplying together the distributions over the visible states specied by different experts is achieved by simply adding the log odds. Exact inference of the hidden states given the visible data is tractable in an RBM because the states of the hidden units are conditionally independent given the data. The learning algorithm given by equation 2.2 is exactly equivalent to the standard Boltzmann learning algorithm for an RBM. Consider the derivative of the log probability of the data with respect to the weight wij between a visible unit i and a hidden unit j. The rst term on the right-hand side of equation 2.2 is @ log fj (d | wj ) @wij
D hsi sj id ¡ hsi sj iPh1 ( j) ,
(7.1)
where wj is the vector of weights connecting hidden unit j to the visible units, hsi sj id is the expected value of si sj when d is clamped on the visible units and sj is sampled from its posterior distribution given d, and hsi sj iPh1 ( j) is the expected value of si sj when alternating Gibbs sampling of the hidden and visible units is iterated to get samples from the equilibrium distribution in a network whose only hidden unit is j. The second term on the right-hand side of equation 2.2 is: X
p (c | w)
@ log fj (c | wj )
c
@wij
D hsi sj iPh1 ¡ hsi sj iPh1 ( j) ,
(7.2)
where w is all of the weights in the RBM and hsi sj iPh1 is the expected value of si sj when alternating Gibbs sampling of all the hidden and all the visible units is iterated to get samples from the equilibrium distribution of the RBM. Subtracting equation 7.2 from equation 7.1 and taking expectations over the distribution of the data gives ½
@ log Ph1 @wij
¾ P0
D ¡
@ ( P 0 k Ph1 ) D hsi sj iP0 ¡ hsi sj iPh1 . @wij
(7.3)
The time required to approach equilibrium and the high sampling variance in hsi sj iPh1 make learning difcult. It is much more effective to use the approximate gradient of the contrastive divergence. For an RBM, this approximate gradient is particularly easy to compute: ¡
@ @wij
( P 0 k Ph1 ¡ Ph1 k Ph1 ) ¼ hsi sj iP0 ¡ hsi sj iP1 , h
(7.4)
1782
Geoffrey E. Hinton
where hsi sj iP1 is the expected value of si sj when one-step reconstructions are h clamped on the visible units and sj is sampled from its posterior distribution given the reconstruction (see Figure 1). 8 Learning the Features of Handwritten Digits When presented with real high-dimensional data, a restricted Boltzmann machine trained to minimize the contrastive divergence using equation 7.4 should learn a set of probabilistic binary features that model the data well. To test this conjecture, an RBM with 500 hidden units and 256 visible units was trained on 8000 16 £ 16 real-valued images of handwritten digits from all 10 classes. The images, from the “br” training set on the USPS Cedar ROM1, were normalized in width and height, but they were highly variable in style. The pixel intensities were normalized to lie between 0 and 1 so that they could be treated as probabilities, and equation 7.4 was modied to use probabilities in place of stochastic binary values for both the data and the one-step reconstructions:
¡
@ @wij
( P 0 k Ph1 ¡ Ph1 k Ph1 ) ¼ hpi pj iP0 ¡ hpi pj iP1 . h
(8.1)
Stochastically chosen binary states of the hidden units were still used for computing the probabilities of the reconstructed pixels, but instead of picking binary states for the pixels from those probabilities, the probabilities themselves were used as the reconstructed data vector. It takes 10 hours in Matlab 5.3 on a 500 MHz pentium II workstation to perform 658 epochs of learning. This is much faster than standard Boltzmann machine learning, comparable with the wake-sleep algorithm (Hinton, Dayan, Frey, & Neal, 1995) and considerably slower than using EM to t a mixture model with the same number of parameters. In each epoch, the weights were updated 80 times using the approximate gradient of the contrastive divergence computed on mini-batches of size 100 that contained 10 exemplars of each digit class. The learning rate was set empirically to be about one-quarter of the rate that caused divergent oscillations in the parameters. To improve the learning speed further, a momentum method was used. After the rst 10 epochs, the parameter updates specied by equation 8.1 were supplemented by adding 0.9 times the previous update. The PoE learned localized features whose binary states yielded almost perfect reconstructions. For each image, about one-third of the features were turned on. Some of the learned features had on-center off-surround receptive elds or vice versa, some looked like pieces of stroke, and some looked like Gabor lters or wavelets. The weights of 100 of the hidden units, selected at random, are shown in Figure 5.
Training Products of Experts
1783
Figure 5: The receptive elds of a randomly selected subset of the 500 hidden units in a PoE that was trained on 8000 images of digits with equal numbers from each class. Each block shows the 256 learned weights connecting a hidden unit to the pixels. The scale goes from C 2 (white) to ¡2 (black).
9 Using Learned Models of Handwritten Digits for Discrimination An attractive aspect of PoEs is that it is easy to compute the numerator in equation 2.1, so it is easy to compute the log probability of a data vector up to an additive constant, log Z, which is the log of the denominator in equation 2.1. Unfortunately, it is very hard to compute this additive constant. This does not matter if we want to compare only the probabilities of two different data vectors under the PoE, but it makes it difcult to evaluate the model learned by a PoE, because the obvious way to measure the success of learning is to sum the log probabilities that the PoE assigns to test data vectors drawn from the same distribution as the training data but not used during training. For a novelty detection task, it would be possible to train a PoE on “normal” data and then to learn a single scalar threshold value for the unnormal-
1784
Geoffrey E. Hinton
ized log probabilities in order to optimize discrimination between “normal” data and a few abnormal cases. This makes good use of the ability of a PoE to perform unsupervised learning. It is equivalent to naive Bayesian classication using a very primitive separate model of abnormal cases that simply returns the same learned log probability for all abnormal cases. An alternative way to evaluate the learning procedure is to learn two different PoEs on different data sets such as images of the digit 2 and images of the digit 3. After learning, a test image, t, is presented to PoE2 and PoE3 , and they compute log p (t | h2 ) C log Z2 and log p (t | h3 ) C log Z3 , respectively. If the difference between log Z2 and log Z3 is known, it is easy to pick the most likely class of the test image, and since this difference is only a single number, it is quite easy to estimate it discriminatively using a set of validation images whose labels are known. If discriminative performance is the only goal and all the relevant classes are known in advance, it is probably sensible to train all the parameters of a system discriminatively. In this article, however, discriminative performance on the digits is used simply as a way of demonstrating that unsupervised PoEs learn very good models of the individual digit classes. Figure 6 shows features learned by a PoE that contains a layer of 100 hidden units and is trained on 800 images of the digit 2. Figure 7 shows some previously unseen test images of 2s and their one-step reconstructions from the binary activities of the PoE trained on 2s and from an identical PoE trained on 3s. Figure 8 shows the unnormalized log probability scores of some training and test images under a model trained on 825 images of the digit 4 and a model trained on 825 images of the digit 6. Unfortunately, the ofcial test set for the USPS digits violates the standard assumption that test data should be drawn from the same distribution as the training data, so here the test images were drawn from the unused portion of the ofcial training set. Even for the previously unseen test images, the scores under the two models allow perfect discrimination. To achieve this excellent separation, it was necessary to use models with two hidden layers and average the scores from two separately trained models of each digit class. For each digit class, one model had 200 units in its rst hidden layer and 100 in its second hidden layer. The other model had 100 in the rst hidden layer and 50 in the second. The units in the rst hidden layer were trained without regard to the second hidden layer. After training the rst hidden layer, the second hidden layer was trained using the probabilities of feature activation in the rst hidden layer as the data. For testing, the scores provided by the two hidden layers were simply added together. Omitting the score provided by the second hidden layer leads to considerably worse separation of the digit classes on test data. Figure 9 shows the unnormalized log probability scores for previously unseen test images of 7s and 9s, which are the most difcult classes to discriminate. Discrimination is not perfect, but it is encouraging that all
Training Products of Experts
1785
Figure 6: The weights learned by 100 hidden units trained on 16 £ 16 images of the digit 2. The scale goes from C 3 (white) to ¡3 (black). Note that the elds are mostly quite local. A local feature like the one in column 1, row 7 looks like an edge detector, but it is best understood as a local deformation of a template. Suppose that all the other active features create an image of a 2 that differs from the data in having a large loop whose top falls on the black part of the receptive eld. By turning on this feature, the top of the loop can be removed and replaced by a line segment that is a little lower in the image.
of the errors are close to the decision boundary, so there are no condent misclassications. 9.1 Dealing with Multiple Classes. If there are 10 different PoEs for the 10 digit classes, it is slightly less obvious how to use the 10 unnormalized scores of a test image for discrimination. One possibility is to use a validation set to train a logistic discrimination network that takes the unnormalized log probabilities given by the PoEs and converts them into a probability distribution across the 10 labels. Figure 10 shows the weights in a logistic
1786
Geoffrey E. Hinton
Figure 7: The center row is previously unseen images of 2s. (Top) Pixel probabilities when the image is reconstructed from the binary activities of 100 feature detectors that have been trained on 2s. (Bottom) Pixel probabilities for reconstructions from the binary states of 100 feature detectors trained on 3s. Log scores of both models on training data
Log scores under both models on test data 900
800
800
700
700
Score under model 6
Score under model 6
900
600
600
500
500
400
400
300
300
200
200
100 100
200
300
400
500
600
700
Score under model 4
(a)
800
900
100 100
200
300
400
500
600
700
Score under model 4
800
900
(b)
Figure 8: (a) The unnormalized log probability scores of the training images of the digits 4 and 6 under the learned PoEs for 4 and 6. (b) The log probability scores for previously unseen test images of 4s and 6s. Note the good separation of the two classes.
discrimination network that is trained after tting 10 PoE models to the 10 separate digit classes. In order to see whether the second hidden layers were providing useful discriminative information, each PoE provided two scores. The rst score was the unnormalized log probability of the pixel intensities under a PoE model that consisted of the units in the rst hidden layer. The second score was the unnormalized log probability of the probabilities of activation of the rst layer of hidden units under a PoE model that consisted of the units in the second hidden layer. The weights in Figure 10 show that the second layer of hidden units provides useful additional information.
Training Products of Experts
1787
Figure 9: The unnormalized log probability scores of the previously unseen test images of 7s and 9s. Although the classes are not linearly separable, all the errors are close to the best separating line, so there are no condent errors.
Presumably this is because it captures the way in which features represented in the rst hidden layer are correlated. A second-level model should be able to assign high scores to the vectors of hidden activities that are typical of the rst-level 2 model when it is given images of 2s and low scores to the hidden activities of the rst-level 2 model when it is given images that contain combinations of features that are not normally present at the same time in a 2. The error rate is 1.1% which compares very favorably with the 5.1% error rate of a simple nearest-neighbor classier on the same training and test sets and is about the same as the very best classier based on elastic models of the digits (Revow, Williams, & Hinton, 1996). If 7% rejects are allowed (by choosing an appropriate threshold for the probability level of the most probable class), there are no errors on the 2750 test images. Several different network architectures were tried for the digit-specic PoEs, and the results reported are for the architecture that did best on the test data. Although this is typical of research on learning algorithms, the fact that test data were used for model selection means that the reported results are a biased estimate of the performance on genuinely unseen test images. Mayraz and Hinton (2001) report good comparative results for the larger MNIST database, and they were careful to do all the model selection using subsets of the training data so that the ofcial test data were used only to measure the nal error rate.
1788
Geoffrey E. Hinton
Figure 10: The weights learned by doing multinomial logistic discrimination on the training data with the labels as outputs and the unnormalized log probability scores from the trained, digit-specic PoEs as inputs. Each column corresponds to a digit class, starting with digit 1. The top row is the biases for the classes. The next 10 rows are the weights assigned to the scores that represent the log probability of the pixels under the model learned by the rst hidden layer of each PoE. The last 10 rows are the weights assigned to the scores that represent the log probabilities of the probabilities on the rst hidden layer under the model learned by the second hidden layer. Note that although the weights in the last 10 rows are smaller, they are still quite large, which shows that the scores from the second hidden layers provide useful, additional discriminative information. Note also that the scores produced by the 2 model provide useful evidence in favor of 7s.
10 How Good Is the Approximation? The fact that the learning procedure in equation 3.4 gives good results in the simulations described in sections 4, 5, and 9 suggests that it is safe to ignore the nal term in the right-hand side of equation 3.3 that comes from the change in the distribution Ph1 . To get an idea of the relative magnitude of the term that is being ignored, extensive simulations were performed using restricted Boltzmann machines with small numbers of visible and hidden units. By performing computations that are exponential in the number of hidden units and the
Training Products of Experts
1789
Figure 11: (a) A histogram of the improvements in the contrastive divergence as a result of using equation 7.4 to perform one update of the weights in each of 105 networks. The expected values on the right-hand side of equation 7.4 were computed exactly. The networks had eight visible and four hidden units. The initial weights were randomly chosen from a gaussian with mean 0 and standard deviation 20. The training data were chosen at random. (b) The improvements in the log likelihood of the data for 1000 networks chosen in exactly the same way as in Figure 11a. Note that the log likelihood decreased in two cases. The changes in the log likelihood are the same as the changes in P 0 k Ph1 but with a sign reversal.
number of visible units, it is possible to compute the exact values of hsi sj iP0 and hsi sj iP1 . It is also possible to measure what happens to P 0 k Ph1 ¡Ph1 k Ph1 h when the approximation in equation 7.4 is used to update the weights by an amount that is large compared with the numerical precision of the machine but small compared with the curvature of the contrastive divergence. The RBMs used for these simulations had random training data and random weights. They did not have biases on the visible or hidden units. The main result can be summarized as follows: For an individual weight, the right-hand side of equation 7.4, summed over all training cases, occasionally differs in sign from the left-hand side. But for networks containing more than two units in each layer it is almost certain that a parallel update of all the weights based on the right-hand side of equation 7.4 will improve the contrastive divergence. In other words, when we average over the training data, the vector of parameter updates given by the right-hand side is almost certain to have a positive cosine with the true gradient dened by the left-hand side. Figure 11a is a histogram of the improvements in the contrastive divergence when equation 7.4 was used to perform one parallel weight update in each of 100,000 networks. The networks contained eight visible and four hidden units, and their weights were chosen from a gaussian distribution with mean zero and standard deviation 20. With smaller weights or larger networks, the approximation in equation 7.4 is even better.
1790
Geoffrey E. Hinton
Figure 11b shows that the learning procedure does not always improve the log likelihood of the training data, though it has a strong tendency to do so. Note that only 1000 networks were used for this histogram. Figure 12 compares the contributions to the gradient of the contrastive divergence made by the right-hand side of equation 7.4 and by the term that is being ignored. The vector of weight updates given by equation 7.4 makes the contrastive divergence worse if the dots in Figure 12 are above the diagonal line, so it is clear that in these networks, the approximation in equation 7.4 is quite safe. Intuitively, we expect Ph1 to lie between P 0 and Ph1 , so when the parameters are changed to move Ph1 closer to P 0 , the changes should also move Ph1 toward P 0 and away from the previous position of Ph1 . The ignored changes in Ph1 should cause an increase in Ph1 k Ph1 and thus an improvement in the contrastive divergence (which is what Figure 12 shows). It is tempting to interpret the learning rule in equation 3.4 as approximate optimization of the contrastive log likelihood: hlog Ph1 iP0 ¡ hlog Ph1 iP1 . h
Unfortunately, the contrastive log likelihood can achieve its maximum value of 0 by simply making all possible vectors in the data space equally probable. The contrastive divergence differs from the contrastive log likelihood by including the entropies of the distributions P 0 and Ph1 (see equation 8.1), and so the high entropy of Ph1 rules out the solution in which all possible data vectors are equiprobable. 11 Other Types of Expert Binary stochastic pixels are not unreasonable for modeling preprocessed images of handwritten digits in which ink and background are represented as 1 and 0. In real images, however, there is typically very high mutual information between the real-valued intensity of one pixel and the real-valued intensities of its neighbors. This cannot be captured by models that use binary stochastic pixels because a binary pixel can never have more than 1 bit of mutual information with anything. It is possible to use “multinomial” pixels that have n discrete values. This is a clumsy solution for images because it fails to capture the continuity and one-dimensionality of pixel intensity, though it may be useful for other types of data. A better approach is to imagine replicating each visible unit so that a pixel corresponds to a whole set of binary visible units that all have identical weights to the hidden units. The number of active units in the set can then approximate a real-valued intensity. During reconstruction, the number of active units will be binomially distributed, and because all the replicas have the same weights, the single probability that controls this binomial distribution needs to be computed only once. The same trick can be used to allow replicated hidden units to
Training Products of Experts
1791
Figure 12: A scatter plot that shows the relative magnitudes of the modeled and unmodeled effects of a parallel weight update on the contrastive divergence. The 100 networks used for this gure have 10 visible and 10 hidden units, and their weights are drawn from a zero-mean gaussian with a standard deviation of 10. The horizontal axis shows ( P 0 k Ph1(old ) ¡ Ph1 (old ) k Ph1(old) ) ¡ ( P 0 k Ph1(new) ¡ Ph1 (old) k Ph1(new) ) where old and new denote the distributions before and after the weight update. This modeled reduction in the contrastive divergence differs from the true reduction because it ignores the fact that changing the weights changes the distribution of the one-step reconstructions. The increase in the contrastive divergence due to the ignored term, Ph1 (old) k Ph1(new) ¡Ph1 (new) k Ph1(new) , is plotted on the vertical axis. Points above the diagonal line would correspond to cases in which the unmodeled increase outweighed the modeled decrease, so that the net effect was to make the contrastive divergence worse. Note that the unmodeled effects almost always cause an additional improvement in the contrastive divergence rather than being in conict with the modeled effects.
approximate real values using binomially distributed integer states. A set of replicated units can be viewed as a computationally cheap approximation to units whose weights actually differ, or it can be viewed as a stationary approximation to the behavior of a single unit over time, in which case the number of active replicas is a ring rate. Teh and Hinton (2001) have shown that this type of rate coding can be quite effective for modeling real-valued images of faces, provided the images are normalized. An alternative to replicating hidden units is to use “unifactor” experts that each consist of a mixture of a uniform distribution and a factor analyzer
1792
Geoffrey E. Hinton
with just one factor. Each expert has a binary latent variable that species whether to use the uniform or the factor analyzer and a real-valued latent variable that species the value of the factor (if it is being used). The factor analyzer has three sets of parameters: a vector of factor loadings that specify the direction of the factor in image space, a vector of means in image space, and a vector of variances in image space.8 Experts of this type have been explored in the context of directed acyclic graphs (Hinton, Sallans, & Ghahramani, 1998), but they should work better in a PoE. An alternative to using a large number of relatively simple experts is to make each expert as complicated as possible, while retaining the ability to compute the exact derivative of the log likelihood of the data with respect to the parameters of an expert. In modeling static images, for example, each expert could be a mixture of many axis-aligned gaussians. Some experts might focus on one region of an image by using very high variances for pixels outside that region. But so long as the regions modeled by different experts overlap, it should be possible to avoid block boundary artifacts. 11.1 Products of Hidden Markov Models. Hidden Markov models (HMMs) are of great practical value in modeling sequences of discrete symbols or sequences of real-valued vectors because there is an efcient algorithm for updating the parameters of the HMM to improve the log likelihood of a set of observed sequences. HMMs are, however, quite limited in their generative power because the only way that the portion of a string generated up to time t can constrain the portion of the string generated after time t is by the discrete hidden state of the generator at time t. So if the rst part of a string has, on average, n bits of mutual information with the rest of the string, the HMM must have 2n hidden states to convey this mutual information by its choice of hidden state. This exponential inefciency can be overcome by using a product of HMMs as a generator. During generation, each HMM gets to pick a hidden state at each time so the mutual information between the past and the future can be linear in the number of HMMs. It is therefore exponentially more efcient to have many small HMMs than one big one. However, to apply the standard forward-backward algorithm to a product of HMMs, it is necessary to take the cross-product of their statespaces, which throws away the exponential win. For products of HMMs to be of practical signicance, it is necessary to nd an efcient way to train them. Andrew Brown (Brown & Hinton, 2001) has shown that the learning algorithm in equation 3.4 works well. The forward-backward algorithm is used to get the gradient of the log likelihood of an observed or reconstructed
8 The last two sets of parameters are exactly equivalent to the parameters of a “unigauss” expert introduced in section 4, so a “unigauss” expert can be considered to be a mixture of a uniform with a factor analyzer that has no factors.
Training Products of Experts
1793
Figure 13: A hidden Markov model. The rst and third nodes have output distributions that are uniform across all words. If the rst node has a high transition probability to itself, most strings of English words are given the same low probability by this expert. Strings that contain the word shut followed directly or indirectly by the word up have higher probability under this expert.
sequence with respect to the parameters of an individual expert. The onestep reconstruction of a sequence is generated as follows: 1. Given an observed sequence, use the forward-backward algorithm in each expert separately to calculate the posterior probability distribution over paths through the hidden states. 2. For each expert, stochastically select a hidden path from the posterior given the observed sequence. 3. At each time step, select an output symbol or output vector from the product of the output distributions specied by the selected hidden state of each HMM. If more realistic products of HMMs can be trained successfully by minimizing the contrastive divergence, they should be better than single HMMs for many different kinds of sequential data. Consider, for example, the HMM shown in Figure 13. This expert concisely captures a nonlocal regularity. A single HMM that must also model all the other regularities in strings of English words could not capture this regularity efciently because it could not afford to devote its entire memory capacity to remembering whether the word shut had already occurred in the string. 12 Discussion There have been previous attempts to learn representations by adjusting parameters to cancel out the effects of brief iteration in a recurrent network (Hinton & McClelland, 1988; O’Reilly, 1996; Seung, 1998), but these were
1794
Geoffrey E. Hinton
not formulated using a stochastic generative model and an appropriate objective function. Minimizing contrastive divergence has an unexpected similarity to the learning algorithm proposed by Winston (1975). Winston’s program compared arches made of blocks with “near misses” supplied by a teacher, and it used the differences in its representations of the correct and incorrect arches to decide which aspects of its representation were relevant. By using a stochastic generative model, we can dispense with the teacher, but it is still the differences between the real data and the near misses generated by the model that drive the learning of the signicant features. 12.1 Logarithmic Opinion Pools. The idea of combining the opinions of multiple different expert models by using a weighted average in the log probability domain is far from new (Genest & Zidek, 1986; Heskes, 1998), but research has focused on how to nd the best weights for combining experts that have already been learned or programmed separately (Berger, Della Pietra, & Della Pietra, 1996) rather than training the experts cooperatively. The geometric mean of a set of probability distributions has the attractive property that its Kullback-Leibler divergence from the true distribution, P, is smaller than the average of the Kullback-Leibler divergences of the individual distributions, Q: ³ ´ X m P m Qw m · (12.1) KL P k wm KL ( P k Qm ) Z m P m where the wm are nonnegative and sum to 1, and Z D c P m Qw m (c). When all of the individual models are identical, Z D 1. Otherwise, Z is less than one, and the difference between the two sides of equation 12.1 is log(1 / Z) . This makes it clear that the benet of combining very different experts comes from the fact that they make Z small, especially in parts of the data space that do not contain data. It is tempting to augment PoEs by giving each expert, m, an additional adaptive parameter, wm , that scales its log probabilities. However, this makes inference much more difcult (Yee-Whye Teh, personal communication, 2000). Consider, for example, an expert with wm D 100. This is equivalent to having 100 copies of an expert but with their latent states all tied together, and this tying affects the inference. It is easier to x wm D 1 and allow the PoE learning algorithm to determine the appropriate sharpness of the expert. 12.2 Relationship to Boosting. Within the supervised learning literature, methods such as bagging or boosting attempt to make “experts” different from one another by using different, or differently weighted, training data. After learning, these methods use the weighted average of the outputs of the individual experts. The way in which the outputs are combined
Training Products of Experts
1795
is appropriate for a conditional PoE model in which the output of an individual expert is the mean of a gaussian, its weight is the inverse variance of the gaussian, and the combined output is the mean of the product of all the individual gaussians. Zemel and Pitassi (2001) have recently shown that this view allows a version of boosting to be derived as a greedy method of optimizing a sensible objective function. The fact that boosting works well for supervised learning suggests that it should be tried for unsupervised learning. In boosting, the experts are learned sequentially, and each new expert is tted to data that have been reweighted so that data vectors that are modeled well by the existing experts receive very little weight and hence make very little contribution to the gradient for the new expert. A PoE ts all the experts at once, but it achieves an effect similar to reweighting by subtracting the gradient of the reconstructed data so that there is very little learning on data that are already modeled well. One big advantage of a PoE over boosting is that a PoE can focus the learning on the parts of a training vector that have high reconstruction errors. This allows PoEs to learn local features. With boosting, an entire training case receives a single scalar weight rather than a vector of reconstruction errors, so when boosting is used for unsupervised learning, it does not so easily discover local features. 12.3 Comparison with Directed Acyclic Graphical Models. Inference in a PoE is trivial because the experts are individually tractable and the product formulation ensures that the hidden states of different experts are conditionally independent given the data. 9 This makes them relevant as models of biological perceptual systems, which must be able to do inference very rapidly. Alternative approaches based on directed acyclic graphical models (Neal, 1992) suffer from the “explaining-away” phenomenon. When such graphical models are densely connected, exact inference is intractable, so it is necessary to resort to clever but implausibly slow iterative techniques for approximate inference (Saul & Jordan, 2000) or to use crude approximations that ignore explaining away during inference and rely on the learning algorithm to nd representations for which the shoddy inference technique is not too damaging (Hinton et al., 1995). Unfortunately, the ease of inference in PoEs is balanced by the difculty of generating fantasy data from the model. This can be done trivially in one ancestral pass in a directed acyclic graphical model but requires an iterative procedure such as Gibbs sampling in a PoE. If, however, equation 3.4 is used for learning, the difculty of generating samples from the model is not a major problem. In addition to the ease of inference that results from the conditional independence of the experts given the data, PoEs have a more subtle ad-
9
This ceases to be true when there are missing data.
1796
Geoffrey E. Hinton
vantage over generative models that work by rst choosing values for the latent variables and then generating a data vector from these latent values. If such a model has a single hidden layer and the latent variables have independent prior distributions, there will be a strong tendency for the posterior values of the latent variables in a well-tted model to reect the marginal independence that occurs when the model generates data. 10 For this reason, there has been little success with attempts to learn such generative models one hidden layer at a time in a greedy, bottom-up way. With PoEs, however, even though the experts have independent priors, the latent variables in different experts will be marginally dependent: they can be highly correlated across cases even for fantasy data generated by the PoE itself. So after the rst hidden layer has been learned greedily, there may still be lots of statistical structure in the latent variables for the second hidden layer to capture. There is therefore some hope that an entirely unsupervised network could learn to extract a hierarchy of representations from patches of real images, but progress toward this goal requires an effective way of dealing with real-valued data. The types of expert that have been investigated so far have not performed as well as independent component analysis at extracting features like edges from natural images. The most attractive property of a set of orthogonal basis functions is that it is possible to compute the coefcient on each basis function separately without worrying about the coefcients on other basis functions. A PoE retains this attractive property while allowing nonorthogonal experts and a nonlinear generative model. 12.4 Lateral Connections Between Hidden Units. The ease of inferring the states of the hidden units (latent variables) given the data is a major advantage of PoEs. This advantage appears to be lost if the latent variables of different experts are allowed to interact directly. If, for example, there are direct, symmetrical connections between the hidden units in a restricted Boltzmann machine, it becomes intractable to infer the exact distribution over hidden states when given the visible states. However, preliminary simulations show that the contrastive divergence idea can still be applied successfully using a damped mean-eld method to nd an approximating distribution over the hidden units for both the data vector and its reconstruction. The learning procedure then has the following steps for each data vector:
10 This is easy to understand from a coding perspective in which the data are communicated by rst specifying the states of the latent variables under an independent prior and then specifying the data given the latent states. If the latent states are not marginally independent, this coding scheme is inefcient, so pressure toward coding efciency creates pressure toward independence.
Training Products of Experts
1797
1. With the data vector clamped on the visible units, compute the total bottom-up input, xj , to each hidden unit, j, and initialize it to have real-valued state, qj0 D s ( xj ) , where s (.) is the logistic function. 2. Perform 10 damped mean-eld iterations in which each hidden unit computes its total lateral input yjt at time t, yjt D
X
wkj qt¡1 k ,
k
and then adjusts its state, qjt , qjt D 0.5qjt¡1 C 0.5s (xj C yjt ) . Measure si qj10 for each pair of a visible unit, i, and a hidden unit, j. Also measure qj10q10 for all pairs of hidden units, j, k. k 3. Pick a nal binary state for each hidden unit, j, according to qj10, and use these binary states to generate a reconstructed data vector. 4. Perform one more damped mean-eld iteration on the hidden units starting from the nal state found with the data, but using the bottomup input from the reconstruction rather than the data. Measure the same statistics as above. 5. Update every weight in proportion to the difference in the measured statistics in steps 2 and 4. It is necessary for the mean-eld iterations to converge for this procedure to work, which may require limiting or decaying the weights on the lateral connections. The usual worry—that the mean-eld approximation is hopeless for multimodal distributions—may not be relevant for learning a perceptual system, since it makes sense to avoid learning models of the world in which there are multiple, very different interpretations of the same sensory input. Contrastive divergence learning nesses the problem of nding the highly multimodal equilibrium distribution that results if both visible and hidden units are unclamped. It does this by assuming that the observed data cover all of the different modes with roughly the right frequency for each mode. The fact that the initial simulations work is very promising since networks with multiple hidden layers can always be viewed as networks that have a single laterally connected hidden layer with many of the lateral and feedforward connections missing. However, the time taken for the network to settle means that networks with lateral connections are considerably slower than purely feedforward networks, and more research is required to demonstrate that the loss of speed is worth the extra representational power.
1798
Geoffrey E. Hinton
A much simpler way to use xed lateral interactions is to divide the hidden units into mutually exclusive pools. Within each pool, exactly one hidden unit is allowed to turn on, and this unit is chosen using the “softmax” distribution: exp(xj ) . pj D P k exp(xk )
(12.2)
Curiously, this type of probabilistic winner-take-all competition among the hidden units has no effect whatsoever on the contrastive divergence learning rule. The learning does not even need to know which hidden units compete with each other. The competition can be implemented in a very simple way by making each hidden unit, j, be a Poisson process with a rate of exp(xj ) . The rst unit to emit a spike is the winner. The other units in the pool could then be shut off by an inhibitory unit that is excited by the winner. If the pool is allowed to produce a xed number of spikes before being shut off, the learning rule remains unaltered, but quantities like sj may take on integer values larger than 1. If a hidden unit is allowed to be a member of several different overlapping pools, the analysis gets much more difcult, and more research is needed. 12.5 Relationship to Analysis-by-Synthesis. PoEs provide an efcient instantiation of the old psychological idea of analysis-by-synthesis. This idea never worked properly because the generative models were not selected to make the analysis easy. In a PoE, it is difcult to generate data from the generative model, but given the model, it is easy to compute how any given data vector might have been generated, and, as we have seen, it is relatively easy to learn the parameters of the generative model. Paradoxically, the generative models that are most appropriate for analysis-by-synthesis may be the ones in which synthesis is intractable. Acknowledgments This research was funded by the Gatsby Charitable Foundation. Thanks to Zoubin Ghahramani, David MacKay, Peter Dayan, Radford Neal, David Lowe, Yee-Whye Teh, Guy Mayraz, Andy Brown, and other members of the Gatsby unit for helpful discussions and to the referees for many helpful comments that improved the article. References Berger, A., Della Pietra, S., & Della Pietra, V. (1996). A maximum entropy approach to natural language processing. Computational Linguistics, 22, 39–71.
Training Products of Experts
1799
Brown, A. D., & Hinton, G. E. (2001). Products of hidden Markov models. In Proceedings of Articial Intelligence and Statistics 2001 (pp. 3–11). San Mateo, CA: Morgan Kaufmann. Freund, Y., & Haussler, D. (1992). Unsupervised learning of distributions on binary vectors using two layer networks. In J. E. Moody, S. J. Hanson, & R. P. Lippmann (Eds.), Advances in neural information processing systems, 4 (pp. 912– 919). San Mateo, CA: Morgan Kaufmann. Geman, S., & Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6, 721–741 Genest, C., & Zidek, J. V. (1986). Combining probability distributions: A critique and an annotated bibliography. Statistical Science, 1, 114–148. Heskes, T. (1998). Bias / variance decompositions for likelihood-based estimators. Neural Computation, 10, 1425–1433. Hinton, G., Dayan, P., Frey, B., & Neal, R. (1995). The wake-sleep algorithm for self-organizing neural networks. Science, 268, 1158–1161. Hinton, G. E., & McClelland, J. L. (1988). Learning representations by recirculation. In D. Z. Anderson (Ed.), Neural information processing systems (pp. 358– 366). New York: American Institute of Physics. Hinton, G. E., Sallans, B., & Ghahramani, Z. (1998). Hierarchical communities of experts. In M. I. Jordan (Ed.), Learning in graphical models. Norwood, MA: Kluwer. Hinton, G. E., & Sejnowski, T. J. (1986). Learning and relearning in Boltzmann machines. In D. E. Rumelhart & J. L. McClelland (Eds.), Parallel distributed processing: Explorations in the microstructure of cognition. Vol. 1: Foundations. Cambridge, MA: MIT Press. Mayraz, G., & Hinton, G. E. (2001). Recognizing hand-written digits using hierarchical products of experts. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.), Advances in neural informationprocessingsystems,13 (pp. 953–959). Cambridge, MA: MIT Press. Neal, R. M. (1992). Connectionist learning of belief networks. Articial Intelligence, 56, 71–113. O’Reilly, R. C. (1996) Biologically plausible error-driven learning using local activation differences: The generalized recirculation algorithm. Neural Computation, 8, 895–938. Revow, M., Williams, C. K. I., & Hinton, G. E. (1996). Using generative models for handwritten digit recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18, 592–606. Saul, L. K., & Jordan, M. I. (2000). Attractor dynamics in feedforward neural networks. Neural Computation, 12, 1313–1335. Seung, H. S. (1998). Learning continuous attractors in a recurrent net. In M. I. Jordan, M. J. Kearns, & S. A. Solla (Eds.) Advances in neural information processing systems, 10 (pp. 654–660). Cambridge, MA: MIT Press. Smolensky, P. (1986). Information processing in dynamical systems: Foundations of harmony theory. In D. E. Rumelhart & J. L. McClelland (Eds.), Par-
1800
Geoffrey E. Hinton
allel distributed processing: Explorations in the microstructure of cognition. Vol. 1: Foundations. Cambridge, MA: MIT Press. Teh, Y. W., & Hinton, G. E. (2001). Rate-coded restricted Boltzmann machines for face recognition. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.), Advances in neural information processing systems, 13 (pp. 908–914). Cambridge, MA: MIT Press. Winston, P. H. (1975). Learning structural descriptions from examples. In P. H. Winston (Ed.), The psychology of computer vision. New York: McGraw-Hill. Zemel, R. S., & Pitassi, T. (2001). A gradient-based boosting algorithm for regression problems. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.), Advances in neural information processing systems, 13 (pp. 696–702). Cambridge, MA: MIT Press. Received July 28, 2000; accepted December 10, 2001.
LETTER
Communicated by Peter Latham
Dynamic Approximation of Spatiotemporal Receptive Fields in Nonlinear Neural Field Models Thomas Wennekers [email protected] Max-Planck-Institute for Mathematics in the Sciences, D-04103 Leipzig, Germany This article presents an approximation method to reduce the spatiotemporal behavior of localized activation peaks (also called “bumps”) in nonlinear neural eld equations to a set of coupled ordinary differential equations (ODEs) for only the amplitudes and tuning widths of these peaks. This enables a simplied analysis of steady-state receptive elds and their stability, as well as spatiotemporal point spread functions and dynamic tuning properties. A lowest-order approximation for peak amplitudes alone shows that much of the well-studied behavior of small neural systems (e.g., the Wilson-Cowan oscillator) should carry over to localized solutions in neural elds. Full spatiotemporal response proles can further be reconstructed from this low-dimensional approximation. The method is applied to two standard neural eld models: a one-layer model with difference-of-gaussians connectivity kernel and a two-layer excitatoryinhibitory network. Similar models have been previously employed in numerical studies addressing orientation tuning of cortical simple cells. Explicit formulas for tuning properties, instabilities, and oscillation frequencies are given, and exemplary spatiotemporal response functions, reconstructed from the low-dimensional approximation, are compared with full network simulations. 1 Introduction
Neural eld models are powerful tools in studies approaching spatiotemporal activation patterns in extended cortical tissue (Amari, 1977; Ben-Yishai, Bar-Or, & Sompolinsky, 1995; Carandini & Ringach, 1997; Krone, Mallot, Palm, & Schuz, ¨ 1986; Mineiro & Zipser, 1998; Omurtag, Kaplan, Knight, & Sirovich, 2000; von Seelen, Mallot, & Giannakopolous, 1987; Taylor, 1999; Wennekers, 2001a, 2001b; Wilson & Cowan, 1973; and many more). They describe a piece of cortex by one or more spatially extended layers of locally interconnected cells. Small inputs that, depending on the domain under consideration, may represent dots or bars, pure tones, or small haptic stimuli, for example, typically evoke localized activation proles in such networks. Those point-spread functions directly correspond to receptive elds (RFs) in the respective cortical domain. Besides spatial tuning, the evoked c 2002 Massachusetts Institute of Technology Neural Computation 14, 1801– 1825 (2002) °
1802
Thomas Wennekers
activation can show temporal structure either because the stimuli are already temporally modulated or due to cortex-intrinsic processes. Features of such spatiotemporal receptive elds should provide information about the mechanisms underlying cortical tuning phenomena. An analytical treatment of nonlinear neural eld equations is not easy and has been basically restricted to the linear case or linearizations around some steady state where Fourier methods are at hand to characterize response properties (Krone et al., 1986; Mineiro & Zipser, 1998; Sabatini, 1997). Studies dealing analytically with aspects of nonlinear models are less frequent (but see Amari, 1977; Ermentrout & Cowan, 1980; Hansel & Sompolinsky, 1998; Taylor, 1999; Wennekers & Palm, 1996). I have recently developed a new approximation scheme for localized steady-state solutions in truly nonlinear neural eld models (Wennekers, 2001a). In this work, the method is extended to dynamic phenomena. When a gaussian approximation for spatially localized ring-rate proles is used, the full set of neural eld equations (integro-differential equations from a mathematical point of view) is reduced to a considerably simpler set of ordinary differential equations (ODEs) for the most interesting peak parameters alone: the response amplitudes and tuning widths. This enables a simplied analytical investigation of steady-state solutions and their stability, as well as of the time course of spatiotemporal receptive eld properties. I further show that often a mean-eld approximation based solely on ODEs for peak amplitudes can reasonably predict the spatiotemporal dynamics of tuned activity in neural eld models. Full spatiotemporal activation proles are reconstructed by representing them as superpositions of amplitude-modulated roughly gaussian components in space. Each component corresponds with a distinct projection between a pair of connected network layers, and a coupled system of ODEs again describes the component amplitudes in time. In this way, the time course of spatiotemporal response proles can be understood in terms of the en gros network connectivity (between layers) and the tuning of the individual coupling kernels. The next section sets up the modeling framework. In section 3, the dynamic approximation methods are developed. As an example, section 4 considers two standard neural eld models: a one-eld model with difference-of-gaussians (DOG) coupling kernel and a tightly related two-eld excitatory-inhibitory model with gaussian-shaped mutual kernels. Both models have been used previously to describe orientation tuning of simple cells in area V1 (cf. Ben-Yishai et al., 1995; Carandini & Ringach, 1997; Somers, Nelson, & Sur, 1995; Wennekers, 2001a, 2001b).
2 Neural Field Models
I consider neural eld equations of the following general form (cf. Amari, 1977; Ben-Yishai et al., 1995; Krone et al., 1986; von Seelen et al., 1987; Wen-
Analysis of Spatiotemporal Receptive Fields
1803
nekers, 2001a, 2001b; Wilson & Cowan, 1973): Di w i ( x, t ) D ¡w i (x, t) C Ii ( x, t) C
n X jD 1
kij (x ) ¤ fj (wj ( x, t) ) .
(2.1)
In equation 2.1, n is the number of network layers, and w i ( x, t) describes the membrane potential of cells at location x and time t in layer i. Depending on the application, the model layers in equation 2.1 may correspond to anatomical layers in one cortical area, different cell types, or even different cortical areas. The spatial variable x 2 R D ] ¡ 1, 1[ represents some feature dimension (e.g., space, Rorientation). The asterisk denotes spatial convolution, that is, u (x ) ¤ v ( x ) :D R u ( x ¡ y ) v ( y ) dy, and the Di in equation 2.1 are linear differential operators in the time variable t that contain only rst- or higherorder derivatives in t. For example, Di D ti @/ @t models the response of cells in layer i by a rst-order low-pass lter with (empirical) time constant ti . Higher-order operators may be used to represent the form of postsynaptic impulse response functions more properly. For simplicity, vanishing initial conditions at time t D 0 are assumed. Coupling kernels kij in equation 2.1 p are given by gaussians with amplitude Kij / 2p and standard deviation sij : " ³ ´ # Kij 1 x 2 exp ¡ . kij ( x ) D p 2 sij 2p
(2.2)
Inputs Ii ( x, t) in equation 2.1 are also spatially localized gaussians centered at zero with standard deviation si0 and strength Ii0Ti (t) : " 1 Ii ( x, t) :D Ii0 exp ¡ 2
³
x si0
´2 # ¢ Ti ( t ) .
(2.3)
The width si0 reects the input tuning into layer i, Ii0 sets the absolute input intensity, and Ti ( t) is its time course. Rate functions fi in equation 2.1 are kept general in section 3. A typical choice would be semi-power functions fi (w i ) D p p bi [w i ¡ #i ] Ci :D bi max (0, w i ¡ #i ) pi :D bi hi i , as in Wennekers (2001a). The example in section 4 employs rectilinear functions, bi D 1, pi D 1, #i D 0. 3 Approximation Methods 3.1 Dynamic Approximation Scheme. Field models of the form 2.1 can be solved explicitly only in very rare cases. Most interesting in the context of cortical tuning mechanisms are localized solutions, where the ring-rate proles fi (w i ( x, t) ) are bell shaped and thus conned, or tuned, to a region in x. Steady-state tuning properties of a special class of nonlinear neural eld models have been studied in Wennekers (2001a). The approximation
1804
Thomas Wennekers
method used there generalizes to the temporal domain. It replaces each ring-rate prole fi (w i (x, t) ) by a gaussian gi ( x, t) with the same amplitude and standard deviation spi such that both curves have the same secondorder derivative in x D 0: x2 C O ( x4 ) 2 µ 0 ¶ f (w 0i ( t) ) x2 ¼ fi (w 0i ( t) ) exp i w 2i (t ) D: 2 fi (w 0i ( t) )
fi (w i (x, t) ) ¼ fi (w 0i ( t) ) C fi0 (w 0i (t ) ) w 2i ( t)
(3.1) gi (x, t) ,
(3.2)
where w 0i (t) :D w i (0, t) and w 2i (t) :D w i,xx (0, t) . Equations 3.1 and 3.2 identify the zeroth- and second-order terms in x of the Taylor expansions for fi (w i ( x, t) ) and the exponential gi ( x, t) . Equation 3.2 may also be derived by writing fi (w i (x, t) ) D exp(log( fi (w i ( x, t) ) ) ) and performing a Taylor expansion of log( fi (w i ( x, t) ) ). Thus, the equivalent gaussian prole in layer i has amplitude fi (w 0i ) and variance spi2 :D ¡ fi (w 0i ) / ( fi0 (w 0i ) w 2i ) (i.e., spi2 D ¡[w 0i ¡ #i ] C / w 2i for a semilinear fi ). The advantage of this gaussian approximation is that it allows an evaluation of the convolution integrals kij ¤ fj in equation 2.1 in closed form, which results in " ³ ´ # n X 1 x 2 Di w i ( x, t) ¼ ¡w i ( x, t ) C Ii ( x, t) C cij fj (w 0j ) exp ¡ , (3.3) 2 Sij j D1 where cij :D Kij sij spj / Sij and Sij2 :D sij2 C spj2 . Note that equation 3.3 still contains w 0j , spj , and Sij on the right-hand side, which all depend on time but are independent of x. To determine dynamic equations for these quantities, I again use a Taylor expansion of the w i ( x, t) near x D 0. For gaussian-shaped input, one obtains the temporal evolution of the zeroth- and second-order coefcients of the series as Di w 0i D ¡w 0i C Ii0 Ti ( t) C Di w 2i D ¡w 2i ¡
n X
cij fj (w 0j )
(3.4)
jD 1
n cij fj (w 0j ) Ii0 Ti ( t) X ¡ . 2 si0 S ij2 jD 1
(3.5)
Initial conditions to be used with equations 3.4 and 3.5 are zero, since vanishing initial conditions have been assumed for the full system, equation 2.1. Equations 3.4 and 3.5 provide a closed low-dimensional system of ODEs for the amplitudes w 0i (t) and curvatures w 2i (t) of the potential proles at x D 0. Tuning widths can be estimated by means of spi2 D ¡ fi (w 0i ) / ( fi0 (w 0i ) w 2i ) , and spatio temporal membrane responses result from equation 3.3. Since
Analysis of Spatiotemporal Receptive Fields
1805
equations 3.4 and 3.5 contain the time course of the input stimulus via Ti ( t), spatiotemporal responses to different temporal test functions can be predicted, including pulses, steps, and sinusoidal input as most often used in experiments.
3.2 Interpretation of Equations 3.3 through 3.5. Equations 3.3 through 3.5 have useful intuitive interpretations with respect to qualitative aspects of spatiotemporal network responses. First, consider equation 3.3, which predicts that the potential prole w i ( x, t) behaves approximately as a superposition of several spatial gaussians: one gaussian for the input, Ii (x, t), and one for each projection from a layer j to layer i. The amplitudes (cij fj or Ii0 Ti ), as well as the tuning widths ( Sij or si0 ), of these components in general depend on time, but nonetheless equation 3.3 nicely relates the structure of the full solution to the architecture of the network on the level of layers and the individual component tunings. The precise time course of the component amplitudes may reect the prevalent inuence of particular projections during different periods of a response. If one of these components dominates at some time, say, excitatory or inhibitory feedback or the external input, this component transiently determines the shape and tuning of the potential prole. In this way, component amplitudes are linked to spatial tuning properties of the membrane responses. An example for this spatiotemporal superposition principle in the context of orientation tuning is given in section 4.6. The principle holds especially in steady states. For instance, Somers et al. (1995) used it intuitively in their explanation for the cortex-intrinsical sharpening of tuning curves by feedback amplication. Figure 4 in their article separates full steady-state activity proles into roughly gaussian components resulting from the external inputs and synaptic feedback, respectively. Such a superposition principle is a useful and intuitive property of equation 2.1. Now consider equation 3.4 and observe that equation 3.5 decouples from it and could be neglected if the effective couplings cij were constants. For constant couplings, equation 3.4 describes exactly the dynamics of n coupled graded-response neurons. This formal correspondence implies that if the tuning widths in a eld model stay roughly constant, that is, if the rate peaks are merely amplitude modulated, one can reasonably expect that the amplitudes of localized solutions behave qualitatively similar to the dynamics of the corresponding low-dimensional system given by equation 3.4. In a lowest-order approximation, one may therefore neglect equation 3.5 and investigate just equation 3.4 with xed cij , for example, steady-state values, cij D c¤ij D cij (spj¤ ) . I call this approach mean-eld approximation of localized solutions of equation 2.1 because equations of the form 3.4 or similar ones have been repeatedly studied previously as mean-eld equations for the dynamic behavior of large, homogeneous pools of graded response or Poissonian spiking neurons (e.g., Amari, 1972; Schuster & Wagner, 1990a,
1806
Thomas Wennekers
1990b; Wilson & Cowan, 1972). It is in no way obvious that the dynamics of a small neural network (like the Wilson-Cowan oscillator, for instance) have anything to do with that of a distributed eld model of the same en gros connectivity structure between layers. Equation 3.4 establishes such a relation, and in section 4, I show that the mean-eld approximation can lead to reasonable quantitative predictions for properties of neural eld models. Now consider equation 3.5 for the second-order derivatives, w 2i . What is the intuitive interpretation of this equation? Remember that w 2i D ¡ fi / fi0 ¢ 1 / spi2 or, for semilinear rate functions, w 2i D ¡[w 0i ¡ #i ] C ¢ 1 / spi2 . So w 2i is formally proportional to 1 / spi2 , the squared inverse tuning width in layer i. Thus, equation 3.5 shows that the temporal behavior of w 2i , and therefore the inverse peak variance, 1 / spi2 , appears to be driven by a weighted
superposition of the inverse variances (1 / si02 or 1 / S ij2 ) of the individual spatial gaussian components stemming from the connected source layers. This holds especially for steady states, where the inverse variance of a tuned peak turns out to be a static mixture of the inverse variances of the feedforward and feedback inputs (cf. e.g., equation 4.11). The weighting factors, Ii0 Ti and cij fj , are furthermore exactly the same as the component amplitudes in equation 3.4. Hence, dominating factors force the tuning toward that of the particular gaussian component. 3.3 Reconstruction of Spatiotemporal Solutions. The mean-eld approximation consists of replacing cij ( t) in equation 3.4 by steady-state values c¤ij D cij (spj¤ ) . At rst, it describes only the approximate dynamics of peak amplitudes. However, the assumption is equivalent to assuming steady-state values for the tuning widths. Therefore, on the same level of approximation, one may also set the Sij in equation 3.3 to their steady-state values. It then becomes possible to reconstruct full spatiotemporal activation proles from the lowest-order mean-eld approximation. To this end, dene P w 0i ( t) D ui0 ( t) C jnD1 uij (t ), where the uij , j D 0, . . . , n are the amplitudes of the individual gaussian components that contribute to w i ( x, t) in equation 3.3. Inserting this into equation 3.4 and separating the uij leads to
Di ui0 D ¡ui0 C Ii0 Ti ( t) Di uij D
¡uij C c¤ij fj (w 0j )
(3.6) D
¡uij C c¤ij fj (uj0 C
¢ ¢ ¢ C ujn )
(3.7)
for j D 1, . . . , n and vanishing initial conditions. By solving equations 3.6 and 3.7, one then obtains the time course of the individual component amplitudes and reconstructs approximate spatiotemporal amplitude proles from equation 3.3 as 2 " Á !2 3 ³ ´ # X n 1 x 2 1 x 5 . (3.8) C w i ( x, t ) ¼ ui0 ( t) exp ¡ uij ( t) exp 4 ¡ 2 si0 2 Sij¤ jD 1
Analysis of Spatiotemporal Receptive Fields
1807
Although constant widths are assumed in equations 3.7 and 3.8, the ringrate peak widths resulting from equation 3.8 by applying fi typically depend on time. Estimates of the temporal tuning widths are given by w 2i ( t) ¼ P ¡ui0 ( t) / s0i2 ¡ jnD1 uij (t) / Sij¤2 and spi2 ¼ ¡ fi (w 0i ) / ( fi0 (w 0i )w 2i ) . 4 Example
As an instructive example, I now study two standard neural eld models: a one-layer model with DOG coupling kernel and a related two-layer excitatory-inhibitory model comparable to the spatial Wilson-Cowan equations (Wilson & Cowan, 1973; Ermentrout & Cowan, 1980). These models are the simplest ones showing nontrivial dynamic properties. Moreover, they are interesting, because similar models addressing the feedforwardversus feedback-dominated origin of orientation tuning in V1 simple cells have attracted considerable interest in the last few years (see Ben-Yishai et al., 1995; Ben-Yishai, Hansel, & Sompolinsky, 1997; Carandini & Ringach, 1997; Ferster & Miller, 2000; Omurtag et al., 2000; Pugh, Ringach, Shapley, & Shelley, 2000; Somers et al., 1995; Wennekers, 2001a, 2001b). 4.1 One- and Two-Layer Model Networks. In the following I compare
the one-layer model, t1 wP1 D ¡w 1 C I1 C kDOG ¤ f1 (w 1 ) ,
(4.1)
with the DOG kernel and the following two-layer model: t1 wP1 D ¡w 1 C I1 C k11 ¤ f1 (w 1 ) ¡ k12 ¤ f2 (w 2 ) t2 wP2 D ¡w 2 C k21 ¤ f1 (w 1 ).
(4.2) (4.3)
As usual, x 2 R and vanishing initial conditions are assumed. For simplicity, fi (w i ) :D [w i ] C :D max (0, w i ). I1 D I1 (x, t) and the kernels kij (x ) are gaussians, as in equations 2.3 and 2.2. In equation 4.3, I neglect input into as well as feedback between inhibitory cells, I2 ´ 0 and k22 ´ 0. The latter two assumptions imply that for kDOG D k11 ¡ k12 ¤ k21 steady-state proles for w 1 ( x ) of the two-layer model are identical with those of the one-layer model, w 1¤ (x ) D I1 ( x ) C ( k11 ¡ k12 ¤ k21 ) ¤ f1 (w 1¤ ( x ) ) ,
(4.4)
and the steady-state inhibitory prole is uniquely dened by w 1¤ : w 2¤ (x ) D k21 ¤ f1 (w 1¤ ( x ) ) . Observe that the one-layer model lumps excitatory and inhibitory cells into a single eld but has steady states identical to the more detailed two-layer model only under the special conditions, f2 D [¢] C , I2 ´ 0 and k22 ´ 0. If inhibitory to inhibitory feedback is included, as in section 4.7, this correspondence is lost.
1808
Thomas Wennekers
Reduced equations for the two-layer model easily follow from equations 3.4 and 3.5: t1 wP01 D ¡w 01 C I10T1 C c11 f1 (w 01 ) ¡ c12 f2 (w 02 ) t2 wP02 D ¡w 02 C c21 f1 (w 01 )
(4.5) (4.6)
I10T1 c11 f1 (w 01 ) c12 f2 (w 02 ) C t1 wP21 D ¡w 21 ¡ 2 ¡ 2 2 s10 S11 S12
(4.7)
c21 f1 (w 01 ) t2 wP22 D ¡w 22 ¡ . 2
(4.8)
S21
For steady-state couplings cij D cij (spj¤ ) , the amplitude equations 4.5 and 4.6 alone provide the lowest-order mean-eld approximation. Using T1 ´ 1, hi :D fi (w 0i ) D [w 0i ] C , and w 2i D ¡ fi / ( fi0 spi2 ) D ¡hi / spi2 , 2 equation 4.6 immediately results in h2 D c21 h1 and equation 4.8 in sp2 D 2 S21
2 C 2 sp1 for the steady-state ring rate and tuning width in the D s21
inhibitory eld. Inserting h2 D c21h1 into equation 4.5 and dening c C :D c11 and c¡ :D c12 c21 further yields for the steady-state ring rate in the excitatory eld, h1 D
I10 . 1 ¡ cC C c¡
(4.9)
2 Finally, from equation 4.7, one gets, using h2 D c21 h1 and w 21 D ¡h1 / sp1 ,
h1 I10 c11 h1 c12 c21 h1 ¡ D 2 C , 2 2 2 sp1 s10 S11 S12
(4.10)
2 2 C s2 2 C 2 C which simplies by means of equation 4.9 and S21 D s21 p2 D s21 s12 2 sp1 to
1 2 sp1
D
1 ¡ c C C c¡ cC c¡ C ¡ , 2 s10 S 2C S ¡2
(4.11)
2 s 2 C s 2 . Note that equation 4.11 dewhere S §2 :D s§2 C sp1 , C D s11 , s¡2 D s12 21 pends on only sp1 but not sp2 , and that K C :D K11 , K ¡ :D K21K12 s12 s21 / s¡, sC , and s¡ are the effective DOG parameters in equation 4.4. In the following sections, I mostly use the tuning parameters sC D 7.5, s¡ D 60, and s10 D 23. These parameters are taken from Carandini & Ringach’s work (1997) and are derived from Somers et al.’s (1995) biologically detailed model. Couplings used there correspond with K C D .23 and K ¡ D .092.
Analysis of Spatiotemporal Receptive Fields
30
8
0 0
0
1809
0
.6 .4
0
.6 .4 0
Figure 1: Numerical solution of the reduced xed-point equations, 4.9 and 4.10, for the (A) tuning and (B) response amplitudes of the excitatory cells. Parameters are I10 D 1 and sC D 7.5, s¡ D 60, s10 D 23 taken from Carandini and Ringach (1997). Couplings used in that work correspond with K ¡ D .092 and K C D .23 in the regime of high recurrent signal amplication.
4.2 Steady-State Tuning Properties. Equations 4.9 and 4.11 are easy to solve numerically. Figure 1 displays solutions for the parameters given above and I10 D 1. The absolute input strength is arbitrary, because equation 4.11, and therefore sp1 , is independent of I10. This implies also that sp2 is independent of I10 and that h1 and h2 are proportional to I10 whatever the relative strengths of the input and the recurrent connections are. Figure 1A reveals a point singularity at K¡ D 0, K C ¼ .14, but in largeparameter regions, the excitatory tuning is relatively sharp and comparable to sC D s11 . Figure 1B plots the cortical amplication factor, m D h1 / I10 (because I10 D 1). The region of cortical amplication m > 1 is rather small. In large parts, m is smaller than one, although tuning is typically sharp in these regions. Moreover, m becomes innite along a line singularity. Steady-state values in Figure 1 are well in accord with simulations of the full eld models: equations 4.1 through 4.3. Since this has already been demonstrated (Wennekers, 2001a), the numerical tests are not repeated here. 4.3 Stability of Steady States. Figure 2 presents a bifurcation diagram for the two-eldpmodel and the additional parameters t1 D t2 D 10 ms, s12 D s21 D s¡ / 2, K21 D 1, and K12 D K ¡s¡ / (s12 s21 ) . The exact values of K21 and K12 are arbitrary for stability as long as K¡ D K21 K12s12 s21 / s¡ is kept constant. Changes of K21 only scale w 2 ( x, t ) proportionally (cf. equation 4.3), but as long as K¡ is constant, they have no effect in equation 4.2. Especially, stability properties of steady states depend on K¡ but not its precise factorization into K12 and K21 . The thin labeled lines in Figures 2A and 2B are contour lines of the surfaces in Figure 1. Other lines and symbols are identical in Figures 2A and 2B and indicate two branches of instabilities: a line of saddle-node bifurcations,
1810
Thomas Wennekers
Figure 2: Bifurcation diagram of the excitatory-inhibitory two-eld model. Thin lines are contours of the corresponding surfaces in Figures 1A and 1B and labeled by the respective (A) tuning width and (B) cortical amplication factor. SN denotes a saddle-node line and H a Hopf bifurcation. Solid lines result from a linear stability analysis of the reduced set (4.5) to (4.8), plus signs, and x’s from numerical simulations of the full eld model, and the reduced system, respectively. Asterisks represent the explicit stability boundaries derived in the mean-eld approximation (see section 4.4).
SN, which coincides with the line of innite amplication (cf. Figures 1B and 2B), and a Hopf bifurcation line, H, which separates a region of stable xed points (indicated as 0) from a region where stable limit cycles appear (indicated as 2). The approximation scheme predicts the stability boundaries of the full system very well. In contrast to the two-eld model, the one-eld model (see equation 4.1) reveals only a saddle-node line as in Figure 2, but no Hopf bifurcation. Beside that, there is no difference from Figure 2 for that model. The bifurcation scenario of the two-eld model depends only on K¡ , not its precise factorization into K12 and K21. Somewhat surprisingly, simulations indicate that the bifurcation diagram in Figure 2 does not change 2 2 and s 2 (1 ¡ s) s¡2 are varied in the whole range notably if s12 D ss¡ 21 D s 2 [0, 1] and K¡ is still kept constant. (These variations x s¡ but modify the tuning of the mutual kernels between excitatory and inhibitory kernels in a vast range.) This invariance of the bifurcation diagram is also predicted by the simple mean-eld approximation presented in the next section, but is far from obvious in the full or reduced set of equations. The bifurcation diagram is further invariant with respect to changes in t1 and t2 as long as their fraction is xed. This is because time in equations 4.2 and 4.3 can be rescaled (i.e., t à t / t2 ), which has no impact on stability but leaves the fraction t1 / t2 as the only relevant parameter in the equations. The saddle-node line is independent of the time constants in any case (for the one- as well as the two-eld model).
Analysis of Spatiotemporal Receptive Fields
1811
4.4 Mean-Field Approximation. The mean-eld approximation consists of equations 4.5 and 4.6 with steady-state couplings cij D cij (spj¤ ). Using linear stability theory (e.g., Guckenheimer & Holmes, 1983), it is easy to derive stability conditions for steady states in that approximation. One obtains the characteristic polynomial CMF (l) D t1 t2 l2 C (t2 (1 ¡ c¤C ) C t1 ) l C (1 ¡ c¤C C c¤¡ ). Steady states are stable if all roots of the equation CMF (l) D 0 have negative real part. The critical condition CMF (l D 0) D 0 then leads to the stability condition c¤C D c¤¡ C 1 for the saddle-node line and CMF (l D jv ) D 0 to c¤C D 1 C t1 / t2 , c¤¡ ¸ t1 / t2 for the Hopf bifurcation. Observe that the SN line is independent of the membrane time constants, but the Hopf branch can be shifted vertically in Figure 2 by changing the fraction t1 / t2 : fast excitation and slow inhibition potentially destabilize steady states. Linear stability theory also provides an estimate of the frequency of oscillations appearing along the Hopf branch: v2 D c¤¡ / (t1 t2 ) ¡ 1 / t22 . Some further approximations lead to explicit formulas for the stability boundaries and oscillation frequencies. Figure 2A reveals that sp1 is roughly equal to sC in a large domain of the parameter space. Thus, S C ¼ p 2 À s 2 . Using this in equation 4.11 and 2sC , S ¡ ¼ s¡ À sC , and s10 C skipping small terms, one gets after some rearrangement, ¤ sp1 ¼
p
2sC / (K C sC ) 1/ 3
(4.12)
¤2 2 2 C s ¤2 , as before. These approximations can be inand sp2 D S21 D s21 p1 serted as steady-state values in the above stability conditions, which results in ³ ´ (K C sC ) 1/ 3 1 SN: KSN ¼ p ¡ (4.13) K C ¡ sC 2 p ³ ´ 2 t1 t1 H c H: K C ¼ 1C (4.14) , KH , ¡ ¸ K ¡ :D sC t2 sC t2
v2 D
( K¡ ¡ Kc¡ ) sC t1 t2
D
1 K¡ sC ¡ 2. t1 t2 t2
(4.15)
Conditions 4.13 and 4.14 are indicated with asterisks (¤) in Figures 2A and 2B. Although very simple, they agree well with the results derived from the reduced-system equations and the full model. Only if K¡ tends to zero does the SN curve obtained from equation 4.13 approach the origin in Figure 2 and not the correct asymptotic value of K C ¼ .14 at the ¤ ¤ point singularity for sp1 . The reason is that near that singularity, sp1 becomes considerably larger than sC such that the approximation 4.12, which assumes the opposite, breaks down. In that regime, one has to solve equations 4.11 and 4.12 numerically to obtain exact values for the steady-state couplings, but still can infer stability from the general conditions derived
1812
Thomas Wennekers
Simulation
1
4.5 4 3.5 3 2.5 2 1.5 1 0.5 0
A
25
B
20
1
15 10 5
0
100
200
0
300
0
100
200
300
100
200
300
Approximation 2.5
C
25
2
h1
20
1.5
1
1 0.5 0
D
15 10 5
0
100
200
t/ms
300
0
0
t/ms
Figure 3: Step responses of the full excitatory-inhibitory two-eld model (top) and the reduced dynamic equations (bottom) for peak ring rates, h1 , and tuning widths, sp1 . Different curves (solid, dashed, dotted) are for increasing excitatory coupling strengths K C (K ¡ D .2, cf. Figure 2). Both models reveal a Hopf bifurcation accompanied by oscillatory rate tuning.
in the mean-eld approximation above. (Of course, one can also apply linear stability theory to the full set of reduced equations, 4.5 through 4.8. This, however, is typically not well suited for hand calculations, but numerical solutions are obtained in a straightforward manner and very quickly.) 4.5 Oscillatory Tuning and Autonomous Oscillations. Figure 3 compares simulations of the full excitatory-inhibitory two-eld model and the reduced system, equations 4.5 through 4.8, for K ¡ D 0.2; K C is varied in a range near the Hopf bifurcation (cf. Figure 2); other parameters are as before. Shown are peak ring rates h1 (t) and tuning widths sp1 ( t) in response to a spatial gaussian (s10 D 23 as usual) switched on at time zero, T 1 ( t) D H (t) .
Analysis of Spatiotemporal Receptive Fields
1813
The input amplitude, I10 D 1.0, is again arbitrary. All spatiotemporal solutions of the full model scale proportional to I10 but are otherwise form invariant (cf. Wennekers, 2001b). The three curves in each plot in Figure 3 represent different choices for K C : solid lines indicate a properly stable situation, dashed lines are slightly below, and dotted lines slightly above instability. The instability occurs at K C ¼ .360 in the full model and at K C ¼ .333 in the reduced model (cf. Figure 2 for K¡ D .2). Thus, K C D .330, .355, .365 were chosen in Figures 3A and 3B and K C D .303, .328, .338 in Figures 3C and 3D. Simulations of the full- and reduced-model equations in Figure 3 are in good qualitative agreement, showing damped oscillations below the bifurcation with increasing decay time if the instability is approached (because a pair of complex conjugated eigenvalues reaches the imaginary axis). Autonomous oscillations appear if the steady state becomes unstable. A perfect quantitative comparison is, of course, not possible because the instabilities in both models occur at slightly different values for K C such that the chosen values for K C taken in the simulations (by hand) are not exactly comparable. Especially observe that for the given parameters, Figure 2B reveals that the amplication factor is relatively large and depends quite sensitively on the precise operation point, which is not far from the saddle-node bifurcation line. This has the consequence that the response amplitudes are different in Figure 3A versus 3C. Steady-state values reached asymptotically in Figures 3A and 3B for K C D .330 (solid line) are h1 D 1.82 and s1 D 7.34. In contrast, for K C D .303 in Figures 3C and 3D, one nds asymptotically h1 D 1.06 and s1 D 8.2. These differences are consistent with values obtainable from Figure 2. The model predicts for parameters near the Hopf bifurcation that not only amplitudes oscillate; the tuning widths do as well. Tuning widths in Figures 3B or 3D start from s10 D 23 because the rst few milliseconds of the response are purely input dominated. Afterward, tuning quickly sharpens due to the network intrinsic feedback processes. Note that the relative amplitude modulation is considerably stronger than that of the tuning. For that reason, the mean-eld approximation predicts the stability boundaries well. Oscillation frequencies in Figure 3 are slightly higher in the reduced model than in the full one: 17 versus 13 Hz. Figure 4 compares oscillation frequencies along the Hopf bifurcation branch. The gure displays f D v / (2p ) and is computed for t1 D t2 D 10 ms. Other parameters are as given earlier. The simulations are the same as those shown in Figure 2 along the H-branch. Figure 4 reveals that all of the different approximations predict the occurring oscillation frequencies reasonably well. Using the cited membrane time constants of 10 ms, these frequencies typically fall into the EEG alpha to beta range. It has been suggested that the collective population dynamics of large cell ensembles might be better described by synaptic time constants than by membrane time constants (cf. Treves, 1993). It might therefore be more ap-
1814
Thomas Wennekers
Figure 4: Oscillation frequencies near the Hopf bifurcation line H in Figure 2. Thick line: mean-eld theory; thin line: reduced equations; dotted line: approximation (4.15). Plus signs and x’s are simulation results of the full and reduced system, respectively. t1 D t2 D 10 ms; other parameters as usual.
propriate to choose time constants in the range of a few milliseconds. Then the oscillation frequencies would fall into higher EEG bands than alpha, but equation 4.15 as well as Figures 1 and 2 still apply identically. 4.6 Reconstruction of Impulse Response Functions. According to section 3.3, approximate spatiotemporal activity proles can be obtained from the simple set of ODEs, 3.6 and 3.7. I now apply this reconstruction method to the transient dynamics of orientation tuning as reported for macaque V1 by Ringach, Hawken, and Shapley (1997). They observed that impulse response functions of V1 simple cells in the orientation domain often consist of an early and sharply tuned excitatory component followed by a late and broadly tuned inhibitory one. These experimental response functions can indeed be written as a sum of roughly gaussian components, as predicted by equation 3.3. One may therefore expect that the time course of the amplitudes and tuning widths of these components yields insight into the underlying cortical microcircuitry. I still consider the two-layer model of excitatory and inhibitory cells, equations 4.2 and 4.3, with t1 D t2 D 10 ms, K11 D .38, K12 D .2, K21 D .0333, s10 D 23, s11 D 7.5, and s12 D s21 D 42.4. Thus, K¡ D .2 and s¡ D ¤ ¼ 7.5 ¤ ¼ 60, as earlier. Steady-state tuning widths are then sp1 D s11 , sp2 43 ¼ s21 , and as shown in section 4.3, the mean-eld approximation is well applicable. Furthermore, all spatiotemporal solutions of the model scale proportionally to I10. So I arbitrarily set I10 D 1 and consider input pulses of 20 ms duration starting from w i ( x, t D 0) ´ 0. An example simulation of the full model is displayed in Figure 5A, where the excitatory potential distribution is shown over time. Figure 5B redraws
Analysis of Spatiotemporal Receptive Fields
1
1815
1
3 2 1 0 -1 -2
0.5 0
0
-0.5 64 128 0
50
100
-1 0
64
128
x/deg 3 2 1 0 -1 -2
4
h1
2
u
u10
0
0 64 128 0
50
100
u11
u12
-2 0
25
50
75
100
Figure 5: (A) Spatiotemporal impulse response of the full two-eld model. (B) Spatial sections of the plot in A at different times, each normalized to an amplitude of 1. The reconstructed response (C) reveals all features of the true one. (D) Component amplitudes of the reconstructed prole in C; u10 , u11 , u12 are the amplitudes of the input, recurrent excitatory, and recurrent inhibitory components, respectively. h1 is the excitatory peak ring rate.
sections through w 1 ( x, t) for different t, each section normalized individually to a maximum amplitude of 1. Right after stimulus onset, the response is small and broad, mainly determined by the poorly tuned external input. A short time later, it sharpens, amplied by the strong recurrent connections. Simultaneously, inhibition increases such that cells away from the center (x D 64 in the gure) are suppressed below threshold, # D 0, and only a sharply tuned set of cells at the center can re (this is the iceberg effect). If the stimulus is switched off, excitation fades away more quickly than inhibition such that at around time t D 40, one still observes an excitatory component superimposed on a stronger inhibitory one, but the asymptotic response is largely inhibitory (t D 100). (For similar experimentally observed response patterns, see Ringach et al., 1997.) Simulations of impulse response functions as shown in Figure 5 can be found in Omurtag et al. (2000) and Pugh et al. (2000), of course, without the interpretation suggested by the approximation method developed in this article in equations 3.3 or 3.8. According to section 3.3, solutions of the two-eld model, equations 4.2 and 4.3, can be represented as a sum of three components for the excitatory cells with amplitudes u10, u11 , u12, representing input, recurrent excitation,
1816
Thomas Wennekers
and inhibition, respectively, and one component, u21, for the excitatory input into the inhibitory cells. The explicit dynamic equations for the uij follow immediately from equations 3.6 and 3.7 adapted to the two-layer model. They are not repeated here. Figure 5C displays the impulse response w 1 ( x, t) reconstructed by means of equations 3.6 through 3.8 and Figure 5D the respective component amplitudes, u1j . The reconstruction reveals all qualitative features of the true response in Figure 5A; minor differences are visible only in the precise numerical values. In this way, the structural properties of the spatiotemporal response are nicely related to the individual roughly gaussian components contributing to the potential prole. Unfortunately, the experimentally observed impulse response function does not uniquely determine the underlying cortical circuitry. Note that the dominating components u11 and u12 in Figure 5D are very similar to second-order alpha functions with somewhat slower inhibition than excitation. Thus, instead of using a recurrent network, one might also derive them from the input signal by a feedforward series of low-pass lters. By means of equation 3.8, a response very similar to that in Figures 5A or 5C can also result in this way from a pure feedforward model, where V1 cells directly receive relatively sharply tuned input from LGN bers proportional to u11 ( t) , and, in addition, badly tuned feedforward inhibition via interneurons proportional to u12 ( t) . Steady-state tuning in such a network, just as in the recurrent network, is contrast independent if semilinear rate functions with vanishing thresholds are employed, in which case any dynamic response is proportional to the absolute cortical input strength (Wennekers, 2001b). Therefore, one cannot strictly distinguish between feedforward and feedback mechanisms from the data reported by Ringach et al. (1997). A comparable ambiguity is well known from linear systems theory, where every feedback system can be transformed into an equivalent feedforward system. 4.7 Inuence of Inhibitory-to-Inhibitory Couplings. So far, synaptic connections between inhibitory cells have been neglected in the two-eld model. Inclusion of them requires an additional term, ¡k22 ¤ f2 (w 2 ), in equation 4.3. In this section, I study its impact on the network dynamics. First, one again determines steady states, as in section 4.1. With c C D c¤11 and c¡ D c¤12c¤21 as before and D D (1 C c¤22 ) (1¡c C ) C c¡ , the steady-state ring rates become h1 D (1 C c¤22 ) D ¡1 I10 and h2 D c¤21 D¡1 I10, and the steady-state tunings are now mutually coupled and given by
1 2 sp1
1
D
D 2 sp2
D cC c¡ C ¡ 2 2 2 (1 C c¤22 )s10 (1 C c¤22 ) S12 S11 1 C c¤22 2 S21
¡
c¤22 2 S22
.
(4.16)
(4.17)
Analysis of Spatiotemporal Receptive Fields
1817
Figure 6: Stability diagram of the two-eld model if inhibitory-to-inhibitory connections are included. Dashed lines in A are contour lines for sp1 at levels 5, 7, 9, . . . . Dashed lines in B are contour lines for sp2 at levels 36, 37, . . . . In both plots, thin solid lines are the mean-eld stability boundaries for c22 D 0 corresponding to those in Figure 2. Thick lines are the same conditions for c22 D 1, and the further thin line along the Hopf branch for c22 D 1 results from linear stability theory applied to the reduced set of equations with mutual inhibition. The x’s denote simulation results. Note that mutual inhibition moves the Hopf bifurcation to higher values of cC D c11 .
Figure 6 displays contour lines for sp1 (A) and sp2 (B) obtained numerically from equations 4.16 and 4.17. Model parameters are as given in section 4.1, and, in addition, s22 D s12 ¼ 43. The mutual inhibition has no large impact on the tuning sp1 , because the rst and last terms on the righthand side of equation 4.16 are small as compared to the second term in large parts of the parameter space such that sp1 is determined mainly by the tuning of the excitatory coupling kernel. Mutual inhibition, however, removes the point singularity for sp1 at c C D 1, c¡ D 0 (K C ¼ .14, K ¡ D 0) observable in Figure 2A. It also decreases tuning in the inhibitory eld, because potentials in that eld can now become subthreshold in the tails of the activation peak, which effectively lowers the inhibitory tuning width (the iceberg effect). Stability analysis is again easily possible in the mean-eld approximation. The characteristic polynomial now reads: CMF (l) D t1 t2 l2 C [t2 (1 ¡ c C ) C t1 (1 C c¤22 ) ] C (1 C c¤22 ) (1 ¡ c C ) C c¡. As before, one nds a saddlenode bifurcation for CMF (0) D 0 characterized by c¡ D (1 C c¤22 ) ( c C ¡ 1) and a Hopf bifurcation resulting from CMF ( jv ) D 0 at c C D 1 C (1 C c¤22 ) t1 / t2 , c¡ ¸ (1 C c¤22 ) 2 t1 / t2 with corresponding oscillation frequencies v2 D c¡ / (t1 t2 ) ¡ (1 C c¤22 ) 2 / t22 . For c¤22 D 0, one recovers the conditions given in section 4.4. Figure 6 displays the meaneld stability conditions for c¤22 D 0 and c¤22 D 1. Stability lines for c¤22 D 0 correspond to those in Figure 2. The mutual inhibition does not change the bifurcation diagram structurally but moves
1818
Thomas Wennekers
the Hopf bifurcation to higher values of c C D c¤11 . Stability can in this way be obtained even in systems with considerably stronger effective couplings from those derived by Somers et al. (1995). Note, however, that the critical c C for the Hopf bifurcation is proportional to 1 C c¤22, and the slope of the saddle-node line is as well. Thus, the c¡ -coordinate of the triple point where both branches merge grows quadratically in 1 C c¤22 . Stabilizing systems with gain c C considerably larger than about 2 or 3 therefore needs strongly dominating inhibition, c¡ . 5 Discussion
Section 4 demonstrates that the dynamic approximation method turns out to be very useful for the prototypical models considered there. I now briey comment on the applicability of the method, its restrictions, and possible extensions from a more general point of view. First, note that the approximation method is formulated in this work for one-dimensional elds, x 2 ] ¡ 1, 1[. In higher dimensions, it can be applied as well. The derivation steps completely follow those in section 3.1 and basically result in equations of the same form as equations 3.4 and 3.5. The main difference is that the curvatures w 2i become matrix valued, since derivatives in all spatial directions have to be taken into account in equation 3.1. The intuitive explanations, mean-eld approximation, and receptive eld reconstruction method described in sections 3.2 and 3.3 also generalize to higher dimensions. A clear restriction of the method is that it works only for gaussian coupling kernels. For other kernel proles, it is usually impossible to calculate the convolution integrals in equation 2.1 in closed form. Nonetheless, under very mild conditions, the respective integrals certainly exist. Therefore, one may derive the reduced set of equations in the same form as in equations 3.4 and 3.5 by Taylor expansion up to second order. It is just that the effective couplings, cij (spj ) , can no longer be given explicitly. But as long as the new kernels decrease monotonically with distance, even the functional (sigmoidal) form of the cij remains. In that respect, the reduced-system equations and the dynamic behavior they predict are generic. The method can provide approximations only. It is not a perturbation method, like the Poincare Lindstedt series, which becomes asymptotically exact if some smallness parameter tends to zero (cf. Ermentrout & Cowan, 1980). The gaussian approximation, equation 3.2, always results in some errors in the tails of the ring-rate proles. Nonetheless, there the ring rates are small and contribute only 10% to 20% to the total mass of the peaks. Therefore, the fundamental properties of the peak dynamics should be preserved in the reduced equations. Just like perturbation expansions, the approximation method does not necessarily lead to good results in the whole parameter space. For example, the bifurcation diagram, Figure 2, reveals slight structural differences
Analysis of Spatiotemporal Receptive Fields
1819
near the triple point where the saddle node and Hopf bifurcation lines merge for the reduced set of equations as compared to the full neuraleld model. There is no obvious way to estimate how large such effects are, but clearly, homotopy between the vector elds generating the full and the reduced dynamics is most easily lost near higher-order bifurcation points. It may be somewhat surprising that the membrane potential proles on the whole line can be predicted quite well from just a second-order Taylor expansion around x D 0. The reason is that only the ring-rate peaks must be localized near x D 0 such that they are reasonably approximated by effective gaussians. Insertion of this approximation into equation 2.1 by continuity of that equation then necessarily predicts the whole potential proles. The resulting proles must not be localized and need not even be bell shaped. Nonetheless, the properties of the localized ring-rate peaks are essentially determined by the lowest-order derivatives of the potentials around zero. A generalization of the approximation to spatially asymmetric kernels is problematic. This, at least, needs odd orders in the Taylor expansion and the reduced-system equations. Furthermore, asymmetries can lead to traveling wave solutions that do not t obviously into the developed approximation framework (cf. Zhang, 1996; Wennekers & Palm, 1996; Mineiro & Zipser, 1998). It might, however, be possible to determine properties of traveling wave solutions from a perturbation expansion around an approximate symmetric solution, for instance, along the lines in Zhang (1996). Zhang’s work further suggests that the approximation method developed here should work for exponential rate functions, f (¢) » exp(¢) . In Wennekers (2001a), I have shown that it also gives pretty good results for semi-power functions, p f (¢) » [¢] C , p > 0. Clearly, it cannot be expected to give meaningful results for arbitrary f . Throughout this article, vanishing initial conditions have been assumed. This assumption can be relaxed, but the initial conditions must at least already be localized, that is, expressable in the form 3.8. Otherwise, the reduced system can hardly govern the transient behavior of the peak dynamics. Models consisting of rate-modulated Poissonian spiking neurons by means of spatial coarse graining lead to neural eld equations as in equation 2.1. For a homogeneous pool of stochastically undistinguishable neurons, it has been shown rigorously that the average dynamics of the network in the limit of innite system size can be arbitrarily well described by a single graded-response neuron (Wilson & Cowan, 1972; Amari, 1977; Amari, Joshida, & Kanatani, 1977). n such neurons describe n pools. There is currently no strict proof, but it is plausible to assume that the result transfers to topographically organized neural elds, which are thus best understood as representing locally averaged cell properties on a mesoscopic level. Different sources of noise and quenched disorder are already contained in this
1820
Thomas Wennekers
framework: random valued synapses, synaptic dilution, and probabilistic release rates enter into the absolute strength Kij of the coupling kernels, whereas threshold distributions, individual background noise, and other sources of randomness essentially inuence the effective rate functions fi . Effects of residual noise due to nite size effects can in part be studied by extending equation 2.1 to a Langevin equation, that is, by adding a noise eld on the right-hand side of that equation. As long as the noise stays relatively small, this does not change the principal dynamics, but leads to uctuating peak properties and in some cases stochastic switches between coexisting limit sets. Determining the effective noise strength may pose a problem in recurrent networks, since in principle the noise is coupled with the network activity itself and has to be determined self-consistently (see Burkitt, 1996; Amari, 1972). There is some theoretical evidence that the population dynamics of certain other classes of spiking neurons, basically describable as renewal processes, is more complicated than equation 2.1 (Gerstner, 2000; Treves, 1993; Abbott & van Vreeswijk, 1993). Research in that direction is ongoing, and because experimental evidence is largely missing, it is far from obvious which of the obtained theoretical predictions transfer to real cell populations. Rate-based models, on the other hand, have been shown repeatedly to describe quite a lot temporal response properties of cortical dynamic tuning well (Daugman, 1980; Troyer, Krukowski, Priebe, & Miller, 1998; Heeger, 1993; Suder, Worg¨ ¨ otter, & Wennekers, 2001). Eggert and van Hemmen (2000) have shown further that the rate dynamics of renewal processes can be described in lowest order by equations of the form 2.1 at least if the input is slowly varying, which is typically the case in experiments with continuously moving bars or gratings. Higher-order corrections in Eggert and van Hemmen’s work describe spike-timing effects, fast transients, and in general, dynamic processes on short temporal scales. Such ne-grained temporal effects can not be caught in rate-based models, but as I have argued elsewhere, their relevance for cortical processing is less than clear (Wennekers & Palm, 1997, 1999). Finally, Sompolinsky and coworkers have developed an analytical method to deal with certain nonlinear neural eld models (Ben-Yishai et al., 1995, 1997; Hansel & Sompolinsky, 1998). Seen from a methodological level, their approach operates with truncated Fourier series for periodic solutions and therefore is complementary to the approximation developed in this work. For some special cases (e.g., semilinear rate functions, sinusoidal kernels, and input functions), where the occurring Fourier series are already nite, their method can yield exact results. However, when generalized to a broader class of networks as equation 2.1, restriction to the lowest-order coefcients is necessary, which is to some degree comparable to the restriction to lowest-order derivatives in the approximation scheme developed in section 3.1. For that reason, in a broader context, approximations based on truncated Fourier series suffer from the same problems as the method developed in this article.
Analysis of Spatiotemporal Receptive Fields
1821
6 Conclusion
Employing a gaussian approximation for localized ring-rate proles in quite general n-layer neural eld models (see equation 2.1), I have shown that the membrane potential proles can be approximated as temporally modulated sums of roughly gaussian components (see equation 3.3)—one component for the external input and one component for every projection from a source to the target layer (the superposition principle). Taylor expansion leads to a set of 2n coupled ordinary nonlinear differential equations for the amplitudes and tuning widths alone (see equations 3.4) and (3.5). These equations enable a simplied analysis of spatiotemporal receptive eld properties and at the same time provide intuitive interpretations of the ongoing processes. The amplitude equations, 3.4, are similar to lowdimensional mean-eld equations for pool models, such that the dynamic behavior of small neural networks can be expected to transfer to neural eld models. The curvatures 3.5, on the other hand, x the width of the ring-rate peaks. The inverse peak variance in a layer is determined by a weighted sum of the inverse variances of the individual roughly gaussian components contributing to the potential prole. Moreover, the amplitude equations alone can be used as a lowest-order approximation for the network dynamics (mean-eld approximation). From this approximation, full spatiotemporal potential proles can be reconstructed as superpositions of spatial gaussians with steady-state tuning widths (see equation 3.8). As an application, I compared tuning properties of a standard one-layer model with DOG kernel and an excitatory-inhibitory two-layer neural eld model. Steady states of both models are equivalent, as are branches of saddle-node instabilities, which lead from tuned steady-state solutions to exponentially growing ones. The two-layer model, in addition, reveals a Hopf bifurcation to limit cycle oscillations, where both the amplitudes and tuning widths of localized activity peaks are rhythmically modulated. Already the mean-eld approximation predicts steady-state tuning and stability properties reasonably well and enables the calculation of explicit formulas for stability boundaries and oscillation frequencies. The network behavior is virtually independent of the relative widths and weights of the mutual kernels k12 and k21 and depends only on the values for K¡ and s¡. The timescale of the dynamics has an inuence on only the duration of transients and the frequency of oscillations, not on xed points or the stability diagram. In that respect, the contrast independent tuning properties modeled by Ben-Yishai et al. (1995) and Somers et al. (1995) in a single neural eld with DOG kernel are universal. Finally, impulse response functions for the two-layer excitatory-inhibitory eld model, reconstructed from the mean-eld approximation, reveal all qualitative properties of true response functions derived from the full model and are in accordance with data by Ringach et al. (1997). Although the excitatory-inhibitory network employed here is not implausible, I have argued that almost indistinguishable impulse response functions may come out of a pure feedforward model.
1822
Thomas Wennekers
Somers et al. (1995) did not report oscillations in their biologically detailed network. In Carandini & Ringach’s (1997) and Ben-Yishai et al.’s (1995) one-eld models with DOG kernel, they cannot occur. Taking the standard parameters K¡ D .092 and K C D .23 as derived by Carandini and Ringach (1997) from the model of Somers et al. (1995), one nds an operation point relatively near the SN boundary in Figure 2, away from the Hopf bifurcation. This is not surprising since these works emphasize the role of strong cortical amplication for sharp tuning. Since, Somers et al. (1995) matched their model parameters to biological data, one might expect that also in area V1, the local networks operate in that regime. However, estimates of model parameters from biology are clearly subject to uncertainities, and roughly doubling the coupling strengths would lead to oscillations. Comparable mechanisms as described here, based on excitatory-inhibitory mass action in neural nets, have been repeatedly suggested to underlie cortical oscillations in the alpha to gamma band (Skarda & Freeman, 1987; Wilson & Cowan, 1973; Wright, 1999). The example in section 4 demonstrates mass oscillations for tuned activity in a plausible excitatory-inhibitory neural eld model that can be envisaged as an abstraction of Somers et al.’s model (1995). The model predicts that not only amplitudes but also tuning widths should be temporally modulated during transient or autonomous oscillations. So far, experimental work has not addressed this possibility during oscillations, but Eckhorn, Krause, and Nelson (1993) and Suder et al. (2001) present data for temporally modulated tuning widths at least during transient response in V1 cells. Although the one- and two-eld models chosen as examples in section 4 are quite common and explain basic properties of visual cortical tuning, they might still be envisaged as too simplied from a biologist’s eye. A recent overview and critical discussion concerning some of the drawbacks has been supplied by Ferster and Miller (2000). However, the analysis in this article may serve as a starting point to approach more complex models analytically. Beside the direct usefulness of the reduced equations, the inituitive interpretations provided for the different steps of the reduction method should turn out helpful with that respect. Moreover, it should be clear that the developed approximation method is applicable in far more general areas than just orientation tuning. It should be useful wherever neural eld models provide a reasonable level of description of localized activity in spatially distributed neural systems (e.g., Engels & Sch¨oner, 1995; Samsonovic & McNaughton, 1997; Wennekers & Palm, 1996; Zhang (1996). References Abbott, L. F., & van Vreeswijk, C. (1993). Asynchronous states in networks of pulse-coupled oscillators. Phys. Rev. E, 48, 1483–1490. Amari, S.-I. (1972). Characteristics of random nets of analog neuron-like elements. IEEE Trans. SMC, 2, 643–657.
Analysis of Spatiotemporal Receptive Fields
1823
Amari, S.-I. (1977). Dynamics of pattern formation in lateral-inhibition type networks. Biol. Cybern., 27, 77–87. Amari, S.-I., Joshida, K., & Kanatani, K. I. (1977). A mathematical foundation for statistical neurodynamics. SIAM J. Appl. Math., 33, 95–126. Ben-Yishai, R., Bar-Or, R. L., & Sompolinsky, H. (1995). Theory of orientation tuning in visual cortex. Proc. Natl. Acad. Sci. USA, 92, 3844–3848. Ben-Yishai, R., Hansel, D, & Sompolinsky, H. (1997). Traveling waves and the processing of weakly tuned input in a cortical module. J. Comput. Neuroscience, 4, 57–77. Burkitt, A. N. (1996). Retrieval properties of attractor neural networks that obey Dale’s law using a self-consistent signal-to-noise analysis. Network: Computation in Neural Systems, 7, 517–531. Carandini, M., & Ringach, D. L. (1997). Predictions of a recurrent model of orientation selectivity. Vision Research, 37, 3061–3071. Daugman, J. G. (1980). Two-dimensional spectral analysis of cortical receptiveeld proles. Vision Research, 20, 847–856. Eckhorn, R., Krause, F., & Nelson, J. I. (1993). The RF-cinematogram—a crosscorrelation technique for mapping several visual receptive elds at once. Biological Cybernetics, 69, 37–55. Eggert, J., & van Hemmen, J. L. (2000). Unifying framework for neuronal assembly dynamics. Physical Review E, 61, 1855–1874. Engels, C., & Sch¨oner, G. (1995). Dynamic elds endow behavior-based robots with representations. Robotics and Autonomous Systems, 14, 55–77. Ermentrout, G. B., & Cowan, J. D. (1980). Large scale spatially organized activity in neural nets. SIAM J. Appl. Math, 38, 1–21. Ferster, D., & Miller, K. D. (2000). Neural mechanisms of orientation selectivity in the visual cortex. Annual Review of Neuroscience, 23, 441–471. Gerstner, W. (2000). Population dynamics of spiking neurons: Fast transients, asynchronous states, and locking. Neural Computation, 12, 43–89. Guckenheimer, J., & Holmes, P. (1983). Nonlinear oscillations, dynamical systems, and bifurcations of vector elds. New York: Springer-Verlag, 1983. Hansel, D., & Sompolinsky, H. (1998). Modeling feature selectivity in local cortical circuits. In C. Koch & I. Segev (Eds.), Methods in neuronal modeling (2nd ed.). Cambridge, MA: MIT Press. Heeger, D. J. (1993). Modeling simple cell direction selectivity with normalized half-squared linear operators. J. Neurophysiol., 70, 1885–1898. Krone, G., Mallot, H., Palm, G., & Schuz, ¨ A. (1986). Spatiotemporal receptive elds: A dynamical model derived from cortical architectonics. Proc. Roy. Soc. Lond. B, 226, 421–444. Mineiro, P., & Zipser, D. (1998). Analysis of direction selectivity arising from recurrent cortical interactions. Neural Computation, 10, 353–371. Omurtag, A., Kaplan, E., Knight, B., & Sirovich, L. (2000).A population approach to cortical dynamics with an application to orientation tuning. Network:Comput. Neural Syst., 11, 247–260. Pugh, M. C., Ringach, D. L., Shapley, R., & Shelley, M. J. (2000). Computational modeling of orientation tuning dynamics in monkey primary visual cortex. J. Comput. Neurosci., 8, 143–159.
1824
Thomas Wennekers
Ringach, D. L., Hawken, M. L., & Shapley, R. (1997). The dynamics of orientation tuning in macaque V1. Nature, 387, 281–284. Sabatini, S. P. (1997). Recurrent inhibition and clustered connectivity as a basis for Gabor-like receptive elds in the visual cortex. Biol. Cybern., 74, 189– 202. Samsonovic, A., & McNaughton, B. L. (1997). Path integration and cognitive mapping in a continuous attractor neural network model. J. Neurosci., 17, 5900–5920. Schuster, H., & Wagner, P. (1990a).A model for neuronal oscillations in the visual cortex. I. Mean-eld theory and the derivation of the phase equations. Biol. Cybern., 64, 77–82. Schuster, H., & Wagner, P. (1990b). A model for neuronal oscillations in the visual cortex. II. Phase description and feature dependent synchronization. Biol. Cybern., 64, 83–85. Skarda, C. A., & Freeman, W. (1987). How brains make chaos in order to make sense of the world. Beh. Brain. Sci., 10, 161–195. Somers, D. C., Nelson, S. B., & Sur, M. (1995). An emergent model of orientation selectivity in cat visual cortex simple cells. J. Neurosci., 15, 5448–5465. Suder, K., Worg ¨ otter, ¨ F., & Wennekers, T. (2001). Neural eld model of receptive eld restructuring in primary visual cortex. Neural Computation, 13, 139–159. Taylor, J. G. (1999). Neural “bubble” dynamics in two dimensions: Foundations. Biol. Cybern., 80, 393–409. Treves, A. (1993). Mean-eld analysis of neuronal spike dynamics. Network, 4, 259–284. Troyer, T. W., Krukowski, A. E., Priebe, N. J., & Miller, K. D. (1998). Contrastinvariant orientation tuning in cat visual cortex: Input tuning and correlationbased intracortical connectivity. Journal of Neuroscience, 18, 5908–5927. von Seelen, W., Mallot, H. A., & Giannakopoulos, F. (1987). Characteristics of neuronal systems in the visual cortex. Biol. Cybern., 56, 37–49. Wennekers, T. (2001a). Orientation tuning properties of simple cells in area V1 derived from an approximate analysis of nonlinear neural eld models. Neural Computation, 13, 1721–1747. Wennekers, T. (2001b). Generic steady and dynamic tuning properties in neural eld models with rectifying rate functions. Manuscript submitted for publication. Wennekers, T., & Palm G. (1996). Controlling the speed of synre chains. In C. von der Malsburg, W. von Seelen, J. C. Vorbruggen, ¨ & B. Sendhoff (Eds.), Proceedings of the International Conference on Articial Neural Networks ICANN96 (pp. 451–456). Berlin: Springer. Wennekers, T., & Palm, G. (1997). On the relation between neural modelling and experimental neuroscience. Theory in Biosciences, 116, 267–283. Wennekers, T., & Palm, G. (1999). How imprecise is neuronal synchronization? Neurocomputing, 26–27, 579–585. Wilson, H. R., & Cowan, J. D. (1972). Excitatory and inhibitory interactions in localized populations of model neurons. Biophysical Journal, 12, 1–24. Wilson, H. R., & Cowan, J. D. (1973). A mathematical theory of the functional dynamics of cortical and thalamic nervous tissue. Kybernetik, 13, 55–80.
Analysis of Spatiotemporal Receptive Fields
1825
Wright, J. J. (1999). Simulation of EEG: Dynamic changes in synaptic efcacy, cerebral rhythms, and dissipative and generative activity in cortex. Biol. Cybern., 81, 131–147. Zhang, K.-C. (1996).Representation of spatial orientation by the intrinsic dynamics of the head-direction cell ensemble: A theory. J. Neurosci., 16, 2112–2126. Received April 3, 2001; accepted January 31, 2002.
LETTER
Communicated by Terrence Sejnowski
Independent Components of Magnetoencephalography: Localization Akaysha C. Tang [email protected] Departments of Psychology, Computer Science, and Neurosciences, University of New Mexico, Albuquerque, NM 87131, U.S.A. Barak A. Pearlmutter [email protected] Departments of Computer Science and Neurosciences, University of New Mexico, Albuquerque, NM 87131, U.S.A. Natalie A. Malaszenko [email protected] Department of Psychology, University of New Mexico, Albuquerque, NM 87131, U.S.A. Dan B. Phung [email protected] Department of Biology, University of New Mexico, Albuquerque, NM 87131, U.S.A. Bethany C. Reeb [email protected] Department of Psychology, University of New Mexico, Albuquerque, NM 87131, U.S.A. We applied second-order blind identication (SOBI), an independent component analysis method, to MEG data collected during cognitive tasks. We explored SOBI’s ability to help isolate underlying neuronal sources with relatively poor signal-to-noise ratios, allowing their identication and localization. We compare localization of the SOBI-separated components to localization from unprocessed sensor signals, using an equivalent current dipole modeling method. For visual and somatosensory modalities, SOBI preprocessing resulted in components that can be localized to physiologically and anatomically meaningful locations. Furthermore, this preprocessing allowed the detection of neuronal source activations that were otherwise undetectable. This increased probability of neuronal source detection and localization can be particularly benecial for MEG studies of higher-level cognitive functions, which often c 2002 Massachusetts Institute of Technology Neural Computation 14, 1827– 1858 (2002) °
1828
Akaysha C. Tang et al.
have greater signal variability and degraded signal-to-noise ratios than sensory activation tasks. 1 Introduction
Magnetoencephalography (MEG) is a passive functional brain imaging technique that, under ideal conditions, can monitor the activation of a neuronal population with a spatial resolution of a few millimeters and with millisecond temporal resolution (Ham ¨ a¨ la¨ inen, Hari, Ilmoniemi, Knuutila, & Lounasmaa, 1993; George et al., 1995). Typical signals associated with neuronal activity are on the order of 100 fT, while the noise signals within a shielded room tend to be much larger (Lewine & Orrison, 1995). Furthermore, the intrinsic sensor noise is comparable in magnitude to small neuronal signals. Therefore, what the sensors record during an experiment is always a mixture of small neuromagnetic and large noise signals. This relatively poor signal-to-noise ratio can affect the localization of neuronal activity.1 Several independent component analysis (ICA) algorithms, such as second-order blind identication (SOBI) (Belouchrani, Meraim, Cardoso, & Moulines, 1993; Cardoso, 1994), Infomax (Bell & Sejnowski, 1995), and fICA (Hyv¨arinen & Oja, 1997), have been applied to EEG data (Makeig, Bell, Jung, & Sejnowski, 1996; Makeig, Jung, Bell, Ghahremani, & Sejnowski, 1997; Makeig, Westereld, Jung et al., 1999; Jung, Humphries et al., 2000; Jung, Makeig et al., 2000) and MEG data (VigÂario, Jousm a¨ ki, Ham ¨ a¨ la¨ inen, Hari, & Oja, 1998; Tang, Pearlmutter, Zibulevsky, & Carter, 2000; VigÂario, S¨arela, ¨ Jousmaki, & Oja, 1999; VigÂario, S¨arel¨a, Jousm a¨ ki, H¨ama¨ la¨ inen, & Oja, 2000; Wubbeler ¨ et al., 2000; Ziehe, Muller, ¨ Nolte, Mackert, & Curio, 2000; Cao et al., 2000). In both applications, ICA methods have proven useful for artifact removal and for improving the signal-to-noise ratio (Jung, Humphries et al., 2000; Jung, Makeig et al., 2000; VigÂario et al., 1998; Tang, Pearlmutter, Zibulevsky, & Carter, 2000). For general reviews of ICA, see Amari and Cichocki (1998), Cardoso (1998), Hyv¨arinen (1999), and VigÂario et al. (2000). For MEG, in addition to separating various noise signals from the neuromagnetic signals, SOBI and fICA have been shown to separate one neuronal source from another between and within the same modality (Tang, Pearlmutter, Zibulevsky, & Carter, 2000; VigÂario et al., 1999, 2000). To localize functionally independent neuronal sources or to localize and recover the time course of these neuronal sources simultaneously, a variety of algorithms have been proposed (Mosher, Lewis, & Leahy, 1992; Kinouchi et al., 1996; Sekihara, Poeppel, Marantz, Koizumi, & Miyashita, 1997; Nagano et 1 Unless otherwise indicated, we use signal-to-noise ratio in the sense dened in signal detection theory. Signals refer to the neuromagnetic signal of interest. Noise refers to all other signals, including environmental and sensor noise and other background brain signals.
Localization of ICA Components
1829
al., 1998; Uutela, H¨ama¨ lainen, ¨ & Salmelin, 1998; Mosher & Leahy, 1998, 1999; Schwartz, Badier, Bihoue, & Bouliou, 1999; Sekihara et al., 2000; Huang et al., 2000; Aine, Huang, Stephen, & Christner, 2000; Ermer, Mosher, Huang, & Leahy, 2000; Cao et al., 2000; Schmidt, George, & Wood, 1999). Given their capability to separate noise and neuronal signals, ICA algorithms are expected to benet all source localization methods by providing them with input signals that are more likely to be associated with functionally independent neuronal sources. It was found, however, that the fICA-separated components yielded localization results qualitatively similar to those arrived at without ICA preprocessing (VigÂario et al., 1999). Consequently, no substantial benets from ICA were reported for neuromagnetic source localization. Because one of the strengths of ICA is its ability to separate noise from the signals of interests, whether ICA could offer any advantage in source localization should depend on the signal-to-noise ratio in the sensor data. The experiment reported by VigÂario et al. (1999) was optimally designed to produce strong and focal activation of a small number of neuromagnetic sources, and therefore has high signal-to-noise ratios. Under such optimal conditions, ICA could not improve much on the already good localization provided by conventional methods. In this article, we apply ICA to neuromagnetic signals with relatively poor signal-to-noise ratios collected during cognitive tasks involving large trial-to-trial variability in neuronal source activation and from a much larger number of sources. We localized these neuronal sources using the equivalent current dipole (ECD) modeling method (Neuromag) on SOBI-separated components and on unprocessed sensor data. We found that SOBI preprocessing resulted in the localization of neuronal sources that could not be found when the dipole tting method was directly applied to the sensor data. In addition, the process of localizing separated components required signicantly less subjective judgment regarding which sensors to exclude from the analysis and at what time the dipoles are tted. 2 We suggest that ICA methods can be particularly effective and efcient in the study of higher-level cognitive functions when the neuronal source activations are often characterized by their greater degree of variability and lower signalto-noise ratios. 2 Methods 2.1 Cognitive Tasks. We collected MEG data from four right-handed subjects (two females and two males) during four visual reaction time tasks originally designed to study temporal lobe memory functions (Tang, Pearl-
2 It is a common practice to select 20 to 30 sensors over the brain region of interest for dipole tting.
1830
Akaysha C. Tang et al.
mutter, Zibulevsky, Hely, & Weisend, 2000). These tasks are described in detail in appendix A. Here, we offer a brief description. In each task, a pair of colored patterns, one of which was the target, was presented on the left and right halves of the display screen. The subject was instructed to press either the left or right button when the target appeared on the left or right, respectively. In all tasks, the target was not described to the subject prior to the experiment. The subject was to discover the target by trial and error using auditory feedback (low and high tones corresponded to correct and incorrect responses, respectively). All subjects were able to discover the rule within a few trials. The tasks differed in the memory load required for determining which of the pair is the target. Task 1 served to familiarize the subjects with all visual patterns. The subjects simply viewed the stimuli and were asked to press either the left or right button at their own choice while making sure approximately equal numbers of left and right button presses were performed. As such, task 1 placed little memory demand on the subject. Task 2 involved remembering a single target pattern that appeared on each trial paired with another pattern. Subjects pressed the right or left button to indicate whether the target pattern was on the left or the right. Task 3 involved remembering multiple targets, each always paired with the same nontarget. Task 4 was the most complex. In this task, targets were context sensitive in a circular fashion, as in the game rock-paper-scissors. The amount of cognitive processing beyond the initial sensory processing increased successively from task 1 to task 4. We used data from these complex cognitive tasks to evaluate the capability of SOBI (see section B.1) because of the relatively poor signal-to-noise ratios involved in comparison to sensory activation tasks. Specically, these tasks involved (1) large visual eld stimulation without the use of xation points, (2) incidental somatosensory stimulation as a result of button presses during reaction time tasks, and (3) highly variable button press responses (because precisely what form of the thumb movement should be made, how the mouse was held, and where the hands rest were not specied). These sources of variability in visual and somatosensory activation can lead to poor signal-to-noise ratios in the average responses, making it particularly difcult to localize the neuronal sources from unprocessed averaged sensor data. The involvement of higher-level cognitive functions, memory demands, and the small number of trials (90 in most cases) collected under each task condition further decreased the signal-to-noise ratios in the averaged sensor data. These tasks therefore offered a set of challenging data sets in which the advantages of ICA methods could be revealed. 2.2 Selection of ICA Methods. In selecting ICA algorithms, one important consideration is the robustness of the algorithm to sensor noise. Instantaneous and summary algorithms are two extremes of ICA algorithms that differ in whether each point in time is considered in isolation. Instantaneous
Localization of ICA Components
1831
algorithms, such as Bell-Sejnowski Infomax (1995) and fICA (Hyv¨arinen & Oja, 1997), make repeated passes through the data set and update the unmixing matrix in response to the data at each time point. They are derived under the assumption that the signals are white, and their results should therefore be invariant to shufing of the data. As a consequence, they cannot take advantage of the temporal structure of each source as a cue for correct separation. In contrast, summary algorithms rst make a pass through the data while summary statistics are accumulated by averaging; they then operate solely on the summary statistics to nd the separation matrix. Some summary algorithms collect statistics that allow them to make use of the temporal structure of the sources as a cue for separation. More importantly, summary algorithms in general should be relatively insensitive to sensor noise, because their summary statistics are averages over time. The relatively poor signal-to-noise ratios in MEG data suggested the choice of a summary algorithm rather than an instantaneous algorithm. When it can be assumed that each source has a broad autocorrelation function, as is the case with brain signals, the summary algorithm SOBI (Belouchrani et al., 1993; Cardoso, 1994) can use this temporal structure as a cue and give high-quality separation while imposing rather modest computational requirements. SOBI extracts a large set of statistics from the data set, which it uses for the separation. Each of these statistics is calculated by averaging across the data set, which makes the algorithm robust against noise. The particular statistics calculated are the correlations between pairs of sensors at a xed delay, hxi ( t) xj (t C t ) i. This makes good use of abundant but noisy data, and, most importantly, SOBI can be tuned by modifying its set of delays (see appendix B), allowing its users to gently integrate a very weak form of prior knowledge: knowledge of the length constant of the autocorrelation function. Although Bell-Sejnowski Infomax and fICA have been previously applied to MEG and EEG data, and other ICA algorithms, such as contextual ICA (Pearlmutter & Parra, 1996) and sparse decomposition (Zibulevsky & Pearlmutter, 2001), are locally available, we selected SOBI as our ICA method based on the properties we have set out. However, we have not conducted a systematic comparison of ICA methods for our MEG data.
2.3 Second-Order Blind Identication. SOBI is considered blind because it makes no assumptions about the form of the mixing process. In other words, SOBI does not attempt to solve the inverse problem or use the physics of the situation in any way. It does not try to estimate currents or know about Maxwell’s equation or any of its consequences. The only physical assumption made about the mixing process is that it is instantaneous and linear. Let x ( t) be an n-dimensional vector of sensor signals, which we assume to be an instantaneous linear mixture of n unknown independent underlying
1832
Akaysha C. Tang et al.
sources si ( t) , via the unknown stationary n £ n mixing matrix A , x ( t) D As ( t) .
(2.1)
The ICA problem is to recover s (t) , given the measurements x ( t) and nothing else. This is accomplished by nding a matrix W that approximates A ¡1 , up to permutation and scaling of its rows. SOBI assumes that the sources are statistically independent in time and not necessarily orthogonal in space. It nds W by minimizing the correlation3 between one recovered source at time t and another at time t C t . The particular set of delays t we used were chosen to cover a reasonably wide interval without extending beyond the support of the autocorrelation function. Measured in units of samples, at our 300 Hz sampling rate, the delays 4 were t 2 f 1, 2, 3, 4, 5, 6, 7, 8, 9,
10, 12, 14, 16, 18, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100 g.
Each recovered si ( t) also has a sensor space projection that gives the sensor readings of si ( t) (see section B.2). This sensor projection can be displayed as a eld map and can be used as input to source localization algorithms. For example, after calculating its sensor projection, we can repackage a component for localization by standard Neuromag dipole tting software (xt). SOBI shares a number of weaknesses with all ICA methods: they all assume that there are as many sensors as sources, they all make some sort of independence assumption, they all assume that the mixing process is linear, and they all assume that the mixing process is stable. (See section 4.3 for further discussion.) 2.4 Localization of Separated Components. SOBI was performed on continuous5 122-channel data collected during the entire period of the experiment, sampled at 300 Hz, and bandpass ltered at 0.03–100 Hz. It generated 122 components,6 each a one-dimensional time series with an associated eld map (see section B.2). Each component potentially corresponds to a set of magnetic eld generators. 3
For justication for this minimization, see section 4. The choice of delays can affect the results of separation. Depending on the types of sources activated by the behavioral task, the selection of delays can have complex interactions with the latency of evoked responses. This is an important topic and deserves separate study. 5 Note that ICA algorithms can also be applied to cross-trial averages rather than continuous data, as in Makeig, Westereld, Townsend et al. (1999). 6 ICA algorithms produce the same number of components as there are channels in their input. 4
Localization of ICA Components
1833
Event-triggered averages were calculated from their continuous singletrial time series for all 122 separated components, where the triggering events were either sensory stimuli or behavioral responses. For the specic tasks used here, there were typically 10 to 20 components in each experiment that showed responses locked to either stimuli or button presses. Those with stimulus- or motor-locked responses were candidate neuronal generators, since they showed task-related activation. Those with responses locked onto other external events, such as eye blinks or heart beats (detected using EOG and EKG), were considered known noise sources. The rest were treated as non-task-related noise sources. For a task-related component, if its eld map and time course were consistent with known neurophysiological and neuroanatomical facts, we considered it a neuronal component reecting the activity of a neuronal generator. For example, if the eld map of a component shows activation over the occipital cortex and the visual-stimulus-triggered average for this component contains an evoked response that peaks between 50 and 100 ms, then it is considered to reect the activity of a visual source in the occipital lobe. Using this procedure, neuronal and nonneuronal generators were separated and identied (Tang, Pearlmutter, Zibulevsky, & Carter, 2000; Tang, Pearlmutter, Zibulevsky, Hely et al., 2000). A dipole tting method was then applied to the identied neuronal components. The input to the dipole tting algorithm (Neuromag, xt, least square) was the eld map, and the output was the location of ECDs projected onto the subject’s structural MRI images. This same dipole tting algorithm was used for localization with and without SOBI preprocessing. Because our goal was to evaluate whether ICA methods can improve source localization, we were not concerned with whether the least-square dipole tting was as good as more recent and more sophisticated source modeling methods. Our interest was not in localization accuracy per se but in the comparative performance of a given localization method when used alone, as opposed to being coupled with ICA. In statistical comparison, to match the common practice in source modeling without SOBI, a subset of channels (20 to 30) over the region of interest was selected for dipole tting with both methods. To localize each separated component, we chose channels over the region of interest showing stronger responses to the source. For localization without SOBI (the conventional practice), we began with the channels selected for SOBI localization and then modied the selections to obtain a more dipolar eld pattern.7 If these modications improved the results, then we used them; otherwise, we used the original channel selections. This procedure gave the conventional practice an advantage because the event-triggered-average responses
7 A eld map is judged dipolar by visual inspection (Hari & Salmelin, 1997). If it contains two sets of concentric contour lines, the eld is considered dipolar. If it contains more or fewer than two sets, the eld is not dipolar.
1834
Akaysha C. Tang et al.
were cleaner in the separated components than in the raw data. In fact, the raw data were often so noisy that no channels could have been selected by following the same procedure on the raw data, and therefore no localization could have been performed without the channel selection information enabled by SOBI. To localize a component, we used its eld map as input to the dipole tting program.8 One can select any time during the average time window to t the dipole because the dipole solution for a component is invariant to time (see section B.2). These independence of localization results from the dipole tting time can signicantly simplify the dipole localization process, making it less subjective than dipole localization directly from the sensor data, without the use of ICA. Using the conventional method, the time at which a dipole was tted affects the nal estimated dipole location. To localize neuronal sources without SOBI preprocessing, we used eventtriggered sensor data (averages) as inputs to the dipole tting program. We rst chose the time with the largest evoked-response amplitude within the time window of interest. Then a subset of channels (20 to 30) over the region of interest was selected. When the contour maps were single-dipolar for the selected channels at the time chosen, a single dipole t was performed. Otherwise, multiple dipoles were tted. (For details of the process, see the xt manual.) In the examples shown in the gures in this article, all channels were used in the dipole tting to show that SOBI can identify dipolar sources without any channel selection. 3 Results 3.1 SOBI Decomposition: Time Courses and Sensor Projections. Using SOBI, continuous MEG signals from 122 channels were separated into 122 components, each with a time course and an associated sensor projection. The time course can be averaged across multiple trials using either the visual stimulus onset or the button press as a trigger. It can also be displayed as an MEG image (e.g., see Figure 6, right), a pseudo-colored bitmap in which the responses of a given component during an entire experiment can be parsimoniously displayed (Jung et al., 1999). Typically, each row represents one discrete trial of stimulation, and multiple trials are ordered vertically from top to bottom. (See section B.3 for details on the process of giving sensible units to the components.) As shown in the overlay plots of the visual stimulus and button press triggered averages for all 122 components (see Figures 1c and 1d), only a small fraction of the components 8 Theoretically, one sampling point in time across all sensors contains all information about a source. In practice, the Neuromag software needs a time series of at least several samples. Therefore, we calculated the event-triggered average for the component of interest and made an input .f le containing the average.
Localization of ICA Components
1835
Figure 1: Event-triggered averages for groups of separated components (N D 90 trials). (a) Components showing visual-stimulus-triggered responses, triggered on visual stimulus onset. (b) Components showing button-press-triggered responses, triggered on button presses. (c) All components, triggered on visual stimulus onset. (d) All components, triggered on button presses.
showed task-related responses. For clarity, these task-related components are shown separately in Figures 1a and 1b. The components can be displayed in the sensor domain in eld maps (see Figure 6, left) or in a full view graph using the Neuromag software xt (see Figures 2–5). The sensor projections for two components are shown: one for a visual component (see Figure 2) and the other for a sensorimotor component (see Figure 3). It is clear that the two components are projected selectively to sensors over the visual and sensorimotor cortices. For comparison, the full view plots of the sensor projection from the raw data (mixture of all components) are shown in Figures 4 and 5. 3.2 Energy in Separated Components. We divided components into six categories: visual, somatosensory, ocular artifacts, 60 Hz, sensor jumps, and other.9 Visual and somatosensory components were identied by their clearly visible evoked responses in the MEG images and by their activa9 Sensor jumps refer to a peculiar property of the SQUID sensors that cause an enormous and nearly instantaneous DC shift. “Other” includes any components that do not belong to the rst ve categories.
1836
Akaysha C. Tang et al.
Figure 2: Sensor projection of components showing selective sensor activation over the occipito-parietal cortex (N D 90 trials, visual stimulus triggered averages). Compare with unseparated data in Figure 4.
tion patterns in the eld maps (see the following sections). These neuronal components were further veried by the consistency between the response latency, shown in the MEG images, and the spatial location of sensor activation, shown in the eld maps (see subsequent sections). Ocular artifact sources were identied by their characteristic activation patterns in the eld map and their large-amplitude responses in the MEG image (see Figure 6a), which match signals measured by EOG (not shown). The 60 Hz components were identied by the clearly visible 60 Hz cyclic activity in the MEG images in Figure 6b (see also Tang, Pearlmutter, Zibulevsky, & Carter, 2000). Sensor jump components were easily identied by the single-sensor activation in the eld maps and sometimes by high-contrast lines or dots in the MEG images (see Figure 6c). For these ve types of identied components, we calculated the amount of energy in each (see section B.4), across all subjects and all tasks, using
Localization of ICA Components
1837
Figure 3: Sensor projection of components showing selective activation over the right fronto-parietal cortex (N D 90 trials, button-press-triggered averages). Compare with unseparated data in Figure 5.
a window of 200 ms after either the visual stimulus presentation or the button presses 10 (see section B.4). This window was chosen to cover all neuronal responses. The amount of energy in a single component varied widely, between 0.17% and 71% of the total energy across all sensors. This range differed among the ve categories of the components (see Table 1). The energies in the visual and somatosensory components (using visual stimuli and button presses as triggers respectively) were 10.0 § 1.02% (N D 29) and 4.65 § 0.74% (N D 10). Using button presses as triggers (because subjects tended to blink after the button press responses), the energy in the ocular artifact components was 24.86 § 4.67% (N D 16). Since both 60 Hz signals and sensor jumps were not task related, the energy in these two types of components was calculated using both visual stimulation and button 10
Window specication affects the calculated energy.
1838
Akaysha C. Tang et al.
Figure 4: Visual-stimulus-triggered averages of unseparated data, N D 90 trials. Aberrant sensors are shaded.
presses as triggers and then averaged. The total energy in the 60 Hz sources was 10.19 § 1.73% (N D 32), and the total energy in the sensor jump sources was 1.72 § 0.28% (N D 13). The ocular artifact and 60 Hz components have the most energy, while the energy in the neuronal components represented 10% or less of the total. Table 1: Range of Energy Accounted for (% of Total Energy Across All Channels) by the Five Categories of Components. Category
Visual Somatosensory Ocular artifact 60 Hz Sensor jump
Minimum
Maximum
0.21 0.47 0.57 0.26 0.17
24 14 72 44 12
Localization of ICA Components
1839
Figure 5: Button-press-triggered averages of unseparated data, N D 90 trials. Aberrant sensors are shaded.
3.3 Localization of Separated Components: Examples. Using the sensor projection of task-related components as input to the Neuromag software (xt), we localized separated components. In conventional source localization practice, very often only 20 to 30 channels are selected for source localization. To show how well SOBI can isolate one neuronal source from another without relying on channel exclusion, throughout sections 3.3 and 3.4, we generated the eld maps, contour plots, and dipole localizations for components using all channels (i.e., without channel exclusion). To make the localization results comparable between using SOBI and not using SOBI, a subset of 20 to 30 channels was selected in dipole tting in section 3.5, which provides statistical comparisons.
3.3.1 Visual Component. As the tasks involved simultaneous bilateral visual stimulation and judgment of its spatial location and identity, we expected SOBI to isolate visual components in the occipital, parietal, and
1840
Akaysha C. Tang et al.
Figure 6: Field maps and unltered MEG images for (a) an ocular artifact component, (b) 60 Hz component, and (c) sensor jump component.
temporal lobes. Components with visual evoked responses indeed showed eld map activation over occipital, parietal, and temporal lobes (not shown). For the particular stimuli used in these experiments, temporal sources were more variable in their precise location and temporal proles. In contrast, occipito-parietal lobe activation appeared to have the greatest signal amplitudes and was reliably identied across multiple subjects. Figure 7 (left) shows the dipole location of one such occipito-parietal visual source along with its eld map and time course. 3.3.2 Somatosensory Component. As the tasks involved button presses, we expected both somatosensory and motor responses from the sensory and
Localization of ICA Components
1841
Figure 7: Examples of separated (left) visual, (middle) somatosensory, and (right) auditory components, shown in (a) event-triggered averages (N D 90 trials, stimulus onset at t D 0), (b) eld maps, (c) contour plot, and (d) the tted dipole superimposed on the subject’s structural MRI images. All sensors (channels) were used in generating the contour plots and tting the dipoles.
motor areas. Figure 7 (middle) shows the dipole location of one component in the left hemisphere. Notice that this dipole is near the region where one nds dipoles from median nerve stimulation (Hari & Forss, 1999; Tesche & Karhu, 1997) and that the median nerve services the thumb. The time course of the response suggests that this response is unlikely to be a response from the motor cortex because the activations associated with motor preparation are typically estimated to be 385 § 85 ms before the movement onset (Hoshiyama et al., 1997), much earlier than the latency shown here. The motor-evoked sensory responses with an estimated onset time of approximately 20 §30 ms after the onset of movement (Hoshiyama et al., 1997) matched best to our button-press-elicited responses. Therefore, this component corresponds to a somatosensory source instead of a motor source. In contrast to typical fast-rising somatosensory responses recorded using median nerve stimulation (Hari & Forss, 1999; Tesche & Karhu, 1997), the slow-rising somatosensory responses recorded here were elicited by stimulation to the thumb due to button presses. These temporal proles are expected to differ due to the difference between a very brief and focal electric shock and a much longer and distributed stimulation to the thumb and its surrounding areas. 3.3.3 Auditory Component. Because the tones were presented as feedback, auditory responses were expected from the auditory cortex. In contrast to the typical auditory responses recorded during a simple auditory oddball
1842
Akaysha C. Tang et al.
task, the auditory responses from our experiment were most likely to overlap with and perhaps to be affected by visual, motor, and somatosensory processing. Because both auditory responses and somatosensory responses were triggered on the button press, auditory responses needed to be distinguished from the somatosensory sources. The spatial location of their tted ECDs in the auditory cortex and their longer response latencies were sufcient to allow the disambiguation. Figure 7 (right) shows one unilateral auditory source, with slow-rising and long response latency, localized to the vicinity of the lateral ssure, as expected for auditory activation (Cansino, Williamson, & Karron, 1994). The relatively longer response latency (»180 ms) may be due to particular aspects of the task (see appendix A). Specically, in order to process the auditory feedback, the subjects must rst switch their attention from the visual to the auditory modality, and this takes time. Furthermore, the subjects must process and interpret the auditory stimulus in evaluating their behavioral response and registering the correct target stimuli into memory. This additional processing may account for the difference in the temporal prole of the auditory responses. As the tones were bilaterally presented, one would expect auditory components with eld maps showing bilateral activation. The auditory components recovered in these experiments, however, were unilateral for two subjects and bilateral for the other two. This variability across subjects could be due to differences in cerebral dominance of auditory processing. A difference in the temporal aspect of the left and right auditory processing could be expected to lead to the identication of two separate left and right components. In comparison to visual and somatosensory components, auditory components were much more difcult to identify, perhaps due to the complexity and associated variability. Although in most cases auditory components could be identied from visual inspection of the eld map and event-triggered averages, the signal-to-noise ratios were too poor to permit consistent dipole tting across tasks and across subjects. Therefore, the following more detailed and systematic analysis of localization results will focus on only visual and somatosensory sources. 3.4 Cross-Task and Cross-Subject Reproducibility in Localization of Components. To show how reproducible the localization of components
can be across the four cognitive tasks, we examined separated visual components from one subject. Across tasks, two occipito-parietal visual sources were reliably localized within the same subject from two separated components. For both visual sources, the time course of the response is highly repeatable across multiple tasks, as shown in the overlay plot (see Figures 8a and 8b). The earlier visual responses were almost identical in both amplitude and response latency (see Figure 8a), while the later responses varied only in amplitude across tasks (see Figure 8b). Given the number of subjects in this study (four), we do not have the statistical power to draw any
Localization of ICA Components
1843
Figure 8: Cross-task consistency in the (a, b) temporal prole and (c, d) dipole location of two visual components. (a, c) Occipital and (b, d) occipito-parietal sources can be identied and localized consistently across multiple tasks (overlay). (a, b) Visual stimulus-triggered averages from four visual tasks, overlaid (N D 90 trials per task). (c, d) Corresponding single ECDs for visual sources in a, b. Notice the consistency of the dipole locations across tasks. Notice also that the temporal prole of the earlier visual source (a) did not differ across tasks, but the amplitude of the later visual source (c) was modulated by the task conditions.
conclusions about whether the amplitude increases monotonically with the complexity of the task. These visual components identied from different tasks were localized to similar locations within the occipital and parietal lobes, as shown in Figures 8c and 8d, in which tted dipoles from multiple experiments are superimposed on the subject’s structural MRI images. Notice that in the eld map, the right side of the head is shown on the right, whereas in the structural MRI images, following radiological convention, the right side is shown on the left. To show how reproducible the localization of components can be across subjects, we examined separated somatosensory components from three subjects. 11 In all three subjects, we reliably identied two components (left
11 The fourth subject did right-hand index-midnger button presses, which differed from the rest of the subjects.
1844
Akaysha C. Tang et al.
Figure 9: Somatosensory sources can be identied and localized consistently across multiple subjects (shown for the left source). Similar to Figure 7 except the responses were triggered by the button press.
and right) with button-press-locked responses in the somatosensory areas. Figure 9 shows the time course, eld map, contour plot, and tted dipole for the somatosensory components in the right hemisphere of the three subjects. Notice the cross-subject similarity in the eld maps, contour plots, and dipole locations (somatosensory cortex in the anterior parietal lobe, post-central sulcus). 3.5 Detecting Expected Neuronal Sources With and Without SOBI. To offer quantitative comparison in the relative performance of source localization with and without SOBI, we attempted to identify and localize the most reliable occipito-parietal visual source and both the left and right somatosensory sources in all subjects and all tasks from separated components and from the unprocessed data. Because all four tasks involved bilateral presentation of visual stimuli, we expected that at least one visual source would be found active in the occipito-parietal cortex. Similarly, because separate left and right button presses were required by all the tasks, we also expected that at least one left and one right somatosensory source would be active. For these expected sources, we attempted to localize the source with dipole tting from separated components and from the raw sensor data (without SOBI). The percentage of the expected sources for which dipole solutions can be found are compared for localization with and without the aid of SOBI. For a component to be considered a detectable neuronal source, there must be an evoked response that clearly deviates from the baseline in the
Localization of ICA Components
1845
averaged component data. We rejected all components with any ambiguity on this criterion. Second, the components must have a eld map showing focal activation of sensors over the relevant brain regions (occipito-parietal cortex and anterior parietal cortex in this study). Third, the contour plot for the component must be dipolar. Finally, the tted dipole must be in the relevant cortical areas. For a source to be considered detectable using the conventional method of localization, one must rst identify a sensor at which the largest evoked response is found. Second, the contour plot must be dipolar at the peak time. Finally, in a few cases when multiple dipole solutions are needed, at least one of the dipoles is localized to the expected brain region. 3.5.1 Visual Sources. Among all separated components, for each subject and each task, we were able to identify and localize an occipito-parietal visual source with a single dipole (100% detectability). These occipito-parietal components invariantly had very focal sensor projections (see the eld maps in Figure 7b), and the contour plots were invariably dipolar even without channel selection (for example, see the eld map and contour plot in Figure 7a). Single dipoles were tted for these occipito-parietal components. A subset of channels over the occipito-parietal lobe (20 to 30 channels) was used for the purpose of fair comparison with the conventional analysis method without the aid of SOBI. The peak response latencies of these components (N D 16) were 139.0 §7.6, and the dipole coordinates (X,Y,Z) were 7.5 § 2.6, ¡49.4 § 3.2, and 68.6 § 3.4 mm. Using the conventional method of source localization directly from the unseparated sensor data, dipoles were tted using the same or similar subset of channels selected over the occipito-parietal cortex. In all subjects and all tasks, the conventional method identied and localized at least one visual source in the occipito-parietal lobe (100% detectability). Of a total of 16 expected sources (4 tasks by 4 subjects by 2 sides), 10 could be tted with a single dipole, 4 were tted with two-dipole solutions, 1 was tted with a three-dipole solution, and 1 was tted with a four-dipole solution. When multiple dipole solutions were needed, at least one of them was localized to the occipito-parietal cortex. This variation in dipole solutions may reect some individual differences in visual processing occurring outside the occipito-parietal cortex. The peak response latencies of these occipitoparietal visual sources (N D 16) were 143.6 § 5.5, and the dipole coordinates (X,Y,Z) were 4.21 § 4.8, ¡55.89 § 2.68, and 59.42 § 3.83 mm. 3.5.2 Somatosensory Source. From components of all subjects and all tasks, with only two failures we were able to identify and localize 22 of the 24 expected left and right somatosensory sources with a single dipole (3 subjects by 4 tasks). All 22 somatosensory components had very focal sensor projections (see the eld maps in Figure 9b), and their contour plots were all highly dipolar even without channel selection (for example, see the eld
1846
Akaysha C. Tang et al.
Figure 10: SOBI increased the detectability of expected neuronal sources for the more variable somatosensory activation.
map and contour plot in Figure 9). Single dipoles were tted to these components, with a subset of channels over the somatosensory cortex (20 to 30 channels) selected for the purpose of fair comparison with the conventional analysis method. The peak response latencies were 3.3 §4.2 and 0.8 § 3.4 ms for the left (N D 11) and right (N D 11) somatosensory sources, respectively. The dipole coordinates (X,Y,Z) were ¡39.4 §2.4, 7.8 §2.7, and 84.6 § 1.7 for the left and 45.69 § 2.1, 5.6 §2.2, and 84.1 §3.1 for the right somatosensory sources. Using the conventional method of source localization directly from the unseparated sensor data, dipoles were tted using the same or similar subset of channels selected over the somatosensory cortex. Only 9 out of 24 expected left and right somatosensory sources could be identied and localized following the conventional method of identifying a peak response in the averaged sensor data. Of 24 sources expected (3 subjects by 4 tasks by 2 sides), in 7 cases no visible peak response could be identied in any of the sensors. Of the remaining 17 cases in which peak responses could be found in at least one sensor over the somatosensory cortex, 4 did not have dipolar elds, and 4 resulted in dipole locations outside the head or in the auditory cortex. Single dipole solutions were found in only 9 cases. The peak response latencies of these somatosensory sources were ¡5.2 § 2.5 for the left hemisphere (N D 5) and 1.6 § 1.8 for the right hemisphere (N D 4). The dipole coordinates (X, Y, Z) were ¡43.3 §3.9, 12.1 § 5.6, and 82.8 § 3.8 for the left 42.3 §5.5, 15.9 § 2.7 and 89.9 § 1.4 for the right sources. 3.6 Statistical Comparisons. There was no signicant difference in the detectability for the occipito-parietal source measured with and without SOBI. In contrast, SOBI preprocessing resulted in an increase in the detectability of the expected somatosensory sources (chi squre, test p < .0001) (see Figure 10). The peak response latencies for the visual and somatosen-
Localization of ICA Components
1847
sory sources did not differ signicantly when measured using and without using SOBI. For the visual sources, the precise dipole locations estimated with and without SOBI did not differ in the X and Y dimensions but nearly differed signicantly in the Z dimension (p D 0.05). For the somatosensory sources, the precise dipole locations differed signicantly in the Y dimension (p < 0.05) for the left source and in the Y and Z dimensions for the right source (p < 0.05). Because the true accuracy of source locations cannot be determined from these experiments without a depth electrode, no quantitative comparisons can be made concerning accuracy. 4 Discussion
We identied and localized visual and somatosensory sources activated in four subjects during four cognitive tasks. Due to the relatively large variability involved in highly cognitive tasks and the small number of trials collected, these tasks were characterized by relatively poor signal-to-noise ratios in the sensor data and therefore were ideal for evaluating differential localization performance. Our results showed that despite the large variability associated with the visual and somatosensory activations during these particular tasks, SOBI was able to separate identiable visual and somatosensory components that were further localized to the expected cortical regions. The physiological and neuroanatomical interpretability of these components across multiple sensory modalities and their cross-subject and cross-task reproducibility establish SOBI as a viable method for separating and identifying neuronal populations from MEG data during fairly complex cognitive tasks. Most importantly, we showed that SOBI preprocessing offered a special advantage when the evoked responses in the sensor data had poor signal-to-noise ratios. Specically, for the highly variable somatosensory activation evoked by incidental stimulation during button presses, SOBI preprocessing resulted in a greater percentage of the expected somatosensory sources being identied and localized than the same dipole modeling method applied directly to the raw sensor data. 4.1 SOBI Reduced Subjectivity and Labor in Source Localization. In conventional source localization, there are two major sources of subjectivity: the selection of dipole tting times and the selection of channels. Both are eliminated by our proposed procedure. First, because each component has a xed eld map, the dipole tting solutions for components were not sensitive to either the time at which the dipoles were tted or to the sensor used for determining the time of t (see section B.2). Within this map, each sensor reading reects only activation due to a single source generator, or several temporally coherent generators as opposed to activation due to a combination of multiple generators, each with a different time course. Therefore, using SOBI, there is no need to subjectively select a time from a sensor for dipole tting. Second, simple components, which have eld
1848
Akaysha C. Tang et al.
activation over early sensory processing areas, were almost always dipolar even without channel selection or reduction.12 Therefore, channel selection is not necessary. One way to see the difference between dipole localization with and without SOBI processing is to view SOBI as a more automatic and more objective tool that allows the isolation of sensor activation due to an already isolated functionally independent generator. The reduced subjectivity and time required to nd dipole solutions can make data analysis and training of new researchers for MEG more cost-effective. 4.2 SOBI Improved Detectability of Neuronal Sources. The advantages of ICA algorithms in general have been shown in a number of applications to EEG and MEG data. First, these algorithms can separate neuronal activity from various artifacts (Makeig et al., 1996; VigÂario et al., 1998; Tang, Pearlmutter, Zibulevsky, & Carter, 2000; Ziehe et al., 2000; Jung, Humphries et al., 2000; Jung, Makeig et al., 2000), such as eye blinks. In contrast to methods that rely on the use of a template, ICA removes these artifacts without any prior assumptions about the nature of the waveforms. Second, ICA isolates physiologically and behaviorally meaningful components that describe previously unavailable aspects of neuronal activity (Makeig et al., 1997; Makeig, Westereld, Jung et al., 1999; Wubbeler ¨ et al., 2000). Finally, ICA-separated neuronal sources are less contaminated by various noise sources, which allows single-trial response detection (Jung et al., 1999; Tang, Pearlmutter, Zibulevsky et al., 2000; Carter et al., 2000). ICA methods have been able to distinguish the absence of rhythmic activity from the absence of phase-locked rhythmic activity (Makeig, Townsend et al., 1999). We have shown that SOBI separation of the data resulted in a greater detectability of somatosensory sources but did not increase the detectability of visual sources. This modality-specic improvement in source detectability depended on the signal-to-noise ratios in the sensor data. Because visual responses could be clearly identied from the raw sensor data even without the aid of SOBI, it would not have been possible for SOBI to improve the detection rate. In contrast, the relatively poor signal-to-noise ratios in the raw sensor data for the somatosensory responses caused many failures in identifying a sensor at which a peak response occurred and in determining the peak response time. Under this poor signal-to-noise condition, in all but two cases, SOBI preprocessing resulted in separated components with the characteristic eld map, characteristic temporal response prole, and the correct dipole location for a somatosensory source. These ndings suggest another advantage that ICA algorithms can offer: improving the ability to detect and localize neuronal sources that are otherwise difcult 12 SOBI also separated out many complex components, which have multiple patches or very broad eld activation. These components may reect synchronized activation in multiple brain regions. Functional connectivity may be inferred among these brain regions.
Localization of ICA Components
1849
to detect or are undetectable under relatively poor signal-to-noise conditions. This improvement has signicant practical implications. First, brain regions involved in higher-level cognitive processing tend to show greater trial-to-trial variability in their activation and therefore have lower signalto-noise ratios in the average responses. Second, behavioral tasks that bear greater resemblance to real-world situations tend to involve greater variability in both stimulus presentation and subsequent processing. Finally, studies of clinical patients and children are often limited by the length of the experiment and therefore often provide data from a limited number of trials. Our results suggest that ICA may offer an improved capability in detecting and localizing neuronal source activations in these difcult situations. 4.3 Assumptions of SOBI. Here, we discuss assumptions of particular relevance to SOBI and MEG rather than general issues in ICA. Like all other ICA algorithms, SOBI assumes that the mixing process is stable. In the context of MEG, a stable mixing process corresponds to assuming that the head is motionless relative to the sensors. For this reason, head stabilization can be particularly important in MEG when ICA is used. SOBI also assumes that there are at least as many sensors as sources. For us, this is not a serious problem, as our MEG device has 122 sensors, yet we recover only a few dozen sources that show task-related evoked responses. The observation that only a small number of sources are active during typical cognitive and sensory activation tasks is consistent with the results of studies using both EEG (Makeig, Westereld, Jung et al., 1999) and MEG (VigÂario et al., 2000). The crucial assumption in ICA is that of independence. (For a thorough discussion of the independence assumption as it pertains to MEG, see VigÂario et al., 2000.) Here, we will discuss independence only in the context of the particular measure of independence used by SOBI. One problem that EEG and MEG researchers have with the independence assumption arises from the fact that if one computes correlations between EEG or MEG sensor readings over multiple brain regions during behavioral tasks, one would nd that some brain regions have nonzero correlations. A good example of correlated brain activity is the apparently correlated evoked responses from neuronal populations in multiple visual areas along the processing pathway during a visual stimulus presentation. Based on such an observation, one could conclude that as the statistical independence assumed by ICA is clearly violated, the results of ICA must not be trusted. Yet we have shown that SOBI was able to separate visual components that clearly correspond to neuronal responses from early and later visual processing stages that are correlated due to common input (Tang, Pearlmutter, Zibulevsky, Hely et al., 2000). Others (Makeig, Westereld, Jung et al., 1999; VigÂario et al., 2000) have produced behaviorally and neurophysiologically meaningful components under a variety of task conditions.
1850
Akaysha C. Tang et al.
As different ICA algorithms use the independence assumption differently, we offer the following explanation that applies specically to SOBI. One needs to recognize that correlation is not a binary quantity. Consequently, neither is violation of the independence assumption. The important question is not whether the assumption is violated but whether the assumption is sufciently violated such that the estimated neuronal sources by SOBI are no longer meaningful. The way SOBI uses the independence assumption is to minimize the total correlations computed with a set of time delays, as described in section B.1. As such, each delay-correlation matrix R t generally makes only a small contribution to the objective function. For example, the correlation one would observe between V1 and V2 responses could be high only at or around one particular time delay, say, in R 20ms . In optimizing its objective function, SOBI can leave a particularly large nonzero off-diagonal element—say, the one corresponding to the 20 ms delayed correlation between V1 and V2, when minimizing the sum squared off-diagonal elements across all the components and time delays. Therefore, this particular method of maximizing independence is not necessarily incompatible with a large correlation at a particular time delay between two sources sharing common inputs. Most ICA algorithms, including SOBI, minimize some objective function. It is possible for the optimization process to nd a poor local minimum. In general, poor results can result from many underlying causes—for example, poor experimental design, poorly conducted experiments, poor head stabilization, poor optimization within the ICA algorithm, or violation of assumptions. No amount of attention to any one possible problem can validate ICA-based methods for processing functional brain imaging data. As with any other statistical procedure, the real issue here should not be whether assumptions are violated at all but whether the algorithms can robustly produce separated components that are behaviorally, neuroanatomically, and physiologically interpretable, despite some violation of the assumptions under which the algorithms were derived. For example, t-tests are very robust against the violation of normality assumption and are therefore regularly performed on data that are not guaranteed to be gaussian. Only empirical results can give condence that a method is correctly separating the MEG data. 4.4 Summary. Establishing that (1) SOBI preprocessing can lead to the identication and localization of physiologically and anatomically meaningful neuronal sources and (2) SOBI preprocessing can increase the success rate in detecting and localizing neuronal source activation under poor signal-to-noise conditions is only the rst step in demonstrating the usefulness of ICA algorithms to the analysis and interpretation of MEG data. The next steps include systematically studying the effect of ICA on source localization when ICA methods are combined with more sophisticated source localization algorithms (Ribary et al., 1991; Aine et al., 1998; Mosher & Leahy,
Localization of ICA Components
1851
1999; Schmidt et al., 1999) and exploring the possibility of measuring singletrial response onset times in ICA separated neuronal sources. Currently in submission is a closely related article, “Independent Component of Magnetoencephalography: Single-Trial Response Onset Time Detection.” Appendix A: Experimental Details
Continuous 122-channel data were collected during the entire period of the following four tasks sampled at 300 Hz and bandpass ltered at 0.03 to 100 Hz. Each subject performed four visual reaction time tasks. In all tasks, each trial consisted of a pair of colored abstract block compositions, one of which was the target, presented symmetrically and simultaneously on the left and right halves of the screen. Subjects were instructed to respond as quickly and as accurately as possible with a left- or right-hand mouse button press when the target stimulus was presented to the left or right side of the display screen, respectively. The button press elicited an auditory feedback indicating whether a correct or incorrect response was made. Stimuli were presented either on a 15-inch VGA computer monitor at a distance of 48 inches and occupying 7.6 degrees of visual angle or backprojected by an LCD projector positioned so that the stimulus occupied the same visual angle. In all tasks, the interval between the motor response and the next stimulus presentation was 3.0 § 0.5 seconds. Auditory feedback was composed of 2000 Hz and 500 Hz tones indicating correct and incorrect choices, respectively. The four tasks differed from each other primarily in their denition of the target stimulus, which affected how much processing was required for target determination. The precise duration of each task varied slightly across subjects, depending on the subject’s reaction time. Therefore, typical durations are given below. The rst task (stimulus preexposure) consisted of 270 trials. It took the subjects approximately 30 minutes to perform this task. The other three tasks (elemental discrimination, trump card, and transverse patterning) each consisted of 90 trials that were subsets of the same stimuli contained in the rst task. Each of these three tasks took approximately 10 minutes to complete. For each subject, all four experiments were performed on the same day but each in a separate session. Instructions for each experiment were given immediately prior to that experiment. Subjects were permitted to move between experiments. Head positions were recalibrated at the beginning of each experiment. Subjects performed the four experiments in order of increasing task demand: stimulus preexposure, trump card task, elemental discrimination task, and transverse patterning task: 1. Stimulus pre-exposure task. There were no predened relationships between stimuli and button presses. No feedback was given to the subjects about any choice. The subject was instructed to examine both stimuli and then make a roughly equal number of right and left but-
1852
Akaysha C. Tang et al.
ton presses, without consistent alternation between right and left responses. The sequence of presentation was random. Presentations of each stimuli on the left and right sides of the video screen were counterbalanced. 2. Trump card task. Subjects were instructed to discover by trial and error which of the two stimuli in the stimulus pair was the target (the trump card). Nine stimulus pairs involving 10 stimuli were used, with a single stimulus as the trump card. Subjects did not have any problem in discovering the trump card within a few trials. 3. Elemental discrimination task. Subjects were instructed to discover which one of the stimulus pair was the target stimulus by trial and error. Three stimulus pairs consisting of six stimuli were used. For each pair of stimuli, one of the pair was the target. This task differs from the trump card task in that multiple target stimuli were involved. All subjects found the targets within a few trials. 4. Transverse patterning task. Subjects were instructed to discover which of the two stimuli in a stimulus pair was the target. Three stimulus pairs consisting of three stimulus compositions were used. Each stimulus could be a target or nontarget depending on what it was paired with. The target denition was a “rock-paper-scissors” arrangement: A wins when paired with B, B wins when paired with C, C wins when paired with A. Again, subjects were able to discover the winning relationships after a few trials. Appendix B: Mathematical Methods B.1 SOBI Source Separation Algorithm. The SOBI algorithm (Belouchrani et al., 1993) proceeds in two stages. First, the sensor signals are zeromeaned and presphered as follows: y ( t) D B ( x ( t) ¡ hx ( t) i) .
(B.1)
The angle brackets h¢i denote an average over time, so the subtraction guarantees that y will have a mean of zero. The matrix B is chosen so that the correlation matrix of y , hy ( t) y ( t) T i, becomes the identity matrix. This is ac¡1 2 complished by moving to the PCA basis using B D diag(li / ) U T , where li are the eigenvalues of the correlation matrix h(x ( t) ¡ hx ( t) i) (x ( t) ¡ hx ( t) i) T i and U is the matrix whose columns are the corresponding eigenvectors, that is, the “PCA components” of x . (This presphering is solely for the purpose of improving the numerics of the situation by constraining the matrix V below to be a rigid rotation.) For the second stage, one constructs a set of matrices that, in the correct separated basis, should be diagonal. We chose a set of time delay values t
Localization of ICA Components
1853
to compute symmetrized correlation matrices between the signal y ( t) and a temporally shifted version of itself, R t D sym(hy ( t) y ( t C t ) T i)
(B.2)
where sym ( M ) D ( M C M T ) / 2 is a function that takes an asymmetric matrix and returns a closely related symmetric one. This symmetrization discards some information, but the problem is already highly overconstrained, and the symmetrized matrices provide valid, albeit slightly weaker, constraints on the solution. After calculating the R t , we P lookPfor a rotation V that jointly diagonalizes all of them by minimizing t iD6 j ( V T R t V ) 2ij , the sum of the squares
of the off-diagonal entries of the matrix products V T R t V , via an iterative process (Cardoso & Souloumiac, 1996; using MATLAB code available on-line at http://sig.enst.fr/»cardoso/). The nal estimate of the separation matrix is W D V T B , which is used to calculate the separated components sO ( t) D Wx ( t) . B.2 Separated Components in Sensor Space. Since W is the estimated unmixing matrix, let us use sO ( t) D Wx (t) for the consequent estimated O D W ¡1 for the corresponding estimated mixing matrix. sources, and A Using these, the sensor signals resulting from just one of the components O O sO ( t) , where D is a matrix of zeros (t ) D AD can be computed as xO ( t) D ADWx except for ones on the diagonal entries corresponding to each component that is to be retained. To localize a single component, one computes xO ( i) ( t) D sOi ( t) aO ( i) ,
(B.3)
O and xO ( i) ( t) is the sensor space image of source where aO (i) is the ith column of A ( i) ( ) i. Because xO t is at each point in time equal to the unchanging vector aO (i) , scaled by the time course sOi (t) , dipole tting algorithms will localize xO ( i) (t) to the same location no matter what window in time is chosen. B.3 Scaling. Blind source separation leaves the freedom to choose an arbitrary scale factor for each component. For instance, the source si (t) could be scaled up by a factor of 10 and the ith column of A scaled down by the same factor of 10, giving rise to the exact same observation x ( t) . Making a reasonable assumption that all the sensors have intrinsic gaussian noise of the same magnitude, we used the additivity of these independent sensor noises to scale each row of W to give each row a vector length of one. That Q is the unscaled unmixing matrix, then we normalized its rows to is, if W
1854
Akaysha C. Tang et al.
yield W using wQ ij wij D qP j
wQ 2ij
.
(B.4)
With this scale factor, the sources can be viewed as being measured by a “virtual sensor ” that measures in the same units, with the same scale, and with the same amount of intrinsic noise as the real sensors. This gives rise to “effective Ft/cm” units as above. An alternative approach to scaling is to try to calculate the actual strength of the source, for instance, the actual total energy emitted. This can be done by tting a physical source model (such as an equivalent current dipole) to O match each component and scaling the rows of W such that the columns of A the attenuations predicted by the estimated physical model. This approach has the disadvantage of being dependent on the localization process, thus giving rise to multiple scalings when there are multiple localization procedures in use, or even when a single procedure produces multiple possible localizations. Another disadvantage of this alternative is a failure to generate a scaling when the localization fails, as it would on noise components. B.4 Energy/Variance Accounted For. A commonly used statistic is the energy in a source, or the amount of variance it accounts for. The energy of source i is X X ( i) () (xOj (t ) ¡ xOj i ) 2 , (B.5) Ei D t
j
where the mean is being subtracted to discount DC offsets, an important consideration in MEG. Because the rows of the matrix W are normalized, we can simplify this expression using equation B.3, yielding Ei D
X t
( sOi ( t) ¡ sOi ) 2 ,
(B.6)
which is computationally more efcient. In this article, P we gave the fraction of variance accounted by the ith component as Ei / i Ei . Acknowledgments
This work was supported by the National Foundation for Functional Brain Imaging and NSF CAREER award 97-02-311, an equipment grant from Intel corporation, the Albuquerque High Performance Computing Center, a gift from George Cowan, and a gift from the NEC Research Institute. We thank Ole Jensen for tips on packaging the separated components for the Neuromag software, Mike Weisend for granting us access to his data, and Robert
Localization of ICA Components
1855
Christner for technical support. We also thank Lloyd Kaufman, Zhonglin Liu, Claudia Tesche, Cheryl Aine, and Roland Lee for their comments and discussion. References Aine, C. J., Huang, M., Christner, R., Stephen, J., Meyer, J., Silveri, J., & Weisend, M. (1998). New developments in source localization algorithms: Clinical examples. International Journal of Psychophysiology,30, 198. Aine, C., Huang, M., Stephen, J., & Christner, R. (2000). Multistart algorithms for MEG empirical data analysis reliably characterize locations and time courses of multiple sources. Neuroimage, 12(2), 159–172. Amari, S.-I., & Cichocki, A. (1998). Adaptive blind signal processing—neural network approaches. Proceedings of the IEEE, 9, 2026–2048. Bell, A. J., & Sejnowski, T. J. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7(6), 1129– 1159. Belouchrani, A., Meraim, K. A., Cardoso, J.-F., & Moulines, E. (1993). Secondorder blind separation of correlated sources. In Proc. Int. Conf. on Digital Sig. Proc. (pp. 346–351). Cyprus. Cansino, S., Williamson, S. J., & Karron, D. (1994). Tonotopic organization of human auditory association cortex. Brain Res., 663, 38–50. Cao, J. T., Murata, N., Amari, S., Cichocki, A., Takeda, T., Endo, H., & Harada, N. (2000). Single-trial magnetoencephalographic data decomposition and localization based on independent component analysis approach. IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, E83A(9), 1757–1766. Cardoso, J.-F. (1994). On the performance of orthogonal source separation algorithms. In European Signal Processing Conference (pp. 776–779). Edinburgh. Cardoso, J.-F. (1998). Blind signal separation: Statistical principles. Proceedings of the IEEE, 9(10), 2009–2025. Cardoso, J.-F., & Souloumiac, A. (1996). Jacobi angles for simultaneous diagonalization. SIAM Journal of Matrix Analysis and Applications, 17(1), 161– 164. Carter, S. A., Tang, A. C., Pearlmutter, B. A., Anderson, L. K., Aine, C. J., & Christner, R. (2000). Co-activation of visual and auditory pathways induces changes in the timing of evoked responses in populations of neurons: An MEG study. Society for Neuroscience Abstracts, 26(9197), 1503. Ermer, J. J., Mosher, J. C., Huang, M. X., & Leahy, R. M. (2000). Paired MEG data set source localization using recursively applied and projected (RAP) MUSIC. IEEE Transactions on Biomedical Engineering, 47(9), 1248–1260. George, J. S., Aine, C. J., Mosher, J. C., Schmidt, D. M., Ranken, D. M., Schlitt, H. A., Wood, C. C., Lewine, J. D., Sanders, J. A., & Belliveau, J. W. (1995).Mapping function in the human brain with magnetoencephalography, anatomical magnetic resonance imaging, and functional magnetic resonance imaging. J. Clin. Neurophysiol., 12, 406–431.
1856
Akaysha C. Tang et al.
H¨am¨al¨ainen, M., Hari, R., Ilmoniemi, R. J., Knuutila, J., & Lounasmaa, O. V. (1993). Magnetoencephalography—theory, instrumentation, and applications to noninvasive studies of the working human brain. Rev. Modern Physics, 65, 413–497. Hari, R., & Forss, N. (1999). Magnetoencephalography in the study of human somatosensory cortical processing. Phil. Trans. R. Soc. Lond. B, 354, 1145– 1154. Hari, R., & Salmelin, R. (1997). Human cortical oscillations: A neuromagnetic view through the skull. Trends in Neuroscience, 20, 44–49. Hoshiyama, M., Kakigi, R., Berg, P., Koyama, S., Kitamura, Y., Shimojo, M., Watanabe, S., & Nakamura, A. (1997). Identication of motor and sensory brain activities during unilateral nger movement: Spatiotemporal source analysis of movement-associated magnetic elds. Exp. Brain Res., 115(1), 6– 14. Huang, M. X., Aine, C., Davis, L., Butman, J., Christner, R., Weisend, M., Stephen, J., Meyer, J., Silveri, J., Herman, M., & Lee, R. R. (2000). Sources on the anterior and posterior banks of the central sulcus identied from magnetic somatosensory evoked responses using multi-start spatio-temporal localization. Human Brain Mapping, 11(2), 59–76. Hyva¨ rinen, A. (1999). Survey on independent component analysis. Neural Computing Surveys, 2, 94–128. Hyva¨ rinen, A., & Oja, E. (1997). A fast xed-point algorithm for independent component analysis. Neural Computation, 9(7), 1483–1492. Jung, T.-P., Humphries, C., Lee, T.-W., McKeown, M. J., Iragui, V., Makeig, S., & Sejnowski, T. J. (2000). Removing electroencephalographic artifacts by blind source separation. Psychophysiology, 37, 163–178. Jung, T.-P., Makeig, S., Westereld, M., Townsend, J., Courchesne, E., & Sejnowski, T. J. (1999). Analyzing and visualizing single-trial event-related potentials. In M. S. Kearns, S. Solla, & D. Cohn (Eds.), Advances in neural information processing systems, 11 (pp. 118–124). Cambridge, MA: MIT Press. Jung, T.-P., Makeig, S., Westereld, M., Townsend, J., Courchesne, E., & Sejnowski, T. J. (2000). Removal of eye activity artifacts from visual event– related potentials in normal and clinical subjects. Clinical Neurophysiology, 111(10), 1745–1758. Kinouchi, Y., Ohara, G., Nagashino, H., Soga, T., Shichijo, F., & Matsumoto, K. (1996). Dipole source localization of MEG by BP neural networks. Brain Topogr., 8(3), 317–321. Lewine, J. D., & Orrison II, W. W. (1995).Magnetoencephalography and magnetic source imaging. In W. W. Orrison II, J. D. Lewine, J. A. Sanders, & M. F. Hartshorne (Eds.), Functional Brain Imaging (pp. 369–417). St. Louis: Mosby. Makeig, S., Bell, A. J., Jung, T.-P., & Sejnowski, T. J. (1996). Independent component analysis of electroencephalographic data. In D. Touretzky, M. Mozer, & M. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 145–151). Cambridge, MA: MIT Press. Makeig, S., Jung, T. P., Bell, A. J., Ghahremani, D., & Sejnowski, T. J. (1997). Blind separation of auditory event-related brain responses into independent components. Proc. Natl. Acad. Sci. USA, 94(20), 10979–10984.
Localization of ICA Components
1857
Makeig, S., Townsend, J., Jung, T.-P., Enghoff, S., Gibson, C., Sejnwoski, T. J., & Courchesne, E. (1999). Early visual evoked response peaks appear to be sums of activity in multiple alpha sources. Soc. Neurosci. Abstr., 25(652.8). Makeig, S., Westereld, M., Jung, T.-P., Covington, J., Townsend, J., Sejnwoski, T. J., & Courchesne, E. (1999). Functionally independent components of the late positive event-related potential during visual spatial attention. J. Neurosci., 19(7), 2665–2680. Makeig, S., Westereld, M., Townsend, J., Jung, T.-P., Courchesne, E., & Sejnowski, T. J. (1999).Functionally independent components of the early eventrelated potential in a visual spatial attention task. Philosophical Transactions of the Royal Society: Biological Sciences, 354, 1135–1144. Mosher, J. C., & Leahy, R. M. (1998). Recursive music: A framework for EEG and MEG source localization. IEEE Transactions on Biomedical Engineering, 45(11), 1342–1354. Mosher, J. C., & Leahy, R. M. (1999).Source localization using recursively applied projected (RAP) MUSIC. IEEE Transactions on Signal Processing,47(2), 332–340. Mosher, J. C., Lewis, P. S., & Leahy, R. M. (1992). Multiple dipole modeling and localization from spatiotemporal MEG data. IEEE Transactions on Biomedical Engineering, 39(6), 541–557. Nagano, T., Ohno, Y., Uesugi, N., Ikeda, H., Ishiyama, A., & Kasai, N. (1998). Multi-source localization by genetic algorithm using MEG. IEEE Transactions on Magnetics, 34(5/pt.1), 2976–2979. Pearlmutter, B. A., & Parra, L. C. (1996). A context-sensitive generalization of ICA. In International Conference on Neural Information Processing (pp. 151–157). Hong Kong: Springer-Verlag. Ribary, U., Ioannides, A. A., Singh, K. D., Hasson, R., Bolton, J. P. R., Lado, F., Mogilner, A., & Llinas, R. (1991). Magnetic-eld tomography of coherent thalamocortical 40-Hz oscillations in humans. Proc. Natl. Acad. Sci. USA, 88(24), 11037–11041. Schmidt, D. M., George, J. S., & Wood, C. C. (1999). Bayesian inference applied to the electromagnetic inverse problem. Human Brain Mapping, 7(3), 195–212. Schwartz, D. P., Badier, J. M., Bihoue, P., & Bouliou, A. (1999). Evaluation of a new MEG–EEG spatio-temporal localization approach using a realistic source model. Brain Topogr., 11(4), 279–289. Sekihara, K., Nagarajan, S. S., Poeppel, D., Miyauchi, S., Fujimaki, N., Koizumi, H., & Miyashita, Y. (2000). Estimating neural sources from each timefrequency component of magnetoencephalographic data. IEEE Transactions on Biomedical Engineering, 47(5), 642–653. Sekihara, K., Poeppel, D., Marantz, A., Koizumi, H., & Miyashita, Y. (1997).Noise covariance incorporated MEG-MUSIC algorithm: A method for multipledipole estimation tolerant of the inuence of background brain activity. IEEE Transactions on Biomedical Engineering, 44(9), 839–847. Tang, A. C., Pearlmutter, B. A., Zibulevsky, M., & Carter, S. A. (2000). Blind separation of multichannel neuromagnetic responses. Neurocomputing, 32– 33, 1115–1120. Tang, A. C., Pearlmutter, B. A., Zibulevsky, M., Hely, T. A., & Weisend, M. P. (2000). An MEG study of response latency and variability in the human
1858
Akaysha C. Tang et al.
visual system during a visual-motor integration task. In S. A. Solla, T. K. Leen, & K.-R. Muller ¨ (Eds.),Advances in neural information processing systems, 12 (pp. 185–191). Cambridge, MA: MIT Press. Tesche, C. D., & Karhu, J. (1997). Somatosensory evoked magnetic elds arising from sources in the human cerebellum. Brain Research, 744(1), 23–31. Uutela, K., Hamalainen, M., & Salmelin, R. (1998). Global optimization in the localization of neuromagnetic sources. IEEE Transactions on Biomedical Engineering, 45(6), 716–723. VigÂario, R., Jousm¨aki, V., H¨am¨al¨ainen, M., Hari, R., & Oja, E. (1998). Independent component analysis for identication of artifacts in magnetoencephalographic recordings. In M. Kearns, M. Jordan, & S. Solla (Eds.) Advances in neural information processing systems, 10. Cambridge, MA: MIT Press. VigÂario, R., S¨arel¨a, J., Jousm¨aki, V., H¨amal¨ ¨ ainen, M., & Oja, E. (2000). Independent component approach to the analysis of EEG and MEG recordings. IEEE Transactions on Biomedical Engineering, 47(5), 589–593. VigÂario, R., S¨arel¨a, J., Jousmaki, V., & Oja, E. (1999). Independent component analysis in decomposition of auditory and somatosensory evoked elds. In Proc. First International Conference on Independent Component Analysis and Blind Source Separation ICA’99 (pp. 167–172). Aussois, France, January 11–15, 1999. Wubbeler, ¨ G., Ziehe, A., Mackert, B.-M., Muller, ¨ K.-R., Trahms, L., & Curio, G. (2000). Independent component analysis of non-invasively recorded cortical magnetic DC-elds in humans. IEEE Transactions on Biomedical Engineering, 47(5), 594–599. Zibulevsky, M., & Pearlmutter, B. A. (2001). Blind source separation by sparse decomposition in a signal dictionary. Neural Computation, 13(4), 863–882. Ziehe, A., Muller, ¨ K.-R., Nolte, G., Mackert, B.-M., & Curio, G. (2000). Artifact reduction in magnetoneurography based on time-delayed second order correlations. IEEE Transactions on Biomedical Engineering, 47(1), 75–87. Received August 1, 2000; accepted January 31, 2002.
LETTER
Communicated by Jean-Fran c¸ ois Cardoso
Robust Blind Source Separation by Beta Divergence Minami Mihoko [email protected] Shinto Eguchi [email protected] Institute of Statistical Mathematics and Graduate University for Advanced Studies, Minato-ku, Tokyo 106-8569, Japan Blind source separation is aimed at recovering original independent signals when their linear mixtures are observed. Various methods for estimating a recovering matrix have been proposed and applied to data in many elds, such as biological signal processing, communication engineering, and nancial market data analysis. One problem these methods have is that they are often too sensitive to outliers, and the existence of a few outliers might change the estimate drastically. In this article, we propose a robust method of blind source separation based on the b divergence. Shift parameters are explicitly included in our model instead of the conventional way which assumes that original signals have zero mean. The estimator gives smaller weights to possible outliers so that their inuence on the estimate is weakened. Simulation results show that the proposed estimator signicantly improves the performance over the existing methods when outliers exist;it keeps equal performance otherwise. 1 Introduction
Blind source separation is the problem of recovering the original independent signals when only their linear mixtures are observed. Let s ( t) D ( s1 (t ) , s2 ( t) , . . . , sm ( t) ) T , t D 1, . . . , n be a vector of m source signals whose components are independent each other. We cannot observe original signals directly, but observe x (t ) linearly transformed by
x ( t) D As (t) ,
t D 1, 2, . . . , n,
where A is a nonsingular unknown matrix of size m. Our problem is to estimate recovering matrix W such that elements of W x are independent each other (cf. Hyv¨arinen, 1999b). If W is properly obtained, y ( t) is equal to s (t) except for an arbitrary scaling of each signal component and permutation of indices. Distributions of original signals are unknown, although specic density functions might be used to dene loss functions or derive estimating equations. c 2002 Massachusetts Institute of Technology Neural Computation 14, 1859– 1886 (2002) °
1860
Minami Mihoko and Shinto Eguchi
Many estimators have been proposed from various points of view. These include Jutten and HÂerault’s (1991) heuristic approach, entropy maximization (Bell & Sejnowski, 1995), minimizing cross cumulants (Cardoso & Souloumiac, 1993), approximation of mutual information by Gram-Charlier expansion and the natural gradient approach (Amari, Cichocki, & Yang, 1996), and xed-point algorithms (Hyv¨arinen, 1999a). Most estimators proposed so far have the same type of estimating functions and are shown to be unbiased provided the means of original signals are zeros (Amari & Cardoso, 1997). However, one problem shared with this type of estimators is that they are not robust to outliers. In this article, we propose a robust blind source separation method based on the b divergence, which is essentially the same as a density power divergence by Basu, Harris, Hjort, and Jones (1998), with shift parameters explicitly included. (See Eguchi & Kano, 2001, for a general discussion of robust maximum likelihood methods.) In section 2, we briey review a class of estimating functions for blind source separation and discuss robustness of estimators in this class. In section 3, we propose a robust estimator based on the b divergence and discuss its consistency, robustness, and stability. Computational algorithms are discussed in section 4, and simulation results are shown in section 5. Section 6 gives concluding remarks. 2 Estimating Function and Robustness
Most estimating functions for estimators for blind source separation methods can be expressed in a common form (Amari & Cardoso, 1997). One way to derive an estimating function in this form is through the framework of the maximum likelihood estimation method. Let us denote the density function of a recovered signal vector y and an observed signal vector x by q ( y ) and f ( x) , respectively. If the marginal density of recovered signal yi is qi for i D 1, . . . , m, the Kullback-Leibler divergence between the distribution of y and the product of marginal distributions of y1 , . . . , ym is given by Z Z q (y ) f ( x) Q DKL ( W ) D q ( y ) log Qm dy D f ( x) log dx, ( ) ( ) | det(W) | m i D 1 q i yi i D 1 qi w i x where w i is the ith row vector of W for i D 1, . . . , m and
y D ( y1 , . . . , ym ) T D W x D ( w 1 x , . . . , w m x ) T . The Kullback-Leibler divergence DKL ( W ) takes a nonnegative value and is equal to zero if and only if W transforms x to y D W x such that y1 , . . . , ym are independent of each other. However, we do not know any form of the distributions of signals, so we replace qi by some specic density functions pi for i D 1, . . . , m. Then the empirical version of the Kullback-Leibler di-
Robust Blind Source Separation by Beta Divergence
1861
vergence is given by n X f ( x ( t) ) bKL ( W ) D 1 log Q D , ( ( )) | det(W) | m n t D1 i D 1 pi yi t
(2.1)
bKL ( W ) where yi (t ) D w i x ( t) . Dene the estimate for W by the minimizer of D or, equivalently, the maximizer of the following quasi-log-likelihood function, LQL ( W ) D
n 1X l ( x ( t) , W ) , n t D1
(2.2)
where l (x, W ) D log | det(W) | C
m X
log(pi ( yi ) ) .
(2.3)
iD 1
We employ the term quasi–log likelihood for the above function, following Pham and Garat (1997) since we use functions pi ( i D 1, . . . , k) without assuming that they are density functions of recovered signals or are close to them. bKL ( W ) provides a reasonable One might wonder if the minimization of D estimate since specic density function pi is used without knowing any form of the underlying distribution. Amari and Cardoso (1997) showed that the corresponding estimating equation was unbiased, and Amari, Chen, and Cichocki (1997) gave the necessary and sufcient conditions for asymptotic stability. These results mean that any matrix that recovers independent signals is the critical point, with some scaling, and is indeed a (local) minimum bKL ( W ) as well as DKL (W ) if the conditions for asympof the expectation of D totic stability are satised. The estimating function F ( x, W ) is the derivative of l (x, w ) : F (x, W ) D
@l ( x, W ) @W
¡T D ( Im ¡ h ( y ) y T ) W ,
(2.4)
where
y D Wx, hi ( yi ) D ¡
h (y ) D (h1 ( y1 ) , . . . , hm ( ym ) ) T 0(
d log pi ( yi ) p yi ) D ¡ i , dyi pi ( yi )
and the estimating equation is given by n 1X F ( x (t) , W ) D 0. n tD 1
and
1862
Minami Mihoko and Shinto Eguchi
We note here again that although we use specic density functions pi (i D 1, . . . , m ) for derivation of estimating functions, nothing is assumed regarding the form of density functions of recovered signals when characteristics of estimators such as consistency are investigated. Our major objective is to propose an estimating function, Fb ( x, W ) D expfbl( x, W ) gF(x, W ) ¡ Bb (W ) ,
(2.5)
where b ¸ 0 and Bb ( W ) is an adjustment matrix not depending on x. We will show that our estimating function is associated with a certain divergence rather than the Kullback-Leibler divergence. Moreover, we add shift parameters to an implementation instead of the conventional way so that the estimation is performed in a unied way. Our choice of scalar function in equation 2.5 will be shown to provide robustness to the procedure. Most estimating functions for estimators proposed in the literature of blind source separation, including Jutten and HÂerault’s (1991) heuristic approach, entropy maximization, minimizing cross cumulants, and the natural gradient approach, are of the form J ( W ) ( Im ¡ h ( y ) y T ) G ( W )
(2.6)
except for their diagonal elements, where J ( W ) and G ( W ) are functions of W. Even for the model in which pi (x ) contains nuisance parameters, such as the generalized gaussian model by Lee and Lewicki (2000), the estimating functions with respect to W are in this form. With the assumption that means of original signals are zeros, the expectation of an estimating function in the form 2.6 is the zero vector, and the corresponding estimator is consistent (Amari & Cardoso, 1997). Now we discuss the robustness of the estimators for blind source separation. The robustness of a maximum likelihood type estimator (M-estimator) is investigated based on its inuence function (Hampel, Ronchetti, Rousseeuw, & Stahel, 1986). We can view an M-estimator as a functional of a distribution G dened by »Z ¼ T[G] D argmax V (hI x) dG (x ) , h
where h denotes a parameter vector to be estimated and V (hI x) is the objective function like the quasi-log-likelihood function l ( x, W ) in our context. The inuence function for the estimator at x under distribution G is dened as IF(xI G ) D lim f(T[(1 ¡ h ) G C hD x ] ¡ T[G]) / hg h! C 0
where D x is the probability measure that puts mass 1 at the point x.
Robust Blind Source Separation by Beta Divergence
1863
The inuence function measures the asymptotic bias caused by contamination at the point x, and an estimator is said to be B-robust (the B comes from “bias”) if its inuence function is a bounded function of x. For Mestimators, it holds that IF( xI G) D M (v , G ) ¡1 v ( xI T[G]) , where v ( xI h ) is the estimating function, v (xIh ) D
@V ( xI h ) @h
and
Z µ
M(v, G ) D ¡
@v ( xI h ) @h
¶ dG ( x) . h D T[G]
Matrix M(v, G ) does not depend on x; thus, the B-robustness is equivalent to the boundedness of the estimating function for the M-estimator. (See Higuchi & Eguchi, 1998, for the inuence function in principal component analysis (PCA).) Any estimator with an estimating function in the form 2.6 is not B-robust, no matter what functions are used for h1 , . . . , hm . An off-diagonal element 6 k and is not bounded as a function of of ( Im ¡ h (y ) y T ) is ¡hj ( yj ) y k with j D y k for given yj , thus, it is not bounded as an m-variate function. The xedpoint algorithms are not also B-robust, as discussed in Hyv¨ The Qarinen (1997). b corresponding element in our estimating function is ¡( m i D 1 pi ( yi ) ) hj ( yj ) yk , so that it can be made bounded by the choice of pi , as we will discuss later. Our estimating equation is n 1X expfbl( x ( t) , W ) gF( x ( t) , W ) ¡ Bb (W ) D 0. n tD 1
We expect that if the tth observation x ( t) is a possible outlier, the function expfbl(x (t) , W ) g becomes small. This intuitive observation will be justied by the argument on boundedness of the inuence function. 3 New Estimator Based on the b Divergence
Robust statistical procedures have been developed for many statistical models (see Huber, 1981, and Hampel et al., 1986.) In this section, we propose a robust estimation method based on the b divergence with shift parameters explicitly included. We then show consistency and discuss the estimator ’s robustness and stability. Assume that for observed signals x, there exists matrix W such that the density q of y D W x is expressed as the product of (unknown) marginal densities as follows: q ( y ) D q1 ( y1 ) ¢ ¢ ¢ qm ( ym ). The density r of x can be expressed as r ( x, W ) D | det(W) | q (W x) .
(3.1)
1864
Minami Mihoko and Shinto Eguchi
Q Now x q0 D m i D1 pi where pi for i D 1, . . . , m are specic density functions we choose, and let ¹ D (m 1 , . . . , m m ) T be a vector of shift parameters. We dene r0 as r 0 ( x, W, ¹ ) D | det(W) |
m Y i D1
pi ( w i x ¡ m i )
and use r 0 for estimation. A few possible choices for pi , for example, are pi ( u ) D c1 expf¡c2 u4 g for sub-gaussian (light-tail) signals and pi ( u ) D c3 / cosh(u) for super-gaussian (heavy-tail) signals (Amari et al., 1997). In many blind source separation methods, means of recovered signals are assumed to be zeros. However, means of observations are usually nonzero and unknown, and so are those of original signals. Even in such cases, it has been regarded without further theoretical investigation that the means of signals can be assumed to be zeros by subtracting the sample averages from observations. In other words, estimation of parameters is performed in two separate steps, and the estimate in the rst step is treated as if it is a known value in the latter step. We include parameter ¹ explicitly in our model and pay more attention to the fact that ¹ is a parameter vector to be estimated. The existence of shift parameter ¹ has an important role for the estimator to have consistency, as we will see. We note that the idea of including shift parameters in the blind source separation model is not new (see Bell & Sejnowski (1995) and Cardoso & Amari, 1997). We now proceed to construct a robust estimator for blind source separation. First, let us dene the b divergence between two density functions with respect to some carrier measure º as 8 Z 1 > > ( gb ( x) ¡ f b ( x) ) g ( x) dº ( x) > > b > Z > < 1 ( gb C 1 (x ) ¡ f b C 1 ( x) ) dº (x ) ¡ Db ( g, f ) D bC1 > > > > for b > 0 > > :R (log g ( x ) ¡ log f (x ) ) g ( x) dº (x ) for b D 0. Since
Z (Z Db ( g, f ) D
g(x ) f ( x)
(3.2)
) b ¡1
( g ( x) ¡ z ) z
dz
dº (x ) ,
Db ( g, f ) is nonnegative and equal to 0 if and only if g D f almost everywhere with respect to º. When b D 0, the b divergence is reduced to the KullbackLeibler divergence. The b divergence (Eguchi & Kano, 2001) is related to the density power divergence db ( g, f ) proposed by Basu et al. (1998) as Db ( g, f ) D (1 C b ) db ( g, f ) , and the same estimator is induced from these divergence. (See Basu
Robust Blind Source Separation by Beta Divergence
1865
et al., 1998, for properties of the estimator in the framework of parametric statistical models.) The b divergence between the density of a recovered signal vector and the product of marginal densities, if they were known, would attain the minimum value, zero, if and only if recovered signals are independent each other. We consider an estimating procedure based on the b divergence bb ( rQ, r0 (¢, W, ¹ ) ) between the empirical distribution rQ of x and r 0 ( x, W, ¹ ) D instead of unknown r (x, W ) . Its opposite is equal up to the constant to the following quasi-b-likelihood function: 8 n > 11X 1 ¡b b > > r0 (x (t) , W, ¹) ¡ bb ( W ) ¡ > > b > < n b tD 1 for 0 < b < 1 Lb ( W, ¹) D > > n > 1X > > log(r 0 ( x ( t) , W, ¹) ) for b D 0, > :n
(3.3)
t D1
where bb ( W ) D
1
Z b C1 r 0 ( x,
bC1
| det(W) | b W, ¹) dx D bC1
Z b C1
q0
(z ) dz .
R b C1 Note that r0 (x, W, ¹ ) dx does not depend on ¹, so bb is a function only R b C1 of W. For pi ( z) D c1 expf¡c2 z4 g, it is explicitly calculated that pi (z ) dz D R (2 c20.25 C (0.25) ¡1 ) b (1 C b ) ¡0.25. For other density functions, pbi C 1 (z ) dz might not be calculated explicitly. However, as we will see in the discussion on consistency, bb ( W ) has an effect only on scaling of recovered signals, and R b C1 pi ( z) dz can be approximated or substituted by some constant in practice. In our experiments, estimation worked well even in the case that we set bb ( W ) D 0. O and We dene the maximizer of the above function as our estimate W O for W and ¹. As it is the type of estimator discussed in section 2, the ¹ proposed estimator has unbiased estimating functions even though we use some density function pi instead of the unknown underlying density. Moreover, the proposed estimator is B-robust, unlike other existing estimators. We show these properties in subsequent sections. The estimating functions of the proposed estimator are b
F1 ( x, W, ¹) D r0 (x, W, ¹ ) (Im ¡ h ( W x ¡ ¹) ( W x ) T ) W ¡T ¡ b bb (W ) W ¡T , b
F2 ( x, W, ¹) D r0 (x, W, ¹ ) h ( W x ¡ ¹ ) ,
(3.4) (3.5)
1866
Minami Mihoko and Shinto Eguchi
and the estimating equations are n 1X b r ( x ( t) , W, ¹ ) (Im ¡ h ( W x (t ) ¡ ¹) (W x ( t) ) T ) W ¡T n tD1 0
1 n
n X tD 1
¡b bb (W ) W ¡T D 0,
(3.6)
r0 ( x ( t) , W, ¹) h (W x ( t) ¡ ¹ ) D 0.
(3.7)
b
We now express our objective function in a slightly more mathematical way. Let us dene function Yb ( u ) on R for b > 0 as Yb ( u ) D
exp(bu) ¡ 1 . b
Further, we dene Yb ( u ) D u for b D 0 since limb !C 0 Yb ( z ) D z. Let yb (u ) be the derivative of Yb ( u) , yb ( u ) D expfbug for b ¸ 0. With Yb , the objective function Lb ( W, ¹) is expressed as Lb ( W, ¹) D
n 1X Yb ( l (x (t ) , W, ¹) ) ¡ ( bb (W ) ¡ 1) n tD1
where l (x (t) , W, ¹) D log r 0 ( x ( t) , W, ¹) , and the estimating equations are expressed as n 1X @bb ( W ) yb (l ( x ( t) , W, ¹) ) S ( x ( t) , W, ¹) ¡ D 0, n tD1 @h
where h denotes a parameter vector that contains all elements of W and ¹ and S ( x ( t) , W, ¹ ) D ( @/ @h ) l (x (t ) , W, ¹) . This can be regarded as a weighted quasi-likelihood equation associated with the weight function yb . Whenever X and Y are independent each other, function yb satises E[yb ( X C Y) ] D E[yb ( X ) ] ¢ E[yb ( Y) ],
(3.8)
where E denotes the expectation. This property will make our computation simple and tractable, as we will see in the subsequent discussion. 3.1 Consistency. An estimating function is called unbiased if its expectation is zero for any parameter value. In the current setting, it is equivalent that any matrix that recovers independent signals is the critical point of the expectation of the objective function, with some scaling, for M-estimators as the proposed estimator. We show unbiasedness of the estimating functions. Let W 0 be a matrix that transforms x to y D W 0x such that components of y are independent of each other. We show that with some scaling matrix
Robust Blind Source Separation by Beta Divergence
1867
L ¤ D diag(l¤1 , l¤2 , . . . , l¤m ) and shift vector ¹¤ D (m ¤1 , . . . , m ¤m ) T , the expectations of estimating functions F1 (x, L ¤ W 0 , ¹¤ ) and F2 ( x, L ¤ W 0, ¹¤ ) with respect to the underlying distribution 3.1 with W D W 0 are zeros; thus, the estimator is consistent. For a matrix W 0, let diagonal matrix L ¤ and vector ¹¤ be the minimizer of Db ( r ( x, W 0 ) , r0 ( x, L W 0, ¹) ) regarding a function of L and ¹. This is equivalent to the following function, µZ ¶ 8 Y 1 m b b > > (l ) ( ) l ¡ m p w x r x , W d x > 0i 0 i i i i > b iD 1 > > > Z > b > > li 1¡b b C1 < ¡ Qm q0 ( z ) dz ¡ iD 1 bC1 b > > > > > > ¶ m µZ > Y > > > li log pi (li w 0i x ¡ m i ) r (x, W 0 ) dx : iD 1
(3.9) for
0
for
bD0
where w 0i denotes the ith row of W 0 , is maximized for L D diag(l1 , . . . , lm ) and ¹ D (m 1 , . . . , m m ) T at L ¤ and ¹¤ . Let W ¤ D L ¤ W 0 and w¤i be the ith row of W ¤ . Taking partial derivatives of equation 3.9 with respect to li and m i for i D 1, . . . , m, equating them to zero at li D l¤i and m i D m ¤i for i D 1, . . . , m, and simplifying them give the following equations: b E[pi ( w ¤i x ¡ m ¤i ) (1 ¡ cb ¡ hi ( w¤i x ¡ m ¤i ) w ¤i x) ] D 0
for
i D 1, . . . , m
(3.10)
i D 1, . . . , m,
(3.11)
b E[pi ( w ¤i x ¡ m ¤i ) hi ( w ¤i x ¡ m ¤i ) ] D 0
for where cb D
b bC1
R Qm
b C1
q0
( z ) dz
b ¤ j D1 E[pj ( w j x
¡ m j¤ ) ]
.
Here and hereafter, E[¢] is the expectation with respect to the underlying distribution 3.1 with W D W 0 . Conditions 3.10 and 3.11 are expressed with z¤i D w ¤i x ¡ m ¤i as b b E[pi ( z¤i ) (1 ¡ cb ¡ hi ( z¤i ) z¤i ) ] D 0 and E[pi ( z¤i ) hi ( z¤i ) ] D 0 for i D 1, . . . , m.
These can be considered as alternatives to conventional assumptions or conditions such that recovered signals have zero means and unit variances. In general, m ¤i does depend on b. However, if the underlying distribution
1868
Minami Mihoko and Shinto Eguchi
of w ¤i x is symmetric and pi is symmetric about zero, ¹¤i does not depend on b and coincides with the mean of w ¤i x. We show that E[F1 ( x, W ¤ , ¹¤ ) ]W ¤T D 0 and
E[F2 ( x, W ¤ , ¹¤ )] D 0I
O and ¹ O are consistent estimators for W ¤ and ¹¤ . We rst note that thus, W E[F1 ( x, W ¤ , ¹¤ ) ]W ¤T "Á ! # m Y b D | det(W ¤ ) | b E pi ( w ¤i x ¡ m ¤i ) ( Im ¡ h ( W ¤ x ¡ ¹) ( W ¤ x) T ) i D1
¤
¡ bbb ( W ) Im ( "Á ! m Y b ¤ b ¤ ¤ E D | det(W ) | pi ( w i x ¡ m i ) iD 1
#) ¤
¤
£( (1 ¡ cb ) Im ¡ h ( W x ¡ ¹ ) ( W x) T )
.
(3.12)
6 k of the matrix in the braces in equation 3.12 is The ( j, k) th element with j D
"Á ¡E
m Y iD 1
! b pi (w ¤i x
¡ m ¤i )
# hj (wj¤ x
0
D ¡@
¡ m j¤ ) w¤k x
1 Y
6 j, k iD
b b E[pi ( wj¤ x ¡ m j¤ )]A E[pj ( wj¤ x ¡ m j¤ ) hj ( wj¤ x ¡ m j¤ ) ]
b
£ E[pk ( w¤k x ¡ m ¤k ) w ¤k x], and this is zero because of equation 3.11. The ( j, j) th element of the matrix in the braces in equation 3.12 is E
"Á m Y
! b pi ( w¤i x
iD 1
¡ m ¤i )
# (1
¡ cb ¡ hj ( wj¤ x
¡ m j¤ ) wj¤ x )
0 1 Y b E[pi (w ¤i x ¡ m ¤i ) ]A D @ 6 j iD
b £ E[pj ( wj¤ x ¡ m j¤ ) (1 ¡ cb ¡ hj ( wj¤ x ¡ m j¤ ) wj¤ x ) ],
and this is also zero because of equation 3.10. Hence, E[F1 ( x, W ¤ , ¹¤ )]W ¤T D 0.
Robust Blind Source Separation by Beta Divergence
1869
The jth element of E[F2 ( x, W ¤ , ¹¤ ) ] is factorized as 0 1 Y b | det(W ¤ ) | b @ E[pi (w ¤i x ¡ m ¤i ) ]A E[pjb ( wj¤ x ¡ m j¤ ) hj (wj¤ x ¡ m j¤ ) ] 6 j iD
and is equal to zero because of equation 3.11; thus, E[F2 ( x, W ¤ , ¹¤ )] D 0. Hence, the estimating functions are unbiased. 3.2 Robustness. Suppose that W is nonsingular and the negative of the Hessian matrix is positive denite at ( W, ¹) . Then the inuence functions are bounded if and only if the estimating functions are bounded. Let us denote zi D w i x ¡m i for i D 1, . . . , m. The ( j, k) th element of F1 ( x, W, m ) W T is factorized as
(F1 ( x, W, m ) W T ) jk 8 0 1 > Y > > b b b > ¡@ pi ( zi ) A ( pj (zj ) hj ( zj ) p k (z k ) zk > > > > 6 j, k iD > < b b C pj ( zj ) hj ( zj ) pk ( zk )m k ) D > 0 1 > > > Y > > > @ pbi ( zi ) A ( pb ( zj ) ¡ pb ( zj ) hj (zj ) zj ) , > > j j : 6 iD j
for
6 k jD
for
jDk
and the jth element of F2 ( x, W, ¹) is factorized as 0 1 Y b (F2 ( x, W, m ) W T ) j D ¡ @ pi ( zi ) A ( pjb ( zj ) hj ( zj ) ) . 6 j iD
Thus, estimating functions 3.4 and 3.5 are bounded if b pi ( z ) ,
b pi (z ) hi (z ) ,
b pi ( z ) hi ( z ) z
and
b pi (z ) z for i D 1, . . . , m
are bounded functions of z. The rst three functions can be made bounded b by the choice of pi for b ¸ 0. However, pi (z ) z cannot be bounded if b D 0. b For b > 0, pi ( z ) z is bounded if pi (z ) / exp(¡|z|) is bounded. Most density functions commonly used for pi , including pi ( z) D c1 exp(¡c2 z4 ) and pi (z ) D c2 / cosh(z) , satisfy this condition. Figure 1 shows that maximum values of p b b b functions pi ( z) z, pi ( z) hi ( z) and pi ( z) hi (z ) z for pi ( z ) D 2 C (0.25) ¡1 exp (¡0.25z4 ). For this pi , hi ( z) D z3 and all three functions are unbounded for b D 0. It is observed that maximum values of these functions rapidly decrease at small b; thus, the estimator would become robust with small b. 3.3 Asymptotic Stability. Our estimator is dened as the maximizer of the quasi-b-likelihood function 3.3. In section 3.1, we showed unbiasedness
1870
Minami Mihoko and Shinto Eguchi
b b b Figure 1: Maximum values of pi ( z) z, pi ( z ) hi ( z) and pi ( z ) hi ( z ) z for pi ( z ) D 4) 3 ¡1 (0.25) ( ) C exp(¡0.25z . For this pi , hi z D z .
p 2
of estimating functions. However, the solution of estimating equations is a (local) maxima if and only if the negative Hessian matrix of the quasi-blikelihood function is positive denite. Also, an on-line learning algorithm is stable only when the expected negative Hessian matrix is positive denite. We now give the necessary and sufcient conditions for the expected negative Hessian matrix of the quasi-b-likelihood function 3.3 to be positive denite. The necessary and sufcient condition for the negative of the expected Hessian matrix at W ¤ and ¹¤ to be positive denite is as follows: 1 C b ¡ cb (1 ¡ b ) C mi ¡ ki (ri ¡ ºi ) 2 > 0
(3.13)
ki > 0
si2 sj2 ki kj
(3.14)
> (1 ¡ b
) 2 (1
¡ cb
)2
(3.15)
6 for all i, j ( i D j) where
z¤i D w ¤i x ¡ m ¤i
b b ºi D E[pi (z¤i ) zj¤ ]E[pi ( z¤i ) ] b b ri D E[pi (z¤i ) (h0i ( z¤i ) ¡ bh2i ( z¤i ) ) z¤i ] / E[pi ( z¤i ) ( h0i ( z¤i ) ¡ bh2i (z¤i ) ) ]
b b si2 D E[pi (z¤i ) ( z¤i ¡ ºi ) 2 ] / E[pi ( z¤i ) ]
b b ki D E[[pi (z¤i ) ( h0i ( z¤i ) ¡ bh2i ( z¤i ) ) ] / E[pi (z¤i ) ]
mi D
b E[pi (z¤i ) (h0i ( z¤i )
¡ bh2i ( z¤i ) ) ( zj¤ ¡ ºi ) 2 ] /
b E[pi ( z¤i )].
and
Robust Blind Source Separation by Beta Divergence
1871
Amari et al. (1997) gave the following necessary and sufcient conditions on the stability for the estimators based on the Kullback-Leibler divergence without shift parameters: 1Cm Qi > 0
(3.16)
kQi > 0
(3.17)
sQ i2 sQj2 kQi kQj > 1,
(3.18)
6 for all i, j ( i D j) where sQ i2 D E[z2i ], kQi D E[h0i ( zi ) ], and mQ i D E[h0i ( zi ) z2i ]. When b D 0, conditions 3.14 and 3.15 are reduced to conditions 3.17 and 3.18, respectively. If pj and the underlying distribution of zj¤ are symmetric about zero, ri D ºi D 0, and condition 3.13 for b D 0 is reduced to condition 3.16. Note that we explicitly included shift parameters in our semiparametric model and derived the necessary and sufcient conditions from the expected Hessian matrix with respect to recovering matrix W and shift vector ¹, including their cross terms. On the other hand, Amari et al. (1997) derived the necessary and sufcient conditions assuming the means of recovered signals are zeros and not taking into account that estimates of the means, sample averages in many cases, are subtracted from observations beforehand. The Hessian matrix and the expected Hessian matrix at W ¤ and ¹¤ are given in appendix A and derivations of equations 3.13 through 3.15 are given in appendix B.
4 Computational Algorithms
The estimation of the mixing matrix for blind source separation is performed off-line in some situations and on-line in others. Here we consider both online and off-line computational algorithms. 4.1 Off-Line Algorithm. When all observations are available for estimation, it is a simple optimization problem, which nds the maximum of the objective function 3.3. For simulation studies, we used a quasi-Newton method with BFGS update and employed Wolfe conditions for a line search (cf. Nocedal, 1992). Although its convergence rate is super-linear, the Hessian approximation is updated at a small computational cost, and overall computational time is comparable to that by the Newton-Raphson method. 4.2 On-Line Algorithm. On-line algorithms might be preferable if there is a possibility that the mixing matrix is gradually changing. The natural gradient approach (Amari, 1998) can be applicable to our procedure with
1872
Minami Mihoko and Shinto Eguchi
the following modication: ( b tC 1 D W bt C gt W
bt ) | | det( W
b
Á m Y
! b b it x pi ( w
iD 1
¡m bit )
btx ¡ b bT )W bt £ ( Im ¡ h ( W ¹t ) xW t ) Z b b C1 b b | det( Wt ) | ¡ q 0 ( z ) dz 1Cb bt x ¡ b b ¹t C 1 D b ¹t C jt Ct¡1 h (W ¹t ) where gt , jt , and dt are learning rates, 2 is a small positive value, and Ct D diag(c1t , c2t , . . . , cmt ) cit D max(2 , eit ) bt ) | eit D (1 ¡ dt ) cit¡1 C dt | det( W
b
Á m Y iD 1
! b b it x pi ( w
¡m bit )
b it x ¡ m b it x ¡ m £ (h0i ( w bit ) ¡ bh2i ( w bit ) )
for i D 1, . . . , m.
Matrix Ct is an estimate for the subblock of the expected negative Hessian matrix corresponding to ¹ and is maintained to be positive denite so that btx ¡ b Ct¡1 h ( W ¹t ) is always an ascent direction of the objective function. 5 Simulation Results
To investigate the performance of our procedure in a comparison of the porcedures based on the Kullback-Leibler divergence, we rst separate speech signals and then conduct a simple study focusing on the behavior for robustness by simulation using randomly generated data. 5.1 Separation of Speech Signals. We mixed two independent speech signals of sample size 1500 with mixing matrix Á ! 1 1 AD 0.3 ¡0.1
and added noise with occurrence rate 0.005. Noise to the rst mixed signals follows 0.1Â52 C 1, and that to the second mixed signals follows 0.03Â52 C 0.3. Figure 2 shows the original speech signals and mixed signals with noise. We rst computed estimates for W and ¹ with various values of b by a quasi-Newton’s method with BFGS update. Since speech signals are highly super-gaussian, we used pi ( x) D c / cosh(x) . Figure 3 shows plots of data
Robust Blind Source Separation by Beta Divergence
1873
Figure 2: (A) Independent speech signals and (B) mixed signals with noise.
Figure 3: Separation of speech signals. Plots of mixed signals and estimates. Solid lines express the estimate by data with noise, and dotted lines express the estimate by data without noise.
1874
Minami Mihoko and Shinto Eguchi
Figure 4: Plots of original signals versus recovered signals by the estimates with (top row) b D 0.1 and (bottom row) b D 0.
and estimates for mixed speech signals with noise. We express estimates b ¡1 (i.e., an estiwith two lines such that their directions are columns of W ¡1 b b mate of A) and they intersect at W ¹. In each plot, solid lines express the estimate for mixed signals with noise, and dotted lines express the estimate for mixed signals without noise. Figure 4 shows plots of original signals versus recovered signals. We point out the following: When there is no noise, the estimator by b divergence produces similar estimates (see the dotted lines in Figure 3) for a wide range of b. The estimates recover original signals well. When noise exists, the estimate by the Kullback-Leibler divergence (b D 0) is badly affected by noise. As b increases, the estimates for data with noise approach those for data without noise. For b ¸ 0.1, the estimates for data with noise are very close to those for data without noise. b of a recovering matrix recovers original When b ¸ 0.1, the estimate W signals well even when mixed signals contain noise.
Robust Blind Source Separation by Beta Divergence
1875
b by an on-line algorithm (A) with Figure 5: Sequence of each element of WA b D 0.1 and (B) with b D 0. We also computed the estimate for data with noise by on-line algorithm, described in section 4.2, with b D 0 and b D 0.1. We used learning rates gt , jt and dt D
1 , 100 C t
b with and data are used four times. Figure 5 shows each element of WA b D 0.1 and with b D 0. If the estimation is successful, either diagonal elements or off-diagonal elements go to zero as t ! C 1. One can see that b with b D 0.1 approach zero as t increases, while diagonal elements of WA b with b D 0 go neither diagonal elements nor off-diagonal elements of WA to zero as t increases. 5.2 Simulation with Randomly Generated Data. We generated three
articial data sets: 1. One hundred pairs of independent random numbers from the gamma distribution with parameter 1.5 and 3 and two outliers at (25, 25). The theoretical mean, variance, and kurtosis of the gamma distribution are 4.5, 13.5, and 7.5, respectively. 2. One hundred fty pairs of independent random numbers from the exponential power distribution and 50 pairs of random numbers from the bivariate normal distribution with mean vector 0, variances 16, and correlation 0.8. The parameter value for the exponential power distribution is 1.25 (kurtosis 2.19) for one signal and 1.45 (kurto-
1876
Minami Mihoko and Shinto Eguchi
Figure 6: Independent gamma data with two outliers. A solid dot represents gamma data and an asterisk represents outliers. Solid lines express the estimate by all data, and dotted lines express the estimate by data without outliers.
sis 2.81) for the other. Data are taken from John Fisher ’s database (http://www.ai.mit.edu/people/sher/ ica data/). 3. Two hundred pairs of independent random numbers from the same exponential power distributions as above, but the rst 150 pairs are centered at the origin and the last 50 pairs are centered at (5, 5) . For all data sets, the identity matrix is used as the mixing matrix. That is, mixed signals are the same as original signals. We computed estimates for W and ¹ with various values of tuning parameter b by a quasi-Newton’s method with BFGS update. Figures 6, 7, and 8 show plots of data and estimates for data sets 1, 2, and 3, respectively. In each plot, asterisks are used for outliers (data set 1, Figure 6), bivariate normal data (data set 2, Figure 7), and data centered at (5, 5) (data set 3, Figure 8) and lled dots are used for other data. Solid lines express the estimate by all data in each data set such that their directions b ¡1 and lines intersect at W b ¡1 ¹ O . Dotted lines express the are columns of W estimate by data without outliers for Figure 6, without bivariate normal data
Robust Blind Source Separation by Beta Divergence
1877
Figure 7: 150 independent exponential power data and 50 correlated normal data. A solid dot represents exponential power data, and an asterisk represents bivariate normal data. Solid lines express the estimate by all data, and dotted lines express the estimate by exponential power data only.
for Figure 7, and without data centered at (5, 5) for Figure 8 in the same way as solid lines. If we express the identity matrix, which is the mixing matrix in this simulation study, the lines are the x-axis and y-axis. In each gure, the top left plot is of the estimate by the method based on the Kullback-Leibler divergence (b D 0) without shift parameters (sample averages are subtracted from observations for the estimation of W). Other plots are of the estimates by the proposed method with various values of tuning parameter b. For data set 1 (gamma data with outliers), we used p ( y) D c / cosh(y) and h ( y ) D tanh ( y ). Figure 6 shows that outliers affected the estimate signicantly when the method based on the Kullback-Leibler divergence was used (b D 0). The estimate by all data with b D 0.02 is not much different from that with b D 0, but the estimate changed suddenly with b D 0.04. The estimates for data with outliers approach those without outliers as b increases from b D 0.04. For b values larger than 0.10, the estimates did not change so much. The estimates by data without outliers changed little as b increased. We also generated uniformly distributed data with outliers
1878
Minami Mihoko and Shinto Eguchi
Figure 8: Independent exponential power data centered at the origin and at (5, 5) . A solid dot represents exponential power data centered at the origin, and an asterisk represents exponential power data centered at (5, 5) . Solid lines express the estimate by all data, and dotted lines express the estimate by data centered at the origin only.
and estimated a recovering matrix with p ( y ) D c exp(¡0.25y4 ) . Results (not shown) were quite similar to this data set. For data set 2 (exponential power data and bivariate normal data), we used p ( y ) D c exp(¡0.25y4 ) and h ( y ) D y3 . Unlike data set 1, what are outliers is not obvious in data set 2. The estimate by all data changed relatively slowly compared with data set 1, but became closer to the estimate by data without bivariate normal data as b increased. The estimates by data without outliers changed little as b increased, as for data set 1. For data set 3 (exponential power data centered at the origin and at (5, 5) ), we used p ( y) D c exp(¡0.25y4 ) and h ( y ) D y3 . The estimates by all data with b less than or equal to 0.12 are not much different, but changed suddenly at b D 0.15 and changed little after that. The estimates by all data with b greater than or equal to 0.15 are about the same as those by data centered at the origin only. In all data sets, when there is no outlier or data with a different nature, such as those in data sets 2 and 3, estimates change little for different val-
Robust Blind Source Separation by Beta Divergence
1879
Figure 9: Weights for each data point. From the top row, last 2, 50 and 50 datapoints are outliers, correlated normal data and exponential power data centered at (5, 5), respectively.
ues of b. When there are outliers or data with a different characteristics, estimates changed as b increased, and for appropriately large b, estimates are not much different from those by data without outliers or data with different characteristics. Figure 9 shows the weight for each data point of data sets 1, 2, and 3 with various values of b. Plots in the rst row are of data set 1, and in each plot, the rst 100 are gamma data and the last two are outliers. Plots in the second row are of data set 2, and in each plot, the rst 150 are exponential power data and the last 50 are bivariate normal data. Plots in the third row are of data set 3, and in each plot, the rst 150 are exponential power data centered at the origin and the last 50 are exponential power data centered at (5, 5). One can see that weights to outliers or data with different characteristics become small as b increases. 6 Conclusion
In this article we proposed a robust blind source separation method. The method is based on the b divergence and includes shift parameters explicitly instead of assuming that original signals have zero means. We showed that
1880
Minami Mihoko and Shinto Eguchi
the proposed estimator has unbiased estimating funtions and is B-robust and gave the necessary and sufcient conditions for asymptotic stability. Simulation results showed that the proposed method is robust to outliers. The estimates by data with outliers changed as b increased, and with appropriately large b, the estimates by data with outliers were not much different from those by data without outliers. When there were no outliers, the estimates were not much different for a wide range of b, including b D 0, that corresponds to the method based on Kullback-Leibler divergence. We restricted our attention to the case that xed functions pi are used, but one can also include other nuisance parameters in addition to ¹ into pi so that pi adaptively approximates the underlying distribution of recovered signals better, as does the generalized gaussian model by Lee and Lewicki (2000). In this case, partial derivatives of equation 3.3 with respect to extra parameters are included as the estimating functions, as are the corresponding estimating equations. The unbiasedness of estimating functions can be easily shown and B-robustness would be obtained, but conditions for asymptotic stability need to be investigated model by model. Selection of b would be an important issue in practice. It can be chosen qualitatively by monitoring changed estimates as b increases, but numerical criteria for the selection of b might be desirable. The usual Bayesian approach cannot be directly applied since b is not a parameter in a statistical model but rather a parameter that denes a divergence. One possible way is to consider a penalized b divergence that includes a penalty term similar to the log of a prior distribution. We leave this problem to future research. Appendix A: The Hessian Matrix and Its Expectation at W ¤ and ¹¤
We derive the Hessian matrix of the function 8 < 1 rb ( x, W, ¹) ¡ b ( W ) ¡ 1 ¡ b b b Vb (x, W, ¹ ) D b 0 : log(r0 ( x, W, ¹) )
for
0< b < 1
for
b D 0,
and its expectation at W ¤ and ¹¤ . The partial derivatives of this function are estimating functions 3.4 and 3.5. We take partial derivatives of the estimating functions to obtain the Hessian matrix. Let us dene Á ! m Y b G´ pi ( wi x ¡ m i ) ( W ¡T ¡ h (W x ¡ ¹) xT ) , iD 1
and denote the jth row of W ¡T by w ( j) and the jth row of G by gj ( j D 1, . . . , m ) : Á ! m Y b gj D pi (w i x ¡ m i ) (w ( j) ¡ hj ( wj x ¡ m j ) xT ) . iD 1
Robust Blind Source Separation by Beta Divergence
1881
Then the partial derivatives of Vb (w , W, ¹) , | det(W) | b and w ( k) T with respect to wj are expressed as @Vb ( w, W, ¹ ) @wj @| det(W) | b @wj @w ( k) T @wj
( ) D | det(W) | b gj ¡ bbb ( W ) w j , ( ) D b | det(W) | b w j ,
D
@w ( j) @wTk
( ) ( ) D ¡w j T w k .
Subblocks of the Hessian matrix are given as follows: Á
@2 Vb ( w , W, ¹)
D | det(W) |
@¹@¹T
b
m Y
! b pi ( wi x
iD 1
¡ m i)
£ [b h ( W x ¡ ¹) h (W x ¡ ¹) T
¡ diag(h01 ( w 1 x ¡ m 1 ) , . . . , h0m ( w m x ¡ m m ) ) ] Á ! m Y @2 Vb ( w , W, ¹) b b D | det(W) | pi ( wi x ¡ m i ) @¹@wj
iD 1
£ [b h ( W x ¡ ¹) w ( j) ¡bhj ( wj x ¡ m j ) h (W x ¡ ¹ ) xT C hj0 ( wj x ¡ m j ) ej xT ]
@2 Vb ( w , W, ¹)
D b | det(W) | w b
@w Tk @wj
( k) T
gj C | det(W) |
Á b
@gj
!
@wTk
( ) ( ) C b (1 ¡ b ) bb ( W ) w j T w k ,
where @gj @w Tk
D ¡b
¡
Á m Y iD 1
Á m Y i D1
! b pi (w i x ¡m i )
b pi ( w i x
h k ( w kx ¡m k ) x ( w ( j) ¡hj ( wj x ¡m j ) xT ) !
¡ m i)
( w ( j) T w ( k) C djkhj0 ( wj x ¡ m j ) xxT ) .
We now compute the expectation of the Hessian matrix at W ¤ and ¹¤ . For simplicity, we denote pj ( wj¤ x ¡m j¤ ), hj ( wj¤ x ¡m j¤ ) , hj0 (wj¤ x ¡m j¤ ) by pj , hj , and hj0 , respectively, for j D 1, . . . , m, h ( W ¤ x ¡ ¹¤ ) by h, and the jth row of W ¤¡T by w ¤( j) . Since components of recovered signals are independent and
1882
Minami Mihoko and Shinto Eguchi
equations 3.10 and 3.11 hold at W ¤ and ¹¤ , the following equations hold: "Á ! # "Á ! # m m Y Y b b T 2 2) E pi hh D E pi diag(h1 , . . . , hm "Á E
i D1
m Y
!
b pi
hj hx
iD 1
"Á E
m Y
# T
! b pi
"Á D E
# hj xT
"Á D E
iD 1
"Á D E
"Á D E
E
"Á m Y
! b pi
# T
iD 1
m Y
! b pi
iD 1
m Y
! b pi
iD 1 m Y
!
iD 1
W ¤¡T
!
iD 1
C djk E
#
ejT
b pi
D (1 ¡ djk ) E
h khj xx
# hj ( W ¤ x) T W ¤¡T
b pi
iD 1
m Y
# hj2 ej xT
#
w
" m Y
¤(j)
# b pi
i D1
"Á m Y
( w ¤(k)T w ¤( j) C w ¤( j) T w ¤(k) )
!
b pi
# hj2 xxT
i D1
where ej denote a unit vector whose jth element is one and other elements are zeros. Using the above equations, expectations of subblocks of the Hessian matrix at W ¤ and ¹¤ are given as follows: " "Á ! # m Y @2 Vb ( w, W, ¹ ) b ¤ b E pi diag( (b h21 ¡h01 ) , ¤ ¤ D | det(W ) | E @¹@¹T W ,¹ i D1 # " E
#
@2 Vb ( w, W, ¹ ) @¹@wj
2
W ¤ , ¹¤
@ b ( w , W, ¹ ) E4 ¤ @w T @wj k
¤) b
2V
W ,¹
. . . , (b h2m ¡ h0m ) )
D ¡| det(W | E
"Á m Y
3
i D1
!
b pi
#
(b hj2 ¡hj0 ) ej xT
5 D ¡| det(W ¤ ) | b
¤
"Á £E
m Y iD 1
! b pj
f(1 ¡ b ) (1 ¡ cb )
£ w ¤( j) T w ¤(k) C 2djkb w ¤( j) w¤( j) T # ¡djk (b hj2 ¡ hj0 ) xxT g .
Robust Blind Source Separation by Beta Divergence
1883
Appendix B: Derivation of the Necessary and Sufcient Condition for Stability
Let H denote the expectation of the Hessian matrix at W ¤ and ¹¤ with respect to h and Hij denote the (i, j)th block of size m so that 8 2 3 > 2 > ( ( ) ) Y @ l x , W, ¹ > b 5 > 4 E for j, k D 1, . . . , m > > T > ¤ ¤ @w k @wj > > W ,¹ > > # < " 2 ( l ( x, W, ¹) ) Y @ b Hkj D E for k D m C 1, j D 1, . . . , m ¤ > > @¹@wj > W¤ ,¹ > > " > # > > @2 Yb ( l ( x, W, ¹) ) > > for k D j D m C 1. > :E ¤ ¤ @¹@¹T W ,¹
¡H is positive denite if and only if ¡v T Hv > 0 for any nonzero vector v of size m ( m C 1) . We denote elements of v such that 0 T 1 0 1 0 1 W v1 v1j s1 B .. C B . C B .. C B .. C vDB C , vj D @ . A for j D 1, . . . , m and s D @ . A @W T v m A vmj sm
s
and z ¤ D ( z¤1 , . . . , z¤m ) T D W ¤ x ¡ ¹¤ . Let sQj D sj ¡ (º C ¹¤ ) T vj so that sj ¡ ( W ¤ x) T vj D sQj ¡ (z ¤ ¡ º ) T vj D sQj ¡
m X kD 1
( z¤k ¡ ºk ) v kj ,
where º D (º1 , . . . , ºm ) T . For simplicity, we denote pj (wj¤ x ¡ m j¤ ), hj ( wj¤ x ¡ m j¤ ) , hj0 ( wj¤ x ¡ m j¤ ) by pj , hj , and hj0 , respectively, for j D 1, . . . , m. We note b here that E[pj ¢ ( zj¤ ¡ ºj ) ] D 0. Then,
¡v T H v D ¡
m X k, j D1
D | det(W ¤ ) | b
v Tk WHkj WT vj ¡ 2 "Á
m X
E
m Y
! b
pi
iD 1
k, j D 1
C 2| det(W ¤ ) | b
k, j D 1
"Á E
jD 1
sT Hm C 1j W T vj ¡ sT Hm C 1m C 1 s
f(1 ¡ b ) (1 ¡ cb ) vjkv kj C 2djkbv2jj #
¡djk (b hj2 m X
m X
m Y iD 1
0
T
)2
¡ hj ) ( y vj g ! b pi
# (b hj2 ¡ hj0 ) ) sj ( y T vj )
1884
Minami Mihoko and Shinto Eguchi
2Á ¤) b
¡ | det(W | E 4
m Y iD 1
! b pi
3 m X
(b hj2
jD 1
0
¡ hj ) sj2 5
2Á !8 <X m m Y b f(1 ¡ b ) (1 ¡ cb ) vjk vkj C 2djkbv2jj g D | det(W ¤ ) | b E 4 pi : D1 k, j D 1
i
C
m X jD 1
93 = ( hj0 ¡ bhj2 ) (sj ¡ ( W ¤ x) T vj ) 2 5 ;
2Á !8 <X m m Y b f(1 C b ¡ cb (1 ¡ b ) ) vjj2 C ( hj0 ¡ bhj2 ) D | det(W ¤ ) | b E 4 pi : D1 jD 1
i
£ ( sQj ¡ (zj¤ ¡ ºj ) vjj ) 2 g C
X f2(1¡b ) (1¡cb ) k< j
£ vjkv kj C ( hj ¡ bhj2 ) ( z¤k 0
¡ ºk ) 2 v2kj 93 = C ( h0k ¡ bh2k ) ( zj¤ ¡ ºj ) 2 v2jk g 5 ;
8 0 1 0 1 9 <X m = Y X Y b @ E[pbi ]A Aj C @ E[pi ]A B kj , D | det(W ¤ ) | b : ; 6 6 jD 1
k< j
iD j
i D j,k
where " Aj D
v2jj E
( b pj
³ 1 C b ¡ cb (1 ¡ b )
0
C ( hj ¡ bhj2 )
zj¤
¡ ºj ¡
sQj
´2 )#
vjj
b b B kj D E[pj p k f2(1 ¡ b ) (1 ¡ cb ) vjkv kj C ( hj0 ¡ bhj2 ) ( z¤k ¡ ºk ) 2 v2kj
C ( h0k ¡ bh2k ) ( zj¤ ¡ ºj ) 2 v2jk g].
This implies that ¡H is positive denite if and only if Aj is positive for any vjj and sQj for j D 1, . . . , m and B kj is positive for any v kj and vjk for 6 k. Let sPj D sQj / vjj . For vjj D 6 0, j, k D 1, . . . , m, j D b b Aj / v2jj D E[pj ] (1 C b ¡ cb (1 ¡ b ) ) C E[pj ( hj0 ¡ bhj2 ) ( zj¤ ¡ ºj ) 2 ] b b ¡ 2E[pj ( hj0 ¡ bhj2 ) ( zj¤ ¡ ºj ) ]sPj C E[pj (hj0 ¡ bhj2 )]sPj 2 2 D E[pj ]f1 C b ¡ c b (1 ¡ b ) C mQ j ¡ 2kQj ( rj ¡ ºj ) sPj C kQj sPj g b
2 2 D E[pj ]fkQj ( sPj ¡ ( rj ¡ ºj ) ) C 1 C b ¡ cb (1 ¡ b ) C mQ j ¡ kQj ( rj ¡ ºj ) g. b
Robust Blind Source Separation by Beta Divergence
1885
This is a quadratic function of sPj and is always positive if and only if kQj > 0 and
1 C b ¡ cb (1 ¡ b ) C mQ j ¡ kQj ( rj ¡ ºj ) 2 > 0.
On the other hand, b b B kj D E[pj ]E[p k ]f2(1 ¡ b ) (1 ¡ cb ) vjkvjk C kQj sQk2 v2kj C kQksQj2 v2jkg 0 1 QksQ2 (1 ) (1 ) ¡ b ¡ k c b b b j A ( vjkvkj ) T , D E[pj ]E[p k ](vjk v kj ) @ (1 ¡ b ) (1 ¡ cb ) kQj sQk2
provided kQj > 0, B kj is positive for any vjk and v kj if and only if sQj2 sQk2 kQj kQk > (1 ¡ b ) 2 (1 ¡ cb ) 2 . Thus, ¡H is positive denite if and only if equations 3.13 through 3.15 hold. Acknowledgments
We thank Te-Won Lee for giving us speech signal data and helpful comments and the referees for their valuable comments. References Amari, S. (1998). Natural gradient works efciently in learning. Neural Computation, 10, 254–276. Amari, S., & Cardoso, J. F. (1997). Blind source separation—Semi-parametric statistical approach. IEEE Trans. on Signal Processing, 45, 2692–2700. Amari, S., Chen, T., & Cichocki, A. (1997).Stability analysis of learning algorithm for blind source separation. Neural Networks, 10(8), 1345–1351. Amari, S., Cichocki, A., & Yang, H. H. (1996). A new learning algorithm for blind source separation. In D. Touretzky, M. Mozer, & M. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 757–763). Cambridge, MA: MIT Press. Basu, A., Harris, I. R., Hjort, N. L., & Jones, M. C. (1998). Robust and efcient estimation by minimizing a density power divergence. Biometrika, 85, 549– 559. Bell, A. J., & Sejnowski, T. J. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7, 1129–1159. Cardoso, J. F., & Amari, S. (1997).Maximum likelihood source separation: Equivalence and adaptivity. In Proc. of SYSID’97 (pp. 1063–1068). Fukuoka, Japan. Cardoso, J. F., & Souloumiac, A. (1993). Blind beamforming for non-gaussian signals. Proc. IEEE, 140, 362–370. Eguchi, S., & Kano, Y. (2001).Robustifying maximum likelihoodestimation(Research Memorandum 802). Tokyo: Institute of Statistical Mathematics.
1886
Minami Mihoko and Shinto Eguchi
Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J., & Stahel, W. A. (1986). Robust statistics. New York: Wiley. Higuchi, I., & Eguchi, S. (1998). Principal component analysis by self-organizing rule. Neural Computation, 10, 1435–1444. Huber, P. J. (1981). Robust statistics. New York: Wiley. Hyva¨ rinen, A. (1997). One-unit contrast functions for independent component analysis: A statistical analysis. In Neural Networks for Signal Processing VII (Proc. IEEEE Workshop on Neural Networks for Signal Processing) (pp. 388–397). Amelia Island, FL. Hyva¨ rinen, A. (1999a). Fast and robust xed-point algorithms for independent component analysis. IEEE Trans. on Neural Network, 10(3), 626–634. Hyva¨ rinen, A. (1999b).Survey on independent component analysis. Neural Computing Surveys, 2, 94–128. Jutten, C., & HÂerault, J. (1991). Blind separation of sources, Part I: An adaptive algorithm based on neuromimetic architecture. Signal Processing, 24, 1–20. Lee, T.-W., & Lewicki, M. S. (2000). The generalized gaussian mixture model using ICA. In Proceedings of the International Workshop on Independent Component Analysis (pp. 239–244). Helsinki, Finland. Nocedal, J. (1992). Theory of algorithms for unconstrained optimization. In A. Iserles (Ed.), Acta numerica (pp. 199–242). Cambridge: Cambridge University Press. Pham, D. T., & Garat, P. (1997). Blind separation of mixture of independent sources through a quasi-maximum likelihood approach. IEEE Trans. on Signal Processing, 45, 1712–1725. Received April 30, 2001; accepted February 1, 2002.
LETTER
Communicated by Klaus Obermayer
Joint Entropy Maximization in Kernel-Based Topographic Maps Marc M. Van Hulle [email protected] K. U. Leuven, Laboratorium voor Neuro- en Psychofysiologie Campus Gasthuisberg, B-3000 Leuven, Belgium A new learning algorithm for kernel-based topographic map formation is introduced. The kernel parameters are adjusted individually so as to maximize the joint entropy of the kernel outputs. This is done by maximizing the differential entropies of the individual kernel outputs, given that the map’s output redundancy, due to the kernel overlap, needs to be minimized. The latter is achieved by minimizing the mutual information between the kernel outputs. As a kernel, the (radial) incomplete gamma distribution is taken since, for a gaussian input density, the differential entropy of the kernel output will be maximal. Since the theoretically optimal joint entropy performance can be derived for the case of nonoverlapping gaussian mixture densities, a new clustering algorithm is suggested that uses this optimum as its “null” distribution. Finally, it is shown that the learning algorithm is similar to one that performs stochastic gradient descent on the Kullback-Leibler divergence for a heteroskedastic gaussian mixture density model. 1 Introduction
In an effort to improve the density estimation properties, the noise tolerance, or even the biological relevance of the self-organizing map (SOM) (Kohonen, 1982, 1995), algorithms have been devised that can accommodate neurons with kernel-based activation functions, such as gaussians, instead of winner-take-all (WTA) functions (Voronoi tessellation). An early example is the elastic net of Durbin and Willshaw (1987), which can be viewed as an equal-variance or homoskedastic gaussian mixture density model, tted to the data points by a penalized maximum likelihood term. More recent examples of the density modeling approach are the algorithms introduced by Bishop, SvensÂen, and Williams (1998; generative topographic map, based on constrained, homoskedastic gaussian mixture density modeling with equal mixings), Utsugi (1997, also using equal mixings of homoskedastic gaussians), and Van Hulle (1998, equiprobabilistic maps using heteroskedastic gaussian mixtures—thus, with differing variances). We should also mention the fuzzy membership in the clusters approach of Graepel, Burger, and c 2002 Massachusetts Institute of Technology Neural Computation 14, 1887– 1906 (2002) °
1888
Marc M. Van Hulle
Obermayer (1997) and the maximization of local correlations approach of Xu and coworkers (Sum, Leung, Chan, & Xu, 1997), both of which rely on homoskedastic gaussians. Graepel, Burger, and Obermayer (1998) proposed a still different approach to kernel-based topographic map formation by introducing a nonlinear transformation that maps the data points to a high-dimensional “feature” space and, in addition, admits a kernel function, such as a gaussian with xed variance, as in (kernel-based) support vector machines (SVMs) (Vapnik, 1995). This idea was recently taken up again by AndrÂas (2001), but with the purpose of optimizing the map’s classication performance, by individually adjusting the kernel radii using a supervised learning algorithm. Finally, the original SOM algorithm itself has been regarded as an approximate way to perform homoskedastic gaussian mixture density modeling by Utsugi (1997), Yin and Allinson (2001), and Kostiainen and Lampinen (2002), among others. Linsker (1989) was among the rst to develop topographic maps by optimizing an information-theoretic criterion. He applied his principle of maximum information preservation (mutual information maximization between input and output—infomax for short) to a network of WTA neurons. The question now is whether this principle can also be applied to kernel-based topographic maps. The obvious answer is to express the average mutual information integral in terms of the kernel output densities—or probabilities when they are discretized—and adjust the kernel parameters individually so that the integral is maximized. However, such an approach rapidly becomes infeasible in practice. Linsker needed to restrict himself to binary outputs in his WTA network in order to facilitate computing the integral. A different information-theoretic approach is to minimize the KullbackLeibler divergence (also called relative- or cross-entropy) between the true and the estimated input density, an idea that has been introduced for kernelbased topographic map formation by Benaim and Tomasini (1991), using homoskedastic gaussians, and extended more recently by Yin and Allinson (2001) to heteroskedastic gaussians. We will introduce in this article a new learning algorithm for kernel-based topographic map formation that adjusts the kernel parameters individually and in such a manner that the joint entropy of the kernel outputs is maximized. We will derive our learning algorithm, and the optimization criterion behind it, in a bottom-up manner by starting with differential entropy maximization: when this is maximized for a given kernel, the kernel’s parameters will be optimally adapted in the sense that the mutual information between the kernel output and its input will maximal. The kernel output function and the learning rule for this case are derived in sections 2 and 3, respectively. As our kernel output function, we take the (radial) incomplete gamma distribution, since the kernel’s differential entropy is theoretically maximal when the input density is gaussian. Section 4 starts with the observation that differential entropy maximization alone is not sufcient when there are multiple kernels in the map, since all kernels will eventually coincide. As
Joint Entropy Maximization in Kernel-Based Topographic Maps
1889
a measure for kernel overlap, and thus for the map’s output redundancy, the mutual information between the kernel outputs is taken. Maximizing joint entropy is then put forward as our optimization principle since it unies both requirements: maximizing the differential entropies of the kernel outputs given that the map’s mutual information also needs to be minimized. Section 5 then complements the learning rules with a neighborhood function so that topographic maps can be developed. Since we are able to derive the theoretically optimal joint entropy performance for the case of an input distribution consisting of nonoverlapping gaussians, we will suggest a new clustering algorithm that uses this optimum as its “null” distribution in section 6. In section 7, we show that our learning algorithm is similar to one that performs stochastic gradient descent on the Kullback-Leibler divergence when heteroskedastic gaussian mixtures are used. Finally, in section 8, we discuss the correspondence with other learning algorithms for kernel-based topographic map formation. 2 Kernel Denition
Consider a formal neuron i, the output of which is, in response to an input v 2
d 2
degrees
d
p 2 ( x ) D
x 2 ¡1 ¢ exp (¡ x2 ) d
2 2 C ( d2 )
,
(2.1)
for 0 · x < 1, and with C (.) the gamma distribution. Hence, the distribution of p the Euclidean distance to the center becomes p ( r) D 2 r pÂ2 ( r2 ), with r D x, following the fundamental law of probabilities. After some algebraic manipulations and by generalizing to a gaussian with standard
Marc M. Van Hulle 1.0
|
0.8
|
0.6
|
0.4
|
0.2
|
p(r) K(r)
1890
| 1
|
0.0 | 0
| 2
| 3
| 4
| | 5 6 distance r
Figure 1: Distribution functions of the Euclidean distance r from the mean of a unit variance, radially symmetrical gaussian, parameterized with respect to the dimensionality d (continuous lines) and the corresponding complements of the cumulative distribution functions (dashed lines). The functions are plotted for d D 1, . . . , 10 (from left to right); the thick lines correspond to the d D 1 case.
deviation s, we can write the distribution of the Euclidean distances as follows: p ( r) D
2( sr ) d¡1 exp (¡
( sr ) 2 ) 2
d
2 2 C ( d2 )
.
(2.2)
The distribution is plotted in Figure 1 (thick and thin continuous lines). The p p 2sC ( d C2 1 ) mean of r equals m r D , which can be approximated as ds, for (d) C
2
d large, using the approximation of Graham, Knuth, and Patashnik (1994); the second moment around zero equals ds 2 . The distribution p ( r) quickly approaches a gaussian with mean m r and standard deviation ps when d 2
increases; for example, the skewness and Fisher kurtosis are 8.07 10¡2 and 8.71 10¡3 for d D 10, respectively. Finally, the kernel is dened in accordance with the cumulative distribution of p (r ), which is the (complement of the) incomplete gamma distribution: ³
d kw i ¡ v k2 yi D K ( v , w i , si ) D P , 2 2si2
´ ´
kw i ¡v k2 2si2 d C( 2)
C ( d2 ,
) ,
(2.3)
which is plotted as a function of the Euclidean distance r D kw i ¡ v k2 , for d D 1, . . . , 10 in Figure 1 (thick and thin dashed lines). Note that since K (.)
Joint Entropy Maximization in Kernel-Based Topographic Maps
1891
depends on only the Euclidean distance, the obtained kernel in the input space is radially symmetrical. 3 Kernel Adaptation
We will now derive an on-line stochastic learning algorithm that adapts an individual kernel’s parameters in such a way that the differential entropy of the kernel output is maximized. Consider rst the adaptation of the kernel center. The entropy of the kernel output of neuron i can be written as Z H ( yi ) D ¡
1 0
pyi ( x) ln pyi ( x ) dx,
(3.1)
with pyi (.) the kernel output density, which can be written as a function of the distribution of the Euclidean distance to the kernel center r: pr ( r ) . pyi ( yi ) D @yi @r
(3.2)
After substitution of the latter into equation 3.1, we obtain Z H ( yi ) D ¡
1 0
Z pr (r ) ln pr ( r) dr C
1 0
@yi ( r) pr ( r ) ln dr. @r
(3.3)
By performing gradient ascent on H (.) , we obtain the on-line learning rule for the kernel center,
D wi
D gw
@H @r @r @w i
D gw
v ¡ wi
si2
,
8i,
(3.4)
with gw the learning rate, after some algebraic manipulations (see the appendix). In a similar manner, the learning rule for the kernel radius si is obtained,
D si
D gs
@H @si
D gs
1 si
³
´ kv ¡ w i k2 ¡ 1 , dsi2
8i,
(3.5)
with gs the learning rate. 4 Joint Entropy Maximization
Maximizing differential entropy alone is not sufcient when there are multiple kernels in the map. This can be easily shown as follows. Consider a lattice of two neurons with kernel outputs y1 and y2 . When equations 3.4 and 3.5 are used (e.g., in the case of a gaussian input density), then the two
1892
Marc M. Van Hulle
kernels will eventually coincide, the neural outputs will be maximally statistically dependent, and thus the map will be maximally redundant. We can formulate statistical dependency (or redundancy) in information-theoretic terms as the mutual information between the neuron outputs. Hence, we maximize the differential entropy given that we also need to minimize the mutual information in order to cope with the kernel overlap. This dual goal is captured by maximizing the joint entropy of the neuron outputs, H ( y1 , y2 ) D H ( y1 ) C H ( y2 ) ¡ I ( y1 , y2 ) ,
(4.1)
with H ( y1 , y2 ) the joint entropy, H ( y1 ) and H ( y2 ) the differential entropies, and I ( y1 , y2 ) the mutual information. We will perform mutual information minimization heuristically by putting kernel adaptation in a competitive learning context. In this way, the winning neuron’s kernel will decrease its range—in particular when it is strongly active—and thus decrease its overlap with the surrounding kernels. In addition, we will add a neighborhood function to the learning process, since we want to achieve topologypreserving lattices. The learning rules are derived in the next section. 5 Topographic Map Formation
Consider a lattice A of N neurons and corresponding incomplete gamma distribution kernels K ( v , w i , si ) , i D 1, . . . , N. We introduce an activity-based competition between the neurons, with the “winning” neuron dened as i¤ D arg max8i2A yi , rather than the more common (minimum Euclidean) distance-based competition, i¤ D arg mini kw i ¡ v k, which is equivalent to our case only when all kernels have equal radii. We supply topological information by means of a neighborhood function L , for which we take a monotonously decreasing function of the lattice distance from the winner. We opt for a gaussian neighborhood function, Á
kri ¡ r i¤ k2 L ( i, i , sL ) D exp ¡ 2sL2 ¤
! ,
(5.1)
with sL the neighborhood function range and ri neuron i’s lattice coordinate. The complete set of learning rules then becomes
D wi
D gw L ( i, i¤ , sL )
v ¡ wi
, si2 ³ ´ 1 kv ¡ w i k2 D si D gs L ( i, i¤ , sL ) ¡ 1 , si dsi2
(5.2) 8i.
(5.3)
In case the neighborhood function is omitted in equations 5.2 and 5.3, joint entropy maximization will still be aimed for, but the kernels will not
Joint Entropy Maximization in Kernel-Based Topographic Maps
1893
become topographically organized, in particular, when starting from a random initialization. Hence, one could envisage the purpose of the neighborhood function as a way to constrain the learning process so that it favors the development of topographic maps. 5.1 Lattice-Disentangling Dynamics. The disentangling dynamics is exemplied in Figure 2 for the standard case of a square lattice and a uniform square input density. The weights were initialized by sampling from the same input density and the radii by sampling the uniform distribution [0, 0.1]. The neighborhood range was decreased as follows:
³ ´ t sL ( t) D sL 0 exp ¡2sL 0 , tmax
(5.4)
with t the present time step, tmax the maximum number of time steps, and sL 0 the range spanned by the neighborhood function at t D 0. We took tmax D 2,000,000 and sL 0 D 12, so that the neighborhood function vanishes at the end of the learning process (sL ( tmax ) ¼ 4.5 £ 10¡10). We further took gw D 0.01 and gs D 10¡4gw . 6 Theoretically Optimal and Achieved Performance
The theoretically optimal performance of the kernel-based map can be derived for two limiting cases. In the rst case, we assume that the input density consists of N gaussians that are spaced innitely far apart so that their overlap is innitesimally small (N-gaussians case). We quantize the kernel outputs yi , 8i 2 A, uniformly into k equally sized and nonoverlapping quantization intervals.1 The optimal solution is reached when each gaussian is modeled by a different kernel. The expressions for the joint entropy JE and mutual information MI can be derived analytically: 1 k¡1 log2 kN C log2 k, k k ³ ´N¡1 1 1 k¡1 kN k C log2 ± MI D log2 ² . ( N ¡ 1) k C 1 k k ( N¡1) k C 1 N JE D
(6.1) (6.2)
kN
In Figure 3A, JE and MI are plotted as a function of N and parameterized with respect to k. We can easily determine the asymptotes for MI when 1 Note that as a result of this quantization, the aforementioned assumption is achieved as soon as, for each neuron, the tails of the N ¡1 other gaussians activate only the neuron’s lowest quantization interval—the one that codes for the lowest kernel output values.
1894
Marc M. Van Hulle
Figure 2: Evolution of a 24 £ 24 lattice as a function of time. (Left column) Evolution of the kernel weights. (Right column) Evolution of the kernel radii. The boxes outline the uniform input probability density. The values given below them represent time.
Joint Entropy Maximization in Kernel-Based Topographic Maps
1895
k ! 1 (the thin dashed line in Figure 3A) and k, N ! 1 (dot-dashed line): ³ ´N¡1 N MI D log2 (6.3) , for k ! 1, N¡1 MI D
1 ¼ 1.4427, loge 2
for k, N ! 1.
(6.4)
This means that for a given number of neurons N, the mutual information will always be nite.2 Furthermore, we also see that for any number of neurons and quantization levels, the mutual information will be bounded by the second asymptote. Finally, note that the optimal JE and MI do not depend on the input space dimensionality d. The performance of our learning algorithm can be easily measured against these theoretical results. Consider a one-dimensional lattice (chain) of N neurons, with N D 1, . . . , 10, developed in the one-dimensional input space. We take N unit-variance gaussians, spaced 15 units apart along the real line. We initialize the weights and radii by sampling the [0, 1] interval uniformly and run our learning algorithm 20 times for several ( N, k) -combinations. We take tmax D 2,000,000, sL 0 D 12, gw D 0.01, and gs D 10¡4gw . The JE and MI plots obtained practically coincide with the optimal ones and are not shown in Figure 3A: the differences between the theoretically optimal JE and MI values and the obtained averages are smaller than, respectively, 0.005 and 0.02 everywhere. This performance was maintained for input space dimensionalities d D 1, . . . , 10 (see Figure 3C). In the second case, we assume that the input density consists of one gaussian only (1-gaussian case). The expressions for the joint entropy and mutual information can be derived analytically, for the case of k D 2 quantization levels and a one-dimensional input space: JE D log2 2N,
³
MI D log2 2N C 2 ³ C ( N ¡ 2) C log2
1 , 2N
1 N¡1 N C log2 N N¡1 N
´
3 2N 2N ¡ 3 2N C log2 log2 2N 3 2N 2N ¡ 3
´
(6.5)
2 This can be intuitively understood by observing that only the lowest quantization interval accounts for the overlap with the other kernels (see note 1) and that as k increases, the activation probability of this interval becomes more and more “unique” with respect to the activation probabilities of the neuron’s other quantization intervals (which form an equitable distribution). More specically, the probability that the lowest quantization (N¡1)kC 1 1 interval is active equals , whereas that of the other intervals equals kN , from kN which it follows that the mutual information will be bounded when k ! 1.
Marc M. Van Hulle
0
|
JE MI (bits)
2
|
| 2
| 3
| 4
| 5
| 6
| 7
| 8
| | 9 10 N
JE MI (bits) | 8
| | 9 10 d
1
0| 1
|
| 7
|
| 6
2
|
| 5
3
|
|
| 4
4
|
| | 3
5
|
| | 2
6
|
|
0 | 1
-2 | 1
|
|
JE (bits)
4
|
|
0 | 1
1
6
|
|
2
2
|
4
3
8
|
6
|
8
10
|
10
|
JE MI (bits)
1896
| 2
| 3
| 4
| 5
| 6
| 7
| 8
| | 9 10 N
| 2
| 3
| 4
| 5
| 6
| 7
| 8
| | 9 10 d
Figure 3: (A,B) Optimal joint entropy (JE—thick continuous lines) and mutual information (MI—thick dashed lines) for the case of N lattice neurons and N onedimensional gaussians spaced innitely far apart (A) and for the case of only one one-dimensional gaussian (B). Results are plotted as a function of the number of neurons N, parameterized with respect to k D 2, 4, 8, 16, 32 quantization levels, with the k D 2 curves being the lowest and the k D 32 curves the highest ones. The thin dashed line in A denotes the MI plot for the case where k ! 1 and the dot-dashed line the case where k, N ! 1. The thin continuous line in B is the average MI result obtained with the learning algorithm (including error bars). (C) Simulation results obtained for JE, for the N gaussians case with k D 2, plotted as a function of the dimensionality d and parameterized with respect to the number of lattice neurons N, N D 1, 2, . . . , 10. The thick line corresponds to N D 1 case, the upper line to N D 10. (D) Optimal JE and MI (thick continuous and dashed lines, respectively) for the one-dimensional gaussian case, plotted as a function of d. Note that MI = 0 for N D 1. The thin continuous line is the average MI result (error bars omitted) obtained with our learning algorithm. Other conventions are as in Figure 3C.
Joint Entropy Maximization in Kernel-Based Topographic Maps
1897
with the second term in MI being present for N > 2 only. Figure 3B shows the JE and MI plots for different values of k; the plots for the k > 2 cases were determined numerically. Contrary to the N-gaussians case, MI continues to increase when N increases. Furthermore, the optimal JE and MI plots depend on the input dimensionality d and are also determined numerically. We also determine the performance of our learning algorithm for the 1-gaussian case. The results are summarized in Figure 3B. The average JE results (also for 20 runs) practically coincide with the theoretically optimal ones (difference with optimal result < 0.02; standard deviation < 0.03). The results as a function of the input dimensionality d are shown in Figure 3D for k D 2. The average JE plots again practically coincide with the numerically determined ones (difference with optimal result < 0.02; standard deviation < 0.025). The standard deviations on the MI results are similar to those shown in Figure 3B and are omitted for clarity. Finally, since we are using a heuristic for mutual information minimization (i.e., competitive learning), we can empirically verify the performance of only the learning algorithm. We cannot formally prove that for a given input distribution, the algorithm is guaranteed to converge toward a solution that will maximize the joint entropy. However, at least for the test cases considered here, we could verify that the achieved performance was satisfactory, since the theoretical results were available. 6.1 Clustering. The theoretical results suggest a new type of metric for determining the number of clusters in a data set, given that they can be appropriately modeled by gaussians. 3 If the data set can be modeled by N gaussians that are spaced sufciently far apart, then we know the theoretically optimal joint entropy (cf. the N-gaussians case). We also know that a mismatch in the number of neurons leads to a higher joint entropy: the JE values of the N-gaussians case are always smaller than the corresponding values for the 1-gaussian case (e.g., 12 log2 2N C 12 < log2 2N, for N > 1, k D 2). Hence, the N-gaussians case will be our “null” distribution and the corresponding optimal joint entropy our reference value. We suggest the following clustering procedure:
1. For ( N ( 1I N · maxÇ numberÇ ofÇ clusters I N C 1) , develop a topographic map with N kernels and determine JE; determine the difference with JE equation 6.1, assuming N gaussians, one for each kernel; and store the JE difference. 2. Select N with the smallest JE difference. The advantage of this procedure is twofold. First, we can still consider the one-cluster case, which is lacking in most clustering methods (Gordon, 3 In practice, single clusters can be appropriately modeled by log-concave functions, such as gaussians.
Marc M. Van Hulle
1.0
|
0.8
|
0.6
|
0.4
|
0.2
|
P(N=Q)
1898
|
0.0 | 0.00
| 0.10
| 0.20
| | 0.30 0.40 overlap rate
Figure 4: Clustering based on the difference between expected and obtained joint entropy (thick continuous line) and based on the Gap statistic applied to kernel weights (thin continuous line) and applied to the k-means clustering result (dashed line).
1999). Second, our reference value is a theoretical result, instead of an estimate, as, for example, in the case of the Gap statistic, which looks at the point where the difference between two values of an intracluster distance metric becomes maximal, one for the given sample set and another for a reference set (Tibshirani, Walther, & Hastie, 2001). In order to test our procedure, we reapply the benchmark Tibshirani and coworkers (2001) used for the Gap statistic. We consider two onedimensional gaussians with equal standard deviations and vary the (Bayesian) overlap rate. For each overlap rate, we train our one-dimensional lattices in batch mode on 100 samples—50 from each Gaussian—and determine the conguration N for which JE is minimal, as explained above. We repeat this experiment 20 times, for different 100 sample sets, determine the probability that the correct number of clusters is found, and plot this probability as a function of the overlap rate. The result is shown in Figure 4 (thick continuous line). We observe a marked drop in performance at about an 18% overlap rate, which is close to the point where the sum of two gaussians becomes unimodal. The Gap statistic, using the Euclidean distance metric, is applied to the kernel centers and also to the case where the weights are determined with the k-means clustering algorithm. The results are shown in Figure 4 (thin continuous and dashed line, respectively). We now observe a gradual decrease in performance for an increasing overlap rate, until it drops below chance level (dot-dashed line), since the choice is in practice between one or two clusters.
Joint Entropy Maximization in Kernel-Based Topographic Maps
1899
7 Connection with Kullback-Leibler Divergence
There is an interesting connection between joint entropy maximization, given our incomplete gamma distribution kernels, and the Kullback-Leibler divergence minimization, given heteroskedastic gaussian kernels. The Kullback-Leibler divergence KL (also called relative- or cross-entropy) is a frequently used metric for assessing the quality of a density estimate. It is dened as follows: Z KL D ¡
log
b p(v) p ( v ) dv , p(v)
(7.1)
with p ( v ) the true input density and b p (v ) the estimated density. It is always a nonnegative number, and it will equal zero if and only if the density estimate is identical to the true density. Assume that we perform a gaussian mixture density modeling with equal mixings, ± ² kv ¡w i k2 N exp ¡ 2 1 X 2si b p (v ) D , d N iD 1 (2p ) 2 sid
(7.2)
with w i and si the ith gaussian’s center and radius, respectively. The “optimal” density estimate is determined by minimizing KL. We need the partial derivatives with respect to the centers and radii: @KL @w i @KL @si
Z ³ D ¡
Z ³ D ¡
1 @b p ( v |H) b p (v | H) @w i 1 @b p ( v |H) b p (v | H) @si
´ p ( v ) dv D 0,
(7.3)
´ p ( v ) dv D 0,
8i,
(7.4)
with H D f[w i ], [si ]g, the parameter vector of the density model. Stochastic approximation (Robbins & Munro, 1951) can be invoked to solve these equations, which leads to the following learning rules, (v ¡ wi ) , si2 ³ ´ d kv ¡ w i k2 D si D gs Pb( v |i) ¡ 1 , si dsi2
D wi
b(v |i) D gw P
(7.5) 8i,
(7.6)
b(v |i) after some algebraic manipulations (see the appendix). The parameter P b( v |i) D represents the ith neuron’s posterior probability. When we take for P dii¤ , with i¤ the neuron that wins the competition and introduce the con-
1900
Marc M. Van Hulle
ventional neighborhood function, since we wish to achieve a topologypreserving mapping, we obtain the following learning rules: (v ¡ wi ) , si2 ³ ´ d kv ¡ w i k2 D si D gs L ( i, i¤ , sL ) ¡ 1 , si dsi2
D wi
¤ D gw L ( i, i , sL )
(7.7) 8i.
(7.8)
Apparently, these learning rules correspond to those of equations 5.2 and 5.3 except for the constant factor d in equation 7.8, but this can be absorbed by the learning rate gs . Hence, at least for our incomplete gamma distribution kernels and our activity-based denition of “winner,” joint entropy maximization and Kullback-Leibler divergence minimization seem to be equivalent. This is a potentially interesting observation for density estimation since the log-likelihood of the training samples will equal the KullbackLeibler divergence, when the number of training samples goes to innity, and when the sample generating process is ergodic (for references, see Yin & Allinson, 2001). As an example, we consider the distribution shown in Figure 5A (quadrimodal product distribution). We have used this example before as a benchmark for comparing the density estimation performance of a series of kernelbased and kernel-extended topographic map formation algorithms (Van Hulle, 2000). The analytic equation of the distribution is, in the rst quadrant, (¡ log v1 ) (¡ log v2 ) , with (v1 , v2 ) 2 [0, 1]2 , and so on. Each quadrant is chosen with equal probability. The resulting asymmetric distribution is unbounded and consists of four modes separated by sharp transitions (discontinuities), which makes it difcult to model. The support of the distribution is bounded by the unit square [¡1, 1]2 . We take a 24 £ 24 lattice and train it until tmax D 1,000,000 in equation 5.4. The density estimate is obtained by using equation 7.2, with w i and si as determined with our learning algorithm. The result is shown in Figure 5B. The Kullback-Leibler divergence is 19.2; the mean squared error (MSE) between the estimated and the theoretical distribution is 5.08 £ 10¡2 . For comparison, we have also considered the classic variable kernel (VK) density estimation method (Silverman, 1992), which puts a gaussian kernel at each input sample and adapts the kernel range to the local sample density.4 The Kullback-Leibler divergence is 80.4 (MSE = 7.12 £ 10¡2 ) for M D 500 samples, 38.6 (MSE = 5.92 £ 10¡2 ) for M D 2000 samples, 28.6 (MSE = 5.57 £ 10¡2 ) for M D 5000 samples, and 21.4 (MSE = 5.35 £ 10¡2 ) for M D 10, 000 samples. Hence, apparently many more kernels are needed to match our learning algorithm’s 4 Technically, we take for the sensitivity parameter of the VK method, a D 12 , as suggested by Breiman, Meisel, and Purcell (1977). The p pilot estimate was determined by using the (kth) nearest-neighbor method with k D M (Silverman, 1992).
Joint Entropy Maximization in Kernel-Based Topographic Maps
1901
performance. Finally, in order to see the effect of our activity-based WTA competition, i¤ D arg max8i2A yi , we have also trained our lattice using the classic distance-based rule, i¤ D arg mini kw i ¡ v k (and with the radii adapted as in equation 5.3). The resulting density estimate is shown in Figure 5C. The Kullback-Leibler divergence is now 84.7 (MSE = 5.13 £ 10¡2 ), which is again inferior to our initial result. 8 Correspondence with Other Kernel-Based Topographic Map Algorithms
The soft topographic vector quantization (STVQ) algorithm (Graepel et al., 1997) performs a fuzzy assignment of input samples to lattice neurons, similar to fuzzy clustering. It also serves as a general model for probabilistic, SOM-based topographic map formation since several algorithms can be considered as special cases, including Kohonen’s batch map version (Kohonen, 1995). Our learning algorithm is different in at least three ways. First, the STVQ kernel represents a fuzzy membership (in clusters) function, that is, the softmax function, normalized with respect to the other lattice neurons. In our case, the kernel represents a cumulative distribution function, which operates in the input space, and determines the winning neuron. Second, instead of using kernels with equal radii in the STVQ algorithm, our radii are individually adapted. Third, the kernels also differ conceptually since in the STVQ algorithm, the kernel radii are related to the magnitude of the noise-induced change in the cluster assignment (thus, in lattice space), whereas in our case, they are related to the radii of the incomplete gamma distribution kernels and, by consequence, to the standard deviations of the assumed gaussian local input densities (thus, in input space). In the kernel-based soft topographic mapping (STMK) algorithm (Graepel et al., 1998), a nonlinear transformation, from the original input space to some “feature” space, is introduced that admits a kernel function, (e.g., a gaussian). The topographic map’s parameters (“weights”) are expressed as linear combinations of the transformed inputs, so that the map is in effect developed in the feature space rather than in the input space directly, as in our approach. Other points of distinction are that the kernels’ parameters are not updated in the STMK algorithm and that the inputs are assigned in probability to neurons. In AndrÂas’s approach (2001), the gaussians serve as a nonlinear transformation, mapping the input vectors to a high-dimensional space, so that the boundaries between the input classes are linearized. The kernel radii are adapted individually so that the map’s classication performance is optimized, an operation that requires supervised learning. This basically sets AndrÂas’s approach apart from ours. In the kernel-based maximum entropy learning rule (kMER) (Van Hulle, 1998), the kernel outputs are thresholded (0/1 activations), and, depending on these binary activations, the kernel centers and radii are adapted.
1902
Marc M. Van Hulle
pd
2 1 0 -1
1 0 v2 v1
pd
1 -1
2 1 0 -1
1 0 v2 v1
pd
0
0
1 -1
2 1 0 -1
1 0 v2 v1
0
1 -1
Figure 5: Two-dimensional quadrimodal product distribution (A) and density estimate obtained with our kernel-based topographic map learning algorithm (B), and when in our learning algorithm a distance-based competition is used instead of an activity-based competition (C).
Joint Entropy Maximization in Kernel-Based Topographic Maps
1903
In the current approach, both the activation states and the learning rules, equations 5.2 and 5.3, depend on the continuously graded kernel outputs. In the approach of Benaim and Tomasini (1991), the weights are adapted in such a manner that the Kullback-Leibler divergence is minimized, given a homoskedastic gaussian mixture density model. Furthermore, in order to achieve a topology-preserving mapping, a smoothness term of the form L ( i, i¤ ) ( w i ¡w i¤ ) is added to the weight update rule, with the winning neuron i¤ dened by a distance-based WTA rule. Recently, Yin and Allinson (2001) reconsidered Benaim and Tomasini’s idea and extended it by also adapting the radii of the gaussian kernels. Yin and Allinson’s neighborhood function b( v |i) term in equations 7.5 and 7.6, which is basically corresponds to the P in input space coordinates, instead of in lattice space coordinates, as in our case (cf. L (.) in equations 7.7 and 7.8). Furthermore, the winning neuron i¤ is the one for which the gaussian kernel output is maximal. In general, this leads to quite different results. 9 Conclusion
We have developed a new learning algorithm for kernel-based topographic map formation, aimed at maximizing the joint entropy of the map’s output. We have formulated the theoretically optimal joint entropy performance for two example input distributions, which we used for assessing the algorithm’s performance and also for suggesting a new type of clustering algorithm. Finally, we have shown the correspondence with stochastic gradient descent on the Kullback-Leibler divergence in the case of a heteroskedastic gaussian mixtures density model. Appendix A.1 Derivation of Learning Rules: Equations 3.4 and 3.5. We rst consider the update rule for the kernel centers. The rst term on the right-hand side of equation 3.3 does not depend on the kernel center. Hence, we need to concentrate further on only the second term, which in fact corresponds to the expected value of its ln component. Entropy maximization can be achieved by considering the training set of r’s to approximate the density pr ( r) , which leads to the on-line stochastic gradient ascent learning rule:
D wi
D gw
@H @r @r @w i
³ D gw
@yi @r
´¡1
@ @w i
³
@yi @r
´ ,
8i.
(A.1)
After some algebraic manipulations, we obtain Á ! ( sr ) 2 rd¡1 p exp ¡ D , 2 @r C ( d2 ) ( 2si ) d¡1
@yi
¡2
(A.2)
1904
Marc M. Van Hulle
of which also the derivative with respect to w i is needed. For the onedimensional case (d D 1), this leads to
D wi
D gw
v ¡ wi , si2
8i,
(A.3)
when using equation 2.3. For d > 1, we obtain the same term on the righthand side and a second one: ( d ¡ 1) kvv¡¡ww ik2 . This more complex rule also i converges toward the centroid of the inputs that activate neuron i. It is reasonable to omit this second term in order to keep the learning rule simple. Indeed, we know that E[kv ¡ w i k2 ] D dsi2 , for a gaussian input distribution, when w i converges to the gaussian’s mean and si to its standard deviation. Hence, the second term is expected to be smaller than the rst term. We can also motivate the omission in the following manner. Since a d-dimensional radially symmetrical gaussian distribution can be built up by taking d samples—one for each input dimension—of a one-dimensional gaussian with the same radius, when the updates D wij are small and when we update along each input dimension separately (e.g., in random order), then we can approximate the learning rule by the simpler one, equation 3.4. The update rule for the kernel radii, equation 3.5, can be derived directly, without any approximations, by performing gradient ascent on the differential entropy H, after some algebraic manipulations. A.2 Derivation of Learning Rules: Equations 7.5 and 7.6. We rst consider the derivation of the update rule for the kernel centers, equation 7.5. Since the true input density p ( v ) is not known, Robbins-Munro stochastic approximation can be invoked in order to solve equation 7.3. This leads to
D wi
D gw
1 @b p (v | H) , b p ( v |H) @w i
Working out the derivative @b p (v | H) @w i D
with pi ( v |hi ) D that
D pi ( v |hi ) 1 d 2
(2p ) sid
@b p ( v |H) @w i
(v ¡ w i ) , si2
8i.
(A.4)
yields (A.5)
± ² ¡w i k2 exp ¡ kv 2s and hi D fw i , si g. Bayes’ rule tells us 2 i
b ( ) b(v |i) D Pi pi v |hi , P b p ( v | H)
(A.6)
bi the prior probability, that is, the ith mixing parameter in our gauswith P bi is sian mixture. Since we have assumed equal mixings in equation 7.2, P
Joint Entropy Maximization in Kernel-Based Topographic Maps
1905
constant. Substituting equations A.5 and A.6 in equation A.4 leads to the end result given in equation 7.5. In a similar way, the update rule for the kernel radii, equation 7.6, is obtained. Acknowledgments
I am supported by research grants from the Fund for Scientic Research (G.0185.96N), the National Lottery (Belgium, 9.0185.96), the Flemish Regional Ministry of Education (Belgium, GOA 95/99-06; 2000/11), the Flemish Ministry for Science and Technology (VIS/98/012), and the European Commission (QLG3-CT-2000-30161 and IST-2001-32114). References Andr Âas, P. (2001). Kernel-Kohonen networks. Int. J. Neural Systems. Manuscript submitted for publication. Bell, A. J., & Sejnowski, T. J. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural Computat., 7, 1129– 1159. Benaim, M., & Tomasini, L. (1991). Competitive and self-organizing algorithms based on the minimization of an information criterion. In Proc. ICANN’91 (pp. 391–396). Bishop, C. M., Svense n, M., & Williams, C. K. I. (1998). GTM: The generative topographic mapping. Neural Computat., 10, 215–234. Breiman, L., Meisel, W., & Purcell, E. (1977). Variable kernel estimates of multivariate densities. Technometrics, 19, 135–144. Durbin, R., & Willshaw, D. (1987). An analogue approach to the travelling salesman problem using an elastic net method. Nature, 326, 689–691. Gordon, A. (1999). Classication. (2nd ed.). London: Chapman and Hall/CRC Press. Graepel, T., Burger, M., & Obermayer, K. (1997). Phase transitions in stochastic self-organizing maps. Physical Rev. E, 56(4), 3876–3890. Graepel, T., Burger, M., & Obermayer, K. (1998). Self-organizing maps: Generalizations and new optimization techniques. Neurocomputing, 21, 173–190. Graham, R. L., Knuth, D. E., & Patashnik, O. (1994). Concrete mathematics: A foundation for computer science. Reading, MA: Addison-Wesley. Kohonen, T. (1982). Self-organized formation of topologically correct feature maps. Biol. Cybern., 43, 59–69. Kohonen, T. (1995). Self-organizing maps. Heidelberg: Springer-Verlag. Kostiainen, T., & Lampinen, J. (2002). Generative probability density model in the self-organizing map. In Self-organizing neural networks. Recent advances and applications, U. Seiffert & L. Jain (Eds.), pp. 75–94. Heidelberg: Physica Verlag. Linsker, R. (1989). How to generate ordered maps by maximizing the mutual information between input and output signals. Neural Computat., 1, 402–411.
1906
Marc M. Van Hulle
Robbins, H., & Munro, S. (1951).A stochastic approximation method. Ann. Math. Statist., 22, 400–407. Silverman, B. W. (1992). Density estimation for statistics and data analysis. London: Chapman and Hall. Sum, J., Leung, C.-S., Chan, L.-W., & Xu, L. (1997). Yet another algorithm which can generate topography map. IEEE Trans. Neural Networks, 8(5), 1204–1207. Tibshirani, R., Walther, G., & Hastie, T. (2001). Estimating the number of clusters in a dataset via the Gap statistic. J. Royal Statist. Soc. B, 63, 411–423. Utsugi, A. (1997). Hyperparameter selection for self-organizing maps. Neural Computat., 9, 623–635. Van Hulle, M. M. (1998). Kernel-based equiprobabilistic topographic map formation. Neural Computat., 10(7), 1847–1871. Van Hulle, M. M. (2000). Faithful representations and topographic maps: From distortion- to information-based self-organization. New York: Wiley. Vapnik, V. N. (1995). The nature of statistical learning theory. New York: SpringerVerlag. Weisstein, E. W. (1999). CRC concise encyclopedia of mathematics. London: Chapman and Hall. Yin, H., & Allinson, N. M. (2001). Self-organizing mixture networks for probability density estimation. IEEE Trans. Neural Networks, 12, 405–411. Received August 7, 2001; accepted January 31, 2002.
LETTER
Communicated by Andrew Back
Sufcient Conditions for Error Backow Convergence in Dynamical Recurrent Neural Networks Alex Aussem [email protected] LIMOS (FRE CNRS 2239), University Blaise Pascal, Clermont Ferrand II, 63173
Aubiere Cedex, France This article extends previous analysis of the gradient decay to a class of discrete-time fully recurrent networks, called dynamical recurrent neural networks, obtained by modeling synapses as nite impulse response (FIR) lters instead of multiplicative scalars. Using elementary matrix manipulations, we provide an upper bound on the norm of the weight matrix, ensuring that the gradient vector, when propagated in a reverse manner in time through the error-propagation network, decays exponentially to zero. This bound applies to all recurrent FIR architecture proposals, as well as xed-point recurrent networks, regardless of delay and connectivity. In addition, we show that the computational overhead of the learning algorithm can be reduced drastically by taking advantage of the exponential decay of the gradient. 1 Introduction
This article extends the analysis of the gradient error back ow and the forgetting behavior (Bengio, Simard, & Frasconi, 1994; Aussem, Murtagh, & Sarazin, 1995; Aussem, 1999, 2000; Hochreiter & Schmidhuber, 1997; Lin, Horne, Tino, & Giles, 1996) to a fairly general class of fully recurrent architectures obtained by modeling the recurrent synapses as nite impulse response lters (FIR) (Wan, 1993; Aussem et al., 1995). The resulting FIRrecurrent structure encompasses a large variety of synaptic delay-based forward, locally, and globally recurrent neural architecture proposals that have emerged in the literature (see, for instance, Piche, 1994; Tsoi & Back, 1994; Baldi, 1995; Campolucci, Uncini, Piazza, & Rao, 1999; Duro & Santos Reyes, 1999) as well as xed-point recurrent networks (Pineda, 1987). By virtue of the recurrency, this versatile model is formally dened as a dynamical system described by a system of nonlinear difference equations, whose internal variables are the hidden unit activations. Therefore, these state-space models are also referred to as dynamical recurrent neural networks (DRNN) (Aussem et al., 1995). The architecture being dened in section 2, the temporal recurrent backprop algorithm, is recalled in section 3. The error backow is dened as the c 2002 Massachusetts Institute of Technology Neural Computation 14, 1907– 1927 (2002) °
1908
Alex Aussem
error gradient, with respect to past neuron activations, that ows backward in time through the unfolded error-propagation network in order to adjust the weights. Careful algebraic manipulations are then conducted in section 4 with a view toward formally identifying explicit and interpretation-friendly sufcient conditions for the convergence of the error backow for all neural architectures that are special cases of the DRNN framework, regardless of whether they necessitate a relaxation phase to settle down to a xed equilibrium point—the existence of which is out of the scope of this article. An upper bound for the backpropagation through time (BPTT) (Rumelhart, Hinton, & Williams, 1986) truncation length that guarantees a given precision on the weight updates is also derived. Extensive simulations are then conducted to validate these results. A practical measure of the tightness of the bound, called the truncation length error rate, is introduced to gauge the tightness of the upper bound for the BPTT truncation length. The latter is easily interpreted: a truncation length error rate of 2 means that the error backow was propagated twice as far as necessary into the unfolded error network for the weights to be adjusted with some given precision. It appears experimentally that the truncation length is acceptable as long as the error gradient decay is small. 2 The DRNN Model
DRNN are an extension of conventional recurrent neural networks (RNN) operating in discrete time, allowing the use of arbitrary synaptic time delays. A DRNN with synaptic delays of 0 reduces to the conventional RNN; the information ows through the synapse instantly, while in DRNN the information coming from neuron i will arrive at neuron j after a certain delay associated with that particular synaptic link. In other words, the multiplicative interactions of scalars are substituted by temporal convolution operations, necessitating a time-dependent weight matrix (Wan, 1993; Aussem et al., 1995). Assuming the existence of stable xed points at each iteration k, a general formulation is a coupled set of discrete equations in vector form, vk D g(uk) C ik uk D
D X dD1
T
T
W dk v k¡d C W 0k v k ,
(2.1)
where N is the number of units and v k and u k are the N-dimensional vector activity and internal inputs of the units at iteration k. i k denotes the external input vector of dimension N. The N £ N matrix W dk contains all adaptive weights of delay d · D at iteration k (see Table 1 for notations). For notational convenience, we do not distinguish among input, hidden, output, and bias units in the following T formulas. Consistent with our notations, we note g ( u k ) D [g(u1k ) , . . . , g (unk )]
Error Backow Convergence
1909
Table 1: DRNN Notations. v u i
wdij w d (j) D [wd1j , . . . , wdNj ]T W d D [w d (1), w d (2), . . . , w d (N)]
g ()
G G0
Activation vector Internal input vector Input vector Weight-connecting neurons i to j with delay d Weight vector of delay d heading to unit j Weight matrix of delay d · D Nonlinear activation function N £ N diagonal matrix constructed from g(u(j)) N £ N diagonal matrix constructed from g0 (u (j))
Note: For simplicity, iteration indexes k are omitted.
under the mild assumption that the nonlinear activation function, g () , is C with bounded derivatives.
1
2.1 Existence of Fixed Points. In the formulation above, nondelayed recurrent links were allowed on purpose so that xed-point networks fall within the DRNN class. Indeed, we show later in this article that the conditions for the error backow convergence in the DRNN are also necessary for the asymptotic stability of the xed points in so-called xed-point networks. This was essentially the only reason for adding relaxation. However, there are a priori no advantages, in terms of processing capabilities, of adding relaxation. Relaxation is time-consuming, and the network is not even guaranteed to settle down to a stable equilibrium solution. To get rid of the relaxation, just assume that W 0k is triangular. Due to the nondelayed recurrent links, it is not guaranteed that equation 2.1 always settles down to a stable equilibrium solution. For the system to settle down to a steady state ultimately for any choice of initial conditions, it should be globally asymptotically stable (Khalil, 1996). Networks for which a Lyapunov function can be exhibited, such as the Hopeld model or pure feedforward networks, are guaranteed to be globally asymptotically stable. Unfortunately, little is known about the stability of equation 2.1 for arbitrary connectivity. The only statement that can be made in this regard is that the xed points—if any—are asymptotically stable provided that all eigenvalues of the linearized system satisfy 8i,
1910
Alex Aussem
gradient propagation in this rather general architecture in order to calculate upper bounds on the weight matrix that apply to trajectory recurrent networks (by simply letting W 0k be the zero matrix) as well as xed-point recurrent networks, regardless of delay and connectivity. 2.2 Related Neural Models. Many discrete-time neural models discussed in the literature are special cases of DRNN. For instance, if W 0k D 0 T
6 0, the model collapses to Wan’s FIR and W dk are upper triangular for all d D feedforward network (Wan, 1993). Several locally recurrent globally feedforward networks fall into the DRNN class (e.g., the generalized Frasconi-GoriSoda architecture of Frasconi, Gori, & Soda, 1992; the Poddar-Unnikrishnan architecture of Tsoi & Back, 1994; and the architectures studied by Duro & Santos Reyes, 1999, and Cohen, Saad, & Maromet, 1997). Fixed-point (asymmetric) recurrent networks studied by Pineda (1987) are also encompassed in this framework by setting W dk D 0 for d > 0. 6 Continuing with examples, if 8d D 1, W dk D 0, the model reduces to a standard globally recurrent network (Williams & Zipser, 1989; Lin et al., 1996). Nonlinear autoregressive (moving average) with eXogeneous inputs (NARX and NARMAX) models (Lin et al., 1996) are also included in this representation. We stress that even if innite impulse response (IIR) synapse architectures (e.g., Tsoi and Back IIR; Tsoi & Back, 1994) do not fall within the DRNN class, IIR networks may be seen as equivalent to expanded DRNN with extra linear activation units and extra local FIR feedback. Therefore, the training scheme and the upper bounds developed later will also hold provided the IIR network is transformed into a DRNN network. All of the architectures mentioned above suffer from exponential forgetting behavior, as shown in the next section.
3 Error-Derivative Calculation
The introduction of time delays leads to a somewhat more complicated training procedure because the current state of a unit may depend on several events having occurred in the neighboring neurons at different times. (See Table 2 for the calculations.) Classically, the goal is to nd a procedure to adjust the weight components wdij so that a given xed initial condition v 0 and a set of input vectors fi kg taken from the training set result in a set of xed points (the existence of which is assumed), whose components along the output units have a desired set of values fd kg. This will be accomplished P kf by our minimizing a quadratic function E D 12 kD 0 e Tk e k, which measures the errors e k D d k ¡ v k between the desired and the actual xed-point values along an epoch (each pass through the training set is called an epoch) starting at iteration 0 up to kf . In an epochwise adaptation, a new weight is
Error Backow Convergence
1911
Table 2: Notations for Error Derivative Calculations. dk ek D dk ¡ v k Ek D eTk e k / 2
h
y k D @@vE k
I ¡ G 0k W 0k
d k D G 0k y k y kk C n D
@ Ek C n @vk
T
Target vector at iteration k Error vector at iteration k (Scalar) cost at iteration k
i ¡1
h I ¡ G 0k W 0k
T
So-called error backow vector
i ¡1
at iteration k Delta vector at iteration k Error backow vector at iteration k due to error e kC n only Delta vector at iteration k due to error e kC n only Dened as maxh (g0 (h))
d kkC n D G 0k y kkC n m
generated at the end of the epoch by D w D ¡g @@wE , where g is the learning rate. Without loss of generality, we adopt the ordinary gradient-descent method and assume g is constant over an epoch. The primary goal is to nd an expression for the term @E / @w. In order to clarify our notation, let G k and G 0k denote the N £ N diagonal matrixes constructed from g ( u k ( j) ) and g0 (u k ( j) ), respectively, for j D 1, . . . , N. Note that for sigmoid functions, G 0k D b G k (1 ¡ G k ) (recall that g0 (h ) D bg(h) (1 ¡ g ( h ) ) ). The following generalized @ rule for DRNN networks provides a backward-difference equation in order to compute recursively the error gradient with respect to the weights. If the DRNN model is convergent over the epoch 0 · k · kf , then P kf the gradient of the cost E D 12 kD 0 e Tk e k can be computed at time step kf by Theorem 1.
@E @W d
D
kf X kD d
± kv Tk¡d ,
8d · D,
(3.1)
where the sequence of vectors ± k, for k D d, . . . , kf , is given by ± k D G 0ky k and the sequence of vectors y k is the solution of the following backward difference equation: 2 3 D X j y k D [I ¡ W 0k G 0k ]¡1 4e k C W k C j G 0k C j y k C j 5 (3.2) j D1
with the boundary conditions: y kf D [I ¡ G 0kf W 0kf ] ¡1 e kf and yj D 0 for j > kf ¡ D. The demonstration of this proposition has been given elsewhere (Aussem et al., 1995). This generalized @ rule puts recurrent backprop (Pineda, 1987),
1912
Alex Aussem
Figure 1: A simple DRNN model (left) and the corresponding unfolded representation (right).
temporal backprop (Wan, 1993), and backprop-through-time (BPTT) (Werbos, 1974; Rumelhart et al., 1986) into a single framework. Like BPTT in its general form, the procedure suggested here makes necessary the artifact of unfolding the error-propagation network (see Figure 1) in a reverse way in time (Rumelhart et al., 1986). Typically, all the individual weight adjustment resulting from the duplication of the connections in the unfolded errorpropagation network must be recombined to compute the overall change. This is achieved by summing the individual gradient components over the overall unfolding interval. 4 Conditions for Backprop Convergence
Classically, the BPTT-like procedure proposed in theorem 1 requires the duplication of the connections in the unfolded error-propagation network and the propagation of the vectors, y k, backward in time over kf , kf ¡ 1, . . . , 1 to compute equation 3.1. However, because the storage and memory requirements grow proportionally with kf , the procedure becomes swiftly inhibitory from an implementation standpoint when long epochs are to be considered. This is also the major aw of BPTT. Fortunately, we will show here that the cost due to the duplication of the hardware can be maintained within acceptable limits while computing the error gradient as precisely as desired.
Error Backow Convergence
1913
Dene y kkC n as the solution of equation 3.2 at time k when all e kC j are set to 0 except for j D n. Let ± kkC n D G 0ky kkC n . We can now expand the delta rule, equation 3.1, by incorporating the terms y kkC n and ± kkC n :
DWd
Dg
kf X kD d
± kv Tk¡d D g
kf X n X
± nk v Tk¡d .
(4.1)
n D d kD d
In batch mode, the question we are faced with is whether the sum is diverging or remains bounded as kf ! 1. Answering this question in the general framework is a formidable task since the weight evolution is strongly dependent on the data set itself. In practice, the weights may blow up if no regularizer is used, only to be limited by their saturation limits. In fact, it seems that no general answers can be formulated. In on-line mode or pattern mode, the weights are adusted after each pattern presentation. At iteration n, the weights are adjusted by summing the individual components ± nk v Tk¡d where the index k runs over all the prior time P lags in the unfolded network, that is, nkD d ± nk v Tk¡d . Therefore, the question we are faced with is whether this sum is diverging or remains bounded as n ! 1. Clearly, for the sum to converge, not only shall ± nn¡j tend to zero with respect to j, but the decay rate must be large enough. More explicitly, The gradient-based training procedure is said to Pbe convergent at each intermediate time step on the epoch if limn!1 nkDd ± nk v Tk¡d exists. Denition 1.
We therefore restrict our attention to the so-called error backow, which is formally dened below: The error backow, due to error E n , is the sequence of vectors, ± nk for k D n, n ¡ 1, . . . , 1, used in equation 4.1 to adjust the weights. Denition 2.
Given an error En at time n, the error backow for that error can intuitively be viewed as the sequence of vectors that carry all the information necessary to adjust the past weights due to that error. With the aim of establishing sufcient conditions for the sum to converge, we now state the fundamental result of this article: Theorem 2.
Let the instantaneous scalar cost be En D 12 e n T en and consider the T
0 n [I ¡ G 0n¡kW n¡k ]¡1 . Let kf > 0 denote the epoch length error vector y nn¡kT D @@vEn¡k
1914
Alex Aussem
of the system and m D max ( g0 ( h) ) . Let Á ad D max k
If r D
PD
d D 1 ad
m kW dkC d k
1 ¡ m kW 0k k
! , 1 · d · D.
kf
(4.2)
kf
< 1, then ky kf ¡(kD C j) k < r kC 1 ky kf k, 8k ¸ 0, 8j D 1, . . . , D.
Proof. This technical result is deferred to the appendix.
The theorem provides a sufcient condition for the convergence of the kf
training procedure. It is worth remarking that the exponential decay of y kf ¡n kf
is equivalent to the exponential decay of the error backow d kf ¡n (recall that d kkC n D G 0ky kkC n and G 0k¡1 always exists) and is also equivalent to the exponential decay of the gradient with respect to the past unit activation, @E kf / @v kf ¡n . kf
kf
The theorem says, in essence, that the terms y kf ¡n (as well as d kf ¡n and @E kf / @v kf ¡n ) are bounded above by a series decaying at exponential rate, r, with respect to n provided r < 1, hence the denition: Denition 3.
r D
PD
d
d D1
maxk (
m kW k C d k 0
1¡m kW k k
) will be called the error backow
exponential decay rate or, more simply, the gradient decay rate. The analytical error backow exponential decay rate, r, bounds above the true error backow decay rate. In batch mode (i.e., when the weights are adjusted after each pass through the training set), r < 1 translates into very appealing conditions: P d In batch mode, D dD 0 kW k < 1 / m is a sufcient condition for the convergence of the training procedure. Corollary 1.
The above inequality set a penalty on the large weights, regardless of delay and connectivity, to ensure the error backow decay. Typically, when the network admits xed points at each time step, the residual backpropagated errors drop off exponentially through the past time slices, rapidly becoming negligible—hence, the forgetting behavior of (standard) recurrent neural networks. The gradient decay rate, r, can then be written as
rD
D X d D1
m kW d k
1 ¡ m kW 0 k
.
(4.3)
Error Backow Convergence
1915
P More specically, with compatible norms kA k1 D maxj i |wij |, kA k1 D P P maxi j |wij |, and kA kF D [ i, j |wij | 2 ]1/ 2 , we obtain three easily interpreted sufcient conditions to ensure training convergence: 0 1 8 D > X X > > > |wdij | A < 1 m max @ > > i > > j dD0 > > Á ! > > D < X X d |wij | < 1 m max j > > i D 0 d > > 2 31 / 2 > > > D > X X > 2 > 4 > |wdij | 5 m < 1. > : dD0
(4.4)
i, j
These conditions are satised, for instance, if all weights are in the range 1 . Note that neither condition implies the others. § m N (D C 1)
4.1 Related Results. The upper bound is in nice agreement with the results of Hochreiter and Schmidhuber (1997) for standard recurrent networks (Williams & Zipser, 1989). For example, if N is the matrix size, they nd that setting 8i, j, |wij | < wmax < 4 / N and b D 1 will result in the exponential gradient decay with a decay rate t D Nwmax / 4 < 1. Clearly, setting W 0k D 0 and D D 1 transforms the DRNN into a standard recurrent network, with a single delay xed to 1. Notice that for sigmoid transfer functions, G 0k is by denition an N £N diagonal matrix constructed from g0 (h ) D b g ( h ) (1¡ g ( h ) ) so that g0 ( h ) < b / 4. Hence, m D 1 / 4. Consequently, r < 1 reduces to a somewhat weaker condition kW 1k k < 4 for all k (b D 1), since one or several weights may exceed wmax . Conversely, it is readily seen that kW 1k k < 4 is satised if |wij | < 4 / N since kW k < Nwmax . Similarly, Frasconi et al.’s (1992) forgetting behavior condition—|wjj | < 1 / d where d D maxh ( g0 ( h ) ) D 4, for local feedback networks—is also a particular case of theorem 2. Interestingly, the existence of [I ¡ G 0kW 0k ]¡1 , and thereby the stability of the xed points, is also guaranteed. It sufces to set D D 0; theorem 2 yields kW 0k k ¢ kG 0kk < 1, which clearly is a sufcient condition for the existence of [I ¡ G 0kW 0k ]¡1 .
PD
d dD 0 kW k in corollary 1 reminds us of a specic regularizing term usually added to the cost function to penalize nonsmooth functions. The links between regularization and Bayesian inference are well established. It can easily be shown that in the gaussian case, minimizing the regularized cost is equivalent to maximizing
4.2 Link to Regularization. Interestingly, the term
1916
Alex Aussem
the posterior solution, assuming the following prior on the weights, P (w | a) /
D Y dD 0
exp(¡akW d k),
(4.5)
where the hyperparameter, a, parameterizes the distribution. Therefore, the training convergence and the statistical analysis of recurrent neural networks concur—by very different considerations—in the observation that the small weights should be privileged. Formal regularization stabilizes the training proocedure by indirectly enforcing the error backow decay whatever prior is used (e.g., gaussian prior, Laplace prior; Goutte, 1997). 5 Tailoring the Number of Backpropagations
In this section, we derive an upper bound for the BPTT truncation length that guarantees a given precision on the weight updates in order to reduce the cost due to the duplication of the hardware. As D is the maximum synaptic delay, we assume that the error propagation network, at iteration n, O d (s) D is unfolded sD times, where s is some positive integer. Therefore, D W n Pn g kD n¡sD ± nk v Tk¡d is the estimate of the true weight matrix update D W dn ( s ) D P g nkD d ± nk v Tk¡d at time n > sD C d. We are seeking a condition on the network d
O ( s) ¡ D W d k < 2 , for all 2 > 0. The following parameters such that kD W n n theorem expresses s as a function of 2 , the required precision on the weight updates: In batch mode, it sufces, at iteration n, to allocate memory to accommodate a maximum of sD backpropagations where s is the smallest integer satisfying Theorem 3.
" # 0 1 2 (1 ¡ r ) (1 ¡ m kW k) ¢ ln s > ln r gm NDke n k
(5.1)
to guarantee that the weights are updated with precision 2 . d
O ( s) D g Proof. Let n be the current iteration. Let D W n
Pn
n T kD n¡sD ± k v k¡d ,
for n > sD C d, denote the estimate of the weight matrix update at time n O d C ( s) D D W d . obtained after sD duplications of the network, so that D W sD d sD C d According to equation 4.1, we have 8d · D, 8s > 0, O d ( s) ¡ D W d k D g kD W n n
n¡sD¡1 X k Dd
± nk v Tk¡d .
Error Backow Convergence
1917
For n > sD C d, we may write n¡sD¡1 X kD d
± nk v Tk¡d ·
n¡sD¡1 X
·m
kD d
kG 0kkkynk kkv k¡d k
n¡sD¡1 X
· mN
kD d
ky nk kkv k¡d k
n¡sD¡1 X kD d
· m NDky nn k
ky nk k
r sC 1 , 1¡r
(5.2)
where the inequalities stem from theorem 1 and from kv k¡d kp < N for p D 1, 2, 1. Theorem 3 is readily obtained from the above inequality and from the inequality ky nn k · kI ¡ W 0k G 0kk ¡1 ken k ·
ke n k
1 ¡ m kW 0 k
,
(5.3)
completing the proof. Note furthermore that kW 0k < Nwmax ; the dependence on n (i.e., the instant at which the gradient is backpropagated) can be removed by considering that ke n k · Nout ¢ max ( en ( j) ) · Nout , j
(5.4)
where Nout is the number of output units. Formula 5.1 provides an implementation-friendly stopping criterion. As far as the memory allocation is concerned, the upper bound for s ensures that the bookkeeping always ts the time interval spanned by the error backow. It is worth remarking that s varies as a function of ln(1 ¡ r ) / ln r (see Figure 2) and thus diverges swiftly for values of r > 0.9. s ! 1 when r ! 1 and s ! 1 when r ! 0. Note also that the number of backpropagations to guarantee a given precision on the weights updates varies as a function of ln ND, provided r and kW 0 k are constant. Remark.
5.1 Numerical Applications. For purposes of illustration, numerical applications are given for standard recurrent neural networks that are special cases of DRNN.
1918
Alex Aussem
Figure 2: Typical increase of the number of backpropagations in time (i.e., sD), given by corollary 2, to guarantee a given precision error on the weight updates, as a function of r. See equation 5.1.
5.1.1 Williams and Zipser Architecture. Consider rst the recurrent network studied by Williams and Zipser (1989) obtained by setting W 0 D 0 and D D 1 (all synapses are delayed by one unit of time). Usual values such as N D 10, D D 1, m D 1 / 4, g D 10¡2 , 2 D 10¡6 , and r D 10¡1 , with a single output, yield a sensible value for s equal to 5 by formula 5.1. In other words, the absolute error on the weight update at each time step during the traO d ( s) ¡ D W d k < 10¡6 ) jectory is guaranteed to remain below 10¡6 (i.e., kD W n n after only ve replications of the network backward in time. If we increase r’s value to 0.5, we obtain s D 17, and for r D 0.9, we obtain s D 128. 5.1.2 NARMAX Networks. These feedback networks consist of a feedforward (static) network with a single output z ( t) and M state additional inputs that are delayed values of the output, z ( t ¡ 1), . . . , z ( t ¡ D ) . NARMAX networks can be seen as particular DRNN where delayed synapses (d D 1, . . . , D) connect the output unit to the hidden units. Values such as N D 10, D D 5, m D 1 / 4, g D 10¡2 , 2 D 10¡6 , kW 0k D 1, and r D 10¡1 yield a value for s equal to 6. For r D 0.5 and r D 0.9, we obtain s D 19 and s D 144, respectively.
Error Backow Convergence
1919
5.1.3 DRNN. Consider now a fully recurrent DRNN of size N D 100, D D 1 with 10 outputs (Nout D 10) with a (higher) precision of 2 D 10¡8 and m D 1 / 4, g D 10¡2 , kW 0 k D 1. For r D 10¡1 , we obtain s D 10. For r D 0.5 and r D 0.9, we obtain s D 32 and s D 233, respectively. As expected, an increase of r entails a dramatic increase of s in all cases, making corollary 2 useless in some cases. Therefore, the validity and the usefulness of the error backow exponential decay rate, r, will be experimentally conrmed in section 6. 5.2 Complexity. In the light of the discussion above, one may naturally take advantage of the exponential decay of the gradient to stop the gradient backpropagation after sD relaxations of the error propagation network. This leads to an efcient parallel implementation requiring reasonable computational resources and obeying predened precision requirements on the weights. Provided r is constant, s is of order ln ( DN ) . Storage of order N ln ( DN ) is needed to keep track of the network previous states and DN2 to store the lter parameters. The time required at each backward pass to compute the elementary weight update scales as O ( DN 2 ) so that O ( DN 2 ln (DN ) ) operations are needed at each time step to adjust the network parameters. In contrast, a forward propagation algorithm entails storage capacity of O (DN 3 ) and the computational requirements scale1 as O ( D2 N 4 ) . Finally, the latter requires a matrix inversion to compute the gradient, with the result that locality is lost. It is essential to note, however, that this procedure essentially gives up on the the long time lag problem by optimally cutting off the gradient, thus reducing the computational effort. Therefore, this training procedure is sure to fail if the time gap between an event and the occurrence of the corresponding error signal spans a length of time signicantly greater than the maximum delay order present in the DRNN structure. In this case, one has to resort to more sophisticated architectures or data processing techniques, as discussed in section 7. 6 Experimental Simulation
The analytical ndings we arrived at provide valuable theoretical insight into the error backow behavior and the forgetting behavior of recurrent networks with delays. As expected, extensive simulations carried out in Aussem (1999) with a family of DRNN models clearly conrm that the
1 Note that BPTT and RTRL can be merged together to reduce the average time complexity per time step of the forward propagation algorithm to O (D2 N3 ) instead of O (D2 N4 ) while still calculating the same gradient (Schmidhuber, 1992).
1920
Alex Aussem
Figure 3: Truncation length error rate denition. Let Y ( t ¡ k) D kytt¡k k. k1 D s1 D is the smallest number of backpropagations in time such that Y ( t ¡ k1 ) < 2 , where 2 is some small value. k2 D s2 D, where s2 is the smallest integer satisfying inequality 5.1. A truncation length error rate of 2 means that the error backow was propagated twice as far as necessary into the unfolded error network.
terms ky nn¡j k, plotted in log scale, drop off linearly from the rst time lag onward, mostly oscillating slightly along a line through ¡1. Still, experimental simulations are necessary to conrm the validity and, more important, the tightness of the bound, r, on the observed error backow decay rate. Therefore, we evaluate in this section the accuracy of the bound as the size of the network (i.e., N and D) is varied. For this purpose, we introduce a practical measure of the tightness of the bound, r, called the truncation length error rate (see Figure 3). Let 2 be some limiting value (e.g., 10e-8) for the residual gradient kynn¡j k. Call k1 the (true) smallest number of backpropagations in time such that ky nn¡j k < 2 . Call k2 D s2 D, where s2 is the smallest integer satisfying inequality 5.1. The truncation length error rate is dened as ( k2 ¡ k1 ) / k1 —in other words, a truncation length error rate of 2 means that the error backow was propagated twice as far as necessary into the unfolded error network. 6.1 Experimental Set-up. For simplicity, a simple 1£N£1 DRNN model was used where N, the number of hidden units, was varied. The single
Error Backow Convergence
1921
input is connected to all hidden units with nondelayed links (D D 0). The N hidden units are fully connected with connections of delay d D 1, . . . , D, where D, the lter order, was varied. All hidden units were connected to the single output with nondelayed links. As W 0 is triangular, a xed point exists at each iteration. Such architecture forces the DRNN model to encode the input sequence internally as no spatial memory (i.e., time-lagged window) is provided. Initial weight values of the networks were chosen randomly between ¡j P d and Cj , with j D 10e ¡2 such that D dD 0 kW k < 1 / m to ensure convergence of the training (see corollary 1). We have chosen the Frobenius matrix norm, which is compatible with the Euclidean vector norm. One hundred random input-target pairs were generated. Each time, the DRNN output was calculated for that particular input. Given the target, the error gradient was backpropagated into the unfolded network until it was found neglectable ( < 2 ) and the observed decay rate was estimated. In practice, the larger the weights, the larger the decay rate. Also, in order to cover all possible decay rates, the procedure was repeated with, at each time, a value for Cj incremented by 10e ¡ 2. Therefore, the decay rate averaged over 100 trials was observed to be larger each time. The procedure was stopped once the observed exponential rate became almost equal to 1. 6.2 Results. Several experiments were conducted with N D 1, 5, and 10 and D D 1, 5, and 10 (see Figure 4). The error rates are shown to be very similar to the theoretical truncation length plotted in Figure 2. It appears implicitly that the analytical convergence rate, r, is an accurate approximation of the true observed rate as long as r is small. The situation, however, degrades as the size of the network increases. As expected, the error rate swiftly increases in all cases as r approaches 1. Still, acceptable errors are obtained for values of r below 0.5: the error rate on the truncation length is less than 5. For completeness, we have plotted in Figure 5 the truncation length error rate for (1) a typical Williams and Zipser architecture with N D 10 fully connected units, a single input, and a single output, and (2) a typical (feedforward) NARMAX network with N D 10 hidden units, a single exogeneous input, a single output z ( t) , and D D 5 additional inputs that are delayed values of the output, z (t ¡ 1) , . . . , z (t ¡ 5) . Here again, the error rate swiftly increases as r approaches 1. It seems that as the product ND is increased, the error rate tends to increase more rapidly. 7 Discussion
So far, we have focused on the analysis of the vanishing gradient. We now briey point out a few solutions proposed in the literature to alleviate this problem.
1922
Alex Aussem
Figure 4: Truncation length error rate versus the observed decay rate plotted for different DRNN architectures. N is the number of hidden neurons, and D is the maximum synaptic delay. The hidden layer is fully connected to itself with connections of delay d D 1, . . . , D.
Error Backow Convergence
1923
Figure 5: Truncation length error rate versus the observed decay rate. (Top) A fully connected Williams and Zipser architecture with 10 units. (Bottom) A NARMAX (feedforward) architecture with 10 hidden units and 5 additional inputs that are delayed values of the output.
DRNN-like networks fail to learn anything within reasonable time in the presence of long time lags between relevant inputs and target events because the gradient vanishes at an exponential rate. This is not new; various experiments that validate claims regarding the limitations in learning longterm dependencies can be found—for instance, in Aussem et al. (1995) and Hochreiter and Schmidhuber (1997)—and a further understanding of this problem is developed in Bengio et al. (1994) and Lin et al. (1996). While large time delays also lead to vanishing gradient in the case of long-term dependencies, the effect gets weakened by a factor of 1/D. This property is exploited in networks with a single adaptable time delay associated with each link, as discussed in Duro and Santos Reyes (1999),
1924
Alex Aussem
which extends the traditional backprop algorithm to this purpose. Constructive heuristic for carefully adding new delayed connections to a given architecture are proposed in BonÂe, Cruciani, Verley, & Asselin de Beauville (2000). Another possibility is to decompose the input sequence into varying scales of temporal resolution via a wavelet transform, each of which holds information on the specic timescale, so that faint temporal structures may be detectable at different resolution levels (see, e.g., Aussem & Murtagh, 1997). A specic architecture that operates on multiple scales was also proposed by El Hihi and Bengio (1996). Nevertheless, these approaches generally fail if precise real values are to be stored precisely over periods potentially spanning hundreds of time steps. To remedy this difculty, Hochreiter and Schmidhuber (1997) and Gers, Schmidhuber, and Cummins (2000) have proposed a promising architecture, called long short-term memory (LSTM), based on high-order gating units capable of bridging arbitrarily long time intervals. This desirable feature arises from the inclusion of specic so-called memory cells, aimed at maintaining constant error ows within the cell. In contrast to DRNN-like networks, error signals trapped within a memory cell do not change until they are released by the input gate. This is achieved by using input and output multiplicative gate units in order to open and close access to error ow through internal states of memory cells. Besides, a local self-recurrent connection with a nontrainable weight of 1.0 ensures that the error ow remains constant within the cell. In other words, this can be seen as a way to enforce r D 1 and stay at the cutting edge between exponential explosion and exponential decay. As a result, LSTM can extract and store information over potentially unbounded periods. On the other hand, the appealing storing capability may come at the expense of a loss of sensitivity to short-term information since the additional multiplicative gate units—and thus the additional connections— exclusively devoted to maintaining constant gradient ow become unnecessary. Yet this is also true for DRNN-like networks, when the synaptic delay orders overshoot the typical time length spanned by the useful information. In either case, the use of sophisticated units and/or extra delayed connections have a side effect arising from having to use relatively large nets, which can undermine the success of the model. There is still much room for work to be performed to overcome the problem of vanishing gradient. While several promising approaches have been proposed to overcome this limitation, whether by enforcing constant error ow within the sophisticated units, using adaptable delays, or decomposing the input sequence into varying temporal scales, no method is a priori superior to another in all situations. As many authors have pointed out, it is essential to experiment with a variety of techniques to determine which is best suited to the specic task at hand.
Error Backow Convergence
1925
8 Conclusion
In this article, we have derived an upper bound on the norm of the weight matrix to ensure the convergence of the error backow in a fully recurrent architecture that encompasses a large variety of synaptic delay–based discrete-time neural models that have been proposed in the literature. An upper bound for the BPTT truncation length that guarantees a given precision on the weight updates was also derived. We have experimentally shown that the truncation length is acceptable as long as the error gradient decay is small. Appendix: Asymptotic Behavior of the Backpropagated Error Terms
In this appendix, theorem 1 is demonstrated. kf
The behavior of y k can be inferred from the backward recursive equations 3.2. It sufces to set ej D 0 for j D 0, 1, . . . , kf ¡ 1. This yields the relation kf
y k D [I ¡ W 0k G 0k ]¡1
D X
kf
W dkC d G 0k C d y k C d
dD 1
(A.1) k
f for k < kf , with the following boundary condition: y kf D [I ¡ W 0k G 0k ]¡1 e kf . The vectorial equation, A.1, governs the dynamic of the error terms. Let m D max ( g0 (h ) ) so that kG 0kk < m , 8k. Let us assume that kW 0k k < 1 / m for k D 0. . . . , kf so that the convergence of I C W 0k ¢ G 0k C ( W 0k ¢ G 0k ) 2 C . . . is assured (Notice at this point that k k refers to any matrix of vector norm compatible with each other, i.e., kA kp D supx D6 0 kA xkp / kxkp , p D 1, 2, 1). Hence, for any iteration k D 0, . . . , kf ¡ 1, we have
kf
ky k k · kI ¡ W 0k G 0kk ¡1 ·
D X d D1
m
D X
1 ¡ m kW 0k k
d D1
kf
kW dkC d kkG 0kC d kky kC d k kf
kW dkC d kky kC d k.
(A.2)
d
Now dene ad D max k (
m kW k C d k 0
1¡m kW k k
) , 1 · d · D. From equations A.1 and
A.2, we deduce the following inequality: kf
kf
kf
kf
ky k k · a1 ky kC 1 k C a2 ky kC 2 k C ¢ ¢ ¢ C aD ky kC D k.
(A.3)
The time dependency of the coefcients has been removed. This allows P the analysis of a univariate series, fun g, dened by un D jDD 1 aj un¡j , to be
1926
Alex Aussem kf
carried out. fun g is of interest since un ¸ ky kf ¡n k for n D 0, . . . , kf , with kf
the boundary conditions un D ky kf ¡n k for n D 0, . . . , D ¡ 1. Since un < P P maxj D1,...,D (un¡j ) ¢ jDD1 aj , and considering that uj < ( jDD 1 aj ) ¢ u 0 for j > 0 kf
and u 0 D kykf k, it is readily proven by induction that 0 kf ky kf ¡(kD C j C 1) k
· u kD C j C 1 < @
D X j D1
1k C 1 aj A
kf
ky kf k,
8k ¸ 0, 8j D 0, . . . , D ¡ 1.
(A.4)
P Therefore, jDD 1 aj < 1 ensures the convergence of the backpropagation dynamic, completing the proof. References Aussem, A. (1999). Dynamical recurrent neural networks towards prediction and modelling of dynamical systems. Neurocomputing, 28, 207–232. Aussem, A. (2000). Sufcient conditions for error back ow convergence in dynamical recurrent neural networks. In Proceedings of the International Joint Conference on Neural Networks (pp. 577–582). Aussem, A., & Murtagh, F. (1997). Combining neural network forecasts on wavelet-transformed time series. Connection Science, 9, 113–121. Aussem, A., Murtagh, F., & Sarazin, M. (1995). Dynamical recurrent neural networks—towards environmental time series prediction. International Journal of Neural Systems, 6, 145–170. Baldi, P. (1995). Gradient descent learning algorithm overview: A general dynamical systems perspective. IEEE Transactions on Neural Networks, 6, 182– 195. Bengio, Y., Simard, P., & Frasconi, P. (1994). Learning long-term dependencies with gradient descent is difcult. IEEE Transactions on Neural Networks, 5, 157–166. BonÂe, R., Cruciani, M., Verley, G., & Asselin de Beauville, J. P. (2000). A bounded exploration approach to constructive algorithms for recurrent neural networks. In Proceedings of the International Joint Conference on Neural Networks (pp. 27–32). Campolucci, P., Uncini, A., Piazza, F., & Rao, B. D. (1999). On-line learning algorithms for locally recurrent neural networks. IEEE Transactions on Neural Networks, 10, 253–271. Cohen, B., Saad, D., & Marom, E. (1997). Efcient training of recurrent neural networks with delays. Neural Networks, 10, 51–59. Duro, R. J., & Santos Reyes, J. (1999). Discrete-time backpropagation for training synaptic delay-based articial neural networks. IEEE Transactions on Neural Networks, 10, 779–789.
Error Backow Convergence
1927
Frasconi, P., Gori, M., & Soda, G. (1992). Local feedback multilayered networks. Neural Computation, 4, 120–130. Gers, F. A., Schmidhuber, J., & Cummins, F. (2000). Learning to forget: Continual prediction with LSTM. Neural Computation, 12, 2451–2471. Goutte, C. (1997). Statistical learning and regularisationfor regression. Unpublished doctoral dissertation, University of Paris, France. El Hihi, S., & Bengio, Y. (1996). Hierarchical recurrent neural networks for longterm dependencies. In G. Tesauro, D. S. Touretzky, & T. K. Leen (Eds.), Advances in neural information processing systems, 8. Cambridge, MA: MIT Press. Hochreiter, S., & Schmidhuber, J. (1997). Long short term memory. Neural Computation, 9, 1735–1780. Khalil, H. K. (1996). Nonlinear systems. Upper Saddle River, NJ: Prentice-Hall. Lin, T, Horne, B. G., Tino, P., & Giles, C. L. (1996). Learning long-term dependencies in NARX recurrent neural networks. IEEE Transactions on Neural Networks, 7, 1329, 1996. Piche, S. W. (1994). Steepest descent algorithms for neural network controllers and lters. IEEE Transactions on Neural Networks, 5, 198–212. Pineda, F. J. (1987). Generalization of back-propagation to recurrent neural networks. Physical Review Letters, 59, 2229–2232. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal representations by error propagation. In D. E. Rumelhart & J. L. McClelland (Eds.), Parallel distributed processing explorations in the microstructure of cognition (pp. 318–362). Cambridge, MA: MIT Press. Schmidhuber, J. (1992).A xed size storage O ( n3 ) time complexity learning algorithm for fully recurrent continually running networks. Neural Computation, 4, 243–248. Tsoi, A. T., & Back, A. D. (1994).Locally recurrent globally feedforward networks: A critical review of architectures. IEEE Transactions on Neural Networks, 5, 229– 239. Wan, E. A. (1993). Finite impulse response neural networks with applications in time series prediction. Unpublished doctoral dissertation, Stanford University, CA. Werbos, P. (1974). Beyond regression: New tools for prediction and analysis in behavioral sciences. Unpublished doctoral dissertation, Harvard University, Cambridge, MA. Williams, R. J., & Zipser, D. (1989). Experimental analysis of the real-time recurrent learning algorithm. Connection Science, 1, 87–111. Received March 30, 2001; accepted November 20, 2001.
LETTER
Communicated by Zoubin Ghahramani
Bayesian A* Tree Search with Expected O(N) Node Expansions: Applications to Road Trac king James M. Coughlan [email protected] A. L. Yuille [email protected] Smith-Kettlewell Eye Research Institute, San Francisco, CA 94115, U.S.A. Many perception, reasoning, and learning problems can be expressed as Bayesian inference. We point out that formulating a problem as Bayesian inference implies specifying a probability distribution on the ensemble of problem instances. This ensemble can be used for analyzing the expected complexity of algorithms and also the algorithm-independent limits of inference. We illustrate this problem by analyzing the complexity of tree search. In particular, we study the problem of road detection, as formulated by Geman and Jedynak (1996). We prove that the expected convergence is linear in the size of the road (the depth of the tree) even though the worst-case performance is exponential. We also put a bound on the constant of the convergence and place a bound on the error rates. 1 Introduction
Many problems in vision, speech, reasoning, and other sensory and control modalities can be formulated as Bayesian inference (Knill & Richards, 1996). It is important to understand the complexities of algorithms that can perform these inferences. We point out that formulating a problem as Bayesian inference implies specifying a probability distribution on the ensemble of problem instances. More formally, in Bayesian inference, the goal is to estimate a quantity x from data y by using the posterior distribution P ( x | y ) . Constructing this posterior requires specifying a likelihood function P ( y | x ) and a prior distribution P ( x ) . From these distributions, we can construct a distribution P ( x, y) on the ensemble of problem instances ( x, y ). (See Figure 4 for samples from a particular ensemble for road tracking.) There are two main advantages to analyzing the performance of an algorithm over the ensemble of problem instances. First, it allows us to determine the behavior of the algorithm for typical problem instances (those that occur with nonnegligible probability) and means that we may not have to deal with worst-case situations (because they may have arbitrarily small probabilities). Second, having a distribution over the ensemble of problem c 2002 Massachusetts Institute of Technology Neural Computation 14, 1929– 1958 (2002) °
1930
James M. Coughlan and A. L. Yuille
instances also enables us to quantify the accuracy of the estimates found by the algorithm. Many problems in articial intelligence can be formulated as tree search (Winston, 1984; Pearl, 1984; Russell & Norvig, 1995). We now study a specic example of tree search: a computer vision algorithm proposed by Geman and Jedynak (1996). (Our analysis can be extended to other tree searching problems). Geman and Jedynak addressed the problem of detecting roads in aerial images. They formulated the problem as Bayesian maximum a posteriori (MAP) estimation (which enables us to determine a distribution on the ensemble of problem instances). In this article, we analyze the complexity of a variant of the Geman and Jedynak algorithm (proposed in Yuille & Coughlan, 2000b). The complexity and performance of the algorithm depend on the probability distributions that characterize the ensemble of problem instances (the likelihood function and the prior). For a large class of ensembles, we are able to prove that the expected complexity is linear in the length N of the road (by contrast, the worst-case complexity is exponential in N). We are also able to put bounds on the expected errors made by the algorithm. We emphasize that we are concerned with the ability of the algorithm to detect the road path, which may not necessarily correspond to the MAP estimate. For any problem instance, there are three important paths: (1) the true road path, (2) the MAP estimate of the true road path, and (3) the path found by the algorithm. In this article, we are concerned with the difference between (3) and (1) only. These results complement our previous work (Yuille & Coughlan, 2000a; Yuille, Coughlan, Wu, & Zhu, 2001), which used this ensemble concept to analyze the algorithm-independent performance of MAP estimation on problems of this type (we evaluated errors between the MAP estimate and the true road path). In particular, we derived an order parameter KB , which is a function of the ensemble. We proved that if KB < 0, then it is impossible to detect the true road by any algorithm. (Our results in this article apply to cases where KB > 0 only. We can express KB D fy1 ¡log Qg C y2 where y1 , y2 are positive quantities, dened in theorem 3. We show linear complexity provided y1 > log Q. If y1 < log Q and y1 C y2 > log Q, then the target path can be found, but we can say nothing about the algorithm complexity.) Another advantage of the concept of an ensemble of problem instances is illustrated by our choice of a heuristic A* algorithm (Yuille & Coughlan, 1999). A* algorithms (Pearl, 1984; Winston, 1984; Russell & Norvig, 1995) search trees and graphs using a heuristic to estimate future rewards. If we are using a Bayesian ensemble, then the probability distributions can be used to generate heuristics. Technically, our proofs make use of large deviation theory (Grimmett & Stirzaker, 1992). In particular, we use Sanov’s theorem (see appendix A), to put bounds on the probability of rare events. We note that many results on statistical learning theory (Vapnik, 1998) are derived using similar techniques from large deviation theory.
Bayesian A* Tree Search with Expected O(N) Expansions
1931
We now mention two related classes of theoretical results that are highly relevant. First, Karp and Pearl (1983) (see also Pearl, 1984) provided a theoretical analysis of convergence rates of A* search by considering an ensemble of problem instances. They studied a binary tree where the rewards for each arc were 0 or 1 and were specied by a probability p. They then obtained the complexity of algorithms for nding the minimum cost path. This work has some similarities to ours, but their formulation is not Bayesian, their heuristics for A* algorithms are different, and large deviation theory is not used. Their work was an inspiration for us, and we provided an analysis of a block pruning algorithm motivated by them in Yuille and Coughlan (1999). Second, we would like to mention related work on optimization that uses the concept of ensembles. Some recent studies show that order parameters exist for NP-complete problems and that these problems can be easy to solve for certain values of the order parameters (Cheeseman, Kanefsky, & Taylor, 1991; Selman & Kirkpatrick, 1996). This work involves analyzing ensembles of problem instances. But the distribution of instances in the ensemble is typically assumed to be uniform and is not derived from Bayesian methods. The structure of this article is as follows. In section 2, we formulate the road tracking problem as tree search and introduce A*. Section 3 gives complexity results for special choices of A* heuristic (making use of Sanov’s theorem, which is given in appendix A). In section 4, we extend the complexity results to other heuristics (using additional results proven in appendix B). Section 5 proves that sorting the queue for A* takes constant expected time per sort operation. (An early version of this work was presented in Coughlan & Yuille, 1999.) 2 Problem Formulation 2.1 Tree Search: The Geman and Jedynak Model. Many problems in articial intelligence can be formulated as tree search (Winston, 1984; Pearl, 1984; Russell & Norvig, 1995). We now study a specic example of this class of problem. Geman and Jedynak (1996) formulate road detection as tree search through a Q-nary tree (see Figure 1). The starting point and initial direction are specied, and there are QN possible distinct paths down the tree. The goal is to nd the road path (whose statistical properties differ from those of the nonroad paths). The worst-case complexity for this problem is exponential, but as we will prove, in many circumstances it is possible to detect a good approximation to the road path in linear expected time. Our analysis involves considering an ensemble of problem instances. More formally, a road hypothesis, or path, consists of a set of connected straight-line segments. We can represent a path by a sequence of moves fti g on the tree. Each move ti belongs to an alphabet fbºg of size Q. For example, the simplest case studied by Geman and Jedynak sets Q D 3 with
1932
James M. Coughlan and A. L. Yuille
Figure 1: (Left) Geman and Jedynak’s tree structure with a branching factor of Q D 3. (Right) The worst-case complexity is exponential because for some problem instances, the best path is determined only by the reward of the nal arc segment. In this panel, Q D 2 and all the 2N paths have to be examined.
Figure 2: Different priors for the geometry. (Left) The probabilities of turning left, right, or straight are one-third. (Right) The probability of going straight is two-thirds, and the probabilities of turning right or left are one-sixth each, biasing toward straighter paths.
an alphabet b1 , b2 , b3 corresponding to the decisions (1) b1 —go straight (0 degrees), (2) b2 —go left (¡5 degrees), or (3) b3 —go right ( C 5 degrees). Each tree will contain a target path that corresponds to the road to be detected. This path is sampled from a prior probability distribution P (fti g) D QN i D 1 PD G ( ti ) , where PD G (.) is the geometric transition probability. For our Q D 3 example, we may choose to go straight, left, or right with equal probability (P D G ( b1 ) D PD G ( b2 ) D PD G ( b3 ) D 1 / 3; see Figure 2). A sequence of moves fti : i D 1, . . . , Ng determines a path of segments X D ( x1 , . . . , xN ) (these segments form a connected path from the top of the tree to the bottom). Conversely, a consistent path X determines a sequence of moves. (This also applies to subpaths.) Let  denote all the QN segments of the tree, so a path X is a subset of Â. The set of segments not on the path is the complement ÂnX.
Bayesian A* Tree Search with Expected O(N) Expansions
1933
There is an observation yx for each segment x 2 Â of the tree. The set of all observations is Y D fyx : x 2 Âg. The values of the observations belong to an alphabet fam g of size J. An observation yx is drawn from a distribution Pon (.) if the segment is on the target path (if x 2 X). If not, it is drawn from Pof f (.) . For any path fti g through the tree with segments fxi g, we have a corresponding set of observations fyxi g. This determines the likelihood function P ( Y | X ) , P (Y | X ) D
Y
Pon ( yx )
x2X
Y
Pof f ( yx ) ,
(2.1)
x2ÂnX
which we can reexpress as P (Y | X ) D
Y i D1,...,N
Pon ( yxi ) Y Pof f ( yx ) D Pof f ( yxi ) x2Â
Y i D 1,...,N
Pon ( yxi ) F ( Y) , Pof f ( yxi )
(2.2)
Q where F ( Y ) D x2Â Pof f ( yx ) is independent of the target path X. We formulate the problem as MAP estimation to nd the mode of the posQ terior distribution P ( X | Y ) D P ( Y | X ) P (X ) / P (Y ) where P ( X ) D N i D 1 P D G ( ti ) (where fti g is the sequence of moves that generates the path X). Then MAP estimation for X is equivalent to maximizing P (Y | X ) P ( X ) D
N Y
Y
P D G ( ti )
i D1
i D1,...,N
P on ( yxi ) F ( Y) . Pof f ( yxi )
(2.3)
This is equivalent to nding the path fti g with observations fyi g that maximizes the following reward function, N X
(
Pon ( yi ) log r (fti g, fyi g) D P of f ( yi ) iD 1
) C
N X iD 1
» log
P D G ( ti ) U ( ti )
¼ ,
(2.4)
where yi is shorthand for yxi , U (.) is the uniform distribution (U(bº ) D 1 / Q P 8º), and so N i D 1 log U ( ti ) D ¡N log Q, which is a constant. The introduction of U (.) helps simplify the analysis in the following sections. Observe that the reward of a particular path depends on only the variables fti g, fyi g, which dene the path (the moves and the observations). This is because the factor F ( Y) in P ( Y | X ) is independent of X and can be ignored (it does not affect which path is most probable). For any path (or subpath) in the tree of length n, we will call r (fti g, fyi g) the reward function. It can be reexpressed as E ¢ b, E r (fti g, fyi g) D nwE ¢ a E C ny
(2.5)
1934
James M. Coughlan and A. L. Yuille
Figure 3: The quantized distributions (left) Pon ( y ) and (right) Pof f ( y ) , where y D E I (x ) |, learned from image data. Observe that, not surprisingly, | r E I ( x ) | is likely |r to take larger values on an edge rather than off an edge. (From Konishi, Yuille, Coughlan, & Zhu, 1999).
where a E and bE have components am D log
Pon ( am ) , m D 1, . . . , J, Pof f ( am )
bº D log
PD G (bº) , º D 1, . . . , Q. U ( bº )
(2.6)
and wE and yE are normalized histograms, or types (see appendix A), with components (with di, j denoting the Kronecker delta function) wm D
n n 1X 1X dyi ,am , m D 1, . . . , J , yº D dt ,b º D 1, . . . , Q. n i D1 n iD 1 i º
(2.7)
The Geman and Jedynak model is an idealization of tracking a road on a lattice. The tree structure represents the set of all possible paths from the starting point. A sequence of moves fti g on the tree determines a path of segments in the image. Each segment is approximately 7 pixels long, and its direction depends on the angle of the move ti relative to the direction of the previous segment. The observations y are the responses to an oriented nonlinear lter designed to detect straight road segments (by estimating a quantity related to the image gradient). The lter is trained on examples of on-road and off-road segments to determine empirical distributions Pon ( y) and Pof f ( y ) for the lter responses (for more details, see Geman & Jedynak, 1996). See Figure 3 for examples of distributions Pon , Pof f . We now illustrate the Bayesian ensemble by Figure 4, which consists of three samples from the ensemble. In these cases, the target path is easily detectable, but noise uctuations mean that some subpaths may distract the algorithm from the target. In other ensembles, the target path may be far harder to detect.
Bayesian A* Tree Search with Expected O(N) Expansions
1935
Figure 4: Samples from the Bayesian ensemble. Simulated road tracking problem where dark lines indicate strong edge responses and dashed lines specify weak responses. The data were generated by stochastic sampling using a simplied version of the models analyzed in this article. In both examples, there is only one strong candidate for the best path (the continuous dark line), but chance uctuations have created subpaths in the noise with strong edge responses.
2.2 Can the Task Be Solved? Distractor Paths. The goal is to detect the target path by selecting the path with the highest reward. But the MAP estimate may not necessarily correspond to the target path. In this section, we specify conditions that ensure that the MAP estimate is expected to be signicantly similar to the target path. Unless these conditions are satised, it will be impossible to nd the target path by any algorithm. The tree contains one target path and QN ¡ 1 distractor paths. We categorize the distractor paths by the stage at which they diverge from the target path (see Figure 5). For example, at the rst branch point, the target path lies on only one of the Q branches, and there are Q ¡ 1 false branches that generate the rst set of false paths F1 . Now consider all the Q ¡ 1 false branches at the second target branch; these generate set F2 . As we follow along the true path, we generate sets Fi of size (Q ¡ 1)QN¡i . The set of all paths is therefore the target path plus the union of the Fi ( i D 1, . . . , N ) . To determine whether the target path can be found by MAP estimation, we must consider the probability that one of the distractor paths has a higher reward than the target path. For example, consider the probability distribution PO1,max ( rmax / N ) of the maximum reward (normalized by N) of all the paths in F1 . We can compare this to the probability distribution of the (normalized) reward POT (rT / N ) of the target path. In related work (Yuille et al., 2001), we use techniques similar to Sanov’s theorem (see appendix A) to estimate these quantities and show that there is a phase transition depending on a parameter K given by
K D D ( Pon k P of f ) C D ( PD G k U ) ¡ log Q,
(2.8)
1936
James M. Coughlan and A. L. Yuille
Figure 5: (Left) Given a specied target path (the straight line shown in bold in this case) we can divide the set of paths into N subsets F1 , . . . , FN , as shown here. Paths in F1 have no overlap with the target path. Paths in F2 overlap with one segment, and so on. Intuitively, we can think of this as an onion, where we peel off paths stage by stage. (Right) When paths leave the target path, they make errors, which we characterize by the number of false segments. For example, a path in F1 has error N, and a path in Fi has error N C 1 ¡ i.
P Pon ( y ) ( ) where D ( Pon | Pof f ) D y Pon y log Pof f ( y) is the Kullback-Leibler divergence between Pon and Pof f . If K > 0, then the probability distribution for the target path reward POT ( rT / N ) lies to the right of the distribution PO1,max ( rmax / N ) of the maximum reward of paths in F1 (see Figure 6, left), and it is straightforward to detect the target path. At K ¼ 0, the two distributions overlap and it becomes hard to detect the target path (see Figure 6, center). But if K < 0, then PO1,max (rmax / N ) is to the right of POT ( rT / N ) (see Figure 6), and it is impossible to detect the target path. To understand k better, consider its three terms. The rst term, D ( Pon k Pof f ) , is a measure of how effective the local lter cues are for detecting the target. If P on D Pof f , then D ( Pon k Pof f ) D 0, and the local cues are useless. The second term, D ( PD G k U ) , is a measure of how much prior knowledge we have about the probable shape of the target (setting P D G D U means we have no prior information). Finally, log Q is a measure of how many distractor paths there are. Therefore, K becomes large (and so target detection becomes easy) the better our lter detectors, the more prior knowledge we have, and the fewer the number of distractor paths. In related work (Yuille & Coughlan, 2000a), we used Sanov’s theorem (see appendix A) to show that the expected number of paths in F1 with rewards greater than the target path behaves as 2 ¡NKB , where the order parameter KB is dened by KB D 2B(Pon , P of f ) C 2B(PD G , U ) ¡ log Q, Pm
(2.9)
where B (P, Q ) D ¡ log iD 1 ( pi ) 1/ 2 (qi ) 1 / 2 is the Bhattacharyya bound between the distributions P D fpi g and Q D fqi g. Once again, there is a change
Bayesian A* Tree Search with Expected O(N) Expansions
1937
Figure 6: Schematic illustration of the phase transition. (Top left) The reward of the target path is higher than the largest reward of the distractor paths (so detection is straightforward). (Top right) The task becomes difcult because the target reward and the best distractor reward are very similar. (Bottom) Detecting the target path becomes impossible because its reward is lower than the best distractor path. The horizontal axis labels the normalized reward.
in behavior as KB changes sign. For KB > 0, we expect there to be no paths in F1 with rewards greater than the target path. Our analysis of the A* algorithm will proceed in the regime where K > 0 and KB > 0. There is little purpose in estimating how fast one can compute the MAP estimator unless one is sure that the estimator is detecting a good approximation to the correct target. Our results (see section 3) will require an additional condition to hold (see theorems 6 and 7), which will ensure that KB > 0. There will, however, be situations where the target is detectable (KB > 0) but we cannot prove expected linear convergence. More specically, we can express KB D fy1 ¡ log Qg C y2 where y1 , y2 are positive quantities, which are dened in theorem 3. Our complexity proofs apply provided y1 > log Q. If y1 < log Q but y1 C y2 > log Q, then the target path is detectable, but we can say nothing about the complexity of the algorithms. 2.3 A*. The A* graph search algorithm (see Pearl, 1984; Winston, 1984; Russell & Norvig, 1995) is used to nd a path of maximum reward between
1938
James M. Coughlan and A. L. Yuille
Figure 7: The A* algorithm tries to nd the path from A to B with the highest reward. For a partial path AC, the algorithm stores g ( C ) , the best reward to go from A to C, and an overestimate h ( C ) of the reward to go from C to B.
a start node A and a goal node B in a graph (see Figure 7). The reward of a particular path is the sum of the rewards of each edge traversed. The A* procedure maintains a tree of partial paths already explored and computes a measure f of the “promise” of each partial path (a leaf in the search tree). New paths are considered by extending the most promising node one step. (See gure 8 for an illustration of the sequence of steps in A*.) The measure f for any node C is dened as f (C ) D g (C ) C h ( C) , where g (C ) is the best cumulative reward found so far from A to C and h (C ) is an overestimate of the remaining reward from C to B. The closer this overestimate is to the true reward, the faster the algorithm will run. We will refer to the value of f as the A* reward in contrast with the reward function of equation 2.5. It is straightforward to prove that A* is guaranteed to converge to the correct result provided the heuristic h (.) is an upper bound for the true reward from all nodes C to the goal node B. A heuristic satisfying these conditions is called admissible. Conversely, a heuristic that does not satisfy them is called inadmissible. The word inadmissible is a technical term only and does not imply that inadmissible heuristics are inferior to admissible ones. In fact, as we show in this article, algorithms using inadmissible heuristics can converge rapidly to good approximations to the correct result. Conversely, algorithms with admissible heuristics may be slow to converge. In this letter, we consider inadmissible heuristics. We set the heuristic reward to be HL C HP for each unexplored segment (where the subscripts L and P label the likelihood and the geometric, prior respectively). Thus, a subpath starting at the origin of length M will have heuristic reward of (N ¡ M ) ( HL C HP ) . We will drop the N ( HL C HP ) term, which is the same for all paths, and simply use ¡M (HL C HP ) as the heuristic. We consider the A* rewards of two partial paths: one of length m segments that overlaps completely with the target path and the other of length n that does not overlap at all with the target path. The A* rewards of these paths
Bayesian A* Tree Search with Expected O(N) Expansions
1939
Figure 8: An A* search sequence. (a) The target path, shown in bold. (b) The two segments added to the queue (marked by asterisks) at the start of the search. (c) Explores the right segment, removes it from the queue (e), and adds its two children to the queue. The search continues in panels d and e, where explored segments are eliminated from the queue (and labeled e) and their children are added to the queue (labeled with asterisks). (f) A* converges to a solution (A*) that is one segment away from the target path (T).
are denoted by Son ( m ) and Sof f ( n ) and are given by E on ¢ bE ¡ HP g, Son ( m ) D mfwE on ¢ a E ¡ HL g C mfy E of f ¢ bE ¡ HP g, Sof f ( n ) D nfwE of f ¢ a E ¡ HL g C nfy
(2.10)
P P where w mon D m1 i dyi ,am and yºon D m1 i dti ,bº for the on-path segments (and similarly for the off-path segments). The effect of this heuristic is to encourage us to explore subpaths provided their reward per segment is greater than HL C HP . Note that the larger the value of HL C HP , the more the algorithm will favor a breadth-rst strategy (Winston, 1984) of exploring the tree (because few long subpaths will have rewards per segment exceeding HL C HP ). Breadth-rst search is a conservative strategy that will nd the best solution but may take a long time to do so. In general, the smaller the heuristic, the faster the search time but the greater the possibility of error. In particular, if HL C HP > maxm 2f1,..., Jg am C maxº2f1,...,Qg bº, then the heuristic is admissible and the search is guaranteed to converge to the path with the highest reward (Pearl, 1984; Winston, 1984; Russell & Norvig, 1995). (But an algorithm that uses this
1940
James M. Coughlan and A. L. Yuille
heuristic is likely to be slow.) We will be considering inadmissible heuristics (HL C HP < maxm 2f1,..., Jg am C maxº2f1,...,Qg bº), which we expect to be quicker than admissible heuristics but will have more errors (our theorems quantify these statements). There is a close connection between heuristics for A* search and the general issue of pruning search algorithms. In previous work (Yuille & Coughlan, 1999) we analyzed the results of a search algorithm that pruned out paths whose rewards fell below a critical value (corresponding to a heuristic). In related work, we explored the use of pruning heuristics for speeding up dynamic programming algorithms for detecting hand outlines in real images (Coughlan, Snow, English, & Yuille, 1998, 2000). We will rst prove convergence results for a specic choice of HL , HP and later generalize to a larger set of values. 3 Convergence Proof with Bhattacharyya Heuristics
This section will prove convergence results for a specic choice of heuristic that we call the Bhattacharyya heuristic. The next section will generalize the results to a larger class (for which the proofs are more complicated). First, we need to say something about the choice of heuristics. We need to choose a heuristic so that it is smaller than the reward (per unit length) that we would get from the target path and larger than the reward for a distractor path. This means (see equation 2.10) that the A* rewards will tend to be positive for the target and negative for the distractor paths (so the algorithm will prefer to explore the target). Let us consider the reward HL only (the analysis is similar for HP ). The expected reward for the target P ( ) path is D ( Pon k Pof f ) D y Pon ( y ) log PPofonf (yy) , and for the distractor path it is P Pon ( y ) ¡D(Pof f k P on ) D y Pof f ( y) log Pof f ( y ) . Therefore, we want to select HL so that ¡D(Pof f k Pon ) < HL < D ( Pon k Pof f ) . Our complexity results will be obtained using Sanov’s theorem (see appendix A) to estimate the probability that the algorithm wastes time searching distractor paths. In order to use Sanov’s theorem, it is convenient to think of the heuristic as the expected reward of data distributed according l to the geometric mixture of Pon , Pof f given by Pl ( y ) D P1¡l on ( y ) P of f ( y ) / Z[l] (where Z[l] is a normalization constant). In this section, we will consider the special case where l D 1 / 2. This gives the Bhattacharyya heuristic, P P (y) H¤L D y PlD 1/ 2 ( y) log Pofonf ( y ) . (We give it this name because the distribution PlD1 / 2 is associated with the Bhattacharyya bound in statistics; Ripley, 1996). By setting l D 1 / 2, we are essentially choosing a heuristic midway between the target and distractors (generated by Pon and Pof f respectively). In the next section, we will extend the results to deal with other values of l. The Bhattacharyya heuristic is special in two ways. First, it simplies the analysis. Second, and more important, we can prove stronger results
Bayesian A* Tree Search with Expected O(N) Expansions
1941
about convergence if the Bhattacharyya heuristic is used (although this may reect limitations in our proofs). As we will discuss in the next section, if the algorithm converges using one of the alternative heuristics, then it will also converge with the Bhattacharyya heuristic, but the reverse is not necessarily true. The Bhattacharyya heuristics H¤L , HP¤ are the expected rewards per segment: HL¤ D wE Bh ¢ a E,
E HP¤ D yE Bh ¢ b,
(3.1)
with respect to the distributions w Bh , yBh : w Bh ( y ) D
fP on ( y) g1/ 2 fPof f ( y) g1/ 2 Zw
, yBh (t ) D
fPD G ( t) g1 / 2 fU(t) g1 / 2 , (3.2) Zy
where Zw , Zy are normalization constants. We rst put an upper bound on the probability that any completely false segment is searched. Let An,i be the set of subpaths of length n that belong to Fi . Then we have the following result: The probability that A* searches the last segment of a particular subpath in An,i is less than or equal to Prf9m: Sof f ( n ) ¸ Son ( m )g. Theorem 1.
By denition of A*, a necessary condition for the segment to be searched is that its A* reward (including the heuristic) is better than the A* reward of at least one segment on the target path. This is because the A* algorithm always maintains a queue of nodes to explore and searches the node segment with the highest reward. The algorithm is initialized at the start of the target path, and so an element of the target path will always lie in the queue of nodes that A* considers searching. (This condition is not sufcient to ensure that the segment is searched, so we are obtaining only an upper bound.) Proof.
We now bound Prf9m: Sof f (n ) ¸ Son (m ) g by something we can evaluate. Theorem 2. Proof.
Prf9m: Sof f (n ) ¸ Son ( m) g ·
P1
m D 0 PrfSof f
( n ) ¸ Son ( m) g
Boole’s inequality.
We now proceed to nd a bound on PrfSof f ( n ) ¸ Son ( m) g. This is done using Sanov’s theorem (see appendix A). It will show that this probability falls off exponentially with n, m (provided certain parameters are positive).
1942
James M. Coughlan and A. L. Yuille
Theorem 3.
Y1 D D ( wEBh
PrfSof f ( n ) ¸ Son ( m )g · f(n C 1) ( m C 1)g J Q 2 ¡(nY1 C mY2 ) , where k Pof f ) C D ( yE Bh k U ) and Y2 D D ( wEBh k Pon ) C D ( yE Bh k PD G ) . 2
2
The proof is an application of Sanov’s theorem (see appendix A) applied to the product space of types of Pon , Pof f , PD G , U. Dene: Proof.
E of f ¢ bE ¡ H ¤ g E D f(wE of f , yE of f , wE on , yE on ) : nfwE of f ¢ a E ¡ HL¤ C y p E on ¢ bE ¡ H¤ gg ¸ mfwE on ¢ a E ¡ HL¤ C y p
(3.3)
(E is the set of all histograms corresponding to partial off paths with higher A* reward than the partial on path). Sanov’s theorem gives a bound in terms of the w off , y of f , w on , y on that minimize f ( wE of f , yE of f , wE on , yE on ) D nD ( wE of f k P of f ) C nD ( yE of f k U ) C mD ( wE on k Pon ) ( ) X E on k PD G ) C t1 C mD ( y w of f ( y ) ¡ 1 y
( C t2
X
)
y
of f
t
( C t4
X
( t) ¡ 1 C t3
(
X
) w
on (
y
y) ¡ 1
)
y
on (
t
t) ¡ 1
¤ E on ¢ bE ¡ H ¤ g C c fmfwE on ¢ a E ¡ HL C y p
E of f ¢ bE ¡ H ¤ gg, ¡ nfwE of f ¢ a E ¡ HL¤ C y p
(3.4)
where the t ’s and c are Lagrange multipliers. This function f (. , ., ., .) is (. known to be convex so there is a unique minimum. P Observe that f . . .) consists of four terms of form nD ( wE of f k P of f ) C t1 f y w of f ( y ) ¡1g ¡nc wE of f ¢ a E, which are coupled only by shared constants. These terms can be minimized separately to give c
wE of f ¤ D
Z[1 ¡ c ] c
yE of f ¤ D
1¡c
Pon P of f
1¡c
, wE on¤ D
c
P on Pof f Z[c ]
,
1¡c
PD G U1¡c P Uc , yE on¤ D D G , Z2 [1 ¡ c ] Z2 [c ]
(3.5)
subject to the constraint given by equation 3.3. By inspection, the unique solution occurs when c D 1 / 2. In this case: E of f ¤ ¢ bE D H ¤ D y E on¤ ¢ b. E wE of f ¤ ¢ a E D H¤L D wE on¤ ¢ a E, y P
(3.6)
Bayesian A* Tree Search with Expected O(N) Expansions
1943
Figure 9: This gure illustrates theorem 5. F1 has at most Qn segments at depth n.
The solution occurs at wE on¤ D wE of f ¤ D wE Bh and at yE on¤ D yE of f ¤ D yE Bh . Substituting into the Sanov bound gives the result. From theorem 3, it is a direct summation, and application of theorem 2, to obtain: P1 Prf9m: Sof f ( n ) ¸ Son ( m )g · mD 0 PrfSof f ( n ) ¸ Son ( m )g · 2 2 ( n C 1) J Q C2 (Y2 ) 2¡nY1 , where Y1 , Y2 are specied in theorem 3, and
Theorem 4.
C2 (Y2 ) D
1 X mD 0
( m C 1) J
2 Q2
2 ¡mY2 .
(3.7)
Theorem 4 shows that the probability of exploring a particular distractor path to depth n falls off exponentially with n. We now compute the expected number of false segments that are searched at depth n in the tree. There is a factor Qn ¡ 1 segments at this depth that we can bound above by Qn , which is illustrated in Figure 9. Provided Y1 > log Q, the expected number of segments searched in F1 is less than or equal to C2 (Y2 ) C1 (Y1 ¡ log Q ) , where Theorem 5.
C1 (Y1 ¡ log Q ) D
1 X nD 1
n ( n C 1) J
2
Q2 ¡n(Y1 ¡log Q)
2
.
(3.8)
There are at most Qn segments at depth n in F1 . Using theorem 4, the P n expected number of segments explored is less than or equal to 1 nD 0 Q ( n C 2 2 ¡nY J Q 1 1) . Sum the series. It will not converge unless Y1 > log Q. C2 (Y2 ) 2 Proof.
Finally, by the recursive structure of the tree, we see that the number of segments explored in the sets F2 , F3 , . . . must be less than, or equal to, the
1944
James M. Coughlan and A. L. Yuille
Figure 10: Illustrates theorem 7. A¤ marks the path found by the A* algorithm. M marks the MAP estimate of the target path position. T marks the target path. Observe that A* has an error of one segment, and the MAP has an error of two.
number explored in F1 . (We simply eliminate the rst few segments, which are in common with the target path, and perform the same argument as above.) This yields our main result: The expected number of segments explored by an A* algorithm using the Bhattacharyya heuristic is bounded above by NC2 (Y2 ) C1 (Y1 ¡log Q ) , provided Y1 > log Q. Theorem 6.
The algorithm is expected to explore O ( N ) segments, and the coefcients (Y C2 2 ) C1 (Y1 ¡ log Q ) rapidly decrease as Y2 and Y1 ¡ log Q increase. In addition, we can estimate the expected error of how much our nal estimate differs from the target path. Note that this is not the same as the error with respect to the MAP estimate (illustrated in Figure 10). We count the error as the expected number of incorrect segments on the path. The expected error is bounded above by C2 (Y2 ) C1 (Y1 ¡ log Q) , provided Y1 > log Q. Theorem 7.
We measure the error in terms of the expected number Pof off-road segments. The expected error can then be bounded above by 1 n D 0 Pr ( n) n, where Pr ( n ) is the probability that A* will explore a path in FN C 1¡n to the end. There are Qn such paths. For each P path, the probability that we ( ) ( ) explore it to the end is bounded above by 1 m D 0 PrfSof f n ¸ Son m g · 2 Q2 ¡nY J 1 (n C 1) , using theorem 4. Therefore the expected error is C2 (Y2 ) 2 Proof.
Bayesian A* Tree Search with Expected O(N) Expansions
1945
P n J2 Q2 C (Y )2 ¡nY1 , which was summed bounded above by 1 2 2 n D 0 nQ ( n C 1) in theorem 5. Hence the result. The condition Y1 > log Q is related to the order parameter KB given by equation 2.9. It is straightforward algebra to check that KB D Y1 C Y2 ¡ log Q. Therefore, the condition Y1 > log Q implies that KB > 0 (Y2 > 0 by denition), which ensures that the expected number of paths in F1 with rewards higher than the target path will fall to zero as N 7! 1. 4 Alternative Heuristics
We now generalize our results to other heuristics HL , HP that lie in the range ¡D(Pof f k Pon ) < HL < D ( Pon k P of f ) and ¡D(U k PD G ) < HP < D (PD G k U ). The basic results and proof strategy are similar to the previous section. One stage, however, is technically more complicated and requires some additional results, which are proved in appendix B. The convergence results we obtain for these alternative heuristics are weaker than those for the Bhattacharyya heuristic. Each alternative heurisO 1 , and we can prove convergence only tic is associated with a quantity Y O 1 > log Q. But for the Bhattacharyya heuristic, we can prove provided Y O 1 . Hence, convergence provided Y1 > log Q and, as we will show, Y1 ¸ Y there will be situations where the A* algorithm will converge if the Bhattacharyya heuristic is used but will not necessarily converge for other heuristics. (This may, of course, reect the limitations of our proofs and not of the heuristics.) As in the previous section, in order to use Sanov’s theorem, it is convenient to associate heuristics HL , HP to probability distributions w HL ( y) , yHP ( t) , which are of the form l(HL )
L) ( ) w HL ( y ) D P1¡l(H y Pof f on
1¡m ( HP )
y H P ( t ) D PD G
( y) / Z1 [l(HL ) ],
(t) Um ( HP ) ( t) / Z2 [m ( HP )],
(4.1)
where l(HL ) , m ( HP ) are determined so that X y
w HL ( y ) log
X Pon ( y) PD G ( t ) yHP ( y ) log D HL , D HP . Pof f ( y ) U ( t) y
(4.2)
To remove the ambiguity in HL , HP (because the A* heuristic depends only on their sum HL C HP ), we require that l(HL ) D m ( HP ). It is also convenient to dene additional variables ( HO L , HO P ) related to ( HL , HP ) by the conditions that l(HL ) C l( HO L ) D 1 and m (HP ) C m ( HO P ) D 1. (For the Bhattacharyya heuristics, HL D HO L and HP D HO P .)
1946
James M. Coughlan and A. L. Yuille
O 1 ( HL , HP ) and Y O 2 ( HL , HP ) , by the equaNext, we dene two functions, Y tions O 1 D D (wE O k Pof f ) C D ( yE O k U ) , Y HL HP O 2 D D (wE HL k Pon ) C D ( yE HP k P D G ) ifHL C HP ¸ HO L C HO Y O 1 D D (wE HL k Pof f ) C D ( yE HP k U ) , Y O 2 D D (wE O k Pon ) C D ( yE O k P D G ) ifHL C HP < HO L C HO P . Y HL HP Our results are summarized by the following theorem, which subsumes O 1 ( HL , HP ) , Y O 2 ( HL , HP ) . (It can be theorems 6 and 7 by replacing Y1 , Y2 by Y O 1 D Y1 and Y O 2 D Y2 if we use the Bhattacharyya heuristic). veried that Y The expected number of segments explored by an A* algorithm using O 2 (HL , HP ) ) C1 ( Y O 1 ( HL , HP ) ¡ the heuristics HL , HP is bounded above by NC2 ( Y O 1 ( HL , HP ) ¡ log Q > 0. In addition, the expected error is log Q ) , provided Y O 2 (HL , HP ) ) C1 (Y O 1 ( HL , HP ) ¡ log Q) . bounded above by C2 ( Y Theorem 8.
The proof follows the basic strategy of theorems 1 through 7. The difference is that the bounds for PrfSof f ( n ) ¸ Son (m ) g, given by theorem 3 for the Bhattacharyya heuristic, now depend in a complicated way on n and m. It requires additional work (see appendix B) to bound these terms by expressions that decay exponentially with n and m. This involves replacing O 1 ( HL , HP ) , Y O 2 ( HL , HP ). After that, the remainder of the results Y1 , Y2 by Y follows directly. Proof.
O 1 · Y1 for all alternative heuristics (and with Finally, we show that Y equality only if we use the Bhattacharyya heuristic). This follows directly from equation 4.2. The condition that HL C HP ¸ HO L C HO P implies that l(HL ) · 1 / 2 and m ( HP ) · 1 / 2. Hence yO 1 · y1 . (A similar argument applies if HL C HP · HO L C HO P .) 5 Sorting the Queue in Linear Expected Time
We have shown that the expected number of nodes searched is linear in N. But the convergence rate of the algorithm will also depend on how much time is required to sort the queue of nodes that we want to expand. In this section, we prove that the expected time to sort the queue nodes is constant. We use a simple linked list data structure where we order the queue nodes according to their rewards (instead of a more sophisticated data structure, like a heap; see, for example, Geiger & Liu, 1997; Coughlan et al., 1998). A* proceeds by expanding the top node (the one with highest A* reward) and
Bayesian A* Tree Search with Expected O(N) Expansions
1947
must then adjust the queue to accommodate its children. We now show that the expected sort time, which is required to place the children in their correct positions in the queue, is a constant. To do this, we note that the children nodes have A¤ rewards that are smaller than the top node by at most L , where L D HL C HP ¡ miny log Pon ( y ) / Pof f ( y) ¡ mint log PD G ( t) / U ( t) (note that L > 0). We therefore only have to compare the rewards of the children with nodes whose rewards are within L of the top node. As we will show, the expected number of these nodes is constant. This gives the following theorem. Theorem 9.
The expected sorting rate is constant (independent of the size N of
the problem). The expected sorting rate is equal to Q times the expected number of nodes in the sort queue that have rewards within L of the top node. The reward of the top node is guaranteed to be greater than, or equal to, the reward rT of the longest target subpath in the queue. Let this longest target subpath have length n. To prove that the expected sorting time is constant, it sufces to show that the expected number of paths in the queue with rewards greater than rT ¡ L is constant. This requires computing the probabilities that subpaths in F1 , . . . , Fn have rewards higher than rT ¡ L and bounding the expected number of such subpaths. (We do not need to consider paths in Fi , i > n because, by denition of n, they involve children of nodes in the queue and so cannot be in the queue.) We can bound these probabilities using Sanov’s theorem and then bound the expected number of nodes by summing exponential series. The details are given in the proof of theorem 12. Proof.
6 Conclusion
The goal of this article is to point out that Bayesian formulation of inference problems leads to a probability distribution on the ensemble of problem instances. Analysis of this ensemble can give complexity results and, in other work (Yuille & Coughlan, 2000a; Yuille et al., 2001), algorithmindependent results. As a specic example, we analyzed the Geman and Jedynak (1996) theory for road tracking. We were able to demonstrate linear (in the road size) expected convergence for a class of ensembles even though the worst-case performance is exponential. This agrees with previous work (Yuille & Coughlan, 1999) where we analyzed a block-pruning search strategy motivated by Pearl (1984). Our techniques can be applied to a range of other search problems. This can correspond to either modications of the Geman and Jedynak algorithms (for example, allowing for multiple roads) or, more generally, to
1948
James M. Coughlan and A. L. Yuille
tree search problems. (Of course, the theoretical analysis may not result in closed-form analytic expressions.) Alternatively, one can use these techniques to determine efcient pruning algorithms. For example, the speed of dynamic programming algorithms can be improved by pruning out search hypotheses (Coughlan et al, 1998, 2000). Appendix A: Sanov’s Theorem
This appendix introduces results from the theory of types (Cover & Thomas, 1991), which we will use to prove our results. We will be particularly concerned with Sanov’s theorem. We will apply this material to the problem of determining whether a given set of measurements is more likely to come from a road or nonroad but without making any geometrical assumptions about the likely shape of the road. The theorem assumes that we have an underlying distribution Ps , which generates a set of N independent and identically distributed (i.i.d.) samples. From each sample set, we can determine an empirical histogram, or type (see Figures 11 and 12). The law of large numbers states that these empirical histograms (when normalized) must become similar to the distribution Ps as N 7! 1. Sanov’s theorem puts bounds on how fast the empirical histograms converge (in probability) to the underlying distribution and thereby puts bounds on the probability of rare events. Let y1 , y2 , . . . , yN be i.i.d. from a distribution P s ( y ) with alphabet size J and E be any closed set of probability distributions. Let Pr ( wE 2 E ) be the probability that the type of a sample sequence lies in the set E. Then Sanov’s theorem.
E¤
2 ¡ND( w kPs ) E¤ · Pr ( wE 2 E ) · ( N C 1) J 2 ¡ND(w kPs ) , ( N C 1) J
(A.1)
where wE ¤ D arg minw2E D (wE k Ps ) is the distribution in E that is closE est to P s in terms of Kullback-Leibler divergence, given by D ( wE k P s ) D PJ ( ) y D1 w y log(w £(y) / Ps ( y ) ). This is illustrated by Figure 13. Intuitively, it shows that when considering the chance of a set of rare events happening, we essentially have to worry only about the “most likely” of the rare events (in the sense of KullbackLeibler divergence). Most important, it tells us that the probability of rare events falls off exponentially with the Kullback-Leibler divergence between the rare event (its type) and the true distribution. This exponential fall-off is critical for proving the results in this letter. Note that Sanov’s theorem involves an alphabet factor (N C 1) J . This alphabet factor becomes irrelevant
Bayesian A* Tree Search with Expected O(N) Expansions
1949
Figure 11: Samples from an underlying distribution. (Left to right and top to bottom) The original distribution, followed by histograms, or types, from 10, 100, and 1000 samples from the original. Observe that for small numbers of samples, the types tend to differ greatly from the true distribution. But for large N, the law of large numbers says that they must converge (with high probability).
Figure 12: Sequences of edge values rendered using four gray levels ranging from light gray to black. In each pair, one sequence is drawn i.i.d. from Pon D (0.1, 0.1, 0.3, 0.5) and the other from Pof f D (0.5, 0.3, 0.1, 0.1) . Although individual edge values are unreliable, taken as a whole it is clear that the top sequences from each panel are from Pof f and the bottom sequences from Pon .
at large N (compared to the exponential term). It does, however, require that the distribution Ps is dened on a nite space or can be well approximated by a quantized distribution on a nite space. Sanov’s theorem can be illustrated by a simple coin tossing example (see Figure 13). Suppose we have a fair coin and want to estimate the probability of observing more than 700 heads in 1000 tosses. Then set E is the set of probability distributions for which P ( head) ¸ 0.7 (P(head) C P ( tails ) D 1). The distribution generating the samples is P s ( head ) D Ps (tails) D 1 / 2 because the
1950
James M. Coughlan and A. L. Yuille
Figure 13: (Left) Sanov’s theorem. The triangle represents the set of probability distributions. Ps is the distribution that generates the samples. Sanov’s theorem states that the probability that a type, or empirical distribution, lies within the subset E is chiey determined by the distribution P¤ in E, which is closest to Ps . (Right) Sanov’s theorem for the coin tossing experiment. The set of probabilities is one-dimensional and is labeled by the probability Ps (head) of tossing a head. The unbiased distribution Ps is at the center, with Ps ( head) D 1 / 2, and the closest element of the set E is P¤ such that P¤ ( head) D 0.7.
coin is fair. The distribution in E closest to Ps is P¤ ( head ) D 0.7, P ¤ ( tails) D 0.3. We calculate D ( P¤ k Ps ) D 0.119. Substituting into Sanov’s theorem, setting the alphabet size J D 2, we calculate that the probability of more than 700 heads in 1000 tosses is less than 2 ¡119 £ (1001) 2 · 2 ¡99. In this letter, we are concerned only with sets E that involve the rewards of types. This is because the reward function depends on the data only by the type. These sets will therefore be dened by linear constraints on the types—in particular, constraints such as wE ¢ a E ¸ T, where a(y) D PJ E ( ) log(P on ( y) / Pof f ( y ) ) , y D 1, . . . , J. (We dene w ¢ a E D y D1 w y a(y) ). This will enable us to derive results that will not be true for arbitrary sets E. We will often, however, be concerned with the probabilities that the rewards of samples from one distribution are greater than those from a second. It is straightforward to generalize Sanov’s theorem to deal with such cases. Appendix B: Proofs for Alternative Heuristics
This appendix proves the analogous result to theorem 3 in the case of general heuristics. This is more complicated because direct application of Sanov’s theorem gives bounds for PrfSof f ( n ) ¸ Son ( m )g, which are complicated functions of n and m. To obtain bounds that fall off exponentially with n, m, we must rst prove a convexity result concerning how the fall-off factors, such
Bayesian A* Tree Search with Expected O(N) Expansions
1951
Figure 14: (Top) D ( wE T k Pon ) is a convex function of T with its minimum at T D D ( Pon k Pof f ) . (Bottom) Similarly, D ( wE T k Pof f ) is convex with minimum at T D ¡D(Pof f k Pon ) .
as D (wE T k Pon ), vary with the threshold T. This result, combined with Sanov’s theorem, will give us the bounds we need. First, we prove a convexity result, theorem 10, which will be used to obtain our main results. Theorem 11 gives the details for section 4. Theorem 12 gives the details for section 5. Theorem 10.
P
1¡l(T)
Let wE T ( y ) D Pon
( y ) Pl(T) ( ) ( ) l(T) is determined of f y / Z T where
( )
Pon y E E by y w T ( y ) log Poff ( y) D T. Then D ( w T k Pon ) and D ( w T k Pof f ) are convex functions of T that attain minima of zero at T D D ( Pon k Pof f ) and T D ¡D(Pof f k Pon ) respectively (see Figure 14).
Proof.
First, observe that
D ( wE T k Pon ) D D ( wE T k Pof f ) ¡ T because of the identity
P
E
y w T ( y)
(B.1) wE ( y)
logf PonT ( y) g D
P
E
y w T ( y)
wE ( y ) Pof f ( y) g. Pon ( y )
logf PoffT ( y )
1952
James M. Coughlan and A. L. Yuille
We calculate X dwE T ( y ) X dwE T ( y ) wE T ( y ) d C log D ( wE T k Pon ) D dT dT Pon ( y ) dT y y D
X dwE T ( y ) wE T ( y ) log , dT Pon ( y ) y
(B.2)
P dwE T ( y ) P 1¡l(T) d E E ( y ) Pl(T) ( ) using y dT D dT y w T ( y ) D 0. Substituting w T ( y ) D Pon of f y / Z ( T ) , we reexpress this as l(T) X dwE T ( y ) Pof f ( y ) d E log l(T) D ( w T k Pon ) D dT dT Pon ( y ) Z (T ) y
D ¡l(T)
We recall that which implies:
P
E
y w T ( y)
X dwE T ( y) Pon ( y ) log . dT Pof f ( y) y P (y)
log Pofonf ( y ) D T, and hence
(B.3)
P y
dwET ( y) dT
P (y)
log Pofonf ( y ) D 1,
d d D ( wE T k Pon ) D ¡l(T) , D ( wE T k Pof f ) D 1 ¡ l(T) . dT dT
(B.4)
The second derivatives are given by d2 dl (T ) d2 dl (T ) . D ( wE T k Pon ) D ¡ , D ( wE T k Pof f ) D ¡ 2 2 dT dT dT dT
(B.5)
( )
Now dldTT < 0, because as the threshold T decreases, then wE T becomes closer to Pof f and hence l(T) decreases. (It can be checked that dldT( T ) is minus the inverse of the variance of log PPofonf with respect to wE T .) Hence, by equa-
tion B.5, we see that both D ( wE T k Pon ) and D ( wE T k Pof f ) are convex functions of T. The minima of the functions occur at l(T) D 0 and l(T) D 1, corresponding to w D Pon and w D Pof f , respectively. Hence, the minima occur at T D D ( Pon k Pof f ) and T D ¡D(Pof f k Pon ) , respectively. We can use this result in combination with Sanov’s theorem to put bounds on PrfSof f ( n ) ¸ Son ( m )g, which fall off exponentially with n, m. Theorem 11.
PrfSof f (n ) ¸ Son (m ) g · f(n C 1) ( m C 1) g J
2 Q2
2¡g(mIn) ,
(B.6)
Bayesian A* Tree Search with Expected O(N) Expansions
1953
where: g ( mI n ) ¸ nfD (w HO L k Pof f ) C D ( yHO P k U ) g C mfD (w HL k Pon ) C D ( yHP k PD G )g, if HL
> HO L ,
g ( mI n ) ¸ nfD (w HL k Pof f ) C D ( yHP k U ) g C mfD (w HO k Pon ) C D ( yHO k PD G )g, if HL L
P
< HO L .
(B.7)
We start by following the proof of theorem 3 but with the denition of set E changed to allow for different heuristics HL , HP . We minimize f (. , ., ., .) and obtain expressions for w T1 , yW1 , w T2 , yW2 except that the minimization no longer occurs at c D 1 / 2. The fall-off rate is determined by Proof.
g ( mI n ) D nfD (w T1 k Pof f ) C D (yW1 k U ) g C mfD (w T2 k Pon ) C D ( yW2 k PD G ) g,
(B.8)
where mfT2 C W2 ¡ HL ¡ HP g D nfT1 C W1 ¡ HL ¡ HP g,
(B.9)
and l(T1 ) C l(T2 ) D 1 D m (W 1 ) C m (W 2 ) , l(T1 ) D m (W 1 ) ,
l(T 2 ) D m ( W2 ) .
(B.10)
Recall that to remove the ambiguity in HL and HP , we imposed the constraint that l(HL ) D m ( HP ) . We also dened HO L , HO P by the conditions l(HO L ) C l(HL ) D 1 and m ( HO P ) C m ( HP ) D 1, which implies that l( HO L ) D m ( HO P ). There are two situations to consider: (1) HL ¸ HO L , which implies HP ¸ HO P (this follows from the equations at the end of the previous paragraph plus the fact that l(.) and m (.) are monotonically decreasing functions), and (2) HL · HO L , which implies HP · HO P . We claim, in case 1, that T1 , T2 2 [HO L , HL ] and W1 , W2 2 [HO P , HP ] (see Figure 15). Moreover, g ( mI n ) ¸ nfD (w HO L k Pof f ) C D ( yHO P k U ) g
C mfD (w HL k Pon ) C D ( yHP k PD G )g.
(B.11)
1954
James M. Coughlan and A. L. Yuille
Figure 15: This gure illustrates the three cases of the argument in theorem 7 (one in each column). The variables T1 , T2 , W1 , W2 are functions of m, n, and their values determine g ( mI n ) . If we use the Bhattacharyya heuristic, then T1 D T2 D HL¤ and W1 D W2 D HP¤ for all values of m, n. They can therefore be thought of as “effective heuristics,” and it is necessary to understand their “dynamics” as m, n vary. In theorem 11, we show that they are restricted to lie in the ranges illustrated by the left-most column (the congurations in the center and rightmost column are inconsistent with the three constraints, equations B.9 and B.10, which enables us to put bounds on g ( mI n )).
Moreover, in situation 2, we claim that T1 , T2 2 [HL , HO L ] and W1 , W 2 2 [HP , HO P ], and g (mI n ) ¸ nfD (w HL k Pof f ) C D (yHP k U )g C mfD (w HO k Pon ) C D ( yHO k PD G )g. L
P
(B.12)
We prove the results only for situation 1 because the proofs for situation 2 are exactly analogous. The condition l(T 1 ) C l(T2 ) D 1 implies that there are only three possible cases: either both T1 , T2 2 [HO L , HL ] or, using the monotonicity of l(T) , that T1 > HL and T2 < HO L , or T1 < HO L and T2 > HL . The rst case will ensure that W1 , W 2 2 [HO P , HP ], which solves the problem. The second requires that W1 > HP and W2 < HO P , but this is inconsistent with the requirement that mfT2 C W2 ¡ HL ¡ HP g D nfT1 C W 1 ¡ HL ¡ HP g (because the left-hand side is negative and the right-hand side is positive). Similarly, the third case implies that W1 < HO P and W2 > HP , which again contradicts the equality. Thus, the only possible situation is the rst case. Moreover, as n 7! 1, we have T1 7! HL , W 1 7! HP , T2 7! HO L , W 2 7! HO P . (This is because T1 · HL and W1 · HP so as n 7! 1, we have T1 C W 1 ¡ HL ¡ HP 7! 0.)
Bayesian A* Tree Search with Expected O(N) Expansions
1955
Let this longest target subpath in a sort queue have length n and reward rT . Then the expected number of paths in the queue with rewards greater than rT ¡ L is constant. Theorem 12.
Let the heuristics be HL , HP , and consider the case when HL > HO L and HP > HO P (the alternative case can be solved by adapting the following argument). Suppose the longest true partial path in the queue is of length n and has reward rT . We must consider the probabilities that paths in F1 , . . . , Fn have rewards higher than rT ¡ L . Applying the onion argument, for each m · n, we must bound the probability that any off-path of any length has reward higher than the true reward for n segments minus L . Following the standard application of Sanov’s theorem, we dene the set Proof.
E D f( wE of f , yE of f , wE on , yE on ) :
E of f ¢ bE ¡ HL ¡ HP ) C L m ( wE of f ¢ a E Cy
E on ¢ bE ¡ HL ¡ HP )g. ¸ n ( wE on ¢ a E Cy
(B.13)
Following the proof of theorem 11, this gives thresholds: T1 , T2 , W1 , W 2 (as before, these thresholds are functions of n and m), where m ( T1 C W1 ¡ HL ¡ HP ) C L D n ( T2 C W2 ¡ HL ¡ HP ) ,
(B.14)
l(T1 ) C l(T2 ) D 1, m (W1 ) C m ( W2 ) D 1, l(T1 ) D m ( W1 ) , l(T2 ) D m (W 2 ) .
(B.15)
The fall-off depends on g ( n : m ) D mfD ( wE T1 k Pof f ) C D (yE W1 k U ) g E W k PD G )g. C nfD ( wE T2 k Pon ) C D ( y 2
(B.16)
Now, again following theorem 11, we would like to put lower bounds on g ( n : m ) . For theorem 11, we were able to prove that T1 , T2 2 [HO L , HL ] and W1 , W2 2 [HO P , HP ] (for the situation where HL > HO L and analogous results hold for the situation with HL < HO L ). The L term prevents these results from being true. However, for large enough n or m, the L term becomes negligible, and we will prove that T1 , T2 2 [HO L ¡ 2 , HL C 2 ] and W1 , W2 2 [HO P ¡ 2 , HP C 2 ]. These contain the most important terms and, as we will show, make only a constant contribution to the expected sorting cost. The contributions for small m and n are, of course, also constant. We rst show that for any xed n, the thresholds T1 , W1 increase monotonically with m and tend to HL , HP as m 7! 1 and, similarly, T2 , W2 decrease monotonically with m and tend to HO L , HO P . From l(T1 ) D m ( W1 )
1956
James M. Coughlan and A. L. Yuille
(see equation B.15), and the monotonicity of the functions l(.) and m (.) , we see that the coupling between T 1 and W 1 means that they have to increase, or decrease, together. Similarly, T2 and W2 must either decrease, or increase, together. By equation B.14, we see that at m D 0, we have T2 (0) C W2 (0) D HL C HP C L / n which implies, by equation B.15, that T1 (0) C W1 (0) < HO L C HO P . Equation B.14 enforces that T1 7! HL , W1 7! HP as m 7! 1, which implies that T2 7! HO L and W2 7! HO P . Therefore, we see that T1 C W 1 increases overall from m D 0 as m 7! 1 and, conversely, T2 C W 2 decreases. But are these changes monotonic? From equation B.14, we see that provided T1 C W1 < HL C HP , then it is inconsistent for T1 and W1 to decrease and T2 and W2 to increase. However, it is impossible for T1 C W1 > HL C HP because, by equation B.14, this would imply that T2 C W2 > HL C HP (recall that L > 0) which is inconsistent with equations B.15. So we conclude that the only possibility is for T 1 and W1 to increase monotonically and T2 and W2 to decrease monotonically. Now select a number N 0, chosen so that N0 (2 ) ¸ L /2 , and let n ¸ N0 . Then for m D 0, we see that T2 < HL C 2 and W2 < HP C 2 (this follows from equations B.14 and B.15). Moreover, T1 > HO L ¡ 2 O and W1 > HO P ¡ 2 O (where 2 O is dened by equation B.15). As m increases, T1 , W1 increase monotonically to HL , HP , and T2 , W2 decrease monotonically to HO L , HO P . Therefore we have g (m : n ) ¸ mfD ( wE HO L ¡O2 k Pof f ) C D ( yE HO P ¡O2 k U ) g C nfD ( wE HL C2 k P on ) C D ( wE HP C2 k P D G ) g 8 n
> N0,
(B.17)
which ensures that the fall-off factors are bounded below for large n. We now deal with the case of small n (n < N 0 (2 )) and large m. We claim that there is a specic value M 0 such that for m > M 0 , we have T1 , T2 2 [HO L , HL ] and W1 , W2 2 [HO P , HP ], in which case we can use the same bounds for g (mI n ) as above (see equation B.17). This claim is proved by setting M 0 D L / ( HL C HP ¡ HO L ¡ HO P ) and substituting into equation B.14 to obtain L ( T1 C W 1 ¡ HO L ¡ HO P ) D n ( T2 C W2 ¡ HL ¡ HP ) ( HL C HP ¡ HO L ¡ HO P ) . The consistency conditions, imposed by equation B.15, mean that this equation’s only solution is T1 D HO L , W 1 D HO P , T2 D HL , and W2 D HP (all other possibilities can be shown to be inconsistent using equation B.15). The monotonicity increase of T1 , W1 , and decrease of T2 , W2 , ensure that for m > M 0 , the T1 , T2 2 [HO L , HL ] and W1 , W2 2 [HO P , HP ].1 1 Observe that M becomes innite if we use the Bhattacharyyaheuristic (when HL D HO L and HP D HO P ). This is because the regions [HO L , HL ] and [HO P , HP ] shrink to points H¤L and H¤P and the T’s and W’s only reach them asymptotically. This requires a modication of the proof to obtain bounds on M for which maxf|T1 ¡H¤L |, |W1 ¡HP¤ |, |T2 ¡HL¤ |, |W2 ¡HL¤ |g < 2 .
Bayesian A* Tree Search with Expected O(N) Expansions
1957
The nal situation is when n < N 0 (2 ) and m < M 0 . This is a nite case, so we do not need to obtain bounds. We can simply exhaustively count the number of segments. We now put all these results together. Let nO be the length of the true partial path segment in the queue (by the nature of A*, there can be only one such true path segment in the queue at any time). The expected number of queue members with rewards higher than the true segment minus L is obtained by summing over the possible segments in F1 , F2 , . . . , FnO . We can deal with the cases m < M 0 and n < N 0 (2 ) by exhaustive counting, which O we can use the bounds given by yields a nite number. For each n · n, equation B.17 and apply the arguments from theorem 4 to sum over m for xed n, obtaining a term that decays exponentially with n. Finally, we can apply the arguments from theorem 5 to sum over n. The exponential decay factor means that this sum will converge for any value of nO (even as nO 7! 1). Hence, we get constant expected sorting costs. Acknowledgments
We acknowledge funding from NSF with award number IRI-9700446, from the Center for Imaging Sciences funded by ARO DAAH049510494, and from the Smith-Kettlewell core grant and the AFOSR grant F49620-98-1-0197 to A.L.Y. Lei Xu drew our attention to Pearl’s book on heuristics, and we thank Abracadabra books for obtaining a second-hand copy for us. We thank Dan Snow and Scott Konishi for helpful discussions as the work was progressing and Davi Geiger for providing useful stimulation. David Forsyth, Jitendra Malik, Preeti Verghese, Dan Kersten, Suzanne McKee, and Song Chun Zhu gave very useful feedback and encouragement. Finally, we wish to thank Tom Ngo for drawing our attention to the work of Cheeseman and Selman. References Cheeseman, P., Kanefsky, B., & Taylor, W. (1991). Where the really hard problems are. In Proc. 12th International Joint Conference on A.I. (Vol. 1, pp. 331–337). San Mateo, CA: Morgan-Kaufmann. Coughlan, J. M., Snow, D., English, C., & Yuille, A. L. (1998). Efcient optimization of a deformable template using dynamic programming. In Proceedings Computer Vision and Pattern Recognition. CVPR’98. Santa Barbara, CA. Coughlan, J. M., Snow, D., English, C., & Yuille, A. L. (2000). Efcient deformable template detection and localization without user initialization. Computer Vision and Image Understanding, 78, 303–319. Coughlan, J. M., & Yuille, A. L. (1999). Bayesian A* tree search with expected O(N) convergence rates for road tracking. In Proceedings EMMCVPR’99 (pp. 189–204). York, England. Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: Wiley.
1958
James M. Coughlan and A. L. Yuille
Geiger, D., & Liu, T-L. (1997). Top-down recognition and bottom-up integration for recognizing articulated objects. In M. Pellilo & E. Hancock (Eds.), Proceedings of the International Workshop on Energy Minimization Methods in Computer Vision and Pattern Recognition. Berlin: Springer-Verlag. Geman, D., & Jedynak, B. (1996). An active testing model for tracking roads in satellite images. IEEE Trans. Patt. Anal. and Machine Intel., 18, 1–14. Grimmett, G. R., & Stirzaker, D. R. (1992).Probabilityand random processes. Oxford: Clarendon Press. Karp, R. M., & Pearl, J. (1983). Searching for an optimal path in a tree with random costs. Articial Intelligence, 21, 99–116. Knill, D. C., & Richards, W. (Eds). (1996). Perception as Bayesian inference. Cambridge: Cambridge University Press. Konishi, S., Yuille, A. L., Coughlan, J. M., & Zhu, S. C. (1999). Fundamental bounds on edge detection: An information theoretic evaluation of different edge cues. In Proc. Int’l Conf. on Computer Vision and Pattern Recognition (pp. I573–I-580). Fort Collins, CO. Pearl, J. (1984). Heuristics. Reading, MA: Addison-Wesley. Ripley, B. D. (1996). Pattern recognition and neural networks. Cambridge: Cambridge University Press. Russell, S., & Norvig, P. (1995). Articial intelligence: A modern approach. Englewood Cliffs, NJ: Prentice Hall. Selman, B., & Kirkpatrick, S. (1996). Critical behaviour in the computational cost of Satisability testing. Articial Intelligence, 81, 273–295. Vapnik, V. N. (1998). Statistical learning theory. New York: Wiley. Winston, P. H. (1984). Articial intelligence. Reading, MA: Addison-Wesley. Yuille, A. L., & Coughlan, J. M. (1999).Convergence rates of algorithms for visual search: Detecting visual contours. In M. S. Kearns, S. A. Solla, & D. A. Cohn (Eds.), Advances in neural informationprocessing systems, 11 (pp. 641–647). Cambridge, MA: MIT Press. Yuille, A. L., & Coughlan, J. M. (2000a). Fundamental limits of Bayesian inference: Order parameters and phase transitions for road tracking. Transactions on Pattern Analysis and Machine Intelligence, 22, 1–14. Yuille, A. L., & Coughlan, J. M. (2000b). An A* perspective on deterministic optimization for deformable templates. Pattern Recognition, 33, 603–616. Yuille, A. L., Coughlan, J. M., Wu, Y-N., & Zhu, S. C. (2001). Order parameters for minimax entropy distributions: When does high level knowledge help? International Journal of Computer Vision, 41, 9–33.
Received February 23, 2001; accepted February 1, 2002.
LETTER
Communicated by John Platt
Training º-Support Vector Regression: Theory and Algorithms Chih-Chung Chang [email protected] Chih-Jen Lin [email protected] Department of Computer Science and Information Engineering, National Taiwan University, Taipei 106, Taiwan We discuss the relation between 2 -support vector regression (2 -SVR) and
º-support vector regression (º-SVR). In particular, we focus on properties that are different from those of C-support vector classication (C-SVC) and º-support vector classication (º-SVC). We then discuss some issues that do not occur in the case of classication: the possible range of 2 and the scaling of target values. A practical decomposition method for º-SVR is implemented, and computational experiments are conducted. We show some interesting numerical observations specic to regression. 1 Introduction
The º-support vector machine (Sch¨olkopf, Smola, Williamson, & Bartlett, 2000; Sch¨olkopf, Smola, & Williamson, 1999) is a new class of support vector machines (SVM). It can handle both classication and regression. Properties on training º-support vector classiers (º-SVC) have been discussed in Chang & Lin (2001b). In this letter, we focus on º-support vector regression (º-SVR). Given a set of data points, f( x 1 , y1 ) , . . . , (x l , yl ) g, such that x i 2 Rn is an input and yi 2 R1 is a target output, the primary problem of º-SVR is as follows: ( Pº)
Á ! l 1 T 1X ¤) (j C C C min w w C º2 ji i 2 l iD 1
(1.1)
( w T w ( xi ) C b ) ¡ yi · 2 C j i ,
yi ¡ ( w T w ( x i ) C b) · 2 C ji¤ ,
ji , ji¤ ¸ 0, i D 1, . . . , l, 2 ¸ 0. Here, 0 · º · 1, C is the regularization parameter, and training vectors x i are mapped into a higher (maybe innite) dimensional space by the function w. The 2 -insensitive loss function means that if w T w (x ) is in the range of y § 2 , c 2002 Massachusetts Institute of Technology Neural Computation 14, 1959– 1977 (2002) °
1960
Chih-Chung Chang and Chih-Jen Lin
no loss is considered. This formulation is different from the original 2 -SVR (Vapnik, 1998): l 1 CX (ji C ji¤ ) min w T w C 2 l iD 1
(P2 )
( w T w ( x i ) C b) ¡ yi · 2 C ji , yi ¡ ( w T w ( xi ) C b ) · 2 C ji¤ ,
(1.2)
ji , ji¤ ¸ 0, i D 1, . . . , l.
As it is difcult to select an appropriate 2 , Sch¨olkopf et al. (1999) introduced a new parameter º that lets one control the number of support vectors and training errors. To be more precise, they proved that º is an upper bound on the fraction of margin errors and a lower bound of the fraction of support vectors. In addition, with probability 1, asymptotically, º equals both fractions. Then there are two different dual formulations for Pº and P2 : 1 min ( ® ¡ ®¤ ) T Q ( ® ¡ ®¤ ) C y T ( ® ¡ ®¤ ) 2 e T ( ® ¡ ®¤ ) D 0, e T ( ® C ®¤ ) · Cº, 0·
ai , ai¤
· C / l,
i D 1, . . . , l.
1 min ( ® ¡ ®¤ ) T Q ( ® ¡ ®¤ ) C y T ( ® ¡ ®¤ ) C 2 e T ( ® C ®¤ ) 2 e T ( ® ¡ ®¤ ) D 0, 0·
ai , ai¤
· C / l,
(1.3)
(1.4)
i D 1, . . . , l,
where Qij ´ w ( xi ) T w ( xj ) is the kernel and e is the vector of all ones. Then the approximating function is f (x) D
l X iD 1
(ai¤ ¡ ai )w (x i ) T w ( x ) C b.
For regression, the parameter º replaces 2 while in the case of classication,º replaces C. In Chang & Lin (2001b), we discussed the relation between º-SVC and C-SVC as well as how to solve º-SVC in detail. Here, we are interested in different properties for regression. For example, the relation between º-SVR and 2 -SVR is not the same as that between º-SVC and C-SVC. In addition, similar to the situation of C-SVC, we make sure that the inequality e T ( ® C ®¤ ) · Cº can be replaced by an equality so algorithms for º-SVC can be applied to º-SVR. They will be the main topics of sections 2 and 3. In section 4, we discuss the possible range of 2 and show that it might be easier to use º-SVM. We also demonstrate some situations where the
Training -Support Vector Regression
1961
scaling of the target values y is needed. Note that these issues do not occur for classication. Finally, section 5 presents computational experiments. We discuss some interesting numerical observations that are specic to support vector regression. 2 The Relation Between º-SVR and 2 -SVR
In this section, we will derive a relationship between the solution set of 2 -SVR and º-SVR that allows us to conclude that the inequality constraint,
equation 1.3, can be replaced by an equality. In the dual formulations mentioned earlier, e T (® C ®¤ ) is related to Cº. Similar to Chang and Lin (2001b), we scale them to the following formulations so eT ( ® C ®¤ ) is related to º: ( Dº )
min
1 (® ¡ ®¤ ) T Q ( ® ¡ ®¤ ) C ( y / C) T ( ® ¡ ®¤ ) 2 e T ( ® ¡ ®¤ ) D 0, e T ( ® C ®¤ ) · º,
0 · ai , ai¤ · 1 / l,
( D2 )
i D 1, . . . , l.
1 min ( ® ¡ ®¤ ) T Q ( ® ¡ ®¤ ) C ( y / C) T ( ® ¡ ®¤ ) C (2 2
(2.1)
/ C) eT ( ® C ®¤ )
eT ( ® ¡ ®¤ ) D 0,
0 · ai , ai¤ · 1 / l,
i D 1, . . . , l.
µ
¶ ® For convenience, following Sch¨olkopf et al. (2000), we represent ¤ as ® (¤) . ® Remember that for º-SVC, not all 0 · º · 1 lead to meaningful problems of (Dº) . Here, the situation is similar, so in the following, we dene a º¤ , which will be the upper bound of the interesting interval of º. Dene º¤ ´ min® (¤) e T ( ® C ®¤ ) , where ® (¤) is any optimal solution of (D2 ) , 2 D 0. Denition 1.
Note that 0 · ai , ai¤ · 1 / l implies that the optimal solution set of D2 or Dº is bounded. Because their objective and constraint functions are all continuous, any limit point of a sequence in the optimal solution set is in it as well. Hence, we have that the optimal solution set of D2 or Dº is close and bounded (i.e., compact). Using this property, if 2 D 0, there is at least one optimal solution that satises e T ( ® C ®¤ ) D º¤ . The following lemma shows that for D2 , 2 > 0, at an optimal solution one of ai and ai¤ must be zero: Lemma 1.
If 2 > 0, all optimal solutions of D2 satisfy ai ai¤ D 0.
1962
Chih-Chung Chang and Chih-Jen Lin
If the result is wrong, then we can reduce values of two nonzero ai and ai¤ such that ® ¡ ®¤ is still the same, but the term e T (® C ®¤ ) of the objective function is decreased. Hence, ® (¤) is not an optimal solution so there is a contradiction. Proof.
The following lemma is similar to Chang and Lin (2001b, lemma 4): (¤)
(¤)
If ®1 is any optimal solution of D2 1 , ®2 is any optimal solution of D2 2 , and 0 · 2 1 < 2 2 , then Lemma 2.
e T ( ®1 C ®¤1 ) ¸ e T ( ®2 C ®¤2 ) .
Therefore, for any optimal solution ®2
(2.2) (¤)
of D2 , 2 > 0, e T ( ®2 C ®2 ¤ ) · º¤ .
Unlike the case of classication where e T ®C is a well-dened function of C if ®C is an optimal solution of the C-SVC problem, here for 2 -SVR, for the same D2 , there may be different e T ( ®2 C ®2 ¤ ) . The main reason is that eT ® is the only linear term of the objective function of C-SVC, but for D2 , the linear term becomes ( y / C) T ( ® ¡ ®¤ ) C (2 / C ) e T (® C ®¤ ) . We will elaborate more on this in lemma 4, where we prove that eT ( ®2 C ®2 ¤ ) can be a function of 2 if Q is a positive denite matrix. The following lemma shows the relation between Dº and D2 . In particular, we show that for 0 · º < º¤ , any optimal solution of Dº satises e T ( ® C ®¤ ) D º. Lemma 3.
happen:
For any Dº, 0 · º < º¤ , one of the following two situations must
1. Dº’s optimal solution set is part of the solution set of a D2 , 2 > 0.
2. Dº’s optimal solution set is the same as that of D2 , where 2 > 0 is any one element in a unique open interval. In addition, any optimal solution of Dº satises e T (® C ®¤ ) D º and ai ai¤ D 0. The Karush-Kuhn-Tucker (KKT) condition of Dº shows that there exist r ¸ 0 and b such that Proof.
µ
Q ¡Q
¡Q Q
¶µ
¶ µ ¶ µ ¶ µ ¶ µ ¶ r e y/C e ® ¸ ¡» C C ¡ . b D ¤ ¤ ¡y / C ¡e ®¤ ¸ ¡» C e
(2.3)
If r D 0, ® (¤) is an optimal solution of D2 , 2 D 0. Then, eT (® C ®¤ ) ¸ º¤ > º causes a contradiction.
Training -Support Vector Regression
1963
Therefore, r > 0, so the KKT condition implies that e T ( ® C ®¤ ) D º. Then there are two possible situations: Case 1: r is unique: By assigning 2 D r, all optimal solutions of Dº are KKT points of D2 . Hence, Dº’s optimal solution set is part of that of a D2 . This D2 is unique, as otherwise we can nd another r that satises equation 2.3. Case 2:r is not unique. That is, there are two r1 < r 2 . Suppose r1 and r 2 are the smallest and largest one satisfying the KKT. Again, the existence of r1 and r2 is based on the compactness of the optimal solution set. Then for any r1 < r < r2 , we consider the problem D2 , 2 D r. Dene 2 1 ´ r1 and 2 2 ´ r2 . From lemma 2, since 2 1 < 2 < 2 2 , º D eT ( ®1 C ®¤1 ) ¸ e T ( ® C ®¤ ) ¸ e T ( ®2 C ®¤2 ) D º, (¤)
(¤)
where ® (¤) is any optimal solution of D2 , and ®1 and ®2 are optimal solutions of D2 1 and D2 2 , respectively. Hence, all optimal solutions of D2 satisfy e T (® C ®¤ ) D º and all KKT conditions of Dº so D2 ’s optimal solution set is in that of Dº. Hence, Dº and D2 share at least one optimal solution ® (¤) . For any other optimal solution ® N (¤) of Dº, it is feasible for (D2 ). Since 1 (® N ¡® N ¤ ) T Q (® N ¡® N ¤ ) C ( y / C) T ( ® N ¡® N ¤) 2 1 D ( ® ¡ ®¤ ) T Q ( ® ¡ ®¤ ) C ( y / C) T ( ® ¡ ®¤ ) , 2 and eT ( ® N C® N ¤ ) · º, we have 1 (® N ¡® N ¤ ) T Q (® N ¡® N ¤ ) C ( y / C) T ( ® N ¡® N ¤ ) C (2 / C ) e T ( ® N C® N ¤) 2 1 · ( ® ¡ ®¤ ) T Q ( ® ¡ ®¤ ) C ( y / C) T ( ® ¡ ®¤ ) C (2 / C) e T ( ® C ®¤ ) . 2 Therefore, all optimal solutions of Dº are also optimal for D2 . Hence, Dº’s optimal solution set is the same as that of D2 , where 2 > 0 is any one element in a unique open interval (r 1 , r 2 ) . Finally, as Dº’s optimal solution set is the same as or part of a D2 , from lemma 1, we have ai ai¤ D 0. Using the above results, we now summarize a main theorem: Theorem 1.
We have:
¤
1. º · 1.
2. For any º 2 [º¤ , 1], Dº has the same optimal objective value as D2 , 2 D 0.
1964
Chih-Chung Chang and Chih-Jen Lin
3. For anyº 2 [0, º¤ ) , lemma 3 holds. That is, one of the following two situations must happen: (a) Dº’s optimal solution set is part of the solution set of D2 , 2 > 0, or (b) Dº’s optimal solution set is the same as that of D2 , where 2 > 0 is any one element in a unique open interval. 4. For all Dº, 0 · º · 1, there are always optimal solutions that happen at the equality eT ( ® C ®¤ ) D º.
From the explanation after denition 1, there exists an optimal solution of D2 , 2 D 0, which satises e T (® C ®¤ ) D º¤ . Then this ® (¤) must satisfy ai ai¤ D 0, i D 1, . . . , l, so º¤ · 1. In addition, this ® (¤) is also feasible to Dº, º ¸ º¤ . Since Dº has the same objective function as D2 , 2 D 0, but has one more constraint, this solution of D2 , 2 D 0, is also optimal for Dº. Hence, Dº and Dº¤ have the same optimal objective value. For 0 · º < º¤ , we already know from theorem 3 that the optimal solution happens only at the equality. For 1 ¸ º ¸ º¤ , rst we know that D2 , 2 D 0 has an optimal solution ® (¤) that satises eT ( ® C ®¤ ) D º¤ . Then (¤) we can increase some elements of ® and ®¤ such that new vectors ® O (¤) ¤ ¤ O C® O ) D º but ® O ¡® O D ® ¡ ®¤ . Hence, ® O satisfy e T ( ® is an optimal solution of Dº, which satises the equality constraint. Proof.
Therefore, the above results ensure that it is safe to solve the following problem instead of Dº: (DN º )
1 min ( ® ¡ ®¤ ) T Q (® ¡ ®¤ ) C ( y / C ) T ( ® ¡ ®¤ ) 2 eT ( ® ¡ ®¤ ) D 0, eT ( ® C ®¤ ) D º,
0 · ai , ai¤ · 1 / l,
i D 1, . . . , l.
This result is important because existing SVM algorithms have been able to handle these equalities easily. Note that for º-SVC, there is also a º¤ where for º 2 (º¤ , 1], Dº is infeasible. At that time, º¤ D 2 min(#positive data, #negative data) / l can be easily calculated (Crisp & Burges 2000). Now for º-SVR, it is difcult to know º¤ a priori. However, we do not have to worry about this. If a DN º, º > º¤ is solved, a solution with its objective value equal to that of D2 , 2 D 0 is obtained. Then some ai and ai¤ may both be nonzero. Since there are always optimal solutions of the dual problem that satisfy e T ( ® C ®¤ ) D º, this also implies that the 2 ¸ 0 constraint of Pº is not necessary. In the following theorem, we derive the same result directly from the primal side: Consider a problem that is the same as Pº but without the inequality constraint 2 ¸ 0. We have that for any 0 < º < 1, any optimal solution of Pº must satisfy 2 ¸ 0. Theorem 2.
Training -Support Vector Regression Proof.
each i,
1965
Assume ( w , b, » , » ¤ , 2 ) is an optimal solution with 2 < 0. Then for
¡ 2 ¡ ji¤ · w T w ( x i ) C b ¡ yi · 2 C ji
(2.4)
implies ji C ji¤ C 22 ¸ 0.
(2.5)
With equation 2.4, ¡0 ¡ max(0, ji¤ C 2 ) · ¡2 ¡ ji¤ · w T w ( xi ) C b ¡ yi · 0 C 2 C ji · 0 C max(0, ji C 2 ) .
Hence ( w , b, max(0, » C 2 e ) , max(0, » ¤ C 2 e ) , 0) is a feasible solution of Pº. From equation 2.5, max (0, ji C 2 ) C max(0, ji¤ C 2 ) · ji C ji¤ C 2 . Therefore, with 2 < 0 and 0 < º < 1, Á ! l 1 T 1X ¤ (ji C ji ) w w C C º2 C 2 l iD 1 Á ! l X 1 T C ¤ (ji C ji ) > w wC l2 C 2 l iD 1 ¸
l 1 T CX (max(0, ji C 2 ) C max (0, ji¤ C 2 ) ) w wC 2 l iD 1
implies that (w , b, » , » ¤ ) is not an optimal solution. Therefore, any optimal solution of Pº must satisfy 2 ¸ 0. Next we demonstrate an example where one D2 corresponds to many Dº. Given two training points, x1 D 0, x 2 D 0, and target values y1 D ¡D < 0 and y2 D D > 0. When 2 D D , if the linear kernel is used and C D 1, D2 becomes min 2D (a1¤ C a2 )
0 · a1 , a1¤ , a2 , a2¤ · 1 / l,
a1 ¡ a1¤ C a2 ¡ a2¤ D 0.
Thus, a1¤ D a2 D 0, so any 0 · a1 D a2¤ · 1 / l is an optimal solution. Therefore, for this 2 , the possible e T ( ® C ®¤ ) ranges from 0 to 1. The relation between º and 2 is illustrated in Figure 1.
1966
Chih-Chung Chang and Chih-Jen Lin
º º¤ D 1
2
D
D 2
Figure 1: An example where one D2 corresponds to different Dº 3 When the Kernel Matrix Q Is Positive Denite
In the previous section, we showed that for 2 -SVR, eT ( ®2 C ®2 ¤ ) may not (¤) be a well-dened function of 2 , where ®2 is any optimal solution of D2 . Because of this difculty, we cannot exactly apply results on the relation between C-SVC and º-SVC to 2 -SVR and º-SVR. In this section, we show that if Q is positive denite, then e T (®2 C ®2 ¤ ) is a function of 2 , and all results discussed in Chang and Lin (2001b) hold. Assumption 1.
Q is positive denite.
If 2 > 0, then D2 has a unique optimal solution. Therefore, we can (¤) dene a function e T (®2 C ®2 ¤ ) on 2 , where ®2 is the optimal solution of D2 . Lemma 4.
(¤)
Since D2 is a convex problem, if ®1 solutions, for all 0 · l · 1, Proof.
(¤)
and ®2
are both optimal
1 (l( ®1 ¡ ®¤1 ) C (1 ¡ l) ( ®2 ¡ ®¤2 ) ) T Q (l( ®1 ¡ ®¤1 ) C (1 ¡ l) ( ®2 ¡ ®¤2 ) ) 2 C ( y / C) T (l( ®1 ¡ ®¤1 ) C (1 ¡ l) ( ®2 ¡ ®¤2 ) ) C (2
³
D l
/ C) eT (l( ®1 C ®¤1 )
C (1 ¡ l) ( ®2 C
®¤2 ) )
1 (®1 ¡ ®¤1 ) T Q (®1 ¡ ®¤1 ) 2
C ( y / C ) T ( ®1 ¡ ®¤1 ) C (2
³
C (1 ¡ l)
´
/ C ) e T ( ®1 C ®¤1 )
1 ( ®2 ¡ ®¤2 ) T Q ( ®2 ¡ ®¤2 ) 2 C (y / C
)T(
®2 ¡ ®¤2 ) C (2
/ C) e
´ T(
®2 C ®¤2 )
.
Training -Support Vector Regression
1967
This implies (®1 ¡ ®¤1 ) T Q (®2 ¡ ®¤2 ) D
1 (®1 ¡ ®¤1 ) T Q ( ®1 ¡ ®¤1 ) 2 1 C ( ®2 ¡ ®¤2 ) T Q ( ®2 ¡ ®¤2 ) . 2
(3.1)
Since Q is positive semidenite, Q D L T L , so equation 3.1 implies kL (®1 ¡ ®¤1 ) ¡ L ( ®2 ¡ ®¤2 ) k D 0. Hence, L ( ®1 ¡ ®¤1 ) D L (®2 ¡ ®¤2 ). Since Q is positive denite, L is invertible, so ®1 ¡ ®¤1 D ®2 ¡ ®¤2 . Since 2 > 0, from lemma 1, (a1 ) i (a1¤ ) i D 0 and (a2 ) i (a2¤ ) i D 0. Thus, we have ®1 D ®2 and ®¤1 D ®¤2 . For convex optimization problems, if the Hessian is positive denite, there is a ¶unique optimal solution. Unfortunately, here the Hessian is µ Q ¡Q , which is only positive semidenite. Hence, special efforts are ¡Q Q needed for proving the property of the unique optimal solution. Note that in the above proof, L ( ®1 ¡ ®¤1 ) D L ( ®2 ¡ ®¤2 ) implies ( ®1 ¡ (¤) (¤) ®¤1 ) T Q ( ®1 ¡ ®¤1 ) D ( ®2 ¡ ®¤2 ) T Q (®2 ¡ ®¤2 ) . Since ®1 and ®2 are both optimal solutions, they have the same objective value so ¡y T ( ®1 ¡ ®¤1 ) C 2 e T ( ®1 C ®¤1 ) D ¡yT ( ®2 ¡ ®¤2 ) C2 eT ( ®2 C ®¤2 ). This is not enough for proving that e T ( ®1 C ®¤1 ) D e T ( ®2 C ®¤2 ) . On the contrary, for º-SVC, the objective function is 12 ®T Q ® ¡ eT ®, so without the positive denite assumption, ®T1 Q ®1 D ®T2 Q ®2 already implies e T ®1 D e T ®2 . Thus, e T ®C is a function of C. We then state some parallel results in Chang and Lin (2001b) without proofs: If Q is positive denite, then the relation between Dº and D2 is summarized as follows: Theorem 3.
1. (a) For any 1 ¸ º ¸ º¤ , Dº has the same optimal objective value as D2 , 2 D 0. (b) For any º 2 [0, º¤ ) , Dº has a solution that is the same as that of either one D2 , 2 > 0, or some D2 , where 2 is any number in an interval.
2. If a2 ¤ is the optimal solution of D2 , 2 > 0, the relation between º and 2 is as follows: There are 0 < 2 1 < ¢ ¢ ¢ < 2 s and Ai , Bi , i D 1, . . . , s such that 8 ¤ 0 < 2 · 2 1, <º e T ( ®2 C ®2 ¤ ) D Ai C Bi2 2 i · 2 · 2 i C 1 , i D 1, . . . , s ¡ 1, : 0 2 s ·2 , where ®2
(¤)
is the optimal solution of D2 . We also have
Ai C Bi2
iC1
D Ai C 1 C Bi C 12 i C 1 , i D 1, . . . , s ¡ 2,
(3.2)
1968
Chih-Chung Chang and Chih-Jen Lin
and As¡1 C Bs¡12
s
D 0.
In addition, Bi · 0, i D 1, . . . , s ¡ 1. The second result of theorem 3 shows that º is a piece-wise linear function of 2 . In addition, it is always decreasing. 4 Some Issues Specic to Regression
The motivation of º-SVR is that it may not be easy to decide the parameter 2 . Hence, here we are interested in the possible range of 2 . As expected, results show that 2 is related to the target values y .
Theorem 4. ¸ 2
The zero vector is an optimal solution of D2 if and only if
maxiD 1,...,l yi ¡ miniD 1,...,l yi . 2
(4.1)
If the zero vector is an optimal solution of D2 , the KKT condition implies that there is a b such that Proof.
µ
¶
y C2 ¡y
µ ¶
µ
¶
e e ¡b ¸ 0. e ¡e
Hence, 2 ¡ b ¸ ¡yi and 2 C b ¸ yi , for all i. Therefore, 2 ¡ b ¸ ¡ min yi and 2 C b ¸ max yi i D 1,...,l
i D1,...,l
so 2
¸
maxiD 1,...,l yi ¡ miniD 1,...,l yi . 2
On the other hand, if equation 4.1 is true, we can easily check that ® D ®¤ D 0 satisfy the KKT condition so the zero vector is an optimal solution of D2 . Therefore, when using 2 -SVR, the largest value of 2 to try is (maxiD1,...,l yi ¡ miniD 1,...,l yi ) / 2. On the other hand, 2 should not be too small; if 2 ! 0, most data are support vectors, and overtting tends to happen. Unfortunately, we have not been able to nd an effective lower bound on 2 . However, intuitively we would think that it is also related to the target values y .
Training -Support Vector Regression
1969
As the effective range of 2 is affected by the target values y, a way to solve this difculty for 2 -SVM is by scaling the target valves before training the data. For example, if all target values are scaled to [¡1, C 1], then the effective range of 2 will be [0, 1], the same as that of º. Then it may be easier to choose 2 . There are other reasons to scale the target values. For example, we encountered some situations where if the target values y are not properly scaled, it is difcult to adjust the value of C. In particular, if yi , i D 1, . . . , l are large numbers and C is chosen to be a small number, the approximating function is nearly a constant. 5 Algorithms
The algorithm considered here for º-SVR is similar to the decomposition method in Chang and Lin (2001b) for º-SVC. The implementation is part of the software LIBSVM (Chang & Lin, 2001a). Another SVM software that has also implemented º-SVR is mySVM (Ruping, ¨ 2000). The basic idea of the decomposition method is that in each iteration, the indices f1, . . . , lg of the training set are separated to two sets B and N, where B is the working set and N D f1, . . . , lgnB. The vector ®N is xed, and then a subproblem with the variable ®B is solved. The decomposition method was rst proposed for SVM classication (Osuna, Freund, & Girosi, 1997; Joachims, 1998; Platt, 1998). Extensions to 2 SVR are in, for example, Keerthi, Shevade, Bhattacharyya, & Murthy (2000) and Laskov (2002). The main difference of these methods is their working set selections, which may signicantly affect the number of iterations. Due to the additional equality 1.3 in the º-SVM, more considerations on the working set selection are needed. (Discussions on classication are in Keerthi & Gilbert, 2002, and Chang & Lin, 2001b.) For consistency with other SVM formulations in LIBSVM, we consider Dº as the following scaled form: 1 T Na min a N Q N C pN T a N aN 2 yN T a N D D 1,
(5.1)
eN T a N D D 2,
0 · aN t · C, t D 1, . . . , 2l, where µ N D Q
Q ¡Q
D1
¡Q Q
¶
µ , aN D
D 0, D 2 D Clº.
¶ µ ¶ µ ¶ µ ¶ y e e ® p N y N e N , D , D , D , y ¡e e ®¤
1970
Chih-Chung Chang and Chih-Jen Lin
That is, we replace C / l by C. Note that because of the result in theorem 1, we are safe to use an equality constraint here in equation 5.1. Then the subproblem is as follows: 1 N BB aN B C (pN C Q N BN aN k ) T aN B min aN TB Q B N a NB 2 yN TB a N B D D 1 ¡ yN TN a N N, eN TB a NB
D
(5.2)
D 2 ¡ eN TN aN N ,
0 · (a N B ) t · C, t D 1, . . . , q, where q is the size of the working set. Following the idea of sequential minimal optimization (SMO) by Platt (1998), we use only two elements as the working set in each iteration. The main advantage is that an analytic solution of equation 5.2 can be obtained so there is no need to use an optimization software. Our working set selection follows from Chang and lin (2001b), which is a modication of the selection in the software SVMlight (Joachims, 1998). Since they dealt with the case of more general selections where the size is not restricted to two, here we have a simpler derivation directly using the KKT condition. It is similar to that in Keerthi and Gilbert (2002, section 5). 6 Now if only two elements i and j are selected but yN i D yNj , then yN TB aN B D T T T D 1 ¡ yN N aN N and eN B aN B D D 2 ¡ eN N aN N imply that there are two equations with two variables, so in general equation 5.2 has only one feasible point. Therefore, from aN k , the solution of the kth iteration, it cannot be moved any more. On the other hand, if yN i D yNj , yN TB a N B D D 1 ¡yN TN a N N and eN TB a N B D D 2 ¡Ne TN aN N become the same equality, so there are multiple feasible solutions. Therefore, we have to keep yN i D yNj while selecting the working set. The KKT condition of equation 5.1 shows that there are r and b such that
r f ( a) N i ¡ r C byN i D 0 if 0 < aN i < C, ¸ 0 if a N i D 0,
· 0 if a N i D C.
Dene r1 ´ r ¡ b, r2 ´ r C b. If yN i D 1, the KKT condition becomes
r f ( a) N i ¡ r1 ¸ 0 if aN i < C, · 0 if a N i > 0.
(5.3)
Training -Support Vector Regression
1971
On the other hand, if yN i D ¡1, it is
r f (a) N i ¡ r2 ¸ 0 if aN i < C,
(5.4)
· 0 if a N i > 0.
Hence, indices i and j are selected from either i D argmint fr f ( a) N t | yN t D 1, a N t < Cg, j D argmaxt fr f ( a) N t | yN t D 1, aN t > 0g,
(5.5)
or i D argmint fr f ( a) N t | yN t D ¡1, aN t < Cg, j D argmaxt fr f ( a) N t | yN t D ¡1, a N t > 0g,
(5.6)
depending on which one gives a larger r f (a) N j ¡ r f ( a) N i (i.e., larger KKT violations). If the selected r f (a) N j ¡ r f ( a) N i is smaller than a given 2 (10¡3 in our experiments), the algorithm stops. Similar to the case of º-SVC, here the zero vector cannot be the initial solution. This is due to the additional equality constraint eN T aN D D 2 of equation 5.1. Here we assign both initial a N and aN ¤ with the same values. The rst dºl / 2e elements are [C, . . . , C, C (ºl / 2 ¡ bºl / 2c) ]T while others are zero. It has been proved that if the decomposition method of LIBSVM is used for solving D2 , 2 > 0, during iterations ai ai¤ D 0 always holds (Lin, 2001, theorem 4.1). Now for º-SVR we do not have this property as ai and ai¤ may both be nonzero during iterations. Next, we discuss how to nd º¤ . We claim that if Q is positive denite and (®, ®¤ ) is any optimal solution of D2 , 2 D 0, then º¤ D
l X iD 1
|ai ¡ ai¤ |.
Note that by dening ¯ ´ a ¡ a¤ , D2 , 2 D 0 is equivalent to min
1 T ¯ Q ¯ C (y / C) T ¯ 2 e T ¯ D 0,
¡1 / l · bi · 1 / l,
i D 1, . . . , l.
When Q is positive denite, it becomes a strictly convex programming problem, so there is a unique optimal solution ¯ . That is, we have a unique ® ¡®¤ but may have multiple optimal ( ®, ®¤ ) . With conditions 0 · ai , ai¤ · 1 / l, the calculation of |ai ¡ ai¤ | is similar to ai and ai¤ until one becomes zero. Then |ai ¡ ai¤ | is the smallest possible ai C ai¤ with a xed ai ¡ ai¤ . In the next section, we will use the RBF kernel, so if no data points are the same, Q is positive denite.
1972
Chih-Chung Chang and Chih-Jen Lin
6 Experiments
In this section we demonstrate some numerical comparisons between º2 SVR and 2 -SVR. We test the RBF kernel with Qij D e ¡kxi ¡xj k / n , where n is the number of attributes of a training data. The computational experiments for this section were done on a Pentium III-500 with 256 MB RAM using the gcc compiler. Our implementation is part of the software LIBSVM, which includes both º-SVR and 2 -SVR using the decomposition method. We used 100 MB as the cache size of LIBSVM for storing recently used Qij . The shrinking heuristics in LIBSVM is turned off for an easier comparison. We test problems from various collections. Problems housing, abalone, mpg, pyrimidines, and triazines are from the Statlog collection (Michie, Spiegelhetter, & Taylor, 1994). From StatLib (http://lib.stat.cmu.edu/ datasets), we select bodyfat, space ga, and cadata. Problem cpusmall is from the Delve archive which collects data for evaluating learning in valid experiments (http://www.cs.toronto.edu/ ~delve). Problem mg is a MackeyGlass time series where we use the same settings as the experiments in Flake and Lawrence (2002). Thus, we predict 85 time steps in the future with six inputs. For these problems, some data entries have missing attributes, so we remove them before conducting experiments. Both the target and attribute values of these problems are scaled to [¡1, C 1]. Hence, the effective range of 2 is [0, 1]. For each problem, we solve its Dº form using º D 0.2, 0.4, 0.6, and 0.8 rst. Then we solve D2 with 2 D r for comparison. Tables 1 and 2 present the number of training data (l), the number of iterations, and the training time by using C D 1 and C D 100, respectively. In the last column, we also list the number º¤ of each problem. From both tables, we have the following observations: 1. Following theoretical results, we see that as º increases, its corresponding 2 decreases. 2. If º · º¤ , as º increases, the number of iterations of º-SVR and its corresponding 2 -SVR is increasing. Note that the case of º · º¤ covers all results in Table 1 and most of Table 2. Our explanation is as follows: When º is larger, there are more support vectors, so during iterations, the number of nonzero variables is also larger. Hsu and Lin (2002) pointed out that if during iterations there are more nonzero variables than those at the optimum, the decomposition method will take many iterations to reach the nal face. Here, a face means the subspace by considering only free variables. An example is in Figure 2 where we plot the number of free variables during iterations against the number of iterations. To be more precise, the y-axis is the number of 0 < ai < C and 0 < ai¤ < C where (®, ®¤ ) is the solution at one iteration. We can see that for solving 2 -SVR or º-SVR, it takes a lot of iteration to identify the optimal face. From the aspect of 2 -SVR, we can consider 2 e T (® C ®¤ ) as a penalty term in the objective function of D2 .
Training -Support Vector Regression
1973
Table 1: Solving º-SVR and 2 -SVR: C D 1 (time in seconds). Problem
l
pyrimidines
mpg
bodyfat
housing
triazines
mg
abalone
space ga
cpusmall
cadata
º
74 0.2 0.4 0.6 0.8 392 0.2 0.4 0.6 0.8 252 0.2 0.4 0.6 0.8 506 0.2 0.4 0.6 0.8 186 0.2 0.4 0.6 0.8 1385 0.2 0.4 0.6 0.8 4177 0.2 0.4 0.6 0.8 3107 0.2 0.4 0.6 0.8 8192 0.2 0.4 0.6 0.8 20,640 0.2 0.4 0.6 0.8
2
0.135131 0.064666 0.028517 0.002164 0.152014 0.090124 0.048543 0.020783 0.012700 0.006332 0.002898 0.001088 0.161529 0.089703 0.046269 0.018860 0.380308 0.194967 0.096720 0.033753 0.366606 0.216329 0.124792 0.059115 0.168812 0.094959 0.055966 0.026165 0.087070 0.053287 0.032080 0.014410 0.086285 0.054095 0.031285 0.013842 0.294803 0.168370 0.097434 0.044636
2 Time
º¤
145 0.03 0.02 156 0.03 0.04 331 0.04 0.03 460 0.05 0.05 862 0.19 0.16 1444 0.32 0.27 1847 0.40 0.34 2595 0.56 0.51 1047 0.14 0.13 2117 0.25 0.23 2857 0.37 0.31 3819 0.48 0.42 1231 0.30 0.34 1650 0.53 0.45 2002 0.63 0.60 2082 0.85 0.65 116 0.13 0.10 325 0.18 0.15 427 0.20 0.18 513 0.23 0.23 1542 1.58 1.18 3294 2.75 2.35 3300 3.36 2.76 4296 4.24 3.65 3713 15.68 11.69 7113 30.38 22.88 12,984 42.74 37.41 18,277 65.98 54.04 4403 10.47 7.56 7731 18.70 14.44 10,704 26.27 20.72 13,852 32.71 27.19 7422 82.66 59.14 15,240 203.20 120.48 19,126 283.71 163.96 24,840 355.59 213.25 10,961 575.11 294.53 20,968 1096.87 574.77 30,477 1530.01 851.91 40,652 1883.35 1142.27
0.817868
º Iterations 2 Iterations 181 175 365 695 988 1753 2115 3046 1112 2318 3553 4966 799 1693 1759 2700 175 483 422 532 1928 3268 3400 4516 4189 8257 12,483 18,302 5020 8969 12,261 16,311 8028 16,585 22,376 28,262 12,153 24,614 35,161 42,709
º Time
0.961858
0.899957
0.946593
0.900243
0.992017
0.994775
0.990468
0.990877
0.997099
Hence, when 2 is larger, fewer ai , ai¤ are nonzero. That is, the number of support vectors is fewer. 3. There are few problems (e.g., pyrimidines, bodyfat, and triazines) where º ¸ º¤ is encountered. When this happens, their 2 should be zero, but due to numerical inaccuracy, the output 2 are only small positive numbers. Then for different º ¸ º¤ , when solving their corresponding D2 , the number of
1974
Chih-Chung Chang and Chih-Jen Lin
Table 2: Solving º-SVR and 2 -SVR: C D 100 (time in seconds). Problem
l º
pyrimidines 74 0.2¤ 0.4¤ 0.6¤ 0.8¤ mpg 392 0.2 0.4 0.6 0.8 bodyfat 252 0.2 0.4¤ 0.6¤ 0.8¤ housing 506 0.2 0.4 0.6 0.8 triazines 186 0.2 0.4 0.6¤ 0.8¤ mg 1385 0.2 0.4 0.6 0.8 abalone 4177 0.2 0.4 0.6 0.8 space ga 3107 0.2 0.4 0.6 0.8 cpusmall 8192 0.2 0.4 0.6 0.8 cadata 20640 0.2 0.4 0.6 0.8
2
º Iterations 2 Iterations
0.000554 0.000317 0.000240 0.000146 0.121366 0.069775 0.032716 0.007953 0.001848 0.000486 0.000291 0.000131 0.092998 0.051726 0.026340 0.002161 0.193718 0.074474 0.000381 0.000139 0.325659 0.189377 0.107324 0.043439 0.162593 0.091815 0.053244 0.024670 0.078294 0.048643 0.028933 0.013855 0.070568 0.041640 0.022280 0.009616 0.263428 0.151341 0.087921 0.039595
29,758 30,772 27,270 20,251 85,120 210,710 347,777 383,164 238,927 711,157 644,602 517,370 154,565 186,136 285,354 397,115 16,607 34,034 106,621 68,553 190,065 291,315 397,449 486,656 465,922 901,275 1,212,669 1,680,704 510,035 846,873 1,097,732 1,374,987 977,374 1,783,725 2,553,150 3,085,005 1,085,719 2,135,097 2,813,070 3,599,953
11,978 11,724 11,802 12,014 74,878 167,719 292,426 332,725 164,218 323,016 339,569 356,316 108,220 182,889 271,278 284,253 22,651 47,205 51,175 50,786 195,519 299,541 407,159 543,520 343,594 829,951 1,356,556 1,632,597 444,455 738,805 1,054,464 1,393,044 863,579 1,652,396 2,363,251 2,912,838 1,081,038 2,167,643 2,614,179 3,379,580
º Time
2 Time
º¤
0.63 0.65 0.58 0.44 9.53 24.50 42.08 47.61 16.80 50.77 46.23 37.28 24.21 30.49 48.62 69.16 0.94 1.89 5.69 3.73 87.99 139.10 194.81 241.20 797.48 1577.83 2193.97 2970.98 595.42 1011.82 1362.67 1778.38 4304.42 8291.12 11,673.62 14,784.05 16,003.55 31,936.05 42,983.89 54,917.10
0.27 0.27 0.27 0.28 8.26 19.32 34.82 40.86 11.58 23.24 24.33 25.55 16.87 29.51 45.64 49.12 1.20 2.52 2.84 2.81 89.20 141.73 196.14 265.27 588.92 1449.37 2506.52 2987.30 508.41 867.32 1268.40 1751.39 3606.35 7014.32 10,691.95 12,737.35 15,475.36 31,474.21 38,580.61 49,754.27
0.191361
0.876646
0.368736
0.815085
0.582147
0.966793
0.988298
0.984568
0.978351
0.995602
Note: ¤ Experiments where º ¸ º¤ .
iterations is about the same, as we essentially solve the same problem: D2 with 2 D 0. On the other hand, surprisingly, we see that at this time as º increases, it is easier to solve Dº with fewer iterations. Now, its solution is optimal for D2 , 2 D 0, but the larger º is, the more (ai , ai¤ ) are both nonzeros. Therefore,
Training -Support Vector Regression
1975
Figure 2: Iterations and number of free variables (º · º¤ ).
contrary to the general case º · º¤ where it is difcult to identify and move free variables during iterations back to bounds at the optimum, there is no strong need to do so. To be more precise, in the beginning of the decomposition method, many variables become nonzero as we try to modify them for minimizing the objective function. If, nally, most of these variables are still nonzero, we do not need the efforts to put them back to bounds. In Figure 3, we plot the number of free variables against the number of iterations using problem triazines with º D 0.6, 0.8, and 2 D 0.000139 ¼ 0.
Figure 3: Iterations and number of free variables (º ¸ º¤ ).
1976
Chih-Chung Chang and Chih-Jen Lin
It can be clearly seen that for large º, the decomposition method identies the optimal face more quickly, so the total number of iterations is fewer. 4. When º · º¤ , we observe that there are minor differences in the number of iterations for 2 -SVR and º-SVR. In Table 1, for nearly all problems, º-SVR takes a few more iterations than 2 -SVR. However, in Table 2, for problems triazines and mg,º-SVR is slightly faster. Note that there are several dissimilarities between algorithms for º-SVR and 2 -SVR. For example, 2 -SVR generally starts from the zero vector, but º-SVR has to use a nonzero initial solution. For the working set selection, the two indices selected for 2 -SVR can be any ai or ai¤ , but the two equality constraints lead to the selection 5.5 and 5.6 forº-SVR where the set is from either fa1 , . . . , al g or fa1¤ , . . . , al¤ g. Furthermore, as the stopping tolerance 10¡3 might be too loose in some cases, the 2 obtained after solving Dº may be a little different from the theoretical value. Hence, we actually solve two problems with slightly different optimal solution sets. All of these factors may contribute to the distinction on iterations. 5. We see that it is much harder to solve problems using C D 100 than using C D 1. The difference is even more dramatic than the case of classication. We do not have a good explanation for this observation. 7 Conclusion
We have shown that the inequality in theº-SVR formulation can be treated as an equality. Hence, algorithms similar to those for º-SVC can be applied for º-SVR. In addition, in section 6, we showed similarities and dissimilarities on numerical properties of 2 -SVR and º-SVR. We think that in the future, the relation between C and º (or C and 2 ) should be investigated in more detail. The model selection on these parameters is also an important issue. References Chang, C.-C., & Lin, C.-J. (2001a).LIBSVM: A library for support vector machines [Computer Software]. Available on-line: http://www.csie.ntu.edu.tw/~cjlin /libsvm. Chang, C.-C., & Lin, C.-J., (2001b). Training º-support vector classiers: Theory and algorithms. Neural Computation, 13(9), 2119–2147. Crisp, D. J., & Burges, C. J. C. (2000). A geometric interpretation of º-SVM classiers. In S. Solla, T. Leen, & K.-R. Muller ¨ (Eds.), Advances in neural information processing systems, 12. Cambridge, MA: MIT Press. Flake, G. W., & Lawrence, S. (2002). Efcient SVM regression training with SMO. Machine Learning, 46, 271–290. Hsu, C.-W., & Lin, C.-J. (2002). A simple decomposition method for support vector machines. Machine Learning, 46, 291–314. Joachims, T. (1998). Making large-scale SVM learning practical. In B. Sch¨olkopf, C. J. C. Burges, & A. J. Smola (Eds.), Advances in kernel methods—Support vector learning, Cambridge, MA: MIT Press.
Training -Support Vector Regression
1977
Keerthi, S. S., & Gilbert, E. G. (2002). Convergence of a generalized SMO algorithm for SVM classier design. Machine Learning, 46, 351–360. Keerthi, S. S., Shevade, S., Bhattacharyya, C., and Murthy, K. (2000). Improvements to SMO algorithm for SVM regression. IEEE Transactions on Neural Networks, 11(5), 1188–1193. Laskov, P. (2002). An improved decomposition algorithm for regression support vector machines. Machine Learning, 46, 315–350. Lin, C.-J. (2001). On the convergence of the decomposition method for support vector machines. IEEE Transactions on Neural Networks, 12, 1288–1298. Michie, D., Spiegelhalter, D. J., & Taylor, C. C. (1994). Machine learning, neural and statistical classication. Englewood Cliffs, NJ: Prentice Hall. Available on-line at anonymous ftp: ftp.ncc.up.pt/pub/statlog/. Osuna, E., Freund, R., & Girosi, E. (1997). Training support vector machines: An application to face detection. In Proceedings of CVPR’97. New York: IEEE. Platt, J. C. (1998). Fast training of support vector machines using sequential minimal optimization. In B. Sch¨olkopf, C. J. C. Burges, & A. J. Smola (Eds.), Advances in kernel methods—Support vector learning. Cambridge, MA: MIT Press. Ruping, ¨ S. (2000). mySVM—another one of those support vector machines. [Computer software]. Available on-line: http://www-ai.cs.unidortmund.de/SOFTWARE/MYSVM/. Sch¨olkopf, B., Smola, A. J., & Williamson, R. (1999). Shrinking the tube: A new support vector regression algorithm. In M. S. Kearns, S. A. Solla, & D. A. Cohn (Eds.), Advances in neural information processing systems, 11, Cambridge, MA: MIT Press. Sch¨olkopf, B., Smola, A., Williamson, R. C., & Bartlett, P. L. (2000). New support vector algorithms. Neural Computation, 12, 1207–1245. Vapnik, V. (1998). Statistical learning theory. New York: Wiley. Received April 6, 2001; accepted January 7, 2002.
LETTER
Communicated by Noburo Murata
On the Problem in Model Selection of Neural Network Regression in Overrealizable Scenario Katsuyuki Hagiwara [email protected] Faculty of Physics Engineering, Mie University, Tsu, 514-8507, Japan In considering a statistical model selection of neural networks and radial basis functions under an overrealizable case, the problem of unidentiability emerges. Because the model selection criterion is an unbiased estimator of the generalization error based on the training error, this article analyzes the expected training error and the expected generalization error of neural networks and radial basis functions in overrealizable cases and claries the difference from regular models, for which identiability holds. As a special case of an overrealizable scenario, we assumed a gaussian noise sequence as training data. In the least-squares estimation under this assumption, we rst formulated the problem, in which the calculation of the expected errors of unidentiable networks is reduced to the calculation of the expectation of the supremum of the  2 process. Under this formulation, we gave an upper bound of the expected training error and a lower bound of the expected generalization error, where the generalization is measured at a set of training inputs. Furthermore, we gave stochastic bounds on the training error and the generalization error. The obtained upper bound of the expected training error is smaller than in regular models, and the lower bound of the expected generalization error is larger than in regular models. The result tells us that the degree of overtting in neural networks and radial basis functions is higher than in regular models. Correspondingly, it also tells us that the generalization capability is worse than in the case of regular models. The article may be enough to show a difference between neural networks and regular models in the context of the least-squares estimation in a simple situation. This is a rst step in constructing a model selection criterion in an overrealizable case. Further important problems in this direction are also included in this article. 1 Introduction
In wide-ranging applications of layered neural networks and radial basis functions, we often encounter the problem of determining an appropriate network size based on only a set of available data of nite size. One of the techniques to solve this problem is statistical model selection, in which the c 2002 Massachusetts Institute of Technology Neural Computation 14, 1979– 2002 (2002) °
1980
Katsuyuki Hagiwara
optimal size is determined according to a model selection criterion. In the eld of neural networks, several model selection criteria such as generalized prediction error (GPE; Moody, 1992) and network information criterion (NIC; Murata, Yoshizawa, & Amari, 1994) have been proposed. These are generalized criteria of nal prediction error (FPE; Akaike, 1969) or Akaike information criterion (AIC; Akaike, 1973). For example, under the maximum likelihood estimation, NIC reduces to AIC when a true probability distribution is in the family of statistical models with networks (Murata et al., 1994). Under the least-squares estimation, NIC has a form similar to predicted squared error (PSE; Barron, 1984) or, equivalently, FPE if a target true function is in the family of networks. If a target is in the family of networks, then it is said to be realizable or a realizable case. It is said to be an overrealizable case if the target is realizable in terms of the assumed family of networks and the family is not minimal. In other words, a network to represent a true function is redundant. Although the realizability-overrealizability assumption looks too tight in real-world problems, it is not the problem in the spirit of AIC, which is demonstrated by many successful applications of AIC. The reason is as follows. If a family of networks is relatively large, then a network in the family partially ts noise components in the data without tracing the detailed shape of the true function even if the family is not large enough to represent the actual function exactly. This is because the detailed shape of the actual function cannot be identied throughout the available data of nite size. In such a situation, the network reduces the empirical error throughout the estimation procedure. This is the reason that model selection is needed. Therefore, when model selection is needed, it is plausible to assume realizability or overrealizability. Furthermore, a criterion under the realizable-overrealizable scenario reduces to a simple form, as seen in AIC. This enables us to understand the meaning of the criterion and to reduce computational cost, which increases the usefulness of the criterion in practical applications. Hence, a criterion under a realizable or overrealizable scenario is important in real-world problems. Unfortunately, it has been pointed out that AIC, and thus NIC in the overrealizable scenario, cannot be derived for layered neural networks and radial basis functions by means of asymptotic expansion (Hagiwara, Toda, & Usui, 1993). This is because true connection weights, with which the network represents the true target function, are not identiable (Hagiwara et al., 1993). The unidentiability leads to degeneration of the Fisher information matrix of a network and a failure of the asymptotic normality of estimators. For this reason, the asymptotic expansion cannot be applied to neural networks in deriving a criterion. Note that the difculty emerges also in minimum description length (MDL; Rissanen, 1986) and Bayesian information criterion (BIC; Akaike, 1977; Schwarz, 1978) because these criteria are based on the asymptotic normality of the maximum likelihood estimators. Because identiability plays a central role in statistics, it is hard to handle neural networks in a statistical context in general, and we need a new framework. Note that the unidentiability problem is not original
Neural Network Regression in Overrealizable Scenario
1981
for neural networks. The problem arises in handling mixture models or autoregressive moving average (ARMA) processes. In statistics, there are several works on statistical tests in mixture models and ARMA processes (Hartigan, 1985; Dacunha-Castelle & Gassiat, 1997, 1999). A number of authors have approached the problem in the eld of articial neural networks (Hayasaka, Toda, Usui, & Hagiwara, 1996; Fukumizu, 1997; Fukumizu & Amari, 2000; Watanabe, 2001; Amari & Ozeki, 2001). In the Bayes estimation framework, Watanabe (2001) has found an algebraic method to handle the statistical model without identiability. He has shown that the Bayesian generalization error of neural networks without identiability is smaller than that of regular statistical models, for which identiability holds. Here, the Bayesian generalization error is measured as the Kullback-Leibler distance between the predictive distribution and a true distribution. On the other hand, in the maximum likelihood–least-squares estimation framework, Hayasaka et al. (1996) and Fukumizu (1997) have independently analyzed their special models, which reect the properties of the hierarchical structures in neural networks. In their results, the generalization errors of their models are larger than that of regular models, where the generalization error is dened by the Kullback-Leibler distance between the estimated statistical model and a true distribution in the maximum likelihood estimation and is dened by the squared norm between the estimated network and a true function in the least-squares estimation. Also Amari and Ozeki (2001) have considered a simple cone model under the maximum likelihood estimation and have shown that the generalization errors of their models are smaller than those of regular models. These works tell us that the statistical properties of neural networks are different from those of regular models. It should be noted that unidentiability also affects the training process of neural networks (Saad & Solla, 1995). Saad and Solla (1995) have found that the plateau caused by the linearly dependent output of hidden nodes exists in training. The output weights under the linear dependency are also unidentiable (Hagiwara et al., 1993). For this problem, Amari and Ozeki (2001) have shown that natural gradient learning (Amari, 1998) can escape from the plateau. Moreover, Fukumizu and Amari (2000) have considered the existence of the local minima and plateaus in general. The research on the unidentiable case is progressing rapidly. In general, an AIC-type model selection criterion is an unbiased estimator of the generalization error, which is formed by the training error plus the corrected bias term. The latter term is often called a penalty term because it penalizes model complexity and is obtained as the difference between the expected generalization error and the expected training error. Therefore, to construct a model selection criterion, it is important to analyze the expected training error and the expected generalization error. Recently, in the maximum likelihood–least-squares estimation, it has given an upper bound of the expected training error of radial basis functions and a type of three-layer perceptrons for a gaussian noise sequence (Hagiwara, Hayasaka, Toda, Usui, & Kuno, 2001). Unlike ordinal asymptotic normality, the key tool
1982
Katsuyuki Hagiwara
used in the work is the extreme value theory (e.g., Leadbetter, Lindgren, & Rootz’en, 1983; Resnick, 1987), which is the theory of the maxima in random processes and random sequences. However, Hagiwara et al. (2001) did not include the result on the expected generalization error. In this article, we rst give a formulation of the problem, in which the calculation of the expected training and generalization error completely reduces to the calculation of the expectation of the maxima of a random process in a special case. Although the formulation is included in Hagiwara et al. (2001) implicitly, we give it in a more explicit way here. This formulation enables us to understand the problem more clearly and to calculate the generalization error in a special case. Furthermore, we show the results on the probabilistic behavior of the training error and the generalization error. Finally, the discussion of this problem will be provided with some numerical simulations. 2 Preliminary 2.1 Nonlinear Regression with Neural Networks. Let us denote a set of T input-output data by DT D f(xt , yt ) : xt 2 R m , yt 2 R , 1 · t · Tg, in which each output yt for a corresponding input xt is generated by the rule yt D h (xt ) C jt , t D 1, . . . , T, where j1 , . . . , jT are independently and identically distributed (i.i.d.) samples having a probability distribution with mean 0 and variance s 2 . Here, x1 , . . . , xT are i.i.d. samples according to Q (x) . Thus, it can be understood that DT is a set of T independent samples from a common probability distribution Q (x, y ) D Q ( y | x) Q (x) with density q (x, y ) D q ( y | x)q(x) . In the rule, h is called a true function. Dening a network as its mapping is represented by a linear combination of functions gbi : R m ! R , i D 1, . . . , n. Therefore, the network output for x 2 R m is given by
fw (x) D fc, b (x) D
n X
ci gbi (x) ,
(2.1)
i D1
where w D (c, b) 2 D R n £ B n is the parameter vector whose components are c D ( c1 , . . . , cn ), ci 2 R and b D (b1 , . . . , bn ) , bi 2 B . Note that threelayered perceptron and radial basis functions are written in this form. For example, if we set bi D (ai , ti2 ) 2 B D R m £ R C and » ¼ 1 gbi (x) D exp ¡ 2 kx ¡ ai k2 , 2ti
(2.2)
then equation 2.1 is the output of a gaussian radial basis function. Here, R C D fx: 0 < x < 1g and k¢ k2 is the Euclidean norm. If we set bi D (ai , ti ) 2 B D R m C 1 and gbi (x) D
1 , 1 C exp f¡(ai ¢ x C ti )g
(2.3)
Neural Network Regression in Overrealizable Scenario
1983
where ¢ denotes the inner product, then equation 2.1 is the output of a threelayer perceptron with sigmoidal activation function in the hidden layer. Here, we consider a nonlinear regression by using the neural network with a form of equation 2.1. In the least-squares estimation, for training data DT and a network fw , the empirical squared error is dened by remp (w) D remp (c, b) D
T 1X fyt ¡ fw (xt ) g2 . T t D1
(2.4)
The minimization of the empirical squared error is dened in the following manner. First, we dene remp ( cO (b) , b) D minc remp (c, b) for a xed b 2 B n . Then the minimum of the empirical squared error is dened by remp (w) O D O D infb remp (cO (b) , b) . We call remp ( w) O a training error. The meaning remp (Oc, b) of this minimization will be shown later. 2.2 Unidentiability Problem. Let us consider a target function of the P ¤ ¤ 6 ¤ 6 0 for all i and b D 6 b . form h (x) D niD 1 c¤i gb¤i (x) , x 2 R m , where c¤i D ,..., D 1 n We assume the network with a form of equation 2.1 to represent such h. If n > n¤ , then the h is overrealizable, and there are n ¡ n¤ redundant units to represent h. In this case, if we set ci D 0 for a redundant unit, then the corresponding bi is not uniquely determined in representing the h. Or if we set bi1 D bi2 for the i1 th and the i2 th units, then ci1 and ci2 are not also uniquely determined. These facts imply that the true connection weights are not identiable under the overrealizable scenario. The effect of the unidentiability for the model selection is found in Hagiwara et al. (2001). 2.3 Denition of the Expected Errors. The expected training error is the expectation of remp ( w) O with respect to the distribution of DT and denoted by
Z O gD Remp ( n, T ) D E DT fremp (w)
Z ¢¢¢
O remp ( w)
T Y
q (xt , yt ) dxt dyt .
(2.5)
tD1
The generalization error is dened by Z Z ¢ ¢ ¢ ( z ¡ fw (u) ) 2 q (u, z ) du dz. r (w) D E f(z ¡ fw (u) ) 2 g D
(2.6)
This measure of the generalization error is equivalent to the L2 distance between the true function h and fw except the constant. The above measure is suitable in considering a model selection criterion. Then the expected generalization error is dened by the expectation of the generalization error O with respect to the distribution of DT as follows: at w O g. R ( n, T ) D E DT fr(w)
(2.7)
1984
Katsuyuki Hagiwara
Note that the penalty term in an AIC-type model selection criterion is determined by the difference between the expected training error and the expected generalization error, that is, R (n, T ) ¡ Remp ( n, T ). To analyze the expected training error and the expected generalization error in an overrealizable scenario and clarify the difference from the regular situation quantitatively, hereafter we use the following assumption for the output data: h (x) D 0 for any x and jt » N (0, s 2 ) , that is, a gaussian distribution with mean 0 and variance s 2 . Thus, yt » N (0, s 2 ); the assumed data are a gaussian noise sequence. Assumption 1.
Because h (x) D 0, h 2 f fw : w 2 g for any n ( ¸ 1) . Hence, the true connection weights are unidentiable for any n ( ¸ 1). To consider the case under the assumption is a rst step for approaching the problem in the overrealizable scenario because the penalty term of a model selection criterion in the overrealizable scenario is determined by the overtting property of networks to a noise component, which is observed in NIC and AIC. Thus, the analysis under the assumption is also important in capturing the overtting property of networks. Next, we assume the following locality condition on a family of networks: For any given xt , t D 1, . . . , T and any ¯ 2 R m , there exists
Assumption 2. bN i 2 BN such that
lim gbi (xt ) D Â (xt )
bi !bN i
(2.8)
holds, where  is an index function of the form »  (x) D
1 0
¯Dx . otherwise
(2.9)
Here, BN is the closure of B . Throughout this article, we measure the generalization error at the trainP ing inputs as follows—the empirical distribution q (u) D T1 TtD 1 d (u ¡ xt ) is assumed in equation 2.6: Assumption 3.
pectation
O is dened by the conditional exThe generalization error at w (
0
O D E z1 ,...,zT r (w)
) T ¡ ¢2 1X zt ¡ fwO (xt ) x1 , . . . , xT , T tD1
(2.10)
Neural Network Regression in Overrealizable Scenario
1985
where zt is according to Q ( y | xt ) and y1 , . . . , yT , z1 , . . . , zT are independent. Here, O g. we also dene R0 ( n, T ) D E DT fr0 (w) In the following, we refer the above denition for the generalization error. 3 Main Results 3.1 Technical Formulation of the Problem. Here, we use matrix forms such as Y D ( y1 , . . . , yT ) 0 and C D ( c1 , . . . , cn ) 0 , where 0 denotes the transpose of the matrix. Let G be a T £n matrix whose ( t, i) component is gbi (xt ) . Then, because the empirical squared error is the quadratic form in terms of ci ’s, the estimator cO (b) , CO (b) D ( cO1 (b) , . . . , cOn (b) ) 0 at xed b is given by
CO (b) D ( G0b Gb ) ¡1 G0b Y.
(3.1)
By using this, it is easily shown that remp ( cO (b) , b) D
1 0 1 Y Y ¡ FO0b FOb T T
(3.2)
holds at a xed b, where FOb D Gb CO (b) .
(3.3)
Because the rst term of the right-hand side of equation 3.2 is independent of b, we have the following decomposition of the training error: O D remp ( w)
1 0 1 Y Y ¡ sup FO0b FOb . T T b
(3.4)
Then the expected training error under assumption 1 is given by Remp ( n, T ) D s 2 ¡
1 B ( n, T ) , T
(3.5)
where » ¼ 0 O O ( ) B n, T D E DT sup Fb Fb .
(3.6)
b
On the other hand, under assumptions 1 and 3, the generalization error is written as O D s 2 C sup r0 ( w) b
1 O0 O F Fb . T b
(3.7)
1986
Katsuyuki Hagiwara
Therefore, the expected generalization error is given by R0 (n, T ) D s 2 C
1 B ( n, T ). T
(3.8)
Therefore, the calculation of the expected training error and the expected generalization error reduces to the calculation of B ( n, T ) . However, under assumption 1, FO0b FOb at xed b is easily shown to have  2 distribution with n degrees of freedom, which is referred by Ân2 distribution. As a result, the problem is reduced to the calculation of the supremum of the Ân2 process or random eld. Although the formulation cannot be derived in the general situation, Fukumizu (2001) has recently shown that a locally conic parameterization gives the same formulation as in equation 3.4, which was originally given by Dacunha-Castelle and Gassiat (1997). In the following, we derive a lower bound of B ( n, T ) by means of the extreme value theory. 3.2 A Trick to Obtain Bounds. Here, we x T for a while. Let us rearrange the order of the output data y1 , . . . , yT in terms of the power, and we have y2t1 ¸ y2t2 ¢ ¢ ¢ ¸ y2tT . Now, under assumption 2, there exists bN i such that
lim gbi (xt ) D Âxti (xt ) ,
t D 1, . . . , T,
bi !bN i
(3.9)
by putting ¯ D xti in equation 2.8. Set bN D (bN 1 , . . . , bN n ) . Then, by equation 3.1 and 3.3, the tth element of limb!bN FOb is equal to yti if t 2 ft1 , . . . , tn g and is equal to 0 if t 2/ ft1 , . . . , tn g, which implies that lim FO0b FOb D
b !bN
n X
y2ti .
(3.10)
i D1
Therefore, under assumption 1, we have »
¼
E DT sup FO0b FOb b
¸ lim E DT fFO0b FOb g b !bN
D E DT
»
( ) ¼ n X 0 O 2 O lim Fb Fb D E DT yti ,
b !bN
(3.11)
iD 1
where we use the Lebesgue-dominated convergence theorem based on the P P fact that FO0b FOb · TtD1 y2t and E f TtD 1 y2t g D Ts 2 < 1 for any xed T. Because y1 , . . . , yT are i.i.d. random variables according to N (0, s 2 ) by assumption 1, y21 / s 2 , . . . , yT / s 2 are i.i.d. random variables according to Â12 . Therefore, the problem of calculating a lower bound of B (n, T ) is formulated as the calculation of the expectation of the sum up to the nth largest value among T i.i.d. Â12 random variables.
Neural Network Regression in Overrealizable Scenario
1987
Here, P we consider a family of linear combinations of index funcn tions: I , (x) D i D1 ai  i (x), where ® D (a1 , . . . , an ) , ai 2 R and ¯ D ( ¯ 1 , . . . , ¯n ), ¯ i 2 R m . It is a function that picks up n points only. We call it n points tting function. Then the least-squares estimation of ( ®, ¯ ) O D (xt , . . . , xt ) and O D ( yt1 , . . . , ytn ) and ¯ is shown to be given by ® n 1 the minimum of the empirical squared error is given by remp ( ® O , ¯O ) D 1 PT 1 Pn 2 2 O t D 1 yt ¡ T i D1 yti . Therefore, limb !bN Fb , which give a lower bound is T equivalent to I O , O . Remark.
3.3 Bounds of the Expected Training and Generalization Error. The asymptotic behavior of the maxima of a random sequence is investigated in the extreme value theory in detail (Leadbetter et al., 1983; Resnick, 1987). Especially, in case of i.i.d. sequences, it is known that the limiting distribution of the maximum falls into one of three distributions when we make an appropriate normalization on the maximum depending on the size of the sequence (see Leadbetter et al., 1983; Resnick, 1987). The extreme value theory was rst introduced into a regression context by Sakai (1990) in considering harmonic analysis. Then, Hayasaka, Hagiwara, Toda, and Usui (1999) rened the result. Here, we follow the results of Hagiwara et al. (2001). Let X1 , . . . , XT be independent random variables with common Â12 distribution. The ith largest value in X1 , . . . , XT is denoted by Xti . Especially, Xt1 D maxt Xt . First, by using techniques in Resnick (1987), we have P faT¡1 ( Xt1 ¡ b T ) · xg ! exp(¡e ¡x ) D H1 ( x) ,
(3.12)
as T ! 1, where aT D 2
(3.13)
bT D 2 log T ¡ log log T ¡ 2 log p .
(3.14)
The limiting distribution H1 is often called the double exponential distribution. Because the moment convergence also holds as in proposition 2.1 in R Resnick (1987) and udH 1 (u ) D c , we have lim E fXt1 g D 2 log T ¡ log log T ¡ 2 log p C 2c ,
(3.15)
T!1
where c is the Euler’s constant. For the ith largest value, theorem 2.2.2. in Leadbetter et al. (1983) tells us that the limiting distribution of Xti is given by i¡1 X ¡ ¢ (¡ log H1 ( x) ) r PfaT¡1 Xli ¡bT · xg ! H1 ( x) D Hi ( x ) r! rD 0
i ¸ 2,
(3.16)
1988
Katsuyuki Hagiwara
where H1 is the double exponential distribution and the normalizing constants aT and bT are equal to equations 3.13 and 3.14, respectively. Of course, i ¿ T must hold in the above. Then we can calculate the expectation of Hi by R P 1 using integration by parts repeatedly, and we have udH i ( u ) D c ¡ i¡1 r D1 r . Because Xti · Xt1 , the moment convergence of the ith largest value is obvious from the proof of proposition 2.1 in Resnick (1987). Then we have lim E fXti g D lim E fXt1 g ¡ 2
T!1
T!1
i¡1 X 1 , r r D1
(3.17)
where i ¸ 2. By equations 3.15 and 3.17, we have (
lim E DT
n X
T!1
)
y2ti
i D1
¡
¢
D 2ns 2 log T C o log T .
(3.18)
By using equation 3.11, therefore, we have ¡ ¢ B ( n, T ) ¸ 2ns 2 log T C o log T
(3.19)
for sufciently large T. The above result, together with equations 3.5 and 3.8, yield the following fact: Theorem 1.
Under assumptions 1 and 2, we have
Remp ( n, T ) · s 2 ¡ s 2
³ ´ 2n log T log T C o , T T
(3.20)
and under assumptions 1, 2, and 3, we also have 2
2 2n
³
log T log T C o R (n, T ) ¸ s C s T T 0
´ (3.21)
for sufciently large T (À n ) . 3.4 Probabilistic Behavior of Training Error and Generalization Error.
Under assumption 2, we have sup FO0b FOb ¸ b
n X iD 1
y2ti ¸ y2t1 ,
(3.22)
as in equation 3.10. Under assumption 1, the right-hand side of the above inequality is the maximum of i.i.d. random variables according to Â12 if we
Neural Network Regression in Overrealizable Scenario
1989
normalize each by s 2 . Let X1 , . . . , XT be a sequence of independent random variables from N (0, 1) . Then, it has been shown that » P
¼ max1 ·t·T |Xt | 1 D D1 T!1 (2 log T ) 1 / 2 lim
(3.23)
(Deo, 1972). Hence, we have (
max1 ·t·T X2t P lim D1 T!1 2 log T
) D 1.
(3.24)
Therefore, we have (
supb FO0b FOb P lim ¸1 T!1 2s 2 log T
) D 1,
(3.25)
by equation 3.22. On the other hand, because yt ’s are i.i.d. according to N (0, s 2 ) by assumption 1, we have (
T 1X P lim y2t D s 2 T!1 T tD1
) D 1,
(3.26)
by the strong law of large number. Therefore, we have the following theorem by equations 3.4 and 3.7, together with 3.25 and 3.26. Theorem 2.
( lim
P
T!1
Under assumptions 1 and 2, we have O s 2 ¡ remp (w) s 2 T2 log T
) ¸1
D 1,
(3.27)
and under assumptions 1, 2, and 3, we have ( P
lim
T!1
O ¡ s2 r0 ( w) s 2 T2 log T
) ¸1
D 1.
(3.28)
3.5 Examples of Networks. A gaussian radial basis function of the form 2.2 satises assumption 2 by setting bN i D ( ¯ , 0) . Therefore, theorems 1 and 2 hold for the gaussian radial basis functions. They hold even if we set ti D t for all i. Generally, basis functions employed in radial basis functions have a kind of locality in assumption 2 (e.g., Cauchy-type basis functions). Therefore, the theorems are generally valid for radial basis functions.
1990
Katsuyuki Hagiwara
Moreover, it has been shown that assumption 2 holds for three-layer perceptrons (Hagiwara et al., 2001). The outline is as follows. First, we are given T input points x1 , . . . , xT in R m . Let us dene b D (a, h ), where a 2 R m and h 2 R . Here, we also dene H (b) D fx: x 2 R m , a ¢ x C h D 0g as an m ¡ 1 dimensional hyperplane in R m . b is a parameter vector that denes the hyperplane. Dene a function Qb whose output for an input x is given by » 1 Qb (x) D 0
x 2 H (b) . otherwise
(3.29)
In other words, Qb takes 1 on a hyperplane H (b) and 0 otherwise. Because T is nite, we can always have fH (b (1) ) , . . . , H (b ( T ) ) g such that xt 2 H (b ( t) ) 6 and xl 2/ H (b ( t ) ) for l D t, that is, a set of T hyperplanes, each of which includes only one input point. Then, on the given input points, the output of the function Qb(t) is consistent with the output of the index function Âxt dened in equation 2.9. Here, we dene B D fb (1) , . . . , b ( T ) g. Let fc, b be a function whose output for an input x 2 R m is given by fc, b (x) D
n X
ciQbi (x) ,
(3.30)
iD 1
where c D ( c1 , . . . , cn ) , ci 2 R and b D (b1 , . . . , bn ) , bi 2 B . Then, on the given input points, the output of fc, b is consistent with the output of n points tting functions. Hence, the results on n points tting functions hold for the function with the form of equation 3.30. Note that a difference of two sigmoidal functions of the form 2.3 can approximate equation 3.30 with any precision on the given input points if we choose (ai , ti ) ’s in equation 2.3 appropriately. Hence, assumption 2 holds for a pair of two sigmoidal functions. Therefore, theorems 1 and 2 hold for three-layer perceptrons with sigmoidal activation function in the hidden layer, in which we should take n / 2 instead of n at each n D 2, 4, ¢ ¢ ¢ ¿ T. 4 Discussion 4.1 Comparison with Regular Models. Under assumption 1, the expected training error and the expected generalization error of regular models are given by
s 2 ¡ s 2N / T
s 2 C s 2 N / T,
(4.1) (4.2)
respectively, where N is the number of all network parameters (Murata et al., 1994). The second terms of equations 4.1 and 4.2 are O (1 / T ) , while the second terms of bounds of those expected errors are O (log T / T ) by theorem 1.
Neural Network Regression in Overrealizable Scenario
1991
Therefore, the expected training error of networks in the overrealizable scenario is smaller than the evaluation for regular models, and the expected generalization error of networks is larger than the evaluation for regular models. This implies that the degree of network overtting is higher than that of regular models. Moreover, theorem 2 tells us that the training error is smaller than equation 4.1, and the generalization error is larger than equation 4.2 almost surely as T ! 1. Indeed, the corresponding penalty term in NIC under assumption 1 is given by 2s 2 N / T for regular models, which is given by the difference between equations 4.1 and 4.2, while the penalty is larger than 4s 2 n log N / T in our results, which comes from equations 3.20 and 3.21. This suggests that a larger penalty for networks may be needed compared with regular models. In the context of the maximum likelihood estimation, Hartigan (1985) has shown that the likelihood ratio is not bounded for mixture models when the data go to 1. Note that it has  2 distribution for regular models. Our results are viewed as rening his work in the context of the least-squares estimation of a neural network regression with a gaussian noise model. Recently, Fukumizu (2001) obtained a similar result for neural networks in the context of maximum likelihood estimation. The work is based on a formulation of a locally conic model (Dacunha-Castelle & Gassiat, 1997, 1999) for neural networks and the analysis in Hartigan (1985). A probabilistic upper bound of the training error obtained in the work is consistent with our result in case of a gaussian noise model. The work includes our result partially, but the result is weaker than ours if we consider a gaussian noise model. 4.2 Behavior of NIC in an Unrealizable Situation. Overrealizablility of true functions may not be satised exactly in real-world problems. In many cases, however, it is nearly satised. If not, it may be wrong to identify the underlying rule by using neural networks. In this section, we show a simulation result on the model selection via NIC. Here, we consider a one-dimensional curve tting as a task of three-layer perceptrons with sigmoidal activation function in the hidden layer. In the simulation, the number of data T is set to 100. The inputs are spaced with equidistance on [¡5, 5], which is xed at any sampling. The output data are according to N (0, 1) , that is, s 2 D 1. The true output is given by 2 2 h ( x ) D 4e¡x / 2 C 4e ¡(x¡5) / 2 .
(4.3)
The true output and an example of training data are shown in Figure 1. Note that the true function is unrealizable for a nite n, but it may be nearly realizable as n D 3, as seen in the gure. NIC in the overrealizable case under squared loss is given by NIC(n) D remp ( w, Q n) C
2N sQ 2 , N D 3n, T
(4.4)
1992
Katsuyuki Hagiwara
Figure 1: The true output dened by equation 4.3 and an example of training data, which consists of true output plus a gaussian noise N (0, 1) at each input.
where w Q is the obtained weights by training and remp (w, Q n ) is the training 2 ( ) error at n hidden units. Here, we set sQ D remp w, Q nmax and nmax D 6. Such a choice of sQ 2 is seen often, as in applying Mallow’s Cp (Mallow, 1973). In Figure 2a, we show the averaged training error, the averaged generalization error, and the average of NIC obtained for 100 sets of T training data, where the generalization error is estimated for 100 sets of new samples. In Figure 2b, we show the frequency of selection via NIC in 100 trials. From Figure 2a, the averaged generalization error is minimized at n D 3. From Figure 2b, however, n D 3 is not frequently selected, and the large size tends to be selected. In Figure 2a, the increase of the average of NIC after n D 3 is smaller than the increase of the averaged generalization error. It implies that the correction for the overtting by the penalty of NIC is small. As a result, we can say that NIC may tend to select a larger size even when a true function is unrealizable but nearly realizable at relatively small number of units. The penalty term, which is determined by the overtting property of neural networks, is different from the evaluation in NIC. The simulation result tells us that this gap is directly reected in model selection even when the overrealizablility does not hold exactly but nearly holds. 4.3 Effectiveness of Bounds in Practical Situation. To examine the obtained bounds in a practical situation, we have done simulations of curve tting by three-layer perceptrons with sigmoidal activation function in the hidden layer. The results are shown in Figure 3. In the simulations, the data size T is set to 100. The inputs are generated according to the uniform distribution on [¡5, 5]. The output data are according to N (0, 1) , that is, i.e. s 2 D 1. The expected training error is estimated by the average of the train-
Neural Network Regression in Overrealizable Scenario
1993
averaged errors
2.5 2 1.5 1 0.5
1
2 3 4 5 number of hidden units
6
frequency of selection [%]
(a) 80 60 40 20 0
1
2 3 4 5 number of hidden units
6
Figure 2: (a) Averaged training error (open squares), averaged generalization error (lled squares), and average of NIC (lled circles). (b) Frequency of selection of the number of hidden units by using NIC.
ing error for 100 sets of T training samples (open circle). The generalization error for each estimated network is estimated by the average of the empirical squared error for 100 sets of new T output samples, in which input data are taken to be training input, as in assumption 3. Then the expected generalization error is estimated by the average of the estimated generalization error for trials on the 100 sets of T training samples (lled circles). In this case, the obtained bounds are valid only at n D 2, 4, 6 (open boxes). In the gure, we plot the expected training error and the expected generalization error of regular models, which are given by equations 4.1 and 4.2
1994
Katsuyuki Hagiwara
Figure 3: Averaged training error (lled circles) and averaged generalization error (open circles), and the expectations for regular models (solid line) and theoretical bounds (open square). Here, the generalization is measured at training inputs. Therefore, the averaged generalization error is the estimate of R0 ( n, T ) .
with N D 3n (solid line). In the experiments, we use the kick-out method (Ochiai, Toda, & Usui, 1994) for training while our theory assumes that the linear least-squares estimator should be obtained at each step. This is not important due to the following observation. Because coefcients are linear parameters, there exists only one global minimum at each step. Therefore, it may be nearly minimized in each step at a later stage of training. In the gure, we can observe the bounds are valid. Note that if we can obtain the estimator that globally minimizes the empirical squared error through the training, then the estimate of the expected training error will be smaller than in the gure; thus, the estimate of the expected generalization error will be larger by equation 3.7. Thus, the bounds are still valid. In the gure, on the other hand, the expected training error for regular models is larger than our bound and thus than the estimate. And the expected generalization error for regular models is smaller than our bound and thus than the estimate. Therefore, NIC may underestimate the generalization error. As mentioned in section 4.2, this can be a problem even in the unrealizable case. 4.4 In Case of Usual Measure of the Generalization Error. Our theoretical result may be important in showing the difference between neural networks and regular models and may be signicant in practical situations, as seen in the previous two sections. However, our denition of the generalization error in assumption 3 may not be general. According to denition 2.6, the usual generalization error is given by
O D s 2 C E u f fw2O (u) g r (w)
(4.5)
Neural Network Regression in Overrealizable Scenario
1995
under assumption 1. To consider the usual measure of generalization dened in equation 2.6, however, has the following difculty. Indeed, in showing bounds, we consider an extreme locality in which the network is assumed to approximate any n points tting functions that pick up only n points in the output data. Here, we consider the tting by n points tting functions and use the usual generalization error dened in equation 4.5. Then the probability measure of the input domain, on which the output of the function takes a nonzero value, is 0, provided that Q (x) is continuous. This implies that the generalization error is equal to the noise variance at any sampling; that is, the second term in equation 4.5 is 0. Note that in this case, equation 3.20 for the expected training error is valid with equality (see Hayasaka et al., 1996), which implies that the symmetry between the generalization error and the training error is broken. The symmetry holds in the case of regular models (see Akaike, 1973; Murata et al., 1994). Therefore, the belief that the generalization performance will be poor if the degree of overtting is strong is no longer guaranteed. Of course, the tting by n points tting functions is not realistic in a practical situation. However, for noisy data, it is empirically known that the sharp output is obtained by training with relatively large size networks. This may be caused by the overtting to noise component by excess nodes. Therefore, there is a possibility that the n points tting function is nearly estimated under noisy data, and we should take account of that fact. In the following, we investigate this problem through simple simulations. First, in the above simulation, we measure the generalization in the usual manner dened by equation 4.5, in which the input samples are newly generated according to the uniform distribution on [¡5, 5] in calculating the averaged generalization error. The result is shown in Figure 4. By comparing Figure 3, the symmetry is broken, and it is not monotonically increasing; it decreases even at n. The important point is that the sharp output can be constructed at n D 2, 4, 6, which is shown in section 3.5. These properties cannot be seen in regular models. To investigate this fact precisely, we have done the following simulations. We consider a curve tting by using a simple gaussian unit whose output is given by »
¼ ¢2 1 ¡ gc,b0 ,b1 ( x) D c exp ¡ x ¡ b1 , b0
(4.6)
where c 2 R , b 0 2 B 0 , b1 2 fx1 , . . . , xT g, in which x1 , . . . , xT are training inputs. The output data are a gaussian noise sequence with mean 0 and variance 1. The input data are according to the uniform distribution on [¡5, 5]. Here, B 0 is set to be the set of 1000 points with log scale in the range of [10¡5 , 100]. Then the least-squares estimator is found in the parameter space B 0 £ fx1 , . . . , xT g, for which the number of members is nite. The averaged training error is calculated for 500 sets of T samples, and the
1996
Katsuyuki Hagiwara
Figure 4: Averaged generalization error under the usual generalization measure, where the measure of the generalization error is faithful to the denition of equation 2.6. Therefore, this is the estimate of R ( n, T ) .
generalization error is estimated for 500 new sets of T samples. In the rst simulation, the generalization error is measured at training inputs, which is the measure of generalization dened in assumption 3. In the second simulation, the input samples are newly generated according to the same uniform distribution in measuring the generalization error. Therefore, the averaged generalization error is faithful to denition 4.5. The averaged training error and the averaged generalization error at each number of samples are presented in Figures 5 and 6. Figure 5 is for the rst simulation, and Figure 6 is for the second simulation. In each gure, we overlapped the theoretical bound derived in this article. In Figure 5, we can see that the theoretical bound is valid. In Figure 6, however, the averaged generalization error is smaller than our bound, while the upper bound on the expected training error is still valid. Here, the lower bound of the expected generalization error is not guaranteed in this case, and it is shown only for comparison. Figure 7 shows the histogram of the estimated width parameter b0 at T D 100, 200, 400, 800. Clearly, the estimated parameter frequently falls into the neighborhood of 0, and the concentration is higher as T increases. The result asserts that the symmetry may be broken and n points tting function may be nearly obtained in training. Indeed, the second term of equation 4.5, which consists of the squared norm of the estimated function, will be small by the latter observation, which implies that the generalization error will be small. 4.5 Relation to Regularization Method. In practical situations, the regularization method (Bishop, 1995; Breiman, 1998) is known to be effective to avoid the sharp output of a network with a relatively large size.
Neural Network Regression in Overrealizable Scenario
1997
Figure 5: Averaged training error (lled circles), averaged generalization error (open circles), and theoretical bounds (solid line). Here, the generalization is measured at training inputs. Therefore, the averaged generalization error is the estimate of R0 ( n, T ) .
In the regularization method, the optimal choice of the regularization parameter is the most important issue. Murata (1998) has obtained one analytic solution for the problem. In this work, however, the regularity of the Fisher information matrix is assumed and the asymptotic expansion has been applied, which implies that the optimal regularization parameter is effective for regular models. On the other hand, the situation using a network with a relatively large size is almost overrealizable. According to section 4.2, therefore, the regularization method should be naturally considered under the overrealizable case. Furthermore, according to our derivation of bounds, n points tting function can reduce the training error for the noise component compared with regular models. Actually, the simulation results on a gaussian unit in section 4.4 suggest that peaky output can be obtained frequently in training. Therefore, suppressing the sharp output by the regularization method should be considered with relation to the unidentiability problem. Further progress on the unidentiable case is also important in considering the effect of regularization method. 5 Conclusion
In this article, we analyzed the expected training error and a special case of the expected generalization error in an overrealizable scenario, in which
1998
Katsuyuki Hagiwara
Figure 6: Averaged training error (lled circles), averaged generalization error (open circles), and theoretical bounds (solid line). Here, the measure of the generalization error is faithful to the denition of equation 2.6. Therefore, the averaged generalization error is the estimate of R ( n, T ) . Note that the lower bound of the expected generalization error is not guaranteed under this generalization measure, and it is shown only for comparison.
output data are assumed to be a gaussian noise sequence. First, we gave a formulation of the problem, in which the problem is reduced to calculating the maxima of the random process. A similar formulation has been given in Dacunha-Castelle and Gassiat (1997), Hartigan (1985), and Fukumizu (2001) in considering the mixture models. Based on the formulation, we gave an upper bound of the expected training error of a network, which is independent of input probability distributions, and a lower bound of the expected generalization error, in which the generalization is measured at training inputs. In the derivation, the extreme value theory for i.i.d  2 random sequence and a simple trick using a locality of a function are used. As a result, the expected training error of a network is smaller than in regular models, which says that the degree of overtting of a network to noise should be estimated larger than the evaluation of regular models. The expected generalization error, which is directly associated with the model selection criterion, is shown to be larger than the evaluation for regular models. Moreover, it was shown that the training error is smaller and the generalization error is larger than the evaluation for regular models almost surely as the number of data go to 1. These results suggest that the
Freq.[%]
Freq.[%]
Freq.[%]
Freq.[%]
Neural Network Regression in Overrealizable Scenario
1999
80
T
40 0
0
80
0.005 0.01 0.015 estimated width parameter
0.02
T
40 0
0
0.005 0.01 0.015 estimated width parameter
0.02
80
T
40 0
0
0.005 0.01 0.015 estimated width parameter
0.02
80
T
40 0
0
0.005 0.01 0.015 estimated width parameter
0.02
Figure 7: Histogram of the estimate of the width parameter in the gaussian unit at each T. The frequency of the histogram is written by the percentage in 500 trials.
penalty term of a model selection criterion in the overrealizable scenario may be larger than in NIC. Additionally, the result of numerical experiments shows that our work and the results are meaningful even in practical situations. Although this article deals with a simple situation, it may be enough to show a difference between neural networks and regular models. Because this is a rst step in constructing a model selection criterion in the
2000
Katsuyuki Hagiwara
overrealizable case, there are many unclear points that future works should address: The opposite bounds on the expected errors should be derived. Our results are on a gaussian noise sequence, and the case of nonzero true function should be considered. If a part of the network ts to the true function and another part ts to the noise component, then our theory is directly applicable to the situation. However, this does not necessarily hold in training. In this situation, both identiable and unidentiable parameters exist. Our result can be regarded as a rst step in addressing the problem. Although, we consider a gaussian noise model, the extension on the noise model should be considered. One possible way is to take account of the asymptotic normality of the estimated coefcients cO (b) at xed b. Although this has already been done by Fukumizu (2001), we will consider the expected errors in the extension on noise models, which is not included in his work. In this article, we measure the generalization at training inputs. However, denition 2.6 is usual as the measure. Under the usual measure, as suggested here, there is a possibility that the symmetry between the expected training error and generalization error does not hold while it holds for regular models. Moreover, this point is obviously related to the regularization method. Hence, to answer the question on the behavior of the usual generalization error is very important. As we argued in section 4.5, the regularization method should be reconsidered in the extension on noise models. As in the introduction, the plateau phenomenon is known to be caused by the unidentiability problem (Saad & Solla, 1995), and the natural gradient algorithm has been shown to escape from the plateau (Amari & Ozeki, 2001). A more detailed theoretical analysis should be done in this direction. Acknowledgments
The author thanks Dr. Kenji Fukumizu in the Institute of Statistical Mathematics, Dr. Sumio Watanabe in Tokyo Institute of Technology, Dr. Shun-ichi Amari in RIKEN Brain Science Institute, and Dr. Shiro Usui in Toyohashi University of Technology for their helpful discussions and suggestions. We also thank the anonymous reviewers for helpful comments in improving this article.
Neural Network Regression in Overrealizable Scenario
2001
References Akaike, H. (1969). Fitting autoregressive models for prediction, Ann. Inst. Statist. Math., 21, 243–247. Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In 2nd International Symposium on Information Theory (pp. 267– 281). Budapest, Hungary. Akaike, H. (1977). On entropy maximization principle. In P. R. Krishnaiah (Ed.), Application of statistics (pp. 27–41). Amsterdam: North-Holland. Amari, S. (1998). Natural gradient works efciently in learning. Neural Computation, 10, 251–276. Amari, S., & Ozeki, T. (2001). Differential and algebraic geometry of multilayer perceptrons. IEICE Trans. Fundamentals, E84-A, 31–38. Barron, A. R. (1984). Predicted squared error: A criterion for automatic model selection. In S. J. Farlow (Ed.), Self-organizing methods in modeling (pp. 87–103). New York: Dekker. Bishop, C. M. (1995). Neural networks for pattern recognition. New York: Oxford University Press. Breiman, L. (1998). Bias-variance, regularization, instability and stabilization. In C. M. Bishop (Ed.), Neural networks and machine learning (pp. 27–55). Berlin: Springer-Verlag. Dacunha-Castelle, D., & Gassiat, E. (1997). Testing in locally conic models, and application to mixture models. ESAIM Probability and Statistics, 1, 285– 317. Dacunha-Castelle, D., & Gassiat, E. (1999). Testing the order of a model using locally conic parameterization: Population mixtures and stationary ARMA processes. Annals of Statistics, 27, 1178–1209. Deo, C. M. (1972).Some limit theorems for maxima of absolute values of gaussian sequences. Sankhy¯a Series A, 34, 289–292. Fukumizu, K. (1997). Special statistical properties of neural network learning. In Proceedings of 1997 International Symposium on Nonlinear Theory and Its Applications (pp. 747–750). Honolulu, HI. Fukumizu, K. (2001). Likelihood ratio of unidentiable models and multilayer neural networks (Research Memorandum, 780) Tokyo: Institute of Statistical Mathematics. Fukumizu, K., & Amari, S. (2000). Local minima and plateaus in hierarchical structures of multilayer perceptrons. Neural Networks, 13, 317–327. Hagiwara, K., Hayasaka T., Toda, N., Usui, S., & Kuno, K. (2001). Upper bound of the expected training error of neural network regression for a gaussian noise sequence. Neural Networks, 14, 1419–1429. Hagiwara, K., Toda, N., & Usui, S. (1993). On the problem of applying AIC to determine the structure of a layered feed-forward neural network. In Proceedings of International Joint Conference on Neural Networks ’93, III (pp. 2263–2266). Nagoya, Japan. Hartigan, J. A. (1985). A failure of likelihood asymptotics for normal mixtures. In Proceedings of Berkeley Conference in Honor of Jerzy Neyman and Jack Kiefer (Vol. 2, 807–810). Berkeley, CA.
2002
Katsuyuki Hagiwara
Hayasaka, T., Hagiwara, K., Toda, N., & Usui S. (1999). Statistical model selection for the nonlinear regression models with orthogonal basis functions. In Proceedings of 1999 Workshop on Information–Based Induction Sciences (pp. 151– 156). Izu, Japan. Hayasaka, T., Toda, N., Usui, S., & Hagiwara, K. (1996). On the least square error and prediction square error of function representation with discrete variable basis. In Proceedings of Neural Networks for Signal Processing VI, (pp. 72–81). Kyoto, Japan. Leadbetter, M. R., Lindgren, G., & Rootz’en, H. (1983). Extremes, and related properties of random sequences and processes. Berlin: Springer-Verlag. Mallow, C. L. (1973). Some comments on Cp . Technometrics, 15, 661–675. Moody, J. E. (1992). The effective number of parameters: An analysis of generalization and regularization in nonlinear learning systems. In J. E. Moody, S. J. Hanson, & R. P. Lippmann (Eds.), Advances in neural information processing systems, 4 (pp. 598–605). San Mateo, CA: Morgan Kaufmann. Murata, N. (1998). Bias of estimators and regularization terms. In Proc. 1998 Workshop on Information-Based Induction Sciences (pp. 87–94). Hakona, Japan. Murata, N., Yoshizawa, S., & Amari, S. (1994). Network information criterion— determining the number of hidden units for an articial neural network model. IEEE Trans. on Neural Networks, 5, 865–872. Ochiai, K., Toda, N., & Usui, S. (1994). Kick-out learning algorithm to reduce the oscillation of weights. Neural Networks, 7, 797–807. Resnick, S. I. (1987). Extreme values, regular variation, and point processes. Berlin: Springer–Verlag. Rissanen, J. (1986). Stochastic complexity and modeling. Annals of Statistics, 14, 1080–1100. Saad, D., & Solla, S. A. (1995). On-line learning in soft committee machines. Physical Review E, 52, 4225–4243. Sakai, H. (1990). An application of a BIC-type method to harmonic analysis and a new criterion for order determination of an AR process. IEEE Trans. Acoust., Speech, Signal Processing, 38, 999–1004. Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461–464. Watanabe, S. (2001). Algebraic analysis for nonidentiable learning machines. Neural Computation, 13, 899–933. Received May 30, 2001; accepted January 7, 2002.
ARTICLE
Communicated by Gert Cauwenberghs
Scalable Hybrid Computation with Spikes Rahul Sarpeshkar [email protected]
Micah O’Halloran [email protected] Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA 02139, U.S.A.
We outline a hybrid analog-digital scheme for computing with three important features that enable it to scale to systems of large complexity: First, like digital computation, which uses several one-bit precise logical units to collectively compute a precise answer to a computation, the hybrid scheme uses several moderate-precision analog units to collectively compute a precise answer to a computation. Second, frequent discrete signal restoration of the analog information prevents analog noise and offset from degrading the computation. And, third, a state machine enables complex computations to be created using a sequence of elementary computations. A natural choice for implementing this hybrid scheme is one based on spikes because spike-count codes are digital, while spike-time codes are analog. We illustrate how spikes afford easy ways to implement all three components of scalable hybrid computation. First, as an important example of distributed analog computation, we show how spikes can create a distributed modular representation of an analog number by implementing digital carry interactions between spiking analog neurons. Second, we show how signal restoration may be performed by recursive spike-count quantization of spike-time codes. And, third, we use spikes from an analog dynamical system to trigger state transitions in a digital dynamical system, which recongures the analog dynamical system using a binary control vector; such feedback interactions between analog and digital dynamical systems create a hybrid state machine (HSM). The HSM extends and expands the concept of a digital nite-state-machine to the hybrid domain. We present experimental data from a two-neuron HSM on a chip that implements error-correcting analog-to-digital conversion with the concurrent use of spike-time and spike-count codes. We also present experimental data from silicon circuits that implement HSM-based pattern recognition using spike-time synchrony. We outline how HSMs may be used to perform learning, vector quantization, spike pattern recognition and generation, and how they may be recongured.
c 2002 Massachusetts Institute of Technology Neural Computation 14, 2003–2038 (2002) °
2004
Rahul Sarpeshkar and Micah O’Halloran
1 Introduction Dendritic signals in neurons are analog in nature and are processed by analog spatiotemporal circuits. Spikes are all-or-none digital events generated predominantly at the soma; dendritic spines with active channels have been hypothesized to perform digital logic operations on these spikes. Interspike intervals are analog, while spike number is digital. We constantly make discrete digital decisions that alter our continuous analog behaviors. For example, we may decide to change our mode of transportation from one analog form, walking, where we can walk with graded speeds, to another analog form, running, where we can also run with graded speeds. Speech recognition involves a constant interplay between parsing analog auditory signals into discrete digital symbols, which in turn can change the interpretation of past, current, or future analog signals. Given the hybrid analog-digital nature of neuronal signal processing in the cortex and the hybrid nature of sensory and sensorimotor tasks that we perform, it is natural to imagine that our brains compute in a hybrid fashion. However, there is a much deeper reason that any sufciently complex machine that operates on sensory data is likely to be hybrid. Since sensory data are analog, it is necessary to have at least an analog front end. However, the rest of the signal processing cannot all be analog; the noise and offset accumulation in a cascade of analog processing stages will eventually degrade the computation. Analog systems have no discrete attractor state to which the signal may be restored. Consequently, purely analog systems have no general paradigm for performing signal restoration, and no complex working system in the world today is purely analog. In contrast, because the notion of constantly restoring the signal to a discrete attractor state is built into the very paradigm of digital computation, relatively noiseless processing has made very complex digital systems ubiquitous. Furthermore, digital systems handle complexity through the power of symbolic language and abstraction and may easily be reprogrammed or recongured by changing bit patterns. However, if one wants to compute fast, with little energy, and in a small amount of space, then it is wasteful to spend one’s resources performing simple logical operations on reams of digital data that have been generated by preliminary quantization of analog data. It is much better to exploit the powerful computational primitives of an analog technology that are available for free. For example, 8-bit addition at moderate speeds may be performed by using Kirchoff’s current law on a single wire in analog technology but would require approximately 200 transistors in digital complementary metal oxide semiconductor (CMOS) technology. The power of analog technology becomes even more apparent if feedback is used to improve the precision of an analog channel to that set by a reference, and learning is used to adapt the system constantly to changing data in the environment. Adaptation and learning are ubiquitous in all nervous systems over many different temporal and spatial scales. Neurobiological systems
Scalable Hybrid Computation with Spikes
2005
are extremely adept at exploiting biophysical primitives for computation at various levels of abstraction. (For a review, see Koch, 1999.) In a recent paper, we discussed the pros and cons of analog and digital computation from a signal-processing viewpoint (Sarpeshkar, 1998). We showed that analog computation is more efcient than digital computation at low precision, and vice versa. As a simple example, 8-bit addition at moderate speeds is very easily implemented on one wire through the use of Kirchoff’s current law in analog systems; 16-bit addition would be almost 28 times harder to implement in terms of power or area. In contrast, in digital systems, 16-bit addition is only twice as hard as 8-bit addition because of the divide-and-conquer multiwire approach of digital computation. Hybrid computation has the potential to combine the advantages of analog efciency with the digital advantages of divide-and-conquer processing, signal restoration, and programmability. In Sarpeshkar (1998), we suggested that an underappreciated and important reason for the efciency of the human brain could be the hybrid analog-digital nature of its architecture. Some of our recent work with collaborators has shown that computation in the brain can be simultaneously analog and digital due to the nature of its distributed positive and negative feedback loops (Hahnloser, Sarpeshkar, Mahowald, Douglas, & Seung, 2000). In this article, we take a practical engineering approach to implementing hybrid computers in silicon and nd that spikes afford easy ways to build such computers. We outline how spikes may naturally be used to compute in a hybrid fashion because interspike times are analog while spike counts are digital. Although our approach to spike computation is inspired by biology, our emphasis is on its importance to hybrid computing. Our synthetic engineering approach complements other analytic approaches to spike computation, such as those described in Hopeld and Brody (2001), Rieke, Warland, van Steveninck, and Bialek (1996), and Abeles (1991). In silicon, spikes have been used in multiplexing circuits to create an address event representation (Mahowald, 1994; Lazzaro, Wawrzynek, Mahowald, Sivilotti, & Gillespie, 1993; Boahen, 2000) that is very useful in interchip communication. Spikes have also been used in silicon auditory systems to perform autocorrelation and crosscorrelation computations (Lazzaro & Wawrzynek, 1998). The use of spikes for robust interchip communication of analog information and for performing addition and multiplication of spike trains has been explored (Murray, Del Corso, & Tarassenko, 1991). The organization of this article is as follows. Section 2 presents three basic elements essential for hybrid computation to scale to systems of large complexity: the use of distributed analog computation, the use of a quantization mechanism for signal restoration, and the use of a state machine for implementing complex computations using a sequence of elementary computations. The following three sections discuss how these elements of hybrid computation are naturally implemented in spike-based computers. Section 3 discusses spike-based distributed analog computation. Section 4
2006
Rahul Sarpeshkar and Micah O’Halloran
discusses signal restoration by spike-time to spike-count quantization. Section 5 denes, discusses, and presents an experimental silicon implementation of a hybrid state machine (HSM), an important control architecture for sequencing operations in hybrid computing; the experimental implementation presents data from a successive subranging analog-to-digital converter that works by sequencing two spiking neurons through various charging states. Section 6 presents experimental data from a silicon HSM that performs spike-time-based pattern recognition; applications of HSMs beyond the experimentally demonstrated analog-to-digital converter and pattern recognizer are also discussed in this section. Section 7 concludes by summarizing the key ideas presented in this article. 2 Scalable Hybrid Computation There are three key ideas in hybrid computation that enable it to scale to systems of large complexity. The rst two of these principles were discussed in Sarpeshkar (1998), but we shall briey review and extend them in section 2.1. In section 2.2, we discuss the importance of feedback and learning for signal restoration. Although we do not explore feedback in much experimental depth in this article, section 2.2 illustrates why analog signal restoration should not be accomplished with feedback alone. 2.1 Three Principles for Scalability. The following principles constitute the core of scalable hybrid computation: 1. We compute with a multiwire representation of information that divides the information processing into independent low-precision parts, which many low-precision analog stages can collectively handle efciently. Unlike digital computation, the precision is not necessarily 1 bit per wire (it could be 1,2,3,4,5,. . . bits per wire depending on the operating speed and noise inherent in the technology), and the computing primitives do not necessarily arise from logic (they are analog primitives). Thus, we refer to this idea as the idea of distributed analog computation. Note that there are schemes called multilevel logic schemes, which have more than 1 bit of information per wire. However, in these schemes, the primitives are still logical primitives, which dene rules for switching among various levels. 2. We achieve noise immunity by frequent discrete signal restoration of the analog signals on the low-precision channels. The signal restoration is done to a precision required of the channel. For example, a 2-bit precise analog channel would be restored to one of four discrete levels that is closest to the analog value. We refer to this idea as performing signal restoration via an analog-to-digital-to-analog (A/D/A) converter. The A/D/A behaves simply like an analog-to-digital converter followed by a digital-to-analog converter, although it is not always implemented as such. Effectively, it functions to restore the analog signal to one of a few discrete attractor levels. The number of attractor levels scales exponentially with the precision of
Scalable Hybrid Computation with Spikes
2007
the analog channels. In order for this signal restoration scheme to work, the distance between attractor levels must be much greater than the accumulated noise and offset between restoration stages (a few standard deviations depending on how much bit error rate is tolerable). This signal restoration scheme is a generalization of the 0-or-1 restoration scheme of digital systems that also requires that the typical noise between stages be much less than the distance in voltage space between a 0 and a 1. The optimal amount of analog processing that may be performed before a signal restoration is implemented is analyzed in Sarpeshkar (1998). The A/D/A has been proposed as a building block for storage and processing applications (Cauwenberghs, 1995), but not in the context of a scheme for systematic signal restoration in hybrid systems, such as we propose. Forms of signal restoration that are more sophisticated than an A/D/A may be imagined. For example, analog systems with redundant channels that represent information analogous to the error correction bits of a digital system may be built, or analog systems that implement an A/D/A with implicit rather than explicit digitization may be devised. Signal restoration may be performed at a higher level of the system if there are only a few discrete possibilities and we have to choose among them. Analog dynamical systems with compressive nonlinearities may implicitly possess signal restoration capabilities. All signal restoration schemes that we know of have one common characteristic: the restoration is achieved by having a distance between discrete possibilities be sufciently great to resist corruption by continuous noise. The A/D/A is the simplest example of such a scheme. In this article, we conne ourselves to the A/D/A because it is the simplest, most general, and most scalable form of signal restoration that we know of at this time. 3. Although enormously complex digital systems may be built by simply cascading logic stages, it is very inefcient in hardware to build systems in this way. Rather, it is more efcient to construct a dynamical system, termed a nite state machine (FSM), which conditioned on its present state or on its inputs can evolve through a sequence of states and, while it does so, perform a complex computation. The complexity is possible because in each state, the machine can do something different with the same hardware since inputs have different values. In effect, we construct a complex computation out of the sequential reuse of a single computational block. For example, microprocessors, which are essentially complex FSMs operating on memory, implement a ltering computation out of a sequence of adds and multiplies; they have one arithmetic logic unit (ALU) that is constantly reused in order to implement this computation. Parallel computation with multiple hardware units may enable us to use less sequentiality and more parallelism, but it still does not alter the fact that any parallel machine has only a nite number of parallel channels. Any complex computation will still require a sequence of different operations that need to be coordinated and combined with each other.
2008
Rahul Sarpeshkar and Micah O’Halloran
Even small-order nonlinear analog dynamical systems are capable of very complex computations. However, we do not know of any systematic way to control these systems and to exploit them for our use, except for certain special-purpose computations. One may wonder if it is possible to construct a simple generalization of digital FSMs to hybrid systems, which are easy to control and recongure and that exploit the analog computational properties of a technology rather than just its boolean or switching aspects. In section 5, we discuss how to construct such a machine, termed a hybrid state machine (HSM). Just as clocks are natural in a discussion of FSMs, we nd that asynchronous spiking events are natural in HSMs.
2.2 Feedback, Learning, and Signal Restoration in Analog Systems. The three principles are suitable for building efcient feedforward architectures. However, no feedforward architecture, no matter how efcient, can be error free unless it computes with perfect knowledge, perfectly precise devices, and in an unchanging environment. Such errors may be corrected in a feedback architecture that obtains information from its environment to alter its feedforward parameters to minimize error—an architecture with learning. It is tempting to think that learning can correct for all errors and therefore that discrete signal restoration is not important in complex analog systems. However, it is worth noting that if the burden of error correction is placed solely on the feedback portion of the architecture, then the error of the feedback portion must be small. This is due to the fact that the feedback portion cannot correct the error that it itself introduces (Franklin, Powell, & Emami-Naeini, 1994). In a complex system, either or both of its feedforward or feedback portions will be complex. If the feedback portion is complex, then this portion will require signal restoration to keep its error small. If the feedback portion is simple, then signal restoration may not be necessary. However, in the latter case, the loop gain of the feedback learning system will need to be big to attenuate the large output errors of the complex feedforward system. A big loop gain causes instability or poor convergence (or both) in the learning system unless the dynamics of the learning are impractically slow. Thus, it is not wise to use learning to correct all error. Rather, it is better to encode as much as is known about the problem into the structure of the architecture and to reduce the error of the feedforward portion of the architecture as much as possible. Then, learning may more easily correct the residual errors that still remain in the system. Hence, our use of discrete signal restoration to prevent noise accumulation in complex analog systems is an important scaling principle even in systems that learn. For a discussion of circuits and systems that employ learning in hardware, see Cauwenberghs and Bayoumi (1999). In this article, we shall not focus much on learning, but it is clearly an important component of any scalable hybrid computer.
Scalable Hybrid Computation with Spikes
2009
3 Spike-Based Distributed Analog Computation An important reason for the efciency of digital systems at high precision lies in their ability to represent a number in a distributed fashion, specically in base 2 modulo arithmetic. It is then natural to wonder whether we may construct a distributed analog number. The ability to construct such a number can serve as a foundation for other distributed analog operations; it also represents an important special case of the more general idea of distributed analog computation. In this section, we discuss why spikes are natural for representing distributed analog numbers. After this discussion, we provide a brief example of how spike-based representations may be useful in other distributed analog systems where the information is in the values of many limited-precision analog variables rather than in a number. A general discussion of the use of spikes in all distributed analog systems is beyond the scope of this article. As Figure 1a shows, we begin by representing a positive analog number by a state variable Qstate , the charge on a capacitor. We will discuss only the case of positive numbers since the generalization to negative numbers is straightforward. Each capacitor’s charge is changed by charging the capacitor with the sum of all its input currents Si Ii . The time period of integration, tin ,may be gated by a pulse input or ungated. For simplicity, we have assumed a common nite integration time for all currents, but our ideas are not altered by this assumption. Unless otherwise mentioned, we will also assume that all Ii are positive. Since we are interested not merely in representing numbers but also in computing with them, it is worth noting for the future that the use of Kirchoff’s current law gives us primitives for an analog add ( Si Ii ), and the use of integration gives us primitives for an analog multiply (Q D I £ t). At some point, due to continuous charging, the analog charge variable is bound to reach the upper limit of its dynamic range: the power supply voltage or a fraction of it. To prevent this from happening, we set a threshold level of charge QT , such that if Qstate · QT , we allow charging, but if Qstate > QT , we reset the charge variable to 0, signal to a neighboring channel that we have done so by ring a spike, and then resume charging. A channel refers to a similar charge-and-reset unit on another nearby capacitor. The spike causes the neighboring channel to increment its charge by a discrete amount that is a fraction of QT . In Figure 1b, we introduce a pictorial phase representation for an integrate-and-re unit that naturally represents both its analog and digital characteristics and is thus very useful in depicting hybrid algorithms. The representation consists of a circle and an arrow that rotates about the circle’s axis, much like the hand of a clock. The angular position of the arrow with respect to 12 o’clock represents the value stored in the integrator as a fraction of the maximum allowable level of the integrator. For example, if the integrator is holding a value that is one-quarter of its maximum level, the arrow would
2010
Rahul Sarpeshkar and Micah O’Halloran
Figure 1: Distributed representation of an analog number. (a) State equations dening the representation of a positive analog number as charge on a capacitor. (b) A phase representation of the above state equations. A carry between two neighboring integrate-and-re units is shown, effectively creating a form of analog number representation. (c) High-level circuit implementation of the above scheme.
Scalable Hybrid Computation with Spikes
2011
point directly to the right, 90 degrees from vertical. Other variables also have signicance in this representation. The rate of integration is analogous to the angular speed of the arrow, and the integrator ’s maximum allowable level is equivalent to 2p radians, a fact that leads to a natural representation of the re-and-reset event. When the arrow reaches 2p radians, the unit res a spike and naturally resets its angular variable back to a value of 0 radians (representing zero charge). The advantage of this modular representation is that it depicts both the analog state of the integrate-and-re unit and also its digital spiking events. We see that Figure 1b also illustrates a form of a number representation with a carry when more than one integrate-and-re unit is present. Real numbers are represented by their angular positions on a circle. When the right-most unit’s arrow reaches 2p , we wrap around to 0 (symbolizing a reset) and re a spike, signaling to the neighboring channel on its left that we have performed one full revolution. The adjacent channel keeps track of the number of full revolutions performed by partially incrementing (less than QT ) its charge for each full revolution of its neighbor. This idea can be repeated ad innitum so that complete revolutions of the neighbor are kept track of by partial increments of yet another channel. We have thus solved the limited-dynamic-range problem of an analog variable by a divide-andconquer strategy that represents the information about the variable across many channels. So far, all of these operations have been dened on the set of real numbers. However, in accordance with the hybrid philosophy of representing only a limited amount of information per channel and restoring and preserving that information, we will at some point need to quantize the channel angle by rounding it up or down. For example, if we represent 2 bits of information per channel, at the time of the signal restoration, we will only care about what quadrant of angular space the angular state is in. Thus, we will round the state variable to whichever angle in the set (0,p / 2,p ,3p / 2,2p ) it is closest to. If it is closest to 2p , we re a spike and reset the variable to zero. A natural way of preserving the 2-bit representation across channels is to increment the neighboring channel by p / 2 whenever the current channel has nished a full revolution of 2p . Such a scheme is illustrated in Figure 1b, where we charge the neighboring channel by 1 unit when the current channel has completed a full revolution of 4 units. If we repeat this idea such that full revolutions of the neighbor result in a one-fourth revolution of yet another neighbor to the left of it and so on, we create a method for approximating an analog number in a number representation based on radix 4. In general, each of the channels of our distributed number representation may also have its own analog input currents that arise from another source besides the immediate neighbor to their right. It is important to note that this is not a multilevel logic scheme in radix 4 because we are computing on the set of reals with real-numbered primitives, which are resolution independent, and rounding off to the set of
2012
Rahul Sarpeshkar and Micah O’Halloran
integers; in contrast, multilevel logic schemes compute on the set of integers with integer primitives that are resolution dependent (the number of levels and the radix change the logical rules completely). While our primitives of computation are resolution independent (the law of adding two real numbers is the same independent of precision), the precision to which we may round a continuous number to its nearest discrete approximator is resolution dependent. By changing the resolution to which we round numbers, by having the round-off resolution vary in different parts of a large system, or by having the round-off resolution adapt with the speed of the computation, distributed analog representations of a number can implement sophisticated adaptive precision schemes that obviate the need to maintain the highest level of precision in all parts of the system or at all times. Figure 1c shows how such a scheme would be implemented as a circuit. Input currents charge a capacitor. When the capacitor has reached a certain threshold voltage, it res a spike via a spiking neuron circuit. The spike is communicated to an adjacent channel that increments a spike counter and, via a simple D/A, causes a partial increment of charge in a neighboring spiking neuron. In actual implementations, an explicit spike counter and D/A are not really necessary. The arriving pulse can gate a standard charging current to the neighbor. The charging current may be chosen such that the charge is incremented by one unit. Such a scheme can be implemented with as few as two transistors in series (one for gating and one for the current source).1 Figure 2 contrasts how digital, analog, and hybrid paradigms of computation represent a number. In this example, each computational scheme counts the number of pulses in a particular input signal and represents the answer as a number. In the purely digital case, shown in Figure 2a, each pulse is an event that causes a binary counter to toggle; a cascade of 12 1-bit counters is capable of counting up to 212 ¡ 1 input pulses and signaling an overow in the last counter when the 212 th input pulse arrives; the number is encoded by 12 1-bit precise channels. In the purely analog case, shown in Figure 2b, each pulse causes an increment in the subthreshold charge of an integrate-and-re spiking neuron. When more than 212 ¡ 1 input pulses have arrived, an overow on the arrival of the 212th pulse is signaled by a suprathreshold spike red by the neuron; the number is encoded by a single 12-bit precise channel. In the hybrid case, shown in Figure 2c, each pulse causes an increment in the subthreshold charge of the right-most integrate-and-re spiking neuron. When more than 16 spikes have arrived, the right-most neuron exceeds threshold and res a suprathreshold spike
1 Any accumulated offsets due to mismatches in the current sources are irrelevant so long as we ensure that the sum of all these offsets is less than half the separation between discrete attractor levels prior to the round-off operation.
Scalable Hybrid Computation with Spikes
2013
Figure 2: Number representations in (a) digital, (b) analog, and (c) hybrid paradigms.
that increments the subthreshold charge of the middle neuron by a sixteenth of its threshold amount. This overranging-and-reset scheme is repeated via a similar pulse-to-charge synaptic “carry” between the middle neuron and the left-most neuron in Figure 2c. A distributed representation of a number in radix 16 is created among the three neurons; three 4-bit precise units serve to encode a 12-bit number; the arrival of 212th input pulse is signaled by an overow spike from the left-most neuron of Figure 2c. It is important to note that the tickmarks in Figure 2c that represent discrete levels are not really present during normal analog computation but created when A/D/A computations perform readout and signal restoration on each channel. In fact, if we are not interested in monitoring the exact number of input spikes, but only in whether the number of input spikes is at least 212, we need only monitor the left-most spike of Figure 2c and never perform signal restoration on each of the channels. Thus, Figure 2c shows how we may create a distributed representation of a real number and quantize this real number
2014
Rahul Sarpeshkar and Micah O’Halloran
into an integer if we want to. Note that the external signals in Figure 2c may cause all the digits of this number to be real even if the carries from neighbors are integer carries. It is obvious that if we had ve discrete restoration angles per circle, which are represented as (0,1,2,3,4), our scheme would operate in radix 5. In general, operation in any radix is possible and is determined entirely by how much precision per channel we want to represent. For didactic purposes, in all future discussions, unless otherwise mentioned, we will assume 2 bits of precision per channel; our schemes, are, however, perfectly general, and the extension to higher or lower bits of precision per channel is straightforward. Generally, the optimal precision per channel falls with increasing bandwidth or speed of operation of the channel. Information about the auditory or visual sensory worlds does not come to us in the form of numbers but rather as a distributed array of low-precision analog variables. In the case of audition, the outputs of an auditory front end such as a cochlea or a bank of bandpass lters comprise a distributed representation of a sound signal as a frequency vector with many components. In processing such a vector, spikes phase-locked to the ne time variations of each frequency component are useful in signaling various features in the vector (e.g., the pitch of a signal or binaural delays between ears). In such computations, we exploit the ease of performing autocorrelation or cross-correlation with spiking signals and delays. Computations such as the extraction of common periodicities, detection of recurring temporal patterns, and formation and separation of invariant spike trains that subserve auditory objects have been proposed in neural timing nets (Cariani, 2001). Spike-based distributed processing is useful in signaling important discrete events in the underlying analog signal. In the case of a distributed analog number, the events signify overranging-and-reset behavior in a given channel, with the spike being a carry to the neighboring channel. In the case of a frequency vector, spiking events allow one to easily extract the presence of certain features in the signal, which are of importance in, say, speech recognition or spatial audition. 4 Spike-Based Signal Restoration (Quantization) We have not yet described the details of how we accomplish the rounding operations that are so important for signal restoration. That is the subject of analog-to-digital conversion. In the process of analog-to-digital conversion, a continuous variable is converted to its nearest discrete approximator, which is generally represented as a multibit binary number. A simple D/A converter then converts the binary number back to a continuous number to implement an A/D/A; the whole A/D/A process performs a rounding operation. We shall not describe our implementation of digital-to-analog conversion since it is based on straightforward current-mode architectures. Such architectures are described in Rezhavi (1995).
Scalable Hybrid Computation with Spikes
2015
In the mechanisms for quantization described below, the value of the analog subthreshold charge on a neuron is rst encoded into the time between spikes, then partially quantized into a digital spike-count code in another neuron, and nally transformed back to analog subthreshold charge for future higher-precision conversion. It is worth noting, therefore, that spiketime codes and spike-count codes are used concurrently. In this application, digital spike-count information serves to quantize analog spike-time information. Figure 3a shows the angular state variables of two spiking neurons, called the timing neuron (TN) and counting neuron (CN), respectively. Initially, the counting neuron has an angular state variable that is zero, while the timing neuron has its angular state variable equal to ¡h or, equivalently, to w D 2p ¡h. If we now charge CN with a current equal to I0 while we charge TN with a current equal to I0 / 4, then the charging rate of CN is four times that of the charging rate of TN. If we stop charging both neurons as soon as TN spikes, that is, as soon as SpTN becomes active, we will cause CN to increment its state variable by 4h in the time that TN increments its state variable by h . Thus, if h Dn
p C2 , 2
(4.1)
where p , 2 n 2 f0, 1, 2, 3g,
0·2 <
(4.2) (4.3)
then 4h D n (2p ) C 42 ,
(4.4)
Hence, CN will make n full revolutions and re n spikes in the time that it takes TN to re one spike. By counting the number of spikes that CN res, we may evaluate h rounded down to the nearest whole multiple of p / 2. The arrow in Figure 3 is labeled SpTN to indicate that a state transition occurs at the onset of this spike. After this state transition, the charging currents to the neurons are altered. In Figures 3a and 3b, after the state transition, the charging currents to both neurons are 0, but in general, the charging currents could be altered to values dictated by subsequent processing. The charging currents for each neuron in a given state are labeled alongside the diagrammatic representation of their state variables at the beginning of the state. Figure 3b shows the same two neurons, but with the angular state variable of TN initially at Ch rather than ¡h. This case is equivalent to the case
2016
Rahul Sarpeshkar and Micah O’Halloran
Figure 3: Quantization of subthreshold neuronal charge via (a) upcounting and (b) downcounting operations.
discussed in Figure 3a except that the multiply-by-4 operation is on 2p ¡ h rather than on h. Thus, ± ± p ²² 4(2p ¡ h ) D 4 2p ¡ n C 2 , 2 D (4 ¡ n )2p ¡ 42 , D (3 ¡ n )2p C (2p ¡ 42 ) ,
(4.5)
and the number of spikes red by CN in the time that TN takes to re one spike is 3 ¡ n. Thus, if we start at 3 and downcount the number of spikes
Scalable Hybrid Computation with Spikes
2017
red by CN, we may evaluate h rounded down to the nearest whole multiple of p / 2. The technique of Figure 3a is referred to as upcounting, and the technique of Figure 3b is referred to as downcounting. Equation 4.5 shows that when we do a downcounting evaluation of h, we are left with a residue 2p ¡ 42 . This residue is in a form suitable for further evaluation by upcounting. Similarly, Equation 4.4 shows that when we do an upcounting evaluation of h, we are left with a residue 42 . This residue is in a form suitable for further evaluation by downcounting. The techniques of upcounting and downcounting enable us to do a coarse evaluation of h to the nearest whole multiple of p / 2, leaving a residue for ner evaluation in a subsequent stage of processing. The [0,p / 2] range of 2 is amplied to [0,2p ] for the ne-stage evaluation. By successively alternating upcounting and downcounting evaluations, we may get ner and ner estimates of h. At the last stage of conversion, we perform error correction to get the least signicant bit of the A/D conversion correct. To perform error correction, we are interested in knowing the value of h to a resolution given by p / 4 so as to round the nal evaluation accurately to a resolution of p / 2; for instance, if the value of h was less than p / 4, we would like to round it down to 0, and if its value was greater than p / 4, we would like to round it up to p / 2. To make such decisions based on p / 4 divisions, we perform multiply-by-8 operations that map each p / 4 sector to a 2p sector; the presence (or absence) of spikes caused by complete 2p revolutions can then serve as an indicator of h’s resolution with respect to a p / 4 angular grid. Effectively, error correction is performed by replacing multiply-by-4 operations with multiply-by-8 operations and by making decisions based on increments of two spikes rather than on increments of one spike. Error correction may be performed in an upcounting or downcounting mode. In practical circuits, I 0 / 4 is never exactly I0 / 4; such circuits also have other errors and offsets. Hence, the operations represented in Figures 3a and 3b may result in four spikes being red, although the mathematics would predict that this situation could arise only if h were exactly zero. In the upcounting case, the fourth spike is interpreted to mean that within the resolution of our conversion, h D 2p and 2 D 0; in the downcounting case, the fourth spike is interpreted to mean that within the resolution of our conversion, h D 0 and 2 = 0. The ring of four spikes creates a zero residue and terminates any ner stages of evaluation. The precision of the conversion is thus necessarily limited by the precision to which we can evaluate small angles in the rst stage of conversion. 5 Hybrid State Machines To coordinate and synchronize the various operations needed during quantization, we need to create a machine that is capable of making the transition between discrete states in each of which there is a different conguration for the neuronal charging currents. The transitions from state to state are gov-
2018
Rahul Sarpeshkar and Micah O’Halloran
erned by spikes from the various neurons. Such machines are special cases of a more general class of machines termed hybrid state machines (HSMs). We rst discuss the theory and operation of HSMs and then provide data from an experimental silicon HSM that implements a particular analog-todigital conversion scheme. HSMs extend and expand the concept of an FSM in digital computation to the hybrid domain. Thus, we shall begin with a review of FSMs. To simplify notation, all mathematical variables in this section will be understood to be vectors. 5.1 Finite State Machines. FSMs, shown in Figure 4a, are the workhorse of digital computation. All computers may be viewed as nothing more than complex FSMs operating on memory. An FSM is described by a ow diagram of discrete states. Transitions between discrete states always occur on discrete clock cycles of a periodic clock, marked Clock in Figure 4. The inputs and outputs are given by the binary vectors DI and D0, respectively. The state of the machine is specied by the binary vector D. The transition to the next state depends on the current state and on the current value of the inputs. To describe an FSM completely, we need only specify: 1. Its initial state, given by the binary vector D 0. 2. Its next state function: The function that describes what the next state of the machine will be, given its current state and its inputs. If the next state function is ND (.) , then Dn C 1 D ND (Dn , DI ).
(5.1)
3. Its output function: The function that describes how the output of the machine is a function of its current state and its inputs. If the output function is OD (.) , then DO D OD ( Dn , DI ) .
(5.2)
Any FSM may be built with a synchronous clocked register that encodes its current state and a combinational logic block, as shown in Figure 4a. For proper operation of the machine, the period of the clock must be sufciently long so that all changes caused by the previous transition of the clock have settled before we embark on the succeeding clock transition. FSMs are described in great detail in any traditional textbook on digital computation (e.g., see Rabaey, 1996.) 5.2 Intuitive Description of a Hybrid State Machine. An HSM possesses both discrete digital states and continuous analog states, next-state functions for its discrete and continuous states, analog and digital outputs that are functions of its current state and its analog and digital inputs, and
Scalable Hybrid Computation with Spikes
2019
Figure 4: (a) Finite state machine. (b) Hybrid state machine.
interaction functions that dene the interactions between its analog and digital parts. There is no clock, and all transitions are triggered by spiking events. FSMs are a special case of HSMs with only digital inputs, discrete states, and one special input, called the clock, that is capable of triggering transitions between states. Figure 4b illustrates the architecture of an HSM. The machine is composed of two parts: (1) an analog part with analog state, encoded, for ex-
2020
Rahul Sarpeshkar and Micah O’Halloran
ample, on a set of capacitor charge variables, and with analog inputs and outputs, and (2) a digital part with discrete states encoded on a set of register variables and with digital inputs and outputs. In each discrete state, the digital part may alter the values of a set of binary outputs; these outputs control switches in the analog part of the system and consequently affect the analog system’s operation. In turn, the analog part generates spiking events that trigger transitions between discrete states in the digital part. The digital control of the analog part may occur via two forms: parameter control and topology control. In the parameter control form, D/A converters (DACs) explicitly convert digital control bits to physical analog parameters that affect the operation of the analog system. For example, in a particular state, we may create a current equal to 4I0 by switching certain control bits that are fed to a DAC to 100; this current will change the state of a capacitor in the analog system for which the current is a charging input. Analog parameters such as thresholds, pulse widths, charging currents, and capacitance values may be altered in this way. Further, digital control bits may set certain analog parameters to zero while enabling other analog parameters to express their value. For example, an analog input from the outside world, Iin , may be allowed to assume its value during a sensing operation, or it may be effectively set to zero depending on the state of a certain control bit. In the topology control form, digital control bits may recongure an analog topology by turning certain switches on or off. For example, an operational amplier that is present in the analog system may be disconnected from the rest of the system when a particular digital control bit is 1. The analog control of the digital part occurs by having a dened spike or dened logical combination of spikes trigger state transitions in the digital part. The denitions of which spikes trigger state transitions may vary from state to state, unlike a regular FSM, where the only state triggering input is the clock. The transitions between the discrete states may be conditioned on spikes as well as on the states of other digital inputs. Section 6.4 describes the digital portion of an HSM. It is referred to as a spike-triggered nite state machine (SFSM). Certain changes in discrete states may result in no changes in the analog part. For example, in the A/D scheme of section 4, counting spikes in a particular phase of A/D conversion alters the discrete states of a counter but does not affect any analog parameters. The analog parameters change only after we have made the transition to another phase of operation; the transition is triggered by spikes that are not related to the spikes counted in the prior phase of operation. 5.3 Formal Denition of a Hybrid State Machine. To specify an HSM, shown in Figure 4b, we need to specify its initial states, its next-state functions, its output functions, and its interaction functions. We shall assume that the digital inputs and outputs are given by the binary vectors DI and
Scalable Hybrid Computation with Spikes
2021
DO, respectively; the analog inputs and outputs are given by the real vectors AI and AO, respectively; the digital state is represented by the binary vector DS; the analog state is represented by the vector AS; the spiking state variables are represented by the binary vector Sp; and the binary parameter vector C represents all the digital bits that control the analog part of the machine. The specication of the machine is then given as follows: 1. The initial state. The initial state is given by specifying the initial state of (DS,AS). Usually, we initialize DS to an initial state that we label 0 and initialize AS to have all its components be 0. 2. The next-state functions. The digital next-state function ND (.) governs the evolution of the current digital state DSn to the next digital state DSn C 1 . Thus, ¡ ¢ (5.3) DSn C 1 D ND DSn , DI, Sp , where ND (.) is a logic function on all of its inputs, that is, a logic function on DSn , the components of DI, and the components of Sp. Whenever a logic function is dened with spiking inputs, it is necessary to have an implicit or explicit synchronicity window within which a spike arrival symbolizes a 1 for that input and the absence of a spike arrival symbolizes a 0. This window can be explicitly dened, for example, by an interspike interval between spikes in Sp, or implicitly dened, for example, by the leakage rate of a component of AS. All state transitions from DSn to DSn C 1 may be asynchronously triggered by transitions in DI or Sp. The analog next-state function is described in general by a differential equation, dAS D f ( AS, AI, C) , dt
(5.4)
where f (.) is, in general, some linear or nonlinear function of its inputs. 3. The output functions. The set of digital output functions represented by OD (.) is given by ¡ ¢ (5.5) DO D OD DS, DI, Sp , where OD (.) is a set of logic functions on each of its inputs. The analog output functions are dened by AO D g ( AS, AI, C) ,
(5.6)
where g (.) is, in general, some linear or nonlinear function of its inputs. 4. The interaction functions. The binary vector C that represents the digital control of the analog system is given by C D Cf (DS, DI) ,
(5.7)
2022
Rahul Sarpeshkar and Micah O’Halloran
where Cf (.) is a logic function on all of its inputs. The binary vector Sp that represents the spiking output of the analog system is given by Sp D Spf ( AS, AI, C) ,
(5.8)
where Spf (.) is a function that denes when spikes are red by the analog system. In this article, all of our implementations of the analog part in HSMs involve simple integrate-and-re neurons. In general, however, any spikeproducing analog dynamical system may be used in these machines. The analog dynamical system should be chosen to reect the nature of the computation that the state machine is designed to perform. Integrate-and-re neurons lend themselves naturally to computations that involve various forms of pattern recognition (analog-to-digital conversion is an example of a scalar pattern classication problem), arithmetic, and ltering. For performing vision processing, the analog part may involve resistive grids that do image smoothing, while for performing auditory processing, the analog part may involve a lter bank. It is hard to build robust, long-term analog memory. Thus, we shall generally assume that the analog state is volatile and destructible over a long timescale. Nonvolatile digital state is easily built by the use of positive feedback, as in static random access memories. Thus, if it is important to store analog state variables over a timescale that is much longer than the typical leak time of the capacitors, we will need to perform analog-to-digital conversion to the precision needed and store the analog state digitally. Such conversion may easily be done using the schemes discussed in section 4. In fact, it is easy to have an HSM that performs A/D conversion embedded within a larger HSM. We make the transition from states of the large machine to states of the conversion machine when we want to perform analog-to-digital conversion and then return to the large machine when we are done with the conversion. Thus, we have effectively created an analogto-digital–conversion subroutine that we call from a larger program. If we need to recreate some analog state on a capacitor and this state has been stored digitally, we have a digital-to-analog converter charge the capacitor for a xed amount of time with a current that represents the result of the digital-to-analog conversion. The storage and re-creation process of an analog state variable is a form of signal restoration implemented using the A/D/A technique described in section 2. In keeping with the philosophy of hybrid computing, we will not usually attempt to preserve analog information on any one capacitor to a high degree of precision. The HSM has some similarities to sequential networks that have been proposed for use in sequence recognition (Elman, 1990; Kohonen, 1989; Cleeremans, Servan-Schreiber, & McClelland, 1989), (Jordan, 1985; Stornetta, Hogg, & Huberman, 1988; Mozer, 1989). These systems are composed of networks of neurons that are partially recurrent, with a synchronous clock
Scalable Hybrid Computation with Spikes
2023
regulating updates of the neuronal outputs. Such networks are clocked analog dynamical systems, with analog state information being preserved in certain special subsets of neurons. The HSM differs from these networks in that it may have any spike-producing analog dynamical system, not necessarily neural, for its analog part, in that its state transitions are asynchronously triggered by spiking events, in the explicit nature of the feedback connections between its analog and digital portions, and in the state-bystate recongurability of its analog dynamical system. The HSM has discrete signals and continuous signals, but it operates in continuous time. The continuous time ow in an HSM is asynchronously punctuated by discrete events that cause its state transitions. The state machines in the literature, in contrast, operate in clocked discrete time. FSMs form special cases of both sequential networks and HSMs. We have thus far not used HSMs to perform sequence recognition. In the following sections, we present experimental data from silicon HSMs for doing successive subranging analog-to-digital conversion and spike-time pattern recognition.
5.4 An Experimental Analog-to-Digital Converter HSM. We present results from an experimental proof-of-concept HSM congured to perform 4-bit A/D conversion. The converter was built in a 0.5 um HP MOSIS process with a 3.3 V power supply voltage on a very-large-scale integration (VLSI) chip. The VLSI layout for the HSM, together with labels that identify its various parts, is shown in Figure 5. The converter used spiking-neuron circuits similar but not identical to those described in Sarpeshkar, Watts, and Mead (1992) and circuits similar to those described in section 6.4 to implement the SFSM. In this article, we provide a concise description of the converter’s operation. A more detailed description, suitable for circuit designers is provided in Sarpeshkar, Herrera, and Yang (2000) and Sarpeshkar (1999), which discuss architectures that are more suitable for higher precision and higher speeds than the converter described here, but the key fundamental principles behind their operation are similar. The converter obtains its two most signicant bits by an overranging step and its two least signicant bits by a 2-bit downcounting-and-errorcorrecting step. The ow diagram for the evolution of the converter is shown in Figure 6 and represents an example of what the ow diagram of an HSM looks like. There are four discrete states corresponding to each row of the gure, with the rst discrete state corresponding to the top row. There are two neurons, corresponding to each column of the gure, labeled the MSBs neuron and LSBs neuron, respectively. The charging currents for each neuron in a given discrete state are shown alongside the neurons. The spikes that cause transitions between discrete states are labeled on the transition arrows between states.
2024
Rahul Sarpeshkar and Micah O’Halloran
Figure 5: VLSI layout of the analog-to-digital converter HSM.
The rst state, SD, may be viewed as the default reset-and-wait state. Both angular state variables are reset to zero, and we wait for the control input, Phi, to go low. When Phi does go low, a large current, Ilrg , is used to cause the MSBs neuron to spike quickly. The activation of Sp M causes a transition to the second state, SM. In the second state, SM, the two most signicant bits are evaluated. The MSBs neuron is charged with a current given by I0 / 4, and the LSBs neuron is charged with Iin . We make the transition out of SM into the third state, SL, as soon as SpM is active. Since 0 < Iin < I0 , the LSBs neuron may re up to three SpL spikes during SM. These spikes are upcounted in a 2-bit MSBs counter initialized to 00. We refer to this step as an overranging step because it yields the integer part of the ratio Iin / (I0 / 4) ; it tells us by what factor Iin exceeds the reference current I0 / 4. If more than three spikes are red, we have an overow error that is interpreted as such; alternatively,
NOTE
Communicated by Yoshua Bengio
Learning Nonregular Languages: A Comparison of Simple Recurrent Networks and LSTM J. Schmidhuber [email protected]
F. Gers [email protected]
D. Eck [email protected] Istituto Dalle Molle di Studi sull’Intelligenza Articiale, 6928 Manno-Lugano, Switzerland
In response to Rodriguez’s recent article (2001), we compare the performance of simple recurrent nets and long short-term memory recurrent nets on context-free and context-sensitive languages.
Rodriguez (2001) examined the learning ability of simple recurrent nets (SRNs) (Elman, 1990) on simple context-sensitive and context-free languages (CFLs and CSLs, respectively). He trained his SRN on short training sequences of length not much greater than 10 and found that the SRN does not generalize well on signicantly larger test sets and frequently is unable to store the training set. Similar results were recently reported by BodÂen and Wiles (2000), who studied the simple context-sensitive language (CSL) an bn cn . They trained SRNs on sequences dened by n D 1, 2, . . . , 10 and found that SRNs fail to generalize reliably to n > 13. Sequential cascaded networks (SCNs) did only moderately better. We applied long short-term memory (LSTM) recurrent nets (Hochreiter & Schmidhuber, 1997) to similar problems (Gers & Schmidhuber, 2001a, 2001b). Many of our training set sizes were comparable to those of Rodriguez (2001) and BodÂen and Wiles (2000). We found that LSTM almost always stores the training set and generalizes well on much larger test sets. For instance, we trained LSTM on strings from the CSL an bn cn with n < D 40. From this training data, LSTM readily generalized up to string size 1500; that is, it learned to accept legal test strings such as a500b500c500 while rejecting slightly different illegal strings such as a500b499c500. In other experiments, LSTM was able to generalize well having seen only very limited training data. For example, from the training strings with n D 50, 51, LSTM generalized to 43 < D n < D 57. Similarly, LSTM trained on the CFL an bn for 1 < D n < D 30 generalized in the best case up to n D 1000 and on average c 2002 Massachusetts Institute of Technology Neural Computation 14, 2039–2041 (2002) °
2040
J. Schmidhuber, F. Gers, and D. Eck
up to n D 408. We refer readers to Gers and Schmidhuber (2001b) for an analysis of this and similar tasks, as well as for LSTM equations. True, even LSTM did not learn these languages in the strict sense; we do not have proof that it will generalize to arbitrary n. But it is obvious that LSTM exhibits excellent generalization performance, while SRNs do not. Because LSTM was designed to solve hard problems and not specically to address human cognition, it remains to be seen how well it will model behavioral data. LSTM also successfully learned CSLs that require embedded counters, for example, the deterministic palindrome language am bn Bn Am that Rodriguez (2001) discussed; traditional recurrent neural networks (RNNs) did not even learn the training set. We refer readers to Gers and Schmidhuber (2001b) for details. Testing more complex CFLs is also a topic of future research. How does LSTM solve problems like an bn cn ? Counters of potentially unlimited size are automatically and naturally implemented by linear units, the constant error carousels (CECs) of standard LSTM, originally designed to overcome error decay problems plaguing previous RNNs (Hochreiter, 1991; Hochreiter, Bengio, Frasconi, & Schmidhuber, 2001). Each linear CEC is surrounded by a cloud of nonlinear units responsible for controlling the ow of information in and out of the CEC. Whereas standard RNNs have a hard time implementing precise linear counters with their squashing functions— they need nely tuned weights to countermand the nonlinearities—LSTM can simply use the CECs for counting and so is free to focus its robust and precise weight adaptation process on the nonlinear aspects of sequence processing. In the case of the CFL an bn , LSTM uses one CEC to count up for a’s and down for b’s while using nonlinear gates to protect the CEC from irrelevant signals and generate the appropriate accept or reject response. In the case of the CSL an bn cn , LSTM uses one CEC to count up for a’s and down for b’s and a second CEC to count up for b’s and down on c’s. SRN and SCN counting mechanisms are analyzed in Rodriguez (2001) and BodÂen and Wiles (2000). The CEC-based counters not only deal much better with the well-known long time lag problem (Hochreiter, 1991; Hochreiter et al., 2001) but also the problem of installing easily accessible linear counters whose increments and decrements implement natural push and pop operations. This explains in a nutshell why LSTM outperforms other recurrent nets not only on regular (Hochreiter & Schmidhuber, 1997) but also on context-free and contextsensitive languages (Gers & Schmidhuber, 2001b).
References BodÂen, M., & Wiles, J. (2000). Context-free and context-sensitive dynamics in recurrent neural networks. Connection Science, 12(3/4), 197–210. Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14, 179–211.
Learning Nonregular Languages
2041
Gers, F. A., & Schmidhuber, J. (2001a). Long Short-Term Memory learns context free and context sensitive languages. In V. Kurkova et al. (Eds.), Proceedings of the ICANNGA 2001 Conference (Vol. 1, pp. 134–137). New York: SpringerVerlag. Gers, F. A., & Schmidhuber, J. (2001b). LSTM recurrent networks learn simple context free and context sensitive languages. IEEE Transactions on Neural Networks, 12(6), 1333–1340. Hochreiter, S. (1991). Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, Institut fur ¨ Informatik, Technische Universit a¨ t Munchen. ¨ Available on-line: ni.cs.tu-berlin.de/»hochreit. Hochreiter, S., Bengio, Y., Frasconi, P., & Schmidhuber, J. (2001). Gradient ow in recurrent nets: The difculty of learning long-term dependencies. In S. C. Kremer & J. F. Kolen (Eds.), A eld guide to dynamical recurrent neural networks. Parsippany, NJ: IEEE Press. Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation, 9(8), 1735–1780. Rodriguez, P. (2001). Simple recurrent networks learn context-free and contextsensitive languages by counting. Neural Computation, 13(9), 2093–2118. Received September 12, 2001; accepted March 20, 2002.
NOTE
Communicated by Richard Belew
Center-Crossing Recurrent Neural Networks for the Evolution of Rhythmic Behavior Boonyanit Mathayomchan [email protected] Department of Electrical Engineering and Computer Science, Case Western Reserve University, Cleveland, OH 44106, U.S.A. Randall D. Beer [email protected] Departments of Electrical Engineering and Computer Science and of Biology, Case Western Reserve University, Cleveland, OH 44106, U.S.A. A center-crossing recurrent neural network is one in which the null(hyper)surfaces of each neuron intersect at their exact centers of symmetry, ensuring that each neuron’s activation function is centered over the range of net inputs that it receives. We demonstrate that relative to a random initial population, seeding the initial population of an evolutionary search with center-crossing networks signicantly improves both the frequency and the speed with which high-tness oscillatory circuits evolve on a simple walking task. The improvement is especially striking at low mutation variances. Our results suggest that seeding with center-crossing networks may often be benecial, since a wider range of dynamics is more likely to be easily accessible from a population of center-crossing networks than from a population of random networks.
1 Introduction
Dynamical recurrent neural networks are an ideal neural substrate for the generation and recognition of temporal patterns, but they are considerably more difcult and computationally expensive to train than feedforward networks. Two broad classes of training algorithms are typically used: recurrent extensions of traditional feedforward learning algorithms (Pearlmutter, 1990) and evolutionary algorithms (Beer & Gallagher, 1992; Cliff, Harvey, & Husbands, 1993; Nakahara & Doya, 1998; Di Paolo, 1998; Ijspeert, 2001). Due to the computational cost of these algorithms, general strategies for minimizing the computation required to achieve a given level of performance are of great practical interest. Here, we describe a general technique for improving the performance of evolutionary searches on continuous-time recurrent neural networks (CTRNNs). c 2002 Massachusetts Institute of Technology Neural Computation 14, 2043– 2051 (2002) °
2044
Boonyanit Mathayomchan and Randall D. Beer
In a previous analysis of the dynamical behavior of small CTRNNs, several strategies for improving their evolvability were suggested (Beer, 1995). These suggestions follow from the general dynamical structure of CTRNNs and thus are not specic to any particular application. In this article, we test the advantages of one such proposal: seeding evolutionary searches with center-crossing CTRNNs. A center-crossing CTRNN is one in which the null-(hyper)surfaces of individual neurons cross one another at their exact centers, ensuring that the range of inputs that each neuron receives is centered over the most sensitive region of its activation function (Beer, 1995). Here, we demonstrate that relative to a random initial population, seeding an evolutionary algorithm with center-crossing networks signicantly improves both the reliability and speed with which high-tness oscillatory circuits evolve on a simple walking task. 2 Center-Crossing Networks
Continuous-time recurrent neural networks are networks of model neurons of the following general form, ti yP i D ¡yi C
N X jD 1
wji s ( yj C hj )
i D 1, . . . , N
(2.1)
where yi is the state of the ith neuron, yP i denotes the time rate of change of this state, ti is the neuron’s membrane time constant, wji is the strength of the connection from the jth to the ith neuron, hi is a bias term, and s ( x) D 1 / (1 C e ¡x ) is the standard logistic output function. Although variations of this basic model neuron were studied much earlier, a restricted form of this model was popularized by Hopeld (1984) in his work on associative memories. Unlike Hopeld, we make no restrictions on the weight matrix, so that CTRNNs are capable of general dynamical behavior (Funahashi & Nakamura, 1993). The oscillatory behavior of dynamical recurrent neural networks has been studied by a number of authors, including Ermentrout and Cowan (1979), Atiya and Baldi (1989), and Blum and Wang (1992). A center-crossing CTRNN is one in which the null-(hyper)surfaces of individual neurons intersect at their exact centers of symmetry (see Figure 1), producing an equilibrium point for which the outputs of all neurons are onehalf (Beer, 1995). Since s ¡1 (1 / 2) D 0 D yi C hi , this occurs when each yi D hi . By substituting this constraint into equation 2.1, setting that equation equal to 0 to specify the equilibrium points, and solving for hi , we nd that the center-crossing condition is given by
hi¤
¡ D
N X j D1
2
wji .
(2.2)
Center-Crossing Recurrent Neural Networks
2045
Figure 1: A simple illustration of the center-crossing condition in a two-neuron CTRNN. (A) The example circuit, with biases set as h1 D ¡(2 C 2) / 2 D ¡2 and h2 D ¡(6 ¡ 1) / 2 D ¡2.5. (B) The steady-state input-output curve of each neuron superimposed over the range of synaptic inputs it receives from the other neuron. (C) The nullclines of the full two-neuron circuit, with the center-crossing point indicated by a black dot.
2046
Boonyanit Mathayomchan and Randall D. Beer
Center-crossing networks were originally studied because of their analytical tractability (Beer, 1995). However, it was conjectured that seeding the initial population of an evolutionary algorithm with center-crossing networks rather than random networks might improve the performance of evolutionary searches on CTRNNs (Beer, 1995). A center-crossing network has the property that the operating range of each neuron is centered about the most sensitive region of its activation function (see Figure 1). Due to the form of s ( x) , unless a neuron’s bias is properly tuned to the range of inputs it receives, that neuron will simply saturate on or off and drop out of the dynamics. Thus, the richest dynamics should be found in the neighborhood of the center-crossing networks in parameter space, and one would expect that an evolutionary algorithm would benet from focusing its search there. 3 Methods
We test this conjecture by studying the evolution of CTRNN central pattern generators (CPGs) for walking in a simple legged body (Beer & Gallagher, 1992). The body consists of a single leg with a joint actuated by two opposing “muscles” and a foot. Details of the model and its analysis can be found in Beer, Chiel, and Gallagher (1999). We focus here on fully interconnected ve-neuron CPGs having a total of 35 parameters (5 time constants, 5 biases, and 25 connection weights). Three of these neurons are motor neurons that control the two opposing muscles and the foot, while the remaining two neurons are interneurons with no preassigned function. The model was integrated for 220 time units using the forward Euler integration method with a step size of 0.1. Fitness was evaluated as the average velocity (total forward distance covered in 220 time units divided by 220). The highest average velocity achievable is known to be 0.627 from a previous analysis of the optimal controller for this task and body model (Beer et al., 1999). A real-valued genetic algorithm was used to evolve CTRNN parameters. A population of 100 individuals was maintained, with each individual encoded as a vector of 35 real numbers. The best individual in the old population was simply copied to the new one (elitist selection), and the remaining children were generated by mutation of selected parents. Individuals were selected for mutation using a linear rank-based method, with the best individual producing an average of 1.1 offspring. A selected parent was mutated by adding to it a random displacement vector whose direction was uniformly distributed on the 35-dimensional hypersphere and whose magnitude was a gaussian random variable with 0 mean and variance s 2 (B¨ack, 1996). Searches were run for 250 generations. Connection weights and biases were constrained to lie in the range § 16, and time constants were constrained to the range [0.5, 10]. In the experiments described below, we compared the performance of evolutionary searches seeded with random center-crossing networks (seeded searches) to the performance of searches with completely random initial
Center-Crossing Recurrent Neural Networks
2047
populations (random searches). In the random searches, the initial population was generated by randomly assigning parameter values drawn from a uniform distribution over the allowed ranges. This is the typical way in which evolutionary algorithms are initialized. In the seeded searches, the initial population was generated by randomly assigning time constants and weights, but computing biases from the weights according to the centercrossing condition, equation 2.2. Aside from this difference in the initial population, the evolutionary search proceeded in exactly the same way for both random and seeded searches. 4 Results 4.1 Seeded Searches More Reliably Evolve High-Fitness Circuits. The benets of seeding an evolutionary search with center-crossing networks are clearly shown in Figure 2A, where we plot the mean performance of seeded (black) and random (gray) searches as the mutation variance s 2 is varied from 0.05 to 3.0. Each point represents the mean of the average velocity achieved by the best CPG evolved in each of 100 searches. Over the entire range of s 2 shown, the mean best performance of seeded searches exceeds that of random searches. Since the very best circuits evolved by both random and seeded searches have comparable tness, the observed differences in mean best performance primarily reect differences in the number of successful searches. This difference becomes especially pronounced at small mutation variance, where the mean performance of the CPGs evolved by seeded searches is nearly twice that of random searches. The qualitative forms of these two curves are quite different. While the performance of random searches begins low and rises to a plateau with increasing s 2 , the performance of seeded searches begins high and falls to a plateau. The form of the random curve suggests that randomly generated parameter sets generally perform quite poorly as CPGs. When mutation variance is low, it is difcult for the search to progress very far from these low-performance initial circuits. At higher mutation variances, however, more searches are able to recover from poor initial conditions and still nd their way to good walking controllers. In contrast, the form of the seeded curve suggests that random centercrossing networks are generally much better CPGs, so that low mutation variances can often ne-tune them into very t walking controllers. Larger mutation variances can easily destroy this initial advantage, causing the performance of the seeded searches to degenerate to nearly that of the random searches. Thus, the conditions necessary for oscillation must be somewhat delicate, and they must be closely related to the center-crossing constraint. At large mutation variance, both curves atten into a slow decrease of mean best performance with s 2 . The attening is probably due to our use of elitist selection, which ensures that any progress made by mutation is retained. Nevertheless, as the mutation variance increases, children
2048
Boonyanit Mathayomchan and Randall D. Beer
Figure 2: Results. (A) Mean best performance as a function of mutation variance for seeded (black) and random (gray) initial populations. Each point represents the mean of 100 evolutionary searches at mutation variance intervals of 0.05, for a total of 12,000 experiments. (B) Mean best performance achieved as a function of generation number for seeded (black) and random (gray) initial populations. Results for two different mutation variances are shown: s 2 = 0.05 (solid) and s 2 D 3.0 (dashed). For comparison, the best tness obtained from 25,000completely random CTRNNs was 0.24, and the best tness obtained from 25,000 random center-crossing CTRNNs was 0.40.
Center-Crossing Recurrent Neural Networks
2049
become decreasingly correlated with their parents, and eventually evolutionary search degrades to random sampling. 4.2 High-Fitness Circuits Evolve Faster in Seeded Searches. Not only do seeded searches evolve better CPGs on average, but they also evolve them faster, as shown in Figure 2B. Here, we plot the mean best performance achieved by seeded (black) and random (gray) searches as a function of generation number. Results for two different values of s 2 are shown (solid lines are s 2 D 0.05 and dashed lines are s 2 D 3.0). It is clear from these plots that seeded searches achieve a given level of performance considerably faster than random searches. Of course, as is to be expected from Figure 2A, this advantage declines with increasing mutation variance. Nevertheless, even at s 2 D 3.0, the mean performance of seeded searches begins at higher values and rises faster than for random searches. Why do populations seeded with center-crossing circuits evolve faster? Oscillations are essential to high-performance locomotion. Since the maximum forward distance that the body model can move in one step is 27.5 (Beer et al., 1999), the best performance that a nonoscillatory CTRNN can obtain is 27.5 / 220 D 0.125. Random searches often plateau near this level for long periods of time. In order to exceed this limit, CTRNNs must oscillate so as to take multiple steps during the performance evaluation period. Thus, one way to interpret the results shown in Figure 2B is that seeded searches discover oscillatory CTRNNs faster than random searches. Indeed, since the mean best performance for seeded searches is greater than 0.125 even at generation 0, oscillatory CTRNNs must often exist in the initial centercrossing population. To test this hypothesis, we generated 10,000 random center-crossing CPGs and 10,000 completely random CPGs. We found that 26.6% of the center-crossing circuits produced oscillations, while only 1.2% of the random circuits did so. Once oscillations are discovered, the evolutionary algorithm can ne-tune them into highly t CPGs by matching the amplitude, period, and phase of the oscillation to the characteristics of the model body. 5 Conclusion
We have demonstrated that seeding an evolutionary algorithm with centercrossing networks can signicantly improve both the frequency and the speed with which high-tness central pattern generators evolve. While the benets of center-crossing networks decrease with increasing mutation variance, searches from random initial populations never achieve either the best performance or the rate of improvement exhibited by seeded searches. Since nothing about the center-crossing condition is specic to walking, it is likely that a similar improvement would be found on any oscillatory task. In fact, all other things being equal, our results suggest that seeding evolutionary searches with center-crossing networks may always be benecial, since a
2050
Boonyanit Mathayomchan and Randall D. Beer
wider range of dynamics is more likely to be easily accessible from a population of center-crossing networks than from a random population. Thus, a considerable practical benet can be obtained from general considerations of the dynamics of recurrent neural networks. By the same reasoning, it is likely that the optimization of integrate-and-re or conductance-based models would also benet from initialization with networks whose synaptic weights and thresholds ensured that the neurons were centered about their most sensitive operating regions. Indeed, there is growing evidence that nerve cells actively regulate their ionic conductances so as to maintain a given level of excitability (Turrigiano, Abbott, & Marder, 1994). Acknowledgments
We thank Hillel Chiel for his feedback on a draft of this article. This research was supported in part by grant N0014-99-1-0378 from the Ofce of Naval Research and grant DMI97-20309 from the National Science Foundation. References Atiya, A., & Baldi, P. (1989).Oscillations and synchronization in neural networks: An exploration of the labeling hypothesis. Int. J. Neural Systems, 1(2), 103–124. B¨ack, T. (1996). Evolutionary algorithms in theory and practice. New York: Oxford University Press. Beer, R. D. (1995). On the dynamics of small continuous-time recurrent neural networks. Adaptive Behavior, 3, 469–509. Beer, R. D., Chiel, H. J., & Gallagher, J. C. (1999). Evolution and analysis of model CPGs for walking II. General principles and individual variability. J. Computational Neuroscience, 7(2), 119–147. Beer, R. D., & Gallagher, J. C. (1992). Evolving dynamical neural networks for adaptive behavior. Adaptive Behavior, 1, 91–122. Blum, E. K., & Wang, X. (1992). Stability of xed points and periodic orbits and bifurcations in analog neural networks. Neural Networks, 5, 577–587. Cliff, D., Harvey, I., & Husbands, P. (1993).Explorations in evolutionary robotics. Adaptive Behavior, 2(1), 73–110. Di Paolo, E. A. (1998). An investigation into the evolution of communication. Adaptive Behavior, 6(2), 285–324. Ermentrout, G. B., & Cowan, J. D. (1979). Temporal oscillations in neuronal nets. J. Mathematical Biology, 7, 265–280. Funahashi, K., & Nakamura, Y. (1993). Approximation of dynamical systems by continuous time recurrent neural networks. Neural Networks, 6, 801–806. Hopeld, J. J. (1984). Neurons with graded response properties have collective computational properties like those of two-state neurons. Proceedings of the National Academy of Sciences, 81, 3088–3092. Ijspeert, A. J. (2001).A connectionist central pattern generator for the aquatic and terrestrial gaits of a simulated salamander. Biological Cybernetics, 84, 331–348.
Center-Crossing Recurrent Neural Networks
2051
Nakahara, H., & Doya, K. (1998). Near-saddle-node bifurcation behavior as dynamics in working memory for goal-directed behavior. Neural Computation, 10, 113–132. Pearlmutter, B. A. (1990).Dynamic recurrent neural networks (Tech. Rep. No. CMUCS-90-196).Pittsburgh, PA: Carnegie Mellon University School of Computer Science. Turrigiano, G., Abbott, L. F., & Marder, E. (1994). Activity-dependent changes in the intrinsic properties of cultured neurons. Science, 264, 974–976. Received August 7, 2001; accepted March 21, 2002.
NOTE
Communicated by Terrence Sejnowski
Reply to Carreira-Perpin˜ a n and Goodhill Nicholas V. Swindale [email protected] Department of Ophthalmology, University of British Columbia, Vancouver, B.C., V5Z 3N9 Doron Shoham Weizmann Institute of Science, Department of Neurobiology, Rehovot 76100, Israel Amiram Grinvald [email protected] Weizmann Institute of Science, Department of Neurobiology, Rehovot 76100, Israel Tobias Bonhoeffer [email protected] Max-Planck-Institut fur ¨ Neurobiologie, D-82152 Martinsried, Germany Mark H ubener ¨ [email protected] Max-Planck-Institut fur ¨ Neurobiologie, D-82152 Martinsried, Germany
Although mathematical arguments certainly have a place in biology, we think it is inappropriate to apply, as do Carreira-Perpin ˜ an and Goodhill (hereafter CP&G, 2002), the standards of proof required in mathematics to the acceptance or rejection of scientic hypotheses. To give some examples, showing that data are well described by a linear model does not rule out an innity of other possible models that might give better descriptions of the data. Proving in a mathematical sense that the linear model was correct would require ruling out all other possible models, a hopeless task. Similarly, to demonstrate that two DNA samples come from the same individual, it is sufcient to show a match between only a few regions of the genome, even though there remains a very large number of additional comparisons that could be done, any one of which might potentially disprove the match. This is unacceptable in mathematics, but in the real world, it is a perfectly reasonable basis for belief. Hypothesis testing in science, unlike mathematics, proceeds by taking a hypothesis for which there is usually some good a priori basis. Tests are then applied with the aim of getting results that are either consistent or inconsistent with the null hypothesis. In the present case, it is important to c 2002 Massachusetts Institute of Technology Neural Computation 14, 2053– 2056 (2002) °
2054
Nicholas V. Swindale et al.
stress that there are good a priori reasons for believing that coverage might be optimized in visual cortex maps. First, the different maps have structural relationships, such as orthogonal intersection of domain boundaries and placement of orientation singularities away from domain boundaries (Bartfeld & Grinvald, 1992; Obermayer & Blasdel, 1993; Hubener, ¨ Shoham, Grinvald, & Bonhoeffer, 1997; Crair, Ruthazer, Gillespie, & Stryker, 1997), which are known from simulation studies to improve coverage (Swindale, 1991). Second, cortical maps adapt postnatally to environmental stimulation, so that the area of cortex devoted to a particular stimulus varies with the frequency of occurrence of the stimulus in the environment (Hubel, Wiesel, & LeVay, 1977; Sengpiel, Stawinski, & Bonhoeffer, 1999). Third, it seems intuitively likely that it would be functionally advantageous for lowlevel visual features such as edge orientation to have a uniform or nearly uniform representation across the permutations of visual space, spatial frequency, and the two eyes. Finally, biological structures are often optimized for particular functions. Plausible expectations such as these are often contradicted in biology. In order to provide a direct test of the idea that cortical maps are optimized for coverage, we performed calculations designed to disprove the null hypothesis: that the structural relations between maps are unrelated to coverage uniformity (Swindale, Shoham, Grinvald, Bonhoeffer, & Hubener, ¨ 2000). Two types of tests were done: ips and rotations of individual maps and sliding tests where one map was moved sideways by a variable amount relative to the other two. Both types of test yielded results that were inconsistent with the null hypothesis. In nearly every case, ipping or rotating one or two of the maps in ways that changed the spatial relations between the maps led to a worsening of coverage. This does not prove that coverage is at a local optimum, but it does show that coverage is better than it would be were the maps overlaid at random or in ways unrelated to coverage. The sliding tests showed clear evidence of local coverage optima, with the best coverage always obtained, at or very close to the original map conguration. CP&G’s comments notwithstanding, we do not see a better explanation of these results than the likelihood that development has reached an end stage where coverage is at a local optimum, subject to the maintenance of other structural constraints (e.g., column periodicity, local smoothness, singularity density). The possibility of doing additional tests is somewhat academic because, as pointed out in the original article, as well as by CP&G, we do not know how to do them. While our belief in our conclusions remains unperturbed, it is appropriate to ask whether the local optimum is a good one. Indeed, it may be more relevant to discuss whether coverage values are good in an absolute sense and avoid altogether the issue of whether the values are locally optimal (Graeme Mitchison, personal communication, 2001). One way to answer this question is to compare coverage values in biological maps with maps produced by simulations, such as those based on the elastic net or Kohonen
Reply to Carreira-Perpin ˜ an and Goodhill
2055
algorithms, which are contenders as models of map development and are likely to produce coverage that is both optimized and good in an absolute sense. In order for such comparisons to be valid, the maps produced by the algorithms have to be structurally similar to the real ones, that is, the hypothesized continuity term in the energy function (< in CP&G) has to be correct. Since the correct function remains unknown, the best we can do is to try to generate model maps that are similar to the real ones in terms of the overall periodicity of the different map properties and visual resemblance. When we produce model maps that are matched in this way, we nd that coverage values are better than in the real maps, but only by about 25 to 30% (Swindale, unpublished results). Given that many factors are likely to worsen coverage artifactually in the real maps (e.g., measurement errors are much more likely to worsen coverage, which is a noise measure, than to improve it, so real values are almost certainly better than the measured ones), this suggests that real maps are, in absolute terms, achieving reasonably good values of coverage. This line of argument is preliminary, but is given here as an example of how the problem might be addressed constructively in the future. One of the issues raised by CP&G (in note 2) concerned the density of sampling from stimulus space in calculating coverage. Our sampling density was chosen on the basis that neighboring sample points in retinotopic and orientation space yielded correlated cortical response patterns and there seemed little to be gained by sampling more densely. We conrmed this by repeating coverage calculations on the ve data sets, doubling the sampling density along the retinotopic and orientation axes (giving eight times the number of stimuli, or about 112,000 stimuli per map) and found that coverage values ( c0 ) changed by 1.1% or less. This is small compared with the effect sizes of our manipulations, which were in the range of 10 to 30%. Many of the other issues raised by CP&G were mentioned in the original article and the accompanying commentary (Das, 2000) and do not seem to warrant further discussion. However, we will make one nal comment about the suggestion that the continuity measure < may depend on the spatial relationship between different maps, as well as on within-map structure. Such a dependence is not present in the continuity term of the energy function for the elastic net algorithm (Durbin & Willshaw, 1987), but it is possible in principle that it might be present. If our perturbations were changing <, this could be an explanation of why we might not have seen optima in our coverage measure, even when such optima were present. The fact that we do see them suggests that as in the elastic net model, < does not depend on the spatial relations between different maps. It does not invalidate our observations (as CP&G appear to wish to suggest). In fact, it is plausible that if < is related to wire length and thus connectivity rules, these might be determined primarily by the layouts of individual maps and not the relationships between them. This assumption might be useful in narrowing the search for candidate measures of <.
2056
Nicholas V. Swindale et al.
References Bartfeld, E., & Grinvald, A. (1992). Relationships between orientation- preference pinwheels, cytochrome oxidase blobs, and ocular dominance columns in primate striate cortex. Proc. Natl. Acad. Sci. USA, 89, 11905–11909.  & Goodhill, G. J. (2002). Are visual cortex maps optiCarreira-Perpin ˜ an, M. A., mized for coverage? Neural Computation, 14(7), 1545–1560. Crair, M. C., Ruthazer, E. S., Gillespie, D. C., & Stryker, M. P. (1997). Relationship between the ocular dominance and orientation maps in visual cortex of monocularly deprived cats. Neuron, 19, 307–318. Das, A. (2000). Optimizing coverage in the cortex. Nature Neuroscience, 3, 750– 752. Durbin, R., & Willshaw, D. J. (1987). An analogue approach to the travelling salesman problem using an elastic net method. Nature, 326, 698–691. Hubel, D. H., Wiesel, T. N., & LeVay, S. (1977). Plasticity of ocular dominance columns in monkey striate cortex. Phil. Trans. R. Soc. Lond. B, 278, 131–163. Hubener, ¨ M., Shoham, D., Grinvald, A., & Bonhoeffer, T. (1997). Spatial relationships among three columnar systems in cat area 17. J. Neurosci., 17, 9270–9284. Obermayer, K., & Blasdel, G. G. (1993). Geometry of orientation and ocular dominance columns in monkey striate cortex. J. Neurosci., 13, 4114–4129. Sengpiel, F., Stawinski, P., & Bonhoeffer, T. (1999). Inuence of experience on orientation maps in cat visual cortex. Nature Neuroscience, 2, 727–732. Swindale, N. V. (1991). Coverage and the design of striate cortex. Biol. Cybern., 65, 415–424. Swindale, N. V., Shoham, D., Grinvald, A., Bonhoeffer, T., & Hubener, ¨ M. (2000). Visual cortex maps are optimized for uniform coverage. Nature Neuroscience, 3, 822–826. Received November 20, 2001; accepted February 25, 2002.
LETTER
Communicated by Klaus Pawelzik
Dynamics of the Firing Probability of Noisy Integrate-and-Fire Neurons Nicolas Fourcaud [email protected] Nicolas Brunel [email protected] Â Laboratoire de Physique Statistique, Ecole Normale SupÂerieure, 75231 Paris Cedex 05, France, and Neurophysique et Physiologie du Syst`eme Moteur, UFR biomÂedicale, UniversitÂe Paris 5 RenÂe Descartes 45, 75006 Paris, France Cortical neurons in vivo undergo a continuous bombardment due to synaptic activity, which acts as a major source of noise. Here, we investigate the effects of the noise ltering by synapses with various levels of realism on integrate-and-re neuron dynamics. The noise input is modeled by white (for instantaneous synapses) or colored (for synapses with a nite relaxation time) noise. Analytical results for the modulation of ring probability in response to an oscillatory input current are obtained by expanding a Fokker-Planck equation for small parameters of the problem— when both the amplitude of the modulation is small compared to the background ring rate and the synaptic time constant is small compared to the membrane time constant. We report here the detailed calculations showing that if a synaptic decay time constant is included in the synaptic current model, the ring-rate modulation of the neuron due to an oscillatory input remains nite in the high-frequency limit with no phase lag. In addition, we characterize the low-frequency behavior and the behavior of the high-frequency limit for intermediate decay times. We also characterize the effects of introducing a rise time to the synaptic currents and the presence of several synaptic receptors with different kinetics. In both cases, we determine, using numerical simulations, an effective decay time constant that describes the neuronal response completely. 1 Introduction
Noise has an important impact on the dynamics of the discharge of neurons in vivo. Neurons in the cerebral cortex recorded in vivo show very irregular ring in a large range of ring rates (Softky & Koch, 1993), and there is a wide variability in the spike trains of neurons from trial to trial. There is evidence that a large part of the noise experienced by a cortical neuron is due to the intensive and random bombardment of synaptic sites. Neurons in cortical networks in vivo in awake animals (Burns & Webb, 1976) and anesthetized c 2002 Massachusetts Institute of Technology Neural Computation 14, 2057– 2110 (2002) °
2058
Nicolas Fourcaud and Nicolas Brunel
animals (Destexhe & ParÂe, 1999) are spontaneously active and emit spikes at rates of about 5–10 Hz in an approximately Poissonian way. Due to the large number of synaptic contacts (around 10,000), a cortical neuron thus receives a huge bombardment of spikes in an interval corresponding to its integration time constant (about 20 ms; see McCormick, Connors, Lighthall, & Prince, 1985). The impact of noise on neuronal dynamics can be studied in detail in a simple spiking neuron model, the integrate-and-re (IF) neuron (Lapicque, 1907; Knight, 1972; Tuckwell, 1988). Knight (1972) pioneered the study of the effect of noise on the dynamics of such a spiking neuron. The noise model he studied was a simplied model in which the threshold is drawn randomly after each spike. Gerstner (2000) extended these results and studied both slow noise models, in which either the threshold or the reset is drawn randomly after each spike, and fast escape rate noise models. Phase noise has also been used to study synchrony in networks of coupled oscillators (Abbott & van Vreeswijk, 1993). However, none of these noise models can represent in a realistic way synaptic noise as experienced by cortical neurons. Synaptic currents provoked by a single spike recorded in slice experiments have a fast rise time (often less than 1 ms) and a slower decay time (in the range of 2 to 100 ms depending on the type of receptor; Destexhe, Mainen, & Sejnowski, 1998). Common synaptic currents are excitatory glutamatergic currents, which can be mediated by either fast AMPA receptors (decay of order 2 ms) or slower NMDA receptors (decay of order 50–100 ms) and inhibitory GABAergic currents, whose fast component is mediated by the GABAA receptor (decay of order 5–10 ms). Such synaptic currents can be described by a simple system of ordinary differential equations (Destexhe et al., 1998; Wang, 1999). In this article, we study the impact of noise originating from realistic synaptic models on the dynamics of the ring probability of a spiking neuron. We explore three levels of complexity for the synaptic currents: (1) instantaneous (delta function) synaptic current, (2) synaptic current with an instantaneous jump followed by an exponential decay with time constant ts , and (3) synaptic current described by a difference of exponentials, with a rise time tr and decay time ts . The dynamics of the ring probability of a neuron receiving a bombardment of spikes through such synaptic currents is studied in the framework of the diffusion approximation (see in the neuronal context Tuckwell, 1988; Amit & Tsodyks, 1991). This approximation is justied when a large number of spikes arrive through synapses that are weak compared to the magnitude of the ring threshold, which is the relevant situation in cortex. In the diffusion approximation, the random component in the synaptic currents can be treated as white noise in the case of instantaneous synapses. On the other hand, when synapses have a nite temporal width, as in the more realistic models, synaptic noise has a nite correlation time and thus becomes “colored” noise. Thanks to the diffusion approximation, the dynamics of
Firing Probability of Noisy Integrate-and-Fire Neurons
2059
ring probability can be studied in the framework of the Fokker-Planck theory (Risken, 1984). The dynamics at the network level has been studied analytically recently in the white noise case by Brunel and Hakim (1999) and Brunel (2000). Similar population density approaches have been investigated by Knight, Omurtag, and Sirovich (2000), Nykamp and Tranchina (2000, 2001), and Haskell, Nykamp, and Tranchina (2001). Using the tools of the dynamical linear response theory (Knight, 1972; Gerstner, 2000; Brunel, Chance, Fourcaud, & Abbott, 2001), we study the dynamics of the ring discharge in response to a synaptic current composed of the sum of a stationary stochastic background current, which by itself provokes a background ring rate º0 , and of a small oscillatory component at frequency v. The knowledge of the response at all frequencies allows us to determine the response of the neuron to arbitrary stimuli, in the linear approximation. Knight (1972) had shown that integrate-and-re (IF) neurons have a nite response even in the high-frequency limit for the simplied noise model investigated. Surprisingly, recent studies have shown that in the case of white noise, the p amplitude of the ring-rate modulation at high frequencies decays as 1 / v, with a phase lag that approaches 45 degrees in that limit (Brunel & Hakim, 1999; Brunel et al., 2001); when a more realistic synaptic noise is taken into account, the amplitude of the ring-rate modulation is nite even in the high-frequency limit, and the phase lag is eliminated (Brunel et al., 2001). In this article, we study in more detail how the characteristics of the synaptic currents affect the neuronal response. We start with a detailed report of the calculations leading to the results in the colored noise case. This is done with two types of IF neurons: with or without leak. Using analytical calculations and numerical simulations, we then proceed to study how the high-frequency limit depends on the synaptic decay time, the effect of the synaptic rise time, and the effects of synaptic currents mediated by different receptors with different kinetics, as is the case in neurons in the cerebral cortex. Section 2 describes the models we use. In the following sections, we present the analytical calculations leading to the linear dynamical response of the two types of IF neurons. In section 4, analytical methods are presented, and the stationary ring probability is calculated. Then, in section 5, we calculate analytically the linear dynamical responses in the low- and highfrequency limits. Numerical simulations are used to obtain the response in the full-frequency range and in the more complicated synaptic ltering cases. Finally, we discuss our results in section 6. 2 Models 2.1 The Integrate-and-Fire Model Neurons. The IF model was introduced long ago by Lapicque (1907). Due to its simplicity, it has become one of the canonical spiking renewal models, since it represents one of the
2060
Nicolas Fourcaud and Nicolas Brunel
few neuronal models for which analytical calculations can be performed. It describes basic subthreshold electrical properties of the neuron. It is completely characterized by its membrane potential below threshold. Details of the generation of an action potential above the threshold are ignored. Synaptic and external inputs are summed until it reaches a threshold where a spike is emitted. The general form of the dynamics of the membrane potential V (t) in IF models can be written as C
dV D Il ( V ) C Is ( V, t) C Ie ( t) , dt
(2.1)
where Is is the synaptic current, Ie is an external current directly injected in the neuron, Il is a leak current, and C is the membrane capacitance. In the following, Is will represent the noisy inputs due to background synaptic activity, and Ie will represent an oscillatory input. In vivo, oscillatory inputs arrive through synapses and should therefore be incorporated into Is . However, for the purpose of the calculations detailed here, it is convenient to extract the oscillatory component from the noise and inject it directly to the cell. We will then come back in the discussion to the neuronal response to oscillatory inputs ltered through synapses. When the membrane potential reaches a xed threshold h, the neuron emits a spike and is reset to Vr . We could also introduce an absolute refractory period (ARP) after the emission of a spike during which the neuron is silent and does not sum the input signal. For simplicity, we set the ARP to zero in the following. In the most interesting region of ring rates far from saturation, the inclusion of ARP will affect the results only mildly. Several variants of this model have been considered in the literature. The simplest integrate-and-re neuron (SIF) has no leak current: Il ( V ) D 0. Thus, in the absence of inputs, the neuron stays indenitely at its initial value. The leaky, or forgetful, integrate-and-re neuron (LIF) has a leak current Il ( V ) D ¡g(V ¡ VL ) , where g is the leak conductance and VL the resting potential. We note C / g D tm where tm is the membrane time constant. In all calculations, we set VL D 0. In simulations, we use VL D ¡70 mV. Dividing both sides by g, we can reexpress equation 2.1 as tm
dV D f ( V ) C Is ( V, t) C Ie ( t) , dt
(2.2)
where f ( V ) D 0 for the SIF, f (V ) D ¡V for the LIF, and g has been absorbed in the currents Is and Ie , which are now expressed in units of voltage. Note that in the SIF, tm is an arbitrary parameter, used only to unify the description with the LIF. Thus, results for the SIF will be completely independent of that parameter. 2.2 Models of Synaptic Filtering. Many mathematical descriptions of synaptic currents have been proposed, from overly simple to very realistic
Firing Probability of Noisy Integrate-and-Fire Neurons
2061
(for a review, see Destexhe et al., 1998). Realistic models of the synaptic current Is (V, t ) depend on the membrane potential V through Is ( V, t) D g ( V ¡ Vrev ) s (t ), where g is the synaptic conductance, Vrev is the reversal potential of the synapse, and s ( t) describes the temporal dynamics of the synapse. In the following, we ignore the driving force V ¡Vrev for simplicity and consider current rather than conductance changes at synaptic sites. Thus, synaptic currents become independent on voltage. The synaptic models we consider follow. 2.2.1 Instantaneous (Delta Function) Synapses. If we neglect the synaptic time constants compared to the neuronal time constants, postsynaptic currents (PSCs) can be described by delta functions, and the synaptic input Is (t) can be described by Is ( t ) D
Ns X X d ( t ¡ tik ) tm . Ji i D1
(2.3)
k
The sum over i corresponds to a sum over all synapses, Ns is the number of synaptic connections (typically 10, 000), Ji is the efcacy of synapse i in mV (amplitude of the postsynaptic potential), the sum over k corresponds to a sum over presynaptic spikes of each synapse, and tik is the time where the kth presynaptic spike arrives at the synapse i. 2.2.2 Synapses with Instantaneous Jump and Exponential Decay. In reality, postsynaptic currents have a nite width that can be of the same order of magnitude or even larger than the membrane time constant. A more accurate representation of synaptic inputs consists of an instantaneous jump followed by an exponential decay with a time constant ts . This can be described by the following equation:
ts
Ns X X dI s d ( t ¡ tik ) tm . D ¡Is ( t) C Ji dt iD 1 k
(2.4)
In such a description, the rise time of the synaptic current is neglected. This is justied by the observation that the rise time of a synapse is typically very short compared to the relaxation time. Moreover, such a description implicitly assumes ts to be the same for all synapses. 2.2.3 Synaptic Currents with Different Decay Time Constants. Synaptic inputs to a cortical neuron come from different types of receptors with different temporal characteristics. Common types of receptors are AMPA, NMDA, and GABA receptors. AMPA receptors have synaptic time constants of the order of 2 ms (Hestrin, Sah, & Nicoll, 1990; Sah, Hestrin, &
2062
Nicolas Fourcaud and Nicolas Brunel
Nicoll, 1990; Spruston, Jonas, & Sakmann, 1995; Angulo, Rossier, & Audinat, 1999). GABAA receptors have longer time constants (typically 5–10 ms; Salin & Prince, 1996; Xiang, Huguenard, & Prince, 1998; Gupta, Wang, & Markram, 2000). Finally, NMDA currents are the slowest, with decay time constants of about 100 ms (Hestrin et al., 1990; Sah et al., 1990). Thus, it is of interest to consider synaptic inputs that are combinations of inputs coming through different receptors. Such synaptic inputs can be described by a system of equations, Is ( t) D Is1 ( t) C Is2 ( t) N s1 X X dI s1 ( ) C ts1 d ( t ¡ t1ki ) tm D ¡Is1 t J1i dt iD 1 k
ts2
N s2 X X dI s2 d ( t ¡ t2ki ) tm , D ¡Is2 ( t) C J2i dt iD 1 k
(2.5) (2.6)
(2.7)
where si , i D 1, 2 . . ., represent different types of synaptic receptors. In particular we will consider the situation in which two types of receptors are present: AMPA and GABA. 2.2.4 Synapses with Rise and Decay Times. Finally, the effects of a nite rise time can be included by a third variable in the system of equations describing neuronal dynamics. A simple way to introduce a rise time is to consider this system of equations: ts
dI s D ¡Is ( t) C Ir ( t) dt
(2.8)
tr
Ns X X dI r d (t ¡ tik ) tm . D ¡Ir ( t) C Ji dt iD 1 k
(2.9)
Figure 1 shows different postsynaptic currents (PSCs) with different rise times. Note that our description of the current with a rise time is similar to other kinetic descriptions (see Destexhe et al., 1998 or Wang, 1999), except that we do not take into account the saturation of Is . Neglecting saturation is justied when presynaptic ring rates are low, which is the prevalent situation during spontaneous activity in the cortex. 3 Analytical Methods 3.1 Diffusion Approximation. Neurons usually have synaptic connections from tens of thousands of other neurons. Thus, even when neurons re at low rates, a neuron receives a large amount of spikes in an interval corresponding to its integration time constant. If we assume these inputs
Firing Probability of Noisy Integrate-and-Fire Neurons
2063
Is(t) (mV)
0.5 0.4 0.3 0.2 0.1 0.0 -2
0
2
4
6
8
10
t (ms) Figure 1: Single PSCs for synapses with ts D 2 ms and different rise time constants tr D 0 ms (solid line), tr D 0.5 ms (dotted line), tr D 1 ms (dashed line), tr D 2 ms (long-dashed line).
are Poissonian and uncorrelated and the amplitude of the depolarization due to each input is small—hJi i ¿ h ¡ Vr , where h.i is the average over the input synapses—we can use the diffusion approximation. It consists in approximating uctuations around the mean of the total input spike train by a gaussian white noise, Ns X X d ( t ¡ tik ) ¼ mQ C sg Ji Q ( t) i D1
(3.1)
k
where mQ is proportional to the mean of the synaptic input and g is a gaussian random variable satisfying hg ( t) i D 0 and hg ( t)g (t0 ) i D d ( t ¡ t0 ) . sQ characterizes the amplitude of the noise. If we assume the mean activation rate for each synapse is º, we have mQ D hJi iNsº,
sQ 2 D hJ2i iNsº.
(3.2)
Note that mQ is in units of mV.ms¡1 and sQ in mV.ms ¡ 2 . Consequently, when synapses are instantaneous (relaxation synaptic time equals zero), the noise in the current is equivalent to white noise, but when the synaptic current has a temporal width, the noise is equivalent to colored 1
2064
Nicolas Fourcaud and Nicolas Brunel
noise. Using the diffusion approximation, the neuronal dynamics is fully described by the following systems of stochastic equations: Instantaneous Synapses:
tm
p dV D f ( V ) C m C s tmg ( t) C Ie ( t) dt
(3.3)
in which m and s are: m D mQ tm D hJi iNsºtm ,
and
s 2 D sQ 2 tm D hJi2 iNsºtm
(3.4)
and are both in mV. Exponentially decaying synapses:
dV D f ( V ) C Is ( t ) C Ie ( t ) dt p dI s ts D ¡Is ( t ) C m C s tmg ( t) . dt
tm
(3.5) (3.6)
3.2 Neuronal Frequency Response. The main goal of this article is to determine the neuronal response to an oscillatory input in presence of a realistic synaptic noise. Figure 2 shows an example of numerical simulations of 2000 neurons receiving a common oscillatory current input but with noise inputs that are uncorrelated from neuron to neuron (upper panel). We see on the raster (middle panel) that the neurons do not emit spikes regularly at the input frequency but discharge quite randomly. Actually, it is the instantaneous ring rate (the ring probability) of each neuron that oscillates in time (bottom panel). In the following, we compute the amplitude of the oscillatory response and its phase shift. The instantaneous ring rate º ( t) of the neuron is computed in two steps. We start with a stationary input Ie ( t) D 0. The resulting ring probability is º ( t) D º (see section 4). Then we introduce a small oscillatory input Ie ( t) D 2 m eivt , 2 ¿ 1, and we compute the linear (rst-order) correction to the instantaneous ring rate º ( t) D º[1 C 2 nO (v) eivt ] (see section 5). In both cases, we rst make calculations for instantaneous currents (ts D 0). Then we compute the rst-order correction due to the introduction of the synaptic decay time (ts > 0). In each section, we will present in the main text only the main steps and the results. Details of the calculations can be found in the appendixes. 3.3 Fokker-Planck Equation and Its Boundary Conditions. The analytical methods we use consist of solving the dynamical Fokker-Planck equation associated with the stochastic equations given in section 3.1. We look
A
Noisy input current (mV)
Firing Probability of Noisy Integrate-and-Fire Neurons
2065
60 40 20 0 -20
0
2
4
6
0
2
4
6
2
4
6
Spikes
B
0.4
Firing probability
C
0.3 0.2 0.1 0
0
t (ms) Figure 2: Firing probability response to oscillatory current with noise. (A) Oscillatory current, Ie (dashed line), and noisy oscillatory current, Ie C Is (solid line). (B) Raster of 2000 neurons subject to oscillatory current with uncorrelated noise. (C) Firing probability of the neuron calculated in bins of 6 m s.
2066
Nicolas Fourcaud and Nicolas Brunel
for the probability density P (V, Is , t) of voltage V and synaptic current Is at time t. We note
P ( V, Is , t) D P (V, Is ) C 2 PO ( V, Is , v) eivt C O (2 2 ) , where P is the stationary part of P and PO describes the part of P that oscillates at the same frequency as the current input. Since we will consider the limit of small perturbations around the stationary probability density function 2 ¿ 1, we do not take into account higher-order terms in the expansion of P , which would contain higher harmonics. The dynamical Fokker-Planck equation can be written @P @t
C
@ JV @V
C
@ JI s @Is
D 0,
(3.7)
where JV is the probability ux of P through V at xed Is and JIs is the probability ux of P through Is at xed V. The instantaneous ring rate is given by the sum of the probability ux through the ring threshold V D h over Is : Z º ( t) D
JV (h , Is , t) dI s .
(3.8)
In the case of instantaneous synapses (ts D 0), P depends on V only, and the instantaneous ring rate simplies to º ( t) D JV (h , t) .
(3.9)
When we introduce the synaptic relaxation time (ts > 0), we face a more complex problem of solving a two-dimensional Fokker-Planck equation with a boundary limit on a semi-innite line, as we will see in sections 4.1.3 through 4.2.3. The equation can be solved recursively using an expansion p in ts / tm , using analytical methods developed by Hagan, Doering, and Levermore (1989). 4 Stationary Firing Probability
We describe in this section the calculation of the ring rate and the probability density function (p.d.f.) of the neuronal membrane potential when it receives a stationary input: Ie (t ) D 0. Note that the membrane potential of the leaky IF neuron with white noise is described by an Ornstein-Uhlenbeck process whose mean rst passage time to threshold is well known (Tuckwell, 1988; Amit & Tsodyks, 1991). The solution for the simplest IF neuron with reecting boundary condition at reset has been calculated by Fusi and Mattia (1999).
Firing Probability of Noisy Integrate-and-Fire Neurons
2067
The calculation of the LIF stationary ring rate and of the p.d.f. when noise is white is given in Brunel and Hakim (1999). The calculation of the stationary ring rate when noise is colored has been performed in Brunel and Sergi (1998), assuming that the probability density of synaptic currents after the end of an absolute refractory period is gaussian. Here, we calculate the stationary ring rate with colored noise relaxing the assumption that Brunel and Sergi (1998) made. This gives a slight difference in the nal formula for the ring probability. In the case of the SIF neuron, we can show that the mean ring rate does not depend on the synaptic time constants (tr and ts ) and depends only on the mean synaptic input. This result is given in appendix A. Here, we calculate the p.d.f. without and with relaxation time of the synapses. Indeed, we will see that to determine the dynamic properties of the neuron, we will need to know the rst-order correction in ts of the stationary p.d.f. 4.1 White Noise. In this section, we calculate the p.d.f. of the IF neuron membrane potential without relaxation time of the synapses. The stochastic equation, 3.3, is equivalent to the Fokker-Planck (FP) equation for the p.d.f. of the membrane potential, P 0 ( V ) ,
tm
@P 0 @t
D
s 2 @2 P 0 @ [ (m C f ( V ) ) P 0 ], ¡ 2 2 @V @V
(4.1)
where f ( V ) D 0 for the SIF neuron and f ( V ) D ¡V for the LIF neuron. The probability ux through V at time t is JV D
m C f (V) s 2 dP 0 . P0 ¡ tm 2tm dV
Since a spike is emitted each time V reaches h and the neuron is then immediately reset to Vr , the boundary conditions for JV are JV (h ) D º0
(4.2)
JV ( Vr C ) ¡ JV ( Vr¡ ) D º0 ,
(4.3)
where º0 is the stationary instantaneous ring rate. Equation 4.2, together with P (V > h ) D 0, imply P 0 (h ) D 0 @P 0 @V
(h ) D ¡
P 0 (Vr C ) ¡ P 0 (Vr¡ ) D 0
@P 0 @V
( Vr C ) ¡
@P 0 @V
(Vr¡ ) D ¡
(4.4) 2º0 tm s2
(4.5) (4.6)
2º0 tm . s2
(4.7)
2068
Nicolas Fourcaud and Nicolas Brunel
For the SIF neuron, the results are (see Abbott & van Vreeswijk, 1993, and appendix A.2 for details) P0 (V ) D
2mQ (V¡h) 2 mQ (V¡Vr ) º0 [1 ¡ e sQ 2 ¡ H ( Vr ¡ V ) (1 ¡ e sQ 2 ) ], mQ
(4.8)
where H is the Heaviside function, H ( x) D 1 for x > 0 and 0 otherwise, and º0 D
mQ . h ¡ Vr
(4.9)
The ring rate of the SIF neuron is independent on the magnitude of the noise s. Q For the LIF neuron, we have (Brunel & Hakim, 1999) ³ ´ (V ¡ m ) 2 2º0 tm exp ¡ P0 (V ) D s s2 (R (h ¡m ) / s 2 (V ¡m ) s exp(s ) ds if V < Vr £ R (hr¡m ) //s 2 if V > Vr , (V¡m ) / s exp(s ) ds
(4.10)
in which the stationary ring rate º0 is obtained through the normalization condition, p 1 D tm p º0
Z
h ¡m s Vr ¡m s
2
es (1 C erf(s) ) ds
(4.11)
where erf is the error function (Abramowitz & Stegun, 1970). 4.2 Synaptic Noise.
4.2.1 Fokker-Planck Equation and Boundary Conditions. We consider now the system of equations 3.5 and 3.6. The Fokker-Planck equation becomes @P @t
D
1 ts
µ
¶ s 2 tm @2 P 1 @ @ ( ) C [(Is C f ( V ) ) P]. ¡ m ¡ I P s 2ts @Is2 tm @V @Is
(4.12)
The probability uxes are Is C f ( V ) P tm ³ ´ s 2 tm @P Is ¡ m . JI s D ¡ PC ts 2ts2 @Is
JV D
(4.13) (4.14)
Firing Probability of Noisy Integrate-and-Fire Neurons
2069
The boundary conditions are now JV (h , Is ) D º (Is ) JV ( Vr C , Is ) ¡ JV ( Vr¡, Is ) D º (Is ) ,
(4.15) (4.16)
where º (Is ) is the instantaneous ring rate of the neuron at current Is . The total ring rate of the neuron is Z ºD
C1
¡1
º ( Is ) dI s .
(4.17)
Equation 4.13 and the fact that the probability ux must be nonnegative at V D h for every Is (no neuron comes back from above threshold) imply that P (h , Is ) D 0 for any Is C f (h ) < 0, which in turn implies º (Is ) D 0,
for all
Is C f (h ) < 0.
(4.18)
On the other hand, P (h , Is ) must be positive for Is C f (h ) > 0 for ring to occur. The boundary conditions on P can be written ( Is C f (h ) ) P (h , Is ) D º ( Is ) tm (Is C f (h ) )[P (Vr C , Is ) ¡ P ( Vr¡ , Is ) ] D º ( Is ) tm .
(4.19) (4.20)
We also need to specify boundary conditions on the Is variable. Since the neuron can escape the phase plane only through V D h , the ux through Is is going to 0 for large |Is |. Thus, the boundary conditions on Is are lim JIs ( V, Is ) D 0,
I s !§ 1
(4.21)
which implies @P
lim
D 0
(4.22)
lim Is P D 0.
(4.23)
@Is
Is !§ 1 Is !§1
4.2.2 The Small ts Limit: Strategy. The two-dimensional Fokker-Planck equation, 4.12, can be solved using singular perturbation techniques in the small ts limit (Hagan q et al., 1989). In that limit, the variance of Is becomes proportional to s r kD
ts , tm
tm ts .
Thus, we dene
(4.24)
2070
Nicolas Fourcaud and Nicolas Brunel
Figure 3: Sketch of the ( V, z ) plane with the boundary conditions and the areas where different solutions are computed: the outer solution far from the boundaries and the boundary layer solutions in areas of width k around the boundaries that satisfy the boundary conditions and match the outer solution outside the gray areas.
which will be used as a small parameter in which we will expand the FP equation, 4.12. We also dene a new variable z:
Is D
sz ¡ f (h ) . k
(4.25)
The advantage of this new variable is that it keeps a nite variance in the small ts limit. The problem is complicated by the boundary conditions on the semiinnite lines (z < 0) at V D h and V D Vr (see equation 4.18). The method consists of two steps. First, we look for a solution of the equation far from the boundaries: the outer solution. The outer solution is sufciently far from the boundaries not to be affected by them. Second, we concentrate on narrow regions of width k around the boundaries (the boundary layers). We determine boundary layer solutions that satisfy the boundary conditions and reach exponentially the outer solution far from the boundaries. Figure 3 shows in the plane (V, z) the regions in which the different solutions will be calculated.
Firing Probability of Noisy Integrate-and-Fire Neurons
2071
4.2.3 The p.d.f. of the SIF Neuron. The p.d.f. up to rst order in k is (see section A.3 for details) ³ ´ ¡z2 r ts e ( ) ( ) ( ) C (4.26) P V, z D Q 0 V Q1 V, z p , te p with Q0 ( V ) D
2 mQ (V¡h) 2mQ (V¡Vr ) º0 [1 ¡ e sQ 2 ¡ H (Vr ¡ V ) (1 ¡ e s 2 ) ], mQ
(4.27)
and te D
sQ 2 mQ 2
(4.28)
represents the effective time constant of the SIF neuron. Far from Vr and h, Q1 ( V, z) D
2mQ (V¡h) 2mQ (V¡Vr ) º0 [ae sQ 2 C 2z ¡ H ( Vr ¡ V ) (ae sQ 2 C 2z) ]. mQ
(4.29)
When the membrane potential is close to the reset, (V D Vr C O ( k) ), or to threshold (V D h C O ( k) ), we have, using the reduced variables, vD
( v, z) D QR¡ 1
( Vr ¡ V ) , p sQ ts
xD
(h ¡ V ) p , sQ ts
1 p X p º0 2mQ (V2r ¡h) ( ( e sQ ¡ 1) (a C 2v) ¡ bn vnC ( 2z) e¡ 2nv ) mQ nD1 C
º0 2mQ (Vr2¡h ) (e sQ (a C 2v) C 2z) mQ Á ! C1 p X p º0 C ¡ 2nx T( a C 2x C 2z C bn vn ( 2z) e Q1 x, z) D , mQ n D1
QR1 C ( v, z) D
(4.30) (4.31) (4.32)
RC where QR¡ 1 (Q1 ) describes the p.d.f. in the boundary layer below (above) reset, QT1 (x, z) describes the p.d.f. in the boundary layer below threshold, a and the bn s are parameters given by equations A.45 and A.46, and the vn s are sets of functions given in terms of Hermite polynomials in equation A.47. Note that there are two components in the boundary solutions: the terms linear in v, x, and z match the outer solution close to reset (threshold), and the sum of exponentially decaying terms in x or v is needed to satisfy the boundary condition and vanish far from the boundaries. We show in Figure 4 an example of such a p.d.f. Note that contrary to the white noise regime, the probability density at threshold P (h , z ) , integrated over z, is nonzero. This is an important feature that will qualitatively change the dynamical properties of this model, as we will see in section 5.
2072
Nicolas Fourcaud and Nicolas Brunel
0.2 ~k
~ k~ k
P
0.15
0.1
0.05
0 -60
-58
-54
Vr=-56
V (mV)
-52
q=-50
Figure 4: Stationary p.d.f. of the SIF neuron integrated over z for ts / te D 0.05, 1 mQ D 1.25 mV.ms¡1 , and sQ D 1.25 mV.ms ¡ 2 . Solid line: numerical simulation. Dashed line: outer solution, equations 4.26–4.29. Dotted line: p.d.f. for white noise. Black circles: value of P just below threshold, computed from equation 4.32 at x D 0, and just below reset, computed from equation 4.30 at v D 0. Note the good agreement between the analytical calculation (black dots) and the numerical simulations at threshold and reset potential. For colored noise, there is a nite fraction of neuron population near threshold.
4.2.4 The p.d.f. and Firing Rate of the LIF Neuron. The stationary ring rate is, up to rst order in k, r º D º0 ¡
h ¡m V ¡m ts a Y ( s ) ¡ Y ( rs ) p ³ h ¡m ´2 , tm tm 2 p R s ( ) Y s ds Vr ¡m
(4.33)
s
2
where Y ( s) D es (1 C erf(s) ) . Up to rst order in k, the result for the ring rate is compatible with the following simple expression: p 1 D tm p º
Z
h ¡m C a2 s Vr ¡m C a2 s
p ts
tm
p ts Y ( s) ds. tm
Firing Probability of Noisy Integrate-and-Fire Neurons
2073
Thus, the q synaptic decay time effectively introduces q an effective threshold
h C s a2
ts tm
and an effective reset potential Vr C s a2
ts tm .
Note that in Brunel
and Sergi (1998), only the effective threshold appeared, due to the assumption made there. We can also express the outer reduced probability density of the membrane potentials (the p.d.f. integrated over the synaptic current variable) by 8 h¡m a p ts C R > s 2 ptm exp(s2 ) ds ³ ´> ( V ¡ m ) 2 < Vrs¡m C a2 ttms 2ºtm exp ¡ P (V ) D p ts h¡m a > s s2 > :R s C 2 2tm exp(s2 ) ds V¡m
if V < Vr . if V > Vr
s
The boundary layer p.d.f.s turn out to be quite similar to the ones of the SIF. In particular, the threshold layer p.d.f. is, using the variable x D hs¡V , given k by the same expression as the SIF equation, 4.32. Note that again, the main difference with the white noise case is that colored noise introduces a nite probability density at ring threshold. This has important consequences for the dynamics, as we will see in the following. 5 Dynamical Response
In this section, we consider an oscillatory input given by Ie ( t) D 2 m cos(vt).
(5.1)
The ring rate of the neuron is now driven by the oscillatory input. In the linear approximation, the ring rate is º (t ) D º[1 C 2 r (v) cos(vt C W (v)],
(5.2)
where r (v) is the relative amplitude of the ring-rate modulation and W (v) is the phase lag. Our technique to solve this problem is again to solve the Fokker-Planck equation associated with the system of stochastic equations, using an expansion of the p.d.f. and the instantaneous ring rate at the rst order in 2 . The dynamics of the leaky IF neuron in the presence of white noise has already been discussed by Brunel and Hakim (1999) and Lindner and Schimansky-Geier (2001) (see also Melnikov 1993 for the dynamics of a related system). We include here the results for completeness. Some of the results for synaptic colored noise appear in Brunel et al. (2001). 5.1 White Noise. The problem can be solved exactly when ts D 0. The stochastic equation, 3.3, is equivalent to the Fokker-Planck equation for the
2074
Nicolas Fourcaud and Nicolas Brunel
probability density P (V, t) (using complex input to simplify the calculations) tm
@P 0 @t
D
s 2 @2 P 0 @ [ f (V ) C m (1 C 2 eivt ) ]P 0 , ¡ 2 @V 2 @V
(5.3)
and the boundary conditions are now
P 0 (h , t) @P 0 (h , t) @V P 0 ( Vr C , t) ¡ P 0 ( Vr¡ , t) @P0 @P 0 ( Vr C , t) ) ¡ ( Vr¡ , t) @V @V
D 0
(5.4)
2º(t) tm D ¡ s2 D 0 2º(t) tm . D ¡ s2
(5.5) (5.6) (5.7)
Our strategy is to solve equation 5.3 with its associated boundary conditions using an expansion in the small parameter 2 . Thus, we write º ( t) D º0 [1 C 2 nO 0 (v) eivt C O (2 2 ) ] P 0 ( V, t) D P0 ( V ) C 2 eivt PO 0 ( V, v) C O (2 2 ) ,
(5.8) (5.9)
where nO 0 (v) D r 0 (v) exp(iW (v) ) and PO 0 ( V, v) are complex quantities describing the oscillatory component at the frequency input v of the instantaneous ring rate and the voltage probability density. The zeroth order in 2 gives the stationary response obtained in section 4. At the rst order, we have s 2 @2 PO 0 @ dP 0 [ f ( V ) C m ]PO 0 ¡ m ¡ ivtm PO 0 D , 2 @V2 @V dV
(5.10)
with the boundary conditions PO 0 (h , v) @PO 0 (h , v) dV PO 0 ( Vr C , v) ¡ PO 0 ( Vr¡ , v) @PO 0 @PO 0 ( Vr C , v) ) ¡ ( Vr¡ , v) dV dV
D 0 D ¡ D 0 D ¡
2nO 0 (v)º0tm s2 2nO 0 (v)º0tm . s2
(5.11)
The response of the SIF neuron is given by (see details in Abbott & van Vreeswijk, 1993, and section B.1) µ 2mQ (V¡Vh ) h) º0 2mQ (V¡V r (v) sQ 2 ¡ H ( Vr ¡ V ) PO 0 ( V, v) D e sQ 2 ¡ e C iv ³ ´¶ mQ (V¡V ) 2mQ (V¡Vr ) r C (v) 2 2 r 2 sQ £ e sQ ¡e (5.12) ,
Firing Probability of Noisy Integrate-and-Fire Neurons
2075
with r C (v) D
1C
p
1 C 2ite v , 2
(5.13)
where te is given by equation 4.28, and nO 0 (v) D
p
1 C 2ite v ¡ 1 . ite v
(5.14)
The neuron reproduces inputs at frequencies lower than t1e with little attenuation and q a small phase lag. Inputs at high frequencies are attenuated by
a factor te2v , with a phase lag that tends to ¡ p4 . The amplitude and phase of nO 0 (v) are shown in gure 6A. The response of the LIF neuron is (Brunel & Hakim, 1999; Brunel et al., 2001) nO 0 (v) D
@U ( yh , v) ¡ @@Uy ( yr , v) m @y , s (1 C ivtm ) U ( yh , v) ¡ U ( yr , v)
(5.15)
Vr ¡m where yh D h¡m s , yr D s , and U is given in terms of combinations of hypergeometric functions (Abramowitz & Stegun, 1970):
³
2
U ( y, v) D
± C C
ey
1 C ivtm 2
²M
1 ¡ ivtm 1 , , ¡y2 2 2
´
³ ´ 2 2ye y ivtm 3 ¡ ivtm ¢ M 1 ¡ , , ¡y2 . 2 2 C 2
(5.16)
In the high-frequency limit, s nO 0 (v) »
2 . ivte
(5.17)
This is the same asymptotic behavior as observed for the SIF neuron. 5.2 Synaptic Noise: Analytical Calculations. We now introduce the relaxation time ts of the synaptic current. We proceed along the lines of section 4.2.2. P ( V, Is , t ), the probability density of nding the neuron at potential V and synaptic current Is at time t, and º ( t) , the instantaneous ring rate, are expanded in k (see equation 4.24). There are now two small parameters, k and 2 . The goal of this section is to obtain the rst order in both parameters, that is, in k2 .
2076
Nicolas Fourcaud and Nicolas Brunel
The stochastic system we have to solve is given by equations 3.5 and 3.6. As in the computation of the stationary p.d.f., we prefer to use a synaptic variable that keeps a nite variance in the small synaptic time limit. Thus, we use here zD
k (Is ( t) C f (h ) C 2 m eivt ) , s
(5.18)
which describes the input of the neuron (synaptic and external). Moreover, since z represents the complete input of the neuron, it simplies the description of the boundary conditions. The FP equation associated with equations 3.5 and 3.6 is @P @t
D ¡
@ JV @V
¡
@ Jz
(5.19)
@z
1 ± sz ² P f ( V ) ¡ f (h ) C tm k µ ¶ 1 1 @P C zP Jz D ¡ 2 k tm 2 @z ³ ´ 1 m f (h ) C m (1 C k2 ivtm ) eivt P . C C2 s s ktm
JV D
(5.20)
(5.21)
As in equation 4.17, the instantaneous ring rate is given by the probability ux in the z direction through h integrated over z, Z º ( t) D
C1
¡1
JV (h , z, t) dz,
(5.22)
where JV (h , z, t) is given by JV (h , z, t) D
sz P (h , z, t) . ktm
(5.23)
Note that due to the factor 1k in the expression of JV (h , z, t) , we have to expand P to the second order in k to obtain the rst-order correction to the instantaneous ring rate. The boundary conditions at ring threshold and reset are again (compare with equation 4.18), JV (h , z, t) D 0, C
JV ( Vr , z, t) D
for
JV ( Vr¡ ,
z< 0
z, t) ,
for
(5.24) z < 0.
(5.25)
The other boundary conditions at z D § 1 or V D ¡1 are the same as in O the case without synaptic relaxation and hold for both P and P.
Firing Probability of Noisy Integrate-and-Fire Neurons
2077
At rst order in 2 , equation 5.19 gives µ³ ´ ¶ 1 @2 PO @ ( zPO ) @ f (h ) C m O O C ¡k k ivtm P D P 2 @z2 s @z @z sz ² Oi @ h± ¡ k2 f ( V ) ¡ f (h ) C P @V k m dP ¡ k (1 C k2 ivtm ) . s dz 2
(5.26)
We again solve the equation by expanding the density probability in k. We note P D P 0 C kP1 C k2 P 2 C ¢ ¢ ¢
(5.27)
PO D PO 0 C kPO1 C k PO 2 C ¢ ¢ ¢ 2
(5.28)
º D º0 C kº1 C ¢ ¢ ¢
(5.29)
2O
nO D nO 0 C knO 1 C k n2 C ¢ ¢ ¢
(5.30)
Analytical solutions can be found in two opposite limits: at low frequencies v » t1e and at high frequencies v À t1s . To connect these two limits, we resort to numerical solutions: Low-frequency regime. The solution of the problem is very similar to
the stationary case (see section 4.2.2). The details of the computation can be found in section B.1. We nd that in this regime, the correction at the rst order in k of the ring-rate oscillation amplitude vanishes for both models of neurons. Thus, the synaptic decay time has little effect on the dynamics of the ring probability at low frequencies. High-frequency regime. When v À
1 ts ,
equation 5.26 simplies to
³ ´ m dP 1 O C . P D ¡k o p s dz vts
(5.31)
Thus, the high-frequency modulation of the probability density is related in a simple way to the derivative with respect to the current variable of the stationary probability density. The ring probability is Z º(t ) D
C1
¡1
JV (h , z, t ) dz D
s ktm
Z
C1
¡1
z ( P (h , z) C2 PO (h , z) eivt ) dz.
(5.32)
We already know the contribution of the term at the order zero in 2 (see section 4.2). According to equation 5.31, to obtain the limit at large v for the
2078
Nicolas Fourcaud and Nicolas Brunel
O we need to compute rst order in k of n, Z C1 Z C1 dP1 (h , z) dz D ¡ z P1 (h , z ) dz. dz ¡1 ¡1
(5.33)
Using equations 5.32 and 4.32, we obtain for high-input frequency modulation m (5.34) nO D k A C O ( k2 ) D Ak0 C O ( k2 ) , s where we have dened
s m k Dk D s 0
ts mQ 2 D sQ 2
r
ts , te
where te is given in equation 4.28, and ³ ´ p ¡n p 1 1 X N ( n ) n n¡1 2 e 2 f ¡ ’ 1.3238. (5.35) A D 2 p 2 n! 2 n p f is Riemann’s zeta function, and N ( n ) is given by equation A.48. We see that introducing the synaptic relaxation time ts drastically changes the behavior of the neuron with respect to high-frequency components in the inputs (larger than 1 / ts ). In this regime,pthe amplitude of the modulation of the output ring rate is proportional to ts / te , and there is no phase shift. Qualitatively, this behavior can be understood from equation 5.33. It shows that the ring-rate modulation in this regime is proportional to the cumulative probability density at the threshold. We have seen in section 4.2.3 that there is a nite fraction of the neuron population close to the threshold as soon as the synaptic decay time is nite. This nite fraction is responsible for the instantaneous response to fast changes in the inputs. 5.2.1 The Large Synaptic Time, High-Frequency Limit. The large synaptic time limit corresponds to the limit in which noise vanishes. The noiseless dynamical response was computed by Knight (1972). SIF neuron. In the absence of noise, the SIF neuron reproduces exactly the synatic input: lim
v! C 1, k! C 1
nO (v) D 1.
(5.36)
LIF neuron. Taking the high-frequency limit, we nd after some algebra a simple expression for the high-frequency limit of the response:
lim
v! C 1, k! C 1
nO (v) D
m . m ¡h
(5.37)
Note that the noiseless high-frequency response is now in general different from the low-frequency one, contrary to the SIF neuron.
Firing Probability of Noisy Integrate-and-Fire Neurons
2079
5.3 Synaptic Noise: Numerical Simulations. We performed numerical simulations to obtain the neuronal response function in the whole input frequency range. The stochastic equations were simulated with a Euler method for different values of the parameters. We used a neuron with a reset potential Vr D ¡56 mV and a ring threshold h D ¡50 mV. The integration time step was chosen to be small compared to the time constants of the problem (ts , 1 /º0 and 1 / v).
5.3.1 SIF Neuron. The parameters characterizing the input, sQ and mQ , 1 were held constant, respectively equal to sQ D 1.25 mV.ms¡ 2 and mQ D ¡1 0.28 mV.ms , which gives te D 20 ms and º0 D 46.6 Hz. In this case, we found that the dependence of the gain function of the neuron on 2 is linear even for large-input oscillation amplitudes. Thus, we usually choose 2 D 0.5 in the simulations. 5.3.2 LIF Neuron. Since the stationary ring rate of the LIF neuron depends on ts , all curves were plotted keeping the mean ring rate constant. To keep the ring rate constant, the mean input m was varied. 5.3.3 High-Frequency limit versus k. The high-frequency limitof the modulation amplitude was determined by choosing v D 105 Hz. This value was chosen because it is much larger than the inverse of the synaptic decay times for any ts for which simulations were made. The neuronal response function is shown in Figure 5. The graphics show a linear dependence for small values of k. For the SIF neuron (see Figure 5A), a t of the curve (computed at v D 105 Hz here) in the small k0 region (from k0 D 0.1 to k0 D 1) by a quadratic function gives r (v D 105 Hz, k) ’ 1.32 § 0.02k0 ¡ 0.46 § 0.05k02 ,
(5.38)
which is in good agreement with the analytical result, equation 5.35. Moreover, we nd that the full curve (from k0 D 0 to k0 D 7) is well described by the following function, rHF (ts ) D lim r (v , ts ) ’ v!1
0
1 ¡ e¡2Ak 0 , 1 C e ¡ak
(5.39)
where A ’ 1.32 is given by equation 5.35 and a one-parameter t gives a D 2.2 § 0.1. The function on the right-hand side of equation 5.39 is the simplest function we could nd that has the behavior predicted by theory for small k (i.e., » Ak), has the correct behavior at large k (see equation 5.36), and provides a good t of all the simulation data. For the LIF neuron (see Figure 5B), we nd that the behavior of the highfrequency-modulation amplitude versus k is again well described by the
2080
A
Nicolas Fourcaud and Nicolas Brunel
1.0 0.8
rHF
0.6 0.4 0.2 0.0 0.0
0.5
1.0
1.5
2.0
k’
B
rHF
6
4
2
0 0
2
4
6
8
10
12
k’ Figure 5: High-frequency limit of the gain (estimated at v D 105 Hz) versus k0 . (A) SIF neuron with te D 20 ms. Note the linear relation for small values of k0 . Solid line: t given by equation 5.39, dashed line: t for small k0 given by equation 5.38. (B) LIF neuron with º D 50 Hz, tm D 20 ms, 2 m D 2 mV kept constant, and two values of s: s D 1 mV (±), 5 mV ( ). The solid line is the prediction of the theory, equation 5.35, with a slope equal to 1.3238. We also show the ts given by equation 5.40 (dashed lines) and the asymptotic limit at low noise and high frequencies given by equation 5.37 (dotted-dashed line).
Firing Probability of Noisy Integrate-and-Fire Neurons
2081
Table 1: Values Obtained by Fitting the Response Module with Equation 5.41. ts (ms)
l (ts )
b
c
p
1.8 3.2 7.2 12.8 16.2
0.349 § 2 0.459 § 1 0.624 § 2 0.747 § 1 0.793 § 2
0.10 § 1 0.099 § 2 0.111 § 9 0.110 § 5 0.102 § 8
0.14 § 15 0.004 § 4 0.019 § 31 0.07 § 3 0.06 § 6
3.6 § 10 5.9 § 9 5.5 § 16 5.2 § 6 5.2 § 14
Note: The error bars are on the last digits.
following function, 2A 0
rHF (ts ) D B
1 ¡ e¡ B k 0 , 1 C e ¡c(º,s ) k
(5.40)
with A given by equation 5.35, B by equation 5.37, and c determined by a one-parameter t. Contrary to the SIF, this parameter depends on º and s. For example, c (º D 50 Hz, s D 1 mV) D 0.34 § 0.01 and c (º D 50 Hz, s D 5 mV) D 0.47 § 0.01. These ts are sketched in Figure 5B. 5.3.4 Response vs. Frequency. In Figure 6, we show the amplitude and phase of the ring-rate modulation as a function of the frequency of the modulation in the input for different values of ts . Simulations conrm the high-frequency behavior obtained in the analysis, as we showed in Figure 5. The high-frequency limit of the response is nite, and the phase tends to 0 for any nonzero ts . Simulations also conrm that the behavior at low frequencies is affected very little by the synaptic ltering. In the SIF neuron case in the high-frequency regime, simulation results indicate that the modulation amplitude is well described by a function of the form rHF (ts ) C vtb s , where rHF is the high-frequency limit of the response. The low-frequency regime is given by the white noise gain function nO (v , 0). Then it is natural to interpolate between these two regimes by a sigmoid function. We nd that the complete amplitude response function can be well described by r (v , ts ) D r (v , 0) C
(vts ) p c C (vts ) p
³ ´ b ¡ r (v , 0) . rHF (ts ) C vts
(5.41)
The values of rHF (ts ) obtained by tting the curves over the whole range of frequency are the same as the values obtained with a t only in the highfrequency range. We give the results of ts over the whole parameters for different values of ts (note the error bars are on the last digits) in Table 1. b, c, and p are essentially independent of ts , and we show two examples of these ts in the left panel of Figure 7 with b D 0.1, c D 0.06, and p D 5.
Nicolas Fourcaud and Nicolas Brunel 1.0
0.5
0.0 -1 10
B
0
F(w)
A r(w)
2082
-p/8
-p/4 10
0
1
10
w (Hz)
2
3
10
10
10
F(w)
r(w)
0
1
10
w (Hz)
10
2
10
3
p/8
1.5 1 0.5 0 -1 10
0
10
1
10
w (Hz)
2
10
3
10
10
0
10
1
10
1
w (Hz)
2
10
2
10
10
3
F(w)
0.6
-p/8
0.4 0 -1 10
s=1mV
-p/4 -1 10 0
0.8
0.2
0
-p/8
s=1mV
1
r(w)
10
2.5 2
C
-1
s=5mV 0
10
s=5mV
1
10
w (Hz)
2
10
3
10
-p/4 -1 10
10
0
w (Hz)
10
3
Figure 6: Numerical simulations of the amplitude and phase of the response function for different values of ts . (A) SIF neuron: ts D 0 ms (solid line— amplitude of equation 5.14), 1.8 ms ( ), 7.2 ms ( ), 12.8 ms (4), 16.2 ms ( < ). (B) Low noise. (C) High noise. LIF neuron, response normalized to 1 at v D 0.1 Hz with º0 D 50 Hz and tm D 20 ms. ts D 0 ms—white noise (solid line), ts D 2 ms (dashed line), 5 ms (dotted line), 10 ms (dot-dashed line). In all cases we have º0 D 50 Hz and tm D 20 ms. The size of the error bars is of the order of the size of the symbols.
Similarly, the phase lag W can be well described by ³ ´ (vts ) q W (v, ts ) D W (v , 0) 1 ¡ . a C (vts ) q
(5.42)
We give the results of ts over the whole parameters for different values of ts in Table 2.
Firing Probability of Noisy Integrate-and-Fire Neurons
A
2083
B
1
0
F(w)
r(w)
0.8 0.6 0.4 -p/8 0.2 -1 10
0
10
10
1
w (in Hz)
10
2
10
3
10
-1
0
10
1
10
w (in Hz)
2
10
10
3
Figure 7: Fits of the gain function of the SIF neuron for different values of ts : ts D 1.8 ms ( ), ts D 16.2 ms ( ). (A) Amplitude. (B) Phase. The solid lines are equation 5.41 with b D 0.1, c D 0.06, and p D 5 (amplitude), and equation 5.42, with p D 0.8 and the value of a obtained by the t given in Table 2 (phase). The dashed line is the white noise neuronal response. Table 2: Values Obtained by Fitting the Response Phase Lag with Equation 5.43. ts (ms)
a
q
1.8 3.2 7.2 12.8 16.2
0.82 § 2 0.698 § 8 0.56 § 2 0.408 § 8 0.34 § 1
0.87 § 3 0.87 § 1 0.81 § 2 0.70 § 1 0.59 § 2
Note: The error bars are on the last digits.
The parameter q is also essentially independent of ts over a large interval. For ts of order te or larger, the phase becomes close to 0 for all input frequencies. Examples are shown in the right panel of Figure 7. When ts D 16 ms, the phase lag is less than p / 30 for any input frequency. Simulations of the LIF response are plotted in Figure 6 as a function of frequency. In a small noise case (s D 1 mV, in Figure 6B), there are pronounced resonant peaks at frequencies equal to or multiples of the background ring rate. In a large noise case (s D 5mV in Figure 6C), the resonances have almost disappeared. It shows that as it has been found in the analytical calculation, the low-frequency regime is not affected by the synaptic ltering. In particular, the resonant peaks in the small noise regime are still present. Moreover, we can note that in the small noise case, the high-frequency response of the neuron is greater than its low-frequency response with realistic
2084
Nicolas Fourcaud and Nicolas Brunel
synaptic time constants. The response in the large noise case is very similar to the curve response of the SIF neuron. An important difference is that the resonant peaks appear again when ts increases. This is to be expected in the light of the results of Knight (1972), who showed that without noise (innite ts limit), the LIF neuron response has innite resonant peaks. 5.4 Synaptic Currents with Rise Time. When a rise time is introduced, the FP equation becomes three-dimensional. Tackling such a problem analytically is beyond the scope of this article; thus, we resorted to numerical simulations. We simulated equations 2.1, 2.8, and 2.9 using the diffusion approximation, 3.1. We used the same method as described in section 5.3 and kept the same value of m and s to compare with the case without rise time, tr D 0. The simulations show that the important parameter controlling the response is the sum of rise and decay times, tr C ts . It shows that the system behaves essentially like a neuron with instantaneous rise and a decay time constant ts0 D tr C ts . Figure 8 shows results of simulations with different values of tr and ts keeping their sum ts0 xed. This result can be understood by the following simple argument. One of the factors controlling the effects of noise on neuronal dynamics is the standard deviation of the synaptic inputs Is . When the neuron has no rise time, p this standard deviation is proportional to tm / ts D 1 / k (see equation 4.25). If we take into account the rise time, we nd
r
q 2
D Is »
tm . tr C ts
(5.43)
Thus, it is natural to dene an effective relaxation time ts0 D tr C ts . As shown in Figure 8, this relationship describes particularly well the numerical results for all input frequencies. 5.5 Two or More Distinct Synaptic Currents. Synaptic currents in the nervous system have different temporal dynamics due to the variety of the receptors giving rise to such currents. We consider here the case of two types of synaptic currents with decay times ts1 and ts2 . The variance of the input noise due to these synaptic currents is noted s1 and s2 . The neuron is described by the system of equations,
dV D f ( V ) C Is1 ( t) C Is2 ( t) C 2 m eivt dt p dI s ts1 1 D ¡Is1 ( t) C m 1 C s1 tmg1 ( t) dt p dI s ts2 2 D ¡Is2 ( t) C m 2 C s2 tmg2 ( t) , dt tm
(5.44) (5.45) (5.46)
Firing Probability of Noisy Integrate-and-Fire Neurons
A 1.0
0
r(w)
F(w)
0.8
-p/8
0.6
0.4 10
B
-p/4 -1
10
0
10
1
10
2
w (Hz)
10
3
10
4
10
-1
10
0
10
1
2
10
2
10
10
w (Hz)
3
10
4
3
10
0
r (w)
F (w)
1
0.8
-p/8
0.6 10
2085
-p/4 -1
10
0
10
1
10
2
w (Hz)
10
3
10
4
10
-1
10
0
10
1
10
w (Hz)
4
Figure 8: Amplitude and phase of the gain function of the neuron with different values of tr and ts . Here, we keep tr C ts constant; the curves are almost superimposed. (A) SIF neuron: ±: ts D 1.8 ms, tr D 0 ms. : ts D 1.6 ms, tr D 0.2 ms. : ts D tr D 0.9 ms. The solid line corresponds to the white noise regime. (B) LIF neuron, with º0 D 50 Hz and s D 5 mV, response normalized to 1 at v D 0.1 Hz: ts D 2 ms and tr D 0 ms (±), ts D 1.8 ms and tr D 0.2 ms ( ), ts D tr D 1 ms ( ). Note that the size of the error bars is of the order of the size of the symbols.
p where hgi ( t) i D 0 and hgi (t )gi ( t0 ) i D d ( t ¡ t0 ) . We also note sQi t m D si , mQ i tm D m i . Numerical simulations were performed as in section 5.3. We used m 1 C m 2 D m and s12 C s22 D s 2 with m and s as in the previous simulations in order to compare the results in the small ts1 and/or ts2 limits. We also chose ts1 > ts2 . Results are shown in Figure 9. In most cases, we can completely describe the response with an effective relaxation time given by
ts0 D
s12 C s22 s12 ts1
C
s22 ts2
or
s12 C s22 s2 s2 D 1 C 2. 0 ts ts1 ts2
(5.47)
An intuitive understanding of this result can again be obtained with the same argument as in section 5.4. The standard deviation of Is is indeed
2086
Nicolas Fourcaud and Nicolas Brunel
A 1.0
0
0.6
F(w)
r(w)
0.8 -p/8
0.4 0.2 0.0 -1 0 1 2 3 4 5 10 10 10 10 10 10 10
0
10
10
1
10
1
10
2
10
2
10
3
10
10
3
10
w (Hz)
4
10
5
4
10
0
1 0.8
-p/8
0.6
-p/4
10
-1
10
0
10
1
10
2
3
10
w (Hz)
10
4
10
5
-1
10
0
10
w (Hz)
5
0
F(w)
1.0
r (w)
C
-1
10
w (Hz)
F(w)
r(w)
B
-p/4
-p/8
0.5 -p/4
10
-1
10
0
10
1
w (Hz)
10
2
10
3
10
-1
10
0
10
1
10
w (Hz)
2
10
3
10
4
Figure 9: Amplitude and phase of the gain function of the neuron for different values of ts1 , ts2 , s1 , and s2 . We keep constant the effective relaxation time ts0 given by equation 5.47. (A) SIF neuron: ±: single synaptic current, ts D 1.8 ms. 1 1 : ts1 D 20 ms, ts2 D 0.39 ms, sQ 1 D 1.12 mV.ms ¡ 2 , and sQ 2 D 0.56 mV.ms¡ 2 . 1 1 : ts1 D 20 ms, ts2 D 1.72 ms, sQ1 D 0.28 mV.ms ¡ 2 , and sQ 2 D 1.22 mV.ms ¡ 2 . (B) SIF neuron, crossover with ts1 À ts2 and a very unbalanced noise case. Square: 1 1 ts1 D 16.2 ms, ts2 D 0.041 ms, sQ1 D 1.248 mV.ms ¡ 2 , and sQ 2 D 0.07 mV.ms ¡ 2 . We show here the curves with only one synaptic current with ts D 16.2 ms (4) and ts D 7.2 ms ( ), to highlight the crossover around v » t1s . (C) LIF neuron 2 with º0 D 50 Hz and ts0 D 2 ms, response normalized to 1 at v D 0.1 Hz: normal case: ts1 D 5 ms, s1 D 3.5 mV and ts2 D 1.25 ms, s2 D 3.5 mV (4); extreme case: ts1 D 5 ms, s1 D 4.9 mV and ts2 D 0.13 ms, s2 D 1 mV ( ). We also show the curves with a single synaptic current s D 5 mV and ts D 5 ms (±) and ts D 2 ms ( ). Error bars when not visible are of the order of the size of the symbols.
Firing Probability of Noisy Integrate-and-Fire Neurons
proportional to r p tm D 2 Is » , ts0
2087
(5.48)
with ts0 given by equation 5.47. Numerical simulations indicate that this value of ts0 gives a very good estimate of the high-frequency limit. If relaxation times are different by at least one order of magnitude and the variance of the noise is mostly concentrated on the slower synapses, we start to see differences from the simple picture governed by the single effective time. For input frequencies around t1s , the neuron feels as if all the 1 noise was ltered with the slower relaxation time ts1 . Thus, the response is as if only this current was present. Then, for input frequencies around 1 ts2 , the neuron starts to feel the faster noise due to synaptic ltering with time constant ts2 . There is a crossover of the response function to a response governed by the effective synaptic decay time ts0 . Thus, at high frequencies, we recover the response of a neuron with only one type of synaptic input and ts D ts0 . Examples of this extreme case are shown in Figures 9B and 9C. These arguments can be generalized to an arbitrary number of synaptic currents. In general, the response will be simply described by an effective decay time equal to 0
ts D
Á X i
!Á si2
X s2 i ti i
!¡1
.
(5.49)
When there is a large gap between two sets of time constants, the response function of the neuron will experience a crossover at frequencies of the order of this gap. For frequencies lower than the crossover frequencies, the response of the neuron will be governed by an effective decay time given by equation 5.49, where the sums take into account only the slowest currents. For higher frequencies, the response of the neuron will be governed by the effective time constant taking into account all the currents. This scenario can be easily generalized to several large gaps between sets of time constants, where several crossovers show up. 6 Discussion
We have studied the inuence of the ltering of noise by realistic synaptic dynamics on the dynamics of the instantaneous ring rate of IF neurons. Brunel et al. (2001) showed that ltering of noise with a nite synaptic time constant leads to a nite high-frequency limit of the response, and hence to an instantaneous component of the response of the instantaneous ring rate to any input. In this article, we rst presented a detailed description of analytical methods that lead to results given in Brunel et al. (2001) on the high limit frequency of the neuronal response.
2088
Nicolas Fourcaud and Nicolas Brunel
In particular, we presented the joint p.d.f. of voltage and synaptic current up to rst order in k for both neuronal models and the link between the properties of the p.d.f. at threshold and the high-frequency response. Then we proceeded to investigate extensions of this result, mostly through numerical simulations: Characterization of low- and intermediate-input frequency regimes Characterization of the high-frequency response at all synaptic times ts Extension of these results to more realistic biological conditions, including either a rise time of postsynaptic currents or the presence of different types of synaptic receptors Our calculations show thatpthe probability density of the membrane potential is nite and of order ts / te at the ring threshold, for both types of IF neuron models. When ts > 0, the linear dynamical response of the IF neuron remains nite in the high-frequency limit, with no phase lag. This high-frequency behavior is intimately related to the nite probability density at ring threshold. An intuitive interpretation of this result is that when the probability to be at threshold is nite, there is an instantaneous component to the response of the ring probability to arbitrarily fast variations in input currents. In turn, the fact that the probability at threshold is nite is related to the fact that with a nite synaptic time constant, there is no diffusive term in the probability current going through threshold, contrarily p to the white noise case, for which we get a 1 / ivte attenuation at high frequencies (Brunel & Hakim, 1999; Brunel et al., 2001). Numerical simulations allowed us to characterize the response function of the neuron at intermediate frequencies. For the simplest IF neuron, we found empirically compact descriptions of the response (for both amplitude and phase), which show that the crossover between the low- and high-frequency regime occurs at input frequencies of the order of 1 / ts . More realistic synaptic current dynamics have been investigated. The introduction of a rise time tr does not change qualitatively the neuronal response. We showed how an effective relaxation time ts0 D ts C tr can describe well the neuron response at all frequencies for both the simplest and leaky IF neuron models. In the case of several synaptic currents with different relaxation times, we dened again an effective relaxation time that describes well the high-input frequency regime. Overall, we obtained a fairly complete picture of how the dynamical response of IF neurons depends on synaptic timescales, including fairly realistic descriptions used in detailed biophysical models. These results were obtained under certain conditions. First, we assumed no correlation between synaptic inputs. Correlations in the inputs to an IF neuron affect the ring rate (Salinas & Sejnowski, 2000) since they lead to an effective increase or decrease of the variance of the noise, depending on the type of correlation. However, we do not expect qualitative changes in
Firing Probability of Noisy Integrate-and-Fire Neurons
2089
the neuron dynamical response. Second, we considered synaptic current rather than conductance changes. Qualitative differences between current and conductance inputs were recently pointed out by Tiesinga, Jose, and Sejnowski (2000). As far as the dynamical response is concerned, we do not expect any qualitative difference if conductance-based inputs are considered. Quantitatively, the effective membrane time constant will be reduced by conductance inputs, leading to an increase of the magnitude of the instantaneous component in the response. The two IF neuronal models we investigated show similar qualitative behaviors at high frequencies. The main difference between both models is the presence in the LIF neuron dynamical response, nO (v) of resonant peaks at input frequencies that are multiples of its stationary ring rate. These resonant peaks occur when the background ring rate is high and for low noise, when ring is close to being periodic. They tend to disappear as noise increases. Within the range of ring rate we considered in our simulations (10–50 Hz), the resonant peaks disappear completely at a noise strength s of order 2 to 5 mV. This corresponds roughly to the orders of magnitude of the noise variance found in measurements in vivo (Destexhe & ParÂe, 1999). Therefore, the simplest IF neuron could give a good approximation to the dynamics of the leaky IF neuron in conditions similar to in vivo conditions. In the low-noise regime, these resonant peaks give rise to an oscillatory behavior of the instantaneous ring rate in response to step current increase (Marsalek, Koch, & Maunsell, 1997; Burkitt & Clark, 1999; van Rossum, 2001). Another attempt to include synaptic dynamics into a population density approach has recently been done by Haskell et al. (2001). To get a computationnally efcient description, they reduced the multidimensional partial differential equation describing the dynamics of the population to a onedimensional one. However, their resulting reduced dynamical equation for the population density does not capture the high-frequency response of the neuron. Indeed, in their model, the mean and the variance of the input are ltered by the synaptic dynamics, but the input variance is still associated with white noise. Consequently, they nd that the probability density near threshold is zero for any ts . Thus, their approximation can capture the dynamics in response to slowly varying inputs but not the high-fequency behavior. Our work adds elements to the old debate concerning the dichotomy between spiking and ring-rate neuron models. An important question is whether spiking neuronal models can be reduced to descriptions in terms of ring rates and in which conditions this reduction is possible. In ring-rate neuronal models, the input Iin (t ) of the neuron usually acts on the ring rate of the neuron º ( t) through a low-pass lter, t
dº D ¡º(t) C w [Iin ( t )], dt
where w is the neuronal transfer function, but the value of the time constant t to be used in such a description has been the subject of debate (Treves, 1993;
2090
Nicolas Fourcaud and Nicolas Brunel
Abbott & van Vreeswijk, 1993; Gerstner, 1995; Pinto, Brumberg, Simons, & Ermentrout, 1996; Ermentrout, 1998). We showed that in presence of realistic noise, when the synaptic timescale is of the same order as the membrane time constant, the ring rate is able to follow instantaneously current inputs directly injected into the soma. In these conditions, we would essentially have º ( t) D w[Iin ( t)].
(6.1)
In real neurons, inputs come through synapses, and therefore the response to oscillatory inputs is low pass ltered, with an additional factor of 1 / (1 C ivts ) in the response in the simplest case of a single synaptic decay time. Thus, instead of the instantaneous response given by equation 6.1, the dynamics will be purely governed by the synaptic decay time, dI s D ¡Is ( t) C Iin ( t) dt º ( t) D w[Is ( t)].
ts
Such a relationship is valid for small modulations around a nite background ring rate. When the current is such that the ring rate becomes essentially zero, deviations to this behavior are expected (Chance, 2000). We emphasize that the other conditions for which we expect such a reduction to be valid (presence of a nite background ring rate and synaptic time constants of the same order as the membrane time constant) seem to be the typical conditions of a cortical neuron in vivo. For conductancebased synaptic inputs, the bombardment due to spontaneous activity acts to reduce the membrane time constant. This will also contribute to a faster response to external input. Furthermore, when the synaptic timescales are shorter than the membrane timescale, the dynamics of the ring probability can still be reasonably approximated by Z t du ( ) ( ) (1 ) ) C ¡ rHF exp(¡t / t exp(u / t ) Iin ( u ) , (6.2) º t » rHF Iin t t ¡1 where rHF is the high-frequency limit of the dynamical response (see equation 5.39) and t is a time constant typically between ts and te for small ts . We show examples of this kind of approximation in Figure 10 for the SIF neuron. The noise model we have studied provides an interpolation between two previously studied noise models: at large ts , the rapid changes in response to sharp input transient are similar to what was found in slow noise models (Knight, 1972; Gerstner, 2000), and, at small ts , the response is fast only at low noise strengths, as in the fast noise, escape noise model (Gerstner, 2000). Finally, the methods presented in this article are a prerequisite for an analysis of the dynamics of recurrent neuronal networks in the presence of realistic noise.
Firing Probability of Noisy Integrate-and-Fire Neurons
2091
52
n(t) (Hz)
50 48 46 44 42 0
100
200
t (ms)
300
400
Figure 10: Instantaneous ring rate of the SIF neuron in response to a periodic step input. We show the simulation results for three synaptic times: ts D 0 ms (white noise) (±), ts D 5 ms ( ), and ts D 20 ms ( ). In all cases te D 20 ms. Bin size: 1 ms. Note that for nite ts , there is an instantaneous jump of the ring rate in response to a discontinuity in the input current. We also plot (dashed lines) an approximation of the response given by equation 6.2 with T D 14.9 ms for ts D 5 ms and T D 16.4 ms for ts D 20 ms. Appendix A: SIF Neuron: Stationary Input A.1 Firing Probability. The membrane potential obeys
dV D Is ( t ) . dt
(A.1)
Each time V reaches the threshold h, it is immediately reset to Vr . Below the threshold, the membrane potential of the neuron is Z V ( t) D V (0) C
t 0
Is ( u ) du.
From this equation, it is straightforward to determine the time at which the nth spike is red. It occurs at the rst time t that satises Z
t 0
Is ( u ) du D n (h ¡ Vr ) C Vr ¡ V (0).
2092
Nicolas Fourcaud and Nicolas Brunel
Thus, the number of spikes the neuron has emitted at time t is given by Á
Á
n ( t) D E sup u ·t
!! Ru V (0) ¡ Vr C 0 Is ( v) dv , h ¡ Vr
(A.2)
where E ( x) is the integer part of x. Writing Is as the sum of its mean m and of its uctuations D Is ´ Is ¡ m , we obtain Á
n ( t) D E
! R m u C 0u D Is (v ) dv V (0) ¡ Vr C sup . h ¡ Vr h ¡ Vr u ·t
(A.3)
The ring probability is computed as Á ! Ru m u C 0 D Is (v ) dv 1 n ( t) lim . D lim E sup t!1 t t!1 t h ¡ Vr u ·t
(A.4)
We show in the next paragraph that the mean of the random variable Á ! Z u 1 D Is ( v ) dv X D lim E sup m u C t!1 t 0 u ·t
(A.5)
is m . Thus, the mean ring rate is º D lim
t!1
m n ( t) . D h ¡ Vr t
(A.6)
The ring rate of the SIF neuron is independent of the synaptic time ts and the magnitude of the noise s. We now compute the mean of the random variable (see equation A.5): Á ³ ´! Z u 1 D Is ( v) dv . X D lim E sup m u C t!1 t 0 u ·t The quantity inside the parentheses can be bounded above by ³ Z sup m u C u ·t
u
´
³Z
D Is ( v) dv · sup (m u)
0
u ·t
C sup
0
u ·t
´
u
D Is (v ) dv
(A.7)
and below by ³ Z sup m u C u ·t
u 0
´
³Z
D Is ( v) dv ¸ sup (m u) u ·t
C inf u ·t
u 0
´
D Is ( v ) dv ,
(A.8)
Firing Probability of Noisy Integrate-and-Fire Neurons
2093
with sup u·t (m u ) D m t. Since D Is ( v) is a random process symmetric around 0, we need to look only at the upper bound. First, consider the case where D Is ( v ) is a gaussian white noise (instantaneous synaptic current). We can dene Z Wt D
Z
t 0
D Is ( v) dv D
t
g ( v ) dv, 0
which is a Brownian motion with the well-known probability density p ( Wt D x, t) D p
1 2p s 2 t
e
¡
x2 2s 2 t
.
The running maximum of Wt , St D sup Wu , u ·t
obeys (Karatzas & Shreve, 1988) P (St D x ) D 2P(W t D x ) D 2p(x, t) . The mean of St is r hSt i D s
t . 2p
(A.9)
Hence, it directly follows that hXi D m . If we now consider the case where we have synaptic currents with relaxation time ts , Z 1 v w¡v D Is ( v ) D e ts g (w ) dw, ts 0 we dene
Z Wt0 D
v
D Is ( v) dv 0
and S0t D sup Wu0 . u ·t
D Is is an Ornstein-Uhlenbeck process, and we can compute the probability density of W t0 ,
p (W t0 D x ) D p
1 2p s 2 V ( t)
e
¡
x2 2s 2 V (t)
,
(A.10)
2094
Nicolas Fourcaud and Nicolas Brunel
± ² ± ²2 t with V (t) D t ¡ 2ts tanh 2tt s C t2s (1 ¡ e¡ 2ts ) tanh 2tt s . Since V (t ) < t, it follows that the probability density of Wt0 will be more peaked than when D Is is white noise. Thus, hS0t i < hSt i, and therefore we again nd hXi D m .
(A.11)
A.2 P.d.f.—White Noise. For simplicity, it is useful to dene the adimensional variable and constants:
2mQ V , sQ 2
uD
uh D
2mh Q , sQ 2
ur D
2mQ Vr . sQ 2
(A.12)
The Fokker-Planck equation (see equation 3.10) written with this variable is te @P 0 @2 P 0 @P 0 ¡ D , 2 2 @t @u @u
(A.13)
where sQ 2 mQ 2
te D
(A.14)
is a time constant that depends on the signal-to-noise ratio of synaptic inputs. It is the only time constant when synaptic timescales are absent and plays the role of the membrane time constant of the leaky IF neuron, as we will see. Note that if we come back to the original variables, we have in the case Ji D J for all i, te D 1 / Cº (see equation 3.2). In this section, we consider the stationary problem @P 0 / @t D 0. Thus, we have to solve @2 P 0 @u2
D
@P 0 @u
,
(A.15)
and the new boundary conditions are P 0 ( uh ) D 0
@P 0 @u
( uh ) D ¡
P 0 ( ur C ) ¡ P 0 ( ur¡ ) D 0
@P 0 @u
( ur C ) ¡
@P 0 @u
( ur¡ ) D ¡
º0 te 2 º0 te . 2
(A.16)
The solution of equations A.15 and A.16 is P 0 ( y) D
º0te [1 ¡ eu¡uh ¡ H (ur ¡ u ) (1 ¡ eu¡ur ) ], 2
(A.17)
where H is the Heaviside function. The normalization of the probability density P 0 to 1 gives the stationary ring rate given by equation 4.9.
Firing Probability of Noisy Integrate-and-Fire Neurons
2095
A.3 P.d.f.—Synaptic Noise. In this section, we derive the rst-order correction in k to the SIF neuron p.d.f. A summary of the results is given in secq
tion 4.2.3. Note that for these calculations, we use k D
ts te
with te dened
in equation A.14.
A.3.1 FP Equation and Boundary Conditions. We can rewrite the FP equation (see equation 4.12) using the rescaled variables u and z dened in equations A.12 and 4.25 as 1 @2 P @ ( zP) @P @P C ¡k ¡ 2kz D 0. 2 @z2 @z @z @u
(A.18)
The boundary conditions with these new variables are kteº ( z) 2 kteº ( z) z[P ( ur C , Is ) ¡ P ( ur¡ , z) ] D 2 ( ) lim zP u, z D 0 zP (uh , z) D
z!§ 1
lim
@P
z!§ 1
@z
D 0,
(A.19) (A.20) (A.21) (A.22)
with the particular condition º (z ) D P ( uh , z ) D 0
for all
z < 0.
(A.23)
We then dene P (u, z ) D
Q (u, z) ¡z2 p e . p
(A.24)
Inserting equation A.24 in A.18, we obtain ³ LQ C 2kz
Q¡
@Q @u
´ ¡k
@Q @z
D 0,
(A.25)
where L is a differential operator dened by L. D
1 @2 . @. ¡z . 2 @z2 @z
(A.26)
To solve equation A.25 with boundary conditions given by equations A.19 through A.23, we use recent analytical methods described in Hagan et al.
2096
Nicolas Fourcaud and Nicolas Brunel
(1989) and Klosek & Hagan (1998) and in the neuronal context in Brunel and Sergi (1998). Q and the ring probability º are expanded in powers of k, X X QD kn Qn , º D knºn . n
n
Q and º can then be obtained by recurrence: LQ 0 D 0 LQn D
(A.27)
@Qn¡1 @z
³ ¡ 2z Qn¡1 ¡
@Qn¡1
´
@u
.
(A.28)
A.3.2 Solution far from the Boundaries. The solution of equation A.27 is Z 2 Q 0 D f ( u ) ez dz C g ( u ) ,
Order 0.
(A.29)
where f and g are functions of u only. Since zP must vanish in both limits z ! §1, we necessarily have f ( u ) D 0. Then at order 0 in k, the problem is equivalent to the white noise case, and we have Q 0 (u) D P0 ( u) ,
(A.30)
where P 0 is given by equation A.17. We obtain from equation A.28 ³ ´ @Q 0 ¡ Q0 . LQ1 D 2z @u
Order 1.
(A.31)
Q1 can be written as a sum of the solution of the homogeneous equation p LQ D 0 and of a particular solution of equation A.31, Q1 . The general solution of equation LQ D 0 is again Z 2 ( ) (A.32) Q1 D f u ez dz C g ( u ) . Since the right-hand side of equation A.31 is polynomial in z and Lzn D ¡nzn C
n ( n ¡ 1) n¡2 z , 2
p
we look for a particular solution Q1 that is polynomial in z: p
Q1 D Q11 ( u) z C Q10 ( u ) .
(A.33)
Firing Probability of Noisy Integrate-and-Fire Neurons
2097
Inserting equation A.17 in the right-hand side of equation A.31, we nd Q11 ( u ) D º0te H ( u ¡ ur ).
(A.34)
Since zP must vanish in both limits z ! § 1, we have again f ( u ) D 0 in equation A.32. Thus, we have, absorbing g ( u ) in Q10 (u ) , Q1 ( u, z ) D Q11 ( u ) z C Q10 ( u ) ,
(A.35)
where Q10 is still to be determined. Order 2.
Using the same method, we nd ³
Q2 ( u, z ) D
Q20 ( u ) C
2 1¡
@ @u
´ Q10 (u ) z C H ( u ¡ ur )º0 te z2 .
(A.36)
Order 3. At this order, we nd that to satisfy the boundary condition A.21, we need Q10 to obey the condition @2 Q10 @u2
D
@Q10 @u
.
(A.37)
To get Q10 ( u ) , we still need to specify the boundary conditions it satises. These will be provided by the boundary layer solutions. A.3.3 Threshold Layer Solution. The second step is to determine the solution of equation A.31 near the threshold uh . Since the area close to the threshold where the boundary solution departs from the outer solution is expected to be of width k (Hagan et al., 1989), we dene xD
uh ¡ u , k
(A.38)
and we look for a solution QT ( x, z ) that matches the boundary condition at x D 0 and the outer solution for large x. We also expand this function in k, and the equations for the rst orders in k, QT0 and QT1 , are LQT0 C 2z LQT1 C 2z
@QT0 @x @QT 1 @x
D 0
Á
D
(A.39) @QT0 @z
! ¡ 2zQT0 .
(A.40)
According to equation 4.18, we know that QT (x D 0, z ) D 0 for any z < 0. The x ! 1 limits of QT (x, z) are given by the boundary conditions of the
2098
Nicolas Fourcaud and Nicolas Brunel
outer solution Q 0 in uh . Using equations 4.8 and A.35, we nd that lim QT0 ( x ) D Q 0 ( uh ) D 0
(A.41)
x! C 1
lim QT1 ( x, z ) D Q1 ( uh , z ) ¡ x
x! C 1
0 D Q1 ( uh ) C
@Q0 @u
( uh )
º0 te ( x C 2z) . 2
(A.42)
The solution of equations A.39 through A.42 is (Hagan et al., 1989) QT0 (x, z) D 0 QT1 (x, z) D with
(A.43) Á
º0 te 2
a C x C 2z C
C1 X
!
pn p b n vnC ( 2z)e ¡ 2 x ,
³ ´ p 1 2 f 2 p N ( n) bn D ¡ p p 2n! n aD
z2
vnC ( z) D e 4 e¡
p (z C 2 n) 2 4
(A.44)
nD 1
(A.45) (A.46)
p Hen (z C 2 n ) ,
(A.47)
where f is the Riemann zeta function, Hen is the nth Hermite polynomial (Abramowitz & Stegun, 1970), and N ( s) D
1 ³ Y kD 1
s 1C p
´ k
e
p p ¡2s( k¡ k¡1 )
³
kC1 k
´ s2 2
.
(A.48)
Note that p a and bn s differ from those used in Hagan et al. (1989) by a factor 2. A.3.4 Reset-Layer Solution. To complete the description of the rst-order correction of the p.d.f., we have to look at the boundary solution near u D ur . As in equation A.38, we dene vD
ur ¡ u , k
(A.49)
( v, z) matching the boundary and we look for solutions QR1 C (v, z) and QR¡ 1 condition at v D 0 and the outer solution for large v, respectively, for v < 0 and v > 0. The equation for QR1 is Á ! @QR @QR 0 R 1 R ¡ 2zQ 0 . (A.50) LQ1 ¡ 2z D @v @z
Firing Probability of Noisy Integrate-and-Fire Neurons
2099
Using equations A.17 and A.35, we nd ( ) lim QR0 § ( v) D Q§ 0 ur D 0
(A.51)
v!§ 1
lim QR1 C ( v, z) D Q1C ( ur , z) ¡ v
v!¡1
C
C
@u
(ur ) D Q10C ( ur )
º0te (2z C eur ¡uh v ) 2
( v, z) D Q1¡ ( ur , z) ¡ v lim QR¡ 1
v! C 1
C
@Q 0
¡
@Q 0 @u
(A.52)
(ur ) D Q10¡ ( ur )
º0te ur ¡uh (e ¡ 1) v. 2
(A.53)
(0, z ) According to equations 4.19 and 4.20, we know that QR1 C (0, z) ¡ QR¡ 1 (0 ). We nd D QT , z 1 QR0 § ( v, z) D 0
(A.54)
º0 te yr ¡yh (e QR1 C ( v, z) D Q10 C ( ur ) C v C 2z) 2 º0 te ( v, z) D Q10¡ ( ur ) C QR¡ 1 2 Á ! C1 pn X p C ¡ u v r ¡uh 2 £ (e ¡ 1)v ¡ bn vn ( 2z) e .
(A.55)
(A.56)
nD 1
According to equation A.37, we have Q10 ( u ) D A10 eu C B1 .
(A.57)
Then, knowing the boundary conditions (given by equation A.42) and that the sum of Q10 over y should be 0 since it is a correction to a probability density, we nd º0te ur ¡uh ae 2 º0te a(eur ¡uh ¡ 1) , Q10¡ (ur ) D 2 Q10 C (ur ) D
where a has been dened in equation A.45.
(A.58) (A.59)
2100
Nicolas Fourcaud and Nicolas Brunel
Appendix B: SIF Neuron: Oscillatory Input B.1 White Noise. We can rewrite the FP equation using the variable u
(see equation A.12) as te @P @2 P @P ¡ (1 C 2 eivt ) D , 2 @t @u2 @u
(B.1)
and the boundary conditions are now
P (uh , t) D 0
@P @u @P @u
(uh , t) D ¡
P (ur C , t) ¡ P ( ur¡ , t) D 0 ( ur C , t ) ) ¡
@P
( ur¡ , t) D ¡
@u
(B.2) º ( t) te 2
(B.3) (B.4)
º ( t) te . 2
(B.5)
Our strategy is to solve equation B.1 with its associated boundary conditions using an expansion in the small parameter 2 . Thus, we write º ( t) D º0 [1 C 2 eivt nO (v) C O (2 2 )]
P (u, t) D P (u ) C 2 e PO ( u, v) C O (2 ivt
(B.6) 2)
,
(B.7)
where nO (v) and PO ( u, v) are complex quantities describing the oscillatory component at the frequency input v of the instantaneous ring rate and the voltage probability density. The zeroth order gives the stationary response º0 and P (u ) given by equations A.6 and A.17. At the rst order in 2 , we have ite v O @2 PO @PO dP ¡ ¡ P (u, v) D , 2 2 @u @u du
(B.8)
with the boundary conditions PO ( uh , v) D 0
@PO
@PO
du
( uh , v) D ¡ du PO (ur C , v) ¡ PO (ur¡, v) D 0 (ur C , v) ) ¡
@PO
du
(ur¡, v) D ¡
nO (v)º0 te 2
nO (v)º0 te . 2
(B.9)
Firing Probability of Noisy Integrate-and-Fire Neurons
2101
Solutions of equation B.8 are linear combinations of two exponentials exp(r§ (v) ) where r§ (v) D
p
1§
1 C 2ite v , 2
(B.10)
plus the particular solution 2 dP . POp D ¡ ite v du The coefcients of the exponentials are obtained using the boundary conditions, equation B.9. We obtain º0 PO (u, v) D [eu¡uh ¡erC (v) (u¡uh ) ¡H ( ur ¡ u ) ( eu¡ur ¡erC (v) ( u¡ur ) ) ] iv
(B.11)
and nO (v) D
p
1 C 2ite v ¡ 1 . ite v
(B.12) q
B.2 Synaptic Noise. Note that for these calculations, we use k D
ts te
with te dened in equation A.14.
B.2.1 FP Equation and Boundary Conditions. The FP equation associated with see equations 3.5 and 3.6, using again the variables u, equation A.12, and z, equation 5.18, is k2 te
@P
1 @2 P @ ( zP ) @P C ¡k 2 @z2 @z @z
D
@t
¡ 2 eivt k (1 C k2 ivte )
@P @z
¡ 2kz
@P @u
.
(B.13)
As in equation 4.17, the instantaneous ring rate is given by the probability ux in the z direction through uh integrated over z, Z º(t ) D
C1
¡1
Ju ( uh , z, t) dz,
(B.14)
with Ju dened by @P @t
C
@ Ju @u
C
@Jz @z
D 0
(B.15)
2102
Nicolas Fourcaud and Nicolas Brunel
and so Ju D
2z P. kte
(B.16)
Note that due to the factor 1k in the expression of Ju , we have to expand P to the second order in k to obtain the rst-order correction to the instantaneous ring rate. The particular boundary condition is again (compare with equation 4.18) Ju ( uh , z, t) D 0,
z < 0.
for all
(B.17)
The other boundary conditions at equal z or u are the same as in the case without synaptic relaxation and hold at any time t. We look at the linear response to the oscillation and have at the rst order in 2 , k2 ivte PO D
1 @2 PO @ ( zPO ) @PO @P @PO C ¡k ¡ k (1 C k2 ivte ) ¡ 2kz . 2 2 @z @z @z @z @u
(B.18)
We again solve the equation by expanding the density probability in k. We note P D P 0 C kP1 C k2 P2 C ¢ ¢ ¢
(B.19)
PO D PO 0 C kPO1 C k2 PO2 C ¢ ¢ ¢
(B.20)
Analytical solutions can be found in two opposite limits: at low frequencies v » t1e and at high frequencies v À t1s . To connect these two limits, we resort to numerical solutions. B.2.2 Outer Solution.
We rst dene Q by
e¡z P (u, z, t) D Q (u, z, t) p . p 2
Then we expand Q in 2 and k,
Q D Q C 2 eivt QO C O (2 2 ) D
(B.21)
X n
kn ( Qn C 2 eivt QO n ) C O (2
2 ).
(B.22)
O By inserting equation B.21 into equation 5.26, we nd the equation for Q: Á ! O @Q @QO O O C 2kz ¡ Q C k2 ivte QO LQ D k @z @u ³ ´ @Q 2 C k (1 C k ivte ) ¡ 2zQ . (B.23) @z
Firing Probability of Noisy Integrate-and-Fire Neurons
2103
The operator L has been dened in equation A.26, and the rst orders of Q are given in section 4.2.3. At the zeroth order in k, we have
Order 0.
LQO 0 D 0.
(B.24)
Since QO must vanish for z ! § 1, we necessarily have QO 0 ( u, z, v) D PO (u, v) ,
(B.25)
where PO (u, v) is given by equation B.11. QO 1 is given by
Order 1.
Á LQO 1 D 2z
O0 @Q @u
! ¡ QO 0 ¡ 2zQ0 .
(B.26)
By noting QO 1 ( u, z, v) D QO 10 (u, v) C zQO 11 ( u, v)
(B.27)
and inserting it in equation B.26, we nd Á QO 11
D 2 QO 0 ¡
O0 @Q @u
! C 2Q0 .
(B.28)
The expansion of QO until the third order leads to an equation for QO 10 , @2 QO 10 @y2
¡
@QO 10 @u
D
ivte O 0 @Q10 . Q1 C 2 @u
(B.29)
We will solve it later after we have specied the boundary conditions for QO 10. Order 2. LQO 2 D
Then QO 2 is given by O1 @Q @z
Á C 2z
O1 @Q @u
! ¡ QO 1
C
@Q1 @z
¡ 2zQ1 C ivte QO 0 .
(B.30)
Again, we look for a solution polynomial in z, and we note z2 QO 2 ( u, z, v) D QO 20 (u, v) C zQO 12 ( u, v) C QO 22 (u, v). 2
(B.31)
2104
Nicolas Fourcaud and Nicolas Brunel
By inserting it into equation B.30, we nd for the terms proportional to z2 and z, O1 @Q (B.32) QO 22 (u, v) D 2QO 11 ¡ 2 1 C 2Q11 @u O0 @Q (B.33) QO 12 (u, v) D 2QO 10 ¡ 2 1 C 2Q10 . @u Since L.1 D 0, the resting terms should vanish, and we have the equation @2 QO 0 @u2
¡
O0 @Q @u
D
ivte O @Q0 . Q0 C 2 @u
(B.34)
Note that a comparison of the above with equation B.8 conrms the result, equation B.25. We could obtain a similar equation for QO 20 by looking at the fourth order in k, but we do not need it since QO 20 will not intervene in the calculation of the ring rate. B.2.3 Threshold Solution. Following the method developed in section 4.2.2, we now look at the threshold solution noted QO T . We dene xD
uh ¡ u . k
(B.35)
Then equation B.23 becomes LQO T C 2z
@QO T @x
OT @Q
¡ 2kzQO T C k2 ivte QO T ³ T ´ @Q C k (1 C k2 ivte ) ¡ 2zQT . @z
D k
@z
(B.36)
We rst solve this equation at the order 0 and 1 in k. The equations for QO T0 and QO T1 are LQO T0 C 2z LQO T1 C 2z
@QO T0 @x
@QO T 1 @x
D 0 D
(B.37)
@QO T0 @z
¡ 2zQO T0 ¡ 2zQT0 .
(B.38)
Moreover, according to equations B.25, 5.12, and B.27, we have lim QO T0 D QO 0 ( uh ) D 0
(B.39)
O0 @Q º0 te nO (v) ( uh ) D QO 10 ( uh ) C ( x C 2z) . lim QO T1 D QO 1 ( uh ) ¡ x x!1 2 @u
(B.40)
x!1
Thanks to equation 5.24, we know that QO ( u, z, v) D 0, for all z < 0.
Firing Probability of Noisy Integrate-and-Fire Neurons
2105
According to Hagan et al. (1989), the solutions are QO T0 D 0
(B.41)
º0 te nO (v) QO T1 D 2
Á a C x C 2z C
X
!
pn p b n vnC ( 2z)e ¡ 2 x ,
(B.42)
n
where a, bn , and vnC have been dened in section A.3.3. We now look at the second order in k. We have OT OT @Q @Q @QT 1 1 ¡ 2zQO T1 C ¡ 2zQT1 . (B.43) LQO T2 C 2z 2 D @x @z @z p p Since f1, 2z, vn§ ( 2z) g form a complete set of eigenfunctions of L (Hagan et al., 1989), we seek a solution of the form ³ ´ ³ ´ ³ ´ X p x x p x 2z C a22n p v2 n ( 2z). QO T2 ( x, z ) D a p C a Q p 2 2 2 n,2 D §
(B.44)
By inserting it into equation B.43 and projecting it on the various orthogonal functions, we nd da º0 te (x ) ¡ a(x) Q D p (1 C nO (v) ) dx 2 Á ! p X C0 ¡ nx £ bn hz, vn ( z ) ie ¡ (a C x ) n
da Q ( x) D 0 dx
(B.45)
p da22n º0 te (x ) C 2 na22n ( x) D p (1 C nO (v) ) . . . dx 2 ³½ ¾ 1 2 ( ) £ ,v z z n * + ! 0 X p vnC ( z ) C ¡ 2 nx C bn vn (z ) , ¡ vn ( z) e , z n where h., .i is an inner product dened by 1 hu(z) , v ( z ) i D p p
Z
C1
¡1
e¡z u ( z ) v ( z) dz. 2
(B.46)
To achieve the derivation of the rst order of the ring-rate expansion in this regime, we need to know only the coefcient a(x) Q . According to
2106
Nicolas Fourcaud and Nicolas Brunel
equation B.45, we know it is a constant. It can be determined by looking at the limit x ! 1 where the threshold solution needs to match the outer solution, O 1 ( uh ) . lim a(x) Q DQ 2
(B.47)
x!1
According to equation B.33, we need to compute QO 10, which can be done by solving equation B.29. The boundary conditions for QO 10 are given by equation B.42 and by using equations 4.19 through 4.21. We also need to use the fact that the sum of QO 10 over y is zero. Finally, we nd p 2º0a 0 O [r C (v) erC (v) ( u¡uh ) ¡ eu¡uh . . . Q1 (u, v) D iv ¡ H ( ur ¡ u ) ( r C (v) erC (v) ( u¡ur ) ¡ eu¡ur )], (B.48) where r C (v) has been dened in equation B.10. Introducing this last equation in equations B.33 and B.47, we nd a(x) Q D 0.
(B.49)
We can now compute the rst order of the correction, Z C1 2z O 2 nO 1 ( t) D P2 ( u, z, t ) dz D hz, QO T2 (0, z ) ieivt D 0. t t e e ¡1
(B.50)
The last result rises because we have hz, v2 n ( z) i D 0 for all n and 2 . Appendix C: LIF Neuron C.1 Stationary Input—Synaptic Noise. We follow the same method as
for the SIF case (see section A.3) C.1.1 Outer Solution.
We note yD
V ¡m . s
The probability density P ( y, z ) is p e¡z P ( y, z ) D p [Q0 ( y ) C ts tm ( Q10 ( y ) C zQ11 ( y) ) ], p 2
where Q 0 is given by equation 4.10, p Q10 ( y) D aº0tm 2[¡(Y ( yh ) ¡ Y ( yr ) ) Q0 ( y ) 2 2 2 C e¡y ( eyh ¡ eyr H ( yr ¡ y) ) ]
(C.1)
Firing Probability of Noisy Integrate-and-Fire Neurons
2107
and Q11 by Q11 ( y ) D ¡2yh Q 0 ( y) ¡
@Q0 @y
.
C.1.2 Boundary-Layer Solutions. The threshold and reset solutions are identical to the SIF, equations 4.32. C.2 Oscillatory Input—White Noise. The stochastic equation we con-
sider is now tm
p dV D ¡V ( t) C m (1 C 2 eivt ) C s tmg ( t) , dt
(C.2)
where g ( t) is a gaussian white noise dened as in equation 3.1. The equivalent Fokker-Planck equation is tm
@P 0 @t
D
s 2 tm @2 P 0 @ C [V ¡ m (1 C 2 eivt ) ]P0 . 2 @V 2 @V
(C.3)
The boundary conditions are again identical to the SIF case. We expand both p.d.f. and ring rate in the modulation amplitude 2 . The amplitude of the modulation in the probability PO 0 ( V, v) is given by the rst order in that expansion, s 2 tm @2 PO 0 @ @P 0 C [V ¡ m ]PO0 ¡ ivtm PO 0 ¡ m D 0. 2 @V 2 @V @V
(C.4)
The main difculty compared to the SIF case is that the general solution to equation C.4 that gives PO is no longer a linear combination of exponentials (see equations B.8 and B.10) but rather a linear combination of conuent hypergeometric functions (Brunel & Hakim, 1999). Note that these solutions can also be expressed in terms of parabolic cylinder functions (Melnikov, 1993; Lindner & Schimansky-Geier, 2001). However, these functions share the asymptotic properties of the exponentials found in the SIF case; in the p high-frequency limit, they behave as exp( 2ivtm ( V ¡m ) / s) . The dynamical linear response of the neuron is nO (v) D
@U ( yh , v) ¡ @@Uy ( yr , v) m @y , s (1 C ivtm ) U ( yh , v) ¡ U ( yr , v)
(C.5)
Vr ¡m where yh D h ¡m and U is given in terms of combinations of s , yr D s hypergeometric functions (Abramowitz & Stegun, 1970),
³
2
U ( y, v) D
± C
ey
1 C ivtm 2
²M
1 ¡ ivtm 1 , , ¡y2 2 2
´
2108
Nicolas Fourcaud and Nicolas Brunel
C
³ ´ 2 2ye y ivtm 3 2 ¡ ivtm ¢ M 1 ¡ , , ¡y . 2 2 C 2
(C.6)
In the high-frequency limit, s nO (v) »
2 . ivte
(C.7)
Acknowledgments
We thank Larry Abbott, Frances Chance, Vincent Hakim, David Hansel and Carl van Vreeswijk for useful discussions related to the topics discussed in this article and for their comments on the manuscript. References Abbott, L. F., & van Vreeswijk, C. (1993). Asynchronous states in a network of pulse-coupled oscillators. Phys. Rev. E, 48, 1483–1490. Abramowitz, M., & Stegun, I. A. (1970). Tables of mathematical functions. New York: Dover. Amit, D. J., & Tsodyks, M. V. (1991). Quantitative study of attractor neural network retrieving at low spike rates I: Substrate—spikes, rates and neuronal gain. Network, 2, 259–274. Angulo, M. C., Rossier, J., & Audinat, E. (1999).Postsynaptic glutamate receptors and integrative properties of fast-spiking interneurons in the rat neocortex. J. Neurophysiol., 82, 1295–1302. Brunel, N. (2000). Dynamics of sparsely connected networks of excitatory and inhibitory spiking neurons. J. Comput. Neurosci., 8, 183–208. Brunel, N., Chance, F., Fourcaud, N., & Abbott, L. (2001). Effects of synaptic noise and ltering on the frequency response of spiking neurons. Phys. Rev. Lett., 86, 2186–2189. Brunel, N., & Hakim, V. (1999). Fast global oscillations in networks of integrateand-re neurons with low ring rates. Neural Computation, 11, 1621–1671. Brunel, N., & Sergi, S. (1998).Firing frequency of integrate-and-re neurons with nite synaptic time constants. J. Theor. Biol., 195, 87–95. Burkitt, A. N., & Clark, G. M. (1999). Analysis of integrate-and-re neurons: Synchronization of synaptic input and spike output. Neural Computation, 11, 871–901. Burns, B. D., & Webb, A. C. (1976). The spontaneous activity of neurons in the cat’s cerebral cortex. Proc. R. Soc. Lond. B, 194, 211–223. Chance, F. (2000). Modeling cortical dynamics and the responses of neurons in the primary visual cortex. Unpublished doctoral dissertation, Brandeis University. Destexhe, A., Mainen, Z. F., & Sejnowski, T. J. (1998). Kinetic models of synaptic transmission. In C. Koch & I. Segev (Eds.), Methods in neuronal modeling (2nd ed., pp. 1–25). Cambridge, MA: MIT Press.
Firing Probability of Noisy Integrate-and-Fire Neurons
2109
Destexhe, A., & ParÂe, D. (1999). Impact of network activity on the integrative properties of neocortical pyramidal neurons in vivo. J. Neurophysiol.,81, 1531– 1547. Ermentrout, G. B. (1998). Neural networks as spatio-temporal pattern-forming systems. Rep. Prog. Phys., 61, 353–430. Fusi, S., & Mattia, M. (1999). Collective behavior of networks with linear (VLSI) integrate and re neurons. Neural Computation, 11, 633–652. Gerstner, W. (1995). Time structure of the activity in neural network models. Phys. Rev. E, 51, 738–758. Gerstner, W. (2000). Population dynamics of spiking neurons: Fast transients, asynchronous states, and locking. Neural Computation, 12, 43–89. Gupta, A., Wang, Y., & Markram, H. (2000). Organizing principles for a diversity of GABAergic interneurons and synapses in the neocortex. Science, 287, 273– 278. Hagan, P. S., Doering, C. R., & Levermore, C. D. (1989). Mean exit times for particles driven by weakly colored noise. SIAM J. Appl. Math., 49, 1480–1513. Haskell, E., Nykamp, D., & Tranchina, D. (2001).Population density methods for large-scale modelling of neuronal networks with realistic synaptic kinetics: Cutting the dimension down to size. Network: Comput. Neural Syst., 12, 141– 174. Hestrin, S., Sah, P., & Nicoll, R. (1990). Mechanisms generating the time course of dual component excitatory synaptic currents recorded in hippocampal slices. Neuron, 5, 247–253. Karatzas, I., & Shreve, S. (1988). Brownian motion and stochastic calculus. Berlin: Springer-Verlag. Klosek, M. M., & Hagan, P. S. (1998). Colored noise and a characteristic level crossing problem. J. Math. Phys., 39, 931–953. Knight, B. W. (1972). Dynamics of encoding in a population of neurons. J. Gen. Physiol., 59, 734–766. Knight, B. W., Omurtag, A., & Sirovich, L. (2000). The approach of a neuron population ring rate to a new equilibrium: An exact theoretical result. Neural Computation, 12, 1045–1055. Lapicque, L. (1907).Recherches quantitatives sur l’excitation e lectrique des nerfs traitÂee comme une polarisation. J. Physiol. (Paris), 9, 620–635. Lindner, B., & Schimansky-Geier, L. (2001). Transmission of noise coded versus additive signals through a neuronal ensemble. Phys. Rev. Lett., 86, 2934–2937. Marsalek, P. R., Koch, C., & Maunsell, J. (1997). On the relationship between synaptic input and spike output jitter in individual neurons. Proc. Natl. Acad. Sci., 94, 735–740. McCormick, D., Connors, B., Lighthall, J., & Prince, D. (1985). Comparative electrophysiology of pyramidal and sparsely spiny stellate neurons in the neocortex. J. Neurophysiol., 54, 782–806. Melnikov, V. I. (1993). Schmitt trigger: A solvable model of stochastic resonance. Phys. Rev. E, 48, 2481–2489. Nykamp, D. Q., & Tranchina, D. (2000).A population density approach that facilitates large-scale modeling of neural networks: Analysis and an application to orientation tuning. J. Comp. Neurosci., 8, 19–30.
2110
Nicolas Fourcaud and Nicolas Brunel
Nykamp, D. Q., & Tranchina, D. (2001).A population density approach that facilitates large-scale modeling of neural networks: Extension to slow inhibitory synapses. Neural Computation, 13, 511–546. Pinto, D. J., Brumberg, J. C., Simons, D. J., & Ermentrout, G. B. (1996). A quantitative population model of whisker barrels: Re-examining the Wilson-Cowan equations. J. Comput. Neurosci., 3, 247–264. Risken, H. (1984). The Fokker-Planck equation: Methods of solution and applications. Berlin: Springer-Verlag. Sah, P., Hestrin, S., & Nicoll, R. A. (1990). Properties of excitatory postsynaptic currents recorded in vitro from rat hippocampal interneurones. J. Physiol., 430, 605–616. Salin, P. A., & Prince, D. A. (1996). Spontaneous GABAA receptor mediated inhibitory currents in adult rat somatosensory cortex. J. Neurophysiol., 75, 1573–1588. Salinas, E., & Sejnowski, T. J. (2000). Impact of correlated synaptic input on output ring rate and variability in simple neuronal models. Journal of Neuroscience, 20, 6193–6209. Softky, W. R., & Koch, C. (1993). The highly irregular ring of cortical cells is inconsistent with temporal integration of random EPSPs. J. Neurosci., 13, 334–350. Spruston, N., Jonas, P., & Sakmann, B. (1995). Dendritic glutamate receptor channel in rat hippocampal CA3 and CA1 pyramidal neurons. J. Physiol., 482, 325–352. Tiesinga, P. H. E., Jose, J. V., & Sejnowski, T. J. (2000). Comparison of currentdriven and conductance-driven neocortical model neurons with HodgkinHuxley voltage-gated channels. Phys. Rev. E, 62, 8413–8419. Treves, A. (1993). Mean-eld analysis of neuronal spike dynamics. Network, 4, 259–284. Tuckwell, H. C. (1988). Introduction to theoretical neurobiology. Cambridge: Cambridge University Press. van Rossum, M. C. W. (2001). The transient precision of integrate-and-re neurons: Effect of background activity and noise. J. Comput. Neurosci., 10, 303–311. Wang, X.-J. (1999). Synaptic basis of cortical persistent activity: The importance of NMDA receptors to working memory. J. Neurosci., 19, 9587–9603. Xiang, Z., Huguenard, J. R., & Prince, D. A. (1998). GABAA receptor mediated currents in interneurons and pyramidal cells of rat visual cortex. J. Physiol., 506, 715–730. Received August 23, 2001; accepted March 22, 2002.
LETTER
Communicated by Daniel Tranchina
Integrate-and-Fire Neurons Driven by Correlated Stochastic Input Emilio Salinas [email protected] Department of Neurobiology and Anatomy, Wake Forest University School of Medicine, Winston-Salem, NC 27157-1010, U.S.A. Terrence J. Sejnowski [email protected] Computational Neurobiology Laboratory, Howard Hughes Medical Institute, Salk Institute for Biological Studies, La Jolla, CA 92037, U.S.A., and Department of Biology, University of California at San Diego, La Jolla, CA 92093, U.S.A. Neurons are sensitive to correlations among synaptic inputs. However, analytical models that explicitly include correlations are hard to solve analytically, so their inuence on a neuron’s response has been difcult to ascertain. To gain some intuition on this problem, we studied the ring times of two simple integrate-and-re model neurons driven by a correlated binary variable that represents the total input current. Analytic expressions were obtained for the average ring rate and coefcient of variation (a measure of spike-train variability) as functions of the mean, variance, and correlation time of the stochastic input. The results of computer simulations were in excellent agreement with these expressions. In these models, an increase in correlation time in general produces an increase in both the average ring rate and the variability of the output spike trains. However, the magnitude of the changes depends differentially on the relative values of the input mean and variance: the increase in ring rate is higher when the variance is large relative to the mean, whereas the increase in variability is higher when the variance is relatively small. In addition, the ring rate always tends to a nite limit value as the correlation time increases toward innity, whereas the coefcient of variation typically diverges. These results suggest that temporal correlations may play a major role in determining the variability as well as the intensity of neuronal spike trains. 1 Introduction
Cortical neurons are driven by thousands of synaptic inputs (Braitenberg & Schuz, ¨ 1997). Therefore, one way to understand the integration properties of a typical cortical neuron is to consider its total input as a stochastic c 2002 Massachusetts Institute of Technology Neural Computation 14, 2111– 2155 (2002) °
2112
Emilio Salinas and Terrence J. Sejnowski
variable with given statistical properties and calculate the statistics of the response (Gerstein & Mandelbrot, 1964; Ricciardi, 1977, 1995; Tuckwell, 1988, 1989; Smith, 1992; Shinomoto, Sakai, & Funahashi, 1999; Tiesinga, JosÂe, & Sejnowski, 2000). A common practice is to assume that the total input has a gaussian distribution with given mean and variance and that samples taken at different times are uncorrelated. Input correlations make the statistics of the output spike trains quite difcult to calculate, even for simple model neurons. Ignoring the correlations may actually be a reasonable approximation if their timescale is much smaller than the average time interval between output spikes. However, in cortical neurons, factors like synaptic and membrane time constants, as well as synchrony in the input spike trains (see, for example, Nelson, Salin, Munk, Arzi, & Bullier, 1992; Steinmetz et al., 2000), give rise to a certain input correlation time tcorr (Svirskis & Rinzel, 2000), and this may be of the same order of magnitude as the mean interspike interval of the evoked response. Therefore, the impact of temporal correlations needs to be characterized, at least to a rst-order approximation. Some advances in this direction have been made from somewhat different points of view (Feng & Brown, 2000; Salinas & Sejnowski, 2000; Svirskis & Rinzel, 2000). Here we rst consider a simple integrate-and-re model without a leak current in which the voltage is driven by a stochastic input until a threshold is exceeded, triggering a spike. The origins of this model go back to the work of Gerstein and Mandelbrot (1964), who studied a similar problem— without input correlations—and solved it, in the sense that they determined the probability density function for the interspike intervals (the times between consecutive action potentials). Our approach is similar in spirit, but the models differ in two important aspects. First, in their theoretical model, Gerstein and Mandelbrot did not impose a lower bound on the model neuron’s voltage, so in principle it could become innitely negative (note, however, that they did use a lower bound in their simulations). Second, they used an uncorrelated gaussian input with a given mean and variance, where the mean had to be positive; otherwise, the expected time between spikes became innite. In contrast, we use a correlated binary input and impose a hard lower bound on the voltage of the model neuron, which roughly corresponds to an inhibitory reversal potential; we call it the barrier. When the barrier is set at a nite distance from the threshold, solving for the full probability density distribution of the ring time becomes difcult, but such a barrier has two advantages: the moments of the ring time may be calculated explicitly with correlated input, and the mean of the input may be arbitrary. We also investigate the responses of the more common integrate-andre model that includes a leak current (Tuckwell, 1988; Troyer & Miller, 1997; Salinas & Sejnowski, 2000; Dayan & Abbott, 2001) , driven by the same correlated stochastic input. In this case, however, the solutions for the moments of the ring time are obtained in terms of series expansions.
Integrate-and-Fire Neurons Driven by Correlated Stochastic Input
2113
We calculate the response of these model neurons to a stochastic input with certain correlation time tcorr . The underlying problem that we seek to capture and understand is this. The many inputs that drive a neuron may be correlated with one another; for example, they may exhibit some synchrony. The synchrony patterns may change dynamically (Steinmetz et al., 2000; Fries, Reynolds, Rorie, & Desimone, 2001; Salinas & Sejnowski, 2001), so the question is how these changes affect the response of a postsynaptic neuron. This problem is of considerable interest in the eld (MarÏs Âalek, Koch, & Maunsell, 1997; Burkitt & Clark, 1999; Diesmann, Gewaltig, & Aertsen, 1999; Kisley & Gerstein, 1999; Feng & Brown, 2000; Salinas & Sejnowski, 2000, 2001). We look at this problem by collapsing all separate input streams into a single stochastic input with a given correlation time, which depends on the synchrony or correlation structure among those streams. Thus, the stochastic input that we consider is meant to reect, in a simplied way, the total summed effect of the original input streams, taking into account their correlation structure. The precise mapping between these streams and the parameters of the stochastic input (mean, variance, and correlation time) are difcult to calculate analytically but qualitatively, one may expect higher synchrony to produce higher input variance (Salinas & Sejnowski, 2000, 2001; Feng & Brown, 2000) and longer correlation times between streams to produce larger values of tcorr . The methods applied here are related to random walk and diffusion models in physics (Feynman, Leighton, & Sands, 1963; Van Kampen, 1981; Gardiner, 1985; Berg, 1993) and are similar to those used earlier (Gerstein & Mandelbrot, 1964; Ricciardi, 1977, 1995; Shinomoto et al., 1999) to study the ring statistics of simple neuronal models in response to a temporally uncorrelated stochastic input (see Tuckwell, 1988, 1989; Smith, 1992, for reviews). Similar approaches have also been used recently to study neural populations (Nykamp & Tranchina, 2000; Haskell, Nykamp, & Tranchina, 2001). Here we extend such methods to the case of a correlated binary input. What we nd is that an increase in tcorr typically increases both the mean ring rate and the variability of the output spike trains, but the strength of these effects depends differentially on the relative values of the input mean and variance. In addition, the behaviors obtained as tcorr becomes large are also different: the rate saturates, whereas the variability may diverge. Thus, neural responses may be strongly modulated through changes in the temporal correlations of their driving input. 2 The Nonleaky Integrate-and-Fire Model
This model neuron has a voltage V that changes according to the input that impinges on it, such that t
dV D m 0 C X (t) , dt
(2.1)
2114
Emilio Salinas and Terrence J. Sejnowski
where m 0 and t are constants. The rst term on the right, m 0 , corresponds to a mean or steady component of the input—the drift—whereas the second term corresponds to the variable or stochastic component, which has zero mean. We refer to X as the uctuating current, or simply the current. In this model, an action potential, or spike, is produced when V exceeds a threshold Vh . After this, V is reset to an initial value Vreset , and the integration process continues evolving according to equation 2.1. This model is related to the leaky integrate-and-re model (Tuckwell, 1988; Troyer & Miller, 1997; Salinas & Sejnowski, 2000; Dayan & Abbott, 2001) and to the Ornstein-Uhlenbeck process (Uhlenbeck & Ornstein, 1930; Ricciardi, 1977, 1995; Shinomoto et al., 1999), but is different because it lacks a term proportional to ¡V in the right-hand side of equation 2.1 (see below). An additional and crucial constraint is that V cannot fall below a preset value, which acts as a barrier. This barrier is set at V D 0. This is a matter of convenience; the actual value makes no difference in the results. Thus, only positive values of V are allowed. Except for the barrier, this model neuron acts as a perfect integrator: it accumulates input samples in a way that is mostly independent of the value of V itself, where t is the time constant of integration. We examine the time T that it takes for V to go from reset to threshold; T is known as the rst passage time and is the quantity whose statistics we wish to determine. 3 Stochastic Input
To complete the description of the model, we need to specify the statistics of the input X. We will consider two cases: one in which X is uncorrelated gaussian noise, so X ( t) and X ( t C D t) are independent, and another where X is correlated noise, so X ( t) and X ( t C D t ) tend to be similar. The former is the traditional approach; the latter is novel and is probably a more accurate model of real input currents. In both cases, we will make use of a binary random variable Z ( t) , such that ( C 1 with probability 1 / 2 . (3.1) ZD ¡1 with probability 1 / 2 The relationship between Z and X depends on whether X represents uncorrelated or correlated noise; this will be specied below. At this point, however, it is useful to describe the properties of Z. The overall probabilities of observing a positive or a negative Z are the same and equal to 1 / 2, which means that the random variable has zero mean and unit variance: ZD 0
(3.2)
Z2 D 1.
(3.3)
Integrate-and-Fire Neurons Driven by Correlated Stochastic Input
2115
Here, the bar indicates an average over samples, or expectation. This, together with the correlation function, denes the statistics of Z. When input current X is uncorrelated gaussian noise, the Z variable is also uncorrelated, so Z ( t1 ) Z ( t1 C t ) D d ( t ) ,
(3.4)
where d is the Dirac delta function. Expression 3.4 means that any two samples observed at different times are independent. When input current X represents correlated noise, the Z variable is temporally correlated, in which case ³ ´ |t| . (3.5) Z ( t1 ) Z ( t1 C t) D exp ¡ tcorr Equation 3.5 means that the correlation between subsequent Z samples falls off exponentially with a time constant tcorr . In this case, although the probabilities for positive and negative Z are equal, the conditional probabilities given prior observations are not. That is exactly what the correlation function expresses: samples taken close together in time tend to be more similar to each other than expected by chance, so the probability that the sample Z ( t C D t) is positive is greater if the sample Z ( t) was positive than if it was negative. In equation 3.5, a single quantity parameterizes the degree of correlation, tcorr . This correlation time can also be expressed as a function of a single probability Pc , which is the probability that the sign of the next sample, Z ( t C D t) , is different from the sign of the previous one, Z ( t) . The probability that the next sample has the same sign as the previous one is termed Ps and is equal to 1 ¡ Pc . For small D t, the relationship between tcorr and P c is Pc D
Dt . 2tcorr
(3.6)
This can be seen as follows. First, calculate the correlation between consecutive Z samples directly in terms of Pc : 1 [(1) (¡1) Pc C (1) (1) Ps C (¡1) (1) Pc C (¡1) (¡1) Ps ] 2 (3.7) D 1 ¡ 2Pc .
Z ( t1 ) Z ( t1 C D t ) D
The rst term in the sum, for instance, corresponds to Z ( t) D 1 and Z (t C D t) D ¡1; since the probability that Z ( t) D 1 is 1 / 2 and the probability that the next sample has a different sign is Pc , the product (1) (¡1) has a probability (1 / 2)Pc . Similar calculations apply to the other three terms. Now calculate the same average by expanding equation 3.5 in a Taylor
2116
Emilio Salinas and Terrence J. Sejnowski
series, using t D D t. The resulting expression must be analogous to the equation above, and this leads to equation 3.6. Finally, notice that as the correlation time becomes shorter and approaches D t, Ps ! 1 / 2 and Z values separated by a xed time interval tend to become independent. On the other hand, Ps ! 1 corresponds to a long correlation time, in which case the sequence of binary input samples consists of long stretches in which all samples are equal to C 1 or ¡1. Using the binary variable Z makes it possible to include correlations in the integrator model, and these can be varied by adjusting the value of Pc or, equivalently, tcorr . 3.1 Discretizing Noise. The noisy current X ( t) is a continuous variable, but in order to proceed, it needs to be discretized. This is a subtle technical point, because the discretization depends on whether the noise is correlated. Gaussian white noise can be approximated by setting X at time step j equal to
s0 s0 Xj D Zj p D §p , Dt Dt
(3.8)
with equal probabilities for positive and negative values and with consecutive Z samples being independent. Here, the constant s0 measures the strength of the uctuating current. In contrast, correlated noise can be approximated by setting X at time step j equal to Xj D Zj s1 D §s1 ,
(3.9)
where, in this case, consecutive Z values are not independent: the probability of the next Z changing sign is equal to Pc . The p latter expression is relatively straightforward, but the former includes a D t factor that may seem odd. The rationale for this is as follows. The variance of the integral of X from 0 to t should be proportional to t; that is an essential property of noise. For gaussian white noise, when time is a continuous variable, this property follows immediately from the fact that the correlation between X samples is a delta function (as in equation 3.4). With discretized time, we should be able to approximate the integral with P a sum over consecutive samples, jN Xj D t, where t D N D t. For gaussian white noise, the variance of this sum is
D t2
N X j,k
Xj X k D D t2
N X j,k
dij c2 D c2 tD t.
(3.10)
Here, c is a constant and dij is equal to 1 whenever i D j and to 0 otherwise. The dij appears because X samples are independent. To make this variance proportional to t and independent p of the time step, the proportionality constant c should be divided by D t, as in equation 3.8. Note also that as
Integrate-and-Fire Neurons Driven by Correlated Stochastic Input
2117
D t ! 0, the sum approaches a gaussian random variable even if individual
terms are binary, as a result of the central limit theorem. On the other hand, when the noise is correlated, the analogous calculation involves a correlation function with a nite amplitude. As a consequence, the variance of the sum of X samples ends up being independent of D t without the need of an additional factor. With an exponential correlation P function, the variance of jN Xj D t is equal to µ ³ ³ ´´¶ t 2s12 tcorr t ¡ tcorr 1 ¡ exp ¡ , tcorr
(3.11)
which is proportional to tcorr . Details of the calculation are omitted. The key point is that it uses equation 3.9 with an exponential correlation function, p equation 3.5, and does not require an extra D t. Appendixes A and B describe how random numbers with these properties can be generated in practice. 4 The Moments of T of the Nonleaky Neuron Driven by Uncorrelated Noise
First, we briey review the formalism used to characterize the statistics of the rst passage time when the input current is described by uncorrelated gaussian noise. Then we generalize these derivations to the case of correlated input, which is presented further below. The quantity of interest is the random variable T, the time that it takes for the voltage V to go from reset to threshold. Its characterization depends on the probability density function f ( T, V ), where D tf ( T, V ) , is the probability that, starting from a voltage V, threshold is reached between T and T C D t time units later. The uctuating input current in this case is described by gaussian white noise. To compute the density function f , equation 2.1 needs to be discretized rst, so that time runs in steps of size D t. As shown above, when time is discretized, gaussian white p noise can be approximated by setting X in each time step equal to Zs0 / D t. Thus, for an uncorrelated gaussian current, the discretized version of equation 2.1 is ³ ´ s V ( t C D t) D V (t) C m C p Z D t, Dt
(4.1)
where Z D § 1 and consecutive samples are uncorrelated. Also, we have dened m0 t s0 s´ . t
m ´
(4.2)
2118
Emilio Salinas and Terrence J. Sejnowski
Figure 1 shows examples of spike trains produced by the model when the input is binary and uncorrelated. When there is no drift (m D 0), as in Figures 1a through 1c, ring is irregular; the probability distribution of the times between spikes is close to an exponential (see Figure 1b), as for a Poisson process. Something different happens when there is a strong drift, as in Figures 1d through 1f: the evoked spike train is fairly regular, and the times between spikes cluster around a mean value (see Figure 1e). Thus, as observed before (Troyer & Miller, 1997; Salinas & Sejnowski, 2000), there
Integrate-and-Fire Neurons Driven by Correlated Stochastic Input
2119
are two regimes: one in which the drift provides the main drive and ring is regular, and another in which the variance provides the main drive and ring is irregular. The quantitative dependence of the mean ring rate and variability on the input parameters m , s, and Pc is derived below. 4.1 Equations for the Moments of T . Now we proceed with the derivation of f ( T, V ) . The fundamental relationship that this function obeys is
f ( T, V ( t) ) D f (T ¡ D t, V ( t C D t) ) .
(4.3)
Here, as above, the bar indicates averaging over the input ensemble, or expectation. This equation describes how the probability of reaching threshold changes after a single time step: if at time t the voltage is V and T time units remain until threshold, then at time t C D t, the voltage will have shifted to a new value V ( t C D t) and, on average, threshold will be reached one D t sooner. To proceed, substitute equation 4.1 into the above expression and average by including the probabilities for positive and negative input samples to obtain f (T C D t, V ) D
p ² 1 ± p ² 1 ± f T, V C m D t C s D t C f T, V C m D t ¡s D t . 2 2
(4.4)
Here V ( t) is just V, and to simplify the next step we also made the shift T ! T C D t. Next, expand each of the three f terms in Taylor series around f (T, V ). This gives rise to a partial derivative with respect to T on the lefthand side and to partial derivatives with respect to V on the right. In the
Figure 1: Facing page. Responses of the nonleaky integrate-and-re model driven by stochastic, binary noise. Input samples are uncorrelated, so at each time step they are equally likely to be either positive or negative, regardless of previous values. (a) Sample voltage and input timecourses. Traces represent 25 ms of simulation time, with D t D 0.1 ms. The top trace shows the model neuron’s voltage V along with the spikes generated when V exceeds the threshold Vh D 1. After each spike, the voltage excursion restarts at Vreset D 1 / 3. The bottom traces p show the total input m C Zs / D t at each time step, where Z may be either C 1 or ¡1. The input has zero mean, and ring is irregular. Parameters in equation 4.1 were m D 0, s D 0.2. (b) Frequency histogram of interspike intervals from a sequence of approximately 4,000 spikes; bin size is 2 ms. Similar numbers were used in histograms of other gures. Here hTi D 22 ms, CV ISI D 0.91. (c) Spike raster showing 6 seconds of continuous simulation time; each line represents 1 second. The same parameters were used in a through c. Panels d through f have the same formats and scales as the respective panels above, except that m D 0.03 and s D 0.05. The input has a strong positive drift, and ring is much more regular. In this case, hTi D 22 ms, CV ISI D 0.35.
2120
Emilio Salinas and Terrence J. Sejnowski
latter case, second-order derivatives are needed to avoid cancelling of all the terms with s. After simplifying the resulting expression, take the limit D t ! 0 to obtain s 2 @2 f @f @f Cm . D 2 @V 2 @V @T
(4.5)
This has the general form of the Fokker-Planck equation that arises in the study of Brownian motion and diffusion processes (Van Kampen, 1981; Gardiner, 1985; Risken, 1996). In this case, it should satisfy the following boundary conditions. First, @f ( T, 0) @V
D 0,
(4.6)
which corresponds to the presence of the barrier at V D 0. Second, f ( T, Vh ) D d ( T ) ,
(4.7)
which means that once threshold is reached, it is certain that a spike will occur with no further delay. From equation 4.5 and these boundary conditions, expressions for the mean, variance, and higher moments of T may be obtained. To do this, multiply equation 4.5 by T q and integrate each term over T. The result is an equation for the moment of T of order q: s 2 d2 hT q i dhT q i Cm D ¡qhT q¡1 i, 2 dV 2 dV
(4.8)
where the angle brackets denote averaging over the probability distribution of T, that is, Z hT q i D hT q i(V) D (4.9) f ( T, V ) T q dT. The rst equality above indicates that the moments of T are, in general, functions of V, although in some instances the explicit dependence will be omitted for brevity. The equation for the mean rst passage time hTi is obtained when q D 1: s 2 d2 hTi dhTi Cm D ¡1. 2 dV 2 dV
(4.10)
Similarly, the differential equation for the second moment follows by setting q D 2 in equation 4.8: s 2 d2 hT 2 i dhT 2 i Cm D ¡2 hTi. 2 2 dV dV
(4.11)
Integrate-and-Fire Neurons Driven by Correlated Stochastic Input
2121
Notice that to solve this equation, hTi needs to be computed rst. Knowing the second moment of T is useful in quantifying the variability of the spiketrain produced by the model neuron. One measure that is commonly used to evaluate spike-train variability is the coefcient of variation CV ISI , which is equal to the standard deviation of the interspike intervals (the times between successive spikes) divided by their mean. Since the interspike interval is simply T,
CV ISI
p hT 2 i ¡ hTi2 . D hTi
(4.12)
For comparison, it is helpful to keep in mind that the CV ISI of a Poisson process equals 1. Boundary conditions for the moments of T are also necessary, but these can be obtained by the same procedure just described: multiply equations 4.6 and 4.7 by T q and integrate over T. The result is dhT q i ( V D 0) D 0 dV hT q i(V D Vh ) D 0.
(4.13)
6 4.2 General Solutions (m D 0). The solution to equation 4.10 may be found by standard methods. Taking into account the boundary conditions above with q D 1, we obtain
hTi(V) D w 1 ( Vh ) ¡ w 1 ( V ) ³ ´ s2 2m x x C w 1 (x ) D exp ¡ . m 2m 2 s2
(4.14)
To obtain the mean time from reset to threshold, evaluate equations 4.14 at V D Vreset . After inserting these expressions into the right-hand side of equation 4.11 and imposing the corresponding boundary conditions, the solution for hT 2 i becomes hT 2 i(V) D w 2 ( Vh ) ¡ w 2 (V ) ³ ´ 2w 1 (Vh ) s2 x2 ( ) C C w2 x D ¡ x m m3 m2 ³ 2 ´ ³ ´ s w 1 ( Vh ) s4 s2 x 2m x C C C exp ¡ , m2 m4 m3 s2 where w 1 is the function dened in equations 4.14.
(4.15)
2122
Emilio Salinas and Terrence J. Sejnowski
4.3 Solutions for Zero Drift (m D 0). The above expressions for hTi and hT 2 i are valid when m is different from zero. When m is exactly zero, the corresponding differential equations 4.10 and 4.11 become simpler, but the functional forms of the solutions change. In this case, the solution for hTi is
hTi(V) D y1 (Vh ) ¡ y1 ( V )
y1 ( x ) D
x2 . s2
(4.16)
A very similar expression has been discussed before (Salinas & Sejnowski, 2000; see also Berg, 1993). In turn, the solution for hT 2 i becomes hT 2 i(V) D y2 ( Vh ) ¡ y2 (V )
y2 ( x) D
2y1 ( Vh ) x2 x4 ¡ . s2 3s 4
(4.17)
These special solutions can also be obtained from the general expressions 4.14 and 4.15 by using a series representation for the exponentials and then taking the limit m ! 0. Figures 2 and 3 compare these analytical solutions with computer simulations for a variety of values of s and m (see appendix C for details). The continuous lines in these gures correspond to analytic solutions. Equations 4.16 and 4.17 were used in Figure 2b, and equations 4.14 and 4.15 were used in the rest of the panels (as mentioned above, the expressions were evaluated at V D Vreset ). The bottom rows in these gures plot the CV ISI as dened in equation 4.12, and the top rows plot the mean ring rate r in spikes per second, where r D 1 / hTi. The dots in the graphs are the results of simulations. Each dot was obtained by rst setting the values of m and s and then running the simulation until 1,000 spikes had been emitted; then hTi and hT 2 i were computed from the corresponding 1,000 values of T. The agreement between simulation and analytic results is very good for all combinations of m and s. The results in Figures 2 and 3 conrm that ring in this model can be driven either by a continuous drift, in which case ring is regular, like a clock, or by the purely stochastic component of the input, in which case ring is irregular, like radioactive decay. For example, in Figure 3a, the mean rate rises steadily as a function of m almost exactly along the line given by r D m / (Vh ¡ Vreset ) , which is the solution for r when s D 0 (see equation 4.14). The corresponding CVISI values stay very low, except when m gets close to zero, in which case s becomes relatively large. It is easy to see from equations 4.14 and 4.15 that when s D 0, the CV ISI becomes zero as well. In contrast, Figure 2b shows that when there is no drift and ring is driven by a purely stochastic input, the resulting spike train is much more irregular, with a CVISI slightly below one. Figure 2c shows an interesting
Integrate-and-Fire Neurons Driven by Correlated Stochastic Input
Mean firing rate (spikes/s)
a 100
b
m = 0.05
2123
c
m=0
m = 0.02
0
CV
ISI
1
0 0
s
0.3
Figure 2: Mean ring rate and coefcient of variation for the nonleaky integrateand-re neuron driven by uncorrelated noise. Continuous lines are analytic expressions, and dots are results from computer simulations using gaussian white noise. Results are shown as functions of s for the three values of m indicated in the top-left corner of each panel. For each simulation data point, hTi and hT 2 i were calculated from spike trains containing 1,000 spikes (same in following gure). (a, c) The continuous lines were obtained from equations 4.14 and 4.15. (b) The continuous lines were obtained from equations 4.16 and 4.17.
intermediate case. As s increases above zero, the ring rate changes little at rst, but then, after a certain point around s D 0.1, it increases steadily. In contrast, the CV ISI increases most sharply precisely within the range of values at which the rate stays approximately constant, starting to saturate just as the ring rate starts to rise. 5 The Moments of T of the Nonleaky Neuron Driven by Correlated Noise
Correlated noise can also be approximated using a binary description. When there are correlations, changes in the sign of X (t) occur randomly but with a characteristic timescale tcorr (recall that X has zero mean). In this case, the key to approximate X with a binary variable is to capture the statistics of the transitions between positive and negative values. Intuitively, this may be thought as follows. A sample X is observed, and it is positive. On average, it will remain positive for another tcorr time units approxi-
2124
Emilio Salinas and Terrence J. Sejnowski
Mean firing rate (spikes/s)
a 100
s = 0.05
b
c
s = 0.12
s = 0.2
0
CV
ISI
1
0 0.06
0
0.06
m Figure 3: As in Figure 2, but the mean ring rate and coefcient of variation are plotted as functions of m for the three values of s indicated in the top-left corner of each panel. All continuous lines were obtained from equations 4.14 and 4.15.
mately. Therefore, in a time interval D t longer than tcorr , one would expect to see D t / (2tcorr ) sign changes; for instance, in a period of length 2tcorr , one should see, on average, a single sign change. Thus, the probability that X ( t) and X ( t C D t) have different signs, for small D t, should be equal to D t / (2tcorr ) . This is the quantity we called Pc above. A random process that changes states in this way gives rise to a correlation function described by an exponential, as in equation 3.5 (see appendix A), and has a nite variance. For a correlated random current, the discretized version of equation 2.1 is V ( t C D t) D V ( t) C (m C sZ) D t,
(5.1)
where Z D § 1. Now, however, Z has an exponential correlation function, and although C 1 and ¡1 are equally probable overall, the transitions between C 1 and ¡1 do not have identical probabilities, as described earlier. As before, we have rescaled the mean and variance of the input: m0 t s1 s´ . t
m ´
(5.2)
Integrate-and-Fire Neurons Driven by Correlated Stochastic Input
2125
However, we use s1 instead of s0 to note that the two quantities correspond to different noise models, correlated andpuncorrelated, respectively. There is a subtle difference between them: the D t factor is needed to discretize gaussian white noise but not correlated noise, so s1 and s0 end up having different units (see above). Figures 4 and 5 show examples of spike trains produced by the model when the input is binary and temporally correlated. In Figure 4, the model has a negative drift, so ring is driven by the random uctuations only. In Figures 4a and 4c, the correlation time is tcorr D 1 ms, and, as expected, on average the input changes sign every 2 ms (see Figure 4a, lower trace). When tcorr is increased to 5 ms, the changes in sign occur approximately every 10 ms (see Figure 4d, lower trace). This gives rise to a large increase in mean ring rate and a smaller increase in CVISI , as can be seen by comparing the spike trains in Figures 4c and 4f. Note that m and s are the same for all panels. Figure 5 shows the effect of an identical change in correlation time on a model neuron that has a large positive drift. When tcorr D 1 ms, ring is regular (see Figure 5c). When tcorr increases to 5 ms, there is no change in the mean ring rate, but the interspike intervals become much more variable (see Figure 5f). In both gures, long correlation times give rise to a sharp peak in the distribution of interspike intervals (see Figures 4e and 5e), which corresponds to a short interspike interval that appears very frequently. This short interval results when the input stays positive for a relatively long time, as illustrated by the two spikes in the voltage trace in Figure 4d. This interval is equal to ( Vh ¡ Vreset ) / (m C s), which is the minimum separation between spikes in the model given m and s. As the correlation time increases, a larger proportion of spikes separated by this interval is observed. For instance, for the parameters in Figures 4f and 5f, the minimum interval accounts approximately for 48% and 28%, respectively, of all interspike intervals. Analytic solutions for the mean ring rate and the CV ISI of the nonleaky model driven by correlated noise are derived in the following sections. 5.1 Equations for the Moments of T . In the case of correlated input, the probability that threshold is reached in T time units depends not only on the value of V ( t) but also on the value of Z ( t) . If Z (t) is, say, positive, then it is more likely that Z ( t C D t) will also be positive, so we should expect threshold to be reached sooner when Z (t ) is positive than when it is negative. This means that now, instead of the single probability distribution f used before, two functions should be considered: f C ( T, V ) , which stands for the probability distribution of T given that the last measurement of the input Z was positive, and f¡ ( T, V ) , which stands for the probability distribution of T given that the last measurement of Z was negative. Therefore, D t f C (T, V ) is the probability that starting from a voltage V ( t) and given that Z ( t) was positive, threshold is reached between T and T C D t time units later. In turn, using the proper discretiztion for this case (see equation 5.1), the same
2126
Emilio Salinas and Terrence J. Sejnowski
reasoning that previously led to equation 4.4 now leads to a set of two coupled equations: f C (T C D t, V ) D f C (T, V C (m C s) D t) (1 ¡ Pc )
(5.3)
C f ¡ ( T, V C (m ¡ s) D t ) Pc
f¡ (T C D t, V ) D f C ( T, V C (m C s) D t) Pc C f ¡ ( T, V C (m ¡ s) D t ) (1 ¡ Pc ).
(5.4)
The rst equation follows when Z ( t) is positive, so Z (t C D t ) can either stay positive, with probability 1 ¡ Pc , or change to negative, with probability Pc ; similarly, the second equation means that when the current Z at time t is negative, at time t C D t it can either become positive, with probability Pc , or stay negative, with probability 1 ¡ Pc . At this point, one may proceed as in the case of uncorrelated noise: expand each term in a Taylor series, which in this case only needs to be of rst order, substitute Pc for D t / 2tcorr , simplify terms, and take the limit D t ! 0. The result is a pair of coupled differential equations for f C and f¡ analogous to equation 4.5 when input samples are independent: @f C @T
@f ¡ @T
D ¡ D
¢ 1 ¡ @f C f C ¡ f¡ C (m C s) 2tcorr @V
¢ 1 ¡ @f ¡ . f C ¡ f¡ C (m ¡ s) 2tcorr @V
(5.5)
Figure 4: Facing page. Responses of the nonleaky integrate-and-re model driven by correlated, binary noise. The input switches signs randomly, but on average the same sign is maintained for 2tcorr time units. (a) Sample voltage and input timecourses. Traces represent 50 ms of simulation time, with D t D 0.1 ms. The top trace shows the model neuron’s voltage V and the spikes produced. The bottom trace shows the total input m C Zs at each time step, where Z may be either C 1 or ¡1. The input has a negative drift, and ring is irregular. Parameters in equation 5.1 were m D ¡0.01, s D 0.1; the correlation time was tcorr D 1 ms. (b) Interspike interval histogram with hTi D 96 ms and CV ISI D 1. (c) Spike raster showing 6 seconds of continuous simulation time; each line represents 1 second. The same parameters were used in a through c. Panels d through f have the same formats as the corresponding panels above, except that tcorr D 5 ms; thus, changes in the sign of the input in d occur about ve times less frequently than in a. Note different x-axes for the interspike interval histograms. In e, the y-axis is truncated; 48% of the interspike intervals fall in the bin centered at 7 ms. In this case, hTi D 27 ms, CV ISI D 1.18. The increase in correlation time causes a large increase in ring rate and a smaller but still considerable increase in variability.
Integrate-and-Fire Neurons Driven by Correlated Stochastic Input
2127
These equations may be multiplied by T q and integrated, as was done in the previous section. The two equations that result describe how the moments of T change with voltage. These equations are: q ² 1 ± q dhT C i q hT C i ¡ hT ¡ i C (m C s) 2tcorr dV q ± ² 1 dhT ¡ i q¡1 q q ¡qhT ¡ i D hT C i ¡ hT ¡ i C (m ¡ s) . 2tcorr dV q¡1
¡qhT C i D ¡
(5.6)
2128
Emilio Salinas and Terrence J. Sejnowski
Notice that there are two sets of moments, not just one, because there are two conditional distributions, f C and f¡ . We should point out that for the most part, hT C i is the crucial quantity. It corresponds to the average time required to reach threshold starting at voltage V and given that the last value of the current was positive; similarly, hT ¡ i is the expected time required to reach threshold starting at voltage V but given that the last value of Z was negative. However, notice that in a spike train a new excursion from reset to threshold is initiated immediately after each spike, and all spikes must be preceded by an increase in voltage. With one exception, this always corresponds to Z > 0, in which case the quantities of interest are hT C i, hT 2C i, and so on. The exception is when m is positive and larger than s; in this case, V never decreases because the total input is always positive, and therefore threshold can be reached even after a negative value of Z. For this same reason, the cases m < s and m > s will require different boundary conditions. When s is larger than m , the sign of the corresponding change in voltage is equal to the sign of Z (this is not true if m is negative and |m | > s, but then the voltage cannot increase and no spikes are ever produced, so we disregard this case). The rst boundary condition is then @f ¡ ( T, 0) @V
D 0,
(5.7)
which corresponds to the presence of the barrier at V D 0. It involves the function f¡ because the barrier can be reached only after a negative value of Z. Therefore, there is no corresponding condition for f C . Second, f C (T, Vh ) D d ( T ) .
(5.8)
This is the condition on the threshold, which can be reached only after a positive value of Z; hence, the use of f C . In terms of the moments of T, these conditions become q
dhT ¡ i ( V D 0) D 0 dV q
hT C i(V D Vh ) D 0.
(5.9)
When m is positive and larger than s, the change in voltage in one time step is always positive. In this case, the barrier is never reached, so there is no boundary condition at that point. Threshold, however, can be reached with either positive or negative Z values, so the two conditions are f C (T, Vh ) D d ( T ) f¡ (T, Vh ) D d ( T ) ,
(5.10)
Integrate-and-Fire Neurons Driven by Correlated Stochastic Input
2129
Figure 5: As in Figure 4, except that m D 0.02, s D 0.03. The input has a strong positive drift, which gives rise to much more regular responses. (a–c) tcorr D 1 ms, hTi D 33 ms, and CV ISI D 0.37. (d–f) tcorr D 5 ms, hTi D 33 ms, and CV ISI D 0.81. (e) Twenty-eight percent of the interspike intervals fall in the bin centered at 13 ms, and the y-axis is truncated. In this case, the increase in correlation time does not affect the ring rate but produces a large increase in variability.
2130
Emilio Salinas and Terrence J. Sejnowski
and for the moments of T, this gives q
hT C i(V D Vh ) D 0 q
hT ¡ i(V D Vh ) D 0.
(5.11)
5.2 Solutions for s ¸ m . Ordinary differential equations for the mean rst passage time are obtained by setting q D 1 in equations 5.6. The solutions to these equations are
hT C i(V) D w 1C ( Vh ) ¡ w 1C ( V ) w 1C ( x ) D
x C tcorr ( c ¡ 1) 2 exp (¡ax) m
hT ¡ i(V) D w 1C ( Vh ) C 2tcorr c ¡
V ¡ tcorr ( c2 ¡ 1) exp (¡aV ) , m
(5.12)
where we have dened s s1 c´ D m m0 a´
1 . m tcorr (c2 ¡ 1)
(5.13)
These expressions satisfy the boundary conditions 5.9 for the threshold and the barrier. Ordinary differential equations for the second moments of the rst passage time are obtained by setting q D 2 in equations 5.6 and inserting the above expressions for hT C i and hT ¡ i. The solutions to the second moment equations are hT 2C i(V) D w 2C ( Vh ) ¡ w 2C ( V ) Á ! 2w 1C ( Vh ) 2tcorr c2 x2 C C w 2 ( x) D x ¡ 2 m m m ± ± ²² C C 2tcorr ( c ¡ 1) 2 w 1 ( Vh ) C tcorr 2c2 C 4c C 1 exp (¡ax) C
2tcorr ( c ¡ 1) ( c2 C 1) x exp (¡ax ) m ( c C 1)
2 hT 2¡ i(V) D w 2C ( Vh ) C 4ctcorr w 1C ( Vh ) C 4tcorr c2 ( c C 1) Á ! 2w 1C ( Vh ) 2tcorr c ( c C 2) V2 C C ¡V m m m2 ± ± ²² ¡ 2tcorr (c2 ¡ 1) w 1C ( Vh ) C tcorr 2c2 C 2c C 1 exp (¡aV )
¡
2tcorr ( c2 C 1) V exp (¡aV ) . m
(5.14)
Integrate-and-Fire Neurons Driven by Correlated Stochastic Input
2131
These also satisfy boundary conditions 5.9, with q D 2. As explained above, the relevant quantities in this case are hT C i and hT 2C i, because threshold is always reached following a positive value of Z. Thus, hT C i should be equal to the mean rst passage time calculated from simulated spike trains. What is the behavior of these solutions as tcorr increases? Consider the ring rate r of the model neuron under these conditions. As correlations become longer, the stretches of time during which input samples have the same sign also become longer (compare the lower traces in Figures 4a and 4d), although on average it is still true that positive and negative samples are equally likely. During a stretch of consecutive negative samples, the ring rate is zero, whereas during a long stretch of consecutive positive samples, it must be equal to rC D
m Cs , Vh ¡ Vreset
(5.15)
which is just the rate we would obtain if the total input were constant and equal to m C s (see equation 4.14). Therefore, the mean rate for large tcorr should be approximately equal to one-half of r C . Indeed, in the above expression for hT C i, one may expand the exponential in a Taylor series and take the limit tcorr ! 1; the result is that, in the limit, the ring rate is equal to r C / 2. On the other hand, longer correlation times give rise to longer stretches in which no spikes appear, that is, longer interspike intervals. Thus, in contrast to the mean, the variance in the interspike interval distribution should keep increasing. This can be seen from the above expression for hT 2C i: again, expand the exponentials in Taylor series, but note that this time, the coefcient of the nal term in V includes tcorr , so hT 2C i will keep increasing with the correlation time. Because the CVISI is a function of hT 2C i/ hT C i2 , the variability of the spike trains generated by the model will diverge as the correlation time increases. Another interesting limit case is obtained when s D m , or c D 1. Here, the total input is zero every time that Z equals ¡1; this means that half the time, V does not change, whereas the rest of the time, V changes by 2m in each time step. On average, this gives a time to threshold hT C i equal to ( Vh ¡Vreset ) /m , which is precisely the result from equation 5.12. Notice that this quantity does not depend on the correlation time. In contrast, the second moment hT C i does depend on tcorr . For this case, c D 1, the expression for the CV ISI is particularly simple: s 2m tcorr . Vh ¡ Vreset
(5.16)
Thus, again the variability diverges as tcorr increases, but the mean rate does not.
2132
Emilio Salinas and Terrence J. Sejnowski
5.3 Solutions for Zero Drift (m D 0). When the drift is exactly zero, the model neuron is driven by the zero-mean input only. Equations for the moments of the rst passage time are obtained as above, using the proper values of q in equations 5.6, but m D 0 must also be set in these expressions. With these considerations, the solutions for the mean rst passage time in the case of correlated noise with zero drift are
hT C i(V) D y1C ( Vh ) ¡ y1C ( V )
y1C ( x ) D
2x x2 C s 2tcorr s 2
hT ¡ i(V) D y1C ( Vh ) C 2tcorr ¡
V2 2tcorr s 2
.
(5.17)
Notice that these expressions can also be obtained from the solutions to the case s > m by expanding the exponentials in equations 5.12 in Taylor series, cancelling terms, and then taking the limit m ! 0. For the second moment, hT 2C i(V) D y2C ( Vh ) ¡ y2C ( V )
y2C ( x ) D
¢ ¢ 4x ¡ x2 ¡ tcorr C y1C (Vh ) C ¡tcorr C y1C ( Vh ) s tcorr s 2
2x3 x4 ¡ 2 s4 3tcorr s 3 12tcorr ¡ ¢ hT 2¡ i(V) D y2C ( Vh ) C 4tcorr 2tcorr C y1C ( Vh ) ¡
¡
¢ V2 ¡ V4 tcorr C y1C ( Vh ) C . 2 2 s4 tcorr s 12tcorr
(5.18)
The boundary conditions satised by these solutions are those given by equations 5.9, as in the previous case. Also, it is hT C i and hT 2C i that are important, because spikes are always triggered by positive values of the uctuating input. The asymptotic behavior of these solutions for increasing tcorr is similar to that in the previous case. In the limit, the mean rate is again given by r C / 2 (with m D 0), and it is clear from the expression for y2C that hT 2C i will diverge because of the leading term 4xtcorr / s. 5.4 Solutions for s < m . When s is smaller than m , the same differential equations discussed above (equations 5.6 with the appropriate values of q) need to be solved, but the boundary conditions are different. In this case, equations 5.11 should be satised. The solutions for the rst moment that
Integrate-and-Fire Neurons Driven by Correlated Stochastic Input
2133
do so are hT C i(V) D
¡ ¢ Vh ¡ V C tcorr c ( c ¡ 1) 1 ¡ exp (a( Vh ¡ V ) ) m
hT ¡ i(V) D
¡ ¢ Vh ¡ V C tcorr c ( c C 1) 1 ¡ exp (a(Vh ¡ V ) ) . m
(5.19)
In contrast to the previous two situations, now threshold can be reached after a positive or a negative value of Z. Overall, these are equally probable, but the relative numbers of spikes triggered by Z D C 1 and Z D ¡1 need not be, so determining the observed mean rst passage time becomes more difcult. The following approximation works quite well. Consider the case where tcorr is large. The interspike interval during a long stretch of positive samples is still equal to ( Vh ¡ Vreset ) / (m C s) , but spikes are also produced during a long stretch of negative samples, with an interspike interval equal to ( Vh ¡ Vreset ) / (m ¡ s ). This suggests that the number of excursions to threshold that start after Z D C 1 during a given period of time should be proportional to m C s, whereas those that start after Z D ¡1 in the same period should be proportional to m ¡ s. Using these quantities as weighting coefcients for hT C i and hT ¡ i gives hT A i(V) D
(m C s) hT C i C (m ¡ s) hT ¡ i Vh ¡ V D , 2m m
(5.20)
where the second equality is obtained by using equations 5.19. This average should approximate the mean time between spikes produced by the model. According to this derivation, when s < m , the mean rst passage time should be constant as a function of the correlation time (as was found when c D 1). Obtaining an expression for the second moment of the rst passage time requires a similar averaging procedure. First, using equations 5.19 on the left-hand side, solutions to equations 5.6 for q D 2 are found; these are hT 2C i(V) D Q2C ( Vh ) ¡ Q2C ( V )
hT 2¡ i(V) D Q2¡ ( Vh ) ¡ Q2¡ ( V ) ,
(5.21)
where
Q2§ (x ) D
´ Vh x2 C 2tcorr c2 ¨ tcorr c x ¡ m m2 ± ² 2 C 2tcorr c2 3c2 ¨ 2c ¡ 1 exp (a ( Vh ¡ x ) )
2 m
³
³
¡
2tcorr c ( c2 C 1) m c§1
´
( Vh ¡ x) exp (a ( Vh ¡ x) ) .
(5.22)
2134
Emilio Salinas and Terrence J. Sejnowski
These solutions satisfy boundary conditions 5.11. Second, again considering that the relative frequencies of spikes triggered by Z D C 1 and Z D ¡1 should be proportional to m C s and m ¡ s, respectively, one obtains (m C s) hT 2C i C (m ¡ s) hT 2¡ i 2m ³ ´2 ³ ´ Vh ¡ V 2 Vh ¡ V C 2tcorr c D m m ± ²¡ ¢ 2 C 2tcorr c2 c2 ¡ 1 1 ¡ exp (a (Vh ¡ V ) ) .
hT 2A i(V) D
(5.23)
Recall that the quantities a and c are dened in equation 5.13 and apply to all solutions. In this case, the asymptotic behavior of the solutions as functions of tcorr is different from the previous cases. As noted above, the mean rst passage time is independent of tcorr so, as in the case s ¸ m , it remains nite as tcorr increases. The variability, however, does not diverge in this case, as it does when s ¸ m . In the limit when tcorr ! 1, the second moment tends to a nite value; this is not obvious from equation 5.23, but expanding the exponentials, one nds that all terms with tcorr in the numerator cancel out. Intuitively, this makes sense: as the correlation time increases, the value of most interspike intervals tends to be either (Vh ¡ Vreset ) / (m ¡ s) or ( Vh ¡ Vreset ) / (m C s) , so the variance in the interspike interval distribution tends to a constant. The CV ISI in this limit case is also given by a simple expression, which is c p . (5.24) 1 ¡ c2 Plots of the solutions just presented are shown in Figures 6 and 7. The format of these gures is the same as in Figures 2 and 3: dots and continuous lines correspond to computer simulations and analytic solutions, respectively. Here, however, three sets of curves and dots appear in each panel; these correspond to tcorr equal to 1, 3, and 10 ms, where higher correlation times give rise to higher values of the mean rate and CV ISI . As in the case of independent input samples, their agreement is excellent. It is evident from these gures that the correlation time increases both the ring rate and the variability of the responses for any xed combination of m and s. Two important points should be noted. First, the relative effect on the mean ring rate is much stronger when s is large compared to m . This is most evident in Figure 7b (top row), where the effect of tcorr is enormous when m is zero or negative and becomes negligible as m becomes comparable to s. This is also clear in Figure 6c where, in agreement with equation 5.20, the ring rate remains constant for s < m and becomes more sensitive to tcorr as s increases. The higher sensitivity when s is large is not at all surprising because correlations do not alter the drift. Hence, when the neuron
Integrate-and-Fire Neurons Driven by Correlated Stochastic Input
Mean firing rate (spikes/s)
a 100
b m = 0.04
2135
c m=0
m = 0.02
0
CV
ISI
2
1
0
0
s
0.15
Figure 6: Mean ring rate and coefcient of variation for the nonleaky integrateand-re neuron driven by correlated noise. Continuous lines are analytic expressions, and dots are results from computer simulations using a binary input. For each simulation data point, hTi and hT 2 i were calculated from spike trains containing 2,000 spikes (same in following gures). Results are shown as functions of s for the three values of m indicated in the top-left corner of each panel. The three curves in each graph correspond to tcorr D 1 (lower curves), tcorr D 3 (middle curves), and tcorr D 10 ms (upper curves). (a, c) The continuous lines were obtained from equations 5.12 through 5.14, which apply to m < s, except in the initial part of c, where m > s and equations 5.20 and 5.23 were used. (b) The continuous lines were obtained from equations 5.17 and 5.18. Both ring rate and variability increase with correlation time.
is being primarily driven by drift, the change in rate caused by correlations is proportionally small. The second key point is that as the correlation time is increased, the effect on the ring rate saturates, but the effect on variability may not. For instance, in Figures 6a through 6c (top row), the difference between the lower two and the upper two rate curves is about the same, although the latter corresponds to a much larger increase in correlation time. In contrast, the CV ISI plots in the same gure do show larger differences between CVISI traces for larger changes in tcorr . Figure 8 illustrates these points more clearly. Here, the mean rate and the CVISI are plotted as functions of the correlation time tcorr for various combinations of m and s. Only analytic results are shown here, but these
2136
Emilio Salinas and Terrence J. Sejnowski
Mean firing rate (spikes/s)
a 100
s = 0.03
b
c
s = 0.07
s = 0.11
0
CV
ISI
2
1
0 0.06
0
0.06
m Figure 7: As in Figure 6, but the mean ring rate and coefcient of variation are plotted as functions of m for the three values of s indicated. The three curves in each graph again correspond to tcorr D 1 (lower curves), tcorr D 3 (middle curves), and tcorr D 10 ms (upper curves). All continuous lines were obtained from equations 5.12 through 5.14 (m < s), except for the last part of a, where equations 5.20 and 5.23 were used (m > s).
were conrmed by additional simulations. First, notice the behavior of the ring rate: it increases as a function of tcorr up to a certain point, where it saturates. The net increase in rate is larger for higher values of s and smaller for higher values of m . On the other hand, the variability shows different asymptotic behaviors in the cases s > m and s < m . In the former, the CV ISI always keeps rising with longer correlation times; in the latter, it saturates (see the thickest trace in Figure 8b, bottom). This saturation value is in accordance with equation 5.24, which, for the parameter values used in the gure, predicts a limit CV ISI of 0.98. In the following sections, similar analyses are presented for the integrateand-re model with leak. Although the quantitative details differ somewhat, the main conclusions just discussed remain true for this model too. 6 The Leaky Integrate-and-Fire Neuron
The integrate-and-re model with leak (Tuckwell, 1988; Troyer & Miller, 1997; Salinas & Sejnowski, 2000; Dayan & Abbott, 2001) provides a more accurate description of the voltage dynamics of a neuron while still being
Integrate-and-Fire Neurons Driven by Correlated Stochastic Input
a 100
2137
b 100
s = 0.035
Mean firing rate (spikes/s)
m=0
0
4
3
CV
ISI
0
0
0 0
80
Correlation time (ms)
0
80
Correlation time (ms)
Figure 8: Responses of the nonleaky integrate-and-re neuron as functions of the correlation time tcorr . Only analytic results are shown. Plots on the top and bottom rows show the mean ring rate and CV ISI , respectively. (a) The four curves shown in each plot correspond to four values of s: 0.02, 0.045, 0.07, and 0.1, with thicker lines corresponding to higher values. For all these curves, m D 0, as indicated. (b) The four curves shown in each plot correspond to four values of m : ¡0.02, 0, 0.025, and 0.05, with thicker lines corresponding to higher values. For all these curves, s D 0.035, as indicated. As the correlation time increases, the ring rate tends to an asymptotic value. In contrast, the CV ISI diverges always, except when m > s; this case corresponds to the thickest line in b.
quite simple. When driven by a stochastic input, the evolution equation for the voltage may be written as follows, t
dV D ¡V C m 0 C X ( t) , dt
(6.1)
where the input terms are the same as before. The key difference is the ¡V term. Again, a spike is produced when V exceeds the threshold Vh , after which V is reset to Vreset. Note, however, that in this case, there is no barrier. This equation can be further simplied by dening v ´ V ¡ m 0,
(6.2)
2138
Emilio Salinas and Terrence J. Sejnowski
in which case only the uctuating component of the input (with zero mean) appears in the equation t
dv D ¡v C X ( t ). dt
(6.3)
The threshold and reset values need to be transformed accordingly: vh ´ Vh ¡ m 0 vreset ´ Vreset ¡ m 0 .
(6.4)
It is easier to solve equations using v, but when the effects of m 0, s1 , and tcorr are discussed, V is more convenient. These equations will be used to shift between V and v representations. It is in the form of equation 6.3 that the leaky integrate-and-re model becomes identical to the Ornstein-Uhlenbeck process (Uhlenbeck & Ornstein, 1930; Ricciardi, 1977, 1995; Shinomoto et al., 1999). Notice that in this model, v is driven toward X; if X were to remain constant at X D X0 , v would tend exponentially toward this value with a time constant t . If X0 were above threshold, the interspike interval would be constant and equal to ³ T D t log
X0 ¡ vreset X0 ¡ vh
´ ,
(6.5)
which results from integrating equation 6.3. This is a well-known result. Note that spikes may be produced even if X 0 is zero or negative, as long as it is above threshold. With binary inputs, the drive X switches randomly between two values. Although this model remains an extreme simplication of a real neuron, its analytic treatment is more complicated than for the nonleaky model. Considerable work has been devoted to solving the rst passage time problem when the model is driven by uncorrelated gaussian noise (Thomas, 1975; Ricciardi, 1977, 1995; Ricciardi & Sacerdote, 1979; Ricciardi & Sato, 1988). Although closed-form expressions for the moments of T in this case are not known, solutions in terms of series expansions have been found; these are well described by Shinomoto et al. (1999) and will not be considered any further here. In the remainder of the article, we investigate the responses of the leaky integrate-and-re model when X represents correlated binary noise, as was done for the nonleaky model. As before, we are interested in the time T that it takes for v to go from reset to threshold—the rst passage time, or interspike interval.
Integrate-and-Fire Neurons Driven by Correlated Stochastic Input
2139
6.1 Equations for the Moments of T of the Leaky Neuron Driven by Correlated Noise. For a correlated random current, the discretized version
of equation 6.3 can be written as ³ v ( t C D t) D v ( t) C
¡
´ v ( t) C sZ D t, t
(6.6)
where Z D § 1 and Z has an exponential correlation function, as discussed before. Again, the statistics of T are described by the two probability density functions f C ( T, v ) and f¡ ( T, v ) , where f C ( T, v ) D t and f¡ ( T, v ) D t stand for the probability that starting from a voltage v, threshold is reached between T and T C D t time units later; f C is conditional on the last measurement of Z being positive, and f¡ is conditional on the last measurement of Z being negative. The derivation of these functions proceeds in exactly the same way as for the nonleaky model; the key is that equation 6.6 has the same form as equation 5.1, except that the ¡v / t term occupies the place of m . Because it is f C and f¡ that are expanded as functions of their arguments, all the corresponding expressions can be found by exchanging ¡v / t for m . Starting with equation 5.4, this leads to a set of two coupled differential equations for the moments of T analogous to equations 5.6: ² s ¡ v dhT q i 1 ± q q 1 C hT C i ¡ hT ¡ i C 2tcorr t dv ² s C v dhT q i 1 ± q q¡1 q 1 ¡ ¡qhT ¡ i D hT C i ¡ hT ¡ i ¡ . 2tcorr t dv q¡1
¡qhT C i D ¡
(6.7)
Notice that here we use the original variable s1 rather than the scaled one s D s1 / t (see equation 5.2). In addition to these equations, boundary conditions are required to determine the solution. The threshold mechanism is still the same, so applying equation 5.8, we obtain q
hT C i(v D vh ) D 0,
(6.8)
which means that a spike must be emitted with no delay when v reaches threshold after a positive uctuation. Note also that the model can produce spikes only when s1 is larger than the threshold value, because the maximum value that v may attain is precisely s1 . For the second boundary condition, there are two possibilities. First, consider the case vh < ¡s1 . Since the input is binary, the voltage is driven toward either s1 or ¡s1 , depending on the value of Z. If the threshold is lower than ¡s1 , then both Z D C 1 and Z D ¡1 may trigger a spike. Therefore, q
hT ¡ i(v D vh ) D 0,
(6.9)
2140
Emilio Salinas and Terrence J. Sejnowski
which means that a spike must also be emitted without delay when v reaches threshold after a negative uctuation. In the second case, s1 > vh > ¡s1 ; when Z D ¡1, the voltage tends to go below threshold, so a spike can only be triggered by Z D C 1. What is the second boundary condition then? This case is much more subtle. To obtain the new condition, rst rewrite the bottom expression in equations 6.7 as follows: ³ q ²´ t 1 ± q dhT ¡ i q¡1 q hT C i ¡ hT ¡ i . D qhT ¡ i C s1 C v 2tcorr dv
(6.10)
Now notice that the derivative has a singularity at v D ¡s1 . However, the excursion toward threshold may start at any value below it, including ¡s1 , so for any q, the derivative on the left side should be a continuous function of v. In other words, the derivative should not be allowed to diverge. Continuity requires that the limits obtained by approaching from the left and right be the same, that is, q
lim C
v!¡s1
q
dhT ¡ i dhT ¡ i . D lim ¡ dv dv v!¡s1 q
(6.11)
q
Expanding hT C i and hT ¡ i in rst-order Taylor series around s1 and using the above expression leads to the sought boundary condition, q
q
q¡1
hT ¡ i(v D ¡s1 ) D hT C i(v D ¡s1 ) C 2tcorr q hT ¡ i(v D ¡s1 ).
(6.12)
Setting q D 1 in equations 6.7, we obtain the differential equations for the mean rst passage time, ¡1 D ¡ ¡1 D
1 s1 ¡ v dhT C i (hT C i ¡ hT ¡ i) C 2tcorr t dv
1 s1 C v dhT ¡ i (hT C i ¡ hT ¡ i) ¡ , 2tcorr t dv
(6.13)
with boundary conditions hT C i(v D vh ) D 0 hT ¡ i(v D ¡s1 ) D hT C i(v D ¡s1 ) C 2tcorr .
(6.14)
In this case, the second boundary condition has an intuitive explanation. Suppose a spike has just been produced—this necessarily requires Z D C 1—
Integrate-and-Fire Neurons Driven by Correlated Stochastic Input
2141
and the voltage is reset to vreset D ¡s1 . Now suppose that in the next time step, Z D ¡1, so the expected time until the next spike is hT ¡ i. The voltage will not change until Z switches back to +1, because Z D ¡1 drives it precisely toward ¡s1 . The moment Z becomes positive, the expected time to reach threshold becomes hT C i by denition. Therefore, the expected time to reach threshold starting from Z D ¡1 is equal to the expected time starting from Z D C 1 plus the time one must wait for Z to switch back to +1, which is 2tcorr ; that is what the boundary condition says. This happens only at v D ¡s1 because it is only at that point that v stays constant during the wait. Setting q D 2, we obtain the differential equations for the second moment of T, ² s ¡ v dhT 2 i 1 ± 2 1 C hT C i ¡ hT 2¡ i C 2tcorr t dv ² s C v dhT 2 i 1 ± 2 1 ¡ ¡2hT ¡ i D hT C i ¡ hT 2¡ i ¡ , 2tcorr t dv ¡2hT C i D ¡
(6.15)
with boundary conditions hT 2C i(v D vh ) D 0 2 hT 2¡ i(v D ¡s1 ) D hT 2C i(v D ¡s1 ) C 4tcorr hT C i(v D ¡s1 ) C 8tcorr .
(6.16)
Figure 9 shows voltage traces and spikes produced by the model with leak driven by a correlated binary input. The responses of the leaky integrator here are similar to those of the nonleaky model shown on Figure 4, but note that the voltage traces are now composed of piecewise exponential curves. In Figure 9, the neuron can spike only when Z is positive. In Figures 9a through 9c, the correlation time is tcorr D 1 ms, and on average the input changes sign every 2 ms (see Figure 9a, lower trace). When tcorr is increased to 5 ms, the changes in sign occur approximately every 10 ms (see Figure 9d, lower trace), producing a large increase in mean ring rate and a smaller increase in CV ISI . As with the nonleaky model, long correlation times generate many interspike intervals of minimum length, which in this case is 8.5 ms. For the leaky integrator, this minimum separation between spikes is equal to t log( (s1 ¡vreset ) / (s1 ¡ vh ) ) , which is just equation 6.5 with s1 instead of X0 . Analytic solutions for the mean ring rate and the CV ISI of the leaky model driven by correlated noise can be obtained under some circumstances. These are derived below.
2142
Emilio Salinas and Terrence J. Sejnowski
6.2 Solutions for vh > ¡s1 . First, consider the mean of T. Solutions in a closed form are not evident, but series expansions can be used. It may be veried by direct substitution that the following expressions satisfy differential equations 6.13 and their corresponding boundary conditions 6.14,
hT C i D
X j D1
± ² aj (vh C s1 ) j ¡ ( v C s1 ) j
hT ¡ i D 2tcorr C
X jD 1
aj (vh C s1 ) j ¡
X jD 1
bj (v C s1 ) j ,
(6.17)
Integrate-and-Fire Neurons Driven by Correlated Stochastic Input
2143
where the coefcients a and b are given by the following recurrence relations, t s1 aj j t C jtcorr aj C 1 D s1 j C 1 t C 2jtcorr t . bj D aj t C 2jtcorr a1 D
(6.18)
One of the limitations of series solutions is that they are valid only within a restricted range, known as the radius of convergence r. This radius can be found, for instance, by d’Alambert’s ratio test, SkC1 lim D r, k!1 S k
(6.19)
where S k represents term k of the series (Jeffrey, 1995). The series converges for r < 1 and diverges for r > 1. Applying this test to the series X jD 1
aj (v C s1 ) j ,
(6.20)
we nd two conditions for convergence: v < s1 , | v | < 3s1 ,
for v > 0 for v < ¡s1 .
(6.21)
Note from equations 6.17 that these constraints must be satised for both v D vreset and v D vh , but in practice the limitation is on vreset . The rst Figure 9: Facing page. Responses of the leaky integrate-and-re model driven by correlated, binary noise. Same format as in Figure 4. (a) Sample voltage and input timecourses. Note that the original variable V (and not v) is shown. Traces represent 50 ms of simulation time. The bottom trace shows the total input m 0 C Zs1 at each time step, where Z may be either C 1 or ¡1. Parameters in equation 6.6 were m 0 D 0.5, s1 D 1, t D 10 ms; the correlation time was tcorr D 1 ms. (b) Interspike interval histogram, with hTi D 103 ms and CV ISI D 0.94. (c) Spike raster showing 6 seconds of continuous simulation time; each line represents 1 second. The same parameters were used in a through c. Panels d through f have the same formats as the corresponding panels above, except that tcorr D 5 ms. In e, the y-axis is truncated; 43% of the interspike intervals fall in the bin centered at 8 ms. In this case, hTi D 31 ms, CV ISI D 1.15. As in the model without leak, the increase in correlation time causes a large increase in ring rate and a smaller but still considerable increase in variability.
2144
Emilio Salinas and Terrence J. Sejnowski
condition is always satised because the maximum value that the voltage can reach is s1 ; the second one means that the reset value cannot be set too negative (below ¡3s1 ). As will be discussed shortly, equations 6.17 and 6.18 provide an analytic solution that is valid within a fairly large range of parameters. Now consider the second moment of T. The following series satisfy equations 6.15 with boundary conditions 6.16, ² X ± hT 2C i D cj (vh C s1 ) j ¡ (v C s1 ) j jD 1
hT 2¡ i D
2 C 4tcorr hT C i(v D ¡s1 ) C 8tcorr
X jD 1
cj (vh C s1 ) j ¡
X jD 1
dj ( v C s1 ) j ,
(6.22)
where the coefcients c and d are given by 2t (tcorr C hT C i(v D ¡s1 ) ) s1 " ³ Á ´ ³ ´2 !# 1 t C jtcorr t t j ¡ aj 1C D cj s1 t C 2jtcorr j C 1 t C 2jtcorr jC1
c1 D cj C 1
¡ ¢ dj D cj C 4tcorr bj
t . t C 2jtcorr
(6.23)
These series solutions are obviously less transparent than the closed-form solutions found for the model without leak. However, a couple of interesting observations can be made. First, notice the behavior of hT C i as tcorr tends to innity: all the coefcients a become independent of tcorr and remain nite. This means that hT C i saturates as a function of tcorr . On the other hand, as expected, hT ¡ i tends to innity, because of the leading term 2tcorr in equation 6.17. In contrast to the ring rate, the variability of the evoked spike trains does not saturate. Again, the reason is that as tcorr increases, longer stretches of time are found during which no spikes are produced; these are the times during which Z D ¡1. The divergence of the CV ISI is also apparent from the equations above: note that c1 contains a term proportional to tcorr . This and the fact that the rate saturates make the CVISI grow without bound. 6.3 Solutions for vh < ¡s1 . When vh is below ¡s1 , the neuron can re even if Z D ¡1, so boundary conditions 6.8 and 6.9 should be used. Solutions in terms of series expansions can be obtained in this case too. For the rst moment of T, these have the form X hT C i D aj (vh ¡ v ) j j D1
hT ¡ i D
X j D1
bj ( vh ¡ v ) j .
(6.24)
Integrate-and-Fire Neurons Driven by Correlated Stochastic Input
2145
Unfortunately, these solutions turn out to be of little use because they converge for a very limited range of parameters. When they do converge, the ring rate is typically very high, at which point the difference between hT C i and hT ¡ i is minimal. The solutions for the second moment of T have the same problem. Therefore, these expressions will be omitted, along with the corresponding recurrence relations for this case. From the simulations, we have observed that when vh < ¡s1 , the input uctuations have a minimal impact on the mean ring rate. In this case, the mean interspike interval hTi is well approximated by the interval expected just from the average drive, which is zero (or m 0 in the V representation)— that is, ³
vreset hTi ¼ t log vh
´ ,
(6.25)
which is equation 6.5 with X 0 D 0. Figures 10 and 11 plot the responses of the model with leak for various combinations of m 0 and s1 , using the same formats of Figures 6 and 7. Again the three sets of curves correspond to tcorr equal to 1, 3, and 10 ms, with longer correlation times producing higher values of the mean rate and CV ISI . In general, these gures show the same trends found for the model without leak: correlations increase both the ring rate and the variability of the responses for almost any xed combination of m and s. The only exception occurs near the point vh D ¡s1 , which marks the transition between the two types of solution. Right below this point, the rate drops slightly as the correlation time increases. This can be seen in Figures 11a and 11b, where the transitions occur at m 0 D 1.2 and m 0 D 1.5, respectively. Other aspects of the responses are also slightly different for the two models, but the same kinds of regimes can be observed. For instance, compare Figures 6c and 10c; the rate in the leaky model does not stay constant for small values of s, but the rise in CVISI is still quite steep. Also, in Figure 11, the ring rate is zero when m 0 C s1 is less than 1 (which is the threshold for V), so when the rates are low, the curves are clearly different. But notice how in Figure 11a the CVISI decreases for large values of m 0. In this regime, the variability saturates, as for the model without leak, because regardless of tcorr there is a minimum ring rate. The responses of the leaky integrate-and-re neuron as functions of tcorr are plotted in Figure 12. Series solutions were used in all cases except for the combination s1 D 0.3, m 0 D 1.31, for which simulation results are shown (dots). This is the case in which spikes can be produced after positive or negative Z and where the CV ISI does not diverge. Note the overall similarity between the curves in Figures 12 and 8.
2146
Emilio Salinas and Terrence J. Sejnowski
Mean firing rate (spikes/s)
a
b m =0 0
c m = 0.5 0
m = 1.005 0
100
0
CV
ISI
2
1
0
0
s
1
2
Figure 10: Mean ring rate and coefcient of variation for the leaky integrateand-re neuron driven by correlated noise. Continuous lines are analytic expressions, and dots are results from computer simulations using a binary input. Results are shown as functions of s1 for the three values of m 0 indicated in the top-left corner of each panel. As in Figures 6 and 7, the three curves in each graph correspond to tcorr D 1 (lower curves), tcorr D 3 (middle curves), and tcorr D 10 ms (upper curves). The continuous lines were obtained from equations 6.17 and 6.22, which apply to |vreset | < 3s1 . This condition is not satised in c for s1 below 0.23, so the lines start above that value. Both ring rate and variability increase with correlation time.
7 Discussion
We have analyzed two simple integrate-and-re model neurons whose responses are driven by a temporally correlated, stochastic input. Assuming that the input samples are binary and given their mean, variance, and correlation time, we were able to nd analytic solutions for the rst two moments of T, which is the time that it takes for the voltage to go from the starting value to threshold, at which point a spike is produced. We found that when other parameters were kept xed, the correlation time tended to increase both the mean ring rate of the model neurons (the inverse of T) and the variability of the output spike trains, but the effects on these quantities differed in some respects. As the correlation time increased from zero to innity, the ring rates always approached a nite limit value. In contrast, the CVISI approached a nite limit only when the input mean (or drift) was
Integrate-and-Fire Neurons Driven by Correlated Stochastic Input
Mean firing rate (spikes/s)
a
b s = 0.2
c s = 0.5
1
2147
s =1
1
1
100
0
CVISI
2
1
0
0
m
1.5
0
Figure 11: As in Figure 7, but the mean ring rate and coefcient of variation are plotted as functions of m 0 for the three values of s1 indicated. The three curves in each graph again correspond to tcorr D 1 (lower curves), tcorr D 3 (middle curves), and tcorr D 10 ms (upper curves). Continuous lines are drawn in b and c only and correspond to Equations 6.17 and 6.22.
large enough; otherwise, it increased without bound. In addition, the increase in ring rate as a function of correlation time depended strongly on the relative values of m and s, with higher m making the difference smaller. Key to obtaining the analytic results was the use of a binary input. It is thus reasonable to ask whether the trends discussed above depend critically on the discrete nature of the binary variable. Gaussian white noise can be properly approximated with a binary variable because of the central limit theorem. As a consequence, the results for the model without leak were identical when the input was uncorrelated and either gaussian or binary. With correlated noise, the input current uctuates randomly between a positive and a negative state, and the dwell time in each state is distributed exponentially. If instead of a binary variable a gaussian variable is used, with identical statistics for its sign, the results differ somewhat but are qualitatively the same for the two models: for example, the rate still increases as a function of correlation time, and the same asymptotic behaviors are seen (data not shown). Here, we considered the response of the model neuron as a function of the total integrated input, but a real neuron actually responds to a set of incoming spike trains. Ideally, one would like to determine how the statistics
2148
Emilio Salinas and Terrence J. Sejnowski
a
b
m = 0.5
100
0
Mean firing rate (spikes/s)
s = 0.3
100
0
4
3
CV
ISI
0
1
0
0 0
80
Correlation time (ms)
0
80
Correlation time (ms)
Figure 12: Responses of the leaky integrate-and-re neuron as functions of the correlation time tcorr . Analytic results are shown as continuous lines; dots are from simulations. Plots on the top and bottom rows show the mean ring rate and CV ISI , respectively. (a) The four curves shown in each plot correspond to four values of s1 : 0.55, 0.8, 1.1, and 1.4, with thicker lines corresponding to higher values. For all these curves, m 0 D 0.5, as indicated. (b) The four traces in each plot correspond to four values of m 0 : 0.75 0.9, 1.15, and 1.31 (dots), with thicker lines corresponding to higher values. For these curves, s1 D 0.3, as indicated. As the correlation time increases, the ring rate tends to an asymptotic value. In contrast, the CV ISI diverges always, except when the threshold Vh (equal to 1) is below m 0 ¡ s1 ; this case corresponds to the dots in b.
of the hundreds or thousands of incoming spike trains relate to the mean, variance, and correlation time of the integrated input current, but in general this is difcult. Methods based on a population density analysis have been applied to similar problems (Nykamp & Tranchina, 2000; Haskell et al., 2001). These methods are also computationally intensive, but eventually may be optimized to compute efciently the spiking statistics of more accurate models. In a previous study (Salinas & Sejnowski, 2000), we investigated how the ring rates and correlations of multiple input spike trains affected the mean and standard deviation of the total input (and ultimately the statistics of the response), but this required certain approximations. In particular, we implicitly assumed that the correlation time of the resulting
Integrate-and-Fire Neurons Driven by Correlated Stochastic Input
2149
integrated input was negligible. This is a good approximation when the correlations between input spike trains have a timescale shorter than the synaptic time constants, so that the latter determine the correlation time of the integrated input (Svirskis & Rinzel, 2000). But this may not be always true, as discussed below. In the cortex, correlations between input spike trains are to be expected simply because of the well-known convergent and divergent patterns of connectivity (Braitenberg & Schuz, ¨ 1997): neurons that are interconnected or receive common input should be correlated. This has been veried through neurophysiological experiments in which pairs of neurons are recorded simultaneously and their cross-correlation histograms are calculated. The widths of these measured cross-correlograms may go from a few to several hundred milliseconds, for both neighboring neurons in a single cortical area (Gochin, Miller, Gross, & Gerstein, 1991; Brosch & Schreiner, 1999; Steinmetz et al., 2000) and neurons in separate areas (Nelson et al., 1992). Thus, such correlation times may be much longer than the timescales of common AMPA and GABA-A synapses (Destexhe, Mainen, & Sejnowski, 1998). Neurons typically re action potentials separated by time intervals of the same orders of magnitude—a few to hundreds of milliseconds—so under normal circumstances, there may be a large overlap between the timescales of neuronal ring and input correlation. This is precisely the situation in which correlations should not be ignored; if in this case the inputs to a neuron are assumed to be independent, a crucial component of its dynamics may be overlooked. Although not explicitly included here, concerted oscillations between inputs may have a dramatic impact on a postsynaptic response too (Salinas & Sejnowski, 2000). This alternate form of correlated activity is also ubiquitous among cortical circuits (Singer & Gray, 1995; Fries et al., 2001). The results support the idea that input correlations represent a major factor in determining the variability of a neuron (Gur, Beylin, & Snodderly, 1997; Stevens & Zador, 1998; Destexhe & Par e , 2000; Salinas & Sejnowski, 2000), especially in the case of a CV ISI higher than 1 (Svirskis & Rinzel, 2000). However, integrative properties (Softky & Koch, 1993; Shadlen & Newsome, 1994) and other intrinsic cellular mechanisms (Wilbur & Rinzel, 1983; Troyer & Miller, 1997; Gutkin & Ermentrout, 1998) play important roles too. In many respects, the model neurons used here are extremely simple: they do not include an extended morphology or additional intrinsic, voltagedependent currents that may greatly expand their dynamical repertoire. However, the results may provide some intuition on the effects produced by the correlation timescale of the input. For instance, a given ring rate can be obtained with either high or low output variability depending on the relative values of m , s, and tcorr . This distinction between regular and irregular ring modes may be useful even under more realistic conditions (Bell, Mainen, Tsodyks, & Sejnowski, 1995; Troyer & Miller, 1997; Hoˆ & Destexhe, 2000; Salinas & Sejnowski, 2000).
2150
Emilio Salinas and Terrence J. Sejnowski
A challenge lying ahead is extending the techniques used here to network interactions (Nykamp & Tranchina, 2000; Haskell et al., 2001). This requires solving coupled equations for the ring probability densities of multiple, interconnected neurons. This problem is not trivial mathematically or computationally, but it may lead to a better understanding of network dynamics based on a solid statistical foundation. Appendix A: Generating Correlated Binary Samples
Here we describe how the correlated binary input with zero mean, unit variance, and exponential correlation function was implemented in the simulations. The sequence of correlated binary samples was generated through the following recipe: Zi C 1 D Zi sign (P s ¡ ui C 1 ) ,
(A.1)
where Zi is the sample at the ith time step, Ps D 1 ¡ Pc D 1 ¡ D t / (2tcorr ), and u is a random number, independent of Z, uniformly distributed between 0 and 1 (Press, Flannery, Teukolsky, & Vetterling, 1992). The initial value Z0 may be either C 1 or ¡1. Note that the probability that sign(Ps ¡u) is positive equals Ps ; therefore, P (Zi C 1 D C 1) D P ( Zi D C 1) Ps C P ( Zi D ¡1)Pc ,
(A.2)
because u is chosen independent of the Z samples. Assuming that P ( Zi C 1 D C 1) D P ( Zi D C 1) , it follows from the above relation that P ( Zi D C 1) D 1 / 2, and therefore that the mean of the sequence is zero. Showing that the variance of Z equals one is trivial: simply square both sides of equation A.1 to nd that Z2 is always equal to one. Regarding the correlation function, the main result is that Z i Z i C j D ( Ps ¡ Pc ) j ,
j ¸ 0.
(A.3)
This can be seen by induction. First compute the correlation between two consecutive samples by multiplying equation A.1 by Zi , so that Zi Zi C 1 D Z2i sign ( Ps ¡ ui C 1 ) D Ps ¡ Pc .
(A.4)
The last step follows because sign(Ps ¡u) is positive a fraction P s of the time. This expression shows that equation A.3 is valid for j D 1; now assume that it is valid for indices i and j, and compute the correlation between samples i and i C j C 1: ¡ ¢ Zi Zi C j C 1 D Zi Zi C j sign Ps ¡ ui C j C 1 D ( Ps ¡ Pc ) j C 1 .
(A.5)
Integrate-and-Fire Neurons Driven by Correlated Stochastic Input
2151
Therefore, equation A.3 is true for all j ¸ 0. Analogous results can be obtained for j < 0 as well. Finally, note that equation A.3 means that the correlation between Z samples is an exponential. Furthermore, (Ps ¡ Pc ) j D (1 ¡
D t Dt t ) , tcorr
(A.6)
because Ps D 1 ¡ Pc and because the time lag t is equal to jD t. In the limit when D t ! 0, the above expression becomes an exponential function equal to equation 3.5. Appendix B: Generating Correlated Gaussian Samples
Here we describe how a correlated gaussian input with zero mean, unit variance, and exponential correlation function can be implemented in computer simulations. The sequence of correlated gaussian samples was generated through the following recipe: p 1 ¡2
Wi C 1 D 2 Wi C
2g i C1 ,
(B.1)
where ³ 2 ´ exp
¡
Dt tcorr
´ ,
(B.2)
and g is a random number independent of W, drawn from a gaussian distribution with zero mean and unit variance (Press et al., 1992). The g samples are uncorrelated. Applying equation B.2 iteratively and setting the initial value W 0 equal to 2 g 0 , we nd WN D 2
N
g0 C
p 1 ¡2
2
N¡1 X iD0
2 i gN¡i .
(B.3)
Averaging over the distribution of g on both sides implies that the mean of W equals zero, because all g samples have an expected value of zero. By squaring the above equation and then averaging, we obtain 2 WN D2
2N
± ² N¡1 X g20 C 1 ¡ 2 2 2 i, j D 0
iC j
gN¡i gN¡j D 1.
(B.4)
In the last step, we used the fact that the g samples are uncorrelated, so that gN¡i gN¡j D dij , where dij is equal to 1 whenever i D j and is 0 otherwise. Also, we used the standard result for the sum of N terms of a geometric series (Jeffrey, 1995).
2152
Emilio Salinas and Terrence J. Sejnowski
The correlation function can be determined by induction, as in the case of binary samples. To prove that Wi Wi C j D 2 j ,
j ¸ 0,
(B.5)
rst compute the correlation between two consecutive samples by multiplying equation B.1 by W i , so that W i W i C 1 D 2 Wi2 C
p
1 ¡2
2
Wi g i C 1 D 2 .
(B.6)
The last step follows again because g is independent of the W samples. This shows that equation B.5 is valid for j D 1. Now assume that it is valid for samples i and i C j, and compute the correlation for samples i and i C j C 1: p W i W i C j C 1 D 2 W i Wi C j C
1 ¡2
2
Wi gi C j C 1 D 2
j C1
.
(B.7)
Therefore, equation B.5 is true for all j ¸ 0. The analogous results for j < 0 can be obtained similarly. Recalling the denition of 2 and that t D jD t, the right-hand side of equation B.5 becomes ³ ´ |t| exp ¡ . tcorr
(B.8)
Finally, note that uncorrelated samples may be produced by setting 2 D 0 in equation B.1, which is useful for switching between correlated and uncorrelated noise models in the simulations. Appendix C: Simulation Methods
Performing computer simulations of the models described in the main text is straightforward. The key is that the discretized equations, 4.1, 5.1, and 6.6 can be included literally in the simulation software. To run a simulation, rst, values for m 0, s0 (or s1 ), t , and tcorr are chosen, as well as a time step D t. At the start of each spike cycle, T is set to zero and V is set equal to Vreset. Thereafter, on each time step, T is increased by D t, a new value of Z is generated according to equation A.1, and V is updated according to the discretized equation for the model being simulated, for instance, equation 5.1 for the nonleaky integrate-and-re neuron driven by correlated binary noise. For models without leak, the barrier needs to be checked on every step. If after the update V becomes negative, then it is set to zero. The updating continues until threshold is exceeded, in which case a spike is recorded, the value of T is saved, and a new spike cycle is started. Simulations in which the model neuron was driven by a gaussian input proceeded exactly in the same way as with a binary input, except that equation B.1 was used to generate correlated gaussian numbers W, and Z in the discretized equations was replaced by W.
Integrate-and-Fire Neurons Driven by Correlated Stochastic Input
2153
In all simulations presented here, we used t D 10 ms, Vh D 1, and Vreset D 1 / 3 (in arbitrary units). The integration time step D t varied between 0.1 and 0.001 ms. Average values of T and T 2 for each condition tested were computed after 1,000 spike cycles for uncorrelated noise and 2,000 cycles for correlated noise. All interspike interval histograms shown include about 4,500 spikes. Matlab scripts that implement the models presented here, along with functions that evaluate the analytic solutions, may be found on-line at http://www.cnl.salk.edu/»emilio/code.html. Acknowledgments
This research was supported by the Howard Hughes Medical Institute. We thank Bill Bialek for comments on earlier results, motivating this work. We also thank Dan Tranchina for many corrections and invaluable guidance in the analysis and Larry Abbott and Paul Tiesinga for helpful comments. References Bell, A. J., Mainen, Z. F., Tsodyks, M., & Sejnowski, T. J. (1995). “Balancing” of conductances may explain irregular cortical spiking (Tech. Rep. No. INC-9502). San Diego, CA: Institute for Neural Computation, University of California. Berg, H. C. (1993). Random walks in biology. Princeton, NJ: Princeton University Press. Braitenberg, V., & Sch¨uz, A. (1997). Cortex: Statistics and geometry of neuronal connectivity. Berlin: Springer. Brosch, M., & Schreiner, C. E. (1999). Correlations between neural discharges are related to receptive eld properties in cat primary auditory cortex. Eur. J. Neurosci., 11, 3517–3530. Burkitt, A. N., & Clark, G. M. (1999). Analysis of integrate-and-re neurons: Synchronization of synaptic input and spike output. Neural Comput., 11, 871– 901. Dayan, P., & Abbott, L. F. (2001). Theoretical neuroscience. Cambridge, MA: MIT Press. Destexhe, A., Mainen, Z. F., & Sejnowski, T. J. (1998). Kinetic models of synaptic transmission. In C. Koch & I. Segev (Eds.), Methods in neuronal modeling (2nd ed., pp. 1–25). Cambridge, MA: MIT Press. Destexhe, A., & ParÂe, D. (2000). A combined computational and intracellular study of correlated synaptic bombardment in neocortical pyramidal neurons in vivo. Neurocomputing, 32–33, 113–119. Diesmann, M., Gewaltig, M.-O., & Aertsen, A. (1999). Stable propagation of synchronous spiking in cortical neural networks. Nature, 402, 529–533. Feng, J., & Brown, D. (2000). Impact of correlated inputs on the output of the integrate-and-re model. Neural Comput., 12, 671–692. Feynman, R. P., Leighton, R. B., & Sands, M. (1963). The Feynman lectures on physics (Vol. 1). Reading, MA: Addison-Wesley.
2154
Emilio Salinas and Terrence J. Sejnowski
Fries, P., Reynolds, J. H., Rorie, A. E., & Desimone, R. (2001). Modulation of oscillatory neuronal synchronization by selective visual attention. Science, 291, 1560–1563. Gardiner, C. W. (1985). Handbook of stochastic methods for physics, chemistry and the natural sciences. Berlin: Springer. Gerstein, G. L., & Mandelbrot, B. (1964). Random walk models for the spike activity of a single neuron. Biophys J., 4, 41–68. Gochin, P. M., Miller, E. K., Gross, C. G., Gerstein, G. L. (1991). Functional interactions among neurons in inferior temporal cortex of the awake macaque. Exp. Brain. Res., 84, 505–516. Gur, M., Beylin, A., & Snodderly, D. M. (1997). Response variability of neurons in primary visual cortex (V1) of alert monkeys. J. Neurosci., 17, 2914–2920. Gutkin, B. S., & Ermentrout, G. B. (1998). Dynamics of membrane excitability determine interspike interval variability: A link between spike generation mechanisms and cortical spike train statistics. Neural Comput., 10, 1047–1065. Haskell, E., Nykamp, D. Q., & Tranchina, D. (2001). Population density methods for large-scale modelling of neuronal networks with realistic synaptic kinetics: Cutting the dimension down to size. Network, 12, 141–174. Ho, ˆ N., & Destexhe, A. (2000). Synaptic background activity enhances the responsiveness of neocortical pyramidal neurons. J. Neurophysiol., 84, 1488– 1496. Jeffrey, A. (1995). Handbook of mathematical formulas and integrals. London: Academic Press. Kisley, M. A., & Gerstein, G. L. (1999). The continuum of operating modes for a passive model neuron. Neural Comput., 11, 1139–1154. MarÏs Âalek, P., Koch, C., Maunsell, J. (1997). On the relationship between synaptic input and spike output jitter in individual neurons. Proc. Natl. Acad. Sci. USA, 94, 735–740. Nelson, J. I., Salin, P. A., Munk, MH.-J., Arzi, M., & Bullier, J. (1992). Spatial and temporal coherence in cortico-cortical connections: A cross-correlation study in areas 17 and 18 in the cat. Vis. Neurosci., 9, 21–37. Nykamp, D. Q., & Tranchina, D. (2000).A population density approach that facilitates large-scale modeling of neural networks: Analysis and an application to orientation tuning. J. Comput. Neurosci., 8, 19–50. Press, W. H., Flannery, B. P., Teukolsky, S. A., & Vetterling, W. T. (1992).Numerical recipes in C. Cambridge: Cambridge University Press. Ricciardi, L. M. (1977). Diffusion processes and related topics in biology. Berlin: Springer-Verlag. Ricciardi, L. M. (1995). Diffusion models of neuron activity. In M. Arbib (Ed.), Handbook of brain theory and neural networks (pp. 299–304). Cambridge, MA: MIT Press. Ricciardi, L. M., & Sacerdote, L. (1979). The Ornstein-Uhlenbeck process as a model for neuronal activity. Biol. Cybern., 35, 1–9. Ricciardi, L. M., & Sato, S. (1988). First-passage-time density and moments of the Ornstein-Uhlenbeck process. J. Appl. Prob., 25, 43–57. Risken, H. (1996). The Fokker-Planck equation: Methods of solution and applications (2nd ed.). Berlin: Springer-Verlag.
Integrate-and-Fire Neurons Driven by Correlated Stochastic Input
2155
Salinas, E., & Sejnowski, T. J. (2000). Impact of correlated synaptic input on output ring rate and variability in simple neuronal models. J. Neurosci., 20, 6193–6209. Salinas, E., & Sejnowski, T. J. (2001). Correlated neuronal activity and the ow of neural information. Nat. Rev. Neurosci., 2, 539–550. Shadlen, M. N., & Newsome, W. T. (1994). Noise, neural codes and cortical organization. Curr. Opin. Neurobiol., 4, 569–579. Shinomoto, S., Sakai, Y., & Funahashi, S. (1999).The Ornstein-Uhlenbeck process does not reproduce spiking statistics of neurons in prefrontal cortex. Neural Comput., 11, 935–951. Singer, W., & Gray, C. M. (1995). Visual feature integration and the temporal correlation hypothesis. Annu. Rev. Neurosci., 18, 555–586. Smith, C. E. (1992). A heuristic approach to stochastic models of single neurons. In T. McKenna, J. L. Davis, & S. F. Zornetzer (Eds.), Single neuron computation (pp. 561–588). Cambridge, MA: Academic Press. Softky, W. R., & Koch, C. (1993). The highly irregular ring of cortical cells is inconsistent with temporal integration of random EPSPs. J. Neurosci., 13, 334–350. Steinmetz, P. N., Roy, A., Fitzgerald, P. J., Hsiao, S. S., Johnson, K. O., & Niebur, E. (2000). Attention modulates synchronized neuronal ring in primate somatosensory cortex. Nature, 404, 187–190. Stevens, C. F., & Zador, A. M. (1998). Input synchrony and the irregular ring of cortical neurons. Nature Neurosci., 1, 210–217. Svirskis, G., & Rinzel, J. (2000). Inuence of temporal correlation of synaptic input on the rate and variability of ring in neurons. Biophys. J., 79, 629–637. Thomas, M. U. (1975). Some mean rst-passage time approximations for the Ornstein-Uhlenbeck process. J. Appl. Prob., 12, 600–604. Tiesinga, P. H. E., JosÂe, J. V., & Sejnowski, T. J. (2000). Comparison of currentdriven and conductance-driven neocortical model neuron with HodgkinHuxley voltage-gated channels. Phys. Rev. E., 62, 8413–8419. Troyer, T. W., & Miller, K. D. (1997).Physiological gain leads to high ISI variability in a simple model of a cortical regular spiking cell. Neural Comput., 9, 971–983. Tuckwell, H. C. (1988). Introduction to theoretical neurobiology. Cambridge: Cambridge University Press. Tuckwell, H. C. (1989). Stochastic processes in the neurosciences. Philadelphia: Society for Industrial and Applied Mathematics. Uhlenbeck, G. E., & Ornstein, L. S. (1930). On the theory of Brownian motion. Phys. Rev., 36, 823–841. Van Kampen, N. G. (1981). Stochastic processes in physics and chemistry. Amsterdam: North-Holland. Wilbur, W. J., & Rinzel, J. (1983). A theoretical basis for large coefcient of variation and bimodality in neuronal interspike interval distribution. J. Theo. Biol., 105, 345–368. Received May 8, 2001; accepted April 1, 2002.
LETTER
Communicated by Peter F o¨ ldi a k
Preintegration Lateral Inhibition Enhances Unsupervised Learning M. W. Spratling [email protected] M. H. Johnson [email protected] Centre for Brain and Cognitive Development, Birkbeck College, London WC1E 7JL, U.K. A large and inuential class of neural network architectures uses postintegration lateral inhibition as a mechanism for competition. We argue that these algorithms are computationally decient in that they fail to generate, or learn, appropriate perceptual representations under certain circumstances. An alternative neural network architecture is presented here in which nodes compete for the right to receive inputs rather than for the right to generate outputs. This form of competition, implemented through preintegration lateral inhibition, does provide appropriate coding properties and can be used to learn such representations efciently. Furthermore, this architecture is consistent with both neuroanatomical and neurophysiological data. We thus argue that preintegration lateral inhibition has computational advantages over conventional neural network architectures while remaining equally biologically plausible. 1 Introduction
The nodes in a neural network generate activity in response to the input they receive. This response can be considered to be a representation of the input stimulus. The patterns of neural activity that constitute efcient and complete representations will vary between tasks. Therefore, for a neural network algorithm to be widely applicable, it must be capable of generating a variety of response patterns. Such a neural network needs to be able to respond to individual stimuli as well as multiple stimuli presented simultaneously, distinguish overlapping stimuli, and deal with incomplete stimuli. Many highly inuential and widely used neural network algorithms meet some, but not all, of these criteria. We present a simple neural network that does meet all of these criteria. It succeeds in doing so by using a novel form of lateral inhibition. Lateral inhibition is an essential feature of many articial, self-organizing neural networks. It provides a mechanism through which neurons compete for the right to generate a response to the current pattern of input activity. c 2002 Massachusetts Institute of Technology Neural Computation 14, 2157– 2179 (2002) °
2158
M. W. Spratling and M. H. Johnson
This not only makes responses more selective (in the short term), but since learning is commonly activity dependent, it also makes the receptive elds of individual nodes more distinct (in the long term). Competition thus enables nodes to represent distinct patterns within the input space. Lateral inhibition provides competition by enabling nodes to inhibit each other from generating a response. This form of competition is used in numerous neural network algorithms. It is implemented either explicitly through lateral connections between nodes (Foldi ¨ ak, 1989, 1990; Marshall, 1995; Oja, 1989; O’Reilly, 1998; Sanger, 1989; Sirosh & Miikkulainen, 1994; Swindale, 1996; von der Malsburg, 1973) or implicitly through a selection process that chooses the “winning” nodes (FoldiÂak, ¨ 1991; Grossberg, 1987; Hertz, Krogh, & Palmer, 1991; Kohonen, 1997; Ritter, Martinetz, & Schulten, 1992; Rumelhart & Zipser, 1985; Wallis, 1996). Despite the large diversity of implementations, these algorithms share a common mechanism of competition. A node’s success in this competition is dependent on the total strength of the stimulation it receives; nodes that compete unsuccessfully have their output activity suppressed. This class of architectures can thus be described as implementing postintegration lateral inhibition. Single-layer neural networks that make use of this competitive mechanism fail to meet all of the criteria set out above. These representational deciencies can be overcome by using a neural network architecture in which there is inhibition of inputs rather than of outputs (see Figure 1b). In our algorithm, each node attempts to block its preferred inputs from activating other nodes. Nodes thus compete for the right to receive inputs rather than for the right to generate outputs. We describe this form of competition as preintegration lateral inhibition. Such a network can be trained using simple activity-dependent learning rules. Since learning is a function of activation, improved response properties result in more correct learning episodes (Marshall, 1995), and hence, the advantageous coding properties that arise from preintegration lateral inhibition result in efcient, unsupervised, learning. We demonstrate the abilities of a network using preintegration lateral inhibition with the aid of two simple tasks. These tasks provide an easily understood illustration of the properties required in order to learn useful representations and have been used to demonstrate the pattern recognition abilities required by models of the human perceptual system (FoldiÂak, ¨ 1990; Hinton, Dayan, Frey, & Neal, 1995; Marshall, 1995; Marshall & Gupta, 1998; Nigrin, 1993). 2 Method
Simple, two-node, neural networks competing via postintegration lateral inhibition and preintegration lateral inhibition are shown in Figure 1. Preintegration lateral inhibition enables each node to inhibit other nodes from responding to the same inputs. Hence, if a node is active and it has a strong synaptic weight to a certain input, then it should inhibit other nodes from re-
Preintegration Lateral Inhibition Enhances Unsupervised Learning
y1
y2
y1
y2
y1 w12
2159
y2
w12 x1
x2
x1
x2
(a)
(b)
x1
x2 (c)
Figure 1: Models of lateral inhibition. Nodes are shown as large circles, excitatory synapses as small, open circles, and inhibitory synapses as small, lled circles. (a) The standard model of lateral inhibition provides competition between outputs. (b) The preintegration lateral inhibition model provides competition for inputs. (c) The preintegration lateral inhibition model with only one lateral weight shown. This lateral weight has an identical value (w12 ) to the afferent weight shown as a solid line.
sponding to that input. A simple implementation of this idea for a two-node network would be m X (wi1 xi ¡ awi2 y2 ) C y1 D iD 1
y2 D
m X i D1
( wi2 xi ¡ awi1 y1 ) C ,
where yj is the activation of node j, wij is the synaptic weight from input i to node j, xi is the activation of input i, a is a scale factor controlling the strength of lateral inhibition, and ( v ) C is the positive half-rectied value of v. These simultaneous equations can be solved iteratively. To help ensure that a steady-state solution is reached, it has been found useful to increase the value of a at each iteration gradually from an initial value of zero. Physiologically, this regime might correspond to a delay in the buildup of lateral inhibition due to additional propagation delays via inhibitory interneurons. An equivalent implementation of postintegration lateral inhibition would be y1 D y2 D
m X iD 1 m X i D1
(wi1 xi ) ¡ aq21 y2
( wi2 xi ) ¡ aq12 y1 ,
2160
M. W. Spratling and M. H. Johnson
where qkj is the lateral weight from node k to node j. Note that for preintegration lateral inhibition, rectication is applied to the inhibited inputs prior to summation. This is essential; otherwise, preintegration lateral inhibition would be mathematically equivalent to postintegration lateral inhibition. For example, without rectication, the equation given Pmabove for preintegration lateral inhibition to node 1P would become y1P D i D 1 ( wi1 xi ¡ m ( ) awi2 y2 ), which can be rewritten as y1 D m ¡ ay w x w 2 i1 i iD 1 i D 1 i2 , which is equivalent to the equation for postintegration lateral inhibition when Pm . In section 4, we suggest that preintegration lateral inhiw D q 21 i2 iD 1 bition is achieved in the cortex via inhibitory contacts on the dendrites of excitatory cells. Due to the nonlinear response properties of dendrites (Koch, Poggio, & Torre, 1983; Mel, 1994; Shepherd & Brayton, 1987), such dendritic inhibition will have effects conned to the local region of the dendritic tree and little or no impact on excitatory inputs to other branches of the dendrite. This nonlinearity is achieved in our model by using rectication. For preintegration lateral inhibition, the magnitudes of the lateral weights are identical to the corresponding afferent weights. For example, the activation of node 2 inhibits input i of node 1 with a strength weighted by wi2 . This lateral weight is identical to the afferent weight node 2 receives for input i (see Figure 1c). The values of the afferent weights provide appropriate lateral weights since the objective of preintegration lateral inhibition is to allow an active node with a strong synaptic weight to a certain input to strongly inhibit other nodes from responding to that same input, while not inhibiting those inputs from which it receives weak weights. Since the lateral weights have identical values to afferent weights, it is not necessary to determine their strengths separately. In a more biologically plausible version of the model, separate lateral weights could be learned independently. This would be possible since weights that have identical values also have identical (but reversed) pre- and postsynaptic activation values. For example, in Figure 1c, the afferent weight w12 connects input 1 to node 2, and an identical lateral weight connects node 2 to input 1 (at node 1). Hence, by using similar activity-dependent learning rules and identical initial values, appropriate lateral weights could be learned. While the equations presented above provide a simple illustration of preintegration lateral inhibition, a more complex formulation is required for application to neural networks containing arbitrary numbers of nodes (n) and receiving arbitrary numbers of inputs (m). In a network containing more than two nodes, each input could be inhibited by multiple lateral connections. What strength of inhibition should result from multiple nodes? One possibility would be to sum the effects from all the inhibitory connections. However, simply changing the number of nodes in such a network could have a signicant effect on the total inhibition. To ensure that inhibition is not a function of n, we inhibit each node by the maximum value of the inhibition generated by all the nodes. Hence, the activation of each
Preintegration Lateral Inhibition Enhances Unsupervised Learning
2161
node is calculated as 0 yj D
m X iD 1
1C
n B C @wij xi ¡ a maxfwik y kgA . kD 1 ( kD 6 j)
Note that nodes do not inhibit their own inputs. Early in training, nodes tend to have small, undifferentiated weights and hence small activation values. This results in weak lateral inhibition that permits all nodes to respond to each input pattern. To ensure that lateral inhibition is equally effective at the start of training as at the end, the strength of inhibition is normalized as follows: the lateral weight is divided by the maximum lateral weight originating from the inhibiting node, and the inhibiting node’s activity is divided by the maximum activity of all the nodes in the network: 0 » m X n B yj D @wij xi ¡ a max kD 1 ( kD 6 j)
iD 1
¼
1C
wik yk C n fy g A . maxm fw g max lk l lD 1 lD 1
Normalizing the strength of lateral inhibition ensures that there is strong competition, resulting in a differentiation in activity values and more rapid activity-dependent learning. Normalization also results in the strength of inhibition being constrained to be in the range zero to one. However, this form of inhibition would have a greater effect on a stimulus presented at low contrast (using smaller values of xi ) than on the same stimulus presented at high contrast. Tolerance to changes in stimulus contrast is provided by modifying the equation as follows: 0 yj D
m X i D1
n B wij xi @1 ¡ a max kD 1 ( kD 6 j)
»
¼
1C
wik yk C n fy g A . maxm fw g max lk lD 1 lD 1 l
(2.1)
This formulation was used to produce all the results presented in this article. The value of a was increased from zero to four in steps of 0.25. Activation values generally reached a steady state at lower alpha, in which case competition was terminated before a reached its maximum value. The change in the value of a between iterations was found to be immaterial to the nal activation values, provided it was less than 0.5. Networks were trained using unsupervised learning. Weights were initially set all equal to m1 (we describe such nodes as uncommitted). (The algorithm works equally well with randomly initialized weights.) In order to cause differentiation of the receptive elds, noise was added to the node activations at each iteration during the competition. Noise values were random numbers uniformly distributed in the range 0 to 0.001. When several
2162
M. W. Spratling and M. H. Johnson
nodes with identical weights are equally activated by the current input, noise should cause a single node from this identical group to win the competition and hence respond to that input. This bifurcation of activity values requires one node to have a signicantly higher activation (after adding noise) than all other nodes in the identical group, so that it can successfully inhibit the other nodes. As the number of nodes increases, it becomes less likely that a single node will receive a signicantly higher noise value than the others. Hence, this noisy selection process was found to be more robust when noise was added to a subset of nodes: noise was added to each node with probability n4 . (Zero-mean gaussian-distributed noise added to the activity values of all nodes has also been used successfully.) Since the magnitude of the noise is small, it has no effect on competition once nodes have begun to learn preferences for distinct input patterns. Synaptic weights were modied at completion of the competition phase (i.e., using the nal values for yj found after iteration of equation 2.1). The following learning rule was employed:
D wij
( xi ¡ xN ) ( yj ¡ yN ) C Pn , kD 1 x k kD 1 y k
D b Pm
(2.2)
P N is the mean where xN is the mean of the input activations (xN D m1 m i D 1 xi ) and y P of the output activations (yN D n1 jnD1 yj ), and b is a parameter controlling the learning rate. Following learning, synaptic weightsP were clipped at zero C such that wij D ( wij ) C and were normalized such that m i D1 ( wij ) D 1. This learning rule encourages each node to learn weights selective for a set of coactive inputs. This is achieved since when a node is more active than average, it increases its synaptic weights to active inputs and decreases its weights to inactive inputs. Hence, only sets of inputs that are consistently coactive will generate strong afferent weights. In addition, the learning rule is designed to ensure that nodes can represent stimuli that share input features in common (i.e., allow the network to represent overlapping patterns). This is achieved by rectifying the postsynaptic term of the rule so that no weight changes occur when the node is less active than average. If learning was not restricted in this way whenever a pattern was presented, all nodes that represented patterns with overlapping features would reduce their weights to these features. Our learning rule is similar to the instar rule (Grossberg, 1976, 1978), except that only nodes that are more active than average have their weights modied. Hence, postsynaptic activation exceeding a threshold is required before synaptic modication is enabled. The learning rule is also similar to the covariance rule (Grzywacz & Burgi, 1998; Sejnowski, 1977), except the postsynaptic term is allowed to be only greater than or equal to zero. Hence, subthreshold postsynaptic activity does not result in any changes in synaptic strength. Normalization by the total of the input and output activities helps to ensure that each stimulus contributes equally to learning and that the total weight change (across
Preintegration Lateral Inhibition Enhances Unsupervised Learning
2163
all nodes) at each iteration is similar. Division by zero errors was avoided by learning only when the maximum input activity exceeded an arbitrary threshold of 0.1. Nodes learn to respond to specic features within the input space. In order to represent stimuli that contain multiple features, the activation function allows multiple nodes to be active simultaneously. With overlapping patterns, this could allow the same stimulus to be represented in multiple ways. For example, consider a simple network receiving input from three sources (labeled a, b, and c), which has the task of representing three patterns: a, ab, and bc. Individual nodes in a network may learn to become selective to each of these patterns. Each individual pattern would thus be represented by the activity of a single node. However, it would also be possible for a network in which nodes had learned patterns a and bc to respond to pattern ab by fully activating the node selective to a and partially activating the node selective to bc. Pattern ab could be the result of coactivation of pattern a and a partially occluded version of pattern bc. However, if ab occurs with similar frequency to the other two patterns, then it would seem more likely that ab is an independent input pattern that should be represented in a similar way to the other patterns (by the activation of a specic node tuned to that pattern). To ensure that in such situations the network learns all input patterns, it is necessary to introduce a second learning rule:
D wij¡
¡ D ¡b ( xi ¡ Xij ) ( yj ¡ yN ) ,
(2.3)
where b ¡ is a parameter controlling the learning rate and Xij is the input activation from source i to node j after inhibition: 0
1C ¼ w y n B C ik k Xij D xi @1 ¡ a max n fy g A . maxm fw g max kD 1 lk l l D1 l D1 »
( kD 6 j)
The values of Xij are determined P from¡ equation 2.1. Negative weights were constrained such that 0 ¸ m ¸ ¡1. This learning rule is applied i D 1 ( wij ) only to synapses that have a weight of zero (or less than zero) caused by application of the learning rule given in equation 2.2 (or prior application of this rule). Negative weights are generated when a node is active, and inputs, which are not part of the nodes’ preferred pattern, are inhibited. This can occur only when multiple nodes are coactive. If the pattern to which this set of coactive nodes is responding reoccurs, then the negative weights will grow. When the negative weights are sufciently large, the response of these nodes to this particular pattern will be inhibited, enabling an uncommitted node to compete successfully to represent this pattern. On the other hand, if the pattern to which this set of coactive nodes is responding is due just to the coactivation of independent input patterns, then the weights will return toward zero on subsequent presentations of these patterns in isolation.
2164
M. W. Spratling and M. H. Johnson
For programming convenience, we allow a single synapse to take either excitatory or inhibitory weight values. In a more biologically plausible implementation, two separate sets of afferent connections could be used: the excitatory ones being trained using equation 2.2 and the inhibitory ones being trained using a rule similar to equation 2.3. Note that the simplication we have used, which employs one set of weights to depict both excitatory and inhibitory synapses, does not result in any modication to the activation function (see equation 2.1). Inhibitory afferents can themselves be inhibited, and the use of identical lateral weights is unlikely to cause lateral facilitation due to the max operation that is applied. 3 Results 3.1 Overlap. Consider a simple problem of representing two overlapping patterns: ab and abc. A network consisting of two nodes receiving input from three sources (labeled a, b, and c) should be sufcient. Because these input patterns overlap, when the pattern ab is presented, the node representing abc will be partially activated, and when the pattern abc is presented, the node representing ab will be fully activated. Hence, when the synaptic weights have certain values, both nodes will respond with equal strength to the same pattern. For example, when the weights are all equal, both nodes will respond to pattern ab with equal strength (Marshall, 1995). Similarly, when the total synaptic weight from each input is normalized (postsynaptic normalization), both nodes will respond equally to pattern ab (Marshall, 1995). When the total synaptic weight to each node is normalized (presynaptic normalization), both nodes will respond to pattern abc with equal output (Marshall, 1995). Under all these conditions, the response fails to distinguish between distinct input patterns, and competition via postintegration lateral inhibition can do no better than choosing a node at random. Several solutions to this problem have been suggested. Some require adjusting the activations using a function of the total synaptic weight received by the node (using the Webber law—Marshall, 1995—or a masking eld— Cohen & Grossberg, 1987; Marshall, 1995). These solutions scale badly with the number of overlapping inputs and do not work when (as is common practice) the total synaptic weight to each node is normalized. Other suggestions have involved tailoring the lateral weights to ensure the correct node wins the competition (Marshall, 1995; Marshall & Gupta, 1998). The most obvious but most overlooked solution would be to remove constraints placed on allowable values for synaptic weights (e.g., normalization), which serve to prevent the input patterns being distinguished in weight space. It is simple to invent sets of weights that unambiguously classify the two overlapping patterns (e.g., if both weights to the node representing ab are 0.5 and each weight to the node representing abc is 0.4, then each node responds most strongly to its preferred pattern and could then successfully inhibit the activation of the other node).
Preintegration Lateral Inhibition Enhances Unsupervised Learning
Node Activations
Node Activations
6
6
5
5
5
4 3
a
b
c
d
Input
e
f
2 1 a ab abc cd de def
Input Pattern
acef
1
bcde
1
3
abcdf
2
abcdef
2
4
abcd
3
abcde
4
Node
6
Node
Node
Synaptic Weights
2165
Input Pattern
(a)
(b)
(c)
Figure 2: Representing overlapping patterns. A network consisting of six nodes and six inputs (a, b, c, d, e, and f ) was trained using input patterns a, ab, abc, cd, de, and def . (a) The strength of synaptic weights between each node and each input is shown using squares that have sides of length proportional to the synaptic weight. (b) The average response (over 500 presentations) of each node to each of the training patterns is shown using squares that have sides of length proportional to the node activations. Each node responds exclusively to its preferred pattern. (c) The response of the network to multiple and partial patterns. Pattern abcd causes the nodes representing ab and cd to be active simultaneously, despite the fact that this pattern overlaps strongly with pattern abc. Input abcde is parsed as abc together with de, and input abcdef is parsed as abc C def . Input abcdf is parsed as abc C two-thirds of def ; hence, the addition of f to the pattern abcd radically changes the representation that is generated. Input bcde is parsed as two-thirds of abc plus pattern de. Input acef is parsed as a C one-half of cd C two-thirds of pattern def .
Using preintegration lateral inhibition, overlapping patterns can be successfully distinguished and learned (despite the use of presynaptic normalization). To demonstrate this, we applied a six-node preintegration lateral inhibition network to the more challenging problem of representing six overlapping patterns: a, ab, abc, cd, de, and def (de A. Barreto & Ara Âujo, 1998; Harpur & Prager, 1994; Marshall, 1995; Marshall & Gupta, 1998). The network was trained 25 times using different randomly generated sequences of the six input patterns. For each of these trials, the network successfully learned appropriate weights such that each pattern was represented by an individual node. A typical example of the weights learned is shown in Figure 2a, and the response of the network to each of the training patterns is shown in Figure 2b. It can be seen that distinct, individual nodes respond to each of the training patterns. For the majority of the 25 trials, a solu-
2166
M. W. Spratling and M. H. Johnson
tion was found within 55 training cycles (equivalent to approximately nine presentations of each of the six patterns). On the fastest trial, the solution was found in 35 cycles, and on the slowest trial the solution was found by 80 cycles. These results compare very favorably with a network (the EXIN network) that uses postintegration lateral inhibition and was specically developed to solve these types of problem (de A. Barreto & Ara Âujo, 1998; Marshall, 1995; Marshall & Gupta, 1998). Although Marshall (1995) gives no indication of how reliable EXIN is at nding a correct solution, he does state that this network requires 3000 pattern presentations to nd such a solution. Our results were generated using parameter values b D 1 and b ¡ D 1. The network learned successfully with other parameter values; however (as would be expected), reducing either learning rate resulted in 1 slower learning (e.g., with b D 1 and b ¡ D 64 , the majority of networks found a solution within 435 training cycles; the fastest trial took 245 cycles and the slowest 640 cycles). The preintegration lateral inhibition network also responds in an appropriate manner to the presentation of multiple overlapping patterns (even when only partial patterns are presented). Six examples are given in Figure 2c. These parsings of novel input patterns, in terms of known patterns, are very similar to those obtained with the EXIN network (see Figure 8 in Marshall, 1995). 3.2 Multiplicity. While it is sufcient in certain circumstances for a single node to represent the input (local coding), it is desirable in many other situations to have multiple nodes providing a factorial or distributed representation. If individual nodes act as detectors for independent features of the input, then any arbitrary pattern can be represented by having zero, one, or multiple nodes active. A benchmark task of this kind is the bars problem (Charles & Fyfe, 1998; Dayan & Zemel, 1995; FoldiÂak, ¨ 1990; Frey, Dayan, & Hinton, 1997; Fyfe, 1997; Harpur & Prager, 1996; Hinton et al., 1995; Hinton & Ghahramani, 1997; Hochreiter & Schmidhuber, 1999; O’Reilly, 2001; Saund, 1995). The standard mechanism of postintegration lateral inhibition can be modied to enable multiple nodes to be coactive (Foldi ¨ ak, 1990; Marshall, 1995) by weakening the strength of the competition between those pairs of nodes that need to be coactive. Lateral weights need to be sufciently strong to prevent multiple nodes representing the same patterns, but sufciently weak to allow more than one node to be active if multiple stimuli are presented simultaneously. Obtaining the correct strength for the lateral weights requires either a priori knowledge of which nodes will be coactive or the ability to learn appropriate weights. However, the activity of the nodes in response to the presentation of each input pattern is uninformative as to whether the weights have found the correct compromise: when two nodes are simultaneously active, it can indicate either that lateral weights are too weak and have allowed both units to respond to the same input or that each
Preintegration Lateral Inhibition Enhances Unsupervised Learning
2167
node has responded correctly to the simultaneous presentation of distinct inputs; when a single node is active, it can indicate either that the lateral weights are too strong to allow other nodes to respond to other inputs or that only one input pattern is present. To overcome this problem, it has been necessary to design learning rules specic to the particular task. For example, Foldi ¨ ak (1990) uses the following rule to modify lateral inhibitory weights: D qkj D b ( y kyj ¡ p2 ) (where p is the probability of each input pattern occurring, assumed a priori). Such learning rules are severely restricted in the class of problems that they can successfully represent (e.g., to tasks in which all input patterns occur with equal probability and where pairs of nodes are coactive with equal frequency (Foldi ¨ ak, 1990; Marshall, 1995)). Hence, postintegration lateral inhibition fails to provide factorial coding except in the exceptional circumstances where external knowledge is available to set appropriate lateral weights or to dene appropriate learning rules for the specic task. Networks in which postintegration competition is performed using a selection mechanism can also be modied to allow multiple nodes to be simultaneously active (e.g., k-winners-take-all). However, these networks also place restrictions on the types of task that can be successfully represented to those in which a predened number of nodes need to be active in response to every pattern of stimuli. In contrast, preintegration lateral inhibition does not make use of a priori knowledge to learn factorial codes, nor does it place restrictions on the number of active nodes or on the frequency with which nodes, or sets of nodes, are active. It can thus respond appropriately to any combination of input patterns. To demonstrate this, we applied a network using preintegration lateral inhibition to the bars problem, as dened by Foldi ¨ ak (1990). Input data consisted of an 8 £ 8 pixel image in which each of the 16 possible (1-pixel-wide) horizontal and vertical bars were active with probability 18 . Typical examples of input patterns are shown in Figure 3a. The network was trained 25 times using different randomly generated sequences of input patterns. For each of these trials, the network successfully learned appropriate weights such that each bar was represented by an individual node. A typical example of the weights learned, at various stages during training, is shown in Figure 3b. The response of this network to randomly generated test patterns is shown in Figure 3c. It can be seen that a single node exclusively responds to each bar. Bars that overlap (perpendicular bars) can inhibit each other; hence, in cases where perpendicular lines are present, not all nodes are fully activated by the presence of their preferred input. However, nodes that represent active bars are still signicantly more active than average. After being trained for 250 cycles, the network shown in Figure 3 was tested with a further 100,000 randomly generated patterns (learning rates were set to zero to prevent further training). The network was considered to represent a test pattern successfully if all nodes representing bars in the
2168
M. W. Spratling and M. H. Johnson
(a) 50 100 150 200 250 (b)
(c) Figure 3: Representing multiple patterns: the bars problem. (a) Examples of typical input patterns used in the bars problem. Bars on an 8 £ 8 grid are active with probability 18 . Dark pixels indicate active inputs. (b) The synaptic weights for the 16 nodes in a network after training with 50, 100, 150, 200, and 250 input patterns. The darkness of each pixel corresponds to the strength of that weight. (c) The response of the network after training on 250 input patterns. The leftmost column shows the input pattern, with the remainder of each row showing the response of the network to this pattern. The size of the response of each node, in the same order as the weights are shown in b, is represented by the size of each square.
Preintegration Lateral Inhibition Enhances Unsupervised Learning
2169
Figure 4: Failure cases for the bars problem. The network shown in Figure 3 was tested with 100,000 randomly generated patterns. All 13 patterns for which it failed to respond correctly to are shown in the left-most column. The remainder of each row shows the response of the network to this pattern. The size of the response of each node, in the same order as the weights are shown in Figure 3b, is represented by the size of each square.
pattern, and only those nodes, had an activation greater than the average activation. The network failed to represent 13 patterns out of 100,000. These 13 patterns are shown in Figure 4, along with the network’s response to them. In all cases, the patterns contain seven or eight bars, with the majority of these bars at the same orientation. The network successfully represents the majority of bars, but nodes responding to the one or two perpendicular bars are less active than average. In the majority of the 25 trials performed on this task, the network found a solution within 210 cycles. Performance ranged from 140 cycles for the fastest trail to 370 cycles for the slowest trial. Although FoldiÂak ¨ (1990) developed a network that used postintegration lateral inhibition to tackle this problem, he did not report how reliable his network was at nding correct solutions or how much training it required to do so. However, our results compare favorably with all neural network algorithms that have been applied to this problem, not just those that implement postintegra-
2170
M. W. Spratling and M. H. Johnson
tion lateral inhibition. For example, Harpur and Prager (1996) report that training required 2000 pattern presentations, 1 Hochreiter and Schmidhuber (1999) report that training required 5000 passes through a training set of 500 patterns, 2 and Dayan and Zemel (1995) report a success rate of only 69%. 3 The results presented above were obtained with learning rates set to 1 b D 1 and b ¡ D 64 . The network was also tested over 25 trials using ¡ b D 1. Although this case also resulted in a correct solution being found (within 1000 cycles) in 100% of trials, it did signicantly delay convergence to a solution in a small number of these trials. In the overlapping patterns task, discussed in the previous section, training data consisted of isolated patterns. Hence, whenever multiple nodes were coactive, it indicated that a pattern was incorrectly represented and negative weights should be increased. In contrast, for this task, training data consist of multiple patterns, and hence coactive nodes may also be due to multiple input patterns being present. A lower value of b ¡ thus prevents useful weight changes caused by incorrect representations being masked by spurious uctuations due to multiple input patterns. Note, however, that it is not necessary to know a priori whether the training data contain multiple patterns: the bars problem can be learned with 100% success rate using b ¡ D 1; also, the overlapping 1 inputs problem can be learned with 100% success rate using b ¡ D 64 . Figure 5a illustrates the effect of changing the learning rate. This gure shows the change, during learning, of the number of bars correctly represented by the network. For a bar to be considered correctly represented, exactly one node in the network must have strong afferent weights from the pixels activated by that bar. Specically, the sum of these weights must be at least twice the sum of the weights connected to pixels activated by any other bar. The data shown are averaged over 25 trials, and the error bars indicate the best and worst performance. All the previous analysis has been performed on a network with 16 nodes. In general, it would not be known a priori how many nodes were required to form a representation of a particular task. Hence, the algorithm must also work when the number of nodes does not match the number of input features. A network containing 32 nodes was tested 25 times using different randomly generated sequences of input patterns. In all these trials, the network found a solution to the bars problem. The slowest trial took 440 cycles to nd a solution (slightly longer than the worst trial with the 16-node net-
1 Harpur and Prager (1996) used a slightly modied version of the original problem such that bars were active with probability 14 . 2 Hochreiter and Schmidhuber (1999) used a version in which a 5 £ 5 pixel image was used, and bars appeared with probability 15 . 3 Dayan and Zemel (1995) used the same task as Hochreiter and Schmidhuber (1999) except that horizontal and vertical bars could not co-occur. Our network tested with data in which horizontal and vertical bars did not co-occur learned a correct solution in 100% of 25 trials. In the majority of trials, a solution was found within 195 cycles.
Preintegration Lateral Inhibition Enhances Unsupervised Learning
2171
number of bars represented
16 14 12 10 8 6 4
b = 1.0, b =0.0156 b = 1.0, b =1.0 b = 0.25, b =0.0156
2 0 0
200
400 training cycles
600
800
(a)
number of bars represented
16 14 12 10 8 6 4
n = 16 n = 32 n £ 20
2 0 0
200
400 training cycles
600
800
(b) Figure 5: Effect of parameter changes in solving the bars problem. The change during training in the number of bars correctly represented by exactly one node in the network. Lines show the mean for 25 trials with different randomly generated sequences of input patterns. Error bars show the best and worst performance over all trials. (a) The effects of varying the learning rates. (b) The effects of varying the number of nodes in the network.
2172
M. W. Spratling and M. H. Johnson
Figure 6: Typical receptive elds learned by a 32-node network applied to the bars problem. Each square shows the synaptic weights for a node after training for 500 cycles. The darkness of each pixel corresponds to the strength of that weight.
work). Excess nodes generally remained uncommitted to representing any pattern (all weights remained equal to their initial values). Occasionally, an excess node did form a slight preference to a complex pattern containing multiple bars. A typical example of the weights learned is shown in Figure 6. The representations formed in networks with excess nodes remained stable. We also tested a version of the algorithm in which the number of nodes in the network was not xed, but nodes (up to a maximum number, in this case 20) were added as required during training. This network began with two nodes. Whenever none of the nodes remained uncommitted, a new uncommitted node was added to the network. In the majority of 25 random trials performed with this network, it converged to a solution within 130 training cycles. In the slowest trial, a solution was found in 290 training cycles. This constructive version of the algorithm thus learns more quickly than a network with a xed number of nodes. Figure 5b summarizes the results obtained for networks with different numbers of nodes. All the above experiments used noise-free stimuli. Such perfect input data are unrealistic for real-world problems. The performance of the preintegration lateral inhibition network was thus tested on the bars problem as described above, but with zero-mean gaussian noise added independently to each pixel value (pixel values were clipped at zero and one). Two factors result in signicantly slower learning when noisy data are used. First, more examples of input patterns are needed to allow the algorithm to distinguish signal from noise. Second, a lower learning rate is needed in order to smooth out weight uctuations caused by the noise. It was also found necessary to use a network in which there were excess nodes; these nodes learned random looking weights. A typical example of the weights learned is shown in 1 Figure 7b. A 20-node network, with b D 0.25 and b ¡ D 64 , was trained 25 times with different randomly generated sequences of 4000 input patterns. Figure 8 summarizes the results obtained using data corrupted with varying degrees of noise. As the level of noise rises, the network is less reliable at nding a solution and takes longer to do so. Although for higher levels of noise, the network did not nd a solution within 4000 training cycles, on every trial the performance is impressive considering the difcultly in perceiving the bars under this level of noise (see Figure 7a). For comparison, Charles and Fyfe (1998) report on an algorithm with 100% success rate when
Preintegration Lateral Inhibition Enhances Unsupervised Learning
2173
(a)
(b) Figure 7: The noisy bars problem. (a) Examples of input patterns used in the noisy bars problem. The bars data are corrupted with zero-mean gaussian noise, with variance 0.3, added independently to each pixel value (pixel values were clipped at zero and one). The darkness of each pixel corresponds to the strength of activation of that input. (b) Typical receptive elds learned by a 20-node network applied to the noisy bars problem (using noise with variance 0.3). Each square shows the synaptic weights for a node after training for 4000 cycles. The darkness of each pixel corresponds to the strength of that weight.
the variance of the noise was 0.25 and 70% success rate when the variance was 1.0. However, their network was trained for 100,000 cycles. Hinton and Ghahramani (1997) report that training required 10 passes through a set of 1000 images, but do not report the variance of the noise used or the reliability with which a solution was found.4 4 Discussion
The examples have shown that preintegration lateral inhibition is capable of generating appropriate representations based on the knowledge stored in the synaptic weights of a neural network. Specically, it can represent overlapping patterns and can respond to partial patterns such that the response is proportional to how well that input matches the stored pattern. It is capable of generating a local encoding of individual input patterns as well as responding simultaneously to multiple patterns, when they are present, in order to generate a factorial or distributed encoding. Importantly, such a network is capable of efciently learning such representations using a simple, unsupervised learning algorithm. In contrast, a network using postintegration lateral inhibition can learn to represent mutually exclusive 4 Hinton and Ghahramani (1997) used a 6 £ 6 image, bars appeared with probability 0.3, and horizontal and vertical bars could not co-occur.
2174
M. W. Spratling and M. H. Johnson
number of bars represented
16 14 12 10 8 6 var(noise)=0.1 var(noise)=0.2 var(noise)=0.3 var(noise)=0.4
4 2 0 0
1000
2000 training cycles
3000
4000
(a) Variance of noise 0.1 0.2 0.3 0.4
Reliability (%) 100 92 76 60
Speed (cycles) 1125 1900 2700 3550
(b) Figure 8: Effect of changing levels of noise in solving the bars problem. The change during training in the number of bars correctly represented by exactly one node in the network. Lines show the mean for 25 trials with different randomly generated sequences of input patterns. Error bars show the best and worst performance over all trials. (a) The effects of varying the variance of the noise corrupting the input data. (b) Table showing reliability (the percentage of trials for which a solution was found by the end of 4000 training cycles) and speed (the number of training cycles required for the majority of trials to have reached a solution). In trials that failed to nd a solution, the network represented most of the bars (usually 15); hence, the average number of bars represented, as shown in a, is high even when the reliability is low.
Preintegration Lateral Inhibition Enhances Unsupervised Learning
2175
input patterns using winner-take-all type competition, but it has not been shown how it can learn factorial codes except in cases where the input data meet very restrictive criteria. With preintegration lateral inhibition, learning is reliable and fast. Calculating node activations is tractable, requiring a polynomial-time algorithm, but is slightly slower than for postintegration lateral inhibition (computational complexity increases from O ( nm C n2 ) for postintegration lateral inhibition to O ( mn2 ) for preintegration lateral inhibition). We believe that the algorithm we have presented has sufcient computational qualities to make it interesting as an articial self-organizing neural network architecture. However, we also suggest that it has potential application as a model of cortical information processing. Inhibitory synapses contacting cortical pyramidal cells are concentrated on the soma and axon initial segment (Mountcastle, 1998; Somogyi & Martin, 1985) where they can be equally effective at inhibiting responses to excitatory inputs stimulating any part of the dendritic tree (Spruston, Stuart, & H¨ausser, 1999). Such synapses could thus generate postintegration inhibition. However, inhibitory synapses also contact the dendrites of cortical pyramidal cells (Kim, Beierlein, & Connors, 1995; Rockland, 1998), and certain classes of interneuron (e.g., double bouquet cells) specically target dendritic spines and shafts (Mountcastle, 1998; Tamas, Buhl, & Somogyi, 1997). Such contacts would have relatively little impact on excitatory inputs more proximal to the cell body or on the action of synapses on other branches of the dendritic tree. Thus, these synapses do not appear to contribute to postintegration inhibition. However, such synapses are likely to have strong inhibitory effects on inputs within the same dendritic branch that are more distal to the site of inhibition (Borg-Graham, Monier, & Fregnac, 1998; Koch et al., 1983; Koch & Segev, 2000; Rall, 1964; Segev, 1995; Spruston et al., 1999). Hence, they could potentially selectively inhibit specic groups of excitatory inputs. Related synapses cluster together within the dendritic tree so that local operations are performed by multiple functionally distinct dendritic subunits before integration at the soma (H¨ausser, 2001; Hausser, ¨ Spruston, & Stuart, 2000; Koch & Segev, 2000; Mel, 1994, 1999; Segev, 1995; Segev & Rall, 1998). Dendritic inhibition could thus act to block the output from individual functional compartments. If, for modeling convenience, all the synapses contributing to a dendritic compartment are grouped together as a single input, then our algorithm can be viewed as a model of dendritic inhibition. The idea embodied in our model is that pyramidal cells inhibit the activity of dendritic compartments in other pyramidal cells within the same cortical area. This claim is anatomically plausible since it has been shown that cortical pyramidal cells innervate inhibitory cell types, which in turn form synapses on the dendrites of pyramidal cells (Buhl et al., 1997; Tamas et al., 1997). Our model is also supported by recent physiological data. Wang, Fujita, and Murayama (2000) report that blockade of GABAergic (in-
2176
M. W. Spratling and M. H. Johnson
hibitory) synapses alters the selectivity of cells in monkey inferotemporal cortex (area TE). Application of bicuculline methiodide (a GABAA receptor antagonist) results in increased responses to certain image features but not others. “These results suggest that a substantial fraction of excitatory inputs to TE neurons normally are not expressed because of GABAergic inhibition. . . . The effects of bicuculline on these neurons resulted from specic disinhibition of GABAergic synapses on the excitatory inputs of particular stimulus features, rather than removal of nonspecic inhibition” (Wang et al., 2000, p. 808). These data are consistent with our model of dendritic inhibition but are difcult to account for with a model that uses only postintegration inhibition. As with models of postintegration lateral inhibition, we have simplied reality by assuming that the role of inhibitory cells can be approximated by direct inhibitory weights from excitatory cells. A more biologically accurate model might include a separate inhibitory cell population. One potential implementation would allocate one inhibitory neuron to each input or dendritic compartment (i). This inhibitory cell would receive weighted connections from each excitatory node. It would perform a max operation over these inputs and apply the resulting inhibition equally to input i of every excitatory node. Using this scheme, each input (or dendritic compartment) would be inhibited by a single synapse. In subcortical structures, where axo-axonal, terminal-to-terminal synapses are common (Brown, 1991; Kandel, Schwartz, & Jessell, 1995), our model might have an even more direct biological implementation since an excitatory neuron can directly inhibit the input to another neuron via an excitatory synapse on the axon terminal of the input (presynaptic inhibition) (Bowsher, 1970; Kandel, Schwartz, & Jessell, 1995; Shepherd, 1990). Acknowledgments
This work was funded by MRC Research Fellowship number G81/512. References Borg-Graham, L. T., Monier, C., & Fregnac, Y. (1998). Visual input evokes transient and strong shunting inhibition in visual cortical neurons. Nature, 393(6683), 369–373. Bowsher, D. (1970). Introduction to the anatomy and physiologyof the nervous system. Oxford: Blackwell. Brown, A. G. (1991). Nerve cells and nervous systems: An introduction to neuroscience. London: Springer-Verlag. Buhl, E. H., Tamas, G., Szilagyi, T., Stricker, C., Paulsen, O., & Somogyi, P. (1997). Effect, number and location of synapses made by single pyramidal cells onto aspiny interneurones of cat visual cortex. Journal of Physiology, 500(3), 689– 713.
Preintegration Lateral Inhibition Enhances Unsupervised Learning
2177
Charles, D., & Fyfe, C. (1998). Modelling multiple cause structure using rectication constraints. Network: Computation in Neural Systems, 9(2), 167–182. Cohen, M. A., & Grossberg, S. (1987). Masking elds: A massively parallel neural architecture for learning, recognizing, and predicting multiple groupings of patterned data. Applied Optics, 26, 1866–1891. Dayan, P., & Zemel, R. S. (1995). Competition and multiple cause models. Neural Computation, 7, 565–579. de A. Barreto, G., & Ara Âujo, A. F. R. (1998). The role of excitatory and inhibitory learning in EXIN networks. In Proceedings of the IEEE World Congress on Computational Intelligence (pp. 2378–2383). New York: IEEE Press. Foldi ¨ ak, P. (1989). Adaptive network for optimal linear feature extraction. In Proceedings of the IEEE/INNS International Joint Conference on Neural Networks (Vol. 1, pp. 401–405). New York: IEEE Press. Foldi ¨ ak, P. (1990). Forming sparse representations by local anti-Hebbian learning. Biological Cybernetics, 64, 165–170. Foldi ¨ ak, P. (1991). Learning invariance from transformation sequences. Neural Computation, 3, 194–200. Frey, B. J., Dayan, P., & Hinton, G. E. (1997). A simple algorithm that discovers efcient perceptual codes. In M. Jenkin & L. R. Harris (Eds.), Computational and psychophysicalmechanisms of visual coding. Cambridge: Cambridge University Press. Fyfe, C. (1997). A neural net for PCA and beyond. Neural Processing Letters, 6(1/2), 33–41. Grossberg, S. (1976). Adaptive pattern classication and universal recoding, I: Parallel development and coding of neural feature detectors. Biological Cybernetics, 23, 121–134. Grossberg, S. (1978). A theory of human memory: Self-organisation and performance of sensory-motor codes, maps, and plans. In R. Rosen & F. Snell (Eds.), Progress in theoretical biology (Vol. 5, pp. 233–374). London: Academic Press. Grossberg, S. (1987). Competitive learning: From interactive activation to adaptive resonance. Cognitive Science, 11, 23–63. Grzywacz, N. M., & Burgi, P.-Y. (1998). Towards a biophysically plausible bidirectional Hebbian rule. Neural Computation, 10, 499–520. Harpur, G. F., & Prager, R. W. (1994). A fast method for activating competitive self-organising neural networks. In Proceedings of the International Symposium on Articial Neural Networks (pp. 412–418). Harpur, G., & Prager, R. (1996). Development of low entropy coding in a recurrent network. Network: Computation in Neural Systems, 7(2), 277–284. H¨ausser, M. (2001). Synaptic function: Dendritic democracy. Current Biology, 11, R10–12. H¨ausser, M., Spruston, N., & Stuart, G. J. (2000). Diversity and dynamics of dendritic signalling. Science, 290(5492), 739–744. Hertz, J., Krogh, A., & Palmer, R. G. (1991). Introduction to the theory of neural computation. Reading, MA: Addison-Wesley. Hinton, G. E., Dayan, P., Frey, B. J., & Neal, R. M. (1995). The wake-sleep algorithm for unsupervised neural networks. Science, 268, 1158–1161.
2178
M. W. Spratling and M. H. Johnson
Hinton, G. E., & Ghahramani, Z. (1997). Generative models for discovering sparse distributed representations. Philosophical Transactions of the Royal Society of London. Series B, 352, 1177–1190. Hochreiter, S., & Schmidhuber, J. (1999). Feature extraction through LOCOCODE. Neural Computation, 11, 679–714. Kandel, E. R., Schwartz, J. H., & Jessell, T. M. (Eds.). (1995). Essentials of neural science and behavior. London: Appleton & Lange. Kim, H. G., Beierlein, M., & Connors, B. W. (1995). Inhibitory control of excitable dendrites in neocortex. Journal of Neurophysiology, 74(4), 1810– 1814. Koch, C., Poggio, T., & Torre, V. (1983). Nonlinear interactions in a dendritic tree: Localization, timing, and role in information processing. Proceedings of the National Academy of Science USA, 80(9), 2799–2802. Koch, C., & Segev, I. (2000). The role of single neurons in information processing. Nature Neuroscience, 3 (Supp.), 1171–1177. Kohonen, T. (1997). Self-organizing maps. Berlin: Springer-Verlag. Marshall, J. A. (1995). Adaptive perceptual pattern recognition by selforganizing neural networks: Context, uncertainty, multiplicity, and scale. Neural Networks, 8(3), 335–362. Marshall, J. A., & Gupta, V. S. (1998). Generalization and exclusive allocation of credit in unsupervised category learning. Network: Computation in Neural Systems, 9(2), 279–302. Mel, B. W. (1994). Information processing in dendritic trees. Neural Computation, 6, 1031–1085. Mel, B. W. (1999). Why have dendrites? A computational perspective. In G. Stuart, N. Spruston, & M. H¨ausser (Eds.), Dendrites (pp. 271–289). Oxford: Oxford University Press. Mountcastle, V. B. (1998). Perceptual neuroscience: The cerebral cortex. Cambridge, MA: Harvard University Press. Nigrin, A. (1993). Neural networks for pattern recognition. Cambridge, MA: MIT Press. Oja, E. (1989). Neural networks, principal components, and subspaces. International Journal of Neural Systems, 1, 61–68. O’Reilly, R. C. (1998). Six principles for biologically based computational models of cortical cognition. Trends in Cognitive Sciences, 2(11), 455–462. O’Reilly, R. C. (2001). Generalization in interactive networks: The benets of inhibitory competition and Hebbian learning. Neural Computation, 13(6), 1199– 1242. Rall, W. (1964). Theoretical signicance of dendritic trees for neuronal inputoutput relations. In R. F. Reiss (Ed.), Neural theory and modeling (pp. 73–97). Stanford, CA: Stanford University Press. Ritter, H., Martinetz, T., & Schulten, K. (1992). Neural computation and selforganizing maps: An introduction. Reading, MA: Addison-Wesley. Rockland, K. S. (1998). Complex microstructures of sensory cortical connections. Current Opinion in Neurobiology, 8, 545–551. Rumelhart, D. E., & Zipser, D. (1985). Feature discovery by competitive learning. Cognitive Science, 9, 75–112.
Preintegration Lateral Inhibition Enhances Unsupervised Learning
2179
Sanger, T. D. (1989). Optimal unsupervised learning in a single-layer linear feedforward neural network. Neural Networks, 2, 459–473. Saund, E. (1995). A multiple cause mixture model for unsupervised learning. Neural Computation, 7(1), 51–71. Segev, I. (1995). Dendritic processing. In M. A. Arbib (Ed.), The handbook of brain theory and neural networks (pp. 282–289). Cambridge, MA: MIT Press. Segev, I., & Rall, W. (1998). Excitable dendrites and spines: Earlier theoretical insights elucidate recent direct observations. Trends in Neurosciences, 21(11), 453–460. Sejnowski, T. J. (1977). Storing covariance with nonlinearly interacting neurons. Journal of Mathematical Biology, 4, 303–321. Shepherd, G. M. (Ed.). (1990). The synaptic organization of the brain. Oxford: Oxford University Press. Shepherd, G. M., & Brayton, R. K. (1987). Logic operations are properties of computer-simulated interactions between excitable dendritic spines. Neuroscience, 21, 151–166. Sirosh, J., & Miikkulainen, R. (1994). Cooperative self-organization of afferent and lateral connections in cortical maps. Biological Cybernetics, 71, 66–78. Somogyi, P., & Martin, K. A. C. (1985). Cortical circuitry underlying inhibitory processes in cat area 17. In D. Rose & V. G. Dobson (Eds.), Models of the visual cortex. New York: Wiley. Spruston, N., Stuart, G., & H a¨ usser, M. (1999). Dendritic integration. In G. Stuart, N. Spruston, & M. H¨ausser (Eds.), Dendrites (pp. 231–271). Oxford: Oxford University Press. Swindale, N. V. (1996). The development of topography in the visual cortex: A review of models. Network: Computation in Neural Systems, 7(2), 161–247. Tamas, G., Buhl, E. H., & Somogyi, P. (1997). Fast IPSPs elicited via multiple synaptic release sites by different types of GABAergic neurone in the cat visual cortex. Journal of Physiology, 500(3), 715–738. von der Malsburg, C. (1973). Self-organisation of orientation sensitive cells in the striate cortex. Kybernetik, 14, 85–100. Wallis, G. (1996). Using spatio-temporal correlations to learn invariant object recognition. Neural Networks, 9(9), 1513–1519. Wang, Y., Fujita, I., & Murayama, Y. (2000). Neuronal mechanisms of selectivity for object features revealed by blocking inhibition in inferotemporal cortex. Nature Neuroscience, 3(8), 807–813. Received July 5, 2001; accepted February 21, 2002.
LETTER
Communicated by Maneesh Sahani
On Optimality in Auditory Information Processing Mattias F. Karlsson [email protected] John W. C. Robinson [email protected] Swedish Defense Research Agency, SE 172 90 Stockholm, Sweden We study limits for the detection and estimation of weak sinusoidal signals in the primary part of the mammalian auditory system using a stochastic Fitzhugh-Nagumo model and an action-recovery model for synaptic depression. Our overall model covers the chain from a hair cell to a point just after the synaptic connection with a cell in the cochlear nucleus. The information processing performance of the system is evaluated using so-called w -divergences from statistics that quantify “dissimilarity” between probability measures and are intimately related to a number of fundamental limits in statistics and information theory (IT). We show that there exists a set of parameters that can optimize several important w -divergences simultaneously and that this set corresponds to a constant quiescent ring rate (QFR) of the spiral ganglion neuron. The optimal value of the QFR is frequency dependent but is essentially independent of the amplitude of the signal (for small amplitudes). Consequently, optimal processing according to several standard IT criteria can be accomplished for this model if and only if the parameters are “tuned” to values that correspond to one and the same QFR. This offers a new explanation for the QFR and can provide new insight into the role played by several other parameters of the peripheral auditory system. 1 Introduction
When a sensory cell in a mammal is presented with a stimulus, the information about it must in general be communicated through several layers of intermediating nerve cells before it reaches the parts of the brain where the nal processing takes place. A logical question, therefore, is how much of the information is lost in the rst parts of this processing chain and how these parts of the chain have (possibly) been optimized by evolution to combat information loss, for different types of stimuli. One of the simplest settings of this problem is the auditory system. The frequency ltering process in the inner ear makes it sufcient in general, at least for weak signals, to restrict attention to a single type of stimuli, a pure tone, when studying the response of the auditory nerve cells and their connections in the c 2002 Massachusetts Institute of Technology Neural Computation 14, 2181– 2200 (2002) °
2182
Mattias F. Karlsson and John W. C. Robinson
cochlear nucleus. From an information-theoretic perspective, it is thus of interest to determine how well the peripheral parts of the auditory processing chain preserve information about the presence of a tone, its amplitude, and phase. Here, we will primarily focus on in what ways this part of the auditory processing chain imposes limits on achievable detection performance of a weak tone. In the context of statistical decision and information theory (IT), this detection problem is intimately connected to the estimation problem of determining the amplitude. Despite the extensive literature on information processing in neurons, a relatively small number of works treat the fundamental statistical limits for neural detection and estimation that bound the performance of sensory systems. One notable exception, however, is Stemmler’s work (1996) on the detection and estimation capabilities of the Hodgkin-Huxley, McCulloughPitts, and leaky integrate-and-re model neurons in terms of the Fisher information. Stemmler shows that there exists a universal small-signal scaling law that relates the optimal detection, estimation, and communication performance of these model neurons and that this scaling law also applies to the (narrow-band) signal-to-noise ratio (SNR) on the output of a neuron that is excited by a sinusoidal signal. In the majority of other informationtheoretic analyses of neural information processing, the focus is on the spike train on the output of a neuron, and a long-standing objective has been to try to break the neural code of the spike train. However, there is a fundamental component missing in modeling that rests solely on considering information in the spike train: the inuence of the synaptic connections. The importance of this aspect of neural computation has recently been recognized, and it has even been suggested that the synaptic connections in fact represent the primary bottleneck that limits information transmission in neural circuitry (Zador, 1998). Consequently, when studying information processing in neurons, in particular detection and estimation capabilities of the auditory system, it seems imperative to consider models and methods that describe not only the individual neurons and their spike trains but also the synaptic connections between the neurons. In this study, we investigate, theoretically, the fundamental limits for detection and estimation of weak signals in the mammalian auditory system. We model the neurons in the auditory nerve and their synaptic connections using ideas from Tuckwell (1988) and Kistler and Van Hemmen (1999) that take into account the notion of synaptic depression. Incorporation of the synaptic efcacy’s dependence on the preceding sequence of action potentials arriving at the synapse in the model makes it possible to obtain a more realistic assessment of the information available to the next step in the auditory processing chain, the processing in the cochlear nucleus. Another feature of our study is the use of more general measures of signalnoise separation. To quantify signal-noise separation, we use the so-called w -divergences from statistics and IT (Liese & Vajda, 1987). The w-divergences are applicable to virtually any kind of signal and system (in a stochastic set-
On Optimality in Auditory Information Processing
2183
ting), in particular, the highly nonlinear dynamic systems represented by neurons, and are intimately related to a number of fundamental limits in statistics and IT. Our main objective is to determine whether the primary auditory system, when described using standard (albeit simplied) models for neuron and synaptic dynamics, has a structure whereby optimizations of w -divergences with respect to parameters can occur. Given the signicance of the w-divergences as performance measures, an afrmative answer to this question would yield a new view on the role played by various parameters in the neurons of the auditory system, such as the quiescent ring rate (QFR), and would inspire new experiments relating to the function of the auditory processing chain. We show that such optimizations indeed are possible, where some of the underlying mechanisms are explained in terms of the model structure, and we numerically determine the optimal values. The article is organized as follows. In section 2, we describe our model of the auditory system, in which the central component is the FitzhughNagumo system of equations. This section also includes an introduction to w -divergences and a review of their properties. The divergences are computed in section 3, and the results are discussed in section 4. 2 Methods 2.1 Physiological Modeling. We consider the peripheral part of the mammalian auditory nervous system (Geisler, 1998), beginning with the acoustic (uid) pressure at a point in the inner ear and ending at the soma of a cell in the cochlear nucleus. As a model of the chain from the inner ear, via an inner hair cell and a spiral ganglion cell, to a point a small distance down the ganglion axon, we employ a stochastic FitzHugh-Nagumo (FHN) model (FitzHugh, 1961; Scott, 1975). This model, which we henceforth (with a slight abuse of language) will call the FHN neuron, represents an attractive choice in our study. It is analytically and numerically tractable and has the ability to produce a response that is statistically similar to that observed in real neurons (Hochmair-Desoyer, Hochmair, Motz, & Rattay, 1984). In particular, it is well known that even simple (white-noise driven) stochastic FHN models are able to reproduce accurately the interspike interval histograms (ISIH) in various forms of nerve bers, such as the auditory nerve bers of squirrel monkeys (Massanes & Vicente, 1999). For our study, the most important aspect of the neuron model is its ability to reproduce the ISIH since other quantities, such as small voltage variations between spikes, will not inuence the statistical quantities we focus on. Furthermore, the effects on the ISIHs when varying the parameters in the FHN model are well understood, and the subset of parameter space in which we get realistic spike trains is easily extracted. Therefore, we do not require the higher level of realism that can be obtained using more elaborated models, such as the Hodgkin-Huxley model, even though such models can contribute a more detailed understanding of the biological mechanisms in-
2184
Mattias F. Karlsson and John W. C. Robinson
volved. For the terminal boutonic connections of the auditory nerve with the dendrites (or soma) of the cells in the cochlear nucleus, together with the parts of the dendrites from the boutonic connections to the somas, we employ an action-recovery model combined with a time-varying a-function like transformation with additive noise (Tuckwell, 1988; Kistler & Van Hemmen, 1999). The conjunction of these two model features makes it possible to capture both the synaptic depression and variability observed in real neurons. Furthermore, incorporation of depression in the model turns out to be of crucial importance for our results since it removes “false optima” that would otherwise be present. 2.1.1 Stochastic FitzHugh-Nagumo Model. The stochastic FHN model is given by the following system of stochastic differential equations (Longtin, 1993), 1
e dV t dW t
D D
Vt ( Vt ¡ a) (1 ¡ Vt ) dt ¡ Wt dt C dvt , , ( Vt ¡ d ¢ W t ¡ (b C st ) ) dt,
t 2 [0, T],
(2.1)
where e, a, b, d > 0 are (nonrandom) parameters, V is the fast (“voltagelike”) variable, W is the slow (“recovery-like”) variable, vt is the noise, and st is the signal process representing the stimuli, here the acoustic pressure in the inner ear. The parameter a effectively controls the barrier height between the two potential wells in the potential term (the rst term on the right-hand side of the rst equation), and the variable b is a bias parameter moderating the effect of the signal input. These two parameters affect the stability properties of the FHN neuron, and so does the relaxation parameter d multiplying the slow variable. The parameter e sets the timescale for the motion in the potential described by the rst equation. Normally, the variable V is thought to represent membrane voltage in the neuron, but since the FHN model can be viewed as obtained by “descent” from the higher-dimensional Hodgkin-Huxley model (or other more elaborated models), it is not reasonable to attach too strict a physical meaning to it. To us, it will merely act as a convenient way of modeling the timing information in the action potentials generated by the neuron when the latter are dened by a simple threshold operation on the fast variable V. The signal st is here chosen to enter on the slow variable W, which controls the refractory periods of V, in order to facilitate a comparison with existing qualitative results for the corresponding deterministic dynamics (Alexander, Doedel, & Othmer, 1990). However, it is easy to transform the system into an equivalent one (of the same form) where the signal enters on the fast variable (Alexander et al., 1990). The 1 To guarantee global solutions to equation 2.1, we must assume that the model for very large |V| is modied so that the potential term on the right-hand side of the rst equation in that equation grows at most linearly when the state variable V tends to § 1.
On Optimality in Auditory Information Processing
2185
1.2 1
voltage
0.8 0.6 0.4 0.2 0 0.2 0.4 0
5
10
15
20
t
25
30
35
40
Figure 1: A typical example of the output from the model in equations 2.1, 2.2 with parameter values for the input signal and the FHN neuron as in Table 1, besides A D 0.1. The time is measured in ms, and one unit on the voltage axis corresponds to 100 mV.
stochastic process vt is a noise process accounting for the variability in ring pattern observed in real neurons, which we, in order to have control over the correlation time (Longtin, 1993), take to be an Ornstein-Uhlenbeck (OU) process, dvt D ¡lvt dt C sdjt ,
t 2 [0, T],
(2.2)
where l > 0 determines the effective correlation time and j is a standard Wiener process (integrated gaussian white noise) scaled by the intensity parameter s > 0. We assume that all the input and intrinsic noise sources can be collectively described by the process vt . This noise model is also often used with l D 0, so that vt becomes a Wiener process, which has proved sufcient to reproduce real data (Massanes & Vicente, 1999). In Figure 1 we show an example of an output of the FHN neuron (see equations 2.1 and 2.2) with sinusoidal signal and parameter values typical for the simulations. An important underlying assumption in our model and, indeed, in most rate-based treatments of neural dynamics, is that the intervals between action potentials, not their particular form, in a given neuron carry all the information relevant to the subsequent neural processing by other connected neurons. Accordingly, we will refer to the spike train as the set of time points where the fast variable V crosses a threshold c in an up-moving direction. 2.1.2 Synaptic Connections. The model for synaptic response is made up of two parts: a nominal (or average) response and a variability from the nominal due to synaptic depression (Koch, 1999).
2186
Mattias F. Karlsson and John W. C. Robinson
For a synapse in a nominal state at an electrotonic distance x0 from the soma on a dendrite of some length L ¸ x 0, the impulse response r (Green’s function) for the transformation from action potential applied on the presynaptic side of the synapse to the voltage at the soma can be modeled by an expansion of the form (Tuckwell, 1988, sec. 6.5) Á ! 2 1 X A n ( x0 ) e¡at C e¡t(1 C ln ) ¡at ¡ (2.3) r (t ) D b te , t ¸ 0, 1 C l2n ¡ a 1 C l2n ¡ a nD 0 (with uniform convergence in t) where r ( t) D 0 for t < 0. Expressions for the constants An , ln in terms of L and graphs showing the appearance of equation 2.3 for typical values of these constants and a, b are given in Tuckwell (1988). In equation 2.3, it is assumed that the impulse response from action potential to postsynaptic current at the soma is given by a socalled a-function of the form h (t) D bte¡ta for t ¸ 0 and h (t ) D 0 for t < 0 (Jack & Redman, 1971). From the denition of r, it is clear that expression 2.3 actually describes both the synapse and the connected dendrite, but since the response at a point down the dendrite is mainly determined by the response of the synapse, we shall, for simplicity, refer to r in equation 2.3 as the nominal synaptic response. The synaptic connections in the cochlear nucleus are often made by synapses having a fair, or even a large, amount of release sites, such as the endbulb of Held, which is connected to spherical bushy cells in the anteroventral cochlear nucleus (Webster, Popper, & Fay, 1992). As a consequence, the synaptic transmission will be reliable in the sense that an incoming action potential will almost always yield an excitatory postsynaptic potential (EPSP). However, the EPSPs will vary in strength depending (primarily) on the preceding sequence of the action potentials that have arrived at the synapse. This phenomenon, the synaptic depression, has a crucial effect on the overall dynamical behavior of the nerve and needs to be taken into account in conjunction with the nominal response in equation 2.3. We model the depression using a simple action-recovery scheme developed by Kistler and Van Hemmen (1999) that combines the three state plasticity model of Tsodyks and Markham (1997) and the spike response model of Gerstner and Van Hemmen (1992). The action-recovery scheme employs a variable Z and its complement 1 ¡ Z that correspond to active and inactive resources, respectively, where the term resources can be interpreted as affecting factors on both the pre- and the postsynaptic side, such as the availability of neurotransmitter substance or postsynaptic receptors. Quantitatively, the amount of available resources is determined by the recursion (Kistler & Van Hemmen, 1999), Ztk C 1 D 1 ¡ [1 ¡ (1 ¡ R ) Ztk ] exp[¡(tkC 1 ¡ tk ) / t ],
(2.4)
where 0 < R · 1 is a constant corresponding to the fraction of resources that gets inactive due to a spike and t > 0 is a decay time parameter. The
On Optimality in Auditory Information Processing
2187
variable Ztk should be interpreted as the amount of resources available just before time tk, and it is therefore proportional to the strength in an eventual EPSP caused by an action potential arriving at the synapse at time tk. An approximation to the initial condition Zt0 can be obtained by forming an average of the available resources for a number of spike trains, generated by the unforced FHN model for the studied system, for a large T. Thus, by using the depression model above, we can calculate the pristine (or noisefree) postsynaptic response R at the soma as R ( t) D
1 X kD 0
Ztk r ( t ¡ tk ) ,
t > 0,
(2.5)
where r is the nominal response given in equation 2.3. This model is capable of producing results in close agreement with real data (Tsodyks & Markham, 1997), provided the appropriate choices of constants are made. In reality, there is always also a certain noise present due to, for example, the inherent unreliability of the ionic channels involved in the transmission of signals in and between the neurons (Koch, 1999). To take this effect into account, we have added zero-mean white gaussian noise with intensity s2 to the EPSPs given by our model, which thus represents our total synaptic response. 2.2 Information Processing. We study information processing performance in terms of general statistical signal-noise separation measures applied to the output of our model, the soma of a cell in the cochlear nucleus. The output signal-noise separation setting was chosen since it can be applied with only minimal assumptions about the input signal. Due to the frequency selectivity of the primary parts of the auditory system, it is sufcient, at least as a good rst approximation for weak signals (EguÂõ luz, Ospek, Choe, Hudspeth, & Magnasco, 2000; Camalet, Duke, Julicher, ¨ & Prost, 2000), to restrict attention to sinusoidal signals (possibly with slowly varying amplitude and phase). We therefore restrict attention to signals st in the FHN model of the form
st D A sin(v0 t C Q ) ,
(2.6)
where A, v0 ¸ 0 are constant in time and Q 2 [0, 2p ) is a phase that is also constant in time. 2.2.1 w -Divergences and Generalized SNR. A number of fundamental limits in statistical inference and IT can be expressed as monotonic functions of so-called w -divergences, which can be thought of as directed distances between probability measures. For example, the minimal probability of error in (Bayesian) detection, Wald’s inequalities (sequential detection), the bound in Stein’s lemma (cutoff rates in Neyman-Pearson detection), and the
2188
Mattias F. Karlsson and John W. C. Robinson
Fisher information for small parameter deviations (the Crame r-Rao bound) can all be written as simple functions of a w -divergence. In the most basic setting, where p0 , p1 are two probability density functions (PDFs) on the real line R, the w -divergence dw ( p0 , p1 ) between p 0 , p1 is dened as (Liese & Vajda, 1987) ³
Z dw ( p 0 , p1 ) D
R
w
p1 ( x ) p0 ( x)
´ p 0 ( x ) dx,
(2.7)
where w is any continuous convex function on [0, 1) (we assume that p 0 ( x ) D 0 implies p1 ( x ) D 0). By the conditions on w , any divergence dw ( p 0, p1 ) will take its minimum value if and only if p0 D p1 almost everywhere, and the w -divergences therefore express the “separation” between p 0, p1 in a relative-entropy-like way. Indeed, one prominent member of the family of w -divergences is the Kullback-Leibler divergence or relative entropy, also known as information divergence dI (Cover & Thomas, 1991), obtained for w ( x) D ¡ ln(x) . Other important members of this family are ( q) the Kolmogorov or error divergence dE , obtained for w ( x ) D | (1 ¡ q) x ¡ q| where q 2 (0, 1) is a parameter, and the  2 -divergence dÂ2 , obtained for w ( x ) D (1 ¡ x) 2 . The  2 -divergence is twice the rst term in a formal expansion of the information divergence around 0 (p0 D p1 ) and is a (tight) upper bound for a family of generalized SNR measures known as deection ratios2 that depend on only the means and variances of the observables. If h is some function of the data, the deection ratio (DR) D ( h) is dened as (Basseville, 1989) D (h ) D
|E1 ( h ) ¡ E 0 (h ) | 2 , Var0 ( h)
where E1 ( h) , E 0 ( h ) is the expectation of h computed using p 0 and p1 , respectively, and Var0 (h ) is the variance of h computed using p0 . The DR is upper-bounded as D ( h ) · d 2 ( p 0, p1 ) ,
(2.8)
with equality if and only if C1 ( h ¡E 0 ( h ) ) D C2 ( p1 / p0 ¡1) with p 0-probability one, for two constants C1 , C2 (not both zero). In particular, we have equality in equation 2.8 if h equals p1 / p0 , the likelihood ratio. It follows that a larger  2 -divergence allows for larger SNR when expressed in terms of DRs. The  2 and information divergences determine locally the CramÂer-Rao bound (CRB) for parameter estimation (Salicr Âu, 1993; Cover & Thomas, 2 Indeed, it can be shown that the (narrow-band) SNR measures used in stochastic resonance can be expressed as limits of deection ratios (Rung & Robinson, 2000; Robinson, Rung, Bulsara, & Inchiosa, 2001) when h represents a Fourier transform operation.
On Optimality in Auditory Information Processing
2189
1991). For example, if h is a parameter with values in some open interval I and ph , h 2 I , is a family of PDFs on R indexed by h then, under some regularity conditions, lim
h!h0
dÂ2 ( ph0 , ph ) 1 dI ( ph0 , ph ) D lim D I (h0 ) , 2 (h ¡ h0 ) h !h0 2(h ¡ h0 ) 2 2
for h 0 2 I , where I (h0 ) is the Fisher information at h 0. Thus, for estimation of h when h is near h0 , the CRB (which is the inverse of the Fisher information), and thereby the achievable accuracy for unbiased estimation of h, is locally determined by the growth of the  2 and information divergences as a function of h , near h 0. The Kolmogorov divergence is directly related to the minimal achievable probability of error in Bayesian hypothesis testing. If p 0 and p1 are two possible PDFs for the data observed and q is taken as the a priori probability of p 0 to be correct, so that p1 has probability 1¡q, then the minimal achievable (q) probability of error3 PQe ( p 0 , p1 ) for decision between p 0, p1 (which is the correct density) based on a single sample x is given by (Ali & Silvey, 1966) 1 ( q) (q) PQe ( p 0 , p1 ) D (1 ¡ dE (p 0 , p1 ) ). 2 A larger Kolmogorov divergence thus gives a smaller minimal probability of error. These properties manifest the versatility of w -divergences as indicators of information processing performance. For later reference, we also point out that all the denitions and properties above have counterparts on much more general probability spaces (Liese & Vajda, 1987; Robinson et al., 2001; Rung & Robinson, 2000), for instance in the innite dimensional context of probability measures on the space of continuous functions on [0, T]. 2.2.2 Auditory Processing Performance. In order to apply w -divergences to assess performance in our model of the auditory processing chain, we need to specify the setting in somewhat greater detail, as well as elaborate on some of the features of the model. We have chosen to make the stimulus parameters A and v0 constant and treat the phase Q as a (variable) parameter. At rst, this might seem to be an oversimplication, but since every nerve cell in the auditory nerve is tuned to a given frequency, it is natural to consider the stimulus frequency v0 as constant. Furthermore, we begin by studying a nerve cell that has only one connection to a single neuron in the higher layers. Obviously, a more realistic setting would be to model an axon that exhibits spatial divergence near the 3
(q)
As is well known, PQ e (p0 , p1 ) is achieved with a simple posterior-ratio test.
2190
Mattias F. Karlsson and John W. C. Robinson
end, where it splits up into different branches. The different branches then connect with the dendritic tree or soma of the following neurons. Since the dendrites (from the connective synapse to the soma) have different lengths, the time delays in them will be different. For sinusoidal input signals, this is exchangeable for a phase shift of the signal, at least as a good rst approximation. Thus, for a given frequency, the primary auditory processing could be viewed as taking place over a bank of parallel channels, all similar in characteristics but each giving a different phase shift to the signal. However, we will show that our basic problem with only one connection to the next layer of neurons is the key to understanding the detection performance of the more complicated settings with many connections. We assess the auditory processing performance by computing the w divergences of the output of our model (the voltage to the soma of a cell in the cochlear nucleus) at a time point T, where T is the end point of a long time interval [0, T]. Although the measurement is made for only a single point of time, the output voltage, which is R ( t) in equation 2.5 plus an additive noise, will depend on the entire preceding sequence of action potentials. However, T is large enough to make the output only marginally affected by action potentials in the beginning of the interval. The two PDFs p 0, p1 in the denition 2.7 are in the present setting given by the PDF for the output when no signal is present in the FHN model, equations 2.1 and 2.2 (st ´ 0), and when a signal st as in equation 2.6 is present, respectively. In general, an applied input signal will change the neuron ring rate, which will result in a separation between p 0 and p1 . On the other hand, under certain circumstances, the ring rate is about the same regardless if the input signal is present, but the PDFs for the two cases are well separated and the w -divergence is high. This will be the case, for example, if phase locking takes place when the input is applied. Since the PDFs here are densities on the real line, they are easy to compute using numerical simulation, but they are dependent on the phase Q, and so are the resulting w -divergences. 2.3 Simulations. In order to produce realistic data in the simulations, we start with the set of parameters in Table 1, in which the FHN parameters have been chosen on the basis of previous studies (Massanes & Vicente, 1999; Alexander et al., 1990) and the other parameters have been tested to recreate data similar to real experiments. The stochastic differential equations were solved using the Euler-Maruyama scheme (Kloeden & Platen, 1992), and the PDFs of the output to the model were estimated using a histogram approach based on counting the number of samples falling in a grid of intervals on the real line. For calculation of the Kolmogorov divergence, the so-obtained raw histograms were sufcient, but they proved insufcient for the  2 and information divergences (which are sensitive to inaccuracies in the representation of the PDFs). Therefore, smoothing with a kernel of the type e¡c|x| was applied to the estimated PDFs before the latter two divergences were calculated. In order to reduce the dependence on the smoothing parameter
On Optimality in Auditory Information Processing
2191
Table 1: Starting Values for the Model Parameters. Parameter
Part of the Model Affected by the Parameter
Value
A v0 Q a b d e s l c a b x0 L R t s2
Input signal Input signal Input signal FHN-neuron FHN-neuron FHN-neuron FHN-neuron FHN-neuron FHN-neuron FHN-neuron Synapse/dendrite Synapse/dendrite Synapse/dendrite Synapse/dendrite Synapse/dendrite Synapse/dendrite Synapse/dendrite
0.2 8 0 0.55 0.12 1 0.005 p 100 2 ¢ 10¡5 100 0.5 10 100 0.25 1.5 0.2 50 10¡4
c, its values were kept in a region where the results for the Kolmogorov divergence did not vary appreciably depending on whether smoothing was applied. Moreover, in this region, the values of the so-computed  2 and information divergences were qualitatively independent of the value of c. All our simulations were done using Matlab on UNIX(Digital)/Linux(i386) with codes that can be accessed over the Internet (Karlsson & Robinson, 2001). 3 Results
Our main object of study is the variability of performance, quantied via w -divergences (see section 2.2.2), as a function of parameters. We shall focus primarily on the Kolmogorov divergence, since this is easiest to compute numerically, but we shall also consider performance in terms of the information and  2 divergences, and deection ratios. As mentioned in section 2.3, we rst choose a nominal set of parameter values for the simulations (see Table 1), which results in output signals resembling real neuron data, and then we let the parameters vary around this point. At all times, the parameters are kept inside the region where the output is spike-train-like (all the resulting FHN outputs are similar to the one shown in Figure 1). The synaptic constants used for the simulation are chosen in order to give realistic EPSPs for the studied systems, and the distance x 0 is set rather small (x0 D 0.25 on a dendrite of length L D 1.5) since many synapses in the auditory system (e.g., the endbulb of Held) form connections close to the soma. It appears that the exact interval in
2192
Mattias F. Karlsson and John W. C. Robinson 0.8
de
(0.5)
0.6 0.4 0.2 0 8 6 4
phase
2 0
0
0.2
0.4
0.6
0.8
1
a
Figure 2: The Kolmogorov divergence for different values of the potential parameter a and the phase Q and with the other parameters set as in Table 1.
which these constants are chosen is not crucial, provided it does not contain unrealistic values, because the synaptic constants affect the optimal settings of the other parameters in the model only marginally. However, modeling the depression is important since without taking it into account, new values of the parameters become optimal that, when related to real experiments, clearly are outside the physically relevant region. 3.1 Performance with Respect to Variation of a and ’. To illustrate the importance for our investigation of the simple system, the one with only one connection to the following layers of neurons, we start by xing all parameter values except a and Q. The resulting Kolmogorovdivergence as a function of model parameter a is shown in Figure 2 for ve different values of Q. As can be seen, the optimal value of a, which corresponds to the largest value of the divergence, does not change with the phase. This phase independence is, moreover, shared by the optimal values of the model parameters, for all divergences studied here. Hence, for a more complicated system with several connective dendrites of different length, the values of the model parameters for the simplest system will most likely also be optimal for this system. This does not imply that a divergence on N-dimensional output space, obtained by simultaneously considering the outputs of N channels for dendrites of varying length, would be optimized. However, based on the fact that all the divergences studied are invariant of the phase, our conjecture is that the N-dimensional case is qualitatively similar, and therefore we have restricted attention in this study to the simplest case.
On Optimality in Auditory Information Processing a
0.8
30
dc2
d (0.5) e
0.4
20
0.2
10
0 0.25
0 0.25 0.2
0.2 0.15
0.15
b
0.1 0.05
0
0.05
a
4
c
2
2.5
x 10
0
0.7 0.8 0.4 0.5 0.6 0.1 0.2 0.3
a
d
2
I 1
D
d
0.1
0.7 0.8 0.4 0.5 0.6 0.1 0.2 0.3
1.5
0.5
1.5 1 0.5
0 0.25
0 0.25
0.2
0.2
0.15
b
b
40
0.6
b
2193
0.15
b
0.1 0.05
0
0.7 0.8 0.4 0.5 0.6 0.1 0.2 0.3
a
0.1 0.05
0
0.7 0.8 0.4 0.5 0.6 0.1 0.2 0.3
a
Figure 3: (a) The Kolmogorov divergence for different values of the potential parameter a and the bias parameter b with q D 0.5 and the other parameter as in Table 1. (b) The  2 divergence for different values of the potential parameter a and the bias parameter b when the other parameter values are set as in Table 1. Due to the unreliability for high values of the divergence, no value above 40 has been plotted. (Eventually, the Â2 -divergence decreases to zero, when a becomes sufciently large, since almost no spikes will be generated.) (c) The information divergence for different values of the potential parameter a and the bias parameter b when the other parameter values are the same as in Table 1. Due to the unreliability for high values of the divergence, no value above 2 has been plotted. (d) The deection ratio for different values of the potential parameter a and the bias parameter b when the other parameter values are the same as in Table 1.
3.2 Performance with Respect to Variation of a and b. A basic example of performance expressed as a function of FHN parameters is shown in Figure 3a, where the Kolmogorov divergence is displayed as a function of a and b, with the other parameters set as in Table 1. Both a and b have an effect on how much excitation is needed to produce spikes in the FHN output. If a is made smaller, the potential barrier height decreases, which gives a larger spike rate. Increasing the value of b has the same effect, since an increase in b can be interpreted as if a bias was added to the input signal. This is illustrated in Figure 4, where the FHN neuron’s spontaneous activity is displayed for different values of a and b.
2194
Mattias F. Karlsson and John W. C. Robinson
Spike intensity
1 0.8 0.6 0.4 0.2
0 0.25 0.2
b
0.15 0.1 0.05
0
0.2
0.4
0.6
0.8
1
a
Figure 4: Spontaneous activity (no input signal applied) for the FHN model, for different values of the parameters a and b and with the other parameters set as in Table 1.
A marked ridge is present in the divergence surface in Figure 3a, indicating that there is a family of values of the potential parameter a and the bias parameter b that would optimize the ability of the modeled system to detect a (weak) sinusoidal signal. The FHN neurons corresponding to these parameter values have the common property that they re only sparsely without the signal input but re with a signicant intensity when the signal is present. For parameter values outside the region under the ridge, the Kolmogorov divergence, and associated performance, is uniformly lower. The plateau on the left of the ridge is located above parameter values for which the FHN neurons are very easily excited. Given that the spike intensities of the FHN neurons corresponding to these parameter values are roughly independent of the presence or absence of an input signal, the presence of the plateau may seem counterintuitive. However, the ring that takes place when an input signal is applied is much more regular (since it is phaselocked to the signal) compared to that taking place when the excitation is just noise. Thus, the divergences corresponding to the systems for which the FHN parts are easily excited are rather large but still clearly smaller than those corresponding to the ridge. In the former region of parameter values, it is also possible that an applied input signal decreases the ring rate since the noise-induced ring rate can be larger than the rate given by a phase-locked spike train. Consequently, although the region of spontaneous ring yields rather large divergences, they are are clearly smaller than the divergences on the ridge. The region of low divergences to the right of the ridge is generated by parameter values corresponding to systems of FHN neurons that are very difcult to excite and hardly ever re, even in the presence of an input signal.
On Optimality in Auditory Information Processing
2195
Performing the same type of analysis on the system but using the  2 or information divergence instead yields qualitatively similar results, as seen in Figures 3b and 3c. Due to numerical effects, it is hard to calculate the exact height of the ridges, and we therefore limit the surfaces’ heights in the gures by truncating values above a certain threshold to the value of the threshold. Although this prevents a precise estimation of the optimal combinations of parameter values, it allows the main objective to be fullled: to show the existence of regions with (considerably) better performance in terms of divergences than others. For deection ratios, on the other hand, the numerical problems are minor, since they can be calculated without explicitly calculating p0 and p1 , which makes DRs more robust. However, the DRs are not directly related to the probability of error, if h is not chosen as, for example, p1 / p 0 (as explained in section 2.2.1), and therefore works just as an indication of the separation of the PDFs. An example of this can be seen in Figure 3d, where the DRs for the output of the model, with h ( x) D x, are displayed. Also for the DRs, a ridge can be seen, and the resulting set of optimal values is similar to that for the divergences (though small changes in the position of the ridge can be seen). This qualitative behavior seen in all examples so far, with a (largely) common region of optimal values, is recurrent in all our simulations described in the following section. 3.3 Performance for a Lower-Intensity Level. In the previous section, we described a simulation aimed at investigating optimization of performance as a function of the potential parameter a and the bias parameter b, in an otherwise xed environment. If we change the environment, new values of the parameters will emerge as optimal. For instance, if we lower the intensity level of the noise, the location of the ridge appearing in Figure 3a will change, as seen in Figure 5a. Together, these two gures illustrate that care must be exercised when interpreting results of the stochastic resonance type (Gammaitoni, H¨anggi, Jung, & Marchesoni, 1998) for neural processing systems. For a xed pair of parameters values a, b, such as a D 0.6 and b D 0.12, the divergence can be higher for a larger noise level, indicating an SR effect, since the detection performance can increase with the noise intensity. However, the maximally achievable divergence, if we can freely choose the values on a and b, will always be lower for higher noise levels. Hence, for a system where adaptation to environmental changes is possible, by, for example, changing the values of the parameters, a lower noise intensity is always better in our setting. 3.4 Performance with Respect to Variation of a and ± . If instead of varying the potential parameters a, b, we vary the relaxation parameter d, we get the result illustrated in Figure 5b. Also, this divergence surface displays a marked ridge, similar to the ones in Figures 3 and 5a, indicating possible combinations of parameter values for best performance. The observed ridges in the divergence and deection surfaces indeed allow for optimiza-
2196
Mattias F. Karlsson and John W. C. Robinson a
1
b
0.8
0.8
d(0.5) e
d(0.5) e
0.6
0.6
0.4
0.4
0.2
0.2
0 2.5
0 0.25
2 0.2
b
1.5
d
0.15
1 0.5
0.1 0.05
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0
a
c
0.5
a
d(0.5) e
d(0.5) e
0.8
0.3
0.6 0.4
0.2 0.1
0.2
0 0.25
0 0.25 0.2
0.2
b
0.7 0.8 0.4 0.5 0.6 0.1 0.2 0.3
d
1
0.4
0
0.15
0.15
b
0.1 0.05
0
0.2
0.4
0.6
a
0.8
1
0.1 0.05
0
0.2
0.4
a
0.6
0.8
1
Figure 5: (a) The Kolmogorov divergence for different values of the potential parameter a and the bias parameter b, when the other parameter values are pthe same as in Table 1 except for the noise intensity, which is lower (s D 100 2 ¢ 10¡6 ). (b) The Kolmogorov divergence for different values of the potential parameter a and the relaxation parameter d with the other parameter values as in Table 1. (c) The Kolmogorov divergence for different values of the potential parameter a and the bias parameter b when the other parameter values are the same as in Table 1 except for the signal amplitude, which is lower (A D 0.1). (d) The Kolmogorov divergence for different values of the potential parameter a and the bias parameter b when the other parameter values are the same as in Table 1 except for the frequency, which is lower (v0 D 2).
tion of performance by taking parameter values in the interior of the domain of values that have physical signicance. Since the model is based on fairly standard and well-accepted components (e.g., the FHN model), which we feel capture the essential mechanisms involved in the information processing considered here, we believe that the results in fact can be interpreted as a quantitative indication of how some of the parameters in the auditory system presumably must be set. In particular, this applies to the quiescent ring rate (QFR), which in real systems under this assumption, must take values near those that correspond to the maxima of the performance measures considered here. Verifying this is a topic for future research. The conclusion about the QFR is based on the qualitative observation that all the ridges appearing in the divergence and deection surfaces cor-
On Optimality in Auditory Information Processing
2197
respond to parameter values that lie in a certain “thin” or “manifold-like” set in parameter space. A closer examination of this set shows that the combinations of parameter values that correspond to, for example, the ridge in Figure 3a describe systems that have virtually the same ring intensity in the absence of an external signal—that is, virtually the same QFR. Since this specic QFR also is common for all optimal values of parameter combinations corresponding to the ridges in Figures 3, 5a, and 5b and in all other simulations that we have tried with the same input signal, this strongly suggests a connection between the QFR and the information processing performance of the system. 3.5 Performance for Other Input Signal Parameters. The ridges in the divergence surfaces discussed so far are relevant only for the given input signal; if we change the input by, for example, altering the amplitude or the frequency of the signal, we get a different result. Examples of this are shown in Figure 5c, where the amplitude A is set to 0.1, and in Figure 5d where the angular frequency v0 is set to 2. Although we still can see ridges in both cases, they are different in shape from the rst one in Figure 3a. Obviously, the divergence decreases with decreased signal amplitude, and the height of the ridge becomes lower in Figure 5c, but the location of the ridge changes only slightly, and it appears as if only a slight change of optimal parameter values occurs. When the frequency is varied, however, the ridge clearly moves to an entirely new position, and new parameter values render optimal performance. This reects well the frequency division of sound performed in the inner ear, as discussed in section 2.2. Since the optimal values of the parameters are very little affected by a change in our (weak) input signal amplitude, so is the optimal QFR. When the frequency of the input is varied, however, the optimal QFR changes considerably. A more detailed investigation of this connection shows the optimal QFR to be as low as about 1 spike per second for low frequencies, but it increases with the frequency and can be over 100 spikes per second for input signals with a high frequency. 4 Discussion
We have described a method for analyzing the information processing capability in the primary part of the mammalian auditory nervous system using fundamental statistical and information theoretical performance criteria, quantitatively expressed by w -divergences. Our premise has been that since these criteria are presumably relevant for the processing taking place in this system, the nonexistence of well-dened global maxima of these criteria occurring in the interior of regions of feasible system parameters would suggest incompleteness or incorrectness of the overall model. However, the observed ridges in Figures 3 and 5 clearly show a family of settings of parameter values of physical relevance, which maximizes the performance.
2198
Mattias F. Karlsson and John W. C. Robinson
As shown, all these settings seem to correspond to the same QFR as long as the frequency of the input signal is kept constant. This inspires new experiments to answer questions like the following: Are different neurons that process information about the same frequency, but in a more or less noisy environment, designed so that they still have the same QFR? Further, it would be interesting to clarify a possible connection between the neurons’ QFR and the input signal frequency. Although existing data are inconclusive on this point, Kiang’s classical data (1965) can be interpreted to support the hypothesis that such a frequency dependence exists. However, experiments are needed to resolve this issue. One possibility would be to compare the QFR for neurons that perform analysis of input signals with different frequencies, but is connected to a comparable synaptic connection to the following cell in the cochlear nucleus. Finally, we point out that although the location of the ridge in, for example, Figure 3, is largely the same, it does vary slightly depending on which divergence or deection is considered, which is to be expected since these performance measures are not identical. In particular, the  2 -divergence in Figure 3b can, as explained in section 2.2.1, be considered to be a rst-order approximation of the information divergence in Figure 3c. All constants in our model have been chosen in order to produce as realistic data as possible. The choices are not critical, though, since in most of the simulations where the values of the constants are varied (in a reasonable large interval), the results are qualitatively invariant. Our approach therefore offers a new qualitative, and possibly also a quantitative, explanation of the different levels of QFRs observed in the auditory nerve. Acknowledgments
We acknowledge fruitful discussions with A. Longtin of Ottawa University, which led to improvement of the results in several aspects, and to A. Bulsara of SPAWAR SSC, San Diego, California, for many insightful suggestions that claried the presentation of the material. This work was supported by FOA project E6022, Nonlinear Dynamics. References Alexander, J. C., Doedel, E. J., & Othmer, H. G. (1990). On the resonance structure in a forced excitable system. SIAM J. Appl. Math., 50, 1373–1418. Ali, S., & Silvey, D. (1966). A general class of coefcients of divergence of one distribution from another. J. Roy. Stat. Soc., B28, 131–142. Basseville, M. (1989).Distance measures for signal processing and pattern recognition. Signal Processing, 18, 349–369. Camalet, S., Duke, T., Julicher, ¨ F., & Prost, J. (2000).Auditory sensitivity provided by self-tuned critical oscillation of hair cells. Proc. Nat. Acad. Sci., 97, 3183– 3188.
On Optimality in Auditory Information Processing
2199
Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: Wiley. EguÂõ luz, V. M., Ospek, M., Choe, Y., Hudspeth, A. J., & Magnasco, M. O. (2000). Essential nonlinearities in hearing. Phys. Rev. Lett., 84, 5232–5235. FitzHugh, R. A. (1961). Impulses and physiological states in theoretical models of nerve membrane. BioPhys. J., 1, 445–466. Gammaitoni, L., H¨anggi, P., Jung, P., & Marchesoni, F. (1998). Stochastic resonance. Rev. Mod. Phys., 70, 223–287. Geisler, C. D. (1998). From sound to synapse: Physiology of the mammalian ear. New York: Oxford University Press. Gerstner, W., & Van Hemmen, J. L. (1992). Associative memory in a network of “spiking” neurons. Network, 3, 139–164. Hochmair-Desoyer, I. J., Hochmair, E. S., Motz, H., & Rattay, F. (1984). A model for the electrostimulation of the nervus acustics. Neuroscience, 13, 553–562. Jack, J. J. B., & Redman, S. J. (1971). The propagation of transient potentials in some linear cable structures. J. Physiol., 215, 283–320. Karlsson, M. F., & Robinson, J. W. C. (2001). Auditory neural divergences package. Available on-line: http://goto.glocalnet.net/neuro diverg/. Kiang, N. (1965). Discharge patterns of single bers in the cat’s auditory nerve. Cambridge, MA: MIT Press. Kistler, W. M., & Van Hemmen, J. L. (1999). Short-term plasticity and network behavior. Neural Comp., 11, 1579–1594. Kloeden, P. E., & Platen, E. (1992). Numerical solution of stochastic differential equations. Berlin: Springer-Verlag. Koch, C. (1999). Biophysics of computation: Information processing in single neurons. New York: Oxford University Press. Liese, F., & Vajda, I. (1987). Convex statistical distances. Leipzig: Teubner. Longtin, A. (1993). Stochastic resonance in neuron models. J. Stat. Phys., 70, 309–327. Massanes, S. R., & Vicente, C. J. P. (1999). Nonadiabatic resonances in a noisy Fitzhugh-Nagumo neuron model. Phys. Rev. E, 59, 4490–4497. Robinson, J. W. C., Rung, J., Bulsara, A. R., & Inchiosa, M. E. (2001). General measures for signal-noise separation in nonlinear dynamical systems. Phys. Rev. E, 63, article no. 011107. Rung, J., & Robinson, J. W. C. (2000). A statistical framework for the description of stochastic resonance phenomena. In D. S. Broomhead, E. A. Luchinskaya, P. V. E. McClintock, & T. Mullin (Eds.), STOCHAOS: Stochastic and chaotic dynamics in the lakes. Melville, NY: American Institute of Physics. SalicrÂu, M. (1993). Connections of generalized divergence measures with Fisher information matrix. Information Sciences, 72, 251–269. Scott, A. C. (1975). The electrophysics of a nerve ber. Rev. Mod. Phys., 47, 487– 533. Stemmler, M. (1996). A single spike sufces: The simplest form of stochastic resonance in model neurons. Network: Computation in Neural Systems, 7, 687– 716. Tsodyks, M. V., & Markham, H. (1997). The neural code between neocortical pyramidal neurons depends on neurotransmitter release probability.
2200
Mattias F. Karlsson and John W. C. Robinson
Proc. Natl. Acad. Sci., 94, 719–723. (See also correction available on-line: http://www.pnas.org.) Tuckwell, H. C. (1988). Introduction to theoretical neurobiology. Cambridge: Cambridge University Press. Webster, D. B., Popper, A. N., & Fay, R. R. (Eds.) (1992). The mammalian auditory pathway: Neuroanatomy. New York: Springer-Verlag. Zador, A. (1998). Impact of the unreliability on the information transmitted by spiking neurons. J. Neurophysiol., 79, 1219–1229. Received October 16, 2000; accepted March 1, 2002.
LETTER
Communicated by Zhaoping Li
Computational Capacity of an Odorant Discriminator: The Linear Separability of Curves N. Caticha [email protected] J. E. Palo Tejada [email protected] Instituto de FÂõ sica, Universidade de S˜ao Paulo, CEP 05315-970, S˜ao Paulo, SP, Brazil D. Lancet [email protected] Department of Membrane Research and Biophysics, Weizmann Institute of Science, Rehovot 76100, Israel E. Domany [email protected] Department of Complex Systems, Weizmann Institute of Science, Rehovot 76100, Israel We introduce and study an articial neural network inspired by the probabilistic receptor afnity distribution model of olfaction. Our system consists of N sensory neurons whose outputs converge on a single processing linear threshold element. The system’s aim is to model discrimination of a single target odorant from a large number p of background odorants within a range of odorant concentrations. We show that this is possible provided p does not exceed a critical value pc and calculate the critical capacity ac D pc / N. The critical capacity depends on the range of concentrations in which the discrimination is to be accomplished. If the olfactory bulb may be thought of as a collection of such processing elements, each responsible for the discrimination of a single odorant, our study provides a quantitative analysis of the potential computational properties of the olfactory bulb. The mathematical formulation of the problem we consider is one of determining the capacity for linear separability of continuous curves, embedded in a large-dimensional space. This is accomplished here by a numerical study, using a method that signals whether the discrimination task is realizable, together with a nite-size scaling analysis. 1 Introduction
The basic machinery for olfaction, our ability to smell, is an array of a few hundred different types of sensory neurons. Each of these expresses molecular receptors that belong to a single type. When this small neuronal asc 2002 Massachusetts Institute of Technology Neural Computation 14, 2201– 2220 (2002) °
2202
N. Caticha, J. E. Palo Tejada, D. Lancet, and E. Domany
sembly is exposed to external stimuli, its cooperative response is capable of detecting and recognizing a wide variety of odorants and measuring their concentrations. We use odorant to describe any chemically homogeneous substance (ligand) that elicits a response from the olfactory system. The response of the array of neurons to any particular odorant is determined by the responses of the individual constituent neurons. This response is, however, governed by the extent to which the receptors expressed by the particular neuron bind the odorant, that is, by the afnity K of the neuron’s receptors to the odorant. According to a model proposed by Lancet, Sadovsky, and Seidman (1993), these afnities can be viewed as independent random variables, drawn from a single receptor afnity distribution (RAD), denoted by y (K ) . Once a set of afnities (for all odorants and all sensory neurons) has been generated, the response of the entire sensory assembly to any odorant is determined. This information is transferred from the sensory neurons to the olfactory bulb, onto which the axons of the sensory neurons project. They form synapses within the olfactory glomeruli with secondary neurons (mitral and tufted cells). Each glomerulus appears to receive input from cells that bear the same type of receptor protein. The olfactory bulb also contains a complex network of local lateral connections, which provide for initial cross-talk between the receptor-specic units, mediated by periglomerular and granule interneurons. This integration of the sensory input that takes place in the olfactory bulb forms the rst step of the information processing that takes place in the olfactory pathway. Additional processing and integration is believed to take place in the prepyriform cortex, entorhinal cortex, and other higher brain areas. In this article, we evaluate, on the basis of a very simple model, some of the potential computational characteristics of the olfactory pathway as it performs its initial information processing. Although we do not claim to model any specic central nervous system regions involved in olfactory processing, our analysis may relate to some global characteristics of this pathway. Our simple model for the sensory array and a single processing unit is depicted in Figure 1. Some of the important characteristics of a real olfactory system that are not taken into account by such a simple model are discussed in section 4. The model we introduce is also interesting from a mathematical point of view. The problem of linear separability (LS) of points in N-dimensional space has received considerable attention since Schl¨ai in 1852 (Kinzel, 1998). In the mathematics literature, Cover (1965) studied the problem of LS of independent dichotomies using combinatorial method. In computer science, the perceptron, introduced by Rosenblatt (1962) and analyzed in detail by Minsky and Papert (1969), gave a major boost to the eld of neural networks. More recently, by introducing statistical mechanics techniques, Gardner (1988) and Gardner and Derrida (1988) extended Cover ’s results to cases where there are correlations between the points that have to be linearly separated.
Computational Capacity of an Odorant Discriminator
s
2203
m
m
S wi Si m
m
S1 m
K1
m
K2
m
Si m
SN m
K3
Ki H
m
KN
m
Figure 1: Schematic representation of our model. N sensory neurons, represented by boxes, are exposed to odorant m , present in concentration Hm . Each m sensory neuron i is characterized by a set of afnities Ki ; the sensory input elicits m from neuron i a response Si , as given by equations 2.4 and 2.5. A weighted sum of the N sensory responses serves as the input of the secondary neuron (circle), whose output sm , generated in response to odorant m , is given by equation 2.6.
We generalize the problem of separating (zero-dimensional) points to the separability of (one-dimensional) strings or curves, embedded in Ndimensional space. In the context of our problem, the curves that need to be separated are parameterized continuously by the odorant concentration. In principle, one can address the separability of curves by placing a discrete set of points on each curve, thereby mapping the problem onto the previously solved one of separating points. One should note, however, that points that lie on the same curve are not independent; in fact, they are correlated in ways that render the previously developed analytical methods unapplicable. Therefore, we present an extensive numerical analysis of the capacity of this special neural network—of N sensory neurons that provide input to a single processing unit. The capacity we calculate is interpreted as follows. The sensory system is exposed to p C 1 odorants, one at a time. One of these is the target; the aim is to distinguish the target from all the other p D aN odorants that form a noisy olfactory background. The model, based on a single-layer perceptron, is introduced and discussed in detail in section 2.1 . Then we describe the method we have developed in order to determine the capacity numerically. To do this, we adapted and used several different techniques. One of these, a learning algorithm
2204
N. Caticha, J. E. Palo Tejada, D. Lancet, and E. Domany
introduced by Nabutovsky and Domany (1991), is described in section 2.2. This algorithm, like all other perceptron learning rules, nds the separation plane (if the problem is LS); however, unlike other learning algorithms, it provides a rigorous signal to the fact that a sample of examples is not LS. Another technique we adapted to our purposes is nite-size scaling (FSS) analysis of the data. The main results are presented in section 3 as curves of capacity as a function of odorant concentration in the thermodynamic (N ! 1) limit, obtained by extrapolation, using FSS, from data obtained at a sequence of N values. This large N limit is quite natural from both practical and theoretical points of view. In practice, for N of the order of a few hundred, the results can barely be distinguished numerically from those at the N ! 1 limit. As to the theoretical side, the situation in this limit is much cleaner and easier to analyze. The nal section contains a critical discussion of the results from a biological point of view. Our central nding is summarized in Figure 6. If we x the range of concentrations Hmin < H < Hmax in which the system operates and increase the number of background odorants, we will reach a critical number pc beyond which the system fails to discriminate the target. This critical number is proportional to the number of sensory neurons N, that is, pc D ac N, and it decreases when the concentration range increases. 2 Computational Model 2.1 Odorant Identication as Linear Separation of Curves in N-Dimensional Space. The simple neural assembly that is considered here consists
of a single secondary neuron that receives inputs from an array of N units that model the sensory neurons. The single secondary neuron represents a grandmother cell, whose task is to detect one particular target odorant, labeled 0. The sensory scenario we consider allows exposure of the neuronal assembly to a single odorant, which may be either the target odorant or one of p D aN background odorants. The odorant provides simultaneous stimuli to the N sensory neurons. The aim of the single secondary neuron is to determine whether the odorant that generated the incoming signal from the sensory array is the target odorant 0. We assume that all odorants, background and target, are presented to the sensory array in concentrations H that lie within a range Hmin < H < Hmax .
(2.1)
We pose the following well-dened quantitative question: What is the maximal number pc of different background odorants that our neuron can distinguish from the target, for any concentration within the prescribed range? To sharpen the question, we put it in a more precise mathematical form. Consider m D 1, 2, . . . , p background odorants with respective concentrations Hm in the range 2.1. Odorant m is characterized by the i D 1, . . . , N
Computational Capacity of an Odorant Discriminator
2205
m
afnities Ki of the N receptors. According to the RAD model, these afnities are selected independently from a distribution y ( K ) (Lancet et al., 1993). All our numerical results were obtained using for y ( K ) the form (note: K ¸ 0) ³ ´ K K2 yQ ( K ) D 2 exp ¡ 2 . s 2s
(2.2)
The average and variance of this distribution are given by r
< K >D
p s 2
var( yQ (K ) ) D 0.65s.
(2.3)
The distributions suggested by Lancet, et al. (1993) were Poisson and binomial. With regard to the computational limitations of our model, the important idea behind the RAD model lies not in the exact form of the distribution but on the fact that the afnities can be thought of as independent random variables. The main computational features of our model will not be altered as long as the distribution has the following features: it is zero for negative afnities and has nite rst and second moments. We have used yQ ( K) since it satises the previous constraints and is easier to deal with in analytical calculations. When receptor i is exposed to odorant m at concentration Hm , its response is given by m
m
Si D f ( Ki Hm ) ,
(2.4)
where f ( x ) is a sigmoid-shaped function. We use f D x / (1 C x )
(2.5)
throughout this article. Although there is some evidence for linear scaling of the signals from the glomerular layer with odorant concentrations (Ma & Shepherd, 2000; Wachowiak, Cohen, Zochowski, & Falk, 2000), a general analysis should take into account the inevitable saturation at high concentrations (see equation 2.5), as dictated by the law of mass action (Schild, 1988). m The value taken by the afnity Ki sets the particular concentration scale at which odorant m affects the ith sensory neuron. From this point on, we set sD1 in equation (2.2). This means that the concentrations are measured in inverse units of the parameter s.
2206
N. Caticha, J. E. Palo Tejada, D. Lancet, and E. Domany m
m
m
m
The set of values fSi g D fS1 , S2 , . . . , SN g constitutes a vector of signals S , generated by the entire sensory array, when it is exposed to odorant m . m The fSi g serve as inputs to our secondary neuron, which we model as a linear threshold element or perceptron. Its output signal is given by m
m
s D sign
Á N X i
! m wi Si
m D sign( w ¢ S ).
(2.6)
The simple neural network described is schematically presented in Figure 1. The sensory neurons are represented by boxes and the secondary neuron by a circle. We require the output of this neuron to differentiate the target odorant from the background, yielding » sm D
¡1 C1
for m D 1, . . . , P (background) for m D 0 (target)
(2.7)
for any odorant concentration in the allowed range 2.1. To understand the geometrical meaning of this requirement, note that when the concentration of odorant m is varied in the allowed range 2.1, the corresponding vector Sm traces a curve (or string) in the N-dimensional space of sensory responses. The requirement 2.7 means that there exists a hyperplane, such that the entire curve that corresponds to the target odorant lies on one side of it, while the curves that correspond to all p background odorants lie on the other side. This explains our statement that the problem we solve deals with the linear separability of curves. We show that a solution to this classication problem can be found provided p < pmax D ac N. We estimate the critical capacity ac numerically. This is done by extrapolating results obtained for various values of N using nite-size scaling techniques, to the limit N ! 1. The value of ac is evaluated as a function of the limiting odorant concentrations. In order to obtain these results using existing methodology, the most natural and straightforward thing to do is to place a discrete set of f D 1, 2, . . . , M points Sm f on each curve, corresponding to different concentrations, and to require that the M points that lie on the curve of the target odorant are linearly separable from the PM points that represent the background. That is, equations 2.7 become » sm f D
¡1 C1
for m D 1, . . . , PI f D 1, . . . , M (background) for m D 0I f D 1, . . . , M (target) .
(2.8)
This raises the technical question of how many (discrete) representatives of the same odorant should be included in the learning set. We show that while the critical number of odorants pmax scales linearly with N, the number of representatives of a single odorant, M, has to grow at least as fast as N2 .
Computational Capacity of an Odorant Discriminator
2207
This ensures that increasing M further does not change the results of the calculation (e.g., the value of pmax ), and hence the M discrete points indeed correctly represent the continuous curves on which they lie. Our problem has been turned into one of learning M ( p C 1) patterns that constitute our training set L . For technical reasons, it is convenient to introduce and work with normalized patterns, Sm f sm f
jmf D p
( Sm f ) , 2
,
(2.9)
with f D 1, . . . , M running over the M discrete concentrations and m D 0, 1, . . . , p over all odorants. Note that we also multiplied each pattern Sm f by its desired output sm f ; after this change of representation, the condition of linear separability, equation 2.8, becomes sign(w ¤ ¢ j m f ) > 0 for m D 0, 1, 2, . . . , pI f D 1, 2, . . . , M.
(2.10)
2.2 The Learning Algorithm. The question of whether the target odorant can or cannot be distinguished from the background has been reduced to the question, Is there a set of weights wi , i D 1, . . . , N, for which all M ( p C 1) inequalities 2.10 are satised? This problem is of the type studied by Rosenblatt (1962) and is an example of classication by a single-layer perceptron. A solution exists if one can nd a weight vector w ¤ (that parameterizes the perceptron) such that for all the patterns j m f in the training set L , the “eld,”
hm f D j m f ¢ w ¤ > 0,
(2.11)
that is, the projection of the weight vector w ¤ onto all patterns j m f , is positive. We wish to determine the size of the training set L —the number of background odorants p for which a solution w ¤ can be found. This is done by executing a search for a solution w ¤ by means of a learning algorithm. There are several learning algorithms (Rosenblatt, 1962; Abbott and Kepler, 1989) in the literature. All are guaranteed to nd such a weight vector in a nite number of steps provided a solution exists. If, however, the problem is not LS and a solution does not exist, most learning algorithms will run ad innitum. An exception is the algorithm of Nabutovsky and Domany (ND; 1991), which detects, in nite time, that a problem is nonlearnable. This is a batch perceptron learning algorithm, presenting sequentially the entire training set L in one sweep and repeating the process until either a solution is found or nonlearnability is established. We found that this algorithm is efcient and convenient to use (see Abbott & Kepler, 1989, for other algorithms that detect non-LS problems).1 1 The fact that we use a learning procedure to establish the boundaries of linear separability does not imply and is unrelated to any possible plasticity of the olfactory bulb. The algorithm is being used only to nd a solution or show that it does not exist.
2208
N. Caticha, J. E. Palo Tejada, D. Lancet, and E. Domany
Figure 2: Despair d as a function of the number of sweeps for N D 10, Hmax D 50, Hmin D 0.5, and p D 25. (a) Learnable cases. (b) Cases that were not learnable. Note the huge difference in the vertical and horizontal scales.
ND introduced a parameter d, which they called despair, which is calculated in the course of the learning process. d is bounded if the training set L is LS. Since the ND algorithm can be shown either to nd a solution w ¤ or transgress the bound for d in a nite number of learning iterations, d effectively signals if the learning set L fails to be linearly separable. The theorem they proved can be easily extended to the distribution of examples in our problem.2 We introduced a halting criterion, which is probably more stringent than necessary, since no attempt has been made to determine an optimal lower bound. In Figures 2a and 2b, typical evolutions of the despair are shown for an LS case and a non-LS case, respectively. The behavior of d is strikingly different in the two cases, showing that d is a good indicator of learnability. In the learnable cases, d grows linearly with the number of learning sweeps until a solution is found (and the curves terminate). In the non-LS cases, d grows exponentially with the number of sweeps and would continue to grow; the process is halted when its value exceeds a known bound that must be satised if the problem is LS. We now describe the ND algorithm used in the simulations. The patterns of the learning set L are presented one at a time (one cycle constitutes a sweep). ND have shown that for binary valued patterns (ji D § 1), that is, patterns on the vertices of a unit hypercube, an upper bound dc exists iff the training set is LS. On the other hand, the dynamics is shown to take d beyond that bound in a nite (linear in N) number of iterations unless a solution exists and the algorithm halts. Initialize the process with d D 1, w D j 1 . 2 The original ND algorithm was designed for binary vectors, pointing at the corners of a N-dimensional hypercube. It can be shown that the theorem can be extended to vectors on the unit sphere.
Computational Capacity of an Odorant Discriminator
2209
Go to the next example. If it is correctly classied, do nothing to the current weight vector, and go to the next example. Once a misclassied examplej m f is found, update the weight vector as well as the parameter d, according to
w new D
w C gj m f
| w C gj m f |
dCg . dnew D p 1 C 2ghm f C g2
(2.12)
(2.13)
g is not just a learning-rate parameter but an effective modulation function, chosen in order to maximize the increase of the despair as gD
1 / d ¡ hm f . (1 ¡ hm f / d )
(2.14)
The learning dynamics halts if all patterns are correctly classied or, if the value of d exceeds an upper bound, given by d > dc D
N ( N C 1) / 2 . 2N¡1
(2.15)
This is guaranteed to happen in at most Nd2c steps. 3 Numerical Experiments
Since there is a large number of parameters that are to be varied, we rst present a detailed description of the manner in which we deal with them. There are two random elements in our studies. The rst is in the selection of M concentrations for each odorant, within the range 2.1. The second is m the choice of Ki , the afnity of receptor i to odorant m , selected at random from the distribution yQ ( K ) of equation 2.2. For every choice of the remaining variables, we generate an ensemble of experiments and average the object we are measuring over these two random elements. We select La times the set of afnities and for each of these perform Lc times the random selection of concentrations. The object we wish to estimate numerically is the probability P that the p curves described in section 1 are LS. To this end, we place M points on each curve and measure the corresponding probability P ( Hmin , Hmax , p, NI M ). As we will see, for large enough values, M > Mc , this probability becomes independent of M; beyond Mc , the set of M discrete points represents the corresponding curves faithfully, and hence the limiting value P ( Hmin , Hmax , p, NI M > M c ) is our estimate for P ( Hmin , Hmax , p, N ) .
2210
N. Caticha, J. E. Palo Tejada, D. Lancet, and E. Domany
Finally, we are interested in this function in the large N limit, when N ! 1 and p ! 1, while a D p / N is xed. This limit is obtained by extrapolating our nite N results using nite-size scaling methods. Our rst task is to determine how Mc scales with N, that is, how dense a set of concentrations is to be used so that M discrete points represent m accurately the continuous curves Si (Hm ) of equation 2.4. 3.1 Scaling of Mc . We choose values for N (number of receptor cells), p (number of odorants), and Hmin , Hmax (limiting concentrations). We also set some value for M, the number of concentrations by which every odorant is represented (M will be varied). We proceeded according to the following steps: m
1. Draw from the distribution yQ ( K ) a set of afnities Ki for all N receptors and p odorants. 2. Generate for each odorant M concentration values from a uniform distribution in the allowed range Hmin < H < Hmax , and construct the set fj m f g of normalized patterns. 3. Run the ND learning algorithm until it stops. Register whether the set fj m f g was LS. Steps 2 and 3 are repeated Lc times for each set of afnities; the threestep process is repeated for La different sets of afnities. We used La D 100; increasing it further made no difference. With such a value of La , the results did not depend on Lc ; having tried 1 · Lc · 10, we used Lc D 1 in our simulations. At this point, we have Lc ¢ La experiments, out of which a fraction of P ( Hmin, Hmax , p, N, M ) cases were linearly separable. Keeping Hmin , Hmax , p, N xed, we increase M and repeat the entire process, obtaining the probability functions P ( Hmin , Hmax , p, N, M ) , which are plotted in Figure 3 versus M / N 2 . Clearly the curves saturate when M > Mc ¼ N 2 . From this point on, we have xed the value of M at M D N2 . This numerical result can be estimated by using the analysis of Gardner and Derrida (1988) for the capacity of biased patterns, using for the “magnetization” m the value m / 1 / N. This gives, in addition to the leading behavior Mc ¼ N2 , logarithmic corrections as well. We cannot rule out the possibility of such logarithmic corrections to the scaling we found here. 3.2 Measuring the Probabilities P ( Hmax , p, N ) . In all our experiments, we xed the value of Hmin D 0.5, and hence the dependence of the probability on this variable has been suppressed. This choice is justied by the fact that in odorant concentrations near and above half maximal saturation (H D 1), nonlinearity becomes signicant. This makes the discrimination task more challenging for a given dynamic range (Hmax / Hmin ). For vari-
Computational Capacity of an Odorant Discriminator
2211
Figure 3: The estimated probabilities saturate when M, the number of representatives of a given odorant, exceeds Mc ¼ N 2 . For brevity, we denote P ( Hmin , Hmax , p, N ) by P ( h, p / N ) . In all cases, p D 25 odorants were used, with Hmin D 0.5.
ous values of N, p, and Hmax , we calculate P ( Hmax , p, N ) in the manner described above. Keeping N and Hmax xed, we increase p. For p ¿ N, we have P ( Hmax , p, N ) ¼ 1, and the probability of LS decreases as p increases. We stop increasing p when P ( Hmax , p, N ) becomes smaller than some e. The variation of P ( Hmax , p, N ) versus p / N is presented, for three values of Hmax and four values of N, in Figure 4. The results presented in these gures are discussed in the next section. We should mention here that for large N, we used a heuristic modication of the ND halting criterion to label a problem as non-LS. Typical evolutions of the despair parameter are shown in Figure 2. Each curve represents the history for a single learning set. Notice the huge difference in scales for the learnable and the unlearnable cases. The wide separation in nal values of d suggests that a more practical (e.g., smaller) upper bound be used. For N D 30 (the largest value treated here), we used a different halting criterion in order to escape from the need to reach an exponentially high upper bound. After a small number of successful trial runs, which produced linear separability, we identied the highest value of the despair dm that was reached for a learnable set. This value was used to dene our new heuristic halting criterion, dbound D N dm .
2212
N. Caticha, J. E. Palo Tejada, D. Lancet, and E. Domany
Figure 4: (Top left) Probability that a learning set is LS as a function of a D p / N, for maximal concentration Hmax D 50 and Hmin D 0.5. (Top right) Hmax D 70 and Hmin D 0.5. (Bottom) Hmax D 100 and Hmin D 0.5 3.3 Finite-Size Scaling Analysis. As expected, for small a D p / N, the probability for linear separability is close to 1, and it decreases as a increases. The curves obtained for xed Hmax become sharper as N increases. Note that curves obtained for different N values cross at approximately the same value of a. Similar behavior of the corresponding probability functions has been observed for random uncorrelated patterns (Cover, 1965). Notice, however, that the crossing point is at some probability P < 1 / 2. Similar curves obtained for other architectures, such as the parity and commitee machines (Nadler & Fink, 1997), crossed at P > 1 / 2. If there is a sharp transition in the thermodynamic limit (N ! 1), these curves should approach a step function, with
» 1 P (Hmin , Hmax , a) D 0
a < ac ( Hmin , Hmax ) . otherwise
(3.1)
Computational Capacity of an Odorant Discriminator
2213
Figure 5: (Top left) Probability that a learning set is LS as a function of y D 1 (a ¡ ac ) N º , for maximum concentration Hmax D 50 and Hmin D 0.5. (Top right) Hmax D 70 and Hmin D 0.5. (Bottom) Hmax D 100 and Hmin D 0.5 The result of a least-squares t are (top left) Hmax D 50, º D 4.8, ac D 2.8I (top right) Hmax D 70, º D 2.85, ac D 2.25I (bottom) Hmax D 100, º D 2.75, ac D 2.1.
That is, for a below a certain ac ( Hmin , Hmax ) , a learning set will be LS with probability one, and conversely, it will be LS with probability zero for a > ac ( Hmin , Hmax ). The manner in which such a step function is approached as N ! 1 can be described by a nite-size scaling analysis (Privman, 1990). For each value of Hmax (keeping Hmin xed), we tried a simple rescaling of the a variable, with two adjustable parameters, ac and º: 1
y D (a ¡ ac ) N º . For the proper choice of ac and º, we expect data collapse; that is, curves obtained for different values of N are expected to fall onto a single function, provided P ( Hmax , a, N ) is plotted versus the scaled variable y. As can be seen in Figure 5, this expectation is borne out; the evidently good data collapse indeed substantiates the idea of a sharp transition at ac . As N increases, the function P ( Hmax , a, N ) becomes increasingly sharper; its width near ac decreases at a rate governed by the exponentº. Finally, we present in Figure 6
2214
N. Caticha, J. E. Palo Tejada, D. Lancet, and E. Domany
Figure 6: Phase diagram in the a, Hmax plane for Hmin D 0.5. The curve separates the a, Hmax plane into the regions where the network can distinguish the target odorant (below) and where it cannot (above). Below (above) the curve a learning set is (not) LS with probability one in the thermodynamic limit. The horizontal dotted line is alphac D 2.
the behavior of ac as a function of Hmax (for xed Hmin ). As Hmax increases, separation of the curves becomes an increasingly difcult task, and hence ac ( Hmax ) decreases. We nd that it saturates at a low value close to amin D 2, which is exactly the Cover result. This interesting point is explained in the appendix. Although we deal here with linear separability of curves, which one would expect to be a more difcult task than separating points, we found that our ac exceeds the value derived for points, ac D 2. The reason is that this is the critical capacity for separating random, independent points; the curves we are trying to separate are not independent of each other. In fact, m by construction, we have Si > 0 for all the background odorants; hence, all these curves lie on one side of an entire family of planes. The target odorant, which also satises Si > 0, should lie on the other side of the separating plane. The curve ac ( Hmax ) is, in effect, a phase boundary; on one side, we have a “phase” in which the problem is LS, while on the other (high a) region, it is not. We now present a brief description of the manner in which linear separability breaks down as we cross this phase boundary by increasing Hmax at xed a.
Computational Capacity of an Odorant Discriminator
2215
3.4 Breakdown of LS Near-Phase Boundary. The manner in which LS breaks down as Hmax increases beyond the phase boundary is nicely illustrated by Figures 7 and 8. Consider the p curves, in an N-dimensional space, which represent the odorants, in a linearly separable case. We present in Figure 7 (left) a projection of these curves onto a randomly chosen plane. One of these (indicated by an arrow) is the target odorant; it seems to be entangled with the other curves. The point at which all curves seem to converge corresponds to the maximal concentration. The purpose of the learning dynamics is to nd a particular direction w along which one is able to separate the target curve from the others. Denote by V the hyperplane that passes through the origin and is perpendicular to w ; this is the linear manifold that separates the target from all the background curves. Select now any plane Q that contains w , and project all curves onto Q; this produces Figure 7 (right). The horizontal dotted line shown here is the intersection of the hyperplane V with the plane Q. The projected background odorant curves lie on one side of this line and the target on the other. The situation depicted here is LS. Consider now what happens when we turn the problem into non-LS by increasing Hmax beyond the phase boundary. As we increase the maximal concentration, the target odorant’s curve penetrates the “wrong” side of hyperplane V . A picture of this situation is shown in Figure 8 (left). This is a non-LS problem, which means that no matter how long we run our learning algorithm, we will never nd a hyperplane V that separates the target from
Figure 7: The curves representing the p background odorants and the target odorant (marked by an arrow) in an LS case, a < ac , for Hmax D 50, Hmin D 0.5, N D 20. (Left) The curves are projected onto a randomly selected plane and (right) onto a plane Q (see text) that contains the weight vector w ¤ , determined by the learning algorithm. The horizontal dotted line in the right part of the gure demonstrates linear separability of the target from background.
2216
N. Caticha, J. E. Palo Tejada, D. Lancet, and E. Domany
Figure 8: The same situation as in Figure 7, but with Hmax increased beyond the limit of learnability. (Left) Projecting the curves onto the same plane as for the LS case, we see that the target penetrates to the “wrong” side of the broken line. (Right) Further attempts to learn a new w will fail.
all the background. If nevertheless we keep running our learning algorithm, the direction of our candidate for w will keep changing as we “learn,” but since the critical capacity curve of Figure 6 has been crossed, no amount of further learning will produce a separating plane. The density of points near the high concentration limit is much larger than for low concentrations. Hence, further learning will perhaps be able to separate the target from the background at high concentrations, but then separability breaks down at low concentrations (see Figure 8, right). 4 Summ ary and Discussion
In the olfactory bulb of most vertebrates, each secondary neuron (mitral or tufted cell) receives input from only one glomerulus, which is innervated, in all likelihood, by axons stemming from olfactory epithelial sensory cells that express the same olfactory receptor protein. Thus, the grandmother cell modeled here may not simply represent a mitral or tufted cell. Moreover, the periglomerular and granule cells constitute a network of recurrent local connections, which are not taken into account in the present (purely feedforward) model.3 Our analysis may be relevant, in an abstract fashion, to the information processing that takes place at higher olfactory system centers, where broader convergence patterns may exist (Haberly, 2001). We do not 3 Lack of recurrent connections imposes simple dynamics, excluding observed oscillatory behavior (Dorries & Kauer, 2000; Lam, Cohen, Wachowiek, & Zochowski, 2000).
Computational Capacity of an Odorant Discriminator
2217
claim that any neural elements in this pathway may actually be represented by grandmother cells with linear feedforward inputs. However, we have introduced a basic “minimal” model that allows a quantitave estimation of olfactory discrimination capacity. Several studies analyze neuronal networks for the olfactory system (Hopeld, 1991; Li & Hopeld, 1989; Wilson & Bower, 1992; Li, 1990, 1995; Li & Hertz, 2000). However, none of these was based on a quantitative model for the afnity relationships within the entire olfactory receptor repertoire. Here, we use the receptor afnity distribution (RAD) model, which was developed, based on general biochemical considerations, for receptor repertoires, including that of olfactory receptors. The power of this approach is in using a global knowledge about the repertoire to analyze the delity of discrimination among odorants. It has been pointed out in the past that the RAD model may be used to analyze the signal-to-noise ratio in systems in which specic binding to a receptor has to be distinguished from the background of numerous other receptors that constitute nonspecic binding (Lancet, Sadovsky, & Seidman, 1993; Lancet, Horovitz, & Katchalski-Katzir, 1994). Here, we apply a similar concept to an analysis of signal-to-noise discrimination in the case of a neuronal network whose input stems from a receptor repertoire. The results presented here suggest that for a xed number of background odorants, there is a maximal odorant concentration beyond which odorant discrimination becomes impossible. This is not surprising, since olfactory receptors are saturable, and at very high concentrations weak afnity receptors as well as high afnity ones will generate comparable signals. However, it is noteworthy that despite the fact that information capacity for odorant discrimination rapidly declines as odorant concentration goes up, the network we analyzed is still capable of discrimination even at concentrations for which HhKi is of the order of a few hundred (where hKi is the average afnity). The model network consists of N sensory neurons, each characterized by a set of afnities to a number of odorants. When any particular odorm ant, m , is present, sensory neuron i produces a (nonlinear) response, Si . These responses constitute the inputs to a single processing unit (secondary neuron), which performs weighted summation of all the N inputs. The secondary neuron’s output is the sign of this weighted sum. The aim of this single processing unit is to identify one single odorant and separate it from all the others that may be sensed by the system. This secondary neuron plays the role of a grandmother cell for a particular target odorant. An assembly of P 0 such secondary neurons may constitute, together with the sensory neurons, a system that is able to identify the presence of P 0 target odorants from a background of P odorants. We posed a well-dened quantitative question: Given that each odorant may appear with a concentration Hm that lies in a certain range, Hmin < Hm < Hmax , what is the maximal number of background odorants P c D ac N
2218
N. Caticha, J. E. Palo Tejada, D. Lancet, and E. Domany
from which a single target can be separated with probability 1? The answer is summarized in Figure 5, where ac , the critical capacity, is plotted versus Hmax . The result is obtained in the limit of large N (i.e., many sensory neurons; in fact, for N D 100, this result should already give excellent precision). For a dynamic range of Hmax / Hmin of about 100, we nd ac ¼ 2.5. That is, for, say, N D 300 sensory neurons, we can distinguish the target from about 750 background odorants. Hence, if we assemble 750 odorants and appoint a grandmother cell for each, we will be able to identify them one by one. In order to get this quantitative answer, we had to generalize an old problem of linear separability of P points on an N ¡ 1 dimensional hypersphere to the new problem of linerly separating P curves that lie on the same hypersphere. We have shown that in order to represent a curve by discrete points that lie on it, we have to place M / N 2 points on each curve. The results were obtained by a perceptron learning algorithm that signals when a problem is unlearnable (i.e., nonlinearly separable). Results obtained at various values of N were shown to collapse when plotted as functions of properly dened scaled variables, which allowed easy extrapolation to large values of N. Appendix
The behavior of the phase boundary for large concentrations ac ! 2 (see Figure 5) is quite surprising since the network may be expected to enter a totally confused state due to the saturation of the nonlinear sensory neurons. This could be expected to lead instead to ac ! 0. That the Cover result (ac D 2) is recovered in the high concentration regime can be in fact be understood by the following argument. We rst calculate the probability P ( S ) that a sensory unit gives a response S to the presentation of an odorant in the range 2.1 by P (S ) D hd ( S ¡ f ( HK ) ) i,
(A.1)
where the average is taken over possible concentrations H uniformly distributed in range 2.1 and according to the RAD model, over the afnities, yQ ( K ) of equation 2.2. f is given by equation 2.6. The integrals lead to P (S) D
p ³ ³ ´ p (1 ¡ S ) ¡2 S Erf c s ( Hmax ¡ Hmin ) sHmax (1 ¡ S ) ³ ´´ S ¡ Erf c , sHmin (1 ¡ S)
(A.2)
Rx p where Erf c (x ) D 0 exp(¡u2 ) du / p is the complementary error function. This probability has one peak that sharpens and moves to higher values of S as Hmax grows. However, at the very ends of the interval, S D 1 or 0,
Computational Capacity of an Odorant Discriminator
2219
the probability is zero. That P (1) D 0 for every Hmax is the source of the surprise. The peak that concentrates all the probability gets arbitrarily close to S D 1 as the concentration increases, but never makes it to the extreme of the interval. In fact, Speak ¼ 1 ¡ c / Hmax . Therefore, the components of the vectors Sm f will be with overwhelming probability at the peak position, m m m m which can be written as Si D 1 ¡ ei with all ei ( Hmax , Ki ) small but strictly positive. Neglecting second-order terms in e the normalized patterns will then be: m (1 ¡ emi ) Si m SQ i ´ m D p 1 |S | N (1 ¡ 2Nem ) 2 p m D (1 ¡ ei C eN m ) / N.
Therefore, the SQ m f vectors are unbiasedly distributed around (1, 1, . . . , 1) / p N. We are taken back to the original Cover-Gardner problem of separating p unbiased patterns with a hyperplane, and the result ac D 2 is no longer a surprise. This argument does not deal with the asymptotic behavior of the capacity in the presence of any kind of noise. In that case, the naive expectations that ac ! 0 for Hmax ! 1 are probably borne out. Acknowledgments
We thank Ido Kanter for most useful discussions. The research of E.D. was supported by the Germany-Israel Science Foundation, the Minerva Foundation, and the U.S.-Israel Binational Science Foundation. The work reported here was initiated during visits of N.C. to the Weizmann Institute, which were supported by grants from the S˜ao Paulo Society of Friends of the Weizmann Institute and the Gorodesky Foundation. J.E.P.T.’s research was supported by a graduate fellowship of the Funda¸ca˜ o de Amparo a` Pesquisa do Estado de S˜ao Paulo. N.C. received partial support from the Conselho Nacional de Desenvolvimento CientÂõ co e Tecnol Âogico (CNPq). References Abbott, L. F., & Kepler, T. B. (1989). Optimal learning in neural network memories. Journal of Physics A, Math. and Gen., 22(14), L711–717. Cover, T. (1965). Geometric and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE Tran. Elect. Comput., 14, 326. Dorries, K. M., & Kauer, J. S. (2000). Relationships between odor-elicited oscillations in the salamander olfactory epithelium and olfactory bulb. J. Neurophysiology, 83(2), 754–765. Gardner, E. (1988). The space of interactions in neural network models. Journal of Physics A, Math. and Gen., 21, 257.
2220
N. Caticha, J. E. Palo Tejada, D. Lancet, and E. Domany
Gardner, E., & Derrida, B. (1988). Optimal storage properties of neural network models. Journal of Physics A, Math. and Gen., 21, 271. Haberly, L. B. (2001). Parallel-distributed processing in olfactory cortex: New insights from morphological and physiological analysis of neuronal circuitry. Chem. Senses, 26(5), 551–576. Hopeld, J. J. (1991). Olfactory computation and object perception. Proceedings of the National Academy of Sciences USA, 88, 6462. Kinzel, W. (1998). Phase transitions of neural networks. Phil. Mag. B , 77, 1455. Lam, Y. W., Cohen, L. B., Wachowiak, M., & Zochowski, M. R. (2000). Odors elicit three different oscillations in the turtle olfactory bulb. J. Neuroscience, 20(2), 749–762. Lancet, D., Horovitz, A., & Katchalski-Katzir, E. (1994). Models for analysis of protein-ligand interactions. In J. P. Behr (Ed.), Molecular recognition in biology (pp. 25–71). New York: Wiley. Lancet, D., Sadovsky, E., & Seidman, E. (1993). Probability model for molecular recognition in biological receptor repertories—signicance to the olfaction system. Proceedings of the National Academy of Sciences USA, 90, 3715. Li, Z. (1990). A model of olfactory adaptation and sensitivity enhancement in the olfactory bulb. Biol. Cybern, 62, 349. Li, Z. (1995). Modeling the sensory computations of the olfactory bulb. In E. Domany, J. L. van Hemmen, & K. Schulten (Eds.), Models of neural networks. New York: Springer-Verlag. Li, Z., & Hertz, J. (2000). Odour recognition and segmentation by a model olfactory bulb and cortex. Network: Computation in Neural Systems, 11, 83. Li, Z., & Hopeld, J. J. (1989). Modeling the olfactory bulb and its neural oscillatory processings. Biol. Cybern., 61, 379. Ma, M., & Shepherd, G. M. (2000). Functional mosaic organization of mouse olfactory receptor neurons. Proceedings of the National Academy of Sciences USA, 97(23), 12869–12874. Minsky, M., & Papert, S. A. (1969). Perceptrons. Cambridge, MA: MIT Press. Nabutovsky, D., & Domany, E. (1991). Learning the unlearnable. Neural Computation, 3, 604. Nadler, W., & Fink, W. (1997). Finite size scaling in neural networks. Physical Review Letters, 78, 555. Privman, V. (Ed.). (1990). Finite-size scaling and numerical simulations of statistical systems. Singapore: World Scientic. Rosenblatt, F. (1962). Principles of neurodynamics. New York: Spartan Books. Schild, D. (1988). Principles of odor coding and a neural network for odor discrimination. Biophysica Journal, 54(6), 1011. Wachowiak, M., Cohen, L. B., Zochowski, M., & Falk, C. X. (2000). The spatial representation of odors by olfactory receptor neuron input to the olfactory bulb is concentration invariant. Biological Bull., 199(2), 162–163. Wilson, M. A., & Bower, J. M. (1992). Cortical oscillations and temporal interactions in a computer-simulation of piriform cortex. J. Neurophysiol., 67, 981. Received November 8, 2000; accepted April 3, 2002.
LETTER
Communicated by Steven Nowlan
Mixture of Experts Classication Using a Hierarchical Mixture Model Michalis K. Titsias [email protected] Aristidis Likas [email protected] Department of Computer Science, University of Ioannina, 45110 Ioannina, Greece A three-level hierarchical mixture model for classication is presented that models the following data generation process: (1) the data are generated by a nite number of sources (clusters), and (2) the generation mechanism of each source assumes the existence of individual internal class-labeled sources (subclusters of the external cluster). The model estimates the posterior probability of class membership similar to a mixture of experts classier. In order to learn the parameters of the model, we have developed a general training approach based on maximum likelihood that results in two efcient training algorithms. Compared to other classication mixture models, the proposed hierarchical model exhibits several advantages and provides improved classication performance as indicated by the experimental results. 1 Introduction
A widely applied method for implementing the Bayes classier is based on obtaining the posterior probabilities of class membership through the estimation of the class prior probabilities and the class conditional densities (Duda & Hart, 1973). This is a generative approach to classication since a model of the joint distribution of the input data and the class labels is provided. The computationally intensive part of the design of such classiers concerns the estimation of the class conditional densities. The widely used way to obtain these estimates is independently to apply density estimation methods to each class-labeled data set. However, such an approach does not benet from the existence of any common characteristics among data of different classes. For example, the data may arise from differently labeled clusters that are located in overlapping regions in the data space. A very general assumption about data generation in a classication problem which can benet from the existence of common characteristics among differently labeled data is the following: the data are drawn from a nite number of sources (clusters), and within each cluster, the data are generated by labeled sources that form subclusters of the parent cluster. These generc 2002 Massachusetts Institute of Technology Neural Computation 14, 2221– 2244 (2002) °
2222
Michalis K. Titsias and Aristidis Likas
ation assumptions can be efciently modeled by a three-level hierarchical mixture model (Bishop & Tipping, 1998). The rst generation assumption is represented at the second level of the hierarchical mixture, and typically the number of components are unknown and must be inferred by the data. However, at the third level of the hierarchical mixture, where the second assumption is represented, each submixture (associated with a specic parent component) has precisely as many components as the number of classes. We refer to the above model as the hierarchical mixture classication model. In order to learn the parameters of the model, we derive a general training approach based on the maximum likelihood framework that results in two fast training algorithms. The proposed model can be considered as a mixture of experts classier. Mixtures of experts (Jacobs, Jordan, Nowlan, & Hinton, 1991; Jordan & Jacobs, 1994) are general models for estimating conditional distributions. Typically, these models comprise a gating network that divides the problem into smaller problems and expert networks that solve each subproblem. In our case, both the gating network units and the specialized experts are suitably dened from the hierarchical mixture. An additional feature of the hierarchical mixture classier is that it provides class conditional density estimates as at mixtures.1 Consequently, it is possible to compare the method directly with two well-known class conditional density estimation techniques based on mixture models. The rst is the well-known approach that employs a separate mixture (having its own components) for representing each class conditional density (McLachlan & Peel, 2000). This is the most widely used method and has been studied by Hastie and Tibshirani (1996). The second approach is to assume that the class conditional densities are modeled by mixtures having common mixture components (Ghahramani & Jordan, 1994; Miller & Uyar, 1996; Titsias & Likas, 2001). The last is actually similar to using a radial basis function (RBF) or an RBF-like neural network for solving classication problems. This is further investigated in Miller and Uyar (1998), where the Bayes decision function of a classier that estimates the class conditional densities by mixtures with common components is shown to be equivalent to the decision function of an RBF classier. In the following, we will refer to the rst approach as the separate mixtures model and to the second as the common components model. The hierarchical mixture classier can be thought of as being an extended and more exible version of the common components model. In addition, the proposed model can be also considered as a constrained case of a separate mixtures model that employs a certain number of components. The proposed model exhibits several advantages over these methods as illustrated in sections 3.2 and 3.3.
1
We use the term at mixture to refer to the usual mixture density model of the form PM p p (x | j, hj ), which does not exhibit any hierarchical structure. p (x) D jD 1 j
Mixture of Experts Classication Using a Hierarchical Mixture Model
2223
Section 2 provides a unifying description of classication techniques based on mixture models. In Section 3, the proposed hierarchical mixture classication model is described along with a training approach based on maximum likelihood. In addition, we provide illustrative comparisons of the proposed method with the common components and the separate mixtures model. Conclusions drawn from these comparisons are also supported experimentally in section 4, where comparative performance results are presented for several well-known data sets. Finally, section 5 provides conclusions and future research directions. 2 Bayes Classication Based on Mixtures
Consider a classication problem with K classes Ck, k D 1, . . . , K. The Bayes classier decides about the class of a data point x by selecting the class label Ck with the highest posterior probability value P (C k | x ) . Using the Bayes rule, the posterior probability P ( Ck | x ) is written as p ( x | Ck ) P ( Ck ) P ( C k | x ) D PK , ( ) ( ) `D 1 p x | C ` P C`
(2.1)
where P (C k ) is the class prior probability and p ( x | Ck ) the corresponding class conditional density. Each class conditional density p ( x | C k ) is estimated by applying density estimation methods using the available data. In the following, we provide a brief unifying description of some existing methods for estimating the class conditional densities using mixtures. We assume that the data have been generated by M sources (or clusters), and these clusters can be modeled by the densities p (x | j, hj ) , j D 1, . . . , M, with hj denoting the corresponding parameter vector. We further suppose that only some of the clusters can generate data of the class Ck; thus, only a subset T k of the density models is responsible for generating the data of class Ck. Consequently, the Ck-class conditional density can be modeled as the following mixture, X p jk p (x | j, hj ) , (2.2) p (x | C k, H k ) D j2Tk
where the parameter p jk represents the probability P ( j | Ck ) and H k is the total parameters corresponding to class Ck . We assume that any two different subsets T k and T` (corresponding to classes Ck and C`) may contain common 6 ;. The latter implies that the data of elements, that is, in general, T k \ T ` D different classes may have been generated from some common data sources. According to equation 2.2, it is clear that once we know the component j from which a data point x has been drawn, then x is independent of class Ck, that is, p ( x | j) D p ( x | j, Ck ) . The above choice of the class conditional densities provides as special cases two well-known approaches. The rst is the separate mixtures model,
2224
Michalis K. Titsias and Aristidis Likas
and its basic property is that the data of each class are a priori assumed to be generated by clusters that are not common with clusters corresponding to differently labeled data. This model results from equation 2.2 if the sets 6 `. The separate mixtures T k, k D 1, . . . , K are such that T k \ T` D ; for all k D model constitutes a widely used method for designing a Bayes classier, and it has been theoretically studied in Hastie and Tibshirani (1996) in the case of gaussian mixture components. An alternative approach, the common components model, assumes that all data may arise from any of the M clusters and results from equation 2.2 by assuming that T k D f1, . . . , Mg for each k (Ghahramani & Jordan, 1994; Miller & Uyar, 1996; Titsias & Likas, 2001). Clearly, the common components model exhibits generality over the separate mixtures and also over all possible models described by equation 2.2. To classify a new data point x based on the Bayes formula 2.1, the class prior probabilities P ( Ck ) are also needed, which are represented by introducing the parameters P k. Training can be performed based on maximum likelihood. Assume that we have a set ( X, Y ) of labeled data where X is the set of data points and Y the corresponding class labels. The original data set X can be partitioned according to the class labels into K disjoint subsets Xk, k D 1, . . . , K. Learning the whole parameter vector H can be performed PK P by maximizing the following log likelihood L (H) D kD 1 x2Xk log P k p (x | C k, H k ) :
L (H) D D
K X kD 1 K X kD 1
|Xk | log P k C |Xk | log P k C
K X X
log
kD 1 x2Xk K X
Lk (H k ) ,
X
p jkp ( x | j, hj )
j2Tk
(2.3)
kD 1
where L k is the class log likelihood corresponding to the subset X k. Maxk| imization of the rst term in equation 2.3 gives P k D |X |X| , while maximization of the second term would provide estimates of the class conditional densities. Note that the latter maximization in the case of the separate mixtures approach splits into K independent problems, each one involving a class log likelihood L k. Clearly, the same does not hold for the common components approach since the parameters of all components appear in each L k. Let Fj be the subset of all classes Ck for which the data can arise from the component j (j 2 T k). To nd out which is the generation process for a pair (x, Ck ) , we need to express the joint distribution of x and Ck. It holds that P p jk D 0) p (x, Ck | H) D P k jM / Tk, we assume D 1 p jk p ( x | j, hj ) (where for all j 2 P and since P kp jk D P ( j | H) P ( Ck | j, H) (where P ( j | H) D k2Fj p jk P k and
Mixture of Experts Classication Using a Hierarchical Mixture Model
P ( Ck | j, H) D
Pk p jk ), P ( j| H)
p (x, Ck | H) D
2225
we obtain:
M X
P ( j | H) P ( Ck | j, H) p (x | j, hj ) .
(2.4)
jD 1
Based on this expression we may assume that the labeled data are generated as follows: Select a component j from the set f1, . . . , Mg with probability P ( j | H). Select a class label Ck, where k 2 Fj , with probability P ( Ck | j, H) , and draw x from density p ( x | j, hj ) .
The generative model for the separate mixtures and common components model is obtained as a special case. More specically, in the separate mixtures case, the selection of a component j automatically species the class of x since in this case the set Fj contains only one element. On the contrary in the common components case, each Fj contains all classes, and the class label is selected among by all possible values. According to the second point above, once the component j has been selected, the label C k and the data point x are independently specied. Actually, x and Ck are conditionally independent given the component variable j. Finally, if we are interested in the unconditional density of x, this is given PM ( ( ) by p (x | H) D j D 1 P j | H) p x | j, hj , which clearly is a at mixture. In the next section, we present a classication model that estimates the unconditional density of x by a hierarchical mixture. 3 The Hierarchical Mixture Classication Model
We wish to dene a generative model realizing the following two assumptions: (1) the data are generated by M clusters and (2) within each cluster, the data are generated by class-labeled sources that form subclusters of the larger cluster. If a subcluster corresponding to class C k can be modeled by the density p (x | Ck, j, hkj ) (where hkj are the corresponding parameters), then the unconditional density of x can be given by the following three-level hierarchical mixture model (Bishop & Tipping, 1998) illustrated in Figure 1, p (x | H) D
M X jD 1
pj
K X
P kj p (x | C k, j, hkj ) ,
(3.1)
kD 1
where the parameter p j represents the probability P ( j) , P kj the probability P ( Ck | j) , and H denotes the whole set of model parameters. Clearly the second level of the hierarchical mixture (see Figure 1) provides information on how the data are generated by the M components ignoring the class labels. In this level, each component density is obtained
2226
Michalis K. Titsias and Aristidis Likas
p ( x | Q)
pj
Pkj
p ( x | Ck , j, q kj ) Figure 1: Estimation of the unconditional density of x by the hierarchical mixture classication model.
P by marginalizing out the class labels—p(x | j, H) D KkD 1 P kj p (x | C k, j, hkj ) . At the third level of the hierarchy, information is provided about the data along with their class labels. Note that since we have K classes, K subcomponents correspond to each component j of the second level. We are particularly interested in exploiting the use of this model for solving classication problems. Therefore, the posterior probabilities of class membership P ( Ck | x) must be computed: PM P (C k | x, H) D
j D 1 p j P kj P
( x | Ck, j, hkj )
p ( x | H)
.
(3.2)
Although the above expression results directly by the model, an equivalent and more useful expression is P (C k | x, H) D
M X
P ( j | x, H) P ( Ck | x, j, H),
(3.3)
jD 1
where P ( j | x, H) D
p j p ( x | j, H) p ( x | H)
(3.4)
Mixture of Experts Classication Using a Hierarchical Mixture Model
2227
and P (C k | x, j, H) D
P kj p ( x | Ck, j, hkj ) . p ( x | j, H)
(3.5)
Expression 3.3 explicitly denotes that the model estimates the posterior P ( Ck | x) as a mixture of experts model. The mixture of experts network was originally introduced in Jacobs et al. (1991) and extended to a hierarchical structure in Jordan and Jacobs (1994). A mixture of experts network consists of several expert models that estimate the input-dependent distribution of the output in different regions of the input space. The output of the model is computed using an input-dependent gating network that probabilistically combines the estimates of the experts. In our case, the gating network units correspond to P ( j | x, H) provided by equation 3.4, while the estimates of the experts correspond to the locally computed posterior probabilities of class membership P ( Ck | x, j, H) provided by equation 3.5. Several useful quantities such as the class prior probability and the class conditional density can be easily expressed as P (C k | H) D
M X
P kjp j
(3.6)
jD 1
and p (x | C k, H) D
M X
P ( j | C k, H) p ( x | C k, j, hkj ) ,
(3.7)
jD 1
respectively, where P ( j | C k, H) D
P kjp j . P (C k | H)
(3.8)
According to the hierarchical mixture classication model, the generation of a data pair (x, Ck ) proceeds as follows: Select a component from the set f1, . . . , Mg with probability p j .
Select a class label C k, where k 2 f1, . . . , Kg, with probability P kj , and then draw x according to the probability density p ( x | Ck, j, hkj ). Note that according to equation 3.7, each class-conditional density exhibits a at mixture form. This suggests that we can contrast the proposed model against the mixture model classiers described in section 2. In the hierarchical mixture classier case, the class label Ck and the data x are not conditionally independent given the component j, which yields the latter to be in principle different from the other classication mixture models.
2228
Michalis K. Titsias and Aristidis Likas
In addition, the hierarchical mixture classication model with M components in its second level can be considered as an extension of the common components model that employs M components in total. Particularly, the common components model assumes that all data points generated by the component j, and possibly corresponding to different classes, are explained by the same density model p (x | j, hj ) . In contrast, the hierarchical mixture classication model assumes that the data generated by the component j are explained in a way that depends on their class labels (for each class Ck, a different probability model p ( x | Ck, j, hkj ) is provided). In section 3.2 we explain how this additional exibility resolves a serious data representation drawback of the common components model and improves classication performance signicantly. Compared to the separate mixtures model, the proposed model can be derived by setting special constraints to a separate mixtures model with KM components in total in which each class conditional density is modeled by a mixture with M components. The imposed constraints are that M groups of K components (each belonging to different class conditional densities) must be formed, and the components of each group must explain a common input subspace as discussed in section 3. 3.1 Training the Hierarchical Mixture Classication Model. In the following, we assume that all the probability models p ( x | Ck, j, hkj ) follow the same parametric form taken from the exponential family. The log likelihood of the labeled data set (X, Y) is
L (H) D
K X X kD 1 x2Xk
log
M X
p j P kj p ( x | Ck, j, hkj ).
(3.9)
jD 1
It is possible to maximize the above quantity using the expectation-maximization (EM) algorithm. However, such a maximization would cause the whole model to collapse to one equivalent to a separate mixtures model (with M components employed by each class conditional density model), which means that hierarchy is lost. Therefore, in order to maintain the hierarchical nature of the model, we cannot rely on direct optimization of the above log likelihood. According to the assumption of the hierarchical mixture classication model, the missing information is related to the way that the data points are generated by the components of the second level. On the other hand, there is no missing information in the third level of the hierarchy (where class labels are taken into account), and the probability model that generated a data point is explicitly indicated by its class label. In order to express the secondlevel missing information, we introduce for each x an M-dimensional binary vector z ( x ) indicating the component that generated x. The resulting com-
Mixture of Experts Classication Using a Hierarchical Mixture Model
2229
plete data log likelihood is LC (H) D
K X X M X
zj ( x) log p j P kj p ( x | Ck, j, hkj ).
(3.10)
kD1 x2Xk j D1
However, since each variable z (x ) is unknown, we should expect to employ only an approximation of zj ( x ) provided by its expected value. In our case, two methods exist to obtain the expected value of zj ( x ) . In the rst method, class labels are ignored, and the expected value of zj ( x) is equal to the probability P ( j | x) . The second type of expectation takes into account the class label C k of x and corresponds to the probability P ( j | x, Ck ) .2 If hj ( x ) denotes P ( ) either P ( j | x ) or P ( j | x, Ck ) , then jM D 1 hj x D 1, and the expected value of the complete data log likelihood LC is Q (H) D
K X X M X
hj ( x ) log p j P kj p (x | Ck , j, h kj ) .
(3.11)
kD 1 x2Xk j D 1
In analogy to the case of unsupervised hierarchical mixture training (Bishop & Tipping, 1998), we consider that hj ( x ) have been computed in a previous stage and remain constant. In this case, the maximization of Q with respect to the parameters H yields pO j D
K X 1 X hj ( x) |X| D 1 x2X k
(3.12)
k
P x2X hj ( x) O P kj D PK P k `D1 x2X` hj ( x ) hOkj D arg max hkj
X
hj ( x) log p ( x | Ck, j, hkj ).
(3.13)
(3.14)
x2Xk
Since p ( x | Ck, j, h kj ) is chosen from the exponential family, hOjk can be analytically obtained by solving the equation X x2Xk
hj ( x ) rhkj log p ( x | Ck, j, h kj ) D 0
(3.15)
with respect to hkj . 2 In the rst case, the expected value is E[zj (x) | X] D P(zj (x) D 1 | x) D P(j | x), while in the second case, it holds that E[zj (x) | X, Y] D P(zj (x) D 1 | x, Ck ) D P(j | x, Ck ).
2230
Michalis K. Titsias and Aristidis Likas
In the gaussian case, the solution of equation 3.15 can be analytically obtained. Assume that each probability model p ( x | Ck, j, h kj ) is a gaussian of the general form p ( x | Ck, j, h kj ) D
1 (2p ) d / 2 | S kj | 1/ 2 » ¼ 1 T ¡1 ( ( ) ) £ exp ¡ x ¡ m kj S kj x ¡ m kj . 2
(3.16)
Then the solution for each parameter vector h kj D fm kj , S kj g takes the form P ( ) x2X hj x x mO kj D P k (3.17) x2Xk hj ( x) P
SO kj
D
x2Xk
hj ( x ) ( x ¡ mO kj ) ( x ¡ mO kj ) T P . x2Xk hj ( x )
(3.18)
Note that these two estimates are provided only if PO kj > 0, since otherwise the component j does not represent data of the class Ck. Obviously, in order to obtain the parameter solution described by equations 3.12 through 3.14, we must rst specify the values of hj ( x) , that is, estimate the probabilities P ( j | x ) or P ( j | x, C k ). An approximation of P ( j | x) can be obtained by running a mixture model with M components using the data set X and ignoring class labels. Similarly, an approximation of P ( j | x, Ck ) can be obtained by applying the common components model to the labeled data set (X, Y) . Therefore, two different approaches can be applied for obtaining an estimate of hj ( x ) : Algorithm 1: Unsupervised case (hj ( x ) D P ( j | x )). We introduce the P mixture model p ( x | W) D jM D 1 p j p ( x | j, wj ) where p ( x | j, w j ) typically has the same parametric form as p (x | Ck , j, h kj ) . We maximize the log likelihood considering the unlabeled data X using the EM algorithm O (see section A.1). Then we replace and obtain the parameter solution W hj (x ) by pO p ( x | j, wOj ) O D P j . P ( j | x, W) M O i D1 pO i p ( x | i, w i )
(3.19)
pO p (x | j, wOj ) O k ) D P jk . P ( j | x, Ck, W M O) ( i D 1 pO ik p x | i, w i
(3.20)
Algorithm 2: Supervised case (hj ( x) D P ( j | x, Ck ) ). We introduce the comP mon components model p (x | Ck, W k ) D jMD1 p jkp ( x | j, wj ) and obtain O k for each k by maximizing the log likelihood a parameter solution W 2.3 using the EM algorithm (see section A.2). Then we replace hj ( x) by
Mixture of Experts Classication Using a Hierarchical Mixture Model
2231
3
2.5
2
1.5
1
0.5
0
0.5 1
0
1
2
3
4
5
6
7
Figure 2: The two-dimensional data points of a three-class problem and the parameter solutions for the models p ( x | C k , j, hkj ) , which were assumed to be gaussians (solid lines). The dashed lines represent the parameter solution obtained by the intermediate training stage (approximation of hj ( x ) ). Note that for the cluster on the right, there exist only two subclusters (solid lines). This is because this data region contains data from two classes only ( C ’s are missing); thus, the density model of the third class is automatically pruned.
Once we have obtained the parameter solution for the hierarchical mixture classication model, several useful quantities can be estimated. The class O D |Xk | , prior probability given by equation 3.6 would essentially be P ( Ck | H) |X| where equations 3.12 and 3.13 are used. The class conditional density can be estimated using equation 3.7, where O D P ( j | C k, H)
1 X hj (x ) . |Xk | x2X
(3.21)
k
In Figure 2, a three-class data set is illustrated along with the parameter solution of the models p ( x | Ck, j, h kj ) (solid lines), which were chosen to be gaussians. The model employs two components at the second level of the hierarchy. In this example, the same solution is obtained at the intermediate training stage (represented with dash lines) using either a mixture model or the common components model. Although the two algorithms provided the same parameter solutions in this example, this is not expected to hold in general. This can be explained by the fact that the application of a mixture
2232
Michalis K. Titsias and Aristidis Likas (a) 1.2 1 0.8 0.6 0.4 0.2 0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
1.2
1.4
1.6
1.8
(b) 1.2 1 0.8 0.6 0.4 0.2 0.2
0.4
0.6
0.8
1
Figure 3: A two-class problem where the true decision boundary is linear (dotted line). (a) The solution found by the common components model with ve components. (b) The solution of the hierarchical mixture classier found based on the common components solution (a different line style is used for the components of each class).
model constitutes an unsupervised learning task, while the application of the common components model is a supervised task. 3.2 Comparison with the Common Componen ts Model. The hierarchical mixture classication model is a more exible model compared to the common components model. In this section, we investigate this issue and explain how the increased exibility yields an improvement in data representation in regions close to the decision boundaries. Figure 3 displays the data of an articial classication problem of two classes. The problem is so constructed that the true decision boundary is linear corresponding to the vertical dot line in the gure. Figure 3a displays the parameter solution found by the common components model with M D 5. The common components model places three components on the decision boundary, thus leading to a decrease in classication performance. We also applied the hierarchical mixture classier with M D 5 using algorithm 2; the solution is displayed in Figure 3b. More specically, the hierarchical mixture classier renes the solution found by the common components O k ) and approximating the model by taking the probabilities P ( j | x, C k, W
Mixture of Experts Classication Using a Hierarchical Mixture Model
2233
local class-labeled clusters within the previous ve clusters. As shown in Figure 3b, the nal parameter solution of the model approximates quite efciently the true decision boundary. If we compare the two solutions, we can see that they differ only in regions close to the decision boundary (i.e., in the subspace relevant to classication), while in subspaces irrelevant to classication, the representation is exactly the same.3 Thus, the hierarchical mixture classier (trained using algorithm 2) can be considered as taking the common components solution as input and improving it in subspaces that are critical for classication efciency. In the following, we discuss more formally this property of the solutions found by the proposed model. Once a hierarchical mixture classication model has been constructed using algorithm 2, it is natural to compare the two classication methods in terms of the values of the corresponding solution parameters. Assume O is the parameter solution for the hierarchical mixture classication that H model provided by equations 3.12 through 3.14, where hj ( x) is computed as O k ) obtained by the common components model with paramP ( j | x, Ck, W O eters W k, k D 1, . . . , K. We wish to compare the classier provided by the O with the corhierarchical mixture classication model with parameters H O k, k D 1, . . . , K. responding common components model with parameters W To achieve this, it is sufcient to compare the class conditional density estiP O D M P ( j | Ck, H) O p ( x | C k, j, hOkj ) with the corresponding mate p ( x | Ck, H) jD 1 PM O k) D O p ( x | Ck, W D1 pO jk p ( x | j, wj ). j
O is better than p (x | Ck, W O k) It can be shown that the solution p ( x | Ck, H) in terms of the corresponding log-likelihood values. More specically, the following proposition holds:
O be the parameter solution for the hierarchical mixture clasLet H sication model provided by equations 3.12 through 3.14, where hj ( x ) is computed as O k ) , the solution provided by the common components model. Also asP ( j | x, Ck , W sume that for each j, the density models p ( x | j, w j ) and p ( x | Ck, j, hkj ) , k D 1, . . . , K have the same parametric form, which is such that the maximum of equation 3.14 occurs for a unique value of the parameters. O k) D 6 0, where L k is the Ck -class log If for a class Ck it holds that rWk L k ( W O provides higher likelihood dened in equation 2.3, then the estimate p (x | Ck, H) O ( ) class log-likelihood value than the estimate p x | Ck, W k , that is, Proposition 1.
X x2Xk
log
M X jD 1
O p (x | C k, j, hOkj ) > P ( j | C k, H)
X x2Xk
log
M X j D1
pO jkp ( x | j, wOj ) .
(3.22)
3 Subspaces relevant to classication are considered to be all the data clusters containing data from more than one class.
2234
Michalis K. Titsias and Aristidis Likas
O k ) D 0, then the estimates p (x | Ck, H) O If for a class Ck it holds that rWk L k ( W O k ) are identical. 4 and p ( x | C k, W The proof is given in appendix B. O can be such Proposition 1 states that for each Ck, the estimate p ( x | Ck, H) that either the class log-likelihood value would be higher than the log likeliO k ) , or it would be identical to p ( x | Ck, W O k) . hood computed using p ( x | Ck, W O k locally maximize the The second case occurs when the parameter values W O k ) is already a loclass log-likelihood L k, which means that the p ( x | Ck, W cally optimum estimate for the conditional density of class C k. On the other O k does not constitute a hand, the assumption in the rst case implies that W local optimum of the class log-likelihood value L k . The rst case occurs freO k is obtained quently in practice. To explain this, consider that since each W from the maximization of the log likelihood equation 2.3 (corresponding to the common components model case) using the EM algorithm (Dempster, Laird, & Rubin, 1977), we may assume that it constitutes a stationary point of the log likelihood. This implies that each pO jk and wOj satisfy pO jk D
1 X O k) , P ( j | x, C k, W |Xk | x2X
K X X kD 1 x2Xk
(3.23)
k
O k ) rwj log p ( x | j, wOj ) D 0 P ( j | x, Ck, W
(3.24)
or K X kD 1
O k ) D 0. rwj L k ( W
(3.25)
Although pO jk will always correspond to a stationary point of the class loglikelihood L k, equation 3.25 explicitly points out that wOj may not correspond to a stationary point of L k for all k unless the component represents data of only one class. In order for wOj to be a stationary point of L k, it must hold that O k) D rQj L k (W
X x2Xk
O k ) rwj log p (x | j, wOj ) D 0. P ( j | x, Ck, W
(3.26)
The situation where wOj satises equation 3.24 without satisfying equation 3.26 for every k occurs when the component j represents data of dif4 O D pO jk and p (x | Ck , j, hOkj ) D p(x | j, wOj ) for each j D 1, . . . , M We mean that P(j | Ck , H) and x.
Mixture of Experts Classication Using a Hierarchical Mixture Model
2235
ferent classes that do not overlap signicantly.5 In real-world classication problems (with class overlapping), it is almost certain that there would be some wj for which the condition 3.26 will not be true for all k.6 For such a wOj , we expect the corresponding data subspace (represented by the component j of the common components model) to be close to the true decision boundaries. This is because the component j essentially represents data of more than one class that overlap or lie very close to each over. All these wOj O k) D 6 0, and the rst case of proposition 1 would will result in some rWk L k ( W O will improve be applicable. Subsequently, the specic estimates p (x | C k, H) O ( ) the corresponding p x | C k, W k , and this improvement will be observed in subspaces relevant to classication. The latter can be considered very benecial from the classication point of view. An illustrative example is in Figure 3. 3.3 Comparison with the Separate Mixtures Model. As we previously noted, the hierarchical mixture classier can be considered a constrained version of a separate mixtures model with a total of KM components (M components for each class conditional density). An important advantage over separate mixtures is the ability to adjust the number of active components automatically. More specically, the hierarchical mixture model avoids overtting because it is able to prune7 class density models at the third level of hierarchy based on the distribution of the data. For this reason, the active (not pruned) number of density models can be any number in the range [M, KM] and is learned from the data. For example, in the problem of Figure 2, six density models are assumed initially; however, after training, only ve remain active (precisely as many as the problem requires). Similarly, in the data of Figure 3, only eight density models remain active, while originally 10 such models were assumed. These results indicate that the hierarchical mixture classier is able to avoid overtting by pruning density models during learning. This also makes the method to be robust with respect to the choice of M; even if M is overestimated, it is still possible to nd an efcient solution. In order to verify this issue experimentally, we applied the hierarchical mixture classier to the problem of Figure 2 for the
5
An illustrative example is displayed in Figure 2, where each component of the common components model (dashed lines) represents simultaneously data clusters of different classes that clearly do not have the same means and variances. For this reason, the class log likelihoods are not maximum. Note that the class log likelihoods would all be simultaneously maximized if the class subclusters had precisely the same means and variances. 6 O k) D 6 Note that if for a specic wOj there exists a class C k such that rw j L k (W 0, then in order for equation 3.24 to be satised, there must also be at least one different class C` for O `) D 6 0. which rw j L` (W 7 If a parameter Pkj becomes zero (or very close to zero), the corresponding density model p (x | C k , j, hkj ) is pruned.
2236
Michalis K. Titsias and Aristidis Likas
(a)
(b)
3
3
2
2
1
1
0
0 0
2
3
(c)
4
6
2
1
1
0
0 2
3
(e)
4
6
2
1
1
0
0 2
0
2
0
2
3
2
0
2
3
2
0
0
4
6
(d)
(f)
4
6
4
6
4
6
Figure 4: Displays the same data set as in Figure 2. (a–d) The solution of the hierarchical mixture classier for M D 2, 3, 4, 5, respectively. (e, f) The solution obtained using the separate mixtures model with two and three components per class conditional density, respectively. A different line style is used for the density model of each class.
values of M D 2, 3, 4, 5. The obtained solutions are displayed in Figures 4a through 4d, where it is shown that the hierarchical mixture classication model nds exactly the same solution for M D 2, 3, 4 (with ve active components), while for M D 5, six active components are used, but the solution is very similar to the previous cases. The higher the value of M becomes,
Mixture of Experts Classication Using a Hierarchical Mixture Model
2237
the greater the number of pruned components; thus, the model is not prone to overtting. We also measured the log-likelihood value achieved in a test data set; it was ¡757.8 for the rst three cases and ¡758.12 for the last case. On the other hand, the separate mixtures model cannot avoid overtting, because when we increase the number of components in a mixture model, the likelihood value always increases. In Figures 4e and 4f, we display the solutions of the separate mixtures model when each class conditional density is modeled using a mixture with two and three components, respectively. It can be observed that all components remain active, resulting in data overtting. The log-likelihood value in test data was now ¡759.46 and ¡764.1, respectively. It must also be noted that analogous results are obtained if four or ve components are used. Another advantage of the hierarchical mixture classier over separate mixtures is that it is more efcient in classication problems with small data sets. In order for the separate mixtures model to be efcient, we have to infer the number of mixture components corresponding to each class. For example, in the problem of Figure 2, in order to obtain the optimal solution, we have to assume two components for the two class conditional densities and one component for the third one. We can apply model selection techniques, such as cross validation, in order to determine the number of components; however, any model selection method is unreliable when few data are available. Moreover, in case of problems with many classes (K > 10) and few available data for each class (compared to the data dimensionality), the separate mixtures model is not applicable. On the contrary, this is not a problem for our approach, since the specication of M components at the second level of the hierarchy is performed based on all data from all classes, and, in addition, we are allowed to overestimate M.
4 Experiments
We conducted a series of experiments using gaussian components and compared the proposed model with the common components model and the separate mixtures model. We considered ve well-known data sets: the Clouds, Satimage, and Phoneme from the ELENA database and the Pima Indians and Ionosphere from the UCI repository (Blake & Merz, 1998). Details of these data sets are provided in Table 1. We have performed experiments for several values of M, where M denotes the number of components at the second level of the hierarchical mixture classication model and also the total number of components of the common components model. In the case of separate mixtures, M denotes the number of components used by each class-conditional mixture density. To obtain average and standard deviation error values, we applied the ve-fold cross-validation method. The results for all algorithms and all data sets are presented in Tables 2 and 3.
2238
Michalis K. Titsias and Aristidis Likas
Table 1: Data Sets Used in the Experiments. Data Set
Features
Classes
Number of Data
Satimage Phoneme Clouds Pima Indians Ionosphere
5 5 2 8 35
6 2 2 2 2
6,435 5,404 5,000 768 351
Table 2: Generalization Error and Standard Deviation Values for All Tested Algorithms Using the Satimage, Phoneme, and Clouds Data Sets. Satimage
M=6
M=12
M=18
M=24
Algorithm
Error
S.D.
Error
S.D.
Error
S.D.
Error
S.D.
hj (x) D P(j | x) hj (x) D P(j | x, C k ) Common components model Separate mixtures
12.0 11.9 17.1
0.5 1.1 0.4 0.5
10.7
0.4 1.0 0.2 1.0
10.7
0.8 0.9 0.3 0.9
10.4
0.9 0.9 0.5 0.5
11.2
11.5 12.9 11.2
10.9 12.2 11.7
10.6 11.4 12.5
Phoneme
M=8
M=10
M=12
M=14
Algorithm
Error
S.D.
Error
S.D.
Error
S.D.
Error
S.D.
hj (x) D P(j | x) hj (x) D P(j | x, C k ) Common components model Separate mixtures
15.5
1.1 1.1 1.1 1.5
15.2
1.2 1.0 1.9 1.0
15.4
0.8 0.9 1.1 1.4
14.9
1.2 1.0 1.2 1.1
15.8 22.0 16.3
14.7
20.6 15.9
14.0
19.9 15.9
14.5
21.3 15.5
Clouds
M=4
M=6
M=8
M=10
Algorithm
Error
S.D.
Error
S.D.
Error
S.D.
Error
S.D.
hj (x) D P(j | x) hj (x) D P(j | x, C k ) Common components model Separate mixtures
16.7 13.1 13.1
2.5 0.9 0.9 0.6
12.9
0.9 1.0 0.9 0.8
12.6 10.9
1.0 0.9 0.9 1.3
12.5 10.8
0.9 0.9 0.8 0.8
12.2
11.3
11.4 11.9
10.9
11.1
10.8
10.8
Note: Numbers in boldface type indicate best performance among the tested algorithms.
The hierarchical mixture classication model was trained using the two algorithms described in section 3.1. Experimental results are displayed in Tables 2 and 3, where bold numbers indicate best performance among the tested algorithms. The two algorithms are denoted as hj (x ) D P ( j | x ) and hj ( x ) D P ( j | x, Ck ) , respectively. Moreover, since the training of the hierar-
Mixture of Experts Classication Using a Hierarchical Mixture Model
2239
Table 3: Generalization Error and Standard Deviation Values for All Tested Algorithms Using the Pima Indians and Ionosphere Data Sets. Pima Indians
M=6
M=8
M=10
M=12
Algorithm
Error
S.D.
Error
S.D.
Error
S.D.
Error
S.D.
hj (x) D P (j | x) hj (x) D P (j | x, Ck ) Common components model Separate mixtures
26.0
1.1 1.8 3.6 2.8
24.7
2.5 1.7 2.9 3.0
24.8
2.7 2.5 3.7 3.6
25.0
2.5 2.8 2.6 2.3
24.3
28.6 27.1
24.8 29.5 27.1
24.6
28.1 26.3
24.8
26.9 28.9
Ionosphere
M=6
M=8
M=10
M=12
Algorithm
Error
S.D.
Error
S.D.
Error
S.D.
Error
S.D.
hj (x) D P (j | x) hj (x) D P (j | x, Ck ) Common components model Separate mixtures
13.7 12.6 17.7
3.0 4.0 4.0 2.6
10.0
3.1 3.6 3.4 3.3
9.4
2.6 3.2 3.4 3.4
7.4
3.6 1.3 3.3 2.7
12.1
12.0 16.3 12.3
7.4
12.0 8.9
7.4
9.5 8.8
Note: Numbers in boldface type indicate best performance among the tested algorithms.
chical mixture classication model for the case hj ( x) D P ( j | x, Ck ) requires the construction of the common components model, we have also obtained a solution for the common components model at no additional effort. The experimental results indicate the following: 1. Both algorithms for training the hierarchical mixture classication model provide better generalization results than the separate mixtures and the common components model in most of the runs. 2. The algorithm that uses hj ( x) D P ( j | x, C k ) provides a classier that signicantly improves the corresponding common components classier obtained at the intermediate training stage. For example, in the case of the phoneme data set, the improvement is impressive. This constitutes an experimental justication of the discussion in section 3.2. In the case of the Clouds data set, the two classiers provide approximately the same class conditional density estimates (the second case of proposition 1 is applicable), and thus the two methods exhibit almost equal performance. 3. According to all results, the hierarchical mixture classier is quite robust regarding the number of components M, and its classication performance is typically improved as we increase M. On the other hand, the performance of the separate mixtures model is very sensitive to the choice of M. For example, in the Satimage data set, this method clearly cannot avoid overtting as we increase M.
2240
Michalis K. Titsias and Aristidis Likas
5 Discussion
A hierarchical mixture classication model has been presented that exhibits a three-level structure. This structure provides at the higher level an unsupervised representation of the data and then at a lower level information about the classes having generated the data. The proposed model can be considered as a mixture of experts classier, since the components at the second level of the hierarchy partition the data space into subspaces, while the probability models at the third level form the experts that solve the classication problem in each subspace. The proposed hierarchical mixture classier exhibits several attractive features compared to conventional mixture models for classication. Specifically, compared to the common components model, it improves data representation in subspaces relevant to classication. Also, the model exhibits robustness with respect to the number M of components at the second level, and this constitutes a great advantage over the separate mixtures model. For future research, several interesting directions may be followed. Any advanced method for mixture density estimation (Ormeneit & Tresp, 1996; Ueda, Nakano, Ghahramani, & Hinton, 2000; Vlassis & Likas, 2002) can be incorporated at the rst stage (computation of hj ( x ) ) of the proposed training algorithm of the hierarchical mixture classication model. Though such methods can be directly applied where hj ( x) D P ( j | x ), slightly modied versions are needed for the case where hj (x ) D P ( j | x, C k ). Also in the proposed approach, the probability models p ( x | C k, j, h kj ) are assumed to be unimodal densities taken from the exponential family; however, other models may be used, such as factor analyzers (Everitt, 1984), or each p ( x | Ck, j, hkj ) may itself be a mixture model.
Appendix A: Specication of hj ( x ) for Gaussian Components A.1 Approximation of hj ( x) by P ( j | x ) . We assume that the mixture model employed for determining the probabilities P ( j | x ) has gaussian components of the form 3.16. We can obtain an estimation of P ( j | x ) by iteratively applying until convergence the following update equations:
( )
( )
( )
t t t p (x | j, m j , Sj )p j ( t) ( ) | P j x, W D P ( t) ( t) M ( )p i(t ) i D 1 p x | i, m i , S i P P ( j | x, W ( t) ) x ( t C 1) mj D Px2X ( t) x2X P ( j | x, W )
(A.1)
(A.2)
Mixture of Experts Classication Using a Hierarchical Mixture Model
P
(t C 1) Sj ( C 1)
pj t
(
(
C 1) C 1) P ( j | x, W ( t ) ) ( x ¡ m j t ) ( x ¡ m j t ) T P (t ) x2X P ( j | x, W ) 1 X ( ) D P ( j | x, W t ) , |X| x2X
x2X
D
2241
(A.3) (A.4)
where equation A.1 holds for each x 2 X and j and equations A.2 through A.4 for each j. A.2 Approximation of hj ( x) by P ( j | x, Ck ) . We assume that the common components model employed for determining the probability P ( j | x, Ck ) employs gaussian components. The EM algorithm for maximizing the log likelihood 2.3 gives the following update equations (Titsias & Likas, 2001): ( )
P ( j | x,
( ) Ck , W kt )
( C 1) mjt
( )
DP ( t) ( t) M ( i D 1 p ik p x | i, m i ,
(A.5)
( ) Si t )
PK
P ( t) ( ) kD 1 x2Xk P j | x, C k , W k x PK P (t ) kD 1 x2Xk P ( j | x, C k , W k )
D
PK ( C Sj t 1)
D
( C 1)
D
p jkt
( )
p jkt p ( x | j, m j t , Sj t )
P
kD 1
( )
(A.6) ( C
( C 1) T )
t t 1) ) (x¡m j t P ( j | x, C k, W k ) ( x¡m j PK P ( t) ( ) kD 1 x2Xk P j | x, C k , W k
x2Xk
1 X ( ) P ( j | x, Ck, W kt ) |X k | x2X
(A.7) (A.8)
k
where all equations holds for each j, while equation A.8 holds additionally for each k and equation A.5 for each k and x 2 X. Appendix B: Proof of Proposition 1
O the conditional density estimate of the class For the parameter solution H, Ck is O D p (x | C k, H)
M X jD 1
O p ( x | C k, j, hOkj ) , P ( j | C k, H)
(B.1)
where according to equations 3.21 and 3.14, O D P ( j | C k, H)
1 X O k) P ( j | x, Ck, W |Xk | x2X
(B.2)
k
and hOkj D argmax hkj
X x2Xk
O k ) log p ( x | Ck, j, hkj ) , P ( j | x, Ck, W
(B.3)
2242
Michalis K. Titsias and Aristidis Likas
respectively. Also, the corresponding class conditional estimate provided by the common components model is given by O k) D p ( x | Ck, W
M X j D1
pO jkp ( x | j, wOj ) .
(B.4)
Assume the Ck-class log likelihood corresponding to the data set Xk: X
L k (W k ) D
log
M X
p jk p ( x | j, w j ).
(B.5)
jD 1
x2Xk
If we apply one EM iteration to maximize the above log likelihood starting (0) (1) from W k D fwO1 , . . . , wO M , pO 1k, . . . , pO Mkg, the parameter value W k is obtained by maximizing the function M XX
(0)
Q (W k | W k ) D
(0)
P ( j | x, Ck, W k ) log p jkp (x | j, w j ) ,
(B.6)
x2Xk j D 1
which yields (1)
p jk D
1 X (0) P ( j | x, C k, W k ) |X k | x2X
and (1)
wj
(B.7)
k
D argmax wj
X
(0)
P ( j | x, Ck, W k ) log p ( x | j, w j ) .
(B.8)
x2Xk (1)
O D p . Now clearly from equations B.2 and B.7, it holds that P ( j | C k, H) jk (1) Also since p ( x | j, wj ) has the same parametric form with p ( x | Ck, j, hkj ) , wj and hOkj are obtained by maximizing the same quantity (see equations B.3 (1)
and B.8). Thus, it holds that hOkj D wj and the class conditional estimates (1) O are identical. Now, one of the following two p (x | Ck, W k ) and p ( x | Ck , H) cases holds: O k) D 6 0: The convergence property of the EM algorithm implies 1. rWk L k ( W 6 0, that if for the log likelihood L (H) of interest it holds that rH L (H (t ) ) D then at the next EM iteration, it will hold that L (H ( t C 1) ) > L (H ( t ) ) (Wu, 1983; McLachlan & Krishnan, 1997). Thus, in our case, we nd that X
log
(1) (1) P ( j | Ck, W k ) p ( x | j, w j )
j D1
x2Xk
>
M X
X x2Xk
log
M X
(0)
(0)
p jk p (x | j, w j )
jD 1
which proves inequality 3.22.
(B.9)
Mixture of Experts Classication Using a Hierarchical Mixture Model
2243
O k ) D 0: Since the EM algorithm converges to a stationary 2. rWk L k ( W (1) (0) point (Wu, 1983), it holds that W k D W k . Consequently, since p (x | (1) O is identical to p ( x | Ck , W ) , it will also be identical to p ( x | C k, H) k O k ). C k, W
References Bishop, C. M., & Tipping, M. E. (1998). A hierarchical latent variable model for data visualization. IEEE Transactions on Pattern Analysis and Machine intelligence, 20(3), 281–293. Blake, C. L., & Merz, C. J. (1998). UCI repository of machine learning databases. University of California, Irvine, Department of Computer and Information Sciences. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B, 39, 1–38. Duda, R. O., & Hart, P. E. (1973). Pattern classication and scene analysis. New York: Wiley. Everitt, B. S. (1984).An introduction to the latent variablemodels. London: Chapman and Hall. Ghahramani, Z., & Jordan, M. I. (1994). Supervised learning from incomplete data via an EM approach. In D. J. Cowan, G. Tesauro, & J. Alspector (Eds.), Neural information processing systems, 6 (pp. 120–127). Cambridge, MA: MIT Press. Hastie, T. J., & Tibshirani, R. J. (1996). Discriminant analysis by gaussian mixtures. Journal of the Royal Statistical Society B, 58, 155–176. Jacobs, R. A., Jordan, M. I., Nowlan, S. J., & Hinton, G. E. (1991). Adaptive mixtures of local experts. Neural Computation, 3, 79–87. Jordan, M. I., & Jacobs R. A. (1994). Hierarchical mixtures of experts and the EM algorithm, Neural Computation, 6, 181–214. McLachlan, G. J., & Peel, D. (2000). Finite mixture models. New York: Wiley. Miller, D. J., & Uyar, H. S. (1996). A mixture of experts classier with learning based on both labeled and unlabeled data. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), Neural information processing systems, 9. Cambridge, MA: MIT Press. Miller, D. J., & Uyar, H. S. (1998). Combined learning and use for a mixture model equivalent to the RBF classier. Neural Computation, 10, 281– 293. Ormoneit, D., & Tresp, V. (1996). Improved gaussian mixture density estimates using Bayesian penalty terms and network averaging. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural informationprocessing systems, 8 (pp. 542–548). Cambridge, MA: MIT Press. Titsias, M. K., & Likas, A. (2001). Shared kernel models for class conditional density estimation. IEEE Trans. on Neural Networks, 12(5), 987–997.
2244
Michalis K. Titsias and Aristidis Likas
Ueda, N., Nakano, R., Ghahramani, Z., & Hinton, G. E. (2000). SMEM algorithm for mixture models. Neural Computation, 12, 2109–2128. Vlassis, N., & Likas, A. (2002). A greedy EM algorithm for gaussian mixture learning. Neural Processing Letters, 15, 77–87. Wu, C. F. J. (1983). On the convergence properties of the EM algorithm. Annals of Statistics, 11, 95–103. Received October 16, 2001; accepted April 1, 2002.
LETTER
Communicated by Gary Cottrell
On the Emergence of Rules in Neural Networks Stephen Jos e Hanson [email protected] Michiro Negishi [email protected] Psychology Department, Rutgers University, Newark, N.J. 07102, U.S.A. A simple associationist neural network learns to factor abstract rules (i.e., grammars) from sequences of arbitrary input symbols by inventing abstract representations that accommodate unseen symbol sets as well as unseen but similar grammars. The neural network is shown to have the ability to transfer grammatical knowledge to both new symbol vocabularies and new grammars. Analysis of the state-space shows that the network learns generalized abstract structures of the input and is not simply memorizing the input strings. These representations are context sensitive, hierarchical, and based on the state variable of the nite-state machines that the neural network has learned. Generalization to new symbol sets or grammars arises from the spatial nature of the internal representations used by the network, allowing new symbol sets to be encoded close to symbol sets that have already been learned in the hidden unit space of the network. The results are counter to the arguments that learning algorithms based on weight adaptation after each exemplar presentation (such as the long term potentiation found in the mammalian nervous system) cannot in principle extract symbolic knowledge from positive examples as prescribed by prevailing human linguistic theory and evolutionary psychology.
1 Introduction
A basic puzzle in the cognitive neurosciences is how simple associationist learning at the synaptic level of the brain can be used to construct known properties of cognition that appear to require abstract reference, variable binding, and symbols. The ability of humans to parse sentences and to abstract knowledge from specic examples appears to be inconsistent with local associationist algorithms for knowledge representation (Pinker, 1994; Fodor & Pylyshyn, 1988; but see Hanson & Burr, 1990). Part of the puzzle is how neuron-like elements could, from simple signal processing properties, emulate symbol-like behavior. Symbols and symbol systems have c 2002 Massachusetts Institute of Technology Neural Computation 14, 2245– 2268 (2002) °
2246
Stephen JosÂe Hanson and Michiro Negishi
previously been dened by many different authors. Harnad’s version (1990) is a useful summary of many such properties: (1) a set of arbitrary physical tokens (scratches on paper, holes on a tape, events in a digital computer, etc.) that are (2) manipulated on the basis of explicit rules that are (3) likewise physical tokens and strings of tokens. The rule-governed symbol-token manipulation is based (4) purely on the shape of the symbol tokens (not their “meaning”) i.e., it is purely syntactic, and consists of (5) rulefully combining and recombining symbol tokens. There are (6) primitive atomic symbol tokens and (7) composite symbol-token strings. The entire system and all its parts—the atomic tokens, the composite tokens, the syntactic manipulations (both actual and possible) and the rules—are all (8) semantically interpretable: The syntax can be systematically assigned a meaning (e.g., as standing for objects, as describing states of affairs). As this denition implies, a key element in the acquisition of symbolic structure involves a type of independence between the task the symbols are found in and the vocabulary they represent. Fundamental to this type of independence is the ability of the learning system to factor the generic nature (or rules) of the task from the symbols, which are arbitrarily bound to the external referents of the task. Consider the simple problem of learning a grammar from valid, “positive set only” sentences consisting of strings of symbols drawn randomly from an innite population of such valid strings. This sort of learning might very well underlie the acquisition of language in children from exposure to grammatically correct sentences during normal discourse with their language community.1 As an example, we will examine the learning of strings generated by Finite State machines (FSMs; for an example, see Figure 1), which are known to correspond to regular grammars. Humans are known to gain a memorial advantage from exposure to strings drawn from an FSM over random strings (Miller & Stein, 1963; Reber, 1967), as though they are extracting abstract knowledge of the grammar. A more stringent test of knowledge of a grammar would be to expose the subjects to an FSM with one external symbol set and see if the subjects transfer knowledge to a novel external symbol set assigned to the same FSM. In principle in this type of task, it is impossible for the subjects to use the
1
Although controversial, language acquisition must surely involve the exposure of children to valid sentences in their language. Chomsky (1957) and other linguists have stressed the importance of the a priori embodiment of the possible grammars in some form more generic than the exact target grammar. Although not the main point of this article, it must surely be true of the distribution of possible grammars that some learning bias must exist that helps guide the acquisition and selection of one grammar over another in the presence of data. What the nature of this learning bias is might be a more protable avenue of research in language acquisition than the recent polarizations inherent in the nativist-empiricist dichotomy (Pinker, 1997; Elman et al., 1996).
On the Emergence of Rules in Neural Networks
2247
Figure 1: A vocabulary transfer task. The gure shows two FSM representations, each of which has three states (1, 2, 3), with transitions indicated by arrows and legal transition symbols (A, B, . . . , F) for each state. The network’s rst task is to learn the FSM on the left-hand side to criterion and then transfer to the second FSM on the right-hand side of this gure. This task involves no possible generalization from the transition symbols. Rather, all that is available are the state conguration geometries. This type of transfer task explicitly forces the network to process the symbol set independently from the transition rule. States with tailless incoming arrows indicate initial states (state 1 in the FSMs). States within circles indicate accepting states (state 3 in the FSMs).
symbol set as a basis for generalization without noting the patterns that are commensurate with the properties of the FSM.2 An example of this type of transfer (vocabulary transfer) is shown in Figure 1. In this task, the syntactic structure of the generated sentences is kept constant while the vocabularies are switched. This will be the rst kind of simulation in this article. In the second simulation, we examine a type of syntax transfer effect where the vocabulary is kept the same but the syntax is altered (see Figure 2). The purpose of the second simulation is to examine the effect of syntactic similarities on the degree of knowledge transfer. 2 Related Work
Early neural network research in language learning (Hanson & Kegl, 1987) involved training networks to encode English sentences drawn randomly from a large text corpus (the Brown corpus). The network was able to learn a large number of English sentences and provided evidence that abstract knowledge could emerge from the statistical properties of a representative population of sentences and a simple neural network learning rule. Other early research (Elman, 1990, 1991) showed that networks could also be used to abstract general knowledge from structured input sentences. In Elman’s work, a neural network model with feedback connections, called the simple recurrent network (SRN), was trained with sentences generated from
2 Reber (1969) showed that humans would signicantly transfer in such a task; however, his symbol sets allowed subjects to use similarity as a basis for transfer as they were composed of contiguous letters from the alphabet. Recent reviews of the literature indicate that this type of transfer is common even across modalities (Redington & Chater, 1996).
2248
Stephen JosÂe Hanson and Michiro Negishi
Figure 2: A syntax transfer task. In this task, vocabulary sets are the same in both the left and the right FSMs, but the structure of the FSMs, such as starting states and accepting states, allowed inputs at each state, and resultant state transitions differ for each FSM. As in the vocabulary transfer task, the network’s rst task is to learn the FSM on the left-hand side to criterion and then transfer to the second FSM on the right-hand side of this gure. In a syntax transfer task, the network may make use of prior knowledge associated with input symbols where the same input causes the same transition in both of the FSMs.
a context-free grammar. After the SRN was shown symbols in a sentence sequentially (one symbol at a time), it could predict a possible set of symbols that could appear next in the sentence. For instance, after reading a singular noun, the network predicted that the set of next possible symbols included a wh- word (who in Elman’s grammar) and singular form verbs, which in this case exhausted the possibilities in the grammar used for the simulation. In the 1990 study, Elman conducted a cluster analysis (see also Hanson & Burr, 1990) on the hidden-layer node activities to show that syntactic categories of input symbols as well as semantic categories were formed during learning. Elman (1991) studied more complex syntactic forms (embedded sentences) also constructed from articial grammars. In these studies, he conducted a principal component analysis (PCA) of the hidden-layer activities and showed that trajectories of these activities in the multiple embedded clauses were represented by self-similar movements in the state-space and that locations of these self-similar loops were slightly displaced from each other. These analyses indicated that the network appeared to learn an approximation to a phrase structure grammar. Elman’s work and other past language learning lays the foundation for our learning research in which we examine the productivity of these types of learned structures in supporting language acquisition and transfer. This article focuses on analyzing the effect of changing the vocabulary while keeping the syntax constant (vocabulary transfer) and the effect of changing the syntax while keeping the vocabulary constant (syntax transfer). We introduce a method of analysis that optimally searches for structure in the hidden space that is most consistent with how the representation of a grammar might be processed. Dienes, Altmann, and Gao (1999) showed that an SRN augmented with an additional hidden layer acquires abstract knowledge that is indepen-
On the Emergence of Rules in Neural Networks
2249
dent of input symbols. During the training phase in their simulations, the network was given sentences of inputs in one vocabulary generated by an FSM and was trained on the symbol prediction paradigm as in Elman’s work. During the test period, they presented the network with sentences that had the same structure but used a different vocabulary. During this period, some of the weights in the network were not modied. This strategy is often called weight freezing. They showed that the network could learn sentences composed of symbols from the second vocabulary during the test phase signicantly better if the network had been trained with the rst vocabulary using the same syntactic rules, compared with the case where the network was trained with sentences that were generated by different syntactic rules. Thus, the network was shown to be able to extract abstract knowledge in one vocabulary and could transfer such knowledge to a new vocabulary (i.e., an independent group of input symbols whose mapping to the old ones was not known a priori). They showed that the network had many previously reported characteristics of human learners: it was able to transfer grammatical knowledge of a nite-state grammar to a new vocabulary, and it showed sensitivity to both the grammaticality of the test sentences and the similarity between the test sentences and the training sentences, greater learning if trained on the whole sentences than when it was trained for bigrams (if equated for the number of symbols), and the ability to transfer even if the sentences did not include repetition of a symbol within a sentence that creates noticeable patterns. 3 Although Dienes et al. (1999) compared the behavior of their network with human data, they did not analyze how the abstract knowledge was represented in the network after training. We believe that one of the important goals of connectionist modeling (see Hanson & Burr, 1990) should be to examine the representational and mechanistic aspects of network processing postlearning. This should be concomitant with showing that the network has similar behavioral patterns to human performance in the same task. If we can nd correlates of grammatical knowledge in terms of grammatical features, parsing states, and so on as the result of mathematical analysis of the network weights or activity, we can say that grammatical knowledge emerged in the network and think of mathematical entities (such as the patterns of weights in the networks and attractors4 associated with states in 3 For instance, if repetitions of symbols are allowed, BDCCCB may be recognized as analogous to MXVVVM (Dienes et al., 1999). 4 Although attractors are often used to describe continuous time systems, the same notion can also be applied to discrete time systems. In the current network, attractors are dened as follows: For each point in the hidden state space at time t, there is a new position in the same space at t C 1 for each input symbol. Hence, there is a vector eld, or a phasespace, for each input symbol (for instance, there is a phase-space for symbol A, another phase space for symbol B, and so on). Attractors are the stable points in the phase-space that movements from surrounding areas are drawn toward. Note that phase-spaces can also be dened for sequences of symbols. For instance, if an input sequence ABC causes
2250
Stephen JosÂe Hanson and Michiro Negishi
an FSM in the hidden unit activity space) as physical tokens that emerged through learning. Another possible disparity between network and human language processing is that Dienes et al. (1999) required freezing of some of the weights so that the network’s existing vocabulary knowledge would not be interfered by learning a new vocabulary. It is unclear whether such “weight freezing” is at work in humans at some cognitive level, nor is it clear that there is some physiological sense in which weight freezing might be implemented in a biological system. Given that adults show the vocabulary transfer effects, weight freezing cannot be attributed to a developmental factor. One could hypothesize a mechanism that controls the rate of learning in humans, but if a model can account for the transfer effect without such a mechanism, such a model would be preferred because of the simplicity of the explanatory theory that is embodied by it. It should also be noted that our network has one less hidden layer than Dienes et al.’s (1999), lacking what they call an encoding layer. Such a model would further be attractive because it could also be used to discover new mechanisms that might have a sensible biological plausibility. In this article, we describe a series of experiments with a recurrent neural network that creates abstract structure that is context sensitive, hierarchical, and extensible, without any special conditions such as freezing weights or otherwise creating any reference to past knowledge acquisition. Also, unlike Dienes et al.’s (1999) work, the focus of this article is on the analysis of the representation of both the resultant abstract knowledge and symbol-dependent knowledge. It is well known that neural networks can induce the structure of an FSM from exposure to strings drawn from that FSM (Cleeremans, 1993; Giles, Horne, & Lin, 1995). In fact, it has been shown that if a neural network (under simple perturbations due to nothing more than starting point variation for example, due to noise in data set order or gaussian noise injected into the weights; Hanson, 1990) can be reliably trained to recognize strings generated from an FSM, then the underlying state-space partitions of the neural network have no choice but to be in a one-to-one correspondence with the states of the minimal FSM. This result is consistent with a mechanism containing the minimal number of internal states that correctly accepts only the sentences created by the FSM (theorem 3.1 in Casey, 1996). Also, all cycles in the minimal FSM induce attractors in hidden unit state-space of the recurrent network (theorem 3.3, Casey, 1996). These surprising theorems seem to suggest that learning an FSM may provide the opportunity for a neural network to factor out the underlying rules of a state machine in its attractor space from the actual symbols that cause the state transitions. Without Casey’s theorems, it is possible to imagine that one sort of representation the hidden-layer activity to return to the original position (before the input ABC) in the state-space, such a cyclic movement through three different phase spaces (corresponding to A, B, and C) reduces to a point—possibly an attractor—in the phase-space dened for the input sequence ABC.
On the Emergence of Rules in Neural Networks
2251
that a neural network could learn about an FSM would involve the input encoding of the symbols being bound with the state representation in the hidden unit attractor space. This type of representation of an FSM would be very limited in its productivity or ability to transfer to other symbol sets or FSMs since each state would depend on a specic input symbol or set of symbols. Casey’s theorems inspire us to investigate a network’s general ability to transfer between surface symbols of the same FSM. We investigate this possibility in the following studies. In all of the connectionist modeling works cited above (Elman, 1990, 1991; Cleeremans, 1993; Giles et al., 1995; Dienes et al., 1999), the networks have feedback connections, such as connections from a hidden layer to an input layer. One obvious question that arises is whether a network memory created by recurrent connections is essential not only for accomplishing the task but for extracting abstract knowledge. Networks without feedback connections or slow decays do not have the means to maintain a short-term memory and thus merely map inputs to outputs. Feedback connections make the past history of the network available at the present time. Our initial studies in this area that investigated similar tasks, including the Penzias problem, a simple counting type of FSM induced by a feedforward network (Denker et al., 1987), implied that feedback connections could be important for extracting and transferring abstract knowledge. In this task, the network was trained to count the numbers of clusters of a particular symbol. For instance, if the network was trained to count the 1 cluster, the output corresponding to 11111 was 1, the output corresponding to 01011 was 2, the output corresponding to 10101 is 3, and so on. We trained both feedforward and recurrent networks on the Penzias task. A permutation task was dened in a similar way to the vocabulary transfer task. That is, the network that had been trained with 0s and 1s was trained with As and Bs, for example. The recurrent network showed transfer effects (savings in the number of training trials required to reach a performance criterion), but the feedforward network showed only interference effects (an increase in the number of training trials required to reach a performance criterion) even with signicant increases in the capacity of the network. In the Penzias task, the best (and most transferrable) strategy is to nd 0 to 1 transitions; the position of each transition is irrelevant. Such a strategy can be achieved by a memory, which keeps information as to whether the network has seen a series of one or more 0 symbols or a series of one or more 1 symbols most recently. Context dependence could also be accomplished in a feedforward network with shared weights, in which the weight pattern that is used to represent left and right context dependence at one position is copied to weight patterns that are used in other positions. However, in shared weight networks, all inputs (or at least the past several inputs) have to be presented to the network simultaneously, which is just a simple form of memory. Context dependence is also necessary for the efcient processing of regular grammars, which would be made possible by a network mem-
2252
Stephen JosÂe Hanson and Michiro Negishi
ory that keeps track of the current FSM state. At least the results from our preliminary studies including the Penzias task would suggest that memory in the network is an important component of the ability of a neural network to transfer its abstract knowledge. Note that the transfer of knowledge between tasks is possible without feedback connections if inputs are not completely novel. It has been shown that learning in a new task is signicantly accelerated by transferring weights in neural networks without feedback connections if the same input domain is used (Pratt, Mostow, & Kamm, 1991; Pratt, 1993). Analogy is another process that is closely related to knowledge abstraction and transfer. Once enough evidence is accumulated to infer mappings from the old domain to the new domain, abstract knowledge in the old domain can be freely applied in the new domain. Most neural networks that have been used for analogical processing are embedded in a symbol processing system that has complex predened architectures and typically do simple constraint satisfaction between the source and target domains (Thagard & Verbeurgt, 1998; Burns, Hummel, & Holyoak, 1993). In this article, a simple neural network with recurrent connections from the hidden layer to the input is examined for its ability to abstract knowledge, and the behavior of the hidden layer is analyzed in an effort to understand how the abstract knowledge is acquired. This type of network provides a simpler basis for studying the network’s complexity and representational bias than the highly structured networks of previous works. 3 Simulations 3.1 The Training Paradigm. In this work, we train a recurrent neural network (RNN; see Figure 3) on symbol sequences generated from FSMs. The network is trained with randomly generated grammatical sentences until performance of the network meets a learning criterion.5 Each symbol in the sentences is presented to the network sequentially. Each symbol is
5 Specically, after each training sentence, the network was tested with 1,000randomly generated sentences. The criterion for termination of training was the correct prediction of all allowable end points in these sentences. A margin of error was chosen in order to reduce false positives. Two thresholds were used to bound network responses that were either too low to be certain of an end point or too high to be certain of a nonendpoint. Falling within the margin was classied as an error. Thresholds were initialized to 0.20 (highvalue threshold) and 0.17 (low-value threshold) and were adapted at the end of each test phase using output values that were generated during processing of the test sentences by the network. The high threshold was modied to the minimum value yielded at the actual end of the sentence, minus a margin (0.01). Note that actual sentence end points rather than potential end points were used. Explicitly providing the network with information concerning potential end points was seen as providing too much information about the FSM in the external error signal. These modied thresholds were used for the next training and test sentences.
On the Emergence of Rules in Neural Networks
2253
Figure 3: The recurrent network architecture used in the simulations. The feedback-layer activity is a copy of the hidden-layer activity at a previous time step and thus provides a simple form of memory. Weights from the hidden layer to the feedback layer are one to one and are of a xed strength. All other weights are subject to adaptation during both source and target grammar and vocabulary learning.
represented as an activation value of 1.0 of a unique node in the input layer, while all other node activations are set to 0.0. The task for the network is to predict if the sentence can end with the current word (in which case the output layer node activation was trained to become 0.85) or not (output should be 0.15). Note that when the FSM is at an accepting state, a sentence can either end or continue. Therefore, at this state, the network is sometimes taught to predict the end of a sentence and sometimes not. However, the network eventually learns to yield higher output node activation when the sentence can end, as will be shown. Input sentences are generated by an FSM, with equal probabilities of state transitions emanating from each state. Sentences longer than 20 symbols long are ltered out.6 Because of this limitation in the input sentence length, we do not claim that the network has acquired full grammatical competence. Rather, we are interested in the formation of the representation of the grammatical knowledge that faithfully corresponds to the FSM through limited exposure to inputs. 3.2 The Network. The network architecture (see Figure 3) embodies the basic mechanism of any FSM: the combination of current input (the input 6 Because of this generation scheme, sentences with shorter lengths are generated more often than longer ones during training and test. However, because the network is tested with 1,000 sentences, it is in practice impossible for the network to pass the test phase by being tested on only relatively short sentences given the distribution of sentence lengths sampled.
2254
Stephen JosÂe Hanson and Michiro Negishi
layer) and the current state (the feedback layer) determines the current output (the output layer) and the next state (the hidden layer). The network has an input layer with 13 nodes (three symbols times four vocabulary sets, plus the START symbol), an output layer with one node, one hidden layer with four nodes, and a feedback layer with four nodes. The network architecture is a special case of a second-order recurrent neural network (Giles et al., 1992) where the output layer is not connected to the feedback layer. This modication enables us to analyze hidden-layer activities independent of the desired output. Both the output layer and the hidden-layer receive rst-order and second-order connections. Each rst-order connection connects one source node to one destination node. In a rst-order connection, the weight is multiplied by the source node activity before it is added to the destination node potential. In contrast, a second-order connection connects two source nodes (one node in the input layer and another in the feedback layer in the current network) to one destination node. In this case, the weight is multiplied by the product of the two source node activities before it is added to the destination node potential. (See the appendix for a complete set of activation and training equations.) At the beginning of each sentence, the hidden- and feedback-layer activities are reset to zero, and the inputs are processed one symbol at a time, beginning with the START symbol. The weights are modied using a standard learning algorithm developed by Williams and Zipser (1989), extended to second-order connections (Giles et al., 1992). The learning rate was set to 0.2 and the momentum coefcient to 0.8. Except for the “copy-back” weights from the hidden layer to the feedback layer, all weights in the RNN were candidates for adaptation; no weights were xed prior to or during learning. 3.3 Simulation 1: The Vocabulary Transfer Task. In the rst task, new symbols were cyclically assigned to the arcs in the FSM. In this task, the network was trained on three regular grammars (the source grammar) that had the same syntactic structure dened on three unique sets of symbols (fA,B,Cg, fD,E,Fg, and fG,H,Ig), and the effect of the prior training of these symbols sets on the target grammar was measured as the network was trained with yet another new set of symbols (fJ,K,Lg). This task explicitly examined the ability of the network to transfer over novel, unknown symbol sets due to prior training on the same source grammar. The method of training and the stopping criteria are exactly as described in section 3.1. One of the indicators of the vocabulary transfer effect is the savings in terms of number of trials needed to meet the completion criterion. In the simulation, three source grammars were used for three cycles, resulting in nine training sessions (or eight vocabulary switchings) before the sentences from the target grammar were presented. Figure 4 shows the number of trials for both the source grammar training (vocabulary switching = 0, 1, . . . , 8 in the gure) and the target grammar training (vocabulary switching = 9), averaged over 20 networks with different initial random weights. The error
On the Emergence of Rules in Neural Networks
2255
Figure 4: The learning savings from subsequent relearnings of the vocabularies. Each data point represents the average of 20 networks trained to the criterion on the same grammar. The relearning cycle shows an immediate transfer to the novel symbol set, which continues to improve to near-perfect transfer through the eighth vocabulary switch (which results in the ninth cycle: three cycles of the three symbol sets) until the ninth switching, where a completely novel set is used with the same grammar. Over 60% of the original learning on the grammar independent of symbol set is saved.
bars show standard errors. The result of eight vocabulary switchings (50,000 sentences) is a complete accommodation of the new symbol sets with 98% savings (as shown in the eighth cycle of Figure 4, where the number of trials for learning has dropped to 2% relative to the number of trials required for the rst learning of the source FSM). This accommodation represents the network’s ability to create a data structure that is consistent with a number of independent vocabularies. Although this does not seem necessarily surprising, it does mean that there was no catastrophic forgetting (McCloskey & Cohen, 1989). We think this is worth pointing out, considering that Dienes et al. (1999), who employed local encoding as we did, needed to freeze some weights in order to prevent catastrophic forgetting in the vocabulary transfer task (which they call a domain transfer task). One difference between our network and their network is that our network has second-order connections in addition to rst-order connections, whereas Dienes et al.’s network has only rst-order connections. This difference in the architec-
2256
Stephen JosÂe Hanson and Michiro Negishi
ture may have an impact on the generalization among vocabularies and forgetting of old vocabularies. Every second-order connection to the output layer and to the hidden layer is associated with one input-layer node. This means that while the network is trained on one vocabulary, second-order connections that are associated with other sets of vocabularies participate in neither the computation of activities nor the modication of weights. Thus, second-order weights do not contribute to generalization, but they resist forgetting. In contrast, rst-order weights from the feedback layer are independent of input-layer activities and thus contribute to generalization but are prone to forgetting. We postulate that rst-order weights are more protected from forgetting by having second-order weights because during the learning of a new vocabulary, second-order weights can be modied in a way that is specic to the combination of the input symbol and the current hidden-layer state. Thus, on average, second-order weights, having a higher degree of freedom, accommodate much of the input- and state-dependent changes. By contrast, rst-order weights, which are affected by all input words in all vocabularies, undergo less vocabulary specic changes overall, although not perfectly protected from modication. More critical, however, there was a 63% reduction in the amount of required training for the new unseen vocabulary (again as compared to the rst learning of the source FSM). At rst glance, this savings may not look impressive, given that humans are known to be able to infer grammatical features of unknown symbols in a sentence even without learning (Berko, 1958). However, it is unclear how well humans do if all symbols in a sentence are novel. We are unaware of an experiment that demonstrated one-shot or very quick learning in language, 7 but in an articial grammar learning experiment reported by Brooks and Vokey (1991), subjects were able to label only 64% of new grammatical sentences as grammatical even if the vocabulary was known and the new sentence was similar to training sentences. For sentences that had the same syntactic structures as the training set but consisted of unknown symbols, they labeled only 59% of grammatical sentences as grammatical, showing that the learning was far from instantaneous. Apparently, the RNN is able to partially factor the transition rules from its constituent symbol assignments after mere exposure to a diversity of vocabularies. How is the neural network accomplishing these abstractions? Note that in the vocabulary transfer task, the network has no possible way to transfer based on the external symbol set. It follows that the network must abstract away from the input encoding. In effect, the network must nd a way to recode the input in order to defer binding of variables in rules to symbols 7 Marcus, Viyayan, Bandi Rao, and Vishton’s (1999) experiment, in which sevenmonth-old infants could transfer phonological rules to novel sounds, is pertinent to this problem. However, infants may have been attending to phonological features that were not novel (Negishi, 1999).
On the Emergence of Rules in Neural Networks
2257
until enough symbol sequences have been observed (and simply continue to produce errors on what appear at the surface to be new sentences). If the network extracted the common syntactic structure, the hidden-layer activities would be expected to represent the corresponding FSM states regardless of vocabularies. This is in fact shown by linear discriminant analysis (LDA).8 After the network learned the rst vocabulary, activity of the hidden layer was sensitive to FSM states (see Figure 5a). In the gure, values of two linear discriminants after each input symbol presentation during the test phase are plotted. Different FSM states are plotted with different colors. The gure shows a clear separation among colors, indicating that the state-space is organized in terms of the FSM states. We observed that cyclic trajectories through these clusters represent attractors in the state-space. If two transitions corresponding to the same input symbol start at nearby points in the state-space within one cluster that corresponds to the current FSM state, ending points after the transitions in the state-space are also close to each other within a cluster that corresponds to the next FSM state. Furthermore, we observed that even if one starts at a periphery of a cluster (a state far from the center of a cluster), the state at the end of the transition will be closer to the center of the new cluster (these cases not shown). Hence, this space possesses context sensitivity in that coordinate positions encode both state and trajectory information.9 After each of the three vocabularies was learned in three cycles, LDA of the hidden-layer node activities with respect to FSM states (see Figure 5b; colors signify FSM states and shapes signify vocabularies) was contrasted with LDA with respect to vocabulary sets (see Figure 6a; colors signify vocabularies and shapes signify FSM states). In Figures 5a, 5b, 6a, and 6b, LDA was applied to the hidden unit activations of one network to demonstrate the development of the phase-space structure. To show that such phase-space organization was not formed by chance, LDA was applied to 20 networks, and a measure of separability of the state-space with respect to FSM states was computed. One measure of such separability is the correct
8
LDA of the hidden unit states does a complete search of linear projections that are optimally consistent with organizations based on FSM states or by vocabulary. Like PCA, Fisher’s LDA extracts the rst set of a linear combination of variables (rst discriminant function) that maximizes the ratio of the within- and between-group variances (within and between states or vocabulary variances in this case) for all groups, and then extracts the second set for the next largest ratio of within- and between-group variances, and so on. Evidence from the LDA for state representations would indicate that the RNN found a solution to the multiple vocabularies by referencing them hierarchically within each state based on context sensitivity within each vocabulary cluster. The rst two linear discriminants accounted for the bulk of the variance (more than 90%) and are used for reprojection of sample sentence hidden unit activity patterns. 9 Here the phrase context sensitivity is used as in dynamical systems theory and not as in linguistics—that is, to denote the effect of the attractor structure in a dynamical system, and not the effect of environments on a local syntactic structure.
2258
Stephen JosÂe Hanson and Michiro Negishi
Figure 5: LDA of hidden-layer activities. (a) LDA of hidden activities of a network that learned a single FSM-symbol set. The colors code state (1: red, 2: blue, 3: green), and the C sign codes for the single symbol set. It can be seen that FSM states are pairwise separable in the hidden-layer activity space. (b) LDA of the hidden units with respect to states after the RNN has learned three FSMs with three different symbol sets using state as the discriminant variable. Notice how the hidden unit space spreads out compared to a. The space is organized by clusters corresponding to states (coded by color: red D 1, blue D 2, green D 3), which are internally differentiated by symbol sets ( C D ABC, triangle D DEF, square D GHI).
On the Emergence of Rules in Neural Networks
2259
Figure 6: LDA with respect to vocabularies and LDA after transfer to a new symbol set. (a) LDA of hidden unit activities with respect to vocabularies after the RNN has learned three FSMs and three different symbol sets. The LDA used the symbol set as the discriminant. The color codes the vocabularies (red: ABC, blue: DEF, green: GHI), and the shape codes the FSM states ( C : 1, triangle: 2, square: 3). In this case, the discriminant function based on the symbol set produces no simple spatial classication compared to Figure 5b, which shows the same activations classied by the state of the FSM. (b) LDA of the hidden state-space after training on three independent symbol sets for three cycles and then transfer to a new untrained symbol set (coded by stars, circled for clarity). Note how the new symbol set slides in the gaps between symbol sets previously learned (see Figure 5b, which is primarily identical except for the new symbol set). Apparently, this provides initial context sensitivity for the new symbol set, creating the 60% savings.
2260
Stephen JosÂe Hanson and Michiro Negishi
rate of discrimination using linear discriminants. One linear discriminant divides the state-space into two regions: one in which the linear discriminant is positive, and another in which the linear discriminant is negative. Two linear discriminants divide the phase space into four. Because there are only three FSM states, if FSM states are completely (pairwise) linearly separable in the phase-space, two linear discriminants with respect to the FSM states should be sufcient to discriminate all FSM states correctly. Similarly, two linear discriminants with respect to vocabularies should be sufcient to discriminate the three vocabularies correctly. The correct rate of discrimination after eight vocabulary switchings clearly shows that the state-space is organized by FSM states rather than by vocabulary sets, since FSM states could be correctly classied by the two linear discriminants with respect to the FSM states with an accuracy of 91% (SD D 7%, n D 20), whereas the vocabulary set could be classied correctly for only 73% (SD D 10%, n D 20). Note that LDA when determining accuracy takes into account the number of classes in each case, although in the case here, there was an equal number of classes for both vocabulary and state. This means that the hidden layer learned a mapping from the input and current state-space, which is not pairwise separable by states, into the next state-space, which is pairwise separable by states, greatly facilitating the state transition and output computation at the next input. Notice in both Figures 5b and 6b relative to 5a that the symbol sets have spread out and occupy more of the hidden unit space with signicant gaps between clusters of the same and different symbol sets. Moreover, from Figure 5b, one can also see that in each state (coded by the colors), each vocabulary (coded by the plotting symbols) clusters together. Although vocabularies are not always pairwise separable in this plot, the separation is remarkable considering that the linear discriminants are optimized solely to discriminate the states. We can test the vocabulary separability within each state directly by doing an LDA within each state using vocabulary as the discriminant variable. In this case, vocabularies were correctly discriminated with an average accuracy of 95% (S.D. D 7, n D 20). This is strong evidence that the hidden-layer activity states are hierarchically organized into FSM states and vocabularies. This hierarchical structure allows for the accommodation of the already-learned vocabularies and any new ones the RNN is asked to learn. LDA after the test vocabulary is learned once also shows that the network state is predominantly organized by FSM states (see Figure 6b), although the linear separation by FSM states of a small fraction of activities is compromised, as can be seen at the boundary of red crosses, blue stars, and triangles in the gure. This interference by the new vocabulary is not surprising considering that old vocabularies were not relearned after the new vocabulary was learned. What is more interesting is the spatial location of the new vocabulary (“stars”). The hidden unit activity clearly shows that state discriminant structure is dominant and organizes the symbol sets. It seems that the fourth vocabulary simply lls empty spots in the hidden unit
On the Emergence of Rules in Neural Networks
2261
space to code it in a position relative to the existing state structure. This can be seen by comparison to Figure 5b, which is almost identical to the linear projections that were found without the new symbol set. This retention of the projections of the old space would not be surprising since the conguration prior to exposure to the new symbol set should not change because of the extensive prior vocabulary training. Apparently, the precedence of the existing abstraction encourages use of the hierarchical state representation and its existing context sensitivity. The network seems to bootstrap from existing nearby vocabularies that would allow generalization to the same FSM that all symbol sets are using. Several factors may make such bootstrapping possible. First, initial connection weights from novel symbols are small random weights, creating quite different patterns of activations in the hidden layer than old symbols do through the more articulated learned weights (weight strengths with a wider dynamic range). Second, the probability distribution of FSM states is the same regardless of the vocabulary set, and this information is reected in the rst-order weights. Third, the gradient-descent algorithm nds a most effective modication to the weights to perform the prediction task. For these reasons, we can view what the network is doing as a kind of analogical learning process (e.g., D is to E as A is to B). 3.4 Simulation 2: The Syntax Transfer Task. Humans can apply abstract knowledge (such as the structure of symbol sequences) for solving problems even if the abstract structure required for the task is slightly different from the acquired one (playing card games, for example; often the knowledge of strategy in one game transfers to another game). Does a neural network have the same exibility? In the second simulation, we carried out a simulation of the syntax transfer task where the target syntactic structure was changed from the acquired ones and the vocabulary was kept constant. In the rst syntax transfer task, only the initial and the last states in the FSM were changed, so the subsequence distributions are almost the same (see Figure 7a). In this case, 47% and 49% savings were observed in each transfer direction. In the second syntax transfer task, directions of all arcs in the FSM were changed to the opposite (see Figure 7b). Therefore, the mirror image of a sentence accepted in one grammar is accepted in the other grammar. Although the grammars were very different, there is a signicant amount of overlap in the permissible subsequences. Therefore, there were 19% and 25% savings in training. In the third and fourth syntax transfer tasks, the source and the target grammars share fewer subsequences. In the third case (see Figure 7c), the subsequences were very different because the source grammar has two one-state loops (at states 1 and 3) with the same symbol A, whereas the two one-state loops in the target grammar consist of different symbols (A and B). Also, the two-state loops (states 2 and 3) consist of different symbols (BCBC¢ ¢ ¢ in source grammar and CCCC¢ ¢ ¢ in the target grammar). In this case, there was an 18% reduction in the number of trials required in one transfer direction but a 29% increase in the other
2262
Stephen JosÂe Hanson and Michiro Negishi
Figure 7: Syntax transfer results as a function of source to target difference in grammar. Numbers on the horizontal arrow tails signify numbers of required training sentences for the source grammar, and numbers on the horizontal arrow head signify numbers of required training on the target grammar. For instance, the left grammar in a required 20, 456 training sentences when it was the source grammar and 10, 760 training sentences when it was the target grammar. Thus, the effect of learning the grammar on the right-hand side resulted in a 47.4% reduction of required training sentences. Numbers in the parenthese are standard errors (N D 20).
On the Emergence of Rules in Neural Networks
2263
direction. In the fourth case (see Figure 7d), one grammar includes two one-state loops with the same symbol (AAA¢ ¢ ¢ in states 1 and 3), whereas in the other grammar, they consist of different symbols (AAA¢ ¢ ¢ in state 1 and BBB¢ ¢ ¢ in state 3). In this case, there were 13% and 14% increases in the number of trials required. The fact that there is interference (increase in the number of required training trials) as well as transfer indicates that nding a correct mapping is a hard problem for the network. From the observations, we speculated that if the acquired grammar allows many subsequences of symbols that are also allowed by the target grammar, the transfer is easier, and therefore there will be more savings. 10 The simulation results are consistent with the nding in human articial grammar learning that the transfer effect persists even when the new strings violate the syntactic rules slightly (Brooks & Vokey, 1991). 4 Conclusion
It has been shown that the amount of training needed to learn the end prediction task on a new grammar is reduced in the vocabulary transfer paradigm as well as in the syntax transfer paradigm (given that the source and the target grammars were similar). We have shown that previous experience with examples drawn from rules (governing an FSM) can shape the acquisition process and speed of new, though similar, rules in simple associationist neural networks. The linear discriminant analysis of the hidden-layer activities showed that the activity space was hierarchically organized in terms of FSM states and vocabularies. The trajectories in the state-space showed context sensitivity, and the reorganization of the state-space showed a simple type of analogical process that would support the vocabulary transfer process. It has been argued that unstructured (minimal bias) neural networks with general learning mechanisms are incapable of representing, processing, and generalizing symbolic information (Pinker, 1994; Fodor & Pylyshyn, 1988; Marcus et al., 1999). Evolutionary psychologists argue that humans have an innate symbol processing mechanism that has been shaped by evolution. Pinker, for one, argues that there must be two distinct mechanisms for language, an associative mechanism and a rule-based mechanism, the latter being equipped with a nonassociative learning mechanism or arising from genetic basis (Pinker, 1991). The other alternative, as demonstrated by our experiments, is that neural networks that incorporate associative mechanisms can be both sensitive to the statistical substrate of the world and exploit data structures that have the property of following a deterministic rule. As demonstrated, these new data structures can arise even when that
10 To conrm this conjecture, we sought a measure of the overlap of subsequences between the source and the target grammar,Our efforts to produce such a numeric measure have met only moderate success (see Negishi & Hanson, 2001, for more details).
2264
Stephen JosÂe Hanson and Michiro Negishi
explicit rule has been expressed only implicitly by examples and learned by mere exposure to regularities in the data. Appendix: Network Equations A.1 Output-Layer Node i activation.
0 1 X O S( @ ( )A XO wO i ( t ) D f ( Pi ( t ) ) D f ijk Xj t ¡ 1) I k t jk
In the equations, t is a time step that indicates the number of input words that have been seen. For instance, t D 1 when the network is processing the rst word. In the equation above, the product of a feedback-layer node j activity (which is a state hidden-layer node j activity at the previous time step) XjS ( t ¡ 1) , and an input-layer node k activity Ik ( t) is weighted by a O and is added to an output-layer node potential second-order weight W ijk i O Pi (t) . A transfer function f ( ) is applied to the potential and yields the output-layer node i activity XO i ( t ). In the current network, there is only one output layer node ( i D 1). XS0 ( t ¡ 1) and I0 (t) are special constant nodes O serves as a rstwhose output values are always 1.0. As the result, Wi0k order connection from the input node k to the output-layer node i, and O serves as a rst-order connection from the feedback-layer node to the Wij0 j O serves as a bias term for the output output layer node i. The weight Wi00 layer node i.
A.2 State Hidden-Layer Node Activation.
0 1 X XSi (t) D f ( PSi (t) ) D f @ wSijkXjS (t ¡ 1) Ik ( t) A jk
The product of a feedback layer node j activity XjS (t ¡ 1) and an input layer
S and is added node k activity Ik ( t) is weighted by a second-order weight Wijk
to a state hidden-layer node i potential PSi (t) . The transfer function f ( ) is applied to the potential and yields the state hidden-layer node i activity XiS ( t) . Because of the special denitions of XS0 ( t ¡ 1) and I0 ( t) described S serves as a rst-order connection from the input node to the above, Wi0k k S serves as a rst-order connection from state hidden-layer node i, and Wij0 S the feedback-layer node j to the state hidden-layer node i. The weight W i00 serves as a bias term for the state hidden-layer node i.
On the Emergence of Rules in Neural Networks
2265
A.3 Output-Layer Weight Update.
D wO lmn @XiO ( t) @wO lmn
D ¡a
D
X @J ( t) t
@ @wO lmn
@wO lmn
D ¡a
XX i
Ei ( t )
t
0 1 X O S( f@ wjk Xj t ¡ 1)Ik (t) A
@XiO ( t) @wO lmn
jk
S D fN( PO i ( t ) )dil Xm ( t ¡ 1) In ( t ) O (from the input hidden layer node Change in a second-order weight Wlmn m and the feedback layer node n to the output layer node l) is the negative of the learning constant a times the partial derivative of the cost function J (t) , summed over the whole sentence. The partial derivative is equal to the error (output value minus desired output) of the output layer nodes Ei ( t) times the partial derivatives of the outputs XO i ( t ) summed over all outputs i (in the current network, there is only one output node). The partial derivative of the output node activity is computed using the chain rule ( dx / dy D ( dx / dz ) ( dz / dy ) ) . See the equations for the output-layer node output for variable descriptions. In the last line of the equations above, f ( ) with an overbar on f denotes the derivative of f ( ) , and dil is a Kronecker delta whose value is one only if i D l and is zero elsewhere.
A.4 State Hidden Weight Update.
D wSlmn @XiO ( t) @wSlmn
D ¡a
D
X X @ Ji ( t ) i
@ @wO lmn
t
@wSlmn
D
@ @wSlmn
D ¡a
XX i
Ei ( t )
t
0 1 X S( ( )A f@ wO ijk Xj t ¡ 1) I k t
@XO i ( t) @wSlmn
jk
D fN( PO i ( t) ) @XSi ( t)
@wSlmn
X jk
( ) wO ijk I k t
@XiO ( t) @wSlmn
0 1 X f@ wSijkXjS ( t ¡ 1) Ik ( t) A jk
0 S( D fN( PSi ( t) ) @dil Xm t ¡ 1) In ( t) C
1 X jk
wSijkIk ( t)
@XiS ( t) @wSlmn
A
S Change in a second-order weight Wlmn (from the input layer node m and the feedback layer node n to the state hidden-layer node l) is the negative
2266
Stephen JosÂe Hanson and Michiro Negishi
of the learning constant a times the partial derivative of the cost function J ( t) summed over the whole sentence. The partial derivative is equal to the error (output value minus desired output) of the output nodes Ei ( t) times the partial derivatives of the output-layer nodes XiO ( t) , with respect to a state hidden node weight this time, summed over all outputs i. The partial derivative of the output node activity is computed using the chain rule. In this case, we need to compute the partial derivative of state hidden-layer node activities XO i ( t ). This again is computed using the chain rule. The time sufx ( t ¡ 1) of the state hidden-layer node activation XiS ( t ¡ 1) in the last line indicates that the partial derivative of a hidden-layer node activity with respect to a state hidden-layer weight is computed recursively. The initial value of the partial derivative (at the beginning of the sentence, t D 0) is assumed to be zero.
Acknowledgments
We thank Michael Casey for contributions to an earlier version of this work and Ben Martin Bly for comments and editing of earlier versions of this article. We also acknowledge an anonymous reviewer for reading over our manuscript and providing many useful comments and edits and Gary Cottrell for many useful discussions and corrections to the article.
References Berko, J. (1958). The child’s learning of English morphology. Word, 14, 150–177. Brooks, L. R., & Vokey, J. R. (1991). Abstract analogies and abstracted grammars: Comments on Reber (1989) and Mathews et al. (1990). Journal of Experimental Psychology: General, 120, 316–323. Burns, B. D., Hummel, J. E., & Holyoak, K. J. (1993). Establishing analogical mappings by synchronizing oscillators. In Proceedings of the Fourth Australian Conference on Neural Networks. Sydney, Australia. Casey, M. (1996). The dynamics of discrete-time computation, with applications to recurrent neural networks and nite state machine extraction. Neural Computation, 8, 1135–1178. Chomsky, N. (1957). Syntactic structures. The Hague: Mouton. Cleeremans, A. (1993). Mechanisms of implicit learning. Cambridge, MA: MIT Press. Denker, J., Schwartz, D., Wittner, B., Solla, S., Howard, R., Jackel, L., & Hopeld, J. J. (1987). Automatic learning, rule extraction and generalization. Complex Systems, 1, 877–922. Dienes, Z., Altmann, G., & Gao, S-J. (1999). Mapping across domains without feedback: A neural network model of transfer of implicit knowledge. Cognitive Science, 23, 53–82. Elman, J. (1990). Finding structures in time. Cognitive Science, 14, 179–211.
On the Emergence of Rules in Neural Networks
2267
Elman, J. (1991). Distributed representations, simple recurrent networks, and grammatical structure. Machine Learning, 7, 195–225. Elman, J., Bates, E., Johnson, M., Karmiloff-Smith, A., Parisi, D., & Plunkett, K. (1996). Rethinking innateness. Cambridge, MA: MIT Press. Fodor, J., & Pylyshyn, Z. (1988). Connectionism and cognitive architecture: A critical analysis. In S. Pinker & J. Mehler (Eds.), Connections and symbols. Cambridge, MA: MIT Press. Giles, C. L., Horne, B. G., & Lin, T. (1995). Learning a class of large nite state machines with a recurrent neural network. Neural Networks, 8, 1359–1365. Giles, C. L., Miller, C. B., Chen, D., Chen, H. H., Sun, G. Z., & Lee, Y. C. (1992). Learning and extracting nite state automata with second-order recurrent neural networks. Neural Computation, 4, 393–405. Hanson, S. J. (1990). A stochastic version of the delta rule. Physica D, 42, 265–272. Hanson, S. J., & Burr, D. (1990). What connectionist models learn: Learning and representation in connectionist models. Behavioral and Brain Sciences, 13, 471–518. Hanson, S. J., & Kegl, J. (1987). PARSNIP: A connectionist network that learns natural language grammar from exposure to natural language sentences. In Proceedings of the Ninth Annual Conference on Cognitive Science. Seattle, WA. Harnad, S. (1990). The symbol grounding problem. Physica D, 42, 335–346. Marcus, G. F., Viyayan, S., Bandi Rao, P., & Vishton, M. (1999). Rule learning by seven-month-old infants. Science, 283, 77–80. McCloskey, M., & Cohen, N. (1989). Catastrophic interference in connectionist networks: The sequential learning problem. Psychology of Learning and Motivation, 24, 109–165. Miller, G. A., & Stein, M. (1963). Grammarama I: Preliminary studies and analysis of protocols (Tech. Rep. No. CS-2). Cambridge, MA: Harvard University. Negishi, M. (1999). A comment on G. F. Marcus, S. Vijayan, S. Bandi Rao, & P. M. Vishton “Rule learning by seven-month-old infants.” Science, 284, 435. Negishi, M., & Hanson, S. J. (2001). A study of grammar transfer effects in a second order recurrent network. Proceedings of International Joint Conference on Neural Networks 2001, 1, 326–330. Pinker, S. (1991). Rules of language. Science, 253, 530–535. Pinker, S. (1994). The language instinct. New York: Morrow. Pinker, S. (1997). How the mind works. New York: Norton. Pratt, L. Y. (1993). Discriminability-based transfer between neural networks. In C. L. Giles, S. J. Hanson, & J. D. Cowan (Eds.), Advances in neural information processing systems, 5 (pp. 204–211). San Mateo, CA: Morgan Kaufmann. Pratt, L. Y., Mostow, J., & Kamm, C. A. (1991). Direct transfer of learned information among neural networks. Proceedings of the American Association for Articial Intelligence, 2, 584–589. Reber, A. (1967).Implicit learning of articial grammars. Journal of VerbalLearning and Verbal Behavior, 6, 855–863. Reber, A. (1969). Transfer of syntactic structure in synthetic languages. Journal of Experimental Psychology, 81, 115–119. Redington, J., & Chater, N. (1996). Transfer in articial grammar learning: A reevaluation. J. Exp. Psych: General, 125, 123–138.
2268
Stephen JosÂe Hanson and Michiro Negishi
Thagard, P., & Verbeurgt, K. (1998). Coherence as constraint satisfaction. Cognitive Science, 22, 1–24. Williams, R. J., & Zipser, D. (1989). A learning algorithm for continually running fully recurrent neural networks. Neural Computation, 1, 270–280. Received June 13, 2000; accepted March 27, 2002.
ARTICLE
Communicated by Johnathan Victor
Information-Geometric Measure for Neural Spikes Hiroyuki Nakahara [email protected]
Shun-ichi Amari [email protected] Laboratory for Mathematical Neuroscience, RIKEN Brain Science Institute, Wako, Saitama, 351-0198, Japan
This study introduces information-geometric measures to analyze neural ring patterns by taking not only the second-order but also higher-order interactions among neurons into account. Information geometry provides useful tools and concepts for this purpose, including the orthogonality of coordinate parameters and the Pythagoras relation in the KullbackLeibler divergence. Based on this orthogonality, we show a novel method for analyzing spike ring patterns by decomposing the interactions of neurons of various orders. As a result, purely pairwise, triple-wise, and higher-order interactions are singled out. We also demonstrate the benets of our proposal by using several examples.
1 Introduction One of the central challenges in neuroscience is to understand what and how information is carried by a population of neural ring (Georgopoulos, Schwartz, & Kettner, 1986; Abeles, 1991; Aertsen & Arndt, 1993; Singer & Gray, 1995; Deadwyler & Hampson, 1997; Parker & Newsome, 1998). Many experimental studies have shown, as a rst step toward this end, that the mean ring rate of each single neuron can be signicantly modulated by experimental conditions and may thereby carry information about these experimental conditions, that is, sensory and motor signals. Information conveyed by a population of ring neurons, however, may be not only a sum of mean ring rates. Other statistical structures embedded in the neural ring may also carry behavioral information. In particular, growing attention has been paid to the possibility that coincident ring, correlated ring, synchronization, or specic ring patterns may alter conveyed information or carry signicant behavioral information, whether such a possibility is supported or discarded (Gerstein, Bedenbaugh, & Aertsen, 1989; Engel, Konig, ¨ Kreiter, Schillen, & Singer, 1992; Wilson & McNaughton, 1993; Zohary, Shadlen, & Newsome, 1994; Vaadia et al., 1995; Nicolelis, Ghazanfar, Faggin, Votaw, & Oliveira, 1997; Riehle, Grun, ¨ Diesmann, & Aertsen, 1997; Lisman, 1997; c 2002 Massachusetts Institute of Technology Neural Computation 14, 2269–2316 (2002) °
2270
Hiroyuki Nakahara and Shun-ichi Amari
Zhang, Ginzburg, McNaughton, & Sejnowski, 1998; Maynard et al., 1999; Nadasdy, Hirase, Czurko, Csicsvari, & Buzsaki, 1999; Kudrimoti, Barnes, & McNaughton, 1999; Oram, Wiener, Lestienne, & Richmond, 1999; Nawrot, Aertsen, & Rotter, 1999; Baker & Lemon, 2000; Reinagel & Reid, 2000; Steinmetz et al., 2000; Laubach, Wessberg, & Nicolelis, 2000; Salinas & Sejnowski, 2001; Oram, Hatsopoulas, Richmond, & Donoghue, 2001). For this purpose, it is important to develop a sound statistical method for analyzing neural data. An obvious rst step is to investigate a signicant coincident ring between two neurons, that is, the pairwise correlation (Perkel, Gerstein, & Moore, 1967; Palm, 1981; Gerstein & Aertsen, 1985; Palm, Aertsen, & Gerstein, 1988; Aertsen, Gerstein, Habib, & Palm, 1989; Grun, ¨ 1996; Ito & Tsuji, 2000; Pauluis & Baker, 2000; Roy, Steinmetz, & Niebur, 2000; Grun, ¨ Diesmann, & Aertsen, 2002a, 2002b; G utig, ¨ Aertsen, & Rotter, 2002). In general, however, it is not sufcient to test a pairwise correlation of neural ring because there can be triplewise and higher correlations. For example, three variables (neurons) are not independent in general even when they are pairwise independent. We need to establish a systematic method of analysis that includes these higher-order correlations (Abeles & Gerstein, 1988; Abeles, Bergman, Margalit, & Vaadia, 1993; Martignon, Von Hasseln, Grun, ¨ Aertsen, & Palm, 1995; Grun, ¨ 1996; Tetko & Villa, 1992; Victor & Purpura, 1997; Prut et al., 1998; Del Prete & Martingon, 1998; MacLeod, Backer, ¨ & Laurent, 1998; Martignon et al., 2000; Bohte, Spekreijse, & Roelfsma, 2000; Roy et al., 2000). We are mostly interested in methods able to (1) analyze correlated ring of neurons, including higher-order interactions, and (2) connect such a technique with behavioral events, for which we use mutual information between ring and behavior (Tsukada, Ishii, & Sato, 1975; Optican & Richmond, 1987; Richmond, Optican, & Spitzer, 1990; McClurkin, Gawne, Optican, & Richmond, 1991; Bialek, Rieke, de Ruyter van Steveninck, & Warland, 1991; Gawne & Richmond, 1993; Tovee, Rolls, Treves, & Bellis, 1993; Gochin, Colombo, Dorfman, Gerstein, & Gross, 1994; Abbott, Rolls, & Tovee, 1996; Rolls, Treves, & Tovee, 1997; Richmond & Gawne, 1998; Kitazawa, Kimura, & Yin, 1998; Sugase, Yamane, Ueno, & Kawano, 1999; Panzeri, Schultz, Treves, & Rolls, 1999; Panzeri, Treves, Schultz, & Rolls, 1999; Brenner, Strong, Koberle, Bialek, & de Ruyter van Steveninck, 2000; Samengo, Montagnini, & Treves, 2000; Panzeri & Schultz, 2001). To address these issues, this study uses the orthogonality of the natural and expectation parameters in the exponential family of distributions and proposes methods useful for analyzing a population of neural ring in a systematic manner, based on information geometry (Amari, 1985; Amari & Nagaoka, 2000) and the theory of hierarchical structure (Amari, 2001). By use of orthogonal coordinates, we will show that both hypothesis testing of neural interaction and calculation of mutual information can be drastically simplied. (For an extended abstract, see Nakahara & Amari, 2002.) This article is organized as follows. In section 2, we briey give our perspective on the merits of using an information-geometric measure. In
Information-Geometric Measure for Neural Spikes
2271
section 3, we begin with an introductory description of information geometry, using two random binary variables, and treat the application of this two variables’ case to the analysis of two neurons’ ring. Section 4 discusses the interaction of three binary variables and shows how to extract pure triplewise correlation, which is different from pairwise correlation. Section 5 gives a general theory of decomposition of correlations among n variables and discusses some approaches to overcome practical difculties that arise in this case. Section 6 provides illustrative examples. Section 7 contains the discussion. 2 Perspective In this section, we state our perspective on the merits of using an information-geometric measure, briey referring to a general case of n neurons. A detailed discussion in the general case is given in section 5. We represent a neural ring pattern by a binary random vector variable so that the probability distribution of ring (of any number of neurons) can be exactly expanded by a log-linear model. Let X D (X1 , . . . , Xn ) be n binary variables, and let p D p (x ) , x D ( x1 , . . . , xn ) , xi D 0, 1 be its probability, where we assume p ( x) > 0 for all x. Each Xi indicates that the ith neuron is silent (Xi ( ti ) D 0) or has a spike (Xi (ti ) D 1) in a short time bin, which is denoted by ti . In general, ti can be different for each neuron, but here, we assume ti D t for i D 1, . . . , n for simplicity and drop t in the following notation (see section 6). Each p ( x) is given by 2n probabilities, pi1 ¢¢¢in D Prob fX1 D i1 , . . . , Xn D in g , ik D 0, 1, X subject to pi1 ¢¢¢in D 1, i1 ,...,in
and hence, the set of all the probability distributions fp( x) g forms a (2n ¡1)– dimensional manifold S n . One coordinate system of S n is given by the expectation parameters, gi D E [xi ] D Prob fxi D 1g , i D 1, . . . , n £ ¤ © ª gij D E xi xj D Prob xi D xj D 1 , i < j g12¢¢¢n D E [xi , . . . , xn ] D Prob fx1 D x2 D ¢ ¢ ¢ D xn D 1g , which have 2n¡1 components. This coordinate system is calledg-coordinates and, as in a more general term, denes m-at structure in S n (see section 5). On the other hand, p ( x) can be exactly expanded by X X X log p ( x) D hi xi C hij xi xj C hijkxi xj xk ¢ ¢ ¢ C h1¢¢¢n x1 , . . . , xn ¡ y , i< j
i< j< k
2272
Hiroyuki Nakahara and Shun-ichi Amari
where the indices of hijk, etc. satisfy i < j < k, etc and y is a normalization term, corresponding to ¡ log p ( x1 D x2 D ¢ ¢ ¢ D xn D 0) . Allhijk, . . . , together have 2n ¡ 1 components and form another coordinate system, called hcoordinates, corresponding to the e-at structure in S n (see section 5). Findings in information geometry assure us that e-at and m-at manifolds are dually at. The g-coordinates and h-coordinates are dually orthogonal coordinates. The properties of the dual orthogonal coordinates remarkably simplify some apparently complicated issues. For example, the generalized Pythagoras theorem gives a decomposition of the KullbackLeibler divergence by which we can inspect different contributions in the discrepancy of two probability distributions or contributions of different order interactions in neural ring. This is a global property of the dual orthogonal coordinates in the probability space. As a local property, the dual orthogonal coordinates give a simple form of the Fisher information metric, which is useful, for example, in hypothesis testing. This study exploits these properties. In the next section, we start with the case of two neurons. 3 Pairwise Interaction, Mutual Information, and Orthogonal Decomposition 3.1 Orthogonal Coordinates. Let us begin with two binary random variables X1 and X2 whose joint probability p ( x) , x D ( x1 , x2 ) , is given by © ª pij D Prob x1 D iI x2 D j > 0, i, j D 0, 1. Among four probabilities, fp 00, p 01 , p10, p11 g, only three are free, because of the constraint p 00 C p 01 C p10 C p11 D 1. Thus, the set of all such distributions of x forms a three-dimensional manifold S 2 , where the sufx 2 refers to the number of random variables in x. Any three of pij can be used as a coordinate system of S 2 , which we call P-coordinates for later convenience. In the context of neural ring, random variables X1 and X2 stand for two neurons: neuron 1 and neuron 2. Xi D 1 and Xi D 0 indicate whether neuron i (i D 1, 2) has a spike in a short time bin. A distribution p ( x) can be decomposed into marginal and (pairwise) correlational components. The two quantities, gi D Prob fxi D 1g D E [xi ] ,
i D 1, 2,
specify the marginal distributions of xi , where E denotes the expectation. Obviously, we have g1 D p10 C p11, g2 D p 01 C p11 . Let us put g12 D E [x1 x2 ] D p12 . The three quantities,
´ D (g1 , g2 , g12 ) ,
(3.1)
Information-Geometric Measure for Neural Spikes
2273
form another coordinate system of S 2 , called the g-coordinates. They are the coordinates of the expectation parameters in an exponential probability family in general (Cox & Hinkley, 1974; Barndorff-Nielsen, 1978; Lehmann, 1983). In the context of neural data, g1 and g2 are the mean ring rates of neurons 1 and 2, respectively, whereas g12 is the mean rate of their coincident ring. The covariance, Cov [X1 , X2 ] D E [ (x1 ¡ g1 ) (x2 ¡ g2 )] D g12 ¡ g1g2 , may also be considered as a quantity representing the degree of correlation of X1 and X2 . Therefore, (g1 , g2 , Cov[X1 , X2 ]) can be another coordinate system. The term Cov[X1 , X2 ] becomes zero when the probability distribution is independent, because we have g12 D g1g2 in that case. There are many candidates to specify the correlation component. The correlation coefcient g12 ¡ g1g2 rD p g1 (1 ¡ g1 ) g2 (1 ¡ g2 )
is also such a quantity. The triplet (g1 , g2 , r ) then forms another coordinate system of S 2 . The correlation coefcient is used to show the pairwise correlation of two neurons in N-JPSTH (Aertsen et al., 1989). Which quantity is convenient for representing the pairwise correlational component? It is desirable to dene the degree of pairwise interaction independent of the marginals g1 and g2 . To this end, we use the orthogonal coordinates (g1 , g2 , h ) such that the coordinate curve of h is always orthogonal to those of g1 and g2 . This characteristic is particularly desirable in the context of neural data, as shown later. Once such a h is dened, we have a subset E (h ) for each h, a family of distributions having the same h value (see Figure 1A). The E (h ) is a twodimensional submanifold on which (g1 , g2 ) can vary freely but h is xed. We put the origin h D 0 when there is no correlation (i.e., g12 D g1g2 ) for convenience (see below), and then E (0) is the set of all the independent distributions. Similarly, we consider the set of all the probability distributions whose marginals are common, specied by (g1 , g2 ), but only h is free. This is denoted by M (g1 , g2 ), forming a one-dimensional submanifold in S 2 . The tangential direction of M (g1 , g2 ) represents the direction in which only the pure correlation changes, while the tangential directions of E (h ) span the directions in which only g1 and g2 change but h is xed. We now require that E (h ) and M (g1 , g2 ) be orthogonal at any points, that is, the directions of changes in the correlation and marginals are to be mutually orthogonal. The orthogonality of two directions in S 2 is dened by using the Riemannian metric due to the Fisher information matrix (Rao, 1945; BarndorffNielsen, 1978; Amari, 1982; Nagaoka & Amari, 1982; Amari & Han, 1989; Amari & Nagaoka, 2000). Here, we dene the orthogonality directly. Let us specify the probability distributions by p ( xI g1 , g2 , h ). The directions of
2274
Hiroyuki Nakahara and Shun-ichi Amari
Figure 1: (A) Schematic diagram of E(h ) and M (g1 , g2 ) . (B) A simple example of the generalized Pythagoras decomposition.
small changes in the coordinates gi and h are represented, respectively, by the score functions @ @gi
l ( xI g1 , g2 , h ) ,
( i D 1, 2)
@ @h
l ( xIg1 , g2 , h ) ,
where l ( xI g1 , g2 , h ) D log p ( xIg1 , g2 , h ) . They are random variables, denoting how the log probability changes by small changes in the parameters in the respective directions. These directions are said to be orthogonal when the corresponding random variables
Information-Geometric Measure for Neural Spikes
2275
are uncorrelated, µ E
@ @h
l ( xI g1 , g2 , h )
@ @gi
¶ l ( xI g1 , g2 , h ) D 0,
(3.2)
where E denotes the expectation with respect to p ( xI g1 , g2 , h ) . This implies that the cross components ofh and gi in the Fisher information matrix vanish. When the coordinate h is dened to be orthogonal to the coordinates g1 and g2 of marginals, we say that h represents the pure correlation independent of the marginals. Such h is given by the following theorem. Theorem 1. h D log
The coordinate p11 p00 p 01 p10
(3.3)
is orthogonal to the marginals g1 and g2 . The proof can be shown by direct calculations, which is omitted here. A more general result is shown later. We have another interpretation of h. Let us expand log p ( x ) in the polynomial of x, log p (x ) D
2 X i D1
hi xi C h12 x1 x2 ¡ y .
(3.4)
Since xi takes on the binary values 0, 1, this is an exact expansion. The coefcient h12 is given by equation 3.3, while h1 D log
p10 , p 00
h2 D log
p 01 , p 00
y D ¡ log p 00.
(3.5)
We remark here that the above h12 is well known, having frequently been used in the additive decomposition of log probabilities. It is 0 when and only when X1 and X2 are independent. The triple
µ D (h1 , h2 , h12 ) forms another coordinate system of S 2 , called the h-coordinates. They are the coordinates of the natural parameters in the exponential probability family in general (Cox & Hinkley, 1974; Barndorff-Nielsen, 1978; Lehmann, 1983). Furthermore, the triple
³ ´ (g1 , g2 , h12 ) forms an orthogonal coordinate system of S 2 , called the mixed coordinates (Amari, 1985; Amari & Nagaoka, 2000).
2276
Hiroyuki Nakahara and Shun-ichi Amari
3.2 KL-Divergence, Projections, and Pythagoras Relation. The Kullback-Leibler (KL) divergence between two probabilities p ( x) and q (x ) is dened by £ ¤ X p (x ) . D p:q D p ( x) log q (x ) x
(3.6)
The KL divergence provides a quasi-distance between two probability distributions, D[p : q] ¸ 0 with equality if and only if p ( x) D q (x ), whereas 6 the symmetrical relationship does not generally hold, that is, D[p : q] D D[q : p]. Let pN ( x) be the independent distribution that is closest to a distribution p (x ) , £ ¤ pN ( x) D argminq2E (0) D p : q , where E (0) is the set of all the independent distributions. We call pN (x ) D P i pi ( xi ) the m-projection of p to E (0) (see Figure 1B). Let the mixed coordinates of p be (g1 , g2 , h ) . The coordinates of pN are given by (g1 , g2 , 0) because of the orthogonality, so that pN ( x) D P i pi ( xi Igi ) D p1 ( x1 I g1 ) p2 ( x2 I g2 ) ,
where pi ( xi I gi ) is the marginal distribution of p. Interestingly, the minimized divergence is given by the mutual information X £ ¤ p ( x1 , x2 ) . D p : pN D I ( X1 I X2 ) D p ( x1 , x2 ) log p1 ( x1 ) p2 ( x2 ) We have another characterization of p. N Let p0 be the uniform distribution whose mixed coordinates are (0.5, 0.5, 0) . Let M (g1 , g2 ) be the subspace that includes p. Then, £ ¤ pN D argminq2M (g1 ,g2 ) D q : p 0 . Such pN is called the e-projection to E (0) . £ ¤ of p 0 to £ ¤M (g1 , g2 ) , and it£ belongs ¤ Since we easily have D q : p 0 D ¡H q C H 0, where H q is the entropy of q and H0 D 2 log 2 is a constant, pN has the maximal entropy among those belonging to M (g1 , g2 ) . This fact is called the maximum entropy principle (Jaynes, 1982). It is well known that we have the decomposition £ ¤ £ ¤ £ ¤ D p : p 0 D D p : pN C D pN : p 0 . Now let us generalize the above observation and let p ( x) and q ( x) be p p p two probability distributions whose mixed coordinates are ³ p D (g1 , g2 , h3 ) q q q and ³ q D (g1 , g2 , h3 ), respectively. Let r¤ (x ) be the m-projection of p (x ) to q q ¤¤ q E (h ) and r ( x) be the e-projection of p ( x) to M (g1 , g2 ) : £ ¤ £ ¤ r¤ ( x) D argminr2E (h q ) D p : r , r¤¤ (x ) D argminr2M (gq ,gq ) D r : p . 1
2
Information-Geometric Measure for Neural Spikes
2277 p
p
q
The mixed coordinates of r¤ and r¤¤ are explicitly given by (g1 , g2 , h3 ) and (g1q , g2q , h3p ) , respectively. Hence, the following Pythagoras relation holds (see Figure 1B). Theorem 2. £ ¤ £ ¤ £ ¤ D p : q D D p : r¤ C D r¤ : q £ ¤ £ ¤ £ ¤ D q : p D D q : r¤¤ C D r¤¤ : p .
(3.7)
(3.8) ¤ Theorem 2 shows £ that¤the divergence £ ¤ D p : q from p to q is decomposed into two terms, D p : r¤ and D r¤ : q , where the former represents the degree of difference in their correlation and the latter the difference in their marginals. £
3.3 Local Orthogonality and Fisher Information. For any parameterization p ( xI » ) , the Fisher information matrix G D ( gij ) in terms of the coordinates » is given by µ ¶ @ log p ( xI » ) @ log p ( xI » ) . gij ( » ) D E @ji @jj This G (» ) plays the role of a Riemannian metric tensor. The squared distance ds2 between two nearby distributions p (xI » ) and ( p xI » C d» ) is given by the quadratic form of d» , X ds 2 D gij ( » ) dji djj . i, j2 (1,2,3)
It is known that this is approximately twice the KL divergence, £ ¤ ds 2 ¼ 2D p (xI » ) : p ( xI » C d» ) . When we use the mixed coordinates ³ , the Fisher information is of the form 2 f 3 f g11 g12 0 6 7 f Gf D ( gij ) D 4 gf12 gf22 0 5 , 0
f
0
g33
as is seen from equation 3.2. This is the local property induced by the orthogonality of h and gi . In this case, by putting X f f ds 21 D g33 ( dj3 ) 2 , ds 22 D gij dgi dgj , i, j2 (1,2 )
we have the orthogonal decomposition ds 2 D ds21 C ds22 , corresponding to equation 3.2.
(3.9)
2278
Hiroyuki Nakahara and Shun-ichi Amari
We show the merits of the orthogonal coordinates for statistical inference. Let us estimate the parameter ´ D (g1 , g2 ) and h from N observed data x1 , . . . , xN . The maximum likelihood estimator is asymptotically unbiased and efcient, where the covariance of the estimation error, D ´ and Dh , is given asymptotically by " # D´ 1 ¡1 Cov D G . Dh N f Since the cross terms of G or G¡1 vanish for the orthogonal coordinates, we have Cov [D ´ , Dh] D 0,
(3.10)
implying that the estimation error D ´ of marginals and that of interaction are mutually independent. Such a property does not hold for other nonorthogonal parameterizations such as the correlation coefcient r and the covariance. This property greatly simplies procedures of hypothesis testing, as shown below. 3.4 Hypothesis Testing. Let us consider the estimation of h and ´ more directly. A natural estimate for the g-coordinates is gO i D
1 #fxi D 1g N
( i D 1, 2) ,
gO 12 D
1 #fx1 x2 D 1g. N
(3.11)
This is the maximum likelihood estimator. The estimator hO is obtained by the coordinate transformation from g- to h-coordinates, hO D log
gO12 (1 ¡ gO 1 ¡ gO 2 C gO 12 ) . (gO1 ¡ gO 12 ) (gO 2 ¡ gO 12 )
Notably, the estimation of h can be performed independently from the estimator of ´ in the sense of equation 3.10. This brings a simple procedure of hypothesis testing concerning the null hypothesis, H0 : h D h0 , against 6 h0 . H1 : h D
In previous studies, under different frameworks (e.g., using N-JPSTH), the null hypothesis of independent ring is often examined. This corresponds to the null hypothesis of h 0 D 0 in the current framework.
Information-Geometric Measure for Neural Spikes
2279
6 h 0, respecLet the log-likelihood of the models H 0 : h D h0 and H1 : h D tively, be
l0 D max log p ( x1 , . . . , xN I ´ , h0 ) ,
l1 D max log p ( x1 , . . . , xN I ´ , h ) , ,h
where N is the number of observations. The likelihood ratio test uses the test statistics l D 2 log
l0 , l1
(3.12)
which is subject to the  2 distribution. With the orthogonal coordinates, the likelihood maximization with respect to ´ D (g1 , g2 ) and h can be performed independently, so that we have N I ´O , h 0 ) , l0 D log p ( x
N I ´O , hO ) , l1 D log p (x
where ´O denotes the same marginals in both models. If nonorthogonal parameterization is used, this property does not hold. A similar situation holds 6 ´ 0 for unknown h. in the case of testing ´ D ´ 0 against ´ D Now let us calculate the test statistics l in more detail. Under the hypothesis H 0, l is approximated for a large N as lD2
N X i D1
log "
p ( xi I ´O , h0 ) p ( xi I ´O , hO )
# ( xI ´O , h0 ) p ¼ 2NEQ log p ( xI ´O , hO ) h i ¼ 2ND p ( xI ´O , h0 ) : p ( xI ´O , hO ) ¼ Ngf33 (hO ¡ h0 ) 2 ,
(3.13)
where EQ is the expectation over the empirical distribution and the approximation in the third line comes from our assumption of the null hypothesis f H 0. g33 is the Fisher information of the mixed coordinates ³ in the h direction 0 at ³ D ( ´O Ih 0 ) , which is easily calculated as g33 D g33 ( ³ 0 ) D f
gO 3 (gO 1 ¡ gO 3 ) (gO 2 ¡ gO 3 ) (gO 1 C gO 2 ¡ gO 3 ¡ 1) . gO 1gO2 (gO 1 C gO 2 ¡ 1 ¡ 2gO 3 ) C gO 32
Asymptotically, we have l » Â 2 (1) ,
p q f N g33 (hO3 ¡ h3 ) » N
(0, 1) and hence,
2280
Hiroyuki Nakahara and Shun-ichi Amari
where m in  2 ( m) indicates the degrees of freedom m in the  2 distribution, that is, in our case, the degree of freedom is 1. We must note that the above approach is valid regardless of h3 D 0 or 6 0. In contrast, the decomposition as shown in equation 3.9 cannot exist, D for example, for the coordinate system (g1 , g2 , r ), where r is the correlation coefcient. The plane h3 D 0, or E (0) , coincides with the plane r D 0, which 6 is g3 D g1g2 . However, E ( c) (c D const D 0) cannot be equal to any plane dened by r D c0 where c0 D const. Only in the case of r D 0 it is possible to formulate testing for r similarly to the above discussion, which is testing against the hypothesis of independent ring. 3.5 Application to Firing of Two Neurons. Here, we discuss the application of the theoretical results already noted to the ring of two neurons and relate different choices of the null hypothesis with corresponding hypothesis testings. Given N trials of experiments, the probability distribution of X in a time bin [t, t C dt] can be estimated, denoted by p (xI »O ) D p ( xI »O ( t, t C D t) ) , where »O can be any coordinate system. If stationarity is assumed in a certain time interval, we obtain the probability distribution in the interval by averaging the estimated probabilities of many bins of the interval. The maximum likelihood estimate (MLE) of the P coordinates is given by pOij D
Nij , N
where Nij ( i, j D 0, 1) indicates the number of trials in which the event ( X1 D i, X2 D j) occurs. The MLE is retained by any coordinate transformation. Any coordinate transformation is easy in the case of two neurons, so we freely change the coordinate systems in this section. Let us denote our estimated probability distribution by the mixed coordinates ³O . We also denote by ³ 0 the probability distribution according to our null hypothesis. Then, we have h i h i h 0 i 0 D ³ 0 : ³O D D ³ 0 : ³O C D ³O : ³O D D1 C D2 ,
(3.14)
0 0 0 where D1 D D[³ 0 : ³O ], D2 D D[³O : ³O ], and ³O D (f10 , f20, fO3 ). In equation 3.14, 0 0 we use abbreviation such that D[³ 0 : ³O ] D D[p ( xI ³ 0 ) : p (xI ³O ) ]. Here, D1 and D2 are the quantities representing the discrepancies of p ( ³O )
from p ( ³ 0 ) with respect to the coincident ring and the marginals, respectively. We have l1 D 2ND1 ¼ Ng33 ( ³ 0 ) (f30 ¡ fO30 ) 2 » Â 2 (1) l2 D 2ND2 ¼ N
2 X i, j D 1
gij ( ³ 0 ) (fi0 ¡ fOj0 ) 2 » Â 2 (2) .
Information-Geometric Measure for Neural Spikes
2281
Here, l1 is to test whether the estimated coincident ring signicantly differs from that of the null hypothesis, while l2 is to test whether the estimated marginals differ signicantly from the hypothesized marginals. In particular, a test of whether the estimated coincident ring fO3 is signicantly different from zero is given by ³ 0 D (fO1 , fO2 , 0) . This p ( xI ³ 0 ) is the probability distribution that gives the same marginals as those of p ( xI ³O ) but with independent rings. In this case, l1 D 2nD1 D 2nD[³O : ³ 0 ] gives a test statistic against hO3 D 0, while D2 D 0. Let us consider another typical situation, where we need to compare two estimated probability distributions. This case is very important but somewhat ignored in the testing of coincident rings. Many previous studies often assumed independent ring as the null hypothesis. However, for example, to say a single neuron ring as task related, for example, in a memory-guided saccade task (Hikosaka & Wurtz, 1983), the existence of ring in task period alone does not guarantee that the ring is task related. It is normal to examine the ring in the task period against that in the control period. The ring in the control period serves as the resting level activity or as the null hypothesis. We hence propose that a procedure for testing coincident ring should be performed in a similar manner: we should test if two neurons have any signicant pairwise interaction in one period in comparison to the other (control) period. Investigation of coincident ring in the task period against the null hypothesis of independent ring may lead to a wrong interpretation of its signicance when there is already a weak correlation in the control period (see the examples in section 6). The similar arguments can be applied to different tasks. One example would be a rat’s maze task. A rat is in the left room in one period, while in the other period, it is in the right room. We may want to test if coincident ring of the two neurons, say, in the hippocampus, is signicantly larger or smaller in one room than in the other room. The null hypothesis of independent ring is not plausible in this case. Let us denote the estimated probability distribution in two periods by 1 2 p ( xI »O ) and p ( xI »O ). Using the mixed coordinates, by theorem 2, we have h 1 2i h 1 3i h 3 2i D ³O : ³O D D ³O : ³O C D ³O : ³O , 3 where ³O D (fO11 , fO21 , fO32 ) D (gO 11 , gO 12 , hO32 ) . 1 Here, ³O is an estimated probability distribution. If we can guarantee that 1 ³O is a true underlying distribution, denoted by ³ 1 , we can have h i 3 l D 2ND ³ 1 : ³O ¼ Ng33 ( ³ 1 ) (hO32 ¡ h31 ) 2 » Â 2 (1) .
(3.15)
This  2 test is, precisely speaking, to examine if hO32 is signicantly different 1 from h 1 when ³O is a true distribution. 3
2282
Hiroyuki Nakahara and Shun-ichi Amari
1 In general, when ³O is an estimated distribution, we should test whether hO31 and hO32 are from the same interaction component, which we denote by h3 . 10 20 In this case, the MLEs, denoted by ³O and ³O , are given by 10
20
( ³O , ³O ) D argmax
N X
log p (X j : ³ 1 ) p ( X j : ³ 2 )
j
subject to f31 D f32 D h3 . Then our likelihood ratio test against this null hypothesis yields h 10 1 i h 20 2 i l0 D 2ND ³O : ³O C 2ND ³O : ³O 10 20 ¼ Ng33 ( ³O ) (hO31 ¡ hO310 ) 2 C Ng33 ( ³O ) (hO32 ¡ hO320 ) 2 ,
(3.16)
where hO3 D hO310 D hO320. In equation 3.15, we can convert l into a  2 test, because g33 is the true value by our assumption. In equation 3.16, however, rigorously speaking, we cannot convert this l into a  2 test because both g33 are estimates, determined at each estimated point fO , that is, depending on hO310 and hO320, respectively. This issue is analogous to the famous FisherBeherens problem in the context of the t-test (Stuart, Ord, & Arnold, 1999). Yet since all of the terms in equation 3.16 asymptotically converge to their true values, we suggest using 10
20
l0 ¼ Ng33 ( ³O ) (hO31 ¡ hO310 ) 2 C Ng33 ( ³O ) (hO32 ¡ hO320 ) 2 »  2 (2) . This  2 (2) formulation gives a more appropriate test under the null hypothesis against the average activity in control period. At the same time, to compare signicant events between the two null hypotheses—against independent ring and against the average activity in the control period—we still suggest using the  2 (1) formulation for the latter hypothesis. 3.6 Relationship Between Neural Firing and Behavior. The orthogonality between h and g parameters played a fundamental role in the above results so that pairwise coincident ring, characterized by h3 , can be examined by a simple hypothesis testing procedure. In the analysis of neural data, it is also important to investigate whether any coincident ring has any behavioral signicance. For this purpose, we use the mutual information to relate neural ring with behavioral events. The above orthogonality can again play an important role. Let us denote by Y a discrete random variable representing behavioral choices, for example, making saccade right or left, or presented stimuli, for
Information-Geometric Measure for Neural Spikes
2283
example, red dots, blue rectangles, or green triangles. The mutual information between X D ( X1 , X2 ) and Y is dened by µ ¶ p ( x , y) I (X, Y) D Ep ( X,Y) log , p ( x ) p ( y) which is equivalent to £ £ ¤¤ £ £ ¤¤ I (X, Y) D Ep ( Y) D p ( X | y ) : p ( X ) D Ep ( X) D p (Y | x ) : p ( Y ) . We can apply the Pythagoras decomposition to the above equation. We use the mixed coordinates for p ( X | y ) and p ( X ), denoted by ³ ( X | y ) and ³ (X ) , respectively. Then we have £ ¤ £ ¤ D p ( X | y) : p ( X ) D D ³ ( X | y ) : ³ ( X ) £ ¤ £ 0 ¤ 0 D D ³ ( X | y) : ³ C D ³ : ³ ( X) , where
³ 0 D ³ 0 ( X, y ) D (f1 ( X | y ) , f2 ( X | y ) , f3 (X ) ) D (g1 ( X | y ) , g2 (X | y) , h3 (X ) ) .
Thus, ³ 0 has the rst two components (i.e., g1 , g2 ) as the same as those of ³ (X | y ) and the third term (i.e., h3 ) as the same as that of ³ ( X ) . Using this relationship, the mutual information between X and Y is decomposed. Theorem 3. I (X, Y) D I1 ( X, Y ) C I2 (X, Y) ,
(3.17)
where I1 ( X, Y ) , I2 (X, Y) are given by £ £ ¤¤ I1 ( X, Y ) D Ep (Y ) D ³ ( X | y) : ³ 0 ( X, y) , £ £ ¤¤ I2 ( X, Y ) D Ep (Y ) D ³ 0 ( X, y) : ³ ( X ) . A similar result holds with respect to conditional distribution p ( Y | X ). The above decomposition states that the mutual information I ( X, Y ) is the sum of the two terms: I1 ( X, Y ) is the mutual information by modulation of the correlation components of X, while I2 (X, Y) is the mutual information by modulation of the marginal means of X. This observation helps us investigate the behavioral signicance for each modulation of the coincident ring and the mean ring rate. 4 Triple Interactions Among Three Variables The previous section discussed pairwise interaction between two variables. Given more than two variables, we need to look into not only pairwise interaction but also higher-order interactions. It is useful to study triplewise interactions before stating the general case.
2284
Hiroyuki Nakahara and Shun-ichi Amari
4.1 Orthogonal Coordinates and Pure Triple Interaction. Let us consider three binary random variables X1 , X2 , and X3 , and let p (x ) > 0, x D (x1 , x2 , x3 ) be their joint probability distribution. We put pijk D Probfx1 D of all such distributions forms a i, x2 D j, x3 D kg > 0, i, j, k D 0, 1. The setP seven-dimensional manifold S 3 , because pijk D 1 among the eight pijk’s. The marginal and pairwise marginal distributions of Xi are dened by gi D E [xi ] D Prob fxi D 1g ( i D 1, 2, 3) , £ ¤ © ª gij D E xi xj D Prob xi D xj D 1 , ( i, j D 1, 2, 3) . The three quantities gi , gj , and gij together determine the joint marginal distribution of any two random variables Xi and Xj . Let us further put g123 D E [x1 x2 x3 ] D Probfxi D xj D x k D 1g. All of these have seven degrees of freedom,
´ D (g1 , g2 , . . . , g7 ) D (g1 , g2 , g3 I g12 , g23 , g13I g123 ) ,
(4.1)
which specify any distribution p ( x) in S 3 . Hence, this ´ is a coordinate system of S 3 called the m- or g-coordinates. The pairwise correlation between any two of X1 , X2 , and X3 is determined from the marginal distributions of Xi and Xj or gi , gj , and gij . However, even when all the pairwise correlations vanish, this does not imply that X1 , X2 , and X3 are independent. Therefore, one should dene intrinsic triplewise interaction independent of pairwise correlations. The coordinate g123 itself does not directly give the degree of pure triplewise interaction. In order to dene the degree of pure triplewise interaction, the orthogonality plays a fundamental role. Let us x the three pairwise marginal distributions, specied by the six coordinates,
´ 2 D (g1 , g2 , g3 I g12 , g23 , g13 ) . There ¡ are ¢ many distributions with the same ´ 2 . Let us consider the set M 2 ´ 2 of all the distributions in which we have the same single and pairwise marginals ´ 2 , but g123 may take any value. This is a one-dimensional ¡ ¢ submanifold specied by ´ 2 . Let us introduce a coordinate h in M 2 ´ 2 . (´ 2 , h ) is a coordinate system of S 3 . When the coordinate h is orthogonal to ´ 2 , that is, a change in the log-likelihood along h is not correlated with that in any of the components of ´ 2 , we may say that h represents the degree of pure triple interaction regardless of pairwise marginals ´ 2 and require that h has this property. The tangent direction of M 2 , that is, the direction in which onlyh changes but the second-order marginals ´ 2 are xed, represents a change in the pure triple interaction among X1 , X2 , and X3 . To show this geometrically, let us consider a family of submanifolds E 2¤ (h ) in which all the distributions have
Information-Geometric Measure for Neural Spikes
2285
the same h but the single and pairwise marginals ´ 2 are free. A E 2¤ (h ) is a six-dimensional submanifold transversal to M 2 (´ 2 ) . Tangent directions of E 2¤ (h ) represent changes in marginals ´ 2 , keeping h xed, and E 2¤ (h ) and M 2 ( ´ 2 ) are orthogonal at any h and ´ 2 . In order to obtain such a h, let us expand log p ( x) in the polynomial of x, log p (x ) D
X
hi xi C
X
hij xi xj C h123x1 x2 x3 ¡ y .
(4.2)
This is an exact formula, since xi (i D 1, 2, 3) is binary. One can check that the coefcient h D h123 is given by p111 p100p 010p 001 . p110p101 p 011 p 000
h123 D log
(4.3)
The other coefcients are: h1 D log
p100 , p 000
h12 D log
h2 D log
p110p 000 , p100p 010
p 010 , p 000
h23 D log
h3 D log p 011 p 000 , p010p 001
p001 , p000 h13 D log
y D ¡ log p000.
(4.4) p101 p 000 , p100p 001
(4.5) (4.6)
Information geometry gives the following theorem. Theorem 4. The quantity h123 represents the pure triplewise interaction in the sense that it is orthogonal to any changes in the single and pairwise marginals. We can prove this directly by calculating the derivatives of the log likelihood. Equation 4.2 shows that S 3 is an exponential family with the canonical parameters µ D (h1 , h2 , h3 Ih12 , h23 , h13 I h123 ). The corresponding expectation parameters are ´ D ( ´ 2 Ig123 ) , so that they are orthogonal. We can compose the mixed orthogonal coordinates, denoted by ³ 2 , as
³ 2 D (´ 2 I h123 ) D (g1 , g2 , g3 Ig12 , g23, g13 Ih123 ) .
(4.7)
In this coordinate system, ´ 2 and h D h123 are orthogonal. Note that h123 is not orthogonal to h12, h23, h13. Hence, except when there is no triplewise interaction (h123 D 0) , the quantities h12 , h23, and h13 in equation 4.5 do not directly represent the degrees of pairwise correlations of the respective two random variables. Notably, the submanifold E 2¤ (0) consists of all the distributions having ( ) no triple interactions but pairwise interactions. The P P log probability log p x is quadratic and given by log p ( x ) D h i xi C hij xi xj ¡ y . A stable distribution of a Boltzmann machine in neural networks belongs to this class,
2286
Hiroyuki Nakahara and Shun-ichi Amari
because there are no triple interactions among neurons (Amari, Kurata, & Nagaoka, 1992). The submanifold E 2¤ (0) is characterized by h123 D 0, or in terms of ´ (see equation 4.3) by
g123
(g12 ¡ g123 ) (g13 ¡ g123 ) (g23 ¡ g123 ) £(1 ¡ g1 ¡ g2 ¡ g3 C g12 C g23 C g13 ¡ g123 ) . D (g1 ¡g12 ¡g13 C g123 ) (g2 ¡g23 ¡g12 C g123 ) (g3 ¡g13 ¡g23 C g123 )
4.2 Another Orthogonal Coordinate System. In the above, we extracted the pure triple interaction by using the coordinate h123 , such that ´ 2 and h123 are orthogonal. If we are interested in separating simple marginals from various kinds of interactions, we can use another decomposition. Let us summarize the three simple marginals in ´ 1 D (g1 , g2 , g3 ) and then summarize all of the interaction terms in µ 1¤ D (h12 , h23, h13, h123 ) . Here, µ 1¤ denotes the coordinates complementary to ´ 1 . Using this pair, we have another mixed coordinate system, denoted by ³ 1 , as ¡ ¢ ³ 1 D (f11 , f12, . . . , f17 ) D ´ 1 , µ1¤ .
(4.8)
Here, ´ 1 and µ1¤ are orthogonal. Geometrically, let M 1 (´ 1 ) , specied by ´ 1 D (g1 , g2 , g3 ) , be the set of all the distributions having the same simple marginals ´ 1 D (g1 , g2 , g3 ) but having any pairwise and triplewise correlations. M 1 (´ 1 ) is a four-dimensional submanifold in which µ 1¤ takes arbitrary values. On the other hand, let E 1¤ ( µ 1¤ ) be a three-dimensional submanifold in which all of the distributions have the same µ 1¤ D (h12, h23, h31 , h123 ) but different marginals ´ 1 . We have the following theorem. Theorem 5. The coordinates ´ 1 and µ 1¤ are orthogonal, that is, E 1¤ (µ 1¤ ) is orthogonal to M 1 (´ 1 ) . Here, µ1¤ represents degrees of pure correlations independent of marginals ´ 1 and includes correlations resulting from the triplewise interaction in addition to the pairwise interactions. Because of the non-Euclidean character of S 3 (Amari & Nagaoka, 2000; Amari, 2001), we cannot have a co0 0 0 ordinate system, (g1 , g2 , g3 I h12 Ih123 ) , with fgi g, fhij0 g, and h123 being , h23 , h13 mutually orthogonal. The submanifold E 1¤ (0) has zero pairwise and triplewise correlations and, hence, consists entirely of independent distributions in which gij D gigj and g123 D g1g2g3 hold. The function log p (x ) is linear in x because µ 1¤ D 0 (see equation 4.2). 4.3 Projections and Decompositions of Divergence. Using the above two mixed coordinates, we decompose a probability distribution in the following two ways. Let us consider two probability distributions, p ( x) and q ( x) , where any coordinate system » is denoted by » p and » q , respectively.
Information-Geometric Measure for Neural Spikes
2287
First, let us consider the case where q is the independent uniform distribution. By using the mixed orthogonal coordinate system ³ 2 , we now seek to extract a pure triplewise interaction h123 . For q, we have q
h123 D 0,
q
q
q
g1 D g2 D g3 D
1 , 2
q
q
q
g12 D g23 D g13 D
1 . 4
Furthermore, we note that q 2 E 2¤ (0) and also q 2 E 1¤ (0) . Let us m-project p to E 2¤ (0) by £ ¤ pN (x ) D arg min D p ( x) : r (x ) . r2E2¤ (0)
This pN has the same pairwise marginals as p but does not include any triplepN p q wise interaction, and its mixed coordinates are given by ³ 2 D ( ´ 2 I h123 ) D p ( ´ 2 I 0) . The Pythagorean theorem gives us £ ¤ £ ¤ £ ¤ D p : q D D p : pN C D pN : q , where D[p : p] N represents the degree of pure triplewise interaction, while D[pN : q] represents how p differs from q in simple marginals and pairwise correlations. Let us next extract the pairwise interactions in p (x ) by using another mixed coordinate ³ 1 . To this end, let us project p to E 1¤ (0), which is composed of independent distributions, £ ¤ e p (x ) D arg min D p ( x) : s (x ) . s2E1¤ (0)
pQ
pQ
q
p
More explicitly, we have ³ 1 D ( ´ 1 I µ 1¤ ) D (´ 1 I 0) and £ ¤ £ ¤ £ ¤ D p:e p D D p:e p CD e p:q . Here, D[p : e p] summarizes the effect of all the pairwise and triplewise interactions, while D[e p : q] represents the difference of the simple marginals from the uniformity. By taking the two decompositions together, we have £ ¤ £ ¤ £ ¤ £ ¤ D p : q D D p : pN C D pN : e p CD e p:q .
(4.9)
Here D[p : p] N represents the degree of pure triplewise interaction in the probability distribution p, D[pN : e p] of pairwise interactions, and D[e p : q] of the nonuniformity of ring rate. Let us generalize equation 4.9 by dropping our assumption on q as the independent uniform distribution. We then redene pN and e p as £ ¤ £ ¤ pN (x ) D arg min D p ( x) : r (x ) , e p (x ) D arg min D p (x ) : s (x ) . q
r2E2¤ (h123 )
s2E1¤ (
q ) 1¤
2288
Hiroyuki Nakahara and Shun-ichi Amari
We now have theorem 6: Theorem 6. £ ¤ £ ¤ £ ¤ D p : q D D p : pN C D pN : q £ ¤ £ ¤ D D p:e p CD e p:q £ ¤ £ ¤ £ ¤ D D p : pN C D pN : e p CD e p:q .
(4.10) (4.11) (4.12)
The decompositions in the rst and second lines are particularly interesting for neural data analysis purpose, as shown in the next section. Any coordinate transformation can be done freely in this three-variable case in a numerical sense. In general, however, coordinate transformations between µ and ´ are not easy when the dimensions are high. Later, we discuss several practical approaches in n neuron case for use in neural data analysis. 4.4 Applications to Firing of Three Neurons. Here, we briey discuss our application of the above results to ring of three neurons. The discussion in section 3.5 can be naturally extended. We consider three binary random variables, X D ( X1 , X2 , X3 ) , and denote our estimated probability distribution and the distribution of our null hypothesis by p ( xI »O ) and p (xI » 0 ) , respectively, where » is now a seven-dimensional coordinate system. We use the following decompositions, h i h i h 00 i h i h 0 i 00 0 D » 0 : »O D D ³ 20 : ³O 2 C D ³O 2 : ³O D D ³ 10 : ³O 1 C D ³O 1 : ³O , 0 00 where ³O 1 D ( ´ 10¤ I µO 1¤ ) and ³O 2 D ( ´ 20 I hO123 ) . 00 In the rst decomposition, D[³ 20 : ³O 2 ] represents the discrepancy in the triplewise interaction of p ( xI »O ) from p (xI » 0 ) , xing the pairwise interac00
tion and marginals as specied by p ( xI » 0 ). D[³O 2 : ³O ] then collects all the residual discrepancy and, more precisely, represents the discrepancy of the 00 distribution p ( xI »O ) from p ( xI »O 2 ) , which has the same simple and secondorder marginals as those of p ( xI » 0 ) (i.e., ´ 20) and the same triplewise inter00 action hO123 as that of p ( xI ³O ). Therefore, D[³ 20 : ³O 2 ] is particularly useful for investigating if there is any signicant triplewise interaction in data, that is, p (xI »O ), in comparison with our null hypothesis p ( xI » 0 ). A signicant triplewise interaction, for example, may be considered indicative of three neurons functioning together. As for hypothesis testing, we can use h i 00 0 l2 D 2ND ³ 20 : ³O 2 ¼ Ngf77 ( ³ 20 ) (h123 ¡ hO123 ) 2 » Â 2 (1) , (4.13) where N is the number of trials and the indices are ³ 1 D (f1 , . . . , f6 I f7 ) D (´ 2 Ih123 ) .
Information-Geometric Measure for Neural Spikes
2289
0
In the second decomposition, D[³ 10 : ³O 1 ] represents the discrepancy in both the triplewise and pairwise interactions of p ( xI »O ) from p ( xI » 0 ) , xing 0 the marginals as specied by p (xI » 0 ) , while D[³O 1 : ³O ] collects all the resid0 ual discrepancy. D[³ 0 : ³O ] is useful to investigate if there is a signicant 1
1
coincident ring, taking the pairwise and triplewise interactions together, compared with the null hypothesis. We now have 7 h i X 0 f l1 D 2ND ³ 10 : ³O 1 ¼ N gij ( ³ 10 ) (fi0 ¡ fOi ) (fj0 ¡ fOj ) » Â 2 (4) ,
(4.14)
i, j D4
where the indices are given by ³ D (f1 , . . . , f7 ) D ( ´ 1¤ I µ1¤ ). We can also compare two probability distributions estimated under different experimental conditions. Let us denote two estimated distributions 1 2 by p (xI »O ) and p ( xI »O ) . We rst detect the triplewise interaction. The MLE, 10
20
1 2 , is given by denoted by ³O 2 and ³O 2 , of our null hypothesis, that is, hO123 D hO123 10 20 (³O 2 , ³O 2 ) D argmax
N X
log p ( X j : ³ 12 ) p ( X j : ³ 22 )
j
1 2 subject to h123 . D h123
(4.15)
Then we have h 10 1 i h 20 2 i l02 D 2ND ³O 2 : ³O 2 C 2ND ³O 2 : ³O 2 10 20 f 10 1 )2 20 2 )2 C Ng77 ( ³O 2 ) (h123 ¼ Ngf77 ( ³O 2 ) (h123 ¡ hO123 ¡ hO123
» Â 2 (2).
(4.16)
When we investigate the coincident ring, taking the pairwise and triplewise interactions together, we use the second decomposition above. The MLE of our null hypothesis in this case is given by 10 20 (³O 1 , ³O 1 ) D argmax
N X
log p ( X j : ³ 11 ) p ( X j : ³ 21 )
j
subject to µ 11¤ D µ21¤ . For hypothesis testing, we can use h 10 1 i h 20 2 i l01 D 2ND ³O 1 : ³O 1 C 2ND ³O 1 : ³O 1 ¼N
7 X i, j D 4
10
f gij ( ³O 1 ) (fi10 ¡ fOi ) (fj10 ¡ fOj )
(4.17)
2290
Hiroyuki Nakahara and Shun-ichi Amari
CN
7 X i, j D 4
20 f gij ( ³O 1 ) (fi10 ¡ fOi ) (fj10 ¡ fOj )
2
» Â (8) .
(4.18)
The decompositions in the KL divergence also allow us to decompose mutual information between the ring pattern of three neurons X D ( X1 , X2 , X3 ) and the behavior Y in a similar manner to section 3.6. Theorem 7. µ ¶ p ( x, y ) I ( X, Y ) D Ep (X,Y ) log p ( x) p ( y ) D I1 ( X, Y ) C I2 ( X, Y )
(4.19)
D I3 ( X, Y ) C I4 ( X, Y )
(4.20)
where £ £ ¤¤ I1 ( X, Y ) D Ep ( Y ) D ³ 1 ( X | y ) : ³ 1 ( X, y) , £ £ ¤¤ I2 ( X, Y ) D Ep ( Y ) D ³ 1 ( X, y ) : ³ 1 (X ) and we dene ³ 1 (X, y ) D ( ´ 1 ( X | y )I µ 1¤ ( X ) ) . Similarly, £ £ ¤¤ I3 ( X, Y ) D Ep ( Y ) D ³ 2 ( X | y ) : ³ 2 ( X, y) , £ £ ¤¤ I4 ( X, Y ) D Ep ( Y ) D ³ 2 ( X, y ) : ³ 2 (X ) and we dene ³ 2 (X, y ) D ( ´ 2 ( X | y )I µ 2¤ ( X ) ) . In equation 4.19, the mutual information I ( X, Y ) is decomposed into two parts: I1 , the mutual information conveyed by the pairwise and triplewise interactions of the ring, and I2 , the mutual information conveyed by the mean ring-rate modulation. In equation 4.20, I ( X, Y ) is decomposed differently: I3 , conveyed by the triplewise interaction, and I4 , conveyed by the other terms, that is, the pairwise and mean ring-rate modulations. 5 General Case: Joint Distributions of X1 , . . . , Xn Here we study a general case of n neurons. Let X D (X1 , . . . , Xn ) be n binary variables, and let p D p (x ) , x D ( x1 , . . . , xn ) , xi D 0, 1, be its probability, where we assume p ( x) > 0 for all x. We begin with briey recapitulating Amari (2001) for the theoretical framework and then move to its applications.
Information-Geometric Measure for Neural Spikes
2291
5.1 Coordinate Systems of Sn . As mentioned in section 2, the set of all probability distributions fp( x) g forms a (2n ¡ 1) –dimensional manifold S n . Any p ( x) in S n can be represented by the P-coordinate system, h -coordinate system, or g-coordinate system. The P-coordinate system is dened by pi1 ¢¢¢in D Prob fX1 D i1 , . . . , Xn D in g , X subject to pi1 ¢¢¢in D 1.
ik D 0, 1,
i1 ,...,in
The h-coordinate system is dened by the expansion of log p (x ) as log p (x ) D
X
hi xi C
X i< j
hij xi xj C
X i< j< k
hijkxi xj xk ¢ ¢ ¢
C h1¢¢¢n x1 ¢ ¢ ¢ xn ¡ y ,
(5.1)
where the indices of hijk, . . . , satisfy i < j < k and then
µ D (hi , hij , hijk, . . . , h12,...,n )
(5.2)
has 2n ¡ 1 components and forms the h-coordinate system. It is easy to com..., 0 pute any components of µ ; for example, we can get h1 D log pp10, . For later 0, ..., 0 convenience, we use the notation µ 1 D (hi ) , µ 2 D (hij ) , µ3 D (hijk ) , . . . , µ n D h12,...,n , where l in µ l runs over l-tuple among n binary numbers, yielding n C l components (n C l is the binomial coefcient). Then we can write
µ D ( µ1 , µ2 , . . . , µ n ) . On the other hand, the g-coordinate system is dened by using gi D E [xi ] ( i D 1, . . . , n ) , £ ¤ ( i < j) , . . . , g12,...,n D E [xi , . . . , xn ] , gij D E xi xj which has 2n ¡ 1 components (see section 2)—in other words, ¡ ¢ ´ D gi , gij , . . . , g1,...,n forms the g-coordinate system in S n . We also write ´ D ( ´ 1 , ´ 2 , . . . , ´ n ) , which is linearly related to fpi1 ,...,in g. In the rest of this section, we provide some general concepts in information geometry. Readers who are interested in more detail can refer to Amari and Nagaoka (2000). When a submanifold of S n , denoted by E , is represented by linear constraints among the h coordinates, E is called exponentially at or e at. When a submanifold of S n , denoted by M , is represented by linear constraints among the g coordinates, M is called mixture at or m at.
2292
Hiroyuki Nakahara and Shun-ichi Amari
The Fisher information matrices in the respective coordinate systems play the role of Riemannian metric tensors. The two coordinate systems µ and ´ are dually coupled in the following sense. Let A, B, C, . . . denote ordered subsets of indices, which stand for components of µ and ´ , that is, µ D (hA ) , ´ D (gB ) . Theorem 8.
The two metric tensors G ( µ ) and GN ( ´ ) are mutually inverse,
GN ( µ ) D G ( ´ ) ¡1 ,
(5.3)
where G (´ ) D ( g AB ( ´ ) ) and GN ( µ ) D ( gN AB ( µ ) ) are dened by µ
gAB ( µ ) D E µ gN AB ( ´ ) D E
@ log p ( xI µ ) @ log p ( xI µ ) @hA
@hB
@ log p ( xI ´ ) @ log p ( xI ´ ) @gA
@gB
¶
, ¶ .
The following generalized Pythagoras theorem has been known in S n (Csisz a r, 1967, 1975; Amari et al., 1992; Amari & Han, 1989). It holds in more general cases, playing an important role in information geometry (Amari, 1987; Amari & Nagaoka, 2000). Theorem 9. Let p (x ) , q (x ) and r (x ) be three distributions where the m geodesic connecting p ( x) and q ( x) is orthogonal to the e geodesic connecting q ( x) and r ( x) . Then, £ ¤ £ ¤ £ ¤ D p:q CD q:r DD p:r .
(5.4)
5.2 Higher-Order Interactions. This section aims at dening the higherorder interactions using the k-cut mixed coordinate system. Section 5.1 introduced µ D ( µ1 , . . . , µ n ) and ´ D ( ´ 1 , . . . , ´ n ) , each of which spans S n . Let us dene their partitions, called a k-cut, as follows, ¡ ¢ µ D µ k¡ I µ kC ,
¡ ¢ ´ D ´ k¡ I µ kC ,
(5.5)
where µ k¡ and ´ k¡ consist of coordinates whose subindices have no more than k indices, that is, µ k¡ D (µ 1 , µ 2 , . . . , µ k ) , ´ k¡ D ( ´ 1 , ´ 2 , . . . , ´ k ) , and µ kC and ´ kC consist of the coordinates whose subindices have more than k indices, that is, µ kC D ( µ kC 1 , µ kC 2 , . . . , µn ), ´ kC D ( ´ kC 1 , ´ kC 2 , . . . , ´ n ) . First, note that ´ k¡ species the marginal distributions of any k (or less than k) random variables among X1 , . . . , Xn . Let us consider a family of m-at submanifold in S n , © ª M k ( m k ) D ´ | ´ k¡ D mk .
Information-Geometric Measure for Neural Spikes
2293
It consists of all the distributions having the same k-marginals specied by a xed ´ k¡ D m k. They differ from one another only by higher-order interactions of more than k variables. Second, all coordinate curves represented by µ kC are orthogonal to ´ k¡ or any components of ´ k¡. Hence, µ kC represents interactions among more than k variables independent of the k marginals, ´ k¡ . Then, for a constant vector ck, let us compose a family of e-at submanifolds, © ª E kC ( ck ) D µ | µ kC D ck . Third, E kC ( ck ) and M k (m k ) are mutually orthogonal and introduce a new coordinate system, called the k-cut mixed coordinate system, dened by ¡ ¢ ³ k D ´ k¡I µ kC . Any k-cut mixed coordinate system forms the coordinate system of S n . A change in the µ k C part preserves the k-marginals of p ( x) (i.e., ´ k¡), while a change in the ´ k¡ part preserves the interactions among more than k variables. These changes are mutually orthogonal. Thus, E kC (µ kC ) is regarded as the submanifold consisting of distributions having the same degree of higher-order interactions. When µ k C D 0, E kC (0) denotes the set of all the distributions having no intrinsic interactions of more than k variables. 5.3 Projections and Decompositions of Higher-Order Interactions. Q ( k) Given p (x ), we dene p ( k) ( x) D p by p ( k) ( x ) D
( k) Y
£ ¤ p D arg min D p : q . q2Ek C (0)
This is the point closest to p among those that do not have intrinsic interactions of more than k variables. We note that another characterization of p ( k) is given by h i p ( k) ( x ) D arg min D q : p (0) , q2Mk (
p ) k¡
where it should be easy to see p (0) a uniform distribution by denition of p p (0) . The e geodesic connecting p ( k) and p (0) is orthogonal to M k ( ´ k¡ ) to which the original p belongs. The k-cut mixed coordinates of p ( k) are given by ³ k ( p ( k) ) D ( ´ k¡ , µ k C D 0). The degree of interactions higher than k is hence dened by D[p : p ( k) ]. Since the m geodesic connecting p and p ( k) is orthogonal to E kC (0) , the Pythagoras theorem guarantees the following decomposition: h i h i h i (0) ( ) ( ) (0) . D p:p D D p:pk CD pk :p Let us put h i Dk ( p ) D D p ( k) : p ( k¡1 ) .
2294
Hiroyuki Nakahara and Shun-ichi Amari
Then D k ( p ) is interpreted as the degree of interaction purely among the k variables. We then have the following decomposition in which D k ( p ) denotes the degree of interaction among k variables. Theorem 10. n h i X (0) D p:p D D k ( p) .
(5.6)
kD 1
It is straightforward to generalize the above results when we are given two distributions p ( x) , q (x ) . Let us dene 0
( ) ³ kk D ³ k ( p ( k ) ) D (´ k¡ ( p) I µ kC ( q) ). 0
Then we have p
q
p
( 0)
( 0)
q
k k D[p : q] D D[³ k : ³ k] D D[³ k : ³ k ] C D[³ k : ³ k ],
which is induced from theorem 9. By dening, h 0 i 0 D k0 ( p ) D D p ( k ) : p ( ( k¡1 ) ) , we obtain n £ ¤ X D p:q D D k0 ( p ) .
(5.7)
kD 1
The decompositions shown in equations 5.6 and 5.7 are obviously similar to each other. A critical difference, however, exists in the interpretation of the two decompositions. Each term in equation 5.6, Dk ( p) , represents the degree of purely kth order interaction, whereas Dk0 ( p ) in equation 5.7 does not necessarily do so. This is because ³ k ( p ( k) ) always has µ kC D 0, that is, 0 the higher-order coordinates than the kth order. On the other hand, ³ k ( p ( k ) ) does not necessarily have zero in the corresponding part. In other words, µ k represents the pure kth order interaction only if µ kC D 0. 5.4 Application to Neural Firing. To apply the results in the above sections to neural ring data, the discussion for the case of the three neurons can be directly applied. Hence, we mainly provide some remarks in this section. First, suppose that we have an estimated probability distribution of n neurons, denoted by p (xI »O ), and a probability distribution of our null hypothesis, p (xI » 0 ) . Then, using the k-cut mixed coordinates, we obtain the decomposition, 0 0 D[» 0 : »O ] D D[³ 0 : ³O k] C D[³O k : ³O ],
Information-Geometric Measure for Neural Spikes
2295
0 0 where we dene ³O k D ( ´ 0k¡ I µO kC ) . In this decomposition, D[³ 0 : ³O k] represents the discrepancy between » 0 and »O in the interactions higher than the 0
kth order and D[³O k : ³O ] equal to and lower than the kth order. We can also convert these divergences to a  2 test. For example, we have X
0
lkC D 2ND[³ 0 : ³O k] ¼ N
A,B > k
k (³ 0 ) (fA0 ¡fOA ) (fB0 ¡fOB ) » Â 2 ( m ) , gAB
f
(5.8)
where N is the number of trials and A, B runs over all of the indices higher 2 than the k-tuple, included P n in µ k C , and the degree of freedom m in the  test is given by m D lD kC 1 n Cl . By using l kC , we can test whether there is any signicant contribution of a higher-order interaction by summing all interactions higher than the kth order. k In equation 5.8, gfAB corresponds to the part, denoted by Dfk , of the Fisher information matrix of the k-cut mixed coordinates, Gfk . Let us write the Fisher information matrix of different coordinate systems in block form as follows: " # " # " # Ag Bg Afk O Ah Bh . Gfk D , Gg D Gh D BgT Dg , O Dfk BhT Dh The following theorem gives an explicit form of Gfk . Theorem 11. is given by
The Fisher information matrix of the k-cut mixed coordinates, Gfk ,
Afk D Ah¡1 ,
Dfk D Dg¡1 .
(5.9)
Note that Gh is easy to obtain from the experimental data because ghAB D Eh [XA XB ] ¡ gAgB , component-wise. Given Gh , the computation of Gg may be said to be easy in the sense of Gg D Gh¡1 . Thus, we can easily compute Afk and Dfk . Suppose two probability distributions are given, estimated under differ1 2 ent experimental conditions, denoted by p ( xI »O ) and p (xI »O ) . Our task is to test whether any higher-order (than the kth order) interaction is signicantly different between the two distributions. In this case, we rst need to solve the following MLE: 10 20 (³O k , ³O k ) D argmax
N X j
log p ( X j I ³ 1k ) p ( X j I ³ 2k )
subject to µ1kC D µ 2kC .
(5.10)
2296
Hiroyuki Nakahara and Shun-ichi Amari
The  2 test is then given as 10 1 20 2 l0kC D 2ND[³O k : ³O k ] C 2ND[³O k : ³O k ] X f 10 k (³O k ) (fA10 ¡ fOA1 ) (fB10 ¡ fOB1 ) ¼N gAB A,B > k
CN
X A,B > k
» Â 2 (2m) ,
fk O 20 ( ³ k ) (fA20 ¡ fOA2 ) (fB20 ¡ fOB2 ) gAB
where
mD
n X
n Cl .
lD kC 1
To relate the neural ring with discrete behavioral choice, denoted by Y, we have the following decomposition in the mutual information. Theorem 12. I ( X, Y ) D I kC ( X, Y ) C Ik¡ (X, Y) ,
(5.11)
where we dene £ £ ¤¤ I kC ( X, Y ) D Ep ( Y ) D ³ k (X | y) : ³ k (X, y ) , £ £ ¤¤ I k¡ ( X, Y ) D Ep ( Y ) D ³ k (X, y) : ³ k (X ) and ³ k (X, y ) D ( ´ k¡ ( X | y )I µ kC (X ) ) . 5.5 Homogeneous Case. This section discusses an approach under the assumption of the homogeneity of neural activities, which is useful for avoiding some practical difculties (Grun ¨ & Diesmann, 2000). Here, homogeneity refers to the assumption of homogeneous neural ring—that the interaction of some kth orders is the same as another among all k tuple neurons. For simplicity, we mostly assume below that the interaction of any kth order is the same as another among all k tuple neurons, which will be referred to as full homogeneous assumption. It is easy, using Rota’s methods in the set theory in relation to principle of inclusion and exclusion (Amari, 2001), to explicitly write down the coordinate transformations between the P and g coordinates and also between the P and h coordinates. Hence, it is possible in principle to do the coordinate transformation between the g and h coordinates and also between the k mixed and P coordinates. There are remaining practical difculties, however, in two aspects. First, the computational complexity in these coordinate transformations increases exponentially. Second, the limitation in the number of samples becomes severe in estimating the values in any coordinate system as the number of neurons increases.
Information-Geometric Measure for Neural Spikes
2297
One approach to overcoming these difculties is to use the homogeneity assumption. Under the full homogeneity assumption, the 2n ¡1 dimensions in the n-neuron case reduce to n dimensions. The coordinate transformation in this case is given by theorem 13. Theorem 13. gk D
n¡k X lD 0
( n¡k) Cl P k C l ,
log P k D
k X
k C lhl
lD 1
¡ y,
(5.12)
where n Ck denotes the binomial coefcient. Equivalently, we also have Pk D
n¡k X lD 0
( n¡k) Cl (¡1)
l
g kCl ,
hk D
k X
k Cl (¡1)
( k¡l )
log Pl .
(5.13)
lD 0
All of the results in the previous sections can be rewritten by using these simplied coordinates. 5.6 Interesting Set of Neurons. One important question in this n-neurons case is how to nd a set of neurons whose ring shows signicant coincident ring. We discuss one practical approach here, mentioning several mathematical tricks to overcome some practical issues. In practice, this question may occur either when (1) we simply want to give an answer to the question by some rigorous tests or (2) when, in facing a vast amount of data, we only want to nd some candidates of interesting sets of neurons for further investigation. In this second case, it is perhaps unnecessary to be too rigorous; simplicity is preferred, although some complexity is inevitably involved. A method for this purpose is particularly needed given the increasing number of neurons simultaneously recorded in experiments. This section is formulated accordingly. One obvious approach is to nd a signicant kth order interaction (k < n) and group a signicant k-tuple of the neurons as an interesting set of neurons. A basic procedure is given as follows: 1. Given the population of recorded neurons, calculate the h coordinates and nd nonzero h of the largest order, denoted by h kA , where A indicates a specic k-tuple and k is the order. 2. There can be multiple h kA in the same kth order, but we discard any interaction between them. Under this circumstance, each A is specied by ³ k D ( ´ A I h A ). k¡ k 3. We set our null hypothesis, ³ 0k D ( ´ A I h 0 ) , and then compute the  2 k¡ k value to test whether hkA is signicant against hk0 .
4. If A is found to be signicant, we nominate A as an interesting set of neurons, denoted by A¤ .
2298
Hiroyuki Nakahara and Shun-ichi Amari
5. If A is insignicant, we go down to the lower order. In doing so, we omit the neurons that are already registered in fA¤ g (i.e., any subset of A¤ ) from consideration. Given these, when we nd nonzero h of the next largest order, hkA0 , we repeat from step 2. In the above procedure, there are some practical concerns to solve. First, the coordinate transformation from ³ k to p becomes computationally expensive as the number of neurons increases. In our procedure, this transformation is needed to compute the  2 value. The  2 value is obtained by using the Fisher metric component at ³ 0k , which is computed by use of this transformation (i.e., ³ 0k to p 0). We now recall that the reason we choose to compute the Fisher metric component at ³ 0k comes from hypothesis testing formalism and that the  2 value is a quadratic approximation of the KL divergence. Therefore, for our testing, we can in fact use the Fisher metric at ³ k to compute the  2 value, l D 2ND[³ 0k : ³ k] ¼ Ng ( ³ k ) (h k0 ¡ h kA ) 2 »  2 (1) .
(5.14)
The value of l is easy to compute because all coordinates at ³ k are easily obtained from the data. Second, we should ask how to set hk0 , that is, our null hypothesis. Apparently, there are two choices. One is to choose hk0 D 0, which is a hypothesis of no purely kth-order interaction. The other is to obtain the value of h k under the homogeneous assumption, denoted by ³ ¤k D (´ ¤k¡ Ih k¤ ), and set hk0 D hk¤ . Both choices are feasible, but the underlying assumption differs in each choice. Third, a practical concern is the limited number of trials. One practical solution is to use the homogeneity assumption. For example, when we are interested only in which order should be considered as the order of the interesting set of neurons, we suggest using ³ ¤k to compare against the null hypothesis of no purely kth-order interaction, that is, ³ 0k D ( ´ ¤k I 0) . As another example, in case we happen to nd a specic kA 6 0), we may use ³ D ( ´ A I ´ ¤ , . . . , ´ ¤ tuple A ring (h kA D k ( k¡1 ) Ih k ) under the 1 2 partial homogeneous assumption to compare against the null hypothesis ¤ ¤ ³ 0k D (´ A 1 I ´ 2 , . . . , ´ ( k¡1 ) I 0) . 6 Examples In this section, we demonstrate our method using articial data. (More examples, including application of the proposed method to experimental data and also to autocorrelation, are available as a technical report: Nakahara, Amari, Tatsuno, Kang, & Kobayashi, 2002.)
Information-Geometric Measure for Neural Spikes
2299
6.1 Example 1: Firing of Two Neurons. 6.1.1 Coincident Firing. In this simulation, we aim to demonstrate a relation between correlation coefcient and h , and hypothesis testing under different null hypotheses. Figures 2A and 2B give the mean ring frequency of two neurons and their correlation coefcients (COR; the N-JPSTH), respectively (see the legend for spike generation). Period a is the control period. The neural ring in this period presumably indicates the resting level activity, which was set to have a very weak correlation here. The ring is almost independent in both periods b and c, whereas it is not independent in period d, whose COR is larger than that of period a. Figure 2C shows h3 , a quantity in our measure to indicate the pairwise interaction. At rst glance, the time course of h3 may look similar to the COR. A careful inspection, however, shows that they are different (e.g., the relative magnitudes between periods a and d are different for COR and h). This is because (g1 , g2 , h3 ) and (g1 , g2 , COR) form different coordinate systems, although both h3 and COR represent the correlational component. The KL divergence is used to measure the discrepancy between two probability distributions—one distribution to be examined and the other to be of the null hypothesis. Using the orthogonality of h3 with (g1 , g2 ), we can decompose the KL divergence into two terms—one representing the discrepancy in the correlational component and the other the discrepancy in the mean ring. We call the former the KL divergence in correlation for convenience. We rst examine the KL divergence in correlation against the null hypothesis of independent ring (i.e., h3 D 0), which is the probability distribution with the same marginals of the examined probability but with the independent ring. This KL divergence, denoted by KL1, is indicated by the solid line in Figure 2D. Compared with h3 , the KL1 takes into account the metric in the probability space. In Figure 2D, we also indicate by the dashed line the corresponding p-value, derived from the  2 (1) distribution (e.g., if the p-value reaches 0.95, it is signicant with p < 0.05). We observe that period d is signicant in Figure 2D. We now examine the KL divergence in correlation against the null hypothesis of the averaged activity (including both averaged mean ring rates and averaged coincident ring) in the control period. In Figure 2E, this KL divergence in correlation, denoted by KL2(1), is indicated by a solid line, while the corresponding p-value, derived from  2 (1) , is indicated by a dashed line. In Figure 2E, period d is no longer signicant under this null hypothesis. Period c now is clearly signicant, whereas period b is barely signicant, although h3 is the same between the two periods (see Figure 2C). This is because the mean ring rates are different between periods b and c, and the Fisher information metric takes the geometrical structure into account—not only the degree of the coincident ring (h3 ) but also the marginal probability (i.e., the mean ring rates).
2300
Hiroyuki Nakahara and Shun-ichi Amari
A comparison between Figures 2D and 2E illustrates a simple fact that what should be considered as a signicant event depends on what is taken as the null hypothesis. In Figure 2E, where the p-value is from  2 (1) formulation, the underlying assumption is that the averaged activity in the control period, which is estimated from data in practice, is regarded as a true average activity in the control period (see section 3.5). When we are interested in comparing signicances between two null hypotheses of independent ring and the averaged activity in the control period, we suggest that the comparison of p-values from  2 (1) is informative, as in Figures 2D and 2E. In contrast, since we estimate the average activity in the control period from data in practice, a more proper and more conservative hypothesis testing is to use a different formulation of the likelihood ratio test. This results in using the p-value associated with  2 (2) (see section 3.5). In Figure 2F, we indicate by a solid line the KL2(2), the KL divergence in the correlation that corresponds to this  2 (2) formulation. The dashed line inFigure 2: Facing page. Example of a two-neuron case to detect the signicant pairwise correlation. The spikes of two neurons were generated such that whether a spike exists or not in each bin (1 ms bin width is used) was probabilistically determined in each trial (where the number of all trials was 2000), given an assumed probability (g1 , g2 , g12 ) in each period a–d. Period a (0–100 ms) is with (g1 , g2 , g12 ) D (0.04, 0.04, 0.0031). Period b (100–300 ms) is with (0.04, 0.04, 0.0016) . Period c (300–500 ms) is with (0.12, 0.12, 0.0144) . Period d (500–700 ms) is with (0.12, 0.12, 0.0260) . To estimate the probabilities from articially sampled data, averaged values in each bin were obtained over all trials, and the value in each bin was then nally determined by smoothing over several bins (set as 25 ms). (A) Mean ring frequency for two neurons. Because the mean ring frequency of the two neurons was the same, the two lines are superimposed with very little uctuation. (B) Correlation coefcient. (C) h3 ( D h12 ). (D) KL divergence in correlation against the null hypothesis of independent ring and the corresponding p-value, derived from  2 (1) , are indicated by solid and dashed lines, respectively (see the text). (E) KL divergence in correlation against the null hypothesis of the averaged activity in the control period and the corresponding p-value, from  2 (1) , are indicated by solid and dashed lines, respectively. (F) Using the formulation that the ring in the control period and in other periods is from the same correlation level (see section 3.5), the p-value, derived from  2 (2) , is indicated by dashed line, while the sum of corresponding KL divergences is indicated by solid line. Arrows at the right-hand side of B and C indicate the true values (note that some arrows are superimposed since they are close each other). In all examples in this section, the values of the g coordinates are provided in each gure legend. The g coordinates can be easily converted to the P coordinates used to generate sampled data, by which we obtained the estimated values of any coordinates in gures, and given the P coordinates, it is simple to compute true h values (e.g., equations 4.3–4.6 for Figure 4). In all examples, the estimated h values reasonably match with the true ones (e.g., Figure 6C).
Information-Geometric Measure for Neural Spikes
2301
dicates the p-value, now from Â2 (2) . We can observe that the p-value in Figure 2F gives a more conservative estimate compared with the p-value in Figure 2E.
Freq.(Hz)
6.1.2 Mutual Information Between Firing and Behavior. Figure 3 shows the decomposition of mutual information (MI) between ring and behavior, using articial data. We assumed only two choices for the behavior, denoted by s1 and s2. Figures 3A and 3B show the mean ring frequency with respect to s1 and s2, respectively. The mean ring of both neurons is the same between s1 and s2 in period a, assumed to be the control period. In periods b
A
100
(a)
50 0 0
(c)
100
200
300
100
200
300
100
200
300
400
(d) 500
600
700
400
500
600
700
400
500
600
700
B
COR
0.2
(b)
0.1 0 0
3
2 C 1 0 1 0 x 10
3
0.95
KL1
6 D 4 2 0 0 x 10
100
200
300
400
500
600
700
3
0.95
KL2 (1)
4 E 2 0 0 x 10
100
200
300
400
500
600
700
4
0.95
KL2 (2)
6 F 4 2 0 0
100
200
300
400
500
600
700
2302
Hiroyuki Nakahara and Shun-ichi Amari
Freq. s1
100 A 50
(a)
0 0
(b)
(c)
100
200
300
400
500
100
200
300
400
500
100
200 300 time (ms)
400
500
Freq. s2
100 B 50
MI (bit)
0 0 0.04
C
0.02 0 0
Figure 3: Example of a two-neuron case to obtain mutual information (MI) between ring and behavior and its decomposition. The spikes of two neurons were generated and estimated in a similar manner to Figure 2. The number of stimulus condition is assumed to be two, denoted by s1 and s2, and the number of trials per stimulus was 500. In period a (0–100 ms), assumed probabilities (g1 , g2 , g12 ) for s1 and s2 were given as (0.02, 0.02, 0.002) and (0.02, 0.02, 0.002) , respectively. In period b, (100–300 ms), (0.08, 0.02, 0.004) and (0.02, 0.08, 0.004) ; in period c, (300–500 ms), (0.08, 0.02, 0.002) and (0.02, 0.08, 0.015) . (A, B) Mean ring frequency of the two neuron with respect to s1 (A) and s2 (B), respectively. A solid line indicates the mean ring of one neuron and a dashed-dotted line the mean ring of the other neuron. A dashed line indicates the mean ring of the coincident ring. (C) MI. The (total) MI is indicated by a solid line and is decomposed into two terms: the MI by the modulation of the mean ring rate (dashed line) and the MI by the modulation of the pairwise correlation (dashed-dotted line).
and c, assumed to be the test periods, we have set the mean ring of the two neurons as a somewhat mirror image between s1 and s2. The mean ring of each neuron stays the same in the test periods. Notably, however, the mean coincident ring increases in period c only when s2 is given (period c in Figure 3B). In Figure 3C, we show the (total) MI between ring and behavior and its decomposition. The total MI (the solid line in Figure 3C) exists in periods b and c. Its magnitude is larger in period c than in period b, although the mean ring of each neuron stays the same in both periods. This is because
Information-Geometric Measure for Neural Spikes
2303
the coincident ring is modulated by the behavioral choices only in period c. This observation can be directly examined by looking into the decomposed MIs (see the gure legend). In brief, we observe that the increase in the total MI in period c comes almost exclusively from the part of the MI by the modulation of the coincident ring (indicated by the dashed-dot line in Figure 3C). 6.2 Example 2: Firing of Three Neurons. 6.2.1 Inspection of Triplewise Interaction. Figure 4A gives ´ D (´ 1 , ´ 2 , ´ 3 ) , where the ring is assumed to be homogeneous for simplicity. Period a is assumed to be the control period. Figure 4B shows COR. The COR in the control period is almost zero, while the CORs in periods b and c are almost the same as each other, both being different from zero. Yet when we are more careful in looking into the interaction, using hij (see Figure 4C) and h123 (see Figure 4D), we see that the nature of the interaction is largely different between periods b and c. The triplewise interaction, h123 , shown in Figure 4D is nearly zero in periods a and b. Hence, hij in Figure 4C indicates the purely pairwise correlation in these periods. On the other hand, since h123 is not zero in period c, hij in this period does not represent the purely pairwise correlation any more, although it is still correct to say that the pairwise correlation is different between periods b and c by simply observing that hij ’s are different in the two periods. The fact that h123 is not zero in period c indicates that the purely triplewise interaction exists in this period, where h123 is negative so that the triplewise interaction is negative. We can make the above observations more quantitative. In Figure 4E, the p-value, derived from  2 (2) , is to measure the triplewise interaction against the null hypothesis of the activity in the control period. Here we used the decomposition of k-cut D 2, which separates the triplewise interaction from other orders, which take the mean ring rates and the pairwise interaction together (see section 4.1). We observe that the triplewise coincident ring becomes signicant only in period c. Figure 4F indicates the p-value from  2 (8), which is to measure the triplewise and pairwise interaction together. Here we used the decomposition of k-cut D 1 that separates the mean ring rates (rst order) from other orders, which take the pairwise and triplewise interactions together (see section 4.2). We can see that both periods b and c now become signicant. 6.2.2 Mutual Information Between Behavior and Firing. Figure 5 shows the decomposition of the MI between ring and behavior. In each of two stimulus conditions, denoted by s1 and s2, the neuron ring is assumed to be homogeneous for simplicity. Figures 5A and 5B show the mean frequency of single-neuron ring, pairwise ring, and triplewise ring with respect to s1 and s2, respectively. Period a was assumed to be the control period,
2304
Hiroyuki Nakahara and Shun-ichi Amari
Freq. (Hz)
while periods b through d were assumed to be test periods. For s2, the mean frequency is the same over all periods. For s1, compared with period a, only the mean frequency of single-neuron ring is different in period b; only the mean frequency of pairwise ring is different in period c; and only the mean frequency of triplewise ring is different in period d. The total MI is shown by the solid line in Figures 5C and 5D, while the decomposed MIs by k-cut D 2 and D 1 are shown in Figures 5C and 5D, respectively. In period b, Figure 5D indicates that most of the behavioral information is carried by modulation of single-neuron ring (the dashed
A
40
(a)
20 0 0
(b)
(c)
200
400
600
200
400
600
COR
0.2 B 0.1 0 0.1 0 12, 13, 23
3 2
C
1 0
1 0
400
600
200
400
600
D
123
0
200
8 0 1
0.95
2
(2)
E 0.5 0 0
200
400
1
0.95
(8)
F 2
600
0.5 0 0
200
time (ms)
400
600
Information-Geometric Measure for Neural Spikes
2305
line in the gure), although some information is also carried by modulation of the other orders together (dotted line; see the legend). By inspection of period b in Figure 5C, we can further observe that the behavioral information is mostly carried by the modulation of taking single-neuron ring and pairwise ring together (the dashed line, which is almost the same as the solid line, i.e., the total MI) but is not really carried by triplewise ring (dotted line). We can inspect periods c and d in a similar manner. In brief, Figures 5C and 5D together indicate that most behavioral information is carried by the modulation of taking pairwise and triplewise ring together in period c and is carried by the modulation of triplewise ring in period d. 6.3 Example 3: Interesting Set of Neuron Firing. Here, we demonstrate our method to nd an interesting set of neuron ring and also illustrate some practical issues in data analysis. There are two starting assumptions for this demonstration. First, the number of simultaneously recorded neurons is assumed to be 10. Second, the number of trials is assumed to be 1200 here. The number of trials is a severe limitation in real data for detecting higher-order interaction, and we consider 1200 as conceivable, or at least not impossible. We follow the procedure set out in section 5.6 under the assumption of full homogeneous ring and ask whether any signicant order of ring exists against the null hypothesis of independent ring at the questioned order. Note that it will be very difcult to investigate our data faithfully, because the assumed number of trials is 1200 and the dimension of 10 neurons is 210 ¡ 1 D 1023. The full homogeneous assumption reduces the dimension of n neuron ring from (2n ¡ 1) to n dimension so that it circumvents the undersampling problem. Figure 4: Facing page. Example of a three-neuron case to detect the signicant triplewise interaction. The spikes of three neurons were generated and estimated in a similar manner to Figure 2. (A) The g coordinates. Since we treated a homogeneous case here for simplicity, D (gi , gij , gijk ) is shown from top to bottom, being superimposed with the same order of g coordinates. (B) Correlation coefcients of three pairs of neural ring are shown, being superimposed. (C) The second order, hij D fh12 , h13 , h23 g. (D) The third order, h123 . (E) The p-value, from  2 (2) , to indicate triplewise coincident ring against the null hypothesis of the average activity in the control period. (F) The p-value, from  2 (8) , to indicate pairwise and triplewise coincident ring together against the null hypothesis of the average activity in the control period. As for spike data generation, the number of trials is set as 2000, and the spike probability is assumed to be homogeneous in each period; D (gi , gij , gijk ) D (0.0450, 0.00258, 0.00020) in period a, D (0.0450, 0.0090, 0.0040) in period b, and D (0.0449, 0.0090, 0.0001) in period c. The arrows at the right-hand side in B, C, and D indicate true values (see the Figure 1 legend).
2306
Hiroyuki Nakahara and Shun-ichi Amari
We specied probability distributions over three periods of a trial to generate spike data (see the gure legend) and call them seed probabilities for convenience. Figure 6A indicates the estimated mean ring frequency of a single neuron (i.e., the rst-order mean ring frequency) given generated sampled data, and Figure 6B indicates the estimated mean ring frequencies from the second to the fth order, which appear from top to bottom in the gure. Period a is assumed to be the control period, in which the seed probability is independent ring. While the rst-order mean ring frequency is the same in periods b and c as in period a, the mean ring frequencies in higher order are somewhat different, yet do not look so different in Figure 6B. We will see, however, that its intrinsic probabilistic structure is signicantly different. We can compute the exact p-values in  2 (1) from the seed probabilities (see equation 5.14), some of which (only the relevant ones) are shown in Figure 6C. No signicant order exists in period a, because the ring in that period is independent. In period b, the p-value of the fourth-order ring exceeds the 0.95 signicance level. In period c, while the p-value of the fourth order drops far below, the p-value of the tenth order exceeds the signicance level, and the p-value of the seventh order stays around the signicance level.
Freq. s2
Freq. s1
150
A
100
a
50 0 0 150
b
c
d
200
400
600
800
200
400
600
800
B
100 50 0 0
3
MI (bit)
x 10 15 C 10
kcut=2
5 0 0
200
400
600
800
3
MI (bit)
x 10 15 D 10
kcut=1
5 0 0
200
400 time (ms)
600
800
Information-Geometric Measure for Neural Spikes
2307
In Figure 6D, we show corresponding p-values estimated from the sample data for the fourth and seventh orders. The indicated signicant periods in both orders (see Figure 6D) overall follow the exact values (see Figure 6C). Although the tenth-order ring should be signicant (as in Figure 6C), theoretically at least, it could not be observed in our sampled data due to a sampling problem. Indeed, all of the eighth-, ninth-, and tenth-order rings could not be observed (10) (10) (10) in our sampled data (i.e., PO 8 D PO 9 D PO 10 D 0), and we started our procedure from the seventh order in the sampled data. This kind of situation will most likely be encountered in real data analysis and relates to the sampling problem even under the full homogeneous assumption. This example also indicates that the signicant coincident ring, even if it exists, may not be detected due to the sampling problem. Finally, the other orders do not reach the signicant level by both exact and estimated values (results not shown).
7 Discussion In this study, we investigated the nature of an information-geometric measure and its application to spike data analysis. By using the dual orthogonality of the natural and expectation parameters, we have shown that we can systematically investigate neuronal ring patterns, considering not only the secondorder but also higher-order interactions, and we provided a method of hypothesis testing. For this purpose, we used the log-linear model (e.g., equation 3.4 for the two-neuron case and equation 4.2 for the three-neuron case). In this aspect, our approach shares some features with previous work (Martignon et Figure 5: Facing page. Example of a three-neuron case to obtain mutual information (MI) between ring and behavior and its decomposition. The spikes of three neurons were generated and estimated in a similar manner to Figure 2. (A, B) Mean ring frequency of the three neurons with respect to s1 (A) and s2 (B). In each gure, the mean rings of a single neuron, of the pairwise coincident ring, and of the triplewise coincident ring are indicated from top to bottom. (C) MI and its decomposition by k-cut D 2. The total MI is indicated by a solid line. The two decomposed MIs, one by modulation of taking the mean ring rate and the pairwise interaction together and the other by modulation of the triplewise interaction, are indicated by dashed and dotted lines, respectively. (D) MI and its decomposition by k-cut D 1. The total MI is indicated by a solid line. The two decomposed MIs, one by modulation of the mean ring rate and the other by modulation of taking the pairwise and triplewise interactions together, are indicated by dashed and dotted lines, respectively. As for spike data generation, the number of trials per stimulus was 1000, and in each stimulus, the spike probability is assumed to be homogeneous. For s2, assumed probabilities are (gi , gij , gijk ) D (0.08, 0.00704, 0.00041) over all the periods, a–d. For s1, assumed probabilities are (gi , gij , gijk ) D (0.08, 0.00704, 0.00041) in period a. Compared with these probabilities, gi changed to 0.12 in period b, gij changed to 0.00049 in period c, and gijk changed to 0.00700 in period d.
2308
Hiroyuki Nakahara and Shun-ichi Amari
al., 1995,2000; Del Prete & Martingon, 1998; Deco, Martignon, & Laskey, 1998),but our study explicitly uses the above orthogonality so that the methods become more transparent, more systematic, and easier. As a reference, we simply mention that the method we propose here can be related to a maximum entropy principle (Jaynes, 1982) and to testing goodness of t to a maximum entropy distribution with some constraints, which is also related to a derivation of different types of free energies in statistical physics. In the case of two neurons, we rst showed that our method is able to test whether any pairwise correlation in one period is signicantly different from that in another period, where the correlation, as the null hypothesis, is not necessarily zero. Second, the method is shown to be able to relate behavior directly with neuronal ring, using their mutual information (MI). The MI is decomposed into two types of information, conveyed by the mean ring rate and coincident ring, respectively. Third, the method is extended to the three neurons’ case, where we described the details and went one step further to the general case of n neurons, where we proposed several approaches to meet practical concerns (also see Gutig, ¨ Rotter, & Aertsen, 2001). Fourth, we demonstrated the merits of the method with articial data. The notion of the third-order and higher-order interactions may be difcult to understand at a glance. For example, even with three neurons, investigation of only the pairwise interaction cannot fully determine their interaction, whatever measures are used. In addition, when we question whether the three neurons bind features of a single object together (Singer & Gray, 1995) or re together, it seems more natural to seek whether triplewise interaction exists. Hence, to question functions of even only the three neurons’ interaction, it is more conclusive and more robust to investigate both pairwise and triplewise interactions. Thus, to explore the functions of neural interaction in general, we consider that a method of analysis that can fully investigate any order interaction in relation to behavior is required, with the method as simple, exible, and systematic as possible to use. It was a motivation of this study. This study discussed coincident ring, or simultaneous ring, that is, ti D tj (for any i, j) in X D ( X1 ( t1 ) , . . . , Xi ( ti ) , . . . , Xn (tn ) ) . This was only for simplicity of presentation. Our method is applicable to a more general case—for any set of fti gniD 1 . It is then intriguing to ask how many specic ring patterns, or synre chains, become signicant and how much behavioral information these significant ring patterns carry (Richmond et al., 1990; McClurkin & Optican, 1996; McClurkin, Zarbock, & Optican, 1996; Abbott et al., 1996; Oram et al., 1999, 2001; Baker and Lemon, 2000). We are also pursuing this question using experimental data from the prefrontal and dorsal extrastriate visual cortices (Nakahara et al., 2001). This study represented spike ring patterns by a binary vector; in other words, we dissected continuous time into short time bins. Obviously, we should be careful about how much information we lose by this dissection (Grun, ¨ Diesmann, Grammont, Riehle, & Aertsen, 1999) or by the use of a short time window (Panzeri, Schultz, et al., 1999; Panzeri, Treves, et al., 1999; Panzeri & Schultz, 2001). The later studies use a Taylor series expansion of MI with respect to time. In contrast, the decomposition of different order interactions and MI in
Information-Geometric Measure for Neural Spikes
2309
Figure 6: Example of a 10-neuron case to nd an interesting set of neurons. The spikes of 10 neurons were generated and estimated in a similar manner to Figure 2. The number of trials was set as 1200. The spike probabilities are assumed to be homogeneous so that we dene only 10-dimensional coordinates, which we indicate here by g coordinates. In all periods (a–c), g1 is xed as 0.010. In period a, the ring is assumed to be independent, that is, g k D g1k ( k D 2, . . . , 10) . In period b, g k D 0.0125 £ (0.285) k¡2 ( k D 2, . . . , 10) . In period c, g2 D 0.0125, g3 D 0.0020, and g k D 0.0003 £ (0.5) k¡4 ( k D 4, . . . , . . . , 10) , to which there are very small adjustments. (A) Mean ring frequency (of the rst order). (B) Mean ring frequencies of the second, third, fourth, and fth orders, which appear from top to bottom. (C) Theoretically expected p-values in Â2 (1) for the fourth (indicated by a dashed line), seventh (a solid line), and tenth orders (dotted line). (D) Estimated p-values for the fourth (dashed line) and seventh orders (by solid line).
our study is exact given a xed time bin width, regardless of its size. Mathematically speaking, coincident ring depends on time bin width. The effect of a given bin width in analyzing data is an important subject but not addressed in this study. In addition, we did not address bias correction of the estimates. MI is known to be overestimated given the limited number of sam-
2310
Hiroyuki Nakahara and Shun-ichi Amari
ples. A correcting procedure is needed to estimate MI in our method, too. We can use, in principle, the previously proposed procedure (Optican, Gawne, Richmond, & Joseph, 1991; Kjaer, Hertz, & Richmond, 1994; Treves & Panzeri, 1995; Golomb, Hertz, Panzeri, Treves, & Richmond, 1997; Panzeri & Treves, 1996). We discussed the method only in cases where no event has zero proba6 bility (i.e., pA D 0), for which our theory is mathematically rigorously valid. However, in cases where some event has zero probability, we need some care. Since the dimension of binary vector increases rapidly, this issue may be a concern. There are at least two cases. In the rst case, where we know a certain ring structure (e.g., for some specic sets of neurons), we can incorporate its knowledge in our estimation. Hence, if we know a priori that some probabilities are zero, the method is applicable with a little modication. In the second case where some zero probabilities are found in estimates due to a limited number of samples, we emphasize that it is a limitation of data but not of the method of analysis. For example, we cannot estimate all 1023 coordinate components of 10 neurons given only 800 samples. Yet we may still want to overcome such a situation in some ways. As one solution, we discussed using the assumption of homogeneous ring. Most studies in literature simply ignore the higher-order interaction, say, triplewise and higher, and focus on analyzing the rst-order and pairwise interaction. This corresponds to assuming h k D 0 ( k ¸ 3) (i.e., a partial homogeneous assumption). Certainly, the method can be used with such an assumption. Furthermore, although the decomposition was shown only for the mixed coordinates of a type f k D (g k¡ Ih k C ) in our study, the decomposition property has more generality. Any combination of subsets g A and h B that spans the probability space of Sn can be used similarly. Extension of the method from binary to k-discrete random variable vectors is also possible. In addition, we note that other approaches are also available, including some bootstrap methods and a Bayesian framework (Del Prete & Martingon, 1998; Martignon et al., 2000). Finally, our measure is model free, so it can be used without any assumption of the underlying neuron models and their connections. At the same time, even if the signicant events were detected by the measure, the measure itself cannot determine the underlying neural interaction. This is a different issue but important in its own right, and we are interested in it too (Amari, Nakahara, Wu, & Sakai, in press). Overall, the method presented in the article is just a measure. Its ultimate test is whether it will lead to exciting ndings in neuroscience (Nakahara et al., 2001).
Acknowledgments We thank the anonymous reviewers for their helpful comments. H. N. is grateful to K. Siu and K. Kobayashi for technical assistance; M. Tatsuno for technical assistance and his corrections of some proofs; T. Poggio, Y. Kubuta, R. Gutig, ¨
Information-Geometric Measure for Neural Spikes
2311
K. Anderson, E. Miller, and I. Kitori for their comments. H. N. is supported by Grants-in-Aid for Scientic Research on Priority Areas (C) of the Ministry of Education, Japan.
References Abbott, L. F., Rolls, E. T., & Tovee, M. J. (1996). Representational capacity of face coding in monkeys. Cerebral Cortex, 6(3), 498–505. Abeles, M. (1991). Corticonics: Neural circuits of the cerebral cortex. Cambridge: Cambridge University Press. Abeles, M., Bergman, H., Margalit, E., & Vaadia, E. (1993). Spatiotemporal ring patterns in the frontal cortex of behaving monkeys. Journal of Neurophysiology, 70(4), 1629–1638. Abeles, M., & Gerstein, G. L. (1988). Detecting spatiotemporal ring patterns among simultaneously recorded single neurons. Journal of Neurophysiology, 60(3), 909–924. Aertsen, A., & Arndt, M. (1993). Response synchronization in the visual cortex. Current Opinion in Neurobiology, 3(4), 586–594. Aertsen, A. M. H. J., Gerstein, G. L., Habib, M. K., & Palm, G. (1989). Dynamics of neuronal ring correlation: Modulation of “effective connectivity.” Journal of Neurophysiology, 61(5), 900–917. Amari, S. (1982). Differential geometry of curved exponential families— curvature and information loss. Annals of Statistics, 10, 357–385. Amari, S. (1985). Differential geometrical methods in statistics. Berlin: SpringerVerlag. Amari, S. (1987). Differential geometry of a parametric family of invertible linear systems—Riemannian metric, dual afne connections and divergence. Mathematical Systems Theory, 20, 53–82. Amari, S. (2001). Information geometry on hierarchical decomposition of stochastic interactions. IEEE Transaction on Information Theory, 47, 1701–1711. Amari, S., & Han, T. S. (1989). Statistical inference under multi-terminal rate restrictions—a differential geometrical approach. IEEE Transaction on Information Theory, IT-35, 217–227. Amari, S., Kurata, K., & Nagaoka, H. (1992).Information geometry of Boltzmann machines. IEEE Transaction on Neural Networks, 3(2), 260–271. Amari, S., & Nagaoka, H. (2000). Methods of information geometry. New York: Oxford University Press. Amari, S., Nakahara, H., Wu, S., & Sakai, Y. (in press). Synring and higher-order interactions in neuron pool. Baker, S. N., & Lemon, R. N. (2000). Precise spatiotemporal repeating patterns in monkey primary and supplementary motor areas occur at chance levels. Journal of Neurophysiology, 84, 1770–1780. Barndorff-Nielsen, O. (1978). Information and exponential families in statistical theory. New York: Wiley. Bialek, W., Rieke, F., de Ruyter van Steveninck, R. R., & Warland, D. (1991). Reading a neural code. Science, 252(5014), 1854–1857.
2312
Hiroyuki Nakahara and Shun-ichi Amari
Bohte, S. M., Spekreijse, H., & Roelfsema, P. R. (2000). The effects of pair-wise and higher order correlations on the ring rate of a postsynaptic neuron. Neural Computation, 12(1), 153–179. Brenner, N., Strong, S., Koberle, R., Bialek, W., & de Ruyter van Steveninck, R. (2000). Synergy in a neural code. Neural Computation, 12(7), 1531–1552. Cox, D. R., & Hinkley, D. V. (1974). Theoretical statistics. London: CRC Press. Csisz a r, I. (1967). On topological properties of f-divergence. Studia Scientiarum Mathematicaram Hungarica, 2, 329–339. Csisz a r, I. (1975). I-divergence geometry of probability distributions and minimization problems. Annals of Probability, 3, 146–158. Deadwyler, S. A., & Hampson, R. E. (1997). The signicance of neural ensemble codes during behavior and cognition. Annual Review of Neuroscience, 20, 217– 244. Deco, G., Martignon, L., & Laskey, K. B. (1998). Comparing different measures of spatio-temporal patterns in neural activity. In ICANN98 (pp. 943–948). Sk¨ovde, Sweden. Del Prete, V., & Martingon, L. (1998). Methods to estimate couplings and correlations in the activity of ensembles of neurons. SISSA preprint, 115/98. Engel, A. K., Konig, ¨ P., Kreiter, A. K., Schillen, T. B., & Singer, W. (1992).Temporal coding in the visual cortex: New vistas on integration in the nervous system. Trends in Neuroscience, 15(6), 218–226. Gawne, T. J., & Richmond, B. J. (1993). How independent are the messages carried by adjacent inferior temporal cortical neurons? Journal of Neuroscience, 13(7), 2758–2771. Georgopoulos, A. P., Schwartz, A. B., & Kettner, R. E. (1986). Neuronal population coding of movement direction. Science, 233(4771), 1416–1419. Gerstein, G. L., & Aertsen, A. M. (1985). Representation of cooperative ring activity among simultaneously recorded neurons. Journal of Neurophysiology, 54(6), 1513–1528. Gerstein, G. L., Bedenbaugh, P., & Aertsen, A. M. (1989). Neuronal assemblies. IEEE Transaction on Biomedical Engineering, 36(1), 4–14. Gochin, P. M., Colombo, M., Dorfman, G. A., Gerstein, G. L., & Gross, C. G. (1994). Neural ensemble coding in inferior temporal cortex. Journal of Neurophysiology, 71(6), 2325–2337. Golomb, D., Hertz, J., Panzeri, S., Treves, A., & Richmond, B. (1997). How well can we estimate the information carried in neuronal responses from limited samples? Neural Computation, 9(3), 649–665. Grun, ¨ S. (1996). Unitary joint-events in multiple-neuron spiking activity: Detection, signicance, and interpretation. Thun, Frankfurt on Main: Verlag Harri Deutsch. Grun, ¨ S., & Diesmann, M. (2000). Evaluation of higher-order coincidences in multiple parallel processes. Society for Neuroscience Abstracts, 26, 828.3. Grun, ¨ S., Diesmann, M., & Aertsen, A. (2002a). “Unitary events” in multiple single-neuron spiking activity: I: Detection and signicance. Neural Computation, 14(1), 43–80. Grun, ¨ S., Diesmann, M., & Aertsen, A. (2002b). “Unitary events” in multiple single-neuron spiking activity: II: Nonstationary data. Neural Computation, 14(1), 81–120.
Information-Geometric Measure for Neural Spikes
2313
Grun, ¨ S., Diesmann, M., Grammont, F., Riehle, A., & Aertsen, A. (1999).Detecting unitary events without discretization of time. Journal of Neuroscience Methods, 94(1), 67–79. Gutig, ¨ R., Aertsen, A., & Rotter, S. (2002). Statistical signicance of coincident spikes: Count-based versus rate-based statistics. Neural Computation, 14(1), 121–154. Gutig, ¨ R., Rotter, S., & Aertsen, A. (2001). Analysis of higher-order neuronal interactions based on conditional inference. Manuscript submitted for publication. Hikosaka, O., & Wurtz, R. (1983). Visual and oculomotor functions of monkey substantia nigra pars reticulata. I. Relation of visual and auditory responses to saccades. Journal of Neurophysiology, 49, 1230–1253. Ito, H., & Tsuji, S. (2000). Model dependence in quantication of spike interdependence by joint peri-stimulus time histogram. Neural Computation, 12, 195–217. Jaynes, E. T. (1982). On the rationale of maximum entropy methods. Proc. IEEE, 70, 939–952. Kitazawa, S., Kimura, T., & Yin, P. B. (1998). Cerebellar complex spikes encode both destinations and errors in arm movements. Nature, 392(6675), 494– 497. Kjaer, T. W., Hertz, J. A., & Richmond, B. J. (1994). Decoding cortical neuronal signals: Network models, information estimation, and spatial tuning. Journal of Computational Neuroscience, 1, 109–139. Kudrimoti, H. S., Barnes, C. A., & McNaughton, B. L. (1999). Reactivation of hippocampal cell assemblies: Effects of behavioral state, experience, and EEG dynamics. Journal of Neuroscience, 19(10), 4090–4101. Laubach, M., Wessberg, J., & Nicolelis, M. A. L. (2000). Cortical ensemble activity increasingly predicts behavior outcomes during learning of a motor task. Nature, 405, 567–571. Lehmann, E. L. (1983). Theory of point estimation. London: Chapman & Hall. Lisman, J. E. (1997). Bursts as a unit of neural information: Making unreliable synapses reliable. Trends in Neuroscience, 20(1), 38–43. MacLeod, K., B¨acker, A., & Laurent, G. (1998). Who reads temporal information contained across synchronized and oscillatory spike trains? Nature, 395, 693–698. Martignon, L., Deco, G., Laskey, K., Diamond, M., Freiwald, W. A., & Vaadia, E. (2000). Neural coding: Higher-order temporal patterns in the neurostatistics of cell assemblies. Neural Computation, 12(11), 2621–2653. Martignon, L., Von Hasseln, H., Grun, S., Aertsen, A., & Palm, G. (1995). Detecting higher-order interactions among the spiking events in a group of neurons. Biological Cybernetics, 73(1), 69–81. Maynard, E. M., Hatsopoulos, N. G., Ojakangas, C. L., Acuna, B. D., Sanes, J. N., Normann, R. A., & Donoghue, J. P. (1999). Neuronal interactions improve cortical population coding of movement direction. Journal of Neuroscience, 19(18), 8083–8093. McClurkin, J. W., Gawne, T. J., Optican, L. M., & Richmond, B. J. (1991). Lateral geniculate neurons in behaving primates. II. Encoding of visual information
2314
Hiroyuki Nakahara and Shun-ichi Amari
in the temporal shape of the response. Journal of Neurophysiology, 66(3), 794– 808. McClurkin, J. W., & Optican, L. M. (1996). Primate striate and prestriate cortical neurons during discrimination. I. Simultaneous temporal encoding of information about color and pattern. Journal of Neurophysiology, 75(1), 481– 495. McClurkin, J. W., Zarbock, J. A., & Optican, L. M. (1996). Primate striate and prestriate cortical neurons during discrimination. II. Separable temporal codes for color and pattern. Journal of Neurophysiology, 75(1), 496–507. Nadasdy, Z., Hirase, H., Czurko, A., Csicsvari, J., & Buzsaki, G. (1999). Replay and time compression of recurring spike sequences in the hippocampus. Journal of Neuroscience, 19(21), 9497–9507. Nagaoka, H., & Amari, S. (1982). Differential geometry of smooth families of probability distributions (Tech. Rep.). Tokyo: University of Tokyo. Nakahara, H., & Amari, S. (2002).Information-geometric decomposition in spike analysis. In T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing systems, 14. Cambridge, MA: MIT Press. Nakahara, H., Amari, S., Tatsuno, M., Kang, S., & Kobayashi, K. (2002). Examples of applications of information geometric measure to neural data (Tech. Rep. No. 02-1). Available on-line: www.mns.brain.riken.go.jp/nakahara/ papers/TR IGspike.ps, TR IGspike.pdf. Nakahara, H., Amari, S., Tatsuno, M., Kang, S., Kobayashi, K., Anderson, K. C., Miller, E. K., & Poggio, T. (2001). Information geometric measures for spike ring. Society for Neuroscience Abstracts, 27, 821.46. Nawrot, M., Aertsen, A., & Rotter, S. (1999). Single-trial estimation of neuronal ring rates: From single-neuron spike trains to population activity. Journal of Neuroscience Methods, 94(1), 81–92. Nicolelis, M. A. L., Ghazanfar, A. A., Faggin, B. M., Votaw, S., & Oliveira, L. M. O. 1997. Reconstructing the engram: Simultaneous, multisite, many single neuron recordings. Neuron, 18, 529–537. Optican, L. M., Gawne, T. J., Richmond, B. J., & Joseph, P. J. (1991). Unbiased measures of transmitted information and channel capacity from multivariate neuronal data. Biological Cybernetics, 65(5), 305–310. Optican, L. M., & Richmond, B. J. (1987). Temporal encoding of two-dimensional patterns by single units in primate inferior temporal cortex. III. Information theoretic analysis. Journal of Neurophysiology, 57(1), 162–178. Oram, M. W., Hatsopoulos, N. G., Richmond, B. J., & Donoghue, J. P. (2001). Excess synchrony in motor cortical neurons provides redundant direction information with that from coarse temporal measures. Journal of Neurophysiology, 86(4), 1700–1716. Oram, M. W., Wiener, M. C., Lestienne, R., & Richmond, B. J. (1999). Stochastic nature of precisely timed spike patterns in visual system neuronal responses. Journal of Neurophysiology, 81(6), 3021–3033. Palm, G. (1981). Evidence, information, and surprise. Biological Cybernetics, 42, 57–68. Palm, G., Aertsen, A. M., & Gerstein, G. L. (1988). On the signicance of correlations among neuronal spike trains. Biological Cybernetics, 59(1), 1–11.
Information-Geometric Measure for Neural Spikes
2315
Panzeri, S., & Schultz, S. R. (2001). A unied approach to the study of temporal, correlational, and rate coding. Neural Computation, 13(6), 1311–1349. Panzeri, S., Schultz, S. R., Treves, A., & Rolls, E. T. (1999). Correlations and the encoding of information in the nervous system. Proceedings of the Royal Society of London Series B; Biological Science, 266(1423), 1001–1012. Panzeri, S., & Treves, A. (1996). Analytical estimates of limited sampling biases in different information measures. Network, 7, 87–107. Panzeri, S., Treves, A., Schultz, S., & Rolls, E. T. (1999).On decoding the responses of a population of neurons from short time windows. Neural Computation, 11(7), 1553–1577. Parker, A. J., & Newsome, W. T. (1998). Sense and the single neuron: Probing the physiology of perception. Annual Review of Neuroscience, 21, 227–277. Pauluis, Q., & Baker, S. N. (2000). An accurate measure of the instantaneous discharge probability, with application to unitary joint-even analysis. Neural Computation, 12(3), 647–669. Perkel, D. H., Gerstein, G. L., & Moore, G. P. (1967). Neuronal spike trains and stochastic point processes: II: Simultaneous spike trains. Biophysics Journal, 7, 419–440. Prut, Y., Vaadia, E., Bergman, H., Haalman, I., Slovin, H., & Abeles, M. (1998). Spatiotemporal structure of cortical activity: Properties and behavioral relevance. Journal of Neurophysiology, 79(6), 2857–2874. Rao, C. R. (1945). Information and accuracy attainable in the estimation of statistical parameters. Bulletin of Calcutta Math. Soc., 37, 81–91. Reinagel, P., & Reid, C. R. (2000). Temporal coding of visual information in the thalamus. Journal of Neuroscience, 20, 5392–5400. Richmond, B. J. & Gawne, T. J. (1998). The relationship between neuronal codes and cortical organization. In H. B. Eichebaum, J. L. Davis, & H. Eichenbaum (Eds.), Neuronal ensembles strategiesfor recording and decoding (pp. 57–79). New York: Wiley-Liss. Richmond, B. J., Optican, L. M., & Spitzer, H. (1990). Temporal encoding of two-dimensional patterns by single units in primate primary visual cortex. I. Stimulus-response relations. Journal of Neurophysiology, 64(2), 351–369. Riehle, A., Grun, ¨ S., Diesmann, M., & Aertsen, A. (1997). Spike synchronization and rate modulation differentially involved in motor cortical function. Science, 278, 1950–1953. Rolls, E. T., Treves, A., & Tovee, M. J. (1997). The representational capacity of the distributed encoding of information provided by populations of neurons in primate temporal visual cortex. Experimental Brain Research, 114(1), 149–162. Roy, A., Steinmetz, P. N., & Niebur, E. (2000). Rate limitations of unitary event analysis. Neural Computation, 12(9), 2063–2082. Salinas, E., & Sejnowski, T. J. (2001). Correlated neuronal activity and the ow of neural information. Nature Reviews Neuroscience, 2(8), 539–550. Samengo, I., Montagnini, A., & Treves, A. (2000). Information-theoretical description of the representational capacity of N independent neuron. Society for Neuroscience Abstracts, 26, 739.1. Singer, W., & Gray, C. M. (1995). Visual feature integration and the temporal correlation hypothesis. Annual Review of Neuroscience, 18, 555–586.
2316
Hiroyuki Nakahara and Shun-ichi Amari
Steinmetz, P. N., Roy, A., Fitzgerald, P. J., Hsiao, S. S., Johnson, K. O., & Niebur, E. (2000). Attention modulates synchronized neuronal ring in primate somatosensory cortex. Nature, 404(6774), 187–190. Stuart, A., Ord, K., & Arnold, S. (1999). Kendall’s advanced theory of statistics,Vol. 2: A classical inference and the linear model. London: Arnold. Sugase, Y., Yamane, S., Ueno, S., & Kawano, K. (1999). Global and ne information coded by single neurons in the temporal visual cortex. Nature, 400(6747), 869–873. Tetko, I. V., & Villa, A. E. P. (1992). Fast combinatorial methods to estimate the probability of complex temporal patterns of spikes. Biological Cybernetics, 76, 397–407. Tovee, M. J., Rolls, E. T., Treves, A., & Bellis, R. P. (1993). Information encoding and the responses of single neurons in the primate temporal visual cortex. Journal of Neurophysiology, 70(2), 640–654. Treves, A., & Panzeri, S. (1995). The upward bias in measures of information derived from limited data samples. Neural Computation, 7, 399–407. Tsukada, M., Ishii, N., & Sato, R. (1975). Temporal pattern discrimination of impulse sequences in the computer-simulated nerve cells. Biological Cybernetics, 17, 19–28. Vaadia, E., Haalman, I., Abeles, M., Bergman, H., Prut, Y., Slovin, H., & Aertsen, A. (1995). Dynamics of neuronal interactions in monkey cortex in relation to behavioural events. Nature, 373, 515–518. Victor, J. D., & Purpura, K. P. (1997).Metric-space analysis of spike trains: Theory, algorithms and application. Network: Computation in Neural Systems, 8, 127– 164. Wilson, M. A., & McNaughton, B. L. (1993). Dynamics of the hippocampal ensemble code for space. Science, 261(5124), 1055–1058. Zhang, K., Ginzburg, I., McNaughton, B. L., & Sejnowski, T. J. (1998). Interpreting neuronal population activity by reconstruction: Unied framework with application to hippocampal place cells. Journal of Neurophysiology, 79, 1017–1044. Zohary, E., Shadlen, M. N., & Newsome, W. T. (1994). Correlated neuronal discharge rate and its implications for psychophysical performance. Nature, 370(6485), 140–143.
Received September 6, 2001; accepted April 23, 2002.
LETTER
Communicated by Kechen Zhang
Optimal Short-Term Population Coding: When Fisher Information Fails M. Bethge [email protected] D. Rotermund [email protected] K. Pawelzik [email protected] Institute of Theoretical Physics, University of Bremen, Bremen, D-28334 Germany Efcient coding has been proposed as a rst principle explaining neuronal response properties in the central nervous system. The shape of optimal codes, however, strongly depends on the natural limitations of the particular physical system. Here we investigate how optimal neuronal encoding strategies are inuenced by the nite number of neurons N (place constraint), the limited decoding time window length T (time constraint), the maximum neuronal ring rate fmax (power constraint), and the maximal average rate h f imax (energy constraint). While Fisher information provides a general lower bound for the mean squared error of unbiased signal reconstruction, its use to characterize the coding precision is limited. Analyzing simple examples, we illustrate some typical pitfalls and thereby show that Fisher information provides a valid measure for the precision of a code only if the dynamic range ( fmin T, fmax T ) is sufciently large. In particular, we demonstrate that the optimal width of gaussian tuning curves depends on the available decoding time T . Within the broader class of unimodal tuning functions, it turns out that the shape of a Fisher-optimal coding scheme is not unique. We solve this ambiguity by taking the minimum mean square error into account, which leads to at tuning curves. The tuning width, however, remains to be determined by energy constraints rather than by the principle of efcient coding. 1 Introduction
Since Attneave (1954) and Barlow (1959) proposed redundancy reduction or coding efciency as a major principle for neuronal representations governing the formation of the central nervous system, much theoretical and experimental work has attempted to identify this principle within various structures in the brain. Neurons in cortical circuits are typically coupled to many other neurons (103 ¡ 104 ) (Braitenberg & Schuz, ¨ 1991), and their activity is propagated and converted in a highly distributed manner. This c 2002 Massachusetts Institute of Technology Neural Computation 14, 2317– 2351 (2002) °
2318
M. Bethge, D. Rotermund, and K. Pawelzik
raises the question of how signal processing can in principle be performed by large populations of neurons. Technically, this question corresponds to characterizing the class of computations (i.e., lters or functions) that can be realized by the combination of elementary units (model neurons). Most theoretical studies, however, deal only with noise-free units, which cannot account for the fact that the reliable communication of a signal (that is, the realization of the identity, which may be seen as the simplest case of noisy computation) may require a large number of neurons. Here, we address the issue of optimal signal representation in populations of neurons. Population codes (Rolls & Cowey, 1970; Georgopoulos, Schwartz, & Kettner, 1986; Paradiso, 1988) constitute an important class of neural coding schemes, in which the signal is represented by the number of spikes emitted from each of the neurons within a given counting time window. In other words, a population code is completely determined by the spike counting statistics and may be considered as a simplied snapshot description of the time-continuous neuronal ltering process of temporal integration over synaptic inputs. In this article, we seek to identify optimal encoding strategies (i.e., a set of tuning functions) under the assumption of a Poisson noise model, given a limited number of neurons N and a nite decoding time window of length T. Similar to previous studies by Panzeri et al. (Panzeri, Bielle, Rolls, Skaggs, & Treves, 1996; Panzeri, Treves, Schultz, & Rolls, 1999), we are particularly interested in short time windows, since psychophysical experiments have shown that efcient computations can be performed in cortex at a rate where each neuron has red on average only once (Rolls & Tovee, 1994; Thorpe, Fitz, & Marlot, 1996). In order to make the optimization problem well posed, we specify both the objective function (see section 2) and the set of tuning function arrays within which the optimum is determined. With respect to the latter, we intend to make the class of candidate encoding strategies as large as possible in order to learn something about the arrangement and shape of tuning functions that correspond to the ultimate limit of precision. In particular, this means that we do not restrict the class of encoding strategies a priori to those tuning functions that are considered biologically plausible; rather, we intend to check the explanatory power of the coding efciency principle. Previous studies of optimal population coding have focused on whether broad or small tuning widths are advantageous for the representation of stimulus features (Hinton, McClelland, & Rumelhart, 1986; Baldi & Heiligenberg, 1988; Snippe & Koenderink, 1992). In particular, an optimal scale for populations of neurons with arbitrary radially symmetric tuning functions and arbitrary dimensions of the encoded parameter (Zhang & Sejnowski, 1999) has been determined with respect to the average Fisher information. Subsequently, this analysis has been further generalized in Eurich and Wilke (2000) by allowing for different scales for each stimulus dimension. In contrast to the apparent generality of this approach, its validity is
Optimal Short-Term Population Coding
2319
substantially restricted by two problems that make further analysis necessary. First, the optimization of the scale or width was based on a comparison between tuning functions with exactly the same shape, neglecting the fact that the precision may depend much less on the scale or width than on other aspects of the shape of the tuning functions (see section 6). In contrast, here we seek to nd population codes that are optimal with respect to the entire class of tuning functions specied by very basic constraints only, such as a limited maximum ring rate or unimodality. The other problem results from the limited validity of Fisher information J (see equation 2.7) as a measure for the precision of a population code. The precision of a population code is usually dened independently from Fisher information on the basis of the conditional mean squared error (c.m.s.e.) of an ideal observer as a function of the presented stimulus (Baldi & Heiligenberg, 1988). While the Crame r-Rao bound (see equation 2.8) provides a general lower bound on the c.m.s.e. of any estimator (Aitken & Silverstone, 1942; Frechet, 1943; Rao, 1946; Crame r, 1946), this inequality does not justify the use of Fisher information as a general measure for the precision of population codes, because the Crame r-Rao bound is not unique for biased estimators, and it often cannot be attained. Furthermore, the notion of an ideal observer is not unique, so that no general denition of the resolution exists that ts all problems, but any denition necessarily depends on further specications. Our analysis investigates the question of optimal encoding taking these two aspects into account. Another goal of this article is to strengthen intuition in how far Fisher information can be used to judge on the precision of different encodings. In the next section, we describe the methods and notations used throughout the article. In particular, a thorough justication of Fisher information as a measure for the precision of population codes on the basis of asymptotic efciency is presented. In section 3, we determine the optimal scale for the example of gaussian tuning curves with respect to Fisher information, on the one hand, and with respect to the minimum mean squared error (MMSE) in the case of a small counting time window, on the other hand. This example indicates that Fisher information does not account for the MMSE in the case of short-term population coding. Subsequently, we show that this problem becomes especially relevant for Fisher-optimal codes if one drops the a priori restriction to gaussian-shaped tuning curves by presenting an example in section 4, where two neurons are sufcient to achieve arbitrary large Fisher information. The conditions under which Fisher information can be used to determine the MMSE are discussed in section 5. In section 6, we investigate the case where the tuning functions are constrained to have a single maximum only (i.e., unimodal tuning functions) and the crucial role of energy constraints for the tuning width is demonstrated in section 7. Finally, we discuss the prerequisites and implications of our results in section 8.
2320
M. Bethge, D. Rotermund, and K. Pawelzik
2 Methods and Notations
We give a brief overview of the relevant quantities and methods used throughout this article. The encoded random variable will be denoted by x and the observable spike count vector by k, whose N components are the numbers of spikes of the N neurons (see Figure 1). Because all observable quantities take values within a limited range only, we set the range of x to the open unit interval x 2 (0, 1) without loss of generality. For convenience, we assume x to be uniformly distributed with density p (x ) D H ( x ) H (1 ¡ x) , where H ( y ) denotes the Heaviside function, which is one if y > 0 and zero otherwise. The distribution specied by p ( x ) is also called the a priori distribution, because it determines general properties of the signal x that are independent of the observed k. The encoding of x is specied by the set of tuning functions f fj ( x) gjND1 that give the mean number of spikes for each neuron j divided by the length T of the counting time window. Together with the assumption of independent Poisson noise, the response statistics of the entire population is then described by p ( k | x) D
N Y jD 1
pm j ( kj ) D
N ( Y Tfj (x ) ) kj j D1
kj !
expf¡Tfj ( x) g,
(2.1)
where pm j ( kj ) denotes the probability mass function of the Poisson distribution with parameter m j D Tfj ( x ) , which gives the mean and the variance of the spike count. Apart from the asymptotic cases T ! 0 and T ! 1, we will frequently consider the case fmax T D 1, that is, each neuron does not re more than one spike on average. As we will discuss in section 8, we suspect that fmax T D 1 is of the relevant order for signal transmission in cortex. Throughout the article, the mean of a random variable is denoted by E[.]. In particular, we will have Z X E[A(x, k ) ] D (2.2) p ( x) p ( k | x ) A ( x, k ) dx k
Z D
p ( x)
1 X
, ...,
k1 D 0
1 X
p ( k | x) A ( x, k ) dx,
(2.3)
kN D 0
R whichP reduces to E[A(x)] D p ( x ) A ( x ) dx, if A is a function of x only. E[A | ( ) ( ) x] D k p k | x A k , x is called the conditional R mean given x. We will also consider the conditional mean E[A | k] D p ( x | k ) A ( k , x ) dx, where the so-called a posteriori distribution p ( x | k ) can be obtained by using the Bayes formula: p ( x | k) D R
p ( k | x ) p (x ) . p ( k | xQ ) p ( xQ ) dxQ
(2.4)
p (x)
Optimal Short-Term Population Coding
x
2321
stimulus
p (k x )
encoding N T, {fj (x)}j=1
noise model N p (k {m j (x)}j=1 )
k
response
decoding
x^ ( k )
mean squared error risk
^ x) 2 x ] E[(x Figure 1: General scheme of encoding and decoding. The relationship between a stimulus signal x and the neuronal response k is determined by the likelihood p ( k | x ) , which can be decomposed into the encoding and the noise model. The mean spike counts m j ( x) ´ E[kj | x] D Tfj ( x) as functions of x specify the encoding, while for the noise model, we almost always choose the product of Poisson distributions in this article. Any function xO : k 7! xO ( k ) may be considered as a candidate estimator of x. The performance of an estimator xO with respect to a given x is judged by its mean squared error risk.
2322
M. Bethge, D. Rotermund, and K. Pawelzik
2.1 Dening an Ideal Observer. The precision of a neural code is usually dened on the basis of the mean squared error risk r ( x ) D E[(x ¡ xO ) 2 | x], with which a certain estimator xO reconstructs the presented stimulus x from the response of the neuronal population. While the estimator is intended to represent an ideal observer, it is actually not possible to determine an estimator that is preferable among all possible estimators independently of the problem at hand. This is because no estimator exists that minimizes the risk r (x ) for all x, apart from trivial cases, where error-free estimation r ( x ) ´ 0 is possible (Lehmann & Casella, 1999). However, if one considers sequences of estimators ( xO m ) 1 m D1 , where each element xO m ( k (1) , . . . , k ( m) ) refers to m independent and identically distributed (i.i.d.) spike count vectors ( k (1) , . . . , k (m ) ) , the corresponding sequence of risk functions asymptotically decreases proportional to 1 / m for many estimators.1 This can essentially be explained by the central limit theorem. More precisely, it can be shown under some rather weak assumptions about p ( k | x ) and p (x ) (Lehmann & Casella, 1999) that the rescaled error p m ( xO m ¡ x ) converges in law to a normal distribution with zero mean and a variance, which is the reciprocal value of Fisher information:
J[p ( k | x )] ´ E[( @x log p ( k | x ) ) 2 | x].
(2.5)
Such (sequences of) estimators are called asymptotically efcient. In the case of a homogeneous Poisson process as considered in this article, it is equivalent to consider sequences of increasing decoding time windows Tm D mT0 . While the corresponding estimators are functions of a single spikeP count vector k only, this spike count vector can be interpreted as the sum m t D 1 k ( t ) of m spike count vectors that are independently drawn from a Poisson distribution corresponding to the time window T 0 . Thereby, no P ( information gets lost because m k t D 1 j t ) is a sufcient statistics for the parameter of the Poisson distribution (Lehmann & Casella, 1999). Accordingly, for asymptotically efcient estimators holds lim r ( x, T ) ¢ J[f fj ( x )gjND1 ] D 1,
T!1
(2.6)
1 Note that the term estimator is used with different notions. While basically any arbitrary function of the observable random variables is called an estimator, if it is intended to predict some quantity, we here have a somewhat different meaning: If one talks about the maximum likelihood estimator or the mean square estimator, one does not refer to a unique function, but to a certain set of estimators dened by a unique construction rule. If the latter is specied, a sequence of observations naturally leads to a unique sequence of estimators.
Optimal Short-Term Population Coding
2323
where r ( x, T ) denotes the risk depending on T and the Fisher information is determined by J[f fj ( x )gjND1 ] D T
N f 02 (x ) X j jD 1
fj ( x )
,
(2.7)
which is obtained by inserting equation 2.1 into equation 2.5 (Paradiso, 1988; Seung & Sompolinsky, 1993). While Fisher information also shows up in the Crame r-Rao bound for unbiased estimators in a similar way, the latter by itself is not sufcient to justify the use of Fisher information as a general measure for the coding precision. In its general version, the Crame r-Rao bound, r ( x) ¸
( @x E[xO ( k ) | x]) 2 C (E[xO ( k ) | x] ¡ x) 2 , J[p ( k | x )]
(2.8)
is not unique for different estimators. Furthermore, even if uniqueness is given as, for example, in the case of uniformly unbiased estimators, a comparison of different encodings cannot necessarily be traced back to a comparison of some lower bounds on the decoding risks. Instead, it is indispensable to determine a sufciently close approximation of the actual values. Since the exact equality r ( x) D J ( x ) ¡1 holds true only in very rare cases (see section A.2), the notion of asymptotic efciency presented above is crucial for the use of Fisher information (for a more detailed discussion of differences and relationships between asymptotic theory and the CramÂer-Rao bound, see, e.g., Lehmann & Casella, 1999). While some previous publications on population coding relate their results to the asymptotic limit (Paradiso, 1988; Seung & Sompolinsky, 1993; Dayan & Abbott, 2001), it still lacks a thorough discussion of the conditions, when this limit can be used to estimate the risk of an optimal estimator in cases where the number of neurons is arbitrary large but nite. Before we go deeper into the discussion of asymptotic efciency, we should rst determine a notion of an ideal observer that holds beyond the scope of asymptotic efciency. In principle, there are two approaches that enable a unique selection of an optimal estimator on the basis of the risk: either the class of allowed estimators is restricted a priori such that the order between the risks r1 (x ) , r2 ( x ) of any two estimators becomes completely independent from x, or the estimators are compared on the basis of a loss functional F [r(x) ] over the whole range of x. In other words, the unique selection of an estimator is possible only on the basis of further specications. Here we consider the average risk, Â 2 D E[( xO ¡ x ) 2 ] D E[E[( xO ¡ x ) 2 | x]] D E[E[( xO ¡ x ) 2 | k]]
(2.9)
which is well established in information theory (Cover & Thomas, 1991) and neuroscience (Dayan & Abbott, 2001; Salinas & Abbott, 1994; Johnson,
2324
M. Bethge, D. Rotermund, and K. Pawelzik
1996; Roddey, Girish, & Miller, 2000). According to  2 , the best estimator xO MS is given by Z
xO MS ( k ) ´ argmin E[( xO ¡ x ) 2 | k] D argmin xO
xO
( xO ( k ) ¡ x) 2 p ( x | k ) dx, (2.10)
which is termed the mean square estimator (MS estimator). Equation 2.10 can formally be solved so that the best estimator with respect to the mean squared error loss is known to be the conditional mean (Lehmann & Casella, 1999): Z xO MS ( k ) D E[x | k] D
x p ( x | k ) dx.
(2.11)
The conditional mean squared error E[( xO MS ¡ x) 2 | k] of xO MS given an observation k equals the variance E[x2 | k] ¡ E[x | k]2 of the a posteriori distribution. Hence, the MMSE is generally determined by 2 ÂMS D E[( xO MS ¡ x) 2 ] D E[x2 ] ¡ E[xO 2MS ].
(2.12)
2.2 MS Estimator and Fisher Information. Unfortunately, it is often not possible to derive the moments of the a posteriori distribution analytically, and numerical efforts can be immense. Provided a set of regularity conditions holds, however, the MS estimator is known to be asymptotically efcient (Lehmann & Casella, 1999). This means that with an increasing number of observations, the risk asymptotically approaches 1 / J ( x) for all x. Because this implies that the mean values of both converge, the MMSE is then asymptotically equal to the mean asymptotic error (MASE):
" 2 ÂAS
1 ´E J[f fj (x ) gjND 1 ]
# 1 D T
Z1 0
0 1 ¡1 N f 0 2 ( x) X j @ A dx. ( x) f j jD 1
(2.13)
Note, however that even if one can prove asymptotic normality for a certain sequence of estimators (xO m ) 1 m D 1 , it is still necessary to know how large m has to be so that the MASE becomes a good approximation of the MMSE (i.e., 2 2 | 2 for the relative difference between both it holds |ÂMS ¡ ÂAS < 2 ¿ 1). / ÂMS We will demonstrate that T typically has a large effect on the shape of the MMSE-optimal code, although we have already shown that the large T limit can be considered as the limit of an asymptotically normal sequence of estimators. In contrast, the shape of Fisher-optimal codes is necessarily independent of the available decoding time because T appears only as a constant factor in equation 2.13. Previous articles on population coding using Fisher information referred to the limit of large N rather than to the limit of large T considered above.
Optimal Short-Term Population Coding
2325
There is an important difference between these two kinds of limiting processes: As long as the tuning functions are taken to be static, the integration of spikes over time corresponds to a sum over i.i.d. spike count vectors. In contrast, for the spike counts of different neurons, we typically have 6 6 p ( ki | x ) D p ( kj | x ) for all i D j. This diversity of the tuning functions can crucially slow the convergence of the MMSE to the MASE, and it may even destroy the property of asymptotic efciency. In fact, it is possible to construct sequences of tuning functions so that the difference between the MASE and MMSE becomes larger and larger the more tuning functions are taken into account by the MS estimator (an example will be given). Furthermore, we will show that tuning functions that lead to large Fisher information are particularly likely to underestimate the MMSE by far. This becomes a severe problem when Fisher information is used as an objective function in order to determine optimal encodings. Another fundamental problem is that Fisher information can be used for only those encodings for which x is identiable, which here means that the mapping of the tuning functions is one-to-one. If x is not identiable, Fisher information may either underestimate or overestimate the true error by far. For example, a single symmetric tuning function centered at 1 / 2 (the middle of the interval (0, 1) ) cannot improve the mean squared error at all for any x, while Fisher information can be arbitrarily large everywhere. Conversely, the Fisher information of a single tuning curve that is constant somewhere within an arbitrary small but nite interval predicts a diverging error within this interval, while the c.m.s.e. in fact depends on the length of this interval 2 and ÂMS can never be larger than the variance of the a priori distribution (see equation 2.12). Therefore, Fisher optimality in this article is dened to require both a minimal MASE and a one-to-one mapping of the tuning function array. 3 Optimal Gaussian Tuning Depends on Available Decoding Time
Consider the example of equidistant gaussian tuning curves on the unit interval (
gauss (x ) fj
D fmax
1 exp ¡ 2
³
x ¡ j/ N s
´2 ) , j D 1, . . . , N.
(3.1)
In order to determine the optimal scale with respect to the MASE, the corresponding Fisher information is calculated by inserting equation 3.1 into equation 2.7: gauss ( x) gjND1 ] J[f fj
( ´2 ³ ´ ) N ³ 1 x ¡ j/ N 2 Tfmax X x ¡ j/ N exp ¡ . D s 2 jD 1 s 2 s
(3.2)
2326
M. Bethge, D. Rotermund, and K. Pawelzik
(c2)Gauss
10
0
10
1
10
2
10
3
¯ ¯ 2
1
10
10
0
k 1 k k+1
1
k 1 k k+1
10
s
0
(c2)Gauss
10
2
10
10 4
¯
¯ 3
2
10
10
s
10
2 ) gauss Figure 2: (Left) Log-log plot of the minimum mean squared error ( ÂMS as a function of the scale s for fmax T D 1 (solid) compared with the mean asymptotic 2 gauss 2 gauss ) ) error fmax T ( ÂAS D ( ÂAS D E[1 / J] (dashed) in the case of N D 10 (upper) and N D 100 neurons (lower). The variance of the a priori distribution Var[x] D 2 1 / 12 (dotted) provides an upper bound for ÂMS . The arrows indicate the different minima. (Right) Comparison of the optimal tuning curves with respect to the 2 ) gauss mean asymptotic error ( ÂAS (dashed) and the minimum mean squared error 2 gauss ( ÂAS ) (solid) in the case of N D 10 (upper) and N D 100 neurons (lower). The gray lines indicate the adjacent tuning curves.
( x )gjND 1 ] then yields the MASE Numerical integration over 1 / J[f fj 2 gauss (ÂAS ) . If it is multiplied by fmax T, the resulting expression becomes independent of time, which implies that the optimal scale with respect to Fisher 2 (s) ) gauss information is independent of time too. In Figure 2, fmax T ( ÂAS is plotted as a function of the scale in the case of N D 10 and N D 100 exhibiting a unique minimum, for which the corresponding values are as follows. gauss
N 10 100
sAS 0.045 0.004
2 ) gauss fmax T ( ÂAS 2 ¢ 10¡3 2 ¢ 10¡5
The use of the MASE as objective function is justied only in the case T ! 1 of asymptotic normality. For nite T, however, it is necessary to check whether the MASE agrees with the MMSE. Hence, we computed
Optimal Short-Term Population Coding
2327
2 ) gauss ( ÂMS directly for the case of fmax T D 1 using Monte Carlo methods (see the appendix). The MMSE as a function of the scale for fmax T D 1 is also plotted in Figure 2 (left, solid line), and the values of the minima are as follows:
N 10 100
sMS 0.11 0.04
2 ) gauss ( ÂMS 1.2 ¢ 10¡2 2 ¢ 10¡4
2 ) gauss By comparison, we nd that the optimal scales with respect to ( ÂMS are about one order of magnitude larger than one would conclude from the MASE. In particular, this difference between the short-term optimum and the long-term optimum scale becomes even larger when increasing the number of neurons from 10 to 100. Figure 2 (right) shows the corresponding tuning curves illustrating this relative increase. While the MASE is close to 2 ) gauss ( ÂMS for scales that are larger than the optimal scale, the difference between both increases rapidly the more the scale is reduced from the optimal scale and reaches a maximum at the minimum MASE.
4 Fisher-Optimal Codes Without Tuning Curve Shape Constraints
The analysis of optimal gaussian tuning in the previous section indicates that Fisher-optimal codes are particularly likely to underestimate the MMSE if the time window is small. This mismatch between the MASE and the MMSE becomes even more dramatic in the case of Fisher-optimal codes if one does not stick to the restriction of gaussian-shaped tuning functions. This can be demonstrated by considering Fisher-optimal population codes where the tuning curves are not subjected to a priori constraints apart from a limitation of their dynamic range by a minimum ring rate fmin and a maximum ring rate fmax . We will rst determine the optimal tuning function in the case of a single neuron and then for multiple neurons. 4.1 Single Neuron. A way to nd the tuning function that minimizes the MASE is to start with a calculus of variations for the MASE functional:
1 T
Z1 0
f ( x) dx. ( f 0 (x ) ) 2
(4.1)
A necessary condition for a minimum of the MASE functional is given by the corresponding Euler-Lagrange differential equation, f ( x) f (x ) C 2 f 0 ( x) D C, ( f 0 ( x) ) 2 ( f 0 ( x) ) 3
(4.2)
which is equivalent to the requirement of a constant Fisher information, because the left-hand side is proportional to 1 / J[ f ]. The unique solution
2328
M. Bethge, D. Rotermund, and K. Pawelzik
satisfying the boundary conditions f (0) D fmin and f (1) D fmax reads f
opt (
µ³q x) D
´
q fmax ¡
fmin
¶2
q xC
fmin
.
(4.3)
While the calculus of variation does not account for solutions with kinks, one can prove with some additional effort that f opt in fact constitutes the Fisher-optimal tuning function in the p case ofpthe Poisson noise model. Its Fisher information is J[ f opt ( x) ] D 4T ( fmax ¡ fmin ) 2 . For constant additive gaussian noise, an analog analysis leads to a linear tuning function f (x ) D ( fmax ¡ fmin ) x C fmin , and in general it can be shown that a constant Fisher information is a necessary condition for Fisher-optimal codes. 4.2 Many Neurons. If x is encoded by more than one neuron, the requirement of identiability of x does not necessarily imply any more that the tuning functions are monotonic. In particular, if at least one neuron has a strictly monotone tuning curve, all other neurons may have arbitrarily shaped tuning functions. This makes it easy to construct Fisher-optimal codes, for which the MASE vanishes. In particular, we will show that this is already possible for two neurons if they have, for example, the following tuning functions (see Figure 3),
f1wave ( x) D fmax x2 ,
(4.4)
wave ( ) f2,v x D fmax [(vx) mod 1]2 ,
(4.5)
where we have set fmin D 0 for convenience. In this example, the total Fisher information is also a constant: wave ( )g] J[f f1wave ( x ) , f2,v x D 4 fmax T (v2 C 1) .
(4.6)
Hence, the mean asymptotic error of this wave function encoding scheme 2 ) wave equals (ÂAS D [4 fmax T (v2 C 1)] ¡1 , which becomes arbitrarily small with increasing v. If Fisher information would be a general measure for the precision of population codes, this would imply that all coding problems could be solved with two neurons only. However, if we compare Fisher information with the precision of the MS estimator in the case of fmax T D 1, 2 ) wave we nd that ( ÂMS > 0.06 for all v ¸ 1. In summary, our analysis of Fisher-optimal codes bears two important conclusions. First, with respect to Fisher information, gaussian tuning curves are particularly bad codes, however large or small their tuning widths are. Second, Fisher-optimal codes are not necessarily advantageous in the case of nite time windows.
Optimal Short-Term Population Coding
2329
Tuning function of neuron 1 fmax 0
0
x
1
Tuning function of neuron 2 fmax 0
0
x
1
Figure 3: Fisher-optimal wave coding scheme consisting of two neurons. One tuning function ensures the identiability (top). The other tuning function is a wave function that leads to arbitrary large Fisher information when the wavelength is decreased (bottom).
5 When and Why the Mean Asymptotic Error Fails
In the space of possible arrays of tuning functions, it is difcult to name 2 clear-cut decision boundaries that tell precisely where ÂAS is a fairly good 2 approximation of ÂMS and where it is not. Therefore, the goal of this section is to gain intuition about which features of an encoding strategy are most relevant for the correspondence between the MASE and the MMSE. We begin with a simple example that demonstrates why the MASE and the MMSE need not converge in the large N limit. To this purpose, we spec construct a sequence of encodings ( fj ) jND 1 by simply extending the wave coding scheme (equations 4.4 and 4.5) to spec
fj
wave ( ) ( x) :D f2,v ( j) x ,
(5.1)
with v (1) D 1. For any given T, v (2) can be chosen sufciently large that 2 ) spec ¿ ( 2 ) spec . The subsequent v ( ) , ¸ 3 are dened recursively ( ÂAS ÂMS N D2 j j N D2 by v ( j C 1) :D v ( j) C ja , where we introduced the exponent a ¸ 0 in order to indicate that Fisher information can increase with N arbitrarily 2 ) spec ¿ ( 2 ) spec for fast. Even in the case of a D 0, however, it holds ( ÂAS ÂMS 2 arbitrary N, because ( ÂAS ) spec decreases faster than 1 / N for all N, which is in contradiction to the scaling of the mean squared error risk in the case of asymptotic efciency.
2330
M. Bethge, D. Rotermund, and K. Pawelzik
In general, Fisher information, and hence the MASE, is a separable function in N and T: 2 ÂAS D
1 s (N) . T
(5.2)
If asymptotic efciency holds for large N, the MMSE has to decrease as N ¡1 . Therefore, it is a necessary condition for asymptotic efciency with respect to the limit of large N that s ( N ) decreases proportional to N ¡1 , too. This condition is not necessarily fullled, but as in the example, the population Fisher information can grow much faster with N than linear. Moreover, it is possible to construct encodings for which the MMSE decreases substantially faster than N ¡1 (an example is given below), so that the case of asymptotic efciency holds only for particularly suboptimal codes, which exhibit a high degree of redundancy. In the example above, there are tuning functions that map very distant values of x to the same ring rate and the mismatch between the MASE and the MMSE increases with an increasing number of maxima and minima in the tuning functions. In the following example, we will show that a restric2 2 tion of the number of maxima is not sufcient to ensure ÂAS ¼ ÂMS , but the matching of these two quantities crucially depends on nonlinearities in the tuning functions. Consider the following class of Fisher-optimal codes built with monotonic tuning functions2 8 ± > < fmax Nx ¡ mono ( ) fj,º x D ± ²2 > : fmax l º
² ( l¡1 ) N C j¡l 2 , º ,
( l¡1 ) N C j¡1 ºN ( l¡1 ) N C j ºN
< x<
< x<
( l¡1) N C j ºN
lN C j¡1 ºN
, (5.3)
where j denotes the neuron index and º D 1, 2, 3, . . . species the shape of the tuning function array (see Figure 4). Each tuning curve is completely determined, if one let l run through all integer values l D 1, . . . , º. The Fisher information of these encodings is independent of º: mono ( ) N ] J[f fj,º x gj D1 D 4 fmax T N2 .
(5.4)
In the limit º ! 1, this coding scheme cannot be distinguished from N mono ( ) identical tuning functions fj,1 x D fmax x2 (see Figure 4). However, the
mono ( )gN ] is times larger than the poppopulation Fisher information J[f fj,º x jD 1 N mono ( ) gN ulation Fisher information of the asymptotic tuning functions f fj,1 x j D1
2 The proof of Fisher optimality is based on the same reasoning as the proof of Fisher optimality for unimodal tuning functions given in section 6. However, the set of encodings mono (x)g N : º D 1, 2, 3, . . .g does not contain all Fisher-optimal tuning function arrays. ff fj,º jD 1
Optimal Short-Term Population Coding
f
max
2331
n =5 N =2
n =5 N =100
n =20 N =2
n =20 N =100
0
f
max
0
0
x
1 0
x
1
Figure 4: Four examples of Fisher-optimal encodings built with monotonic tuning functions as described by equation 5.3. The left column shows the case of N D 2, and the right column shows the case of N D 100, where we plotted only the rst tuning function (j D 1, gray) and the last one (j D N, black). The intermediate tuning functions (j D 2, . . . , N ¡1, not shown) lie between the rst and the mono ( ) last one. Independent of N, all tuning functions converge to f1 x D fmax x2 , which is illustrated by the comparison of the case º D 5 (upper row) with the case º D 20 (lower row).
(this is possible because limiting values are not invariant under a change in the order of limiting processes). This example nicely demonstrates that Fisher information behaves as if all structures in the tuning functions are of the same relevance independent of their length scale. In fact, however, nonlinearities in the tuning functions become relevant only if they are observable at a scale that is naturally set by the scattering of the noise distribution. Correspondingly, the critical decoding time Tc that is necessary to approach the asymptotic normal case, which is described correctly by Fisher information, increases with increasing º. For large º, Tc has to be roughly proportional to º2 because the squared mono ( ) from the smoothed tuning funcdeviations of the tuning functions fj,º x mono tions fj,1 ( x) are of the order of (1 /º) 2 and become relevant only if the mean squared error, which scales like 1 / T, is of the same order (or smaller). Therefore, the critical decoding time diverges for a diverging º, which explains mono ( )gN the difference in the population Fisher information between f fj,º x jD 1 mono ( ) gN . Finally, it is worthwhile to note that the ramp coding and f fj,1 x jD 1 scheme obtained for º D 1 can be considered the best among the class of
2332
M. Bethge, D. Rotermund, and K. Pawelzik 1
(c2)mono
10
2
10
3
10
4
10
1
10
0
10
1
fmaxT
10
mono ( ) 20 Figure 5: The MMSE of f fj,º x gjD 1 is displayed as a function of the decoding time T for º D 1, 2, 3, 4, 5 and º D 20 (solid, from dark to pale). Although all encodings have the same Fisher information (dashed), the critical decoding time increases with increasing º. If T is smaller than the critical decoding time and larger than 3 / ( fmax N ) , the MMSE curves are well described by the Fisher mono ( ) 20 ] information of the asymptotic tuning functions J[f fj,1 x gjD 1 (dot-dashed). For T < 3 / ( fmax N ) , the bound given by the a priori variance 1 / 12 (dotted) is most relevant.
Fisher-optimal coding schemes built with nondecreasing tuning functions, because it has the smallest critical decoding time. This is demonstrated in 2 ) mono on the Figure 5, where we show how the different dependency of ( ÂMS decoding time is affected by the parameter º. Taken together, Fisher information is a measure of the long-term coding precision of population codes in the rst place, while in the case of nite T, one has to check carefully whether Fisher information provides correct results for the minimum mean squared error. As a rule of thumb, one can say that the smoother the tuning functions, the higher the probability is that the MASE matches the MMSE. 2 2 Hitherto, we considered examples only where ÂAS · ÂMS , and one might suspect that this holds true in general according to the Crame r-Rao bound. 2 diverges in the limit Apart from the trivial fact that ÂAS T ! 0 in contrast to 2 · Var[x], we will now show that for arbitrary large ÂMS T, arrays of tuning 2 2 functions exist, for which ÂAS À ÂMS . In order to show this, we consider a generalized ramp coding scheme, ramp
fj,a ( x ) D fmax ([Nx ¡ j C 1] C ¡ [Nx ¡ j] C ) a,
(5.5)
Optimal Short-Term Population Coding
2333
N
# neuron
N 1
1 0
x
1
Figure 6: Illustration of the ramp coding schemes for different values of a. The tuning functions differ only in the shape of the ramp, which is determined by a. Apart from the Fisher-optimal encoding, which is given for a D 2 (solid), we also plotted some other shapes of the ramp (dashed) that correspond to a D 0, 1, 3, 1 (from pale to dark).
where [y] C D yH ( y ) is the rectier function. The parameter a 2 [0, 1) can be used to change the tuning curves smoothly from linear ramp functions (a D 1) to step functions (a ! 0 or a ! 1), which is illustrated ramp in Figure 6. Furthermore, it is important to note that fj,2 ( x ) is identical mono ( ), which is the Fisher-optimal encoding for nondecreasing tuning to fj,1 x functions with the smallest critical decoding time. According to equation 5.5, the MASE becomes » 1 1 a 2 (0, 3) 2 (a) ) ramp a2 (3¡a) , (ÂAS ¢ (5.6) D , 2 1 fmax TN , otherwise 2 (a) ) ramp À Var[x] ¸ which implies that for all T, there is an a < 3 so that ( ÂAS 2 ramp ( ÂMS (a) ) . In the case of a ¸ 3, the MASE diverges however large T may 2 (a) ) ramp be. The strong dependence on a in case of ( ÂAS is not likely to hold for 2 (3) ) ramp diverges, the MMSE too. In particular, it is quite surprising that (ÂAS although the corresponding tuning function array looks very similar to that in the Fisher-optimal case (a D 2). The reason for this huge discrepancy is that Fisher information in general cannot account for the precision of encodings, for which J ( x ) has rst-order zeros (or higher order). One could say that the latter is a weaker form of nonidentiability. Although the encoding is one-to-one, Fisher information cannot account for the precision if the slope of all tuning functions becomes too small somewhere.
2334
M. Bethge, D. Rotermund, and K. Pawelzik 0
(c2)ramp
10
5
10
0
10
1
10
2
N
10
3
10
Figure 7: The MMSE (solid) of the Fisher-optimal ramp encoding (a D 2) is shown in the case of fmax T D 1. It is very close to the upper bound (dashed) that is the minimum of the two upper bounds given by equation 5.7 and the a priori variance. The dotted line indicates the MASE of the Fisher-optimal ramp encoding, which may be considered as a lower bound on the MMSE for all a provided N is sufciently large. Therefore, all ramp encodings perform similarly well in case of fmax T D 1.
In contrast to the strong dependence of the MASE on a, the MMSE in the case of fmax T D 1 is very similar for all a. It can be shown analytically (see the appendix) that the following inequality holds for all a, Á ! 1 1 1 ¡ qN 2 ramp ( ÂMS ) · 2 C 3 1¡ (5.7) , N p N p2 where p D 1 ¡ e¡ fmax T increases and q D 1 ¡ p decreases with the length T of the time window. As one can see in Figure 7, this bound is quite close to 2 (2) ) ramp of the Fisher-optimal code. (ÂAS To summarize, all examples discussed in this section demonstrate that the matching of the MASE with the MMSE critically depends on the effect of nonlinearities of the tuning functions on the stimulus reconstruction. This is also suggested by the fact that Fisher information is calculated only on the basis of the local shape of the likelihood function p ( k | x ), which corresponds to a linear extrapolation around the true value xtrue . In the case of the Poisson noise model considered in this article, it is possible to give this statement about the locality of Fisher information a precise meaning: If gj D fj¡1 denotes the local inverse function of a tuning
Optimal Short-Term Population Coding
2335
curve fj , then the conditional variance Var[gj ( kj / T ) | x] at point x can be expressed by µ ¶ kj ¡ Tfj ( x) C O ( kj2 ) | x Var[gj ( kj / T ) | x] D Var gj ( fj (x ) ) C gj0 ( fj ( x) ) (5.8) T D
1 C Var[O ( kj2 ) | x]. J[ fj ( x )]
(5.9)
In a similar way, Fisher information shows up if one determines the error of the MS estimator in the limit of vanishing noise. For any given x, the MS estimator can be approximated by a linear function of k in this limit. In particular, this linear function can be set to the form of a superposition of the inverse tuning functions gj D fj¡1 , because the MS estimator is asymptotically unbiased. Therefore, it holds that xO MS ( k ) ¼ x C
X j
Wj gj0 ( f (x ) ) ( kj / T ¡ fj ( x) ) D x C
X
Wj
j
kj / T ¡ fj (x ) , (5.10) fj0 ( x )
P where the fWj gjND 1 stand for an arbitrary weighting with jND 1 Wj D 1. Accordingly, the conditional error variance of the MS estimator is given by E[( xO MS ( k ) ¡ x ) 2 | x] D
X
Wj2
j
Var[kj | x] . (Tfj0 ( x ) ) 2
(5.11)
Minimizing equation 5.11 under the constraint of fWj gjND 1 yields Wj D
( Tfj0 ( x ) ) 2 Var[kj | x]
P
(Tfj0 ( x) ) 2 j Var[kj |x]
,
(5.12)
and correspondingly, the conditional error variance becomes E[( xO MS ( k ) ¡ x ) 2 | x] D P
1 ( Tfj0 ( x) ) 2
j Var[kj |x]
D
1 J[f fj (x ) gjND 1 ]
.
(5.13)
This somewhat heuristic calculation suggests that the inverse Fisher information is a good approximation of the risk of the MS estimator, when the scattering of the MS estimator around the true value of x is restricted to a region within which the tuning functions may be considered linear.
2336
M. Bethge, D. Rotermund, and K. Pawelzik
6 Are Fisher-Optimal Codes Unique?
In order to test the idea of efcient coding as a rst principle for population codes in the brain, it is important to deduce characteristic predictions for the shape of neuronal tuning functions from it that can be directly compared with experimental data. Therefore, it is important to know whether tuning functions of Fisher-optimal codes exhibit unique features that can be observed experimentally. Several theoretical studies have focused on whether a large or a small tuning width is advantageous with respect to its coding efciency. While Fisher information can diverge in the case of two neurons only, if no particular constraints are imposed on the shape of the tuning function, the MASE of Fisher-optimal codes remains nite if the number of maxima of each tuning function is set to be limited. This is clearly the case for unimodal tuning functions, which have one maximum only. For such encodings, Fisher information cannot increase faster than proportional to N2 , provided identiability of x. Here we derive Fisher-optimal unimodal tuning curves. In order to avoid asymmetries due to the boundaries of the interval, we switch to the case of a circular random variable (which could represent the angle of an oriented bar). The ring topology of the circular random variable requires modifying the Euclidean distance slightly, D ( x1 , x2 ) D minf|x1 ¡ x2 |, 1 ¡ |x1 ¡ x2 |],
(6.1)
by always choosing the path of smaller distance (Lehmann & Casella, 1999). Accordingly, the MMSE is then given by 2 ) uni ( ÂMS D E[D( xO MS ¡ x ) 2 ].
(6.2)
While this modication reduces the a priori error compared to the case without periodic boundary conditions, it has no effect asymptotically. For convenience, we assume that x is encoded by unimodal symmetric tuning curves of identical shape with equidistantly distributed centers cj D j/ N, 8 fmax , > > < ± ( ) ² D x,cj ¡a uni ( ) , fj x D g b¡a > > : fmin ,
D ( x, cj ) · a
a < D (x, cj ) < b , b · D ( x, cj ) ·
(6.3)
1 2
where g: (0, 1) ! [ fmin , fmax ] is a monotone decreasing and otherwise arbitrary function. Since the Fisher information J[ fjuni ( x )] can be positive only for a < D ( x, cj ) < b, we refer to the corresponding regions as Fisher information regions (F regions) of the tuning functions. Independent from the function g ( z) ,
Optimal Short-Term Population Coding
2337
J[ fjuni ( x) ] is proportional to the inverse squared F region width ( b ¡ a) ¡2 . Therefore, it is a necessary condition for a minimum of the MASE that the F regions of different neurons must not overlap, because the contributions of different neurons add at most linearly to the total Fisher information. Then the evaluation of the MASE can be decomposed, µ E
1 uni J ( x)
¶
Z D
D (x,cj ) < a
dx C uni J ( x) Z
Z a < D ( x,cj ) < b
dx
C b < D ( x,cj ) < 0.5
Juni ( x)
dx J[ fjuni ( x) ]
,
(6.4)
p p and we can conclude from section 4.1. that g ( z) D ( ( fmax ¡ fmin ) (1 ¡ z ) C p fmin ) 2 is the minimizer of the second term. It remains to determine the optimal choice of a and b. As we will show, the MASE becomes a minimum if the F region width is set to b ¡ a D 1 / (2N ) , and b D k / (2N ) for any k 2 f1, 2, . . . , Ng (see Figure 8). In this case, the total Fisher information Juni does not depend on x so that it holds µ E
1 Juni
¶ D
1 E[Juni ] x
D
16(
p
fmax
1 p . ¡ fmin ) 2 TN 2
(6.5)
Because E[Juni ] decreases if b ¡ a is increased, it follows with the Jensen inequality, E
µ ¶ 1 1 ¸ , J E[J]
(6.6)
x
that the optimal F region width b ¡ a cannot be larger than 1 / (2N ) (see Figure 9). However, b ¡a cannot be smaller than 1 / (2N ) due to the requirement of identiability as well as due to the fact that the MASE diverges for all encoding strategies with b ¡ a < 1 / (2N ) . Since the identical minimal MASE is achieved for sharp tuning as well as for broad tuning, the length of the tuning width w :D a C b is not appropriate to characterize the Fisher-optimal code in the class of unimodal tuning functions. Obviously, this conclusion holds true also for suboptimal coding schemes, such as gaussian or cosine tuning functions. In fact, they have to be considered a special choice out of many equivalent3 codes that all have the same shape of decay g ( z) but are different with respect to their tuning width. Instead of the tuning width, we nd that in general, the length of the 3
Equivalence with respect to the population Fisher information.
2338
M. Bethge, D. Rotermund, and K. Pawelzik
a+b fmax 1/(2N)
1/(2N) c
j
¯
0
x
AS
(c2 )uni
Figure 8: The Fisher-optimal unimodal tuning curve with the smallest critical decoding time is at with small edges of length 1 / (2N ) . Fisher information and the special type of noise model are relevant for the shape of optimal tuning curves only within these edge regions. The solid line refers to a Poisson noise model and the dashed line to additive gaussian noise of arbitrary variance.
0
1/(2N)
(b a)
Figure 9: Sketch of the MASE (solid) as a function of the F-region width ( b ¡ a ) . If ( b ¡ a) < 1 / (2N ) , there is an interval with zero Fisher information and hence the MASE diverges (gray region). For ( b ¡ a ) D 1 / (2N ) , we have derived the encoding with minimal MASE, for which we found that it equals 1 / E[J] of that encoding (dot). Since the MASE is larger than or equal to 1 / E[J] (dashed) and this lower bound is an increasing function of ( b ¡ a ) , the MASE is an increasing function of ( b ¡ a ) too, independent of the particular shape of the encoding. It follows that the encoding that minimizes the MASE in case of ( b ¡a) D 1 / (2N ) is also Fisher optimal compared to coding schemes with different F-region widths.
Optimal Short-Term Population Coding
2339
F region width is crucial for Fisher optimality because it has to be as small as possible. Accordingly, steep changes and at plateaus are the main signatures of Fisher-optimal tuning curves, which is the more true the larger the populations are, because the minimal average F region width scales like 1 / N. While the tuning width failed to constitute a characteristic feature of Fisher-optimal codes, other global aspects may be suitable for this purpose. The derivation above suggests that Fisher-optimal unimodal tuning functions are approximately box shaped if N is sufciently large. However, the set of Fisher-optimal codes with unimodal tuning functions considered above is not complete, because we imposed various additional constraints on the tuning functions there. In fact, there are many more Fisher-optimal unimodal tuning functions with the same MASE as given by equation 6.5 if we drop these assumptions. For example, if we require monotony instead of strict monotony for g (z ) only, g may have arbitrarily many constant parts. Then similar to the idea underlying the Fisher-optimal class of monotonic tuning function encodings given by equation 5.3, this allows the construction of various Fisher-optimal codes, which have no features in common that could be tested experimentally.
7 Optimal Tuning Width—a Question of Energy?
While it was not possible to determine an optimal tuning width with respect to Fisher information under the constraints considered above, we suspect that in reality, energy consumption plays a crucial role for the tuning widths favoring sparse codes (Levy & Baxter, 1996). Before we derive an optimal width under this additional constraint, we want to know how much the MMSE of the Fisher-optimal tuning curves f fjuni gjND1 depends on the width. Therefore, we computed the minimum of equation 6.2 numerically for fmin D 0 and fmax T D 1 (this corresponds to fmax D 200 Hz and T D 5 ms) in the case of N D 10 and N D 20 neurons. It turns out that w ¼ 1 / 2 is optimal, while the objective function is at in a wide region around this optimum (see Figure 10). Furthermore, there is a slight asymmetry: ( ÂMS ( w ) ) uni increases not as fast in the direction of sparse codes as in the other direction. This asymmetry is due to the Poisson noise model, for which the noise variance increases proportionally to the average ring rate. While the broadness of the minimum of the MMSE as a function of the tuning width does not indicate a substantial advantage of a certain receptive eld size, this is likely to change if energy consumption is taken into account. Hitherto, the coding efciency was limited only by a constraint on the power ( f ( x) · fmax ), which can be motivated, for example, by the refractory period of a neuron after the generation of an action potential. On the other hand, it is likely that energy constraints are relevant because the average interspike intervals of cortical neurons are much larger than their refractory period.
M. Bethge, D. Rotermund, and K. Pawelzik 1
10
2
(cMS)
uni
2340
2
10
3
10
0
1/2 tuning width
1
Figure 10: MMSE as a function of the tuning width in the case of N D 10 neurons (solid) and N D 20 neurons (dot-dashed). The dashed line indicates the 2 a priori variance Var[x] D 1 / 12 that is an upper bound for ÂMS . The dotted lines denote the results for N D 10 (upper) and N D 20 (lower) neurons if the energy constraint (see equation 7.1) is taken into account.
This fact can be taken into account if one assumes an additional upper bound for the mean ring rates E[ fj ( x ) ] · h f imax . In order to demonstrate the effect of this energy constraint, we have also calculated the minimum mean squared error in the case where the mean ring rate is limited by h f imax D fmax / 20 (this corresponds to a maximum ring rate of 200 Hz and a mean ring rate of 10 Hz). Therefore both constraints together can be expressed by a width-dependent maximum ring rate: ³ ´ fmax . (7.1) fQmax ( w) D min fmax , 20w 2 ) uni The resulting ( ÂMS energy is shown in the case of fmax T D 1 in Figure 10 by the dotted line, and it has a distinct minimum located at the maximal w, for which fQmax ( w) D fmax . Furthermore, the MASE exhibits a clear asymmetry toward smaller tuning widths, being minimal for all w · h f imax / fmax · 1 / 20. This preference for smaller tuning widths gives a precise meaning to the statement that sparse coding can be explained by constrained energy consumption. In contrast to previous conclusions in van Vreeswijk (2001), this result does not rely critically on the Poisson noise model, but also holds true in the case of a gaussian noise model. In fact, the Fisher-optimal code for the gaussian noise model differs from the Poisson case only by a small change of the function g ( z) , which becomes g ( z ) D ( fmax ¡ fmin ) (1 ¡z) C fmin (see Figure 8, dashed line). This leads to a MASE v / [2N ( fmax ¡ fmin ) ]2 , where v D Var[kj | x] denotes the constant noise variance. Correspondingly, the MASE is again minimal for all w · h f imax / fmax · 1 / 20 in the same way as in the Poisson case. Moreover, the Poisson model itself is not sufcient 2 ) uni is identical for all tuning to explain small receptive elds, because (ÂAS
Optimal Short-Term Population Coding
2341
2 ) uni widths w D 1 / N, 2 / N, . . . , 1 and the asymmetry of ( ÂMS in the absence of energy constraints is very weak.
8 Discussion and Conclusion
This article addresses several questions that arise if efcient coding is used as a rst principle (Attneave, 1954; Barlow, 1959) in order to explain response properties of neuronal populations. The questions range from the problem of dening an ideal observer to the issue of how relevant the need for efcient signal transmission is as a determinant of the shape of neural codes compared to other constraints that are additionally imposed. The ideal observer paradigm corresponds to the idea of evaluating all available information about the stimulus that in principle can be read out from the neuronal response. In almost all relevant cases, however, it is not possible to determine an estimator that is optimal for all possible stimuli. Therefore, it is not a trivial question which objective function is best suited to the problem of optimal coding. Although many studies use Fisher information as a measure for the precision of an ideal observer, this approach is justied only with respect to the limit of asymptotic normality. In this limit, the risk functions of many relevant estimators become equal to the inverse of Fisher information. Among others, this is the case for the risk of the maximum likelihood estimator and the MS estimator. Out of the asymptotic efcient estimators, we chose the MS estimator as an ideal observer because it accounts for all available prior knowledge on the stimulus statistics p ( x ) and thus reects the ultimate limit of optimal stimulus reconstruction. The use of prior knowledge and its proper combination with the transmitted data is indispensable for efcient coding at short timescales. This is due not only to the bias-variance trade-off, but also in case of large N, prior knowledge, like the limitations on the dynamic range, cannot be neglected because it may have a substantial effect on the shape of optimal codes. Therefore, the MS estimator appears to be a reasonable choice because it allows for a very exible incorporation of all sorts of prior information in a rather straightforward way. As with any Bayesian type of estimator, one has to take great care with the choice of the a priori distribution. There are, however, well-developed methods to determine noninformative a priori distributions that do not introduce more specications than are actually known (Lehmann & Casella, 1999). While the uniform prior distribution chosen in this article plays the role of a dummy distribution, it can be seen as a simple choice of a noninformative prior given the dynamic range of the signal x, because it has no effect on the posterior distribution apart from cutting it to the range of x. For the need of signal processing, it is natural to require a constant upper bound for the risk function in order to achieve a certain precision for all x. This in turn suggests choosing the minimax estimator that minimizes
2342
M. Bethge, D. Rotermund, and K. Pawelzik
the worst-case loss functional F [r(x) ] :D sup x2[0,1] r ( x) instead of the mean. Due to the uniform prior, however, the risk function of the MS estimator depends only slightly on x, and its maximum rapidly becomes close to that of the minimax estimator with increasing data. In comparison to the minimax error, the MMSE has the advantage of being less sensitive to special assumptions on the coding model. In particular, the average risk does not depend on singularities of the risk function, and its computation is by far not so demanding. Furthermore, the MS estimator is always admissible, and it is asymptotically efcient, which allows for relating the MMSE to the average inverse Fisher information. Taken together, we believe that the MMSE provides a reasonable objective function for optimal population codes that holds beyond the scope of asymptotic efciency. Moreover, it can be estimated from neurophysiological experiments, without the need to specify an estimator a priori to the data (Roddey et al., 2000). We also computed the mutual information for all examples presented in this article. Like Fisher information, mutual information becomes equiva2 in the case of gaussian distributions so that all three measures lent to ÂMS can be used equivalently as objective functions in the case of asymptotic normality (see also Clarke & Barron, 1990; Brunel & Nadal, 1998). As an interesting fact, however, we note that mutual information becomes similar to the MMSE more rapidly than the MASE for all examples considered in this article (not shown). While asymptotic efciency is obtained for many population codes, we demonstrated that the use of Fisher information to characterize the precision of an encoding for a given decoding time T strongly depends on the particular coding scheme. As an example, we have shown that the optimal width of a population of gaussian tuning curves depends on the available decoding time, while Fisher-optimal codes are always independent of T. The choice of the counting time window length can be related to the timescale at which neurons integrate over their synaptic inputs. Since the high degree of irregularity of neuronal discharge in cortex (Softky & Koch, 1993) implies that the effective integration time constant is of the order of a few milliseconds, we were most interested in the situation where the spike count of a single neuron is of the order of one. We demonstrated that the critical decoding time Tc that is necessary for a sufcient matching of the MASE and the MMSE is typically increased by nonlinearities of the tuning functions. In particular, Tc grows with the frequency with which the tuning functions rise and decay between their minimum and maximum ring rate. If Fisher information is used as an objective function in order to determine optimal coding schemes, one is led to tuning functions with a slope that is as large as possible. In fact, the Fisher information of a (bounded) tuning function behaves like a penalty term for regularization. Because Fisher information is intended to become as large as possible, however, Fisher op-
Optimal Short-Term Population Coding
2343
timality has quite the opposite effect of regularization, and, hence, it cannot be expected to rule out a large number of coding schemes on the basis of Fisher information only. As we showed with an example in section 4.2, where only fmax T and N are given, the MASE can be reduced to zero with only two neurons. This means that any additional neuron is completely superuous with respect to asymptotic optimality, and hence no substantial constraint is imposed by Fisher information. In contrast, a clear optimum within the class of bounded functions can be found on the basis of the MMSE, which we will show in a forthcoming paper (Bethge, Rotermund, & Pawelzik, 2002). If the total number of maxima of the tuning functions is nite, the MASE remains nite too. However, in these cases, there is no unique Fisher-optimal code, but very many encodings achieve the same minimal MASE. This can be conceived from the case of nondecreasing tuning functions, for which we presented an innitely large set of Fisher-optimal encodings (that was still not complete). The method with which Fisher optimal codes can be derived was presented in the case of unimodal tuning functions. It is not restricted to monotonic or unimodal tuning functions, but can also be applied to tuning functions with more maxima. The crucial point is that the F regions of Fisher-optimal tuning functions do not overlap, and the total length of the F P regions of a single tuning function fj equals dj / N i D 1 di , where dj denotes the number of how many times fj ( x) is allowed to traverse the dynamic range. 4 In the simplest case, the tuning function fj then has dj regions within which the tuning function increases (or decreases) quadratically, as given by equation 4.3. We therefore have the general formula for the Fisher information of Fisher-optimal codes: ³q J D 4T
´2
q fmax ¡
fmin
0 @
N X
12 dj A .
(8.1)
j D1
It is also possible that the quadratic increase itself is interrupted by at mono ( ) and regions as it is the case for fj,º x º > 1, so that the F regions can be scattered over the entire range of x. Due to this freedom, the number of Fisher-optimal codes is very large, and there are no common features that could be tested experimentally. Fisher-optimal codes do not perform equally well with respect to the MMSE. Therefore, it is natural to choose that Fisher-optimal encoding that has the smallest critical decoding time. Accordingly, one generally obtains tuning functions with at maxima and quadratically decaying edges that are as steep as possible. 4 For example, dj D 1 in case of monotonic tuning functions and dj D 2 in case of unimodal tuning functions.
2344
M. Bethge, D. Rotermund, and K. Pawelzik
Furthermore, we found that the question of whether small or broad tuning widths are advantageous cannot be decided if only the number of neurons N, the available decoding time T, and the maximum ring rate fmax are given. Instead, we demonstrated that a limitation of the average ring rate, which can be motivated by energy consumption, naturally breaks the symmetry toward sparse codes with small tuning widths. Throughout this article, we considered the estimation of a single parameter only. In the case of multiparameter estimation, the choice of the squared error distance is not sufcient to enable a well-posed comparison of different coding schemes, but additional specications become necessary. While the optimization can be very complicated if the loss functional L [xˆ , x] depends on different dimensions in a nonlinear fashion, it becomes rather simple if it is multilinear, that is,
L [xˆ , x] D L (Â12 , . . . , ÂD2 ) D
D X
cd Âd2 ,
(8.2)
dD 1
which makes the individual loss Âd2 of different parameters commensurable by an appropriate choice of weightings fcd gD . If one further assumes that dD 1 no statistical dependencies among the different stimulus components exQ ist, (that is, p ( x ) D D m D 1 p ( xm ), it is easy to extend our results to the case where many parameters, say D, have to be inferred simultaneously from the neuronal population activity.5 Without assuming special constraints, the optimization problem reduces to the single-parameter case, where optimal encoding is simply given by D subpopulations that encode each parameter independently and the number of neurons in each subpopulation has to be chosen such that the contributions cd Âd to the total loss become equal. The more general case, where statistical dependencies among the variables exist, can often be traced back to the case without correlations, provided the weightings cd are all identical. For example, if p (x) is given by an arbitrary multivariate normal distribution, for which all correlations are determined by the covariance matrix, one can always nd another coordinate system x˜ by the Karhunen-Loeve transformation (Jollife, 1986), for which all variables Q become independent (p( x˜ ) D D Q m ) ). mD 1 p ( x From these considerations, we conclude that the shape of optimal tuning functions is not necessarily related to the number of dimensions. Under specic assumptions, however, dependencies on the number of dimensions can emerge. For example, the result in Zhang and Sejnowski (1999) that decreasing the scale maximizes Fisher information in the case of D < 2 and the opposite holds in the case of D > 2 is a direct consequence of the 5 This corresponds to the case considered in Zhang and Sejnowski (1999) and Eurich and Wilke (2000).
Optimal Short-Term Population Coding
2345
restriction to radial symmetric tuning curves and the density approximation of the encoding, by which continuous translation invariance of the coding problem over the whole real axis is enforced articially. Although the goal of this work was to understand the general principles of optimal population coding constraining the possibilities of interneuronal signal processing in cortex, one might also relate these results to the shape of measured tuning functions as it has been done in literature. For unimodal tuning functions, which are ubiquitous in the sensory pathways, the Fisheroptimal shape with the smallest critical decoding time is at and has steep quadratic edges (see Figure 8). For large N, these tuning curves become very similar to boxes standing in remarkable contrast to most measured tuning functions of sensory neurons in mammals. This apparent contradiction suggests essentially two alternative explanations. On the one hand, it is by far not evident whether the boxlike tuning curve is indeed favorable. It has been derived with the use of Fisher information, and we have shown how the optimality of a neural encoding relies on additional constraints. Other constraints from those considered in this article may be necessary to explain the shape of experimentally observed tuning functions. For example, it is likely that one needs to consider wiring constraints and natural limitations of the subsequent neuronal readout that prevent minimum mean square estimation. If in the end, however, the neuronal encoding strategy is determined by the choice of additional constraints rather than by optimal coding, the honest conclusion would be that the principle of efcient coding is of rather little relevance for explaining neuronal response properties. On the other hand, it is also likely that the quantity x actually encoded by the neuron is somehow correlated with the stimulus feature s considered in the experimental tuning curve. If x D s C c, where c stands for an arbitrary contextual quantity that is not under the control of the experimentalist, its variability during the measurement would lead to a smoothed version f ( s ) of the actual tuning function f (x ) . In summary this means that both the uncertainty in the assumptions of constraints, as well as the limited control in experiments, may often counteract the goal to explain measured tuning functions by the method of optimal coding. If we nevertheless speculate about theoretical implications of efcient coding that might be observed experimentally, we would conclude from our investigations that place coding is better than intensity coding (using the terms place coding and intensity coding as dened in Snippe, 1996) in cases where the constraint of decoding time is stronger than the cost of the required number of neurons. This means that it is advantageous if neurons are activated in an all-or-nothing manner rather than in a smooth graded way. Experimentally, this idea can be supported by the fact that bursting cells, which are activated in a bistable way, are ubiquitous in the brain.
2346
M. Bethge, D. Rotermund, and K. Pawelzik
Appendix A.1 Monte-Carlo Integration. While the evaluation of the integrals de-
termining the MS estimator, R1 xO ( k ) D
x p ( k | x ) dx
0
R1
,
(A.1)
p ( k | x) dx
0
can be computed by classical integration routines, the evaluation of the mean, Z1 )2
E[(xO ( k ) ¡ x ] D
p ( x) 0
1 X
, ...,
k1 D 0
1 X kN D 0
(xO ( k ) ¡ x ) 2 p ( k | x) dx,
(A.2)
requires a summation over an N-dimensional space. According to the Monte Carlo technique (Bishop, 1995), we estimate the value of equation A.2 by an average over n trials, for which we draw a particular (x, k ) i randomly from the joint distribution p ( k | x) p ( x ). Then it holds that E[(xO ( k ) ¡ x ) 2 ] ¼
n 1X ( xO ( ki ) ¡ xi ) 2 . n i D1
(A.3)
The error of this approximation drops with the number of trials. We evaluated the right-hand side of equation A.3 up to the second relevant digit. As a termination criterion, we stopped the averaging process when there was no change in the value of the second relevant digit during the last 10,000 trials. A.2 Fisher Information and the Exponential Family. The mean squared error of any estimator can always be decomposed into its bias bxO ( x ) D ( g ( x) ¡ x ) 2 and its variance vxO (x ) D E[( xO ¡ g ( x) ) 2 | x], where we introduced g ( x ) D E[xO | x] for the sake of clarity. Accordingly, the CramÂer-Rao bound (see equation 2.8) can also be given in the form
vxO (x ) ¸
g0 ( x ) 2 . J ( x)
(A.4)
For this lower bound, it is known that equality holds if and only if p ( k | x) constitutes an exponential family. While the Poisson distribution constitutes an exponential family with respect to the mean spike count m j D fj ( x ) T for each neuron j, it depends on the shape of the tuning functions whether
Optimal Short-Term Population Coding
2347
an estimator of x exists, for which equality holds in equation A.4. Such an estimator has to satisfy the following equation (Lehmann & Casella, 1999): xO ( k ) D g (x ) C
g0 ( x ) @x log p ( k | x ) . J ( x)
(A.5)
Therefore, the mean squared error of an estimator is completely determined by Fisher information if it is unbiased (i.e., g ( x) D x), and the right-hand side of equation A.5 is independent from x, that is, xC
@x log p ( k | x )
J ( x)
(A.6)
D const.
Inserting equation 2.1 and taking the derivative with respect to x yields ( k ¡ m )m 00 D 0, m 02
(A.7)
in the case of N D 1. From this, it follows that m 00 equals zero and hence, the tuning function f (x ) D T1 m ( x ) is required to be linear. While we did not solve equation A.6 for N > 1, this short calculation may hint at the strong restrictions that it imposes on the shape of the tuning functions. A.3 Derivation of Equation 5.7. The upper bound results from a calculation of the error of an suboptimal estimator: » ³ ´ ¼ 1 j 1 1, . . . . (A.8) xO ( k ) :D max , H kj ¡ j D , N 2 N N
We decompose the mean squared error,
(Â
2 (a) ) ramp
D
Z1 X 0
k
( x ¡ xO ( k ) ) 2 p ( k | x) dx a N
Z X N 1 X ( x ¡ xO ( k ) ) 2 p ( k | x ) dx, D N N aD 1 k |
a¡1 N
{z
(A.9)
}
Âa2
and consider the parts Âa2 , a D 1, . . . , N separately: 1
´ ³ ´ ZN ³ 1 2 1 2 1 Â12 D N x¡ dx · max x¡ D 2 N x2[0,1/ N] N N 0
(A.10)
2348
M. Bethge, D. Rotermund, and K. Pawelzik 2
Â22
´ ZN ³ 2 2 ( ) | [1 > 0 2 2 ¡ D p k2 x x dx / N, / N] N
(A.11)
1 N 2
ZN ³ C p ( k2 D 0 | x 2 [1 / N, 2 / N] )
x¡
1 N
´2 dx
1 N
³
´ 2 2 · p ( k2 > 0 | x 2 [1 / N, 2 / N]) max x¡ x2[1/ N,2 / N] N ³ ´ 1 2 C p ( k2 D 0 | x 2 [1 / N, 2 / N] ) max x¡ x2[1 / N,2/ N] N · p ( k2 > 0 | x 2 [1 / N, 2 / N]) D
1 1 C p ( k2 D 0 | x 2 [1 / N, 2 / N] ) N2 N2
1 N2 3
Â32
´ ZN ³ 2 2 · p ( k2 > 0 | x 2 [2 / N, 3 / N]) ¡ x dx N 2 N
³ ´ 1 2 ( ) C p k2 D 0 | x 2 [2 / N, 3 / N] Â2 C N2 ·
1 (1 C p ( k2 D 0 | x 2 [2 / N, 3 / N]) ). N2
(A.12)
In general, it holds for all a ¸ 2: Âa2C 1 ·
1 1 C p ( ka D 0 | x 2 [a / N, ( a C 1) / N] ) Âa2 D C q Âa2 , N2 N2
(A.13)
where we introduced the abbreviation q for the probability p ( ka D 0 | x 2 [a / N, (a C 1) / N]) D e ¡ fmax T , which does not depend on the neuron index. Together with p :D 1 ¡ q, it then follows by induction, Âa2 ·
1 1 ¡ qa¡1 1 1 ¡ qa¡1 D , N2 1 ¡ q N2 p
(A.14)
because it holds Âa2C 1
Eq. A.13
·
1 1 1 ¡ qa¡1 1 Cq D 2 N2 N2 p N
³
p C q ¡ qa p
´ D
1 1 ¡ qa . N2 p
(A.15)
Optimal Short-Term Population Coding
2349
According to equation A.9, we nally obtain (Â
2 (a) ) ramp
! N X 1 1 1 ¡ qa¡1 C N2 N2 p aD 2 Á ! N 1 1X a¡1 (1 ¡ q ) 1C N3 p a D2 Á ! X 1 N ¡ 1 1 N¡2 C1 a 1C ¡ q N3 p p aD 0 Á ! 1 N ¡1 q 1 ¡ qN¡1 1C ¡ N3 p p 1 ¡q Á ! 1 1 p C q ¡ qN C 1¡ N 2 p N3 p2 Á ! 1 1 1 ¡ qN C 1¡ . N 2 p N3 p2
N 1 X 1 D Â2 · N aD 1 a N
D D D D D
Á
(A.16)
Acknowledgments
We thank S. Wilke and C. Eurich for fruitful discussions. This work was supported by the DFG (SFB 517, Neurocognition). References Aitken, A. C., & Silverstone, H. (1942). On the estimation of statistical parameters. Proc. Roy. Soc. Edin., A, 62, 369. Attneave, F. (1954). Informational aspects of visual perception. Psychological Review, 61, 183–193. Baldi, P., & Heiligenberg, W. (1988). How sensory maps could enhance resolution through ordered arrangements of broadly tuned receivers. Biological Cybernetics, 59, 313–318. Barlow, H. B. (1959). Sensory mechanisms, the reduction of redundancy, and intelligence. In The mechanisation of thought processes (pp. 535–539). London: Her Majesty’s Stationery Ofce. Bethge, M., Rotermund, D., & Pawelzik, K. (2002). Optimal neural rate coding leads to bimodal ring rate distributions. Manuscript submitted for publication. Bishop, C. M. (1995). Neural networks for pattern recognition. New York: Oxford University Press. Braitenberg, V., & Sch¨uz A. (1991).The anatomy of the cortex:Statisticsand geometry. Berlin: Springer-Verlag. Brunel, N., & Nadal, J.-P. (1998). Mutual information, Fisher information, and population coding. Neural Computation, 10, 1731–1757.
2350
M. Bethge, D. Rotermund, and K. Pawelzik
Clarke, B. S., & Barron, A. R. (1990). Information-theoretic asymptotics of Bayes method. IEEE Trans. Infor. Theory, 36, 453–471. Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: Wiley. CramÂer, H. (1946). A contribution to the theory of statistical estimation. Aktuariestidskrift, 29, 458–463. Dayan, P., & Abbott, L. F. (2001). Theoretical neuroscience. Cambridge, MA: MIT Press. Eurich, C. W., & Wilke, S. D. (2000). Multi-dimensional encoding strategy of spiking neurons. Neural Computation, 12, 1519–1529. Frechet, M. (1943). Sur l’extension de certaines evaluations statistiques de petits echantilons. Rev. Int. Statist., 11, 182–205. Georgopoulos, A. P., Schwartz, A. B., & Kettner, R. E. (1986). Neuronal population coding of movement direction. Science, 233, 1416–1419. Hinton, G. E., McClelland, J. L., & Rumelhart, D. E. (1986). In D. E. Rumelhart & J. L. McClelland (Eds.), Parallel distributed processing (Vol. 1, pp. 77–109). Cambridge, MA: MIT Press. Johnson, D. H. (1996). Point process models of single-neuron discharges. J. Comput. Neurosci., 3, 275–299. Jollife, I. T. (1986). Principal component analysis. New York: Springer-Verlag. Lehmann, E. L., & Casella, G. (1999). Theory of point estimation. New York: Springer-Verlag. Levy, W. B., & Baxter, R. A. (1996). Energy efcient neural codes. Neural Computation, 8, 531–543. Panzeri, S., Biella, G., Rolls, E. T., Skaggs, W. E., & Treves, A. (1996). Speed, noise, information and the graded nature of neuronal responses. Network: Comp. in Neural Syst., 7, 365–370. Panzeri, S., Treves, A., Schultz, S., & Rolls, E. T. (1999).On decoding the responses of a population of neurons from short time windows. Neural Computation, 11, 1553–1577. Paradiso, M. A. (1988). A theory for the use of visual orientation information which exploits the columnar structure of striate cortex. Biological Cybernetics, 58, 35–49. Rao, C. R. (1946). Information and the accuracy attainable in the estimation of statistical parameters. Bull. Calcutta Math. Soc., 37, 81–91. Roddey, J. C., Girish, B., & Miller, J. P. (2000). Assessing the performance of neural encoding models in the presence of noise. J. Comput. Neurosci., 8, 95–112. Rolls, E. T., & Cowey, A. (1970). Topography of the retina and striate cortex and its relationship to visual acuity in rhesus monkeys and squirrel monkeys. Exp. Brain Research, 10, 298–310. Rolls, E. T., & Tovee, M. J. (1994). Processing speed in the cerebral cortex and the neurophysiology of visual masking. Proc. R. Soc. Lond. B, 257, 9–15. Salinas, E., & Abbott, L. F. (1994). Vector reconstruction from ring rates. J. Comput. Neurosci., 1, 89–107. Seung, H. S., & Sompolinsky, H. (1993). Simple models for reading neuronal population codes. PNAS, 90, 10749–10753.
Optimal Short-Term Population Coding
2351
Snippe, H. (1996). Parameter extraction from population codes: A critical assessment. Neural Computation, 8, 511–529. Snippe, H., & Koenderink, J. J. (1992). Discrimination thresholds for channelcoded systems. Biol. Cybern., 66, 543–551. Softky, W. R., & Koch, C. (1993). The highly irregular ring of cortical cells is inconsistent with temporal integration of random EPSPs. J. Neurosci., 13(1), 334–350. Thorpe, S., Fitz, D., & Marlot, C. (1996). Speed of processing in the human visual system. Nature (London), 381, 520–522. van Vreeswijk, C. (2001). Whence sparseness? In T. Leen, T. Dietterich, & V. Tresp (Eds.), Neural information processing systems, 13. Cambridge, MA: MIT Press. Zhang, K., & Sejnowski, T. J. (1999). Neuronal tuning: To sharpen or broaden? Neural Computation, 11, 75–84. Received December 20, 2000; accepted April 12, 2002.
LETTER
Communicated by Kwabena Boahen
Coupling an aVLSI Neuromorphic Vision Chip to a Neurotrophic Model of Synaptic Plasticity: The Development of Topography Terry Elliott [email protected] Department of Electronics and Computer Science, University of Southampton, Higheld, Southampton, SO17 1BJ, United Kingdom Jorg ¨ Kramer [email protected] Institute of Neuroinformatics, University of Zurich, ¨ and ETH Zurich, ¨ ¨ Winterthurerstrasse 190, CH-8057 Zurich, Switzerland We couple a previously studied, biologically inspired neurotrophic model of activity-dependent competitive synaptic plasticity and neuronal development to a neuromorphic retina chip. Using this system, we examine the development and renement of a topographic mapping between an array of afferent neurons (the retinal ganglion cells) and an array of target neurons. We nd that the plasticity model can indeed drive topographic renement in the presence of afferent activity patterns generated by a real-world device. We examine the resilience of the developing system to the presence of high levels of noise by adjusting the spontaneous ring rate of the silicon neurons. 1 Introduction
Exposing biological models to the full complexity of real environments provides far better and more rigorous tests of their underlying assumptions and mechanisms than simulating their environments on computers. Such simulations generally fail to model details of inherent noise and irregularities in the environment, some of which may have a signicant inuence on the development and functioning of the modeled biological system and the performance of the model. For the purposes of sensory-processing models in the neurosciences, it is therefore desirable to generate afferent electrical activity patterns for input into neuronal networks by using real-world sensory-transduction devices. Most work done according to this paradigm uses standard electronic sensing devices, such as video cameras and microphones, which do not bear much similarity to biological sensors beyond the fact that they transduce the sensory signal into an electrical one. A recent alternative to using such traditional sensors is provided by the relatively c 2002 Massachusetts Institute of Technology Neural Computation 14, 2353– 2370 (2002) °
2354
Terry Elliott and J¨org Kramer
new concept of neuromorphic engineering (Mead, 1989). Neuromorphic engineering typically uses analog VLSI (aVLSI) technology to implement, at the level of neuronal processing elements and the electrical signals they generate, key features of the low-level sensory processing found, for example, in the visual or auditory systems. Perhaps the best-known product of neuromorphic engineering is the so-called silicon retina, a device that generates retinal ganglion cell–like output in response to optical stimuli falling on its surface. After an early attempt to build a retina model from discrete electronic components (Fukushima, Yamaguchi, Yasuda, & Nagata, 1970) that already showed the basic spatiotemporal ltering behavior observed in biological retinae, a host of retina-like integrated circuits have been developed. These circuits differ in the types of retinal functions they implement and the detail with which they model these functions. Depending on the intended applications, some implementations focus on modeling spatial properties of the inner and outer plexiform layers (Boahen, 1999), while others, at the expense of detailed spatial modeling, put more emphasis on reproducing temporal dynamics, such as adaptation in the photoreceptors (Mahowald, 1991; Delbruck ¨ & Mead, 1994). For more extensive reviews of neuromorphic vision chips we refer to the literature (Etienne-Cummings & Van der Spiegel, 1996; Moini, 1999). For the work examined here, a model has been used that restricts itself to temporal aspects of retinal processing in favor of high processing densities and low xed-pattern noise (Kramer, 2002a). It implements the separation into ON and OFF channels found in the retina and the lateral geniculate nucleus, shows quickly decaying transient responses, as found in the magnocellular pathway projecting from the retina into the visual cortex, and employs the spike-based activity encoding of the retinal ganglion cells and later visual processing stages (Kramer, 2002b). Silicon retinas have so far mainly been used to visualize the retinal output and to demonstrate its qualities. But they were also instrumental in driving the development of an event-driven, asynchronous communication infrastructure for neuromorphic chips and other systems (Lazzaro, Wawrzynek, Mahowald, Sivilotti, & Gillespie, 1993; Mortara & Vittoz, 1994; Deiss, Douglas, & Whatley, 1998; Boahen, 2000), which has been adopted for this work. A few attempts have been made to use this infrastructure to couple retina chips to other neuromorphic circuits implementing higher-level visual processing (Mahowald, 1994; Venier, Mortara, Arreguit, & Vittoz, 1997; Indiveri, Whatley, & Kramer, 1999; Higgins & Koch, 2000; Indiveri, Murer, ¨ & Kramer, 2001; Liu et al., 2001). Recently, an autocongurable neuromorphic chip that models the activity-dependent release and uptake of a diffusible neurotropic factor has been developed. It uses a xed set of gradient-sensing silicon growth cones, interspersed within a two-dimensional array of target neurons, to wire together neurons that re together (Kwabena Boahen, personal communication, 2001). However, to our knowledge, no attempt has
Developing Topography Using a Neuromorphic Vision Chip
2355
been reported so far that uses a neuromorphic vision chip in conjunction with a detailed model of synaptic plasticity that permits synaptic sprouting and retraction. Here, we explore the coupling of a silicon retina to a biologically inspired model of anatomical, activity-dependent competitive synaptic plasticity and neuronal development. This model is inspired by recent data implicating a class of molecular factors, called neurotrophic factors (NTFs), in activity-dependent synaptic competition in the developing vertebrate nervous system (for a review, see, for example, McAllister, Katz, & Lo, 1999). The model has been developed, extensively analyzed, and applied in simulation to the development of various systems, including the visual system and the neuromuscular junction (Elliott & Shadbolt, 1998a, 1998b, 1999; Elliott, Maddison, & Shadbolt, 2001). Coupling this model of synaptic plasticity to a neuromorphic chip not only permits us to explore further the properties of the chip but also to determine whether the model can successfully drive normal development in the presence of realistic, rather than simulated, evoked afferent activity patterns. We consider in particular the competitive, activity-dependent development of an orderly topographic projection from an array of afferent neurons, representing the neuromorphic vision chip, to an array of target neurons. Such a system could represent, for example, the development of the retinotectal system of amphibiums and sh, which is known to develop and regenerate in an activity-dependent, competitive manner (Meyer, 1983; Schmidt & Edwards, 1983; Schmidt & Eisele, 1985). We (Elliott & Shadbolt, 1999) and others (see, for example, Goodhill, 1993; Sirosh & Miikkulainen, 1997) have previously considered the development of retinotopic maps, although only in the presence of simulated afferent input. 2 The Neuromorphic Chip
The neuromorphic chip consists of a 16 £ 16 square array of pixels (picture elements) consisting of adaptive photoreceptors with rectifying and threshold differentiating elements. The chip responds to ON and OFF transient optical signals at different output terminals. In response to a sufciently large ON or OFF transient, which in an imaging setup typically corresponds to a moving edge, a pixel generates a single binary spike or a train of binary spikes, depending on the shape of the transient and the pixel’s refractory period, which can be globally set with a bias control. For short refractory periods, a pixel emits a burst of spikes, even for sharp transients, and is thus said to be in the bursting mode, while for long refractory periods, even a prolonged transient can evoke only a single spike, and the chip is then said to operate in the nonbursting or single-spike mode. The response threshold and the spontaneous ring rate of the pixel can be varied separately for the ON and OFF pathways with bias controls. For example, the thresholds can be set such that only one of the pathways shows any response, and sponta-
2356
Terry Elliott and J¨org Kramer
neous ring can be completely suppressed if desired. The spikes from the two outputs of the different pixels in the array are ltered through an arbiter system, determining the order in which they are multiplexed to be sent off chip. Communication with other devices obeys a standard protocol, according to which the spikes are placed on an asynchronous communication bus as binary addresses that code for the identity of the sending neurons. In our case, the address codes for the location of the pixel in the array and its output terminal (ON or OFF). Data trafc on the communication bus is regulated using a four-phase handshaking protocol. Bus addresses rapidly decay to zero in the absence of further spiking activity. The chip is coupled to a standard PC running under Linux for data acquisition purposes. Due to unresolved difculties in the rate of processing interrupts by the Linux kernel, it has been found simplest to set the chip in free running mode by shorting its handshaking pins, and reading the address bus, connected to the computer’s parallel port, at a rate set by the speed of the computer simulation. Such an approach, however, entails two problems. First, without reading the status of the handshaking pins, it is impossible to determine whether a zero address (X D 0, Y D 0, these being pixel coordinates on the chip) on the bus corresponds to a decayed address in the absence of recent spiking activity or to a recent spike generated by the (0,0) pixel. All zero address reads are therefore discarded, so that the (0,0) pixel is ignored. Second, in free running mode, the computer can read multiple times the same address, generated from a single-spike event, before it decays to zero. In bursting mode, when each pixel generates several spike requests in response to one ON or OFF edge, it is therefore impossible, without examining the handshaking pins, to distinguish multiple reads of the same spike event from multiple reads of the same address generated by several spike events from the same pixel. The second problem can be eliminated by setting the chip to nonbursting mode, so that precisely one spike is generated in response to each ON or OFF edge, and by discarding subsequent consecutive reads of any new address on the bus. Using the nonbursting mode also practically eliminates a third possible problem: that spikes may be ignored by the PC due to interspike intervals being shorter than the PC’s read period. This would happen in the presence of strong activity on the bus, given that the maximum communication rate of the chip is signicantly larger than the maximum read rate of the PC. We nd that our approach generates data that reect very accurately the optical stimuli presented to the chip. Optical stimuli for the chip are generated on the LCD monitor of the same PC as is used for data acquisition. The stimuli consist of eight separate sequences of images, each sequence corresponding to a white bar sweeping across a black background in a window on the monitor. The sequences are distinguished only by the orientation and direction of motion of the bar. A bar can be horizontal, vertical, or diagonal (on either diagonal) with respect to the orientation of the rows of the array and move in either direction
Developing Topography Using a Neuromorphic Vision Chip
2357
perpendicular to the bar’s orientation. All bars are sufciently long to ll the eld of view of the chip (i.e., the bars are effectively innitely long, although of nite width). These sequences are repeatedly played without interruption, each new sequence being selected randomly after the completion of the previous one. The speed of the bar is largely unimportant. Provided that the speed is neither so low nor so high that the chip does not respond faithfully to an edge, any variation in intermediate speeds merely tests the computer’s capacity to process the captured data rapidly enough before another set of data is captured. The optical stimuli are focused onto the chip’s light-sensitive surface by a variable-aperture, variable-focus lens. These stimuli consist of both a leading ON edge and a trailing OFF edge. However, because the chip’s response to ON and OFF edges is not quite symmetric and because we require only one response type for this work, the ON responses from the chip are suppressed. 3 The Neurotrophic Model of Plasticity
We briey describe our neurotrophic model of anatomical, activity-dependent, competitive synaptic plasticity and its application to the development of topography. A fuller description, derivation, and discussion of the model and its application to various systems can be found elsewhere (Elliott & Shadbolt, 1998a, 1998b; 1999; Elliott et al., 2001). Our discussion of the model is at rst general. We then discuss specic features of the implementation of the model necessitated by its coupling to the neuromorphic chip. Let letters such as i and j label afferent or presynaptic cells and letters such as x and y label target or postsynaptic cells; the vector character of these labels is left implicit for the purposes of notational simplicity unless indicated otherwise. Let sxi denote the number of synapses projected from afferent i to target x, and let ai 2 [0, 1] denote the activity of afferent i. Then our basic equation for the evolution of sxi is given by " Á ! # P X ( a C ai )r i ds xi j syj aj D xy T0 C T1 P ¡1 . (3.1) D 2 sxi P dt j sxj ( a C aj )rj y j syj In this equation, T 0 and T1 represent, respectively, an activity-independent and a maximum activity-dependent release of NTF by target cells; the parameter a represents a resting NTF uptake capacity by afferents. These three parameters, through the combination c D T 0 / (aT1 ), determine whether competitive dynamics are present in the model: for values of c below a calculable threshold, competition occurs, while for values of c exceeding this threshold, competition breaks down (Elliott & Shadbolt, 1998a). Analysis shows that a must be neither too large nor too small, so we set a D 1 as in most previous work. T1 is an overall scale factor for the number of synapses and may be set for reasons of numerical convenience, but without any loss of generality, to T1 D 20. Selecting a value of T 0 D 0 will ensure the maximum rate of
2358
Terry Elliott and J¨org Kramer
map development, so this is what we use. The function D xy characterizes the diffusion of NTF between target cells. Alternatively, it may be thought of as encoding lateral interactions between target cells, with excitatory lateral interactions (positive D xy ) enhancing target cell NTF release and inhibitory lateral interactions (negative D xy) suppressing NTF release. For simplicity, we take D xy to be a gaussian with characteristic width s D 0.75, consistent with previous work (Elliott & Shadbolt, 1999). NTF receptor dynamics are P encoded in the function r i D aN i / x sxi , where aN i denotes the recent average activity of afferent cell i. It can be shown (Elliott & Shadbolt, 1998a) that if D xy is nonnegative, then NTF receptor dynamics are required to ensure that each afferent cell always innervates at least one target cell. However, if D xy possesses a Mexican hat prole, with an excitatory center and inhibitory surround, NTF receptor dynamics can be discarded. Finally, 2 sets the overall rate of plasticity. Here, we set 2 D 0.02, which is sufciently small to prevent development from being disrupted by too great a dependence on temporally recent stimulation and sufciently large to allow adequately short run times. To apply this model to the development of a topographic mapping between an afferent structure and a target structure, we suppose that the afferent and target cells are separately organized into two square arrays of cells. Although not necessary, for convenience we take these arrays to be of the same size, so that the number of target cells equals the number of afferent cells. Were the topographic mapping between these two sheets of cells perfect, each afferent cell would innervate precisely one target cell—the target cell that would be in register with it were the two sheets superimposed without distortion. Such a mapping ensures that neighborhood relations on the afferent sheet are preserved on the target sheet and vice versa and are expected to develop from an initially unrened mapping in which these relations are not at rst respected. We establish an initial unrened pattern of connectivity between the afferent and target sheets by following the method used by Goodhill (1993). Let dmax denote the maximum distance (in units of cell spacing) between any pair of cells in the target sheet. For a given afferent cell, let d denote the distance between any randomly selected target cell and the target cell to which the afferent cell would project were topography perfect. Then the number of synapses between the afferent cell and the randomly selected target cell is set to be proportional to ³ b 1¡
d dmax
´ C (1 ¡ b ) n,
(3.2)
where n 2 [0, 1] is a randomly selected number for each such pair of target and afferent cells. The parameter b 2 [0, 1] denes the topographic bias in the initial projections (Goodhill, 1993). A value of b D 0 gives no initial topographic bias, with entirely random connections between the two sheets. A value of b D 1 gives maximum bias, so that an afferent cell projects max-
Developing Topography Using a Neuromorphic Vision Chip
2359
imally to its topographically preferred target cell, with projections falling linearly with increasing distance away from this cell. Consistent with previous work, we usually set b D 0.5 (Elliott & Shadbolt, 1999); we will also examine the impact of decreasing the value of b on our results. In order to depict in graphical form the topographic representation of the afferent sheet on the target sheet, we use the following method. First, for each target cell, we calculate the center of mass (CoM) of afferent projections to that cell, so that we evaluate P EEõ Eõ Eõ s x CoM E (3.3) M xE D P , s Eõ xEEõ where we have restored the vectorial nature of the afferent and target cell labels for clarity. The positions dened by these vectors are then plotted in space and connected by lines according to the nearest-neighbor relations E CoM and M E CoM are on the target sheet: the positions corresponding to M xE yE connected by a line if, and only if, cells xE and yE are nearest neighbors on the target sheet; this is repeated for all such pairs. Increasing deviations of the resulting pattern of lines away from a regular square grid of lines represent increasing disruption of the topographic representation of the afferent sheet on the target sheet. We now turn to a discussion of the features of the implementation of the model that are particular to its coupling to the neuromorphic chip. The silicon retina is a square array of 16 £ 16 pixels. Hence, the array of target cells is also 16 £ 16. In free running mode, we discard data corresponding to a zero address on the bus and so ignore the (0,0) pixel on the retina. Correspondingly, we discard the (0,0) neuron in the target sheet. This is achieved by removing all projections from the (0,0) afferent pixel to the target sheet and all projections to the (0,0) target cell from the afferent sheet. Our plasticity model is not a spiking model, but rather uses an instantaneous ring rate, ai 2 [0, 1]. In principle, we could capture one spike at a time from the chip and process this single spike according to equation 3.1. However, single spikes processed in this way by our model would lack any spatial or temporal information about the structure of the optical stimuli presented to the chip, and consequently topographic renement would not occur. We therefore capture, or “bin,” several spikes before simultaneously processing all of them according to equation 3.1. In bursting mode, it is possible that each pixel could spike several times during a binning read, thus necessitating a somewhat ad hoc procedure for converting spike numbers with variable maxima per bin into instantaneous rates. In nonbursting mode, for a typical moving edge, each stimulated pixel spikes precisely once. In this mode, therefore, ai conveniently takes only binary values, so that ai 2 f0, 1g, without the need for any arbitrariness in the conversion to an instantaneous rate.
2360
Terry Elliott and J¨org Kramer
Figure 1: The development of the topographic mapping between the neuromorphic chip’s retinal sheet and a target sheet of neurons, with the binning at each iteration of 32 spikes. Each map is generated as discussed in the text and represents the state of topography at the iteration number indicated immediately above it.
Because the retinal array is 16 £ 16, a horizontally or vertically moving edge generates successive sequences of 16 spikes, the spikes being generated by the line of pixels on which the edge instantaneously falls. We therefore bin integral multiples of 16 spikes during the data acquisition process, before processing the spike data using equation 3.1. In fact, we will show results from the binning of 16, 32, and 64 spikes. 4 Results
We now turn to a presentation of our results. A run typically consists of 10,000 to 20,000 iterations, each iteration corresponding to the capturing and processing, via equation 3.1, of a xed number of spikes from the chip. The number of spikes captured per iteration inuences the nal quality of the topographic mapping from the retinal sheet to target sheet and affects the overall rate of topographic error reduction. We consider the binning of 16, 32, and 64 spikes and perform several complete runs for each bin size. For each bin size, all runs are qualitatively very similar. We therefore characterize the results for each bin size by randomly selecting a representative run from each case. For the binning of 32 spikes, we show, in Figure 1, the renement of topography over the period of development. By approximately 10,000 it-
Developing Topography Using a Neuromorphic Vision Chip
2361
erations, topographic development is largely complete, with no signicant further changes occurring with increased iteration numbers. Initially, the topographic map is tightly folded and disrupted. For example, for target cells near the edges of the sheet, considerable afferent input is derived initially from the opposite edge of the afferent sheet. Hence, the center of mass of afferent projections at rst is skewed toward the center of the afferent sheet. As development proceeds, the map unfolds, with the centers of mass drifting in the direction of perfect projections, although nonuniformities in the emerging topographic grid remain. By about 6000 iterations, the map is nearing its nal structure. Edge effects remain even at 10,000 iterations, although these would likely be reduced by greatly increasing the number of iterations. The “hole” at the top left corner of the map corresponds to the missing (0,0) afferent and target cells. Although the afferent input is derived from a real-world device and not simulated, these results compare very favorably with those obtained from simulated retinal inputs (Goodhill, 1993; Elliott & Shadbolt, 1999). Corresponding to Figure 1, in Figure 2 we show the development of the receptive eld of a typical target cell near the center of the target sheet. The development of this cell’s receptive eld shows that, as with the development of the overall topographic map, the receptive eld is nearly mature at 6000 iterations. In Figure 3, we show the nal topographic maps obtained from binning 16, 32, and 64 spikes, where the data for 32 spikes are identical to those used to generate Figures 1 and 2; these maps represent 17,500, 10,000, and 10,000 iterations of development, respectively. There is little qualitative difference between the maps for the binning of 32 or 64 spikes; both are nearly, although not quite, perfect. For 16 spikes, however, the map, although well developed, does not exhibit the quality of the maps for larger numbers of spikes. This is because the captured activity patterns represent only the leading edge of a horizontally or vertically moving bar and lack any structure in a direction perpendicular to the orientation of the bar ’s edge. A very similar result would be obtained with simulated retinal data in which just single lines of retinal cells are activated across the retina. Figure 4 shows the change in the overall topographic error in the maps as development proceeds for the 16, 32, and 64 spike data sets shown in Figure 3. At each iteration, the topographic error is calculated simply as the average over all target cells of the distance (in units of cell spacings) between a target cell’s center of mass input and the afferent cell that would project to the target cell were topography perfect. The greatest rate of error reduction is achieved with the binning of 64 spikes. Nevertheless, although 32 spikes result in a slower decrease in topographic error, the nal errors in the mature states for both 32 and 64 spikes are nearly identical. The residual error in these systems is due largely to the edge effects previously discussed. With 16 spikes, the error reduces and stabilizes by around 10,000 iterations but remains higher than the nal error for 32 or 64 spikes. In addition to edge effects, this higher error reects the slightly greater dis-
2362
Terry Elliott and J¨org Kramer
Min
Max
Figure 2: The development of the receptive eld of a target cell near the center of the target sheet for the same system as shown in Figure 1. Each square represents an afferent cell’s input to the target cell, with the gray-scale value of the square indicating the number of synapses projected by that afferent cell. A white square (“Min”) denotes no synapses, a black square (“Max”) denotes the largest number of synapses projected by an afferent to the target cell, and shades of gray interpolate between these extremes.
Figure 3: Final states of the topographic mappings between the neuromorphic chip’s retinal array and a target sheet for three different levels of spike binning per iteration. The number of spikes per bin used to generate a map is indicated immediately above the map.
Developing Topography Using a Neuromorphic Vision Chip
2363
6
Topographic Error
5 4 3 64 32
16
2 1 0
0
2000
4000 6000 Iteration Number
8000
10000
Figure 4: The change in the overall topographic error in the development of the topographic maps, calculated as discussed in the text, plotted against iteration number, for the same three data sets used to generate Figure 3. The number attached to each curve indicates the number of spikes per bin used to generate the data.
ruption in the maps generated by the binning of fewer spikes, as seen in Figure 3. In terms of maximizing the rate of topographic error reduction, for the parameter values used in the simulations of our plasticity model, bins of 64 spikes appear to be approximately optimal. If we bin larger numbers of spikes per iteration (96 or 128 spikes, for example; data not shown), we nd that the error curves initially follow very closely that for 64 spikes in Figure 4 but eventually depart from it, leaving higher residual errors. This behavior reects the way in which the spike bin size affects the correlations in the afferent activity patterns experienced by the developing neuronal network. Once afferent correlations (and hence spike bin size) exceed some threshold, competition between spatially neighboring afferent cells begins to break down or becomes tempered, although competition between more distant afferent cells remains unchanged. Hence, a gross, overall retinotopy continues to emerge rapidly, but topographic renement at a ne-grained level does not occur because nearby afferents’ activities are too well correlated. For a sufciently large number of spikes per bin, no error reduction will occur at all, corresponding to a complete breakdown of competitive dynamics. For bins of 32 spikes and 10,000 iterations in each case, we examine the dependence of the nal maps on the parameter b, which determines the topographic bias in the initial, unrened mapping between the afferent and
2364
Terry Elliott and J¨org Kramer
Figure 5: Final states of the topographic mappings between the neuromorphic chip’s retinal array and a target sheet for three different values of the initial topographic bias parameter b. The value of b used to generate a map is indicated immediately above the map. In all cases, 32 spikes are binned per iteration.
target sheets. All the data above are generated for a value b D 0.5, which falls midway between no initial bias, b D 0.0, giving a completely random initial map, and maximal bias b D 1.0, giving a well-ordered initial map. Figure 5 shows representative examples of the maps generated with smaller values of b, so that the initial topographic bias is smaller. For b D 0.4 and b D 0.3 (data not shown), the nal maps are essentially identical to those generated with b D 0.5. For b D 0.2, we see a well-ordered map, although distortions particularly around the edges of the map are beginning to appear. For b D 0.1, these edge distortions become more signicant, but the center of the map remains well structured. For b D 0.0, corresponding to an initially completely random map, a few small patches of locally rened topography emerge, but there is no global order to the map. Figure 6 shows the change in the overall topographic error in these maps during development for these three smaller values of b. The topographically rened patches for b D 0.0 explain why the topographic error actually increases over time in this case: the absence of global order means that synaptic connections may be removed from areas of lower topographic error in favor of higher correlations in areas of greater topographic error. Finally, we examine the robustness of our results to noise. For all data presented above, spontaneous ring of the silicon retina has been completely suppressed, so that pixels generate a spike only in response to a moving edge. Turning up the spontaneous ring rate introduces noise into the captured activity patterns, allowing us to determine how robust our results are to such noise. We bin 64 rather than 32 spikes in order to allow a reasonable number of both edge and spontaneous events to be caught during each capture phase, and set b to its usual value, b D 0.5. In Figure 7, we show typical examples of captured events, in response to a horizontally moving vertical edge. The percentage gure associated with each set of data gives an estimate of the amount of noise in the data, calculated simply as the ratio of the number of spontaneous to total spike events per bin. Figure 8 shows
Developing Topography Using a Neuromorphic Vision Chip
2365
Figure 6: The change in the overall topographic error in the development of the topographic maps, calculated as discussed in the text, plotted against iteration number, for the same three data sets used to generate Figure 5. The number attached to each curve indicates the value of b used to generate the data.
Figure 7: Representative examples of captured spikes (64 spikes binned per iteration) in response to a moving edge in the presence of spontaneous ring from the chip’s pixels. Each 16 £ 16 array of squares represents the array of pixels on the chip, a black square denoting a pixel that has spiked during the binning period and a white one a pixel that has not spiked. The percentage gure above each array gives an estimate of the level of spontaneous ring for that data set.
the maps generated in the presence of spontaneous activity from the chip after 10,000 iterations. For a spontaneous ring rate of »40% (so that »40% of active pixels are not associated with an edge), the nal map, although exhibiting slight distortions, is quite good. The topographic error remaining in this map is »0.5 (in units of cell spacings), which compares very favorably to the results above despite the high level of noise. Even for a spontaneous ring rate of »50%, the nal map still possesses well-developed structure,
2366
Terry Elliott and J¨org Kramer
Figure 8: Final states of the topographic mappings between the neuromorphic chip’s retinal array and a target sheet for three different levels of spontaneous activity on the chip. The spontaneous activity level present in generating a map is indicated immediately above the map.
Figure 9: The change in the overall topographic error in the development of the topographic maps, calculated as discussed in the text, plotted against iteration number, for the same three data sets used to generate Figure 8. The number attached to each curve indicates the spontaneous activity level used to generate the data.
although the distortions are much larger, giving a remaining topographic error of »1.0. For »60% spontaneous ring, however, the nal map has barely unfolded, the remaining topographic error being »4.5, close to its initial value. Figure 9 shows the change in the overall topographic error in these maps during development for the data sets shown in Figure 8.
Developing Topography Using a Neuromorphic Vision Chip
2367
5 Discussion
We have shown that a silicon retina, responding to a moving bar generated on an LCD display, can be used, in conjunction with a biologically inspired neurotrophic model of neuronal development, to drive the topographic renement of afferent projections to a sheet of target neurons and establish a nearly perfect one-to-one mapping between afferent and target cells. The emergence of a rened topographic map occurs robustly, largely independent of the size of spike bins provided that the binned data contain sufcient structure in a direction perpendicular to the bar ’s edge. However, larger spike bins do lead to greater rates of topographic error reduction, although the nal errors associated with the mature maps do not exhibit much variation as a function of spike bin size. For our choice of parameters, bins of 64 spikes give roughly optimal performance, resulting in the greatest rate of topographic error reduction. Larger numbers of spikes per bin begin to correlate the activities of nearby afferents too strongly, resulting in a breakdown or tempering of competitive dynamics between spatially neighboring afferent cells above some threshold in the strength of afferent correlations. The precise location of this threshold depends in part on the function D xy , which characterizes the diffusion of NTF between target cells. Narrowing D xy increases the threshold but weakens the coupling between target cells so that there is less tendency for the receptive elds of nearby target cells to be similar. Overall, therefore, while narrowing D xy will increase this threshold, it will also tend to disrupt the emergence of topography. Our results, generated from a real-world device that captures some of the key features of retinal processing, compare very favorably with results obtained by simulating afferent input (Elliott & Shadbolt, 1999). This is despite the intrinsic noise present in all real-world devices—for example, the occasional failure of some pixels to respond to the stimulus and the temporal jitter of the responses. Indeed, in many respects, the system presented here performs better than earlier simulations. In some earlier work (Elliott & Shadbolt, 1999), we considered afferent input to consist of randomly activated pixels smeared by a gaussian smoothing function (cf. Goodhill, 1993). For the same learning rate used here, that system required well in excess of 100,000 iterations in order to generate a rened topographic map (see gure 6 in Elliott & Shadbolt, 1999). In contrast, our study here requires only on the order of 10,000 iterations to produce a mature map when coupled to the silicon retina. It may be argued that our stimulus (a moving bar) is rather unnatural. However, similar differences between the rates of development in systems with and without preprocessing of the raw optical image are also seen in robotic studies (unpublished observations). Retinal preprocessing of the raw optical image therefore appears, from a developmental point of view, to be of considerable advantage, allowing maps to achieve maturity more quickly than otherwise possible.
2368
Terry Elliott and J¨org Kramer
Spontaneous activity is a key feature of real nervous systems that allows, for example, the encoding of negative quantities by mechanisms that reduce spontaneous activity. Although we have not considered such encoding here, any model of neuronal development must be robust to the presence of spontaneous activity. By turning up the level of spontaneous activity on the silicon retina, we have shown that reasonable maps are generated even in the presence of high levels of spontaneous ring, with up to approximately half of all spike events not being stimulus related. The used retina chip proves to be a suitable input sensor for the plasticity model. It provides the possibility of varying such parameters as refractory period and spontaneous ring rate of the afferent neurons and therefore allows us to evaluate the performance of the model under different input conditions. In the nonbursting mode, the retina chip implements very sparse coding, which makes the topographic renement process more efcient. If, in addition, spontaneous ring is suppressed, the output signal is robust and reproducible with an error rate (number of missed and spurious events divided by total number of events) of approximately 5%. 6 Conclusion
The bar-stimulated silicon retina coupled to a neurotrophic model is a comparatively simple system that has been used here to demonstrate, for the rst time, that a biologically inspired model of neuronal plasticity and a neuromorphic chip can operate successfully in unison. Future work might see the mounting of the chip on a mobile robotic platform for more realistic patterns of stimulation, the activation of both the ON and the OFF pathways in the chip and an examination of their segregation on some target structure, and using two such chips in a binocular conguration to examine the development of structures such as ocular dominance columns. Acknowledgments
T.E. thanks the Royal Society for the support of a University Research Fellowship. J.K. was supported in part by the Swiss National Foundation Research SPP grant. We thank David Lawrence of the Institute of Neuroinformatics for his invaluable help with interfacing the chip to the PC. We thank the Telluride 2000 Neuromorphic Engineering Workshop for having facilitated this work. References Boahen, K. (1999). Retinomorphic chips that see quadruple images. In Proc. 7th Int. Conf. on Microelectronics for Neural, Fuzzy, and Bio-Inspired Systems (pp. 12– 20). Los Alamitos, CA: IEEE Computer Society Press. Boahen, K. A. (2000). Point-to-point connectivity between neuromorphic chips using address events. IEEE Trans. Circuits and Systems II, 47, 416–434.
Developing Topography Using a Neuromorphic Vision Chip
2369
Deiss, S. R., Douglas, R. J., & Whatley, A. M. (1998). A pulse-coded communications infrastructure for neuromorphic systems. In W. Maass & C. M. Bishop (Eds.), Pulsed neural networks (pp. 157–178). Cambridge, MA: MIT Press. Delbruck, ¨ T., & Mead, C. A. (1994). Analog VLSI adaptive logarithmic widedynamic range photoreceptor. In Proc. 1994 IEEE Int. Symp. on Circuits and Systems (pp. 339–342). Piscataway, NJ: IEEE Press. Elliott, T., Maddison, A. C., & Shadbolt, N. R. (2001). Competitive anatomical and physiological plasticity: A neurotrophic bridge. Biological Cybernetics, 84, 13–22. Elliott, T., & Shadbolt, N. R. (1998a). Competition for neurotrophic factors: Mathematical analysis. Neural Computation, 10, 1939–1981. Elliott, T., & Shadbolt, N. R. (1998b). Competition for neurotrophic factors: Ocular dominance columns. Journal of Neuroscience, 18, 5850–5858. Elliott, T., & Shadbolt, N. R. (1999). A neurotrophic model of the development of the retinogeniculocortical pathway induced by spontaneous retinal waves. Journal of Neuroscience, 19, 7951–7970. Etienne-Cummings, R., & Van der Spiegel, J. (1996). Neuromorphic vision sensors. Sensors and Actuators A, SNA056, 19–29. Fukushima, K., Yamaguchi, Y., Yasuda, M., & Nagata, S. (1970). An electronic model of the retina. Proc. IEEE, 58, 1950–1951. Goodhill, G. J. (1993). Topography and ocular dominance: A model exploring positive correlations. Biological Cybernetics, 69, 109–118. Higgins, C., & Koch, C. (2000).A modular multi-chip neuromorphic architecture for real-time visual processing. Journal of Analog Integrated Circuits and Signal Processing, 26, 195–211. Indiveri, G., Murer, ¨ R., & Kramer, J. (2001). Active vision using an analog VLSI model of selective attention. IEEE Trans. Circuits and Systems II, 48, 492– 500. Indiveri, G., Whatley, A. M., & Kramer, J. (1999). Recongurable neuromorphic VLSI multi-chip system applied to visual motion computation. In Proc. 7th Int. Conf. on Microelectronicsfor Neural, Fuzzy, and Bio-Inspired Systems (pp. 37– 44). Los Alamitos, CA: IEEE Computer Society Press. Kramer, J. (2002a).An integratedopticaltransient sensor. Submitted for publication. Kramer, J. (2002b).An on/off transient imager with event-driven, asynchronous read-out. In Proc. 2002 IEEE International Symposium on Circuits and Systems vol. II (pp. 165–168). Piscataway, NJ: IEEE Computer Society Press. Lazzaro, J., Wawrzynek, J., Mahowald, M., Sivilotti, M., & Gillespie, D. (1993). Silicon auditory processors as computer peripherals. IEEE Trans. Neural Networks, 4, 523–528. Liu, S.-C., Kramer, J., Indiveri, G., Delbruck, ¨ T., Burg, T., & Douglas, R. J. (2001). Orientation-selective aVLSI spiking neurons. Neural Networks, 14, 629–643. Mahowald, M. A. (1991).Silicon retina with adaptive photoreceptors. Proc. SPIE, 1473, 52–58. Mahowald, M. A. (1994). An analog VLSI system for stereoscopic vision. Norwell, MA: Kluwer. McAllister, A. K., Katz, L. C., & Lo, D. C. (1999). Neurotrophins and synaptic plasticity. Annual Review of Neuroscience, 22, 295–318. Mead, C. (1989). Analog VLSI and neural systems. Reading, MA: Addison-Wesley.
2370
Terry Elliott and J¨org Kramer
Meyer, R. L. (1983). Tetrodotoxin inhibits the formation of rened retinotopography in goldsh. Developmental Brain Research, 6, 293–298. Moini, A. (1999). Vision chips. Norwell, MA: Kluwer. Mortara, A., & Vittoz, E. A. (1994). A communication architecture tailored for analog VLSI neural networks: Intrinsic performance and limitations. IEEE Trans. Neural Networks, 5, 459–466. Schmidt, J. T., & Edwards, D. L. (1983). Activity sharpens the map during the regeneration of the retinotectal projection in goldsh. Brain Research, 209, 29–39. Schmidt, J. T., & Eisele, L. E. (1985). Stroboscopic illumination and dark rearing block the sharpening of the regenerated retinotectal map in goldsh. Neuroscience, 14, 535–546. Sirosh, J., & Miikkulainen, R. (1997). Topographic receptive elds and patterned lateral interactions in a self-organizing model of the primary visual cortex. Neural Computation, 9, 577–594. Venier, P., Mortara, A., Arreguit, X., & Vittoz, E. A. (1997). An integrated cortical layer for orientation enhancement. IEEE J. Solid-State Circuits, 32, 177–186. Received October 23, 2001; accepted April 18, 2002.
LETTER
Communicated by Laurence Abbott
Hebbian Imprinting and Retrieval in Oscillatory Neural Networks Silvia Scarpetta [email protected] Department of Physics “E. R. Caianiello,” Salerno University, Baronissi (SA), Italy, and INFM, Sezione di Salerno (SA), Italy L. Zhaoping [email protected] Gatsby Computational Neuroscience Unit, UCL, London, U.K. John Hertz [email protected] Nordita, DK-2100 Copenhagen, Denmark We introduce a model of generalized Hebbian learning and retrieval in oscillatory neural networks modeling cortical areas such as hippocampus and olfactory cortex. Recent experiments have shown that synaptic plasticity depends on spike timing, especially on synapses from excitatory pyramidal cells, in hippocampus, and in sensory and cerebellar cortex. Here we study how such plasticity can be used to form memories and input representations when the neural dynamics are oscillatory, as is common in the brain (particularly in the hippocampus and olfactory cortex). Learning is assumed to occur in a phase of neural plasticity, in which the network is clamped to external teaching signals. By suitable manipulation of the nonlinearity of the neurons or the oscillation frequencies during learning, the model can be made, in a retrieval phase, either to categorize new inputs or to map them, in a continuous fashion, onto the space spanned by the imprinted patterns. We identify the rst of these possibilities with the function of olfactory cortex and the second with the observed response characteristics of place cells in hippocampus. We investigate both kinds of networks analytically and by computer simulations, and we link the models with experimental ndings, exploring, in particular, how the spike timing dependence of the synaptic plasticity constrains the computational function of the network and vice versa. 1 Introduction
It has long been known that the brain is a dynamical system in which nonstatic activities are common. In particular, oscillatory neural activity has c 2002 Massachusetts Institute of Technology Neural Computation 14, 2371– 2396 (2002) °
2372
Silvia Scarpetta, L. Zhaoping, and John Hertz
been observed and is believed to play signicant functional roles in, for example, the hippocampus and the olfactory cortex. The inputs to these areas can be oscillatory, and the intra-areal connections also make these systems prone to intrinsic oscillatory dynamics. Networks of interacting excitatory and inhibitory neurons (E-I networks) are ubiquitous in the brain, and oscillatory activity is not unexpected in such networks because of the intrinsically asymmetric character of the interactions between excitatory and inhibitory cells (see, e.g., Menschik & Finkel, 1999; Fellous & Sejnowski, 2000; Tiesinga, Fellous, Jose, & Sejnowski, 2001). Recent experimental ndings further underscored the importance of dynamics by showing that long-term changes in synaptic strengths depend on the relative timing of pre- and postsynaptic ring (Magee & Johnston, 1997; Debanne, Gahwiler, & Thompson, 1994, 1998; Bi & Poo, 1998; Markram, Lubke, Frotscher, & Sakmann, 1997; Bell, Han, Sugawara, & Grant, 1997, 1999; Feldman, 2000). For instance, in neocortical and hippocampal pyramidal cells (Magee & Johnston, 1997; Debanne et al., 1998; Bi & Poo, 1998; Markram et al., 1997; Feldman, 2000), the synaptic strength increases (long-term potentiation, LTP) or decreases (long-term depression, LTD), depending on whether the presynaptic spike precedes or follows the postsynaptic one. The computational role and functional implications of this type of plasticity have been explored in recent work (Rao & Sejnowski, 2001; Kempter, Gerstner, & van Hemmen, 1999; Song, Miller, & Abbott, 2000 and references therein). This synaptic modication is largest for differences between pre- and postsynaptic spike times of the order of 10 ms. Since the scale of this relative timing is comparable to the period of neural oscillations, the oscillatory dynamics should affect the resulting synaptic modications. In particular, the relative phases between the oscillating neurons ought to constrain the synaptic changes that can occur. These synaptic strengths should in turn determine the nature of the network dynamics. This interplay seems likely to have signicant functional consequences. While we have achieved some understanding of the computational power of oscillatory networks (see Li & Dayan, 1999 and references therein), they are poorlyunderstood in comparisonwith networks that always converge to static states, such as feedforward networks or recurrent networks with symmetric connections. In particular, although we know a lot about appropriate learning algorithms for associative memory in symmetrically connected and feedforward networks (Hertz, Krogh, & Palmer, 1991), there is little previous work on learning, in the context of the synaptic physiological ndings mentioned above, in asymmetrically connected networks with oscillatory dynamics. In this article, we introduce a model for spike-timing dependent learning in an oscillatory neural network and show how such a network can perform associative memory or input representation after learning. The experimental ndings dictate the general form of our model. It is an E-I network, with asymmetric interactions between excitatory and inhibitory cells, that exhibits input-driven oscillatory activity. We describe the long-term synaptic changes induced by a pair of pre- and postsynaptic spikes at times tpre and tpost by a function, which we denote A (t ) , of the dif-
Hebbian Imprinting and Retrieval in Oscillatory Neural Networks
2373
ference in spike times t D tpost ¡ tpre . Hence, A (t ) is positive or negative for LTP or LTD for a particular t value. According to the experiments (Magee & Johnston, 1997; Debanne et al., 1994, 1998; Bi & Poo, 1998; Markram et al., 1997; Bell et al., 1997, 1999; Feldman, 2000), A (t ) varies in different preparations. For instance, the synapses between hippocampal pyramidal cells have A (t ) > 0 when t > 0 and A (t ) < 0 when t < 0. We will consider a general A (t ) in order to be able to explore the consequences of different forms of this function that may be relevant to different areas or conditions. We study analytically and by simulation how oscillatory activities inuence synaptic changes and how these changes inuence the network oscillations and their functions. In particular, we ask the following questions: (1) How can the system function as an associative memory or as a substrate for a map of an input space? (2) What constraints do these functions place on the form of A (t ) ? (3) What constraints would particular experimental ndings about A (t ) impose on the function of networks like this one? In the next section, we present the model E-I network and describe its dynamics for arbitrary synaptic strengths, making use of a linearized analysis. Section 3 then applies the spike-timing-dependent synaptic dynamics to the ring states evoked by oscillatory input. We obtain general expressions for the resulting learned synaptic strengths and use the linearized theory to study the response properties of the network. We show how the learning rates can be adjusted so that after learning, the network responds strongly (resonantly) to inputs similar to those used to drive it during learning and weakly to unlearned inputs. In addition to this pattern tuning, the system exhibits tuning with respect to driving frequency: the response is weakened for driving frequencies different from that used during learning. We show further that depending on the kind of nonlinearity in the neuronal input-output function, the model can perform two qualitatively different kinds of computations. One is associative memory, in which an input to the network is categorized by identifying it with the learned pattern most similar to it. (Olfactory cortex is believed to operate in something like this way.) The other is to make a representation of the input pattern as a continuous mapping onto the space spanned by the learned patterns. For this mode of operation, which we call input representation, it is not necessary for the network to have learned explicitly all patterns to which it should respond; it performs a kind of interpolation between a much smaller number of prototypes. (In the hippocampus, we identify these prototypes with place cell elds.) Section 4 presents the nonlinear analysis of the network for both these cases. In section 5, we examine the consequences of various possible constraints on the signs and plasticities of the synapses. Despite the primitive character of the model we use, we believe our ndings may have relevance to the dynamics of many cortical areas. In the nal section, we discuss our results in the context of other modeling and experimental ndings, indicating some interesting directions, both experimental and theoretical, for future work.
2374
Silvia Scarpetta, L. Zhaoping, and John Hertz
2 The Model
We base our model on one formulated recently by two of us (Li & Hertz, 2000) to describe olfactory cortex. For completeness, we summarize its main features here. In the brain regions we model, hippocampus or olfactory cortex, pyramidal cells make long-range connections to both other pyramidal cells and inhibitory interneurons, while inhibitory interneurons generally project only locally (see Figure 1A). The elementary module of the system is an E-I pair consisting of one excitatory and one inhibitory unit, mutually connected. Each such unit represents a local assembly of pyramidal cells or local interneurons sharing common, or at least highly correlated, input. (The number of neurons represented by the excitatory units is in general different from the number represented by the inhibitory units.) Such E-I pairs, without connections between them, form independent damped local oscillators. The connections between units in different pairs, which we term long-range connections, are subject to learning or plasticity in this study. They couple the pairs and determine the normal modes of the coupledoscillator system. The input to the system is an oscillating pattern Ii driving the excitatory units. It models the bulb input to the pyramidal cells in the olfactory cortex or the input from the enthorinal cortex (perforant path) and the dentate gyrus (mossy-ber) to CA3 pyramidal cells. The system outputs are from the excitatory cells. The state variables, modeling the membrane potentials, are u D fu1 , . . . , uN g and v D fv1 , . . . , vN g, respectively, for the excitatory and inhibitory units. (We denote vectors by bold type.) The unit outputs, representing the probabilities of the cells ring (or instantaneous ring rates), are given by gu ( u1 ) , . . . , gu ( uN ) and gv ( v1 ) , . . . , gv ( vN ) , where gu and gv are sigmoidal activation functions that model the neuronal input-output relations. The equations of motion are X (2.1) uP i D ¡aui ¡ b i0 gv (vi ) C Jij0 gu ( uj ) C Ii , j
vP i D
¡avi C c i0 gu ( ui ) C
X
W ij0 gu ( uj ) ,
(2.2)
6 i jD
Figure 1: Facing page. (A) The model elements. Input is to the excitatory units u, which also provide the network output. There are local excitatory-inhibitory connections (vertical solid lines) and nonlocal connections (indicated by dashed lines) between the excitatory units (Jij ) and from excitatory to inhibitory units (Wij ). (B–D) The activation functions for model neurons. Class I and class II nonlinearities are shown in B and C, respectively. Crosses mark the equilibrium point ( u, N vN ) of the system (see section 2.1) used in our numerical simulations. The slopes of all activation functions used in these calculations are taken to be 1 at the equilibrium point. (E–G) An example of kernel shape A (t ) (Kempter et al., 1999) and the real and imaginary part of its Fourier transform.
Hebbian Imprinting and Retrieval in Oscillatory Neural Networks
2375
where a¡1 is the membrane time constant (for simplicity, assume the same for excitatory and inhibitory units), Jij0 is the synaptic strength from excitatory unit j to excitatory unit i, W ij0 is the synaptic strength from excitatory unit j to inhibitory unit i, bi0 and c i0 are the local inhibitory and excitatory connections within the E-I pair i, and Ii ( t) is the net input from other parts of the brain. We omit inhibitory connections between pairs here, since the
Ii
g(u i)
o
Jij ui
Excit.
uj o
Wij gi
o
o
bi
vi
A
1 0.5 -20 -10
A’’
A’
1.5
10
20
Inhib.
vj
...
6 5 4 3 2 t ms 1 50
100
150
-2 -4 -6 -8 -10 Hz 200
50
100
150
Hz
200
2376
Silvia Scarpetta, L. Zhaoping, and John Hertz
real anatomical long-range connections appear to come predominantly from excitatory cells. (The parameter c i0 could be identied as Wii0 and the term c i0 gu ( ui ) absorbed into the following sum over j, but for later convenience, we have written this local term explicitly.) All these parameters are nonnegative; the inhibitory character of the second term on the right-hand side of equation 2.1 is indicated by the minus sign preceding it. 2.1 Linearization. The static part IN of the input determines a xed point (uN , vN ) , given by the solution of equations uP D 0, vP D 0 with I D IN. Linearizing equations 2.1 and 2.2 around the xed point leads to
uPi D ¡aui ¡ b i vi C vP i D ¡avi C c i ui C
X j
X
Jij uj C dIi Wij uj ,
(2.3)
j
where ui and vi are now measured from their xed-point values, d I ´ I ¡ IN , bi D g0v ( vN i )b i0 , c i D g0u ( uN i )c i0, Wij D g0u ( uN j ) Wij0 Jij D g0u ( uN j ) Jij0 . Henceforth, for simplicity, we assume bi D b, c i D c , independent of i. Eliminating the vi from equation 2.3, we have the second-order differential equations [ (@t C a) 2 C bc ]u D Mu C ( @t C a)d I
(2.4)
where M D ( @t C a) J ¡ b W,
(2.5)
or, equivalently, uR C (2a ¡ J) uP C [a2 ¡ aJ C b (c C W ) ]u D ( @t C a)d I.
(2.6)
(We use sans serif type to denote matrices.) Given a stable xed point, an oscillatory drive d I ´ d I C C d I ¡ , where C d I / e ¡ivt and d I ¡ D d I C ¤ , will lead eventually to a sustained oscillatory response u ´ u C C u ¡ with the same frequency v, with u C / e ¡ivt and u ¡ D u C ¤ . Then from equation 2.4, [ (¡iv C a) 2 C bc ]uiC D
X j
Mij ujC C (a ¡ iv)dIiC ,
(2.7)
where M D (a ¡ iv) J ¡ b W
(2.8)
Hebbian Imprinting and Retrieval in Oscillatory Neural Networks
2377
is now the M in equation 2.5 applied to the e ¡ivt modes. The terms in the square bracket describe the local E-I pair contribution, while M gives an effective coupling between the oscillating E-I pairs. A zero M makes u proportional to d I with a constant phase shift, that is, each individual damping oscillator is driven independently by a component of the external drive. Learning imprints patterns into M through the long-range connections J and W. After learning, u depends on how d I C decomposes into the eigenvectors of M. Thus, the network can selectively amplify or distort d I in an imprinted-pattern-specic manner and thereby function as an associative memory or input representation. 2.2 Nonlinearity. At large response amplitudes, nonlinearity in gu and gv signicantly modies the response. We will focus on the nonlinearity in gu only, since gv affects only the local synaptic input while gu also affects the long-range input mediated by J and W. We categorize the nonlinearity into two general classes in terms of how gu deviates from linearity near the xed-point uN :
class I: gu (ui ) » ui ¡ au3i
class II: gu ( ui ) » ui C au3i ¡ bu5i ,
(2.9)
where a, b > 0, and ui is measured from the xed-point value uN i . Class I and II nonlinearity differ in whether the gain g0u decreases or increases (before saturation) as one moves away from the equilibrium point and will lead to qualitatively different behavior, as will be shown. We will not treat the more general case where gu ( u ) is not an odd function of u. However, to lowest order, a quadratic term acts just to shift the equilibrium point, and a quartic one does not affect our results qualitatively. 3 Learning, Neural Dynamics, and Model Behavior
In our treatment, we distinguish a learning mode, in which the oscillating patterns are imprinted in the synaptic connections J and W, from a recall mode, in which connection strengths do not change. Of course, this distinction is somewhat articial; real neural dynamics may not be separated so cleanly into such distinct modes. Nevertheless, among other effects (Menschik & Finkel, 1999; Fellous & Sejnowski, 2000), cholinergic modulatory effects probably do weaken synapses during learning (Hasselmo, 1993, 1999), so there is an experimental basis for the distinction, and it is conceptually indispensable. In what follows we will consider learning of oscillation patterns of two kinds. In one, two local oscillators are either in phase with each other or 180 degrees out of phase; we can write ui (t ) / ji cos vt, where the ji are real numbers (either positive or negative) describing the amplitudes on the different sites. In the second kind of pattern, different local oscillators can have different phases: ui (t) / |ji | cos(vt ¡ w i ) . We can describe both cases by
2378
Silvia Scarpetta, L. Zhaoping, and John Hertz
writing ui ( t) D ji e ¡ivt C c.c., taking the ji real in the rst case and complex (ji D |ji |eiw i ) in the second. Thus, we will often call the rst case real patterns and the second complex patterns. 3.1 Learning Mode. Let Cij be the synaptic strength from presynaptic unit j to postsynaptic unit i. Let xj ( t) and yi ( t) represent the corresponding activities relative to some stationary levels at which no changes in synaptic strength occur. Then Cij changes during the learning interval [0, T] according to
dCij ( t) D hyi ( t) A ( t ¡ t0 ) xj ( t0 ) i D
g T
Z
Z
T
dt 0
1 ¡1
dt0 yi ( t) A ( t ¡ t0 ) xj ( t0 ) ,
(3.1)
where g is the learning rate and T may be taken equal to the period of the oscillating input. The kernel A ( t ¡t0 ) is the measure of the strength of synaptic change at time delay t D t ¡t0 . For example, conventional Hebbian learning, RT with A (t ) / d (t ) (used, e.g., in Li & Hertz, 2000), givesdCij / 0 dt ui ( t) uj (t) . Some experiments (Bi & Poo, 1998; Markram et al., 1997) suggest A (t ) to be a nearly antisymmetric function of t , positive (LTP) for t > 0 and negative (LTD) for t < 0 (see Figures 1E–1G). However, for the moment we do not restrict its shape. In equation 3.1, the contributions of all pre-post spike pairs are combined additively. The polarity and the magnitude of each contribution are determined by the time delay of each pair of spikes via the kernel A (t ) . This weighted sum makes the resulting synaptic change depend on both the time delay and the spike rates. Our preliminary investigations indicate that with an appropriate choice of kernel, this simple linear model can reproduce the essential features of the results of Sj¨ostrom, ¨ Turrigiano, & Nelson (2001). Applying the learning rule to our connections Jij and Wij , we use equation 3.1 with x D u , C D J or W, and y D u or v , respectively, giving: Jij D
1 NT
1 W ij D NT
Z
Z
T
dt 0
Z
Z
T
dt 0
1 ¡1 1
¡1
dt0 ui ( t) AJ (t ¡ t0 ) uj ( t0 ) dt0 vi ( t) AW ( t ¡ t0 ) uj ( t0 ) .
(3.2)
We have absorbed the learning rates into the denition of the kernels A J,W and added the conventional normalizing factor 1 / N for convenience in doing the mean-eld calculations. Cholinergic modulation can affect the strengths of long-range connections in the brain; these are apparently almost ineffective during learning (Hasselmo, 1993, 1999). The neural dynamics is then simplied in our model by turning off J and W (and thus M) in the learning phase.
Hebbian Imprinting and Retrieval in Oscillatory Neural Networks
2379
Consider the learning of a single input pattern, d I D »m e¡ivm t C c.c. We calculate separately the responses ui§ and vi§ to the positive- and negativefrequency parts of the input, add them together, and use the resulting ui (t) and vi ( t) in equations 3.2 to calculate Jij and Wij . For the positive-frequency response, we obtain uiC D viC D
(a ¡ ivm )jim e ¡ivm t m ´ Â0 (vm )ji e ¡ivm t (a ¡ ivm ) 2 C bc
c m Â0 (vm )ji e ¡ivm t . a ¡ ivm
(3.3) (3.4)
The quantity Â0 (v) is the output-to-input ratio for the network with J and W equal to zero. The responses ui¡ and vi¡ to the corresponding negativefrequency driving pattern d I ¡ D » m ¤ eivm t are the complex conjugates of equations 3.3 and 3.4, respectively. Substituting these responses into equations 3.2 yields connections h i 2 m m¤ Re AQ J (vm ) ji jj N " # 2c AQW (vm ) m m ¤ m Re ji jj Wij D , a ¡ ivm N m
Jij D
(3.5)
where Z AQ J,W (v) D |Â0 (v) | 2
1 ¡1
dt AJ,W (t ) e ¡ivt
(3.6)
can be thought of as an effective learning rate at a frequency v. The factor of the Fourier transform of the learning kernel carries the information about different efcacies of learning for different postsynaptic-presynaptic spike time differences, while the factor |Â0 (v) | 2 reects the responsiveness of the uncoupled local oscillators (J D W D 0) in the learning phase. Note that Im AQ J,W (v) D 0 if AJ,W (t ) is symmetric in t and that Re AQ J,W (v) D 0 if AJ,W (t ) is antisymmetric. We will sometimes denote the real and imaginary parts of AQ J,W by AQ 0J,W and AQ 00J,W , respectively. The resulting effective coupling M between oscillators after learning, under positive-frequency external drive d I C of frequency v > 0 (in general 6 vm ), is vD m Mij
" # Q W (vm ) m m ¤ 2(a ¡ iv) 2bc A m m ¤ Re[AQ J (vm )ji jj ] ¡ Re ji jj . D a ¡ ivm N N
(3.7)
The dependence of the neural connections J and W and the oscillator m m¤ couplings M onji jj is a natural generalization of the Hebb-Hopeld factor
2380
Silvia Scarpetta, L. Zhaoping, and John Hertz
m m
ji jj for (real) static patterns. This becomes particularly clear if we consider the special case when there is the following matching condition between the two kernels: AQ J (vm ) D
bc a2 C
vm2
AQ W (vm ) ,
v D vm .
(3.8)
Then the oscillator coupling simplies into a familiar outer-product form for complex vectors » : m
m m¤
Mij D ¡2ivm AQ J (vm )ji jj
/ N,
(3.9)
To construct the corresponding matrices for multiple patterns (which we will always take to be random and independent), we simply sum equation 3.5 over input patterns, labeled by the index m , as for the Hopeld model. We restrict attention to the case where the number P of stored patterns is negligible in comparison with N, the size of the network (though it may be À 1). So far, all our results apply for both real and complex patterns. 3.2 Recall Mode. After learning, the connections are xed and the response u C to an input d I C / e ¡ivt is described by equation 2.7. To solve it, we need to know how the M matrix acts on input vectors. We consider uncorrelated patterns all learned at the same frequency (vm independent of m ). Then it is easy to see that M is a projector onto the space spanned by the imprinted patterns. It has P eigenvectors (the imprinted patterns) with the same nonzero eigenvalue and N ¡ P with eigenvalue zero. These are standard properties of outer-product constructions for orthogonal vectors; we can treat our »m as effectively orthogonal here because we are taking m the components ji to be independent and N À P (Amit, Gutfreund, & Sompolinsky, 1985). The nonvanishing eigenvalue of M, which we denote P (v , vm ), is simply computed as
P (vI vm ) D (a ¡ iv) AQ J (vm ) ¡
bc AQ W (vm ) a ¡ ivm
(3.10)
for complex patterns and "
AQ W (vm ) P (vI vm ) D 2(a ¡ iv) Re AQ J (vm ) ¡ 2bc Re a ¡ ivm
# (3.11)
for real patterns. Thus, from equation 2.7, the response u C to an input d I C in the imprintedpattern subspace is u C D Â (vI vm )d I C ,
(3.12)
Hebbian Imprinting and Retrieval in Oscillatory Neural Networks
2381
with the linear response coefcient or susceptibility  (vI vm ) D
a2 C
bc
a ¡ iv . ¡ 2iva ¡ P (vI vm )
(3.13)
¡ v2
To achieve a resonant response to an input at the imprinting frequency (v D vm ), the learning kernels should be adjusted so that both the real and imaginary parts of the denominator in  (vm I vm ) are close to zero, that is, ´ a2 C bc ¡ vm2 ¡ Re P (vm I vm ) ! 0, 2
(3.14)
D ´ 2vm a C Im P (vm I vm ) ! 0.
(3.15)
For real patterns, Im P (vm I vm ) D ¡2vm AQ 0J , so D D 2vm (a ¡ AQ 0J ) . Thus, small D requires a positive AQ 0J > 0, i.e., stronger positive-t LTP than negativet LTD for excitatory-excitatory couplings (again, provided the typical values of t for which A J (t ) is sizable are small compared to the oscillation period). From equation 3.14, " 2 2 D a C bc ¡ vm ¡ 2 Re aAQ J (vm ) ¡ 2
bc AQ W (vm ) a ¡ ivm
# .
(3.16)
Thus, for a given vm , the resonance condition enforces a constraint on a linear combination of AQ 0J , AQ 0W , and AQ 00W . However, we note that AQ 00J does not appear anywhere; it is simply irrelevant to learning real patterns. For complex patterns, 2
2 2 0 00 D a C bc ¡ vm ¡ (aAQ J C vm AQ J )
C
D
bc (aAQ 0W ¡ vm AQ 00W ) a2 C vm2
(3.17) bc
(aAQ 00W C vm AQ 0W ) . D 2vm a C (aAQ 00J ¡ vm AQ 0J ) ¡ 2 a C vm2
(3.18)
(We have temporarily suppressed the vm -dependence of AQ 0J,W and AQ 00J,W to save space.) One can get some insight here by considering the time-shifted learning kernels AJ (t C hm / vm ) and AW (t ¡hm / vm ), where hm D tan ¡1 (a/ vm ). (For a » vm ¼ 40 Hz, these shifts are around 3 ms.) In terms of the associated frequency-domain quantities, Z BQ J,W (v) D |Â0 (v) | 2
1 ¡1
dt A J,W (t § hm / vm ) e ¡ivt
D AQ J,W (v) e§ ihm ,
(3.19)
2382
Silvia Scarpetta, L. Zhaoping, and John Hertz
we can write equations 3.17 and 3.18 as 2
D
2 2 D a C bc ¡ vm ¡
D 2vm a ¡
q
a2 C vm2 BQ 00J (vm ) ¡ q
bc a2 C vm2
BQ 00W (vm )
q bc a2 C vm2 BQ 0J (vm ) ¡ q BQ 0W (vm ) . a2 C vm2
(3.20)
(3.21)
Thus, the imaginary parts of the frequency-domain kernels BQ J,W (vm ) shift the resonant frequency, and the real parts control the damping. In particular, one needs at least one of BQ 0J,W to be positive to achieve good frequency tuning. Negative (positive) imaginary parts BQ 00 increase (decrease) the resonant J,W
frequency. When the learning window widths in the kernels AJ,W (t ) are much smaller than the oscillation period, the shifts by §hm / vm do not affect the real parts of B J,W strongly. However, for window shapes (see Figure 1E) that change rapidly from negative to positive around t D 0, the imaginary parts can be strongly suppressed, even for fairly small shifts. The explicit form of the resonant response can be seen by expanding the denominator of  (vI vm ) around v D vm :  (vI vm ) D
a ¡ ivm , 2 ¡ iD ¡ Z (vm ) (v ¡ vm )
(3.22)
where
Z (vm ) D 2vm C 2ia C
@P (vI vm ) @v
.
(3.23)
v D vm
Thus, Â has a pole at v D vm C
2 ¡ iD
Z
D vm C
2 Z0 ¡ D Z00 ¡ i( D Z0 C 2 Z00 )
|Z| 2
,
(3.24)
and, as the driving frequency v in the recall phase is varied, the system exhibits a resonant tuning, with a peak near vm and a line width equal to (D Z0 C 2 Z00 ) / |Z| 2 . One has to check that the desired learning rates and kernels do not violate the condition that the response function, equation 3.13, be causal, that is, small perturbations decay in time. Analytically, the requirement is that all singularities of  (v , vm ) must lie in the lower half of the complex v plane. Thus, in equation 3.24, we need D Z0 C 2 Z00 to be positive. For real patterns, the analysis is fairly simple. From equations 3.11, 3.15, and 3.23, we obtain Z D 2vm C 2i (a ¡ AQ 0J ) and D D 2vm (a ¡ AQ 0J ) . Thus, for
Hebbian Imprinting and Retrieval in Oscillatory Neural Networks
2383
D ! 0, Z ! 2vm , and the stability condition is simply that D be positive, that is, AQ 0J (vm ) < a.
For complex patterns, we get Z D 2vm C AQ 00J C i(2a ¡ AQ 0J ) . Requiring D Z0 C 2 Z00 to be positive then imposes constraints on the signs and relative magnitudes of 2 and D , depending on Z0 and Z00 . We omit the details. Notice that for both the real and complex cases, the stability analysis does not depend on the W-learning kernel AW at all (except insofar as it affects 2 and D ). This is because Z involves the derivative @P (v , vm ) / @v, and in both cases the only v-dependence of P is in the factors (a ¡ iv) in the rst terms of equations 3.10 and 3.11, which do not involve AQ W . Figure 2 shows examples of the frequency tuning described by equation 3.22, as obtained from simulations of small networks, including the nonlinearities described in section 2.2. Nonlinearity makes the response deviate from the linear prediction when the amplitude is larger, as happens near the resonance frequency. In particular, class I and class II nonlinearities lead to reduced and enhanced responses, respectively relative to the linear prediction, as will be analyzed in detail. 3.2.1 Examples of Plasticity. We use examples to illustrate the constraints that resonance and stability conditions place on the shape of the kernels for complex patterns. J only. If the patterns are imprinted only in the excitatory-excitatory connections, we have only the rst terms on the right-hand sides of equations 3.10 and 3.11. For real patterns D D 2vm (a ¡ AQ 0J ) (no different from the general case),
so using D ! 0 in equation 3.16 yields 2 D bc ¡ a2 ¡ vm2 . This then (from p 2 D 0) xes the imprinting frequency vm D bc ¡ a2 . q For complex patterns, we have D D 2vm a ¡ a2 C vm2 BQ 0 —similar to the J
real-pattern case but with a shifted learning window and a different effective strength. The resonant frequency shifts down or up according to the sign of BQ 00J . Same-Sign Plasticities in J and W.
Let us consider the case where the kernels are related by the matching condition, equation 3.8. While the exact match is clearly a special case, the simplication it yields in the algebra permits some insight that can be expected to carry over qualitatively to other cases where the two kernels have similar shapes and comparable magnitudes. Here we nd, from equations 3.10 and 3.11, P (vm I vm ) D ¡2ivm AQ J (vm ) ,
(3.25)
2384
Silvia Scarpetta, L. Zhaoping, and John Hertz
for both real and complex patterns. Applying the resonance condition equations, 3.14 and 3.15, we have AQ 00J (vm ) D
¡vm2 C a2 C bc ¡ 2
(3.26)
2vm
AQ 0J (vm ) D a ¡ D / 2vm .
(3.27)
Thus, AQ 0J (vm ) reduces the effective damping from a to D / 2vm ¿ a, and this requires AQ 0 (vm ) ¼ a > 0. When the width of the learning kernel AJ (t )
0
ms
500
50 0
ms
500
u1 u4
0
50 0
ms
500
500
0 50 0 50
500
0
0 50 0 50
500
50 0 50
500
0
0
u8
50 0 50
500
u8
u8
50 0 50
0
50
50 0 50
500
u4
0
50 0
50 0 50
500
u4
u4
50 0 50
0
u8
0
50 u1
50 u1
u1
50
500
0 50 0
ms
500
Hebbian Imprinting and Retrieval in Oscillatory Neural Networks
2385
R is much smaller than the oscillation period, AQ 0J (vm ) ¼ A J (t )dt ; thus, a positive AQ 0J (vm ) requires that LTP dominate LTD in total strength. We observe that a negative AQ 00J (vm ) , as for example, an A J (t ) like that p 2 in Figures 1E through 1G, forces vm to be greater p than a C bc and thus greater than the intrinsic E-I pair frequency bc (a shift in the opposite direction from that in the J-only, real-pattern case). In general, when the width of AJ (t ) is not small, the resonance frequency has to be determined from equations 3.26 and 3.27 by AQ 00J (vm ) / AQ 0J (vm ) ¼ (¡vm2 C a2 C bc ) / 2avm . Opposite-Sign Plasticities in J and W.
We turn now to the case where AJ (t ) and AW (t ) have opposite signs (for all t ). Again, we turn to a particular matching of the magnitudes of the two kernels to nd a simple case that can give some general qualitative insight. We use our old matching condition equation 3.8, but with a minus sign. For complex patterns, we now nd P (vm , vm ) D 2aAQ J (vm ) , and, applying the resonance condition equations 3.14 and 3.15, AQ 0J (vm ) D
¡vm2 C a2 C bc ¡ 2
(3.28)
2a ¡AQ 00J (vm ) D vm ¡ D / 2a.
(3.29)
Comparing these with equations 3.26 and 3.27 and the accompanying analysis, we see that the roles of the real and imaginary parts of AQ J have been reversed. Now it is the imaginary part that is constrained to be near a xed value (¡vm ) by the D ! 0 condition and the real part that enters in the 2 equation. We note that we need AQ 00J (vm ) < 0 (like the case shown in Fig(vm ) determines ure 1E) in order to obtain a small D and that the sign of AQ 0p whether the resonance frequency is larger or smaller than a2 C bc . Figure 2: Facing page. Frequency tuning, shown as response to d I C D m e ¡ivt after imprinting m e ¡ivm t with vm D 41 Hz. The network has 10 excitatory and 10 inhibitory units. In all gures in this article, except where explicitly stated, learnQ J D bc 2 A Q W D 0.5 ¡ i 0.028, and vm » 41 Hz. ing kernels are matched so that A a2 C v m
(A) Temporal activities of 3 of the 10 excitatory cells. gu and gv are as in Figures 1B and 1D. (B) Frequency tuning curve. Response amplitude |hu C | m e ¡ivt i| (simply |Â(vI vm ) | in the linearized theory) to input d I C D m e ¡ivt . (B.1) Using matched kernel, from linearized theory (solid line) and from class I (stars) and II (cirQ J D ¡A Q W D 0.99(a ¡ivm ) ), cles) nonlinearities. (B.2) Opposite-plasticities case (A complex patterns. Solid line, circles, and triangles are results from the linearized theory and the class II nonlinear model with vm D 41 Hz and vm D 68 Hz, respectively.
2386
Silvia Scarpetta, L. Zhaoping, and John Hertz
Another interesting special case for complex patterns is when AW D ¡A J , with the particular choice AQ J (vm ) D a ¡ ivm .
(3.30)
This leads to the remarkably simple result  (vI vm ) D
i . v ¡ vm
(3.31)
That is, the choice 3.30 satises both constraints, 2 , D ! 0 and, in addition, puts the resonance right at the original driving frequency. Figure 2.B.2 shows a frequency tuning curve obtained with AQ J (vm ) D 0.99 (a ¡ ivm ) . To understand the prescription AQ J (vm ) D a ¡ ivm , consider an oscillation period much greater than the temporal width of the learning kernel. R R Then AQ 0J (vm ) ¼ A (t ) dt and AQ 00 (vm ) ¼ ¡ AJ (t ) vm t dt / ¡vm . Thus, the R prescription just R requires AJ (t ) dt ¼ a > 0 (LTP dominates LTD in total strength) and A J (t )t dt ¼ 1 > 0 (LTP when postsynaptic spikes follow presynaptic ones, and LTD for the opposite order). This means that AJ (t ) should look like Figure 1E and AW (t ) like its negative. For real patterns, this choice of kernels does not produce resonant oscillations; in fact, it leads to instability. 3.2.2 Pattern Selectivity. We now consider an input d I C D » e ¡ivt that does not match the imprinted pattern » m . In general, we can decompose it into a component along » m and a component in the complementary subP m¤ C space: d I C ´ d IkC C d I ? , with d I kC ´ h»m | » i» m e ¡ivt ´ N ¡1 ( j jj jj ) » m e ¡ivt . Then MI C D P (vI vm ) IkC , and C u C D Â (vI vm )d IkC C Â0 (v)d I? .
(3.32)
The rst term will be resonant at v D vm , but the second will not. Thus, the system amplies the component of the input along the stored pattern Figure 3: Facing page. Pattern selectivity. (A) Response evoked on 3 of the 10 neurons of the network by input patterns a, b, and c matching the imprinted pattern a in frequency. The class I activation functions shown in Figures 1B and 1D have been used. (B) Response amplitude |hu C | m e ¡ivm t i| versus input overlap |h m | i|, under input d I C D e ¡ivm t . Results from linearized theory (solid line) and from models with class I (stars) and II (circles) nonlinearities. (C) Hysteresis effects in class II simulations. The response amplitude depends on the history of the system: the output remains at the resonance level after input withdrawl (circles connected by dotted line). Circles connected by the solid line correspond to the case of random or zero-overlap initial conditions. The connecting lines are drawn for clarity only.
Hebbian Imprinting and Retrieval in Oscillatory Neural Networks
2387
50
0
0
0
200 ms
400
0
50 0
50 0 50
200 ms
400
200
400
50 0 50
u4
0
200
400
0
50 0
50 0 50
200 ms
400
200
400
200
400
200 ms
400
0
50 0 50 u7
400
0
50 0 50 u7
200 ms
u4
u4
50 0 50
u1
50 u1
50
u7
u1
relative to the orthogonal one, as shown in Figure 3. Again, nonlinearity makes the response deviate from the linear prediction at high response amplitudes, reducing and enhancing the responses for the class I and II nonlinearities, respectively. Class II nonlinearity also leads to hysteresis, with sustained responses even after the input is withdrawn, that is, |h»m | » i| ! 0.
0
50 0
2388
Silvia Scarpetta, L. Zhaoping, and John Hertz
The pattern selectivity can be measured by the ratio q |Â (vm I vm ) | D |Â0 (vm ) |
(a2 C bc ¡ vm2 ) 2 C (2vm a) 2 p ,
D2 C 2
2
(3.33)
where we used the resonance conditions 3.14 and 3.15. When the input frequency v deviates from vm , the pattern selectivity ratio |Â (vI vm ) | / |Â0 (v) | is reduced. 3.2.3 Interpolation and Categorization. With multiple imprinted patterns, an input d I C D j e ¡ivt that overlaps with several of them will evoke a correC spondingly mixed resonant linear response u C D Â (vI vm )d IkC C Â0 (v)d I? , P m C m ¡ivt where d Ik D m h» | » i» e . That is, any input in the pattern subspace produces a resonant linear response just like that to an input proportional to a single pattern. This is a standard property of linear associative memories for orthogonal patterns. (When the number of patterns is much smaller than the number of units in the network, independent random patterns may be taken as effectively orthogonal.) This feature enables the system to interpolate between imprinted patterns, that is, to perform an elementary form of generalization from the learned set of patterns. This property can be useful for input representation. A similar property also holds in the class I nonlinear model but not in the class II model. To see this, suppose the drive I D » e¡ivm t C c.c. overlaps two imprinted patterns, with » / » 1 cos y C » 2 sin y , and write the response u C as u C / » 1 cos w C » 2 sin w . For a linear model, w D y . The class I nonlinear model gives 45± ¸ w > y when y < 45± and 45± < w < y when y > 45± (see Figure 4). Thus, it tends to equalize the response amplitudes to » 1 and » 2 even when they contribute unequally to the input. In contrast, the class II nonlinear model amplies the difference in input strengths to give higher gain to the stronger input component, » 1 or » 2 , thus performing a kind of categorization of the input. Thus, the two nonlinearity classes lead to different computational properties. For the case shown in Figure 4B, the parameters are such that the categorization is into three categories, corresponding to outputs near j 1 , j 2 , and their symmetric combination. For stronger nonlinearity, bipartite classication is possible. Another way to prevent undesirable interpolation between imprinted patterns (or classes of them) is to store different patterns or classes at frequencies that differ by more than the frequency tuning width. Suppose 6 v2 and » 1 ¢ » 2 ¼ 0. Then »1 e ¡iv1 t and » 2 e ¡iv2 t are imprinted, with v1 D m m 1 C 2 1 C 2 we have Jij D Jij Jij and Wij D Wij Wij , where Jij and Wij are given by equations 3.5 with corresponding frequencies for m D 1, 2. The resonance and stability conditions should be enforced separately for each pattern.
Hebbian Imprinting and Retrieval in Oscillatory Neural Networks
2389
Figure 4: Input-output relationship when two orthogonal patterns, 1 and 2 , have been imprinted at the same frequency vm D 41 Hz. Input / 1 cos y C 2 sin y and response u / 1 cos w C 2 sin w . Circles show the simulation results; lines show the analytical prediction for the linearized model. (A) Class I. (B) Class II.
After learning, an input I C D (» 1 C » 2 ) e¡iv1 t at frequency v1 will evoke a response u C ¼ Â (v1 , v1 ) » 1 C Â (v1 , v2 ) » 2 ¼ Â (v1 , v1 ) » 1 ,
(3.34)
since  (v1 , v2 ) ¿  (v1 , v1 ) by design when |v1 ¡ v2 | ¿ D . Hence, as illustrated in Figure 5, the system lters out the oscillation patterns learned at a different oscillation frequency from the input frequency. 4 Nonlinear Analysis
Nonlinearity affects the response mainly at large amplitudes, which occur during resonant recall but not (we assume) in learning mode. Hence, in the following analysis, we leave the formulas for J, W, and M unchanged, ignore nonlinearity in response components orthogonal to the pattern subspace, and examine the corrections to the linear response u D Â (vI vm )d I k . We take the input to be along the imprinted pattern »m : d I kC D I» m e¡ivt . We focus on the nonlinearity in gu , since g0v affects only the local synaptic input, while g0u also affects the long-range input. Equation 2.7 then becomes [(a ¡ iv) 2 ]u D M gu (u ) C (a ¡ iv)d I,
(4.1)
where for simplicity, we include c as a diagonal element of W, and by gu (u ) we mean a vector with components [gu ( u ) ]i D gu ( ui ) . Making the ansatz
2390
Silvia Scarpetta, L. Zhaoping, and John Hertz
0
ms
50 0
500
ms
500
u1
0 50 0 50
u4
500
0 50 0 50
500
0
0
50 0 50
500
0 50 0
ms
500
500
0
u8
50 0 50
500
u8
u8
50 0 50
0
50
50 0 50
500
u4
0
50 0
50 0 50
500
u4
u4
50 0 50
0
u8
0
50 u1
50 u1
u1
50
500
0 50 0
ms
500
Figure 5: Categorization using different imprinting frequencies. Plotted are responses of 3 of the 10 excitatory units to various input patterns and frequencies. Patterns 1 e ¡iv1 t and 2 e ¡iv2 t , where 1 ? 2 , v1 D 41 Hz and v2 D 63 Hz, Q J (v1 ) D 0.5 ¡ 0.025i and have been imprinted. Matched kernels are used with A Q J (v2 ) D 0.5 ¡0.43i satisfying resonance conditions. For the mixed input (fourth A p p column), a D 1 / 17 and b D 4 / 17. u D q»m e ¡ivt C c.c. C higher-order harmonics, we have, m m¤ uj3 ¼ 3q2 q¤jj jj e ¡ivt C c.c. C higher-order harmonics, 2
(4.2)
and analogously for uj5 . The quantity q is the response amplitude of interest; in the linearized theory, q ! Â (vI vm ) I. For the two nonlinearity classes (see equation 2.9), we have, respectively, 0 1 X m 4 2 |jj | A Mu , (4.3) M g ( u ) ¼ @1 ¡ 3a|q| j
0 M g ( u ) ¼ @1 C 3a|q| 2
1 X j
4 |jjm |
¡ 5b|q| 4
X
6 |jjm | A Mu .
(4.4)
j
Thus, at a given response strength |q|, the imprinting strengths are effectively multiplied by the factors in parentheses. Consequently, class I nonlinearity reduces the response at large amplitude, whereas class II nonlinearity enhances it as long as the quadratic term in |q| is larger than the quartic one.
Hebbian Imprinting and Retrieval in Oscillatory Neural Networks
2391
A consequence for class II is the fact that a system that is very close to resonance (2 , D ! 0) in the linear regime can become unstable at higher response levels. The system will then jump to a new state in which the (negative) quartic term in equation 4.4 is large enough that stability is restored, as seen in Figures 2 and 3. Substituting equations 4.3 and 4.4 into 4.1 and matching the coefcients of »m e ¡ivt on the left and right sides, we obtain, for the two nonlinearity classes, respectively, Â ¡1 (vI vm ) q C 3aB Â ¡1 (vI vm ) q ¡ 3aB
X j
X j
|jjm | 4 |q| 2 q D I, |jjm | 4 |q| 2 q C 5bB
(4.5) X j
|jjm | 6 |q| 4 q D ,
(4.6)
where B ´ P (vI vm ) / (a ¡ iv) . These equations can be solved for q. It is apparent that in general, both the phase and the amplitude of q are modied by the nonlinearity. 5 Effects of Synaptic Weight Constraints
Because of the excitatory character of the presynaptic unit, Jij and Wij connections have to be nonnegative, a condition not respected by our learning formula, 3.5, so far. As a remedy, one may add an initial background weight, N / N, independent of i and j, to each connection to make it positive, JN/ N or W 8 P m m¤ > <Jij D JN/ N C m 2 Re[AQ Jji jj ] / N ¸ 0 µ ¶ m m¤ P AQ W ji jj > N /N C 2c Re :Wij D W / N ¸ 0, m a¡ivm
(5.1)
or delete all net negative weights, or both. It is clear from equations 5.1 that adding a background weight is like learning an extra pattern » 0 that is uniform and synchronous, withji0 D 1 for (0) (0) (0) all i, with learning kernels AQ J (v0 ) and AQ W (v0 ) that satisfy 2 Re AQ J (v0 ) = 1 and 2c Re[AW (v0 ) / a ¡ iv0] D 1. We assume that these kernels are the same m as those with which the patterns ji are imprinted, up to an overall learning (0) strength factor, AQ J,W (v) D g AQ J,W (v) , and that the imprinting frequencies are the same: vm D v0 . Thus, if the ji are of unit magnitude, in order to guarantee that no Jij or Wij are to be negative, we need g ¸ 1. This strategy can be effective provided that the imprinting of the uniform extra pattern does not lead to violation of any stability condition. Since we have assumed the imprinted patterns » m (m > 0) are (roughly) orthogonal to » 0 , we can treat the extra pattern independent of the others, and we just have to satisfy the same stability conditions for it that we previously found (0)
2392
Silvia Scarpetta, L. Zhaoping, and John Hertz
for the imprinted patterns. That is, the singularities of  (vI vm ) have to lie in the lower half of the v plane, where now  (vI vm ) (see equation 3.13) has to be computed from a P (vI vm ), which is a factor g larger than before. For g ! 1, we get no change in the stability conditions. Nonnegativity can be more practically achieved by simply deleting the net negative weights. For random patterns, and without background N this leads to deleting half of the weights Jij and Wij obweights JN and W, tained from the learning rule, which weakens their effect, quantied by the function P (vI vm ) , by a factor of 2. In simulations we have found that increasing the learning strength by this factor leads to results like those found earlier when negative Jij and Wij were permitted. Finally, we remark that negative weights can also be simply implemented by inhibitory interneurons with very short membrane time constants. 6 Summ ary and Discussion 6.1 Summary. We have presented a model of learning and retrieval for associative memory or input representation in recurrent neural networks that exhibit input-driven oscillatory activities. The model structure is an abstraction of the hippocampus or the olfactory cortex. The learning rule is based on the synaptic plasticity observed experimentally, in particular, long-term potentiation and long-term depression of the synaptic efcacies depending on the relative timing of the pre- and postsynaptic spikes during learning. After learning, the model’s retrieval is characterized by its selective strong responses to inputs that resemble the learned patterns or their generalizations. Our work generalizes the outer-product Hebbian learning rule in the Hopeld model to network states characterized by complex state variables, representing both amplitudes and phases. Our work differs from previous modeling in the following respects: (1) We allow that stored patterns vary in both amplitudes and phases, as well oscillation frequency. (2) We imprint input patterns into the synapses using a generalized Hebbian rule that gives LTP or LTD according to the relative timing of pre- and postsynaptic activity. (3) We explore two qualitatively different functions of the network: one (associative memory) is to classify inputs into distinct categories corresponding to the individual learned examples, and the other is to represent inputs as interpolations between or generalizations of learned examples. The same model structure was used previously, with a conventional Hebbian rule with A J,W (t ) / d (t ), by two of the authors in a model for odor recognition / classication and segmentation in the olfactory cortex (Li & Hertz, 2000). The principal new contributions in the current work are (1) linking the model with the recent experimental data on neural plasticity and LTP / LTD and dissecting the role of the functional form of the learning kernel AJ,W (t ) in determining the selectivity to input patterns and frequencies, (2) an extended analysis of input selectivity and tuning, (3) exploration
Hebbian Imprinting and Retrieval in Oscillatory Neural Networks
2393
of the two different computational functions (associative memory and input representation) of the model, and (4) a detailed analysis of nonlinearity in the model. 6.2 Discussion. By using both amplitude and phase to code information, it is possible to either encode additional information or increase robustness by redundantly coding the same information coded by the amplitudes. Indeed, hippocampal place cells, which code the spatial location of the animal, re at different phases of the theta wave depending on the location of the animal in the place elds (Keefe & Recce, 1993). In this case, the information encoding is redundant since the location is in principle already encoded by the ring rates (i.e., oscillation amplitudes) in the neural population. In our model, combined phase and amplitude coding requires matching both the amplitude and phase patterns of the inputs with the learned inputs under recall, making matching more specic. This scheme necessitates learning both excitatory-to-excitatory connections and excitatory-to-inhibitory ones. Thus, in a system of N coupled oscillators, the stored items are coded by 2N variables—N amplitudes and N phases, requiring the specication of 2N 2 synaptic strengths—N2 excitatory-to-excitatory synapses and another N 2 excitatory-to-inhibitory ones. Omitting phase coding would require learning of only N2 synapses, for example, of the excitatory-to-excitatory connections, as in previous models (Hendin, Horn, & Tsodyks, 1998; Wang, Buhmann, & von der Malsburg, 1990). Our model’s frequency selectivity adds matching specicity during recall. Furthermore, frequency matching can modulate the spiking timing reliability, since higher- or lower-oscillation amplitudes, caused by better or worse frequency matching, should make the ring probabilities of the cells more or less modulated or locked by oscillation phases. Frequency dependence of spike timing reliability has been observed in cortical pyramidal cells and interneurons (Fellous et al., 2001). In our model, the frequency tuning is a network property imprinted in long-range connections, although frequency tuning as a resonance phenomenon could in principle exist in a single neural oscillator or a local circuit. In our model, both excitatory-to-excitatory and excitatory-to-inhibitory synapses are modiable. Experimentally, there is as yet little evidence concerning plasticity in pyramidal-to-interneuron synapses. More experimental investigations are needed. In experiments by Bell (Bell et al., 1997, 1999), plasticity of the excitatory-to-inhibitory synapses between parallel bers and medium ganglion cells in the cerebellum-like structure of the electric sh has been observed, although these synapses are not part of a recurrent oscillatory circuit. We explored the constraints on the learning kernel functions A J,W (t ) imposed by the requirement of a resonant response. A condition that came up in almost all the variants of the model that we explored was that AQ 0J (vm )
2394
Silvia Scarpetta, L. Zhaoping, and John Hertz
should be positive in order to achieve a strong, narrow resonance. This means, roughly, that for excitatory-excitatory synapses, LTP should dominate LTD in overall strength for spike time differences smaller than 1 / vm . Another condition we considered was that the resonant frequency should be the same as the driving frequency vm during learning. We saw that for real patterns and learning only of the excitatory-excitatory connections, this could not be satised for general vm . However, with learning of the excitatory-to-inhibitory connections, it could, for a suitable (negative) value of BQ 00W (vm ). For complex patterns (see equation 3.20), the imaginary parts of both BQ J and BQ W contribute to the shift, so if they have opposite signs (of the correct relative magnitude), the condition can be satised independent of vm . These features should be looked for in investigations of plasticity of excitatory-to-inhibitory synapses. An interesting property we have identied in the model is its ability to subserve two different computational functions: to classify inputs into distinct learned categories and to represent input patterns as interpolation and generalizations of the prototype examples learned. Categorization is appropriate for associative memories and has been applied in our previous model of olfactory cortex (Li & Hertz, 2000). In this context, interpolation between different learned patterns is not desired; individually learned odors should have specic roles. It is more desirable to perceive individual odors within an odor mixture than to perceive an unspecic blend. On the other hand, interpolation is advantageous in some circumstances. Consider an animal learning an internal representation of a region of space. If particular spatial locations are represented as particular imprinted patterns, then locations in between them will be represented as linear combinations of these patterns. Thus, the network is able to represent a continuum of positions in a natural way. Hippocampal place cells seem to employ such a representation. A network that interpolates can generalize from the learned place elds to represent spatial locations between the learned place elds by superposition of the neural activities of the place cells. Because the place elds are localized, the generalization is conservative (and thus robust). It does not extend beyond the spatial range of the learned locations or to regions between distant, disjoint place elds. We showed that our network can serve one or the other of these two computational functions, depending on the nonlinearity in the neuronal activation functions. Class I leads to the interpolation or input representation operation mode, while class II leads to categorization. The form of g (u ) could be subject to modulatory control, permitting the network to switch function when appropriate. The switch could even be accomplished, for a suitable form of g ( u ), simply by a change in the DC input level, since it is possible to change the effective nature of the nonlinearity near the operating point by shifting the resting point. It seems likely to us that the brain may
Hebbian Imprinting and Retrieval in Oscillatory Neural Networks
2395
employ different kinds and degrees of nonlinearity in different areas or at different times to enhance the versatility of its computations. We have seen that it is possible to store different classes of patterns at different oscillation frequencies and that the network does not interpolate between patterns stored at different frequencies. This feature can increase the capacity of the network and gives the system the possibility of performing several different forms of input representation or categorization without interference between them. For instance, all place elds could be stored at one frequency, while odor memories could be stored at another, and there would be no cross talk between the two modalities if the frequencies differed by much more than the resonance linewidth. Complex neuromodulatory mechanisms that control the frequency of hippocampal oscillations (Fellous & Sejnowski, 2000) could be involved in implementing this scheme. In conclusion, we have seen that this rather simple network is endowed with interesting computational properties that are consequences of the combination of its oscillatory dynamics and the spike-timing-dependent synaptic modication rule. Although experiments to date have not clearly uncovered examples of networks in the brain that function in just this fashion, we hope that our ndings here will stimulate further investigation, both theoretical and experimental.
References Amit, D. J., Gutfreund, H., & Sompolinsky, H. (1985). Storing an innite number of patterns in a spin-glass model of neural network. Phys. Rev. Lett., 55, 1530. Bell, C. C., Han, V. Z., Sugawara, Y., & Grant, K. (1997). Synaptic plasticity in a cerebellum-like structure depends on temporal order. Nature, 387(6630), 278–281. Bell, C. C., Han, V. Z., Sugawara, Y., & Grant, K. (1999). Synaptic plasticity in the Mormyrid electrosensory lobe. J. Exper. Biology, 202, 1339–1347. Bi, G. Q., & Poo, M. M. (1998). Precise spike timing determines the direction and extent of synaptic modications in cultured hippocampal neurons. J. Neurosci., 18, 10464–10472. Debanne, D., Gahwiler, B. H., & Thompson, S. M. (1994). Asynchronous preand postsynaptic activity induces associative long-term depression in area CA1 of the rat hippocampus in vitro. Proc. Nat. Acad. Sci. USA, 91, 1148–1152. Debanne, D., Gahwiler, B. H., & Thompson, S. M. (1998). Long-term synaptic plasticity between pairs of individual CA3 pyramidal cells in rat hippocampal slice cultures. J. Physiol., 507, 237–247. Feldman, D. E. (2000). Timing-based LTP and LTD and vertical inputs to layer II / III pyramidal cells in rat barrel cortex. Neuron, 27, 45–56. Fellous, J. M., Houweling, A. R., Modi, R. H., Rao, R. P. N., Tiesinga, P. H. E., & Sejnowski, T. J. (2001). The frequency dependence of spike timing reliability
2396
Silvia Scarpetta, L. Zhaoping, and John Hertz
in cortical pyramidal cells and interneurons Journal of Neurophysiology, 85, 1782–1787. Fellous, J. M., & Sejnowski, T. J. (2000). Cholinergic induction of oscillations in the hippocampal slice in the slow (0.5–2 Hz), theta (5–12 Hz), and gamma (35–70 Hz) bands. Hippocampus, 10(2), 187–197. Hasselmo, M. E. (1993).Acetylcholine and learning in a cortical associative memory. Neural Computation, 5, 32–44. Hasselmo, M. E. (1999). Neuromodulation: Acetylcholine and memory consolidation. Trends in Cognitive Sciences, 3, 351. Hendin, O., Horn, D., & Tsodyks, M. V. (1998). Associative memory and segmentation in an oscillatory neural model of the olfactory bulb. J. Comput. Neurosci., 5(2), 157–169. Hertz, J. A., Krogh, A., & Palmer, R. G. (1991). Introduction to the theory of neural computation. Reading, MA: Addison-Wesley. Keefe, J. O., & Recce, M. L. (1993). Phase relationship between hippocampal place units and the EEG theta rhythm. Hippocampus, 3, 317–330. Kempter, R., Gerstner, W., & van Hemmen, L. (1999). Hebbian learning and spiking neurons. Physical Review E, 59, 4498–4514. Li, Z., & Dayan, P. (1999). Computational differences between asymmetric and symmetric nets. Network: Computation in Neural Systems, 10, 59. Li, Z., & Hertz, J. (2000).Odor recognition and segmentation by a model olfactory bulb and cortex. Network: Computation in Neural Systems, 11, 83–102. Magee, J. C., & Johnston, D. (1997). A synaptically controlled associative signal for Hebbian plasticity in hippocampal neurons. Science, 275, 209–212. Markram, H., Lubke, J., Frotscher, M., & Sakmann, B. (1997). Regulation of synaptic efcacy by coincidence of postsynaptic APs and EPSPs. Science, 275, 213. Menschik, E. D., & Finkel, L. H. (1999). Cholinergic neuromodulation and Alzheimer ’s disease: From single cells to network simulations. Prog. Brain Res., 121, 19–45. Rao, R. P., & Sejnowski, T. J. (2001). Spike-timing-dependent Hebbian plasticity as temporal difference learning. Neural Comput., 13(10), 2221–2237. Sjostr¨ ¨ om, P. J., Turrigiano, G. G., & Nelson S. B. (2001). Rate, timing and cooperativity jointly determine cortical synaptic plasticity. Neuron, 32, 1149–1164. Song, S., Miller, K. D., & Abbott, L. F. (2000). Competitive Hebbian learning through spike-timing-dependent synaptic plasticity. Nat. Neurosci., 3(9), 919– 926. Tiesinga, P. H., Fellous, J. M., Jose, J. V., & Sejnowski, T. J. (2001). Computational model of carbachol-induced delta, theta, and gamma oscillations in the hippocampus. Hippocampus, 11(3), 251–274. Wang, D. L., Buhmann, J., & von der Malsburg, C. (1990). Pattern segmentation in associative memory. Neural Comput., 2, 94–106. Received November 14, 2001; accepted April 4, 2002.
LETTER
Communicated by John Platt
A New Discriminative Kernel from Probabilistic Models Koji Tsuda [email protected] AIST Computational Biology Research Center, Koto-ku, Tokyo, 135-0064, Japan, and Fraunhofer FIRST, 12489 Berlin, Germany Motoaki Kawanabe nabe@rst.fraunhofer.de Fraunhofer FIRST, 12489 Berlin, Germany Gunnar R a¨ tsch [email protected] Australian National University, Research School for Information Sciences and Engineering, Canberra, ACT 0200, Australia, and Fraunhofer FIRST, 12489 Berlin, Germany S oren ¨ Sonnenbur g sonne@rst.fraunhofer.de Fraunhofer FIRST, 12489 Berlin, Germany Klaus-Robert M uller ¨ klaus@rst.fraunhofer.de Fraunhofer FIRST, 12489 Berlin, Germany, and University of Potsdam, 14469 Potsdam, Germany Recently, Jaakkola and Haussler (1999) proposed a method for constructing kernel functions from probabilistic models. Their so-called Fisher kernel has been combined with discriminative classiers such as support vector machines and applied successfully in, for example, DNA and protein analysis. Whereas the Fisher kernel is calculated from the marginal log-likelihood, we propose the TOP kernel derived from tangent vectors of posterior log-odds. Furthermore, we develop a theoretical framework on feature extractors from probabilistic models and use it for analyzing the TOP kernel. In experiments, our new discriminative TOP kernel compares favorably to the Fisher kernel. 1 Introduction
In classication tasks, the purpose of learning is to predict the output y 2 f¡1, C 1g of some unknown system given the input x 2 X based on the c 2002 Massachusetts Institute of Technology Neural Computation 14, 2397– 2414 (2002) °
2398
Koji Tsuda et al.
training samples fxi , yi gniD 1 . A feature extractor is a vector-valued function f : X ! RD designed for converting the representation of data without losing the information necessary for discrimination. When X is a vector space like Rd , many feature extraction methods have been proposed (Fukunaga, 1990). However, they are typically not applicable when X is a set of sequences of symbols and does not have the structure of a vector space as in DNA or protein analysis (Durbin, Eddy, Krogh, & Mitchison, 1998). In such cases, the similarity (or proximity) between two samples plays an important role (Cox & Ferry, 1993; Graepel, Herbrich, Bollmann-Sdorra, & Obermayer, 1999; Hofmann & Buhmann, 1997). The simplest method is to prepare several prototype samples and compose a feature vector from the similarities to these samples (Graepel et al., 1999). Alternatively in multidimensional scaling (MDS; Cox & Ferry, 1993), the samples are mapped such that the given dissimilarity is approximated by the Euclidean distance in feature space. However, similarities are often not available, and to dene a “good” similarity measure in terms of the classication task in feature space is therefore difcult and requires a fair amount of prior knowledge. Recently, the Fisher kernel (FK; Jaakkola & Haussler, 1999) was proposed, which allows measuring distances between symbols by computing features from probabilistic models p (x | µ ) . At rst, a parameter estimate µO is obtained from the training examples. Then the tangent vector of the log marginal likelihood log p (x | µO ) is used as a feature vector. Fisher kernel refers to the inner product in this feature space, but the method is effectively a feature extractor (also since the features are computed explicitly). The FK can be combined with discriminative classiers such as support vector machines (SVMs), and it has achieved excellent classication results in several elds, for example in DNA and protein analysis (Jaakkola & Haussler, 1999; Jaakkola, Diekhans, & Haussler, 2000). Empirically, it is reported that the FKSVM system often outperforms the classication performance of a plug-in estimate, that is, the pure probabilistic approach.1 Note that the FK is only one possible member in the family of feature extractors fµO (x) : X ! RD that can be derived from a probabilistic model. We call this family modeldependent feature extractors because different probabilistic models lead to different feature vectors. Exploring this family is a very important and interesting subject. Since model-dependent feature extractors are newly developed, performance measures for them have not yet been established. In this article, we therefore propose two performance measures. Then we dene a new kernel (or, equivalently, a feature extractor) derived from the tangent vector of posterior log-odds, which we denote as the TOP kernel. We will analyze 1 In classication by plug-in estimate, x is classied by thresholding the posterior probability yO D sign(P (y D C 1 | x, µO ) ¡ 12 ) (Devroye, Gyor, ¨ & Lugosi, 1996).
A New Discriminative Kernel from Probabilistic Models
2399
the performance of the TOP kernel in terms of our performance measures. Finally, the TOP kernel is compared—favorably—to the FK in experiments with articial data and protein data. 2 Performance Measures
To begin, let us describe the notations. Let x 2 X be the input point and y 2 f¡1, C 1g be the class label. X may be a nite set or an innite set like Rd . Let us assume that we know the parametric model of the joint probability p (x, y | µ ) , where µ 2 Rp is the parameter vector. Assume that the model p (x, y | µ ) is regular (Murata, Yoshizawa, & Amari, 1994) and contains the true distribution. Then the true parameter µ ¤ is uniquely determined. Let µO be a consistent estimator (Devroye et al., 1996) of µ ¤ , which is obtained by n training examples drawn independent and identically distributed (i.i.d.) from p (x, y | µ¤ ) . Let @hi f D @f / @hi , rµ f D (@h1 f, . . . , @hp f ) > , and rµ2 f denote the p £ p matrix, the Hessian, whose ( i, j)th element is @2 f / ( @hi @hj ) . As the FK is commonly used in combination with linear classiers such as SVMs, one reasonable performance measure is the classication error of a linear classier wT fµO (x) C b in the feature space RD , where w 2 RD and b 2 R. Usually w and b are determined by a learning algorithm, so the optimal feature extractor is different with regard to each learning algorithm. To cancel out this ambiguity and make a theoretical analysis possible, we assume the optimal learning algorithm is used. When w and b are optimally chosen, the classication error is R (fµO ) D
min Ex,y W[¡y(w> fµO (x) C b) ],
w2S,b2R
(2.1)
where S D fw | kwk D 1, w 2 RD g, W[a] is the step function, which is 1 if a > 0 and 0 otherwise, and Ex,y denotes the expectation with respect to the true distribution p (x, y | µ ¤ ) . R (fµO ) is at least as large as the Bayes error L¤ (Fukunaga, 1990) and R (fµO ) D L¤ only if the linear classier implements the same decision rule as the Bayes optimal rule. From a geometrical point of view, R (fµO ) ¡ L¤ describes how linear the optimal boundary is in the feature space. Now that we have a performance measure, it is natural to design a feature extractor that minimizes R (fµO ). This task is difcult because of the nondifferentiable function W. So we construct another measure, which upper-bounds R (fµO ) : we consider the estimation error of the posterior probability by a logistic regressor F (w> fµO (x) C b) , with F ( t) D 1 / (1 C exp(¡t) ) : D (fµO ) D
min
w2RD ,b2R
E x |F(w> fµO (x) C b ) ¡ P ( y D C 1 | x, µ ¤ ) |.
(2.2)
The relationship between D (fµO ) and R (fµO ) is illustrated as follows: Let LO be the classication error rate of an arbitrary posterior probability estimator
2400
Koji Tsuda et al.
PO ( y D C 1 | x). The following inequality is known (Devroye et al., 1996): LO ¡ L¤ · 2Ex | PO ( y D C 1 | x) ¡ P ( y D C 1 | x, µ¤ ) |.
(2.3)
When we use PO ( y D C 1 | x) :D F (w> fµO (x) C b) , this inequality leads to the following relationship between the two measures: R (fµO ) ¡ L¤ · 2D(fµO ) .
(2.4)
By this bound, it is useful to derive a new feature extractor that minimizes D (fµO ), as will be done in section 4. 3 The Fisher Kernel
The Fisher kernel (FK) is dened2 as K (x, x0 ) D s(x, µO ) > Z¡1 (µO ) s(x0 , µO ) , where s is the Fisher score, s(x, µO ) D ( @h1 log p (x | µO ) , . . . , @hp log p (x | µO ) ) > D rµ log p (x, µO ) , and Z is the Fisher information matrix: Z ( µ ) D Ex [s(x, µ ) s(x, µ ) > | µ ]. The theoretical foundation of FK is described in the following theorem (Jaakkola & Haussler, 1999): “A kernel classier employed the Fisher kernel derived from a model that contains the label as a latent variable is, asymptotically, at least as good a classier as the MAP labeling based on the model” (p. 491). The theorem says that the FK can perform at least as well as the plug-in estimate, if the parameters of linear classier are properly determined (cf. appendix A of Jaakkola & Haussler, 1999). With our performance measure, this theorem can be represented more concisely: R (fµO ) is bounded by the classication error of the plug-in estimate Rp (µO ) : R (fµO ) · Rp ( µO ) D Ex,y W[¡y(P(y D C 1 | x, µO ) ¡ 12 ) ].
(3.1)
Note that the classication rule constructed by the plug-in estimate P ( y D O ) can also be realized by a linear classier in feature space. Property C 1 | x, µ 3.1 is important since it guarantees that the FK performs better when the optimal w and b are available. In the next section, we present a new kernel that also satises property 3.1 and has a more appealing theoretical property as well. 2 In practice, some variants of the FK are used. For example, if the derivative of each class distribution, not marginal, is taken, the feature vector of FK is quite similar to that of our kernel. However, these variants should be deliberately discriminated from the FK in theoretical discussions. Throughout this article, including experiments, we adopt the original denition of the Fisher kernel from Jaakkola and Haussler (1999).
A New Discriminative Kernel from Probabilistic Models
2401
4 The TOP Kernel 4.1 Denition. Now we proceed to propose a new kernel. Our aim is to obtain a feature extractor that achieves small D (fµO ) . When a feature extractor fµO (x) satises3
w> fµO (x) C b D F ¡1 ( P ( y D C 1 | x, µ ¤ ) ) for all x 2 X ,
(4.1)
with certain values of w and b, we have D (fµO ) D 0. However, since the true parameter µ ¤ is unknown, all we can do is to construct fµO , which approximately satises equation 4.1. Let us dene4 v (x, µ ) D F ¡1 (P ( y D C 1 | x, µ ) )
D log(P(y D C 1 | x, µ ) ) ¡ log(P(y D ¡1 | x, µ ) ) ,
(4.2)
which is called the posterior log-odds of a probabilistic model (Devroye et al., 1996). By Taylor expansion around the estimate µO up to the rst order, we can approximate v (x, µ ¤ ) as v (x, µ ¤ ) ¼ v (x, µO ) C
p X iD 1
@hi v (x, µO ) (hi¤ ¡ hOi ) .
(4.3)
Thus, by setting fµO (x) :D (v (x, µO ) , @h1 v (x, µO ) , . . . , @hp v (x, µO ) ) >
(4.4)
w :D w¤ D (1, h1¤ ¡ hO1 , . . . , hp¤ ¡ hOp ) > , b D 0,
(4.5)
and
equation 4.1 is approximately satised. Since a tangent vector of the posterior log-odds constitutes the main part of the feature vector, we call the inner product of the two feature vectors the TOP kernel: K (x, x0 ) D fµO (x) > fµO (x0 ).
(4.6)
It is easy to verify that the TOP kernel satises equation 3.1, because we can construct the same decision rule as the plug-in estimate by using the rst element only (w D (1, 0, . . . , 0) , b D 0). Notice that F¡1 (t) D log t ¡ log(1 ¡ t). One can easily derive TOP kernels from higher-order Taylor expansions. However, we will deal only with the rst-order expansion here, because higher-order expansions would induce extremely high-dimensional feature vectors in practical cases. 3 4
2402
Koji Tsuda et al.
4.2 A Theoretical Analysis. In this section, we compare the TOP kernel with the plug-in estimate in terms of performance measures. Later, we assume that 0 < P ( y D C 1 | x, µ ) < 1 to prevent |v(x, µ ) | from going to innity. Also, it is assumed that rµ P ( y D C 1 | x, µ ) and rµ2 P ( y D C 1 | x, µ ) are bounded. Substituting the plug-in estimate (denoted by the subscript p ) into D (fµO ) , we have
Dp (µO ) D Ex |P(y D C 1 | x, µO ) ¡ P ( y D C 1 | x, µ ¤ ) |. Dene D µ D µO ¡ µ¤ . By Taylor expansion around µ¤ , we have Dp (µO ) D Ex | ( D µ ) > rµ P ( y D C 1 | x, µ ¤ )
1 (D µ ) > rµ2 P ( y D C 1 | x, µ 0 ) (D µ ) | 2 D O (kD µ k) , C
(4.7)
where µ 0 D µ ¤ C c D µ (0 · c · 1) . When the TOP kernel is used, D (fµO ) · Ex |F( (w¤ ) > fµO (x) ) ¡ P (y D C 1 | x, µ ¤ ) |,
(4.8)
where w¤ is dened as in equation 4.5. Since F is Lipschitz continuous, there is a nite positive constant M such that |F(a) ¡ F ( b) | · M|a ¡ b|. Thus, D (fµO ) · ME x | (w¤ ) > fµO (x) ¡ F ¡1 (P (y D C 1 | x, µ¤ ) ) |.
(4.9)
Since (w¤ ) > fµO (x) is the Taylor expansion of F ¡1 ( P (y D C 1 | x, µ ¤ ) ) up to the rst order, equation 4.3, the rst-order terms of D µ are excluded from the right side of equation 4.9; thus, D (fµO ) D O (kD µk2 ). Since both the plug-in and the TOP kernel depend on the parameter estimate µO , the errors Dp ( µO ) and D (fµO ) become smaller as kD µ k decreases. However,the rate of convergence of the TOP kernel is much faster than that of the plug-in estimate if w and b are optimally chosen. This result is closely related to large sample performances. Assuming that µO is a n1/ 2 consistent estimator with asymptotic normality (e.g., the maximum likelihood estimator), we have kD µ k D Op ( n ¡1/ 2 ) (Murata et al., 1994), where Op denotes stochastic order (Barndorff-Nielsen & Cox, 1989). We can directly derive the convergence order as Dp (µO ) D Op ( n ¡1/ 2 ) and D (fµO ) D Op (n ¡1 ). By using the relation 2.4, it follows that Rp ( µO ) ¡ L¤ D Op (n ¡1/ 2 ) and R (fµO ) ¡ L¤ D Op ( n ¡1 ) .5 Therefore, the TOP kernel has a much better convergence rate in R (fµO ) . However, we must note that this fast rate is 5 For detailed discussion about the convergence orders of classication error, see Devroye et al. (1996).
A New Discriminative Kernel from Probabilistic Models
2403
possible only when the optimal linear classier is combined with the TOP kernel. Since nonoptimal linear classiers typically have the rate Op ( n ¡1/ 2 ) (Devroye et al., 1996), the overall rate is dominated by the slower rate and turns out to be Op ( n ¡1/ 2 ) . But this theoretical analysis is still meaningful, because it shows the existence of a very efcient linear boundary in the TOP feature space. This result encourages practical efforts to improve linear boundaries by engineering loss functions and regularization terms with cross validation, bootstrapping, or other model selection criteria (Devroye et al., 1996). 4.3 Exponential Family: A Special Case. When the distributions of the two classes belong to the exponential family, the TOP kernel can achieve an even better result than shown above. Distributions of the exponential family can be written as q (x, ´ ) D exp(´ > t (x) C y (´ ) ) , where t(x) is a vector-valued function called sufcient statistics and y ( ´ ) is a normalization factor (Geiger & Meek, 1998). Let a denote the parameter for class prior probability of the positive model P ( y D C 1). Then the probabilistic model is described as
p (x, y D C 1 | µ ) D aq C 1 (x, ´ C 1 ) ,
p (x, y D ¡1 | µ ) D (1 ¡ a) q¡1 (x, ´ ¡1 ) , where µ D fa, ´ C 1 , ´ ¡1 g. The posterior log-odds reads > v (x, µ ) D ´ > C 1 t C 1 (x) C y C 1 ( ´ C 1 ) ¡ ´ ¡1 t ¡1 (x)
¡ y ¡1 ( ´ ¡1 ) C log
a . 1¡a
(4.10)
The TOP feature vector is described as fµO (x) D ( v (x, µO ) , @av (x, µO ) , r´ C 1 v (x, µO ) > , r´¡1 v (x, µO ) > ) > ,
where r ´s v (x, µO ) D s (ts (x) C r ´s ys ( ´O s ) ) for s D f C 1, ¡1g. So, when ¤ > > w D (1, 0, ´ ¤C 1 > ¡ ´O > ¡ ´O > C 1 , ´ ¡1 ¡1 )
and b is properly set, the true log-odds F ¡1 ( P ( y D C 1 | x, µ ¤ ) ) can be constructed as a linear function in the feature space 4.1. Thus, D (fµO ) D 0 and R (fµO ) D L¤ . Furthermore, since each feature is represented as a linear function of sufcient statistics t C 1 (x) and t¡1 (x), one can construct an equivalent feature space as (t C 1 (x) > , t¡1 (x) > ) > without knowing µO .6 This result has some importance because all graphical models without hidden states can be represented as members of the exponential family, for example, Markov models (Geiger & Meek, 1998). 6 It is well known that the Bayes optimal boundary of exponential family distributions forms a hyperplane in the space of sufcient statistics (Devroye et al., 1996).
2404
Koji Tsuda et al.
5 Experiment with Articial Data
In this section, we present a classication experiment with articial data. Here, the probabilistic model of each class is a mixture of two gaussians, q (x, » ) D b g (x, ´ 1 ) C (1 ¡ b ) g (x, ´ 2 ) ,
(5.1)
where » D [b, ´ 1 , ´ 2 ] and g is the natural parameter representation of a gaussian distribution ( g (x, ´ ) D exp g1 kxk2 C C
d X iD 1
d X i D1
gi2C 1
xigi C 1
) d / (4g1 ) ¡ 2 log(¡p /g1 ) .
(5.2)
Notice that the natural parameter ´ corresponds to the conventional parameters (mean ¹ and standard deviation s) as g1 D ¡1 / (2s 2 ) , gi D m i¡1 / s 2 (i ¸ 2) . The true parameters of the two classes are dened as
¹1 D ¹2 D (0, . . . , 0) , s1 D 1, s2 D 12 , b D 12 ¹1 D ¹2 D ( 101 , . . . , 101 ) , s1 D 45 , s2 D 25 , b D
(rst class) 1 2
(second class).
Also, the true class prior probability is dened as a D 12 . The derivatives in TOP and FKs are calculated with respect to the parameters » C 1 , » ¡1 in both classes and the class prior probability a as well. In this experiment, the dimensionality of the input space is set to 100. The number of training samples is 30 in the rst and 240 in the second experiment. The performance is measured by the error rate on a test set with 1000 samples. We compared the TOP kernel with the FK. As subsequent classier, an SVM is chosen, which has a regularization parameter C. As candidate values of C, 10 equally spaced points in the log scale are taken from [10¡6 , 10¡1 ]. The value of the parameter C is chosen from the candidate values according to the error rate on a validation set (100 samples).7 The parameters » C 1 and » ¡1 are estimated by the expectation-maximization algorithm (Bishop, 1995), and a is estimated by the ratio of training samples of two classes. The test errors over 30 different samplings of training, validation, and test sets are shown in Figure 1. For reference, we also show the test errors of the Bayes optimal classier and the test errors of the plug-in estimate. To illustrate the difference of FK and TOP in a detailed way, Figure 2 shows 7 See R¨a tsch, Onoda, and Muller ¨ (2001)for details on how model selection is conducted in this type of experiment.
A New Discriminative Kernel from Probabilistic Models
2405
n=30
0.4
0.2
0
OPT
P
FK
TOP FK* TOP* W*
n=240
0.08 0.07 0.06 0.05 0.04 0.03 0.02
OPT
P
FK
TOP FK* TOP* W*
Figure 1: The error rates of various classiers in the articial data experiment. The top and bottom gures correspond to the results when n D 30 and n D 240, respectively. OPT: test errors of the Bayes optimal classier; P: probabilistic models only; FK: the Fisher kernel with SVM; TOP: the TOP kernel with SVM; FK¤ : the Fisher kernel with the nearly optimal linear boundary constructed by a sufcient number of additional samples; TOP¤ : the TOP kernel with the nearly optimal linear boundary; W¤ : the TOP kernel with the weight vector w¤ .
2406
Koji Tsuda et al.
n=30 FK
FK
n=240
0.07 0.3
0.06
0.2
0.05 0.2
0.3
0.05 0.06 0.07 TOP
TOP
0.1
n=240
FK*
FK*
n=30
0.05
0.06
0.04 0.06
0.1
TOP*
0.04
0.05
TOP*
Figure 2: Comparison of error rates of SVMs with the Fisher kernel and the TOP kernel in the articial data experiment. Every point corresponds to one of the 30 different training, validation, and test sets. The upper two gures correspond to FK and TOP in Figure 1, and the lower two correspond to FK¤ and TOP¤ .
comparative plots of test errors. Clearly, the TOP kernel has the smaller error rates in many cases. In order to investigate whether the differences in error rates are signicant, two kinds of statistical tests are applied (see Table 1). One is the t-test (T), which compares the averages of error rates under the assumption that both distributions are gaussian. The other is the Wilcoxson signed rank test (WX), which uses the rank of differences of error rates. This is a nonparametric test, so it can be applied to any distribution. Because the distributions of error rates seem to be skewed (see Figure 1), we favor this test more than the t-test. When n D 30, both tests judged that the difference between average error rates of TOP and FK is signicant, whereas when n D 240, only the Wilcoxson signed rank test judged that the difference is signicant. So it is observed that the TOP kernel is especially effective in small sample cases. So far, we have shown the performance of the combination of a feature extractor (FK or TOP) and SVM. In order to focus on feature extraction performance itself, it is desirable to observe the minimum achievable error by
A New Discriminative Kernel from Probabilistic Models
2407
Table 1: P Values of Statistical Tests in the Articial Data Experiment. Methods TOP, FK TOP¤ , FK¤
Test
n D 30
n D 240
T WX
0.0090¤¤
0.12 0.031¤
2.2 £ 10¡5¤¤ 2.1 £ 10¡6¤¤
0.024¤ 0.0082¤¤
T WX
1.97 £ 10¡5¤¤
Notes: The t-test is denoted as T and the Wilcoxson signed ranks test as WX. ¤ p < 0.05. ¤¤ p < 0.01.
the optimal linear boundary. However, since it is difcult to derive the optimal boundaries analytically, a nearly optimal one is constructed by means of 3000 additional training samples, which are used not in constructing features but in determining linear boundaries. As the learning algorithm in feature space, we used the linear discriminant analysis (Fukunaga, 1990) without regularization. In Figures 1 and 2, these nearly optimal results are shown as FK¤ and TOP¤ . For reference, we also show the results of TOP when the Taylor coefcient w¤ (see equation 4.5) is used as the weight vector (W ¤ in Figure 1).8 As seen in the test result (see Table 1), the difference between average error rates of TOP and FK is signicant in both cases (n D 30 and 240). Thus, we are led to the conclusion that the TOP kernel extracts better discriminative features—at least in this experiment. 6 Experiments on Protein Data
In order to illustrate that the TOP kernel also works well for real-world problems, we present results on protein classication. The protein sequence data is obtained from the Superfamily web site, which provides sequence les with different degrees of redundancy ltering.9 We used the one with 10% redundancy ltering. Here, 4541 sequences are hierarchically labeled into 7 classes, 558 folds, 845 superfamilies, and 1343 families according to the SCOP(1.53) scheme. In our experiment, we determine the top category classes as the learning target. The numbers of sequences in the classes are listed as 791, 1277, 1015, 915, 84, 76, and 383. We use only the rst four classes, and six two-class problems are generated from all pairs among the
8 When n D 30, the error rates of W ¤ are often larger than TOP¤ and have very large variance, because the Taylor expansion, equation 4.3, has large higher-order terms when the number of samples is small. 9 Available on-line at: http://stash.mrc-lmb.cam.ac.uk/SUPERFAMILY/.
2408
Koji Tsuda et al.
four classes. 10 The fth and sixth classes are not used because the number of examples is too small. Also, the seventh class is omitted because it is quite different from the others and too easy to classify. In each two-class problem, the examples are randomly divided into 25% training set, 25% validation set, and 50% test set. The validation set is used for model selection. As a probabilistic model for protein sequences, we train hidden Markov models (HMMs; Durbin et al., 1998) with fully connected states 11 by the Baum-Welch algorithm. 12 To construct FK and TOP kernels, the derivatives with respect to all parameters of the HMMs from both classes are included. The derivative with respect to the class prior probability is included as well. Let q (x, » ) be the probability density function of an HMM. Then the marginal distribution is written as p (x | µO ) D aq O (x, »O C 1 ) C (1 ¡ a) O q (x, »O ¡1 ) , where a is a parameter corresponding to the class prior. The feature vector of FK consists of the following:
r»s log p (x | µO ) D P ( y D s | x, µO ) r»s log q (x, »O s ) O ) D 1 P ( y D C 1 | x, µO ) ¡ @a log p (x | µ a O
s 2 f¡1, C 1g
1 P ( y D ¡1 | x, µO ) , 1¡a O
(6.1) (6.2)
while the feature vector of TOP includes r»s v (x, µO ) D sr»s log q (x, »O s ) for s D §1.13 We obtained »O C 1 and »O ¡1 from the training examples of the respective classes and set aO D 12 . In the denition of the TOP kernel, equation 4.6, we did not include any normalization of feature vectors. However, in practical situations, it is effective to normalize features for improving classication performance. Although it is difcult to determine a fair way of normalization, we chose a simple way: each feature of the TOP and the FK is normalized to have mean 0 and variance 1. Both the TOP kernel and FK are combined with SVMs using a bias term. For calculation of feature vectors from HMMs, see the appendix. When classifying with HMMs, one observes the difference of the loglikelihoods for the two classes and discriminates by thresholding at an ap10 When the TOP kernel is applied for separating two classes, the class-conditional models of both classes need to be known. In contrast, the FK is often used in ltering problems where the model of only one class is known (Jaakkola & Haussler, 1999). 11 Several HMM models have been engineered for protein classication (Durbin et al., 1998). However, we do not use such HMMs because the main purpose of the experiment is to compare FK and TOP. Furthermore, the performances achieved with plain HMM models are lower than the ones presented here using discriminative training, which is well in line with results by Jaakkola, Diekhans, et al. (2000). 12 We mainly followed the implementation of Durbin et al. (1998). For implementation details, see Sonnenburg (2001). 13 Note that @a v(x, hO ) is a constant, which does not depend on x, so it is not included in the feature vector.
A New Discriminative Kernel from Probabilistic Models
2409
Table 2: P Values of Statistical Tests in the Protein Classication Experiments. Methods Test P, FK P, TOP FK, TOP
T WX
1-2 0.95 0.85
1-3 0.14 0.041¤
1-4
2-3
2-4
3-4
0.78 0.24
0.0032¤¤
0.79 0.80
0.12 0.026¤
0.059 0.035¤
5.3 £ 10¡5¤¤ 3.1 £ 10¡4¤¤
0.0040¤¤
T 0.015¤ 1.7 £ 10¡8¤¤ 0.11 3.0 £ 10¡12¤¤ WX 4.3 £ 10¡4¤¤ 6.1 £ 10¡5¤¤ 0.030¤ 6.1 £ 10¡5¤¤ T 0.0093¤¤ 2.2 £ 10¡4¤¤ 0.21 ¡4¤¤ WX 8.5 £ 10 6.1 £ 10¡5¤¤ 0.048¤
2.6 £ 10¡8¤¤ 6.1 £ 10¡5¤¤
0.079 0.0031¤¤ ¤¤ 0.0034 1.8 £ 10¡4¤¤
Notes: The t-test is denoted as T and the Wilcoxson signed ranks test (as WX). ¤ p < 0.05. ¤¤ p < 0.01.
propriate value. Theoretically, this threshold should be determined by the (true) class prior probability, which is typically unavailable. Furthermore, the estimation of the prior probability from training data often leads to poor results (Durbin et al., 1998). To avoid this problem, the threshold is determined such that the false-positive rate and the false-negative rate are equal on the test set. This threshold is determined in the same way for FK-SVMs and TOP-SVMs. The hybrid HMM-TOP-SVM system has several model parameters: the number of HMM states, the pseudo count value (Durbin et al., 1998), and the regularization parameter C of the SVM. We determine these parameters as follows. First, the number of states and the pseudo count value are determined such that the error of the HMM on the validation set (i.e., the validation error) is minimized. Based on the chosen HMM model, the parameter C is determined such that the validation error of the TOP-SVM is minimized. Here, the number of states and the pseudo count value are chosen from f3, 5, 7, 10, 15, 20, 30, 40, 60g and f10¡10, 10¡7 , 10¡5 , 10¡4 , 10¡3 , 10¡2 g, respectively. For C, 15 equally spaced points on the log scale are taken from [10¡4 , 101 ]. Note that the model selection is performed in the same manner for the FK as well. The error rates over 15 different training, validation, and test divisions are shown in Figures 3 and 4. The results of statistical tests are shown in Table 2 as well. In several settings (i.e. 1-3, 2-3, 3-4), the FK performed better than the plug-in estimate with signicant difference in average error rates. This result partially agrees with observations in Jaakkola and Haussler (1999). However, our TOP approach outperforms the FK. According to the Wilcoxson signed ranks test, the TOP kernel was better in all settings with signicant difference. Also, the t-test judged that the difference is signicant except for 1-4 and 2-4. This indicates that the TOP kernel was able to capture discriminative information better than the FK could.
2410
Koji Tsuda et al.
1-2
0.16 0.14
0.25
0.12 0.1
0.2
0.08 P
FK
TOP
1-4
0.36
0.15
P
FK
TOP
2-3
0.18 0.16
0.32
0.14
0.28
0.12
0.24
0.1 P
FK
TOP
2-4
0.32
P
0.36
0.24
0.32
FK
TOP
3-4
0.4
0.28
0.2
1-3
0.3
0.28 P
FK
TOP
P
FK
TOP
Figure 3: The error rates of SVMs with two feature extractors in the protein classication experiments. The labels P, FK, and TOP denote the plug-in estimate, the Fisher kernel, and the TOP kernel, respectively. Each graph is labeled with the two protein classes used for the experiment. 7 Conclusion
In this study, we presented the new discriminative TOP kernel derived from probabilistic models. We proposed two performance measures to analyze such kernels and gave bounds and rates to gain a better insight into modeldependent feature extractors from probabilistic models. Experimentally, we
A New Discriminative Kernel from Probabilistic Models
1-3
FK
FK
1-2
2411
0.25
0.12
0.2 0.08 0.08
0.12
FK
1-4
FK
0.2 0.25 2-3
TOP
TOP
0.16 0.3 0.12
0.25 0.3
0.12
TOP
2-4
0.16 TOP 3-4
FK
FK
0.25
0.3 0.35 0.25 0.3 0.25
0.3 TOP
0.3
0.35
TOP
Figure 4: Comparison of the error rates of the SVMs with two feature extractors in protein classication experiments. Every point corresponds to one of the 15 different training, validation, and test set splits. Each graph is labeled with the two protein classes used for the experiment.
showed that the TOP kernel compares favorably to FK and the plug-in estimator on toy data and in a realistic protein classication experiment. Future research will focus on constructing small sample bounds for the TOP kernel to extend the validity of this work. Since other nonlinear transformations F are available for obtaining different and possibly even better features, we will consider learning the nonlinear transformation F from training samples. An interesting point is that so far, TOP kernels perform local linear
2412
Koji Tsuda et al.
approximations; it would be interesting to move in the direction of local or even global nonlinear expansions. Recently, it was reported that the FKbased classiers can be understood in the Bayesian framework of maximum entropy discrimination (Jaakkola, Meila, & Jebara, 2000; Jaakkola, Meila, & Jebara, 1999) when the prior distribution of parameters is chosen in an appropriate way. It is therefore interesting to explore the relationship between the techniques established in this work for the TOP kernel and such Bayesian inference methods. Appendix: Derivatives with Respect to HMM Parameters
We will illustrate how to compute derivatives of the likelihood with respect to HMM parameters (Rabiner, 1989; Durbin et al., 1998). Let n and m be the number of states and the number of symbols in the alphabet of HMM, respectively. Typically, HMM has the following parameters: a 2 Rn£n : the transition matrix (aij denotes the probability of a transition from state i to j). b 2 Rn£m : the emission matrix (bik denotes the probability of emitting symbol k in state i) p 2 Rn : the initial state distribution (pi denotes the probability of the HMM to start in state i) q 2 Rn : the terminal or end state distribution (qi denotes the probability of the HMM to terminate in state i) Let us dene ¸ D fa, b, p, qg for convenience. Let o denote an observed sequence of length T: o D ( o0 , . . . , oT¡1 ) , 1 · oi · m.
Then the probability that the sequence o is generated by the HMM is described as Á ! T¡2 X Y Pr[o | ¸] D (A.1) ps0 bs0 o0 ast st C 1 bst C 1 ot C 1 qsT¡1 , s
tD 0
P where P Pns denotes P n the sum over all possible state sequences s0 , . . . , sT¡1 : s D s0 D 1 ¢ ¢ ¢ sT¡1 D1 . We will describe the derivative of Pr[o | ¸] with respect to ¸ using forward and backward variables, where the forward variable ati is dened as ati :D Pr[o0 , o1 , . . . , ot , st D i | ¸], and the backward variable bti is described as b ti :D Pr[ot C 1 , ot C 2 , . . . , oT¡1 | st D i, ¸].
A New Discriminative Kernel from Probabilistic Models
2413
These variables are efciently computed by the standard forward-backward algorithm (Durbin et al., 1998). Then the derivatives with respect to parameters are obtained as follows: @ Pr[o | ¸] @p k @ Pr[o | ¸] @qk @ Pr[o | ¸] @a kl @ Pr[o | ¸] @bkl
D bk o0 b 0k
(A.2)
k D aT¡1
(A.3)
D D
T¡2 X tD 0 T¡1 X tD 0
atkbl ot C 1 btlC 1 I ( ot D l)
(A.4)
atkbtk , b kot
(A.5)
where I (ot D l) D 1 if ot D l and 0 otherwise. The parameters in standard HMMs must satisfy the stochasticity conditions: n X jD 1
aij D 1,
m X j D1
bij D 1,
n X j D1
pj D 1,
n X jD 1
qj D 1.
For computations of FK and TOP, we use the derivatives with respect to unconstrained parameters ¸0 as in Jaakkola, Diekhans, et al., 2000. These unconstrained parameters are related to the original ones as pi D p0i /
n X
pj0 ,
j D1
where other parameters a0ij , b0ij , q0i have the same relations (the formulas are not shown for brevity). By the chain rule, the derivative with respect to p0i at the point p0i D pi is obtained as @ Pr[o | ¸] @p0
i
D
@ Pr[o | ¸] @pi
¡
n X jD 1
pj
@ Pr[o | ¸] @pj
.
(A.6)
The derivatives with respect to other unconstrained parameters can be obtained in the same way. Acknowledgments
We thank T. Tanaka, M. Sugiyama, S.-I. Amari, K. Karplus, R. Karchin, F. Sohler, and A. Zien for valuable discussions. We gratefully acknowledge partial support from DFG (JA 379/9-1 and 9-2, MU 987/1-1) and travel grants from EU (Neurocolt II). Thanks also to the anonymous referees for their valuable comments and suggestions that led to a strong improvement of this article.
2414
Koji Tsuda et al.
References Barndorff-Nielsen, O., & Cox, D. (1989). Asymptotic techniques for use in statistics. London: Chapman and Hall. Bishop, C. (1995). Neural networks for pattern recognition. New York: Oxford University Press. Cox, T., & Ferry, G. (1993). Discriminant analysis using non-metric multidimensional scaling. Pattern Recognition, 26, 145–153. Devroye, L., Gyor, ¨ L., & Lugosi, G. (1996). A probabilistic theory of pattern recognition. New York: Springer-Verlag. Durbin, R., Eddy, S., Krogh, A., & Mitchison, G. (1998). Biological sequence analysis: Probabilistic models of proteins and nucleic acids. Cambridge: Cambridge University Press. Fukunaga, K. (1990). Introduction to statistical pattern recognition (2nd ed.). San Diego: Academic Press. Geiger, D., & Meek, C. (1998). Graphical models and exponential families (Tech. Rep. No. MSR-TR-98-10). Redmond, WA: Microsoft Research. Graepel, T., Herbrich, R., Bollmann-Sdorra, P., & Obermayer, K. (1999). Classication on pairwise proximity data. In M. Kearns, S. Solla, & D. Cohn (Eds.), Advances in neural information processing systems, 11 (pp. 438–444). Cambridge, MA: MIT Press. Hofmann, T., & Buhmann, J. (1997). Pairwise data clustering by deterministic annealing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19, 1–14. Jaakkola, T., Diekhans, M., & Haussler, D. (2000). A discriminative framework for detecting remote protein homologies. J. Comp. Biol., 7, 95–114. Jaakkola, T., & Haussler, D. (1999). Exploiting generative models in discriminative classiers. In M. Kearns, S. Solla, & D. Cohn (Eds.), Advances in neural information processing systems, 11 (pp. 487–493). Cambridge, MA: MIT Press. Jaakkola, T., Meila, M., & Jebara, T. (1999). Maximum entropy discrimination(Tech. Rep. No. AITR-1668). Cambridge, MA: MIT. Jaakkola, T., Meila, M., & Jebara, T. (2000). Maximum entropy discrimination. In S. Solla, T. Leen, & K.-R. Muller ¨ (Eds.), Advances in neural informationprocessing systems, 12 (pp. 470–476). Cambridge, MA: MIT Press. Murata, N., Yoshizawa, S., & Amari, S. (1994). Network information criterion— determining the number of hidden units for an articial neural network model. IEEE Trans. Neural Networks, 5, 865–872. Rabiner, L. (1989).A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77, 257–285. R¨atsch, G., Onoda, T., & Muller, ¨ K.-R. (2001). Soft margins for AdaBoost. Machine Learning, 42, 287–320. Sonnenburg, S. (2001). Hidden Markov model for genome analysis [project report]. Berlin: Humboldt University. Received March 22, 2001; accepted March 25, 2002.
LETTER
Communicated by Zoubin Ghahramani
Factorial Hidden Markov Models and the Generalized Backtting Algorithm Robert A. Jacobs [email protected] Department of Brain and Cognitive Sciences, University of Rochester, Rochester, NY 14627, U.S.A. Wenxin Jiang [email protected] School of Operations Research and Industrial Engineering, Cornell University, Ithaca, NY 14853, U.S.A. Martin A. Tanner [email protected] Department of Statistics, Northwestern University, Evanston, IL 60208, U.S.A. Previous researchers developed new learning architectures for sequential data by extending conventional hidden Markov models through the use of distributed state representations. Although exact inference and parameter estimation in these architectures is computationally intractable, Ghahramani and Jordan (1997) showed that approximate inference and parameter estimation in one such architecture, factorial hidden Markov models (FHMMs), is feasible in certain circumstances. However, the learning algorithm proposed by these investigators, based on variational techniques, is difcult to understand and implement and is limited to the study of real-valued data sets. This chapter proposes an alternative method for approximate inference and parameter estimation in FHMMs based on the perspective that FHMMs are a generalization of a well-known class of statistical models known as generalized additive models (GAMs; Hastie & Tibshirani, 1990). Using existing statistical techniques for GAMs as a guide, we have developed the generalized backtting algorithm. This algorithm computes customized error signals for each hidden Markov chain of an FHMM and then trains each chain one at a time using conventional techniques from the hidden Markov models literature. Relative to previous perspectives on FHMMs, we believe that the viewpoint taken here has a number of advantages. First, it places FHMMs on rm statistical foundations by relating them to a class of models that are well studied in the statistics community, yet it generalizes this class of models in an interesting way. Second, it leads to an understanding of how FHMMs can be applied to many different types of time-series data, including Bernoulli c 2002 Massachusetts Institute of Technology Neural Computation 14, 2415– 2437 (2002) °
2416
Robert A. Jacobs, Wenxin Jiang, and Martin A. Tanner
and multinomial data, not just data that are real valued. Finally, it leads to an effective learning procedure for FHMMs that is easier to understand and easier to implement than existing learning procedures. Simulation results suggest that FHMMs trained with the generalized backtting algorithm are a practical and powerful tool for analyzing sequential data. 1 Introduction
Researchers in the neural computation community often nd it useful to distinguish between single-cause and multiple-cause generative models. In single-cause models, each data item is generated by an individual source. Mixture models are a good example of single-cause models because individual data items are generated by sampling from a single mixture component. In multiple-cause models, in contrast, each data item is generated by combining the inuences of multiple sources. Factor models are an important example of multiple-cause models. Individual data items in this case are generated by linearly combining the values of a set of latent variables and then adding gaussian noise to this sum. Recently the distinction between single-cause and multiple-cause models has become important when studying sequential data such as time-series data. A conventional method for summarizing sequential data is to use hidden Markov models (HMMs), which are a generalization of mixture models. In addition to several mixture components, they include a state variable that indicates the mixture component active at the current time step and a set of transition probabilities that indicates the likelihood that each component will become active given the previously active component. Consider the time-series data set Y D fy ( t ) gTtD1 where y ( t ) is a scalar and T is the duration of the series.1 The assumed generative model is that at each time step, a new active component is selected according to the transition probabilities, and an output of the model is sampled from the selected component. This is illustrated in Figure 1(A). The vector q in this gure is the model’s state variable, and y is its output variable. The elements of q are equal to zero except for the element corresponding to the active mixture component; this element is set to one. Initially (at t D 1), the active mixture component is sampled from a multinomial distribution, and the output y (1) is sampled from the chosen component. Subsequently (for t > 1), a new active component is sampled from a multinomial distribution whose parameters depend on the previous active component; then the output y ( t ) is generated by the new active mixture component. This process continues for the duration of the time series.
1 Without loss of generality, we limit our discussion to one-dimensional data. The extension to multidimensional data is straightforward.
FHMMs and Generalized Backtting
A
B
q
(t 1)
y
(t 1)
(t 1)
q1
(t 1)
q2
q
(t 1) 3
y
(t 1)
2417
q
y
(t)
(t)
(t)
q1
(t)
q
y
(t+1)
(t+1)
(t+1)
q1
(t+1)
q2
q2
(t)
q3
(t)
y
q3
y
(t+1)
(t+1)
Figure 1: (A) A graphical representation of a hidden Markov model. (B) A graphical representation of a factorial hidden Markov model with three hidden Markov chains.
2418
Robert A. Jacobs, Wenxin Jiang, and Martin A. Tanner
HMMs have several attractive features. They can model data that violate the stationarity assumptions characteristic of many other time-series models. In addition, there exists an expectation-maximization (EM) algorithm, known as the Baum-Welch algorithm, for nding maximum likelihood estimates of the parameter values of an HMM (Baum, Petrie, Soules, & Weiss, 1970; Rabiner & Juang, 1993). This algorithm is easy to understand and easy to implement. Unfortunately, HMMs have disadvantages too, and researchers continue to look for better models. An important disadvantage of HMMs is that they are instances of single-cause models and thus are inefcient from a representational viewpoint. The history of a time series up to time step t is represented by the active mixture component at time t indicated by the multinomial state variable q ( t ) . As Williams and Hinton (1991) pointed out, the multinomial assumption makes HMMs representationally inefcient. Due to this assumption, HMMs need 2N possible values for the state variable (i.e., 2N mixture components) to represent N bits of information about the history of a time series. To remedy the representational inefciency of HMMs, Williams and Hinton (1991) proposed generalizing HMMs through the use of distributed state representations. Ghahramani and Jordan (1997) studied a special case of this generalization, which they referred to as factorial hidden Markov models (FHMMs), and considered the use of FHMMs for summarizing real-valued time-series data. The most important feature of FHMMs is that they consist of multiple hidden Markov chains (see Figure 1B). Each of the chains contains a number of components along with a multinomial state variable that indicates the component that is active at the current time step and a set of transition probabilities that gives the likelihood that each component will become active given the previously active component. Chains are independent of each other; they do not interact when generating data. The output of the model at each time step is, however, dependent on the values of the state variables of all the chains. Hence, FHMMs are instances of multiple-cause generative models. Real-valued time-series data are generated as follows. At each time step, each chain determines its active component by sampling from a multinomial distribution whose parameters depend on the previous active component. Then an output contribution is made by each chain that is dependent on the chain’s currently active component. The sum of the contributions of all the chains is the mean of a gaussian distribution, and a sample from this distribution is the output of the entire model. In contrast to conventional HMMs, FHMMs are efcient from a representational viewpoint. If each chain has two components, meaning that each chain’s state variable is binary, then an FHMM needs N state variables (i.e., N hidden Markov chains with a total of 2N components) to represent N bits of information about the history of a time series. Factorial hidden Markov models are a special case of dynamic Bayesian networks (DBNs). These networks are directed graphical models of stochas-
FHMMs and Generalized Backtting
2419
tic processes that generalize HMMs by representing the hidden state in terms of state variables with possibly complex interdependencies. Due to these interdependencies, exact maximum likelihood estimation is typically computationally intractable in DBNs, though the study of learning procedures that perform approximate estimation is an active area of research (Boyen & Koller, 1999; Dean & Kanazawa, 1988; Murphy & Weiss, 2001). FHMMs are a special case of DBNs in the sense that they use multiple state variables, though they do not permit the state variables to interact with each other. Although FHMMs are appealing generative models, they are difcult to use for the purposes of statistical inference. In particular, the EM algorithm for nding maximum likelihood estimates of an FHMM’s parameter values is computationally intractable (Ghahramani & Jordan, 1997). The problem occurs during the E-step of the algorithm when one is attempting to infer the expected value of each chain’s state variable at each time step given the timeseries data. Although the state variables of the hidden Markov chains are marginally independent, these variables are conditionally dependent given the time-series data. Intuitively, the computational intractability stems from the cooperative nature of the model; the settings of all the state variables cooperate in determining the mean of the time-series data at each time step. When inferring the expected value of one chain’s state variable given the time-series data, it is therefore necessary to sum over all possible values of the other chains’ state variables. Ghahramani and Jordan developed a procedure for performing inference in FHMMs through the use of the junction tree algorithm, a popular algorithm for performing inference in graphical models. The time complexity of the procedure is O (TMK M C 1 ), where T is the duration of the time series, M is the number of hidden Markov chains, and K is the number of components in each chain. This exponential time complexity makes exact inference and maximum likelihood parameter estimation intractable. Because exact inference and parameter estimation in FHMMs is intractable, Ghahramani and Jordan (1997) studied approximate inference and parameter estimation. The primary focus of their article was on the use of variational techniques to approximate the posterior distribution of the state variables given the time series. In short, the idea is to approximate this distribution with another distribution that can be computed efciently and can be shown using Jensen’s inequality to provide a basis for a lower bound on the log-likelihood of the time series data. The parameters of the approximating distribution are found by minimizing the Kullback-Leibler distance between this distribution and the true posterior distribution (Jordan, Ghahramani, Jaakkola, and Saul, 1998). The resulting approximating distribution can be used to obtain an efcient parameter estimation algorithm. Ghahramani and Jordan showed that an FHMM trained using the variational approximation captured statistical structure in a time-series data set that a conventional HMM did not.
2420
Robert A. Jacobs, Wenxin Jiang, and Martin A. Tanner
In this article, we propose an alternative method for approximate inference and parameter estimation in FHMMs that we call the generalized backtting algorithm. This method is based on the perspective that FHMMs are a generalization of a well-known class of statistical models known as generalized additive models (GAMs; see Hastie & Tibshirani, 1990). GAMs are a method for mapping covariate or input variables to response or output variables and can be used in any setting in which generalized linear models are applicable, such as linear, logistic, or log-linear regression, the analysis of multinomial data, or the analysis of censored survival data. Consider a data set in which covariate variables x1 , x2 , and x3 are mapped to response ( ) ( ) ( ) variable y: fx1n , x2n , x3n ! y ( n) gN n D 1 where n indexes individual data items. When regarded as a generative model, GAMs assume that each data item is generated by rst summing contributions made on the basis of individual covariate variables and then mapping this sum to a conditional mean of the response using a function h: ( ) ( ) ( ) ( ) ( ) ( ) E[y ( n) |x1n , x2n , x3n ] D h[ f1 (x1n ) C f2 ( x2n ) C f3 ( x3n )], ( )
( )
(1.1)
( )
where f1 ( x1n ) , f2 (x2n ) , and f3 (x3n ) are the contributions made on the basis of the individual covariates, and h is an appropriately selected monotonic and differentiable function. The response y (n ) is then sampled from an appropriate distribution with the given conditional mean. When used for statistical inference, GAMs are appealing because they are simple. Rather than consider the relationship between all covariate variables and a response variable, GAMs attempt to circumvent the “curse of dimensionality” by iteratively considering each individual covariate variable one at a time. This training procedure for GAMs is known as the backtting algorithm. When the parameters of a GAM are estimated, a mapping is learned from each covariate variable to a residual error, where the error is based on the value of the response variable and the sum of the contributions based on the remaining covariate variables. Consider the case where the response is real valued, and we are interested in updating the parameters of f1 , the function giving the contribution based on covariate x1 . Then the residual ( ) ( ) error e1n on data item n that corresponds to x1n is ( )
( )
( )
n ( ) n n e1 D y n ¡ [ f2 (x2 ) C f3 ( x3 ) ].
(1.2) ( )
The parameters of f1 are updated so that f1 ( x1n ) more closely approximates ( n) e1 for each data item. The backtting algorithm cycles through updates of the functions f1 , f2 , and f3 until convergence. The parameter estimation method for FHMMs proposed in this article takes advantage of the fact that FHMMs are additive models; at each time step, each hidden Markov chain makes an output contribution based on the value of its state variable, and the sum of these contributions is used to
FHMMs and Generalized Backtting
2421
determine the conditional mean of the time-series data. As a result, lessons learned from the study of GAMs can be applied to FHMMs. The method treats learning in FHMMs and GAMs as similar in the sense that it uses an iterative procedure that considers each chain’s state variable one at a time. More precisely, it considers the mapping at each time step between each chain’s state variable and a residual error where the error is based on the current value of the time-series data and the sum of the contributions based on the remaining chains’ state variables. However, learning in FHMMs is also different from learning in GAMs. Whereas each stage of learning in GAMs considers the known value of a covariate variable, each stage of learning in FHMMs considers the unknown value of a hidden state variable. Fortunately, however, because each chain’s state variable is considered separately, the expected value of this variable can be computed efciently using standard hidden Markov model techniques. The end result is a learning algorithm that closely resembles the backtting algorithm used to train GAMs. Like the conventional backtting algorithm, the proposed procedure is simple and effective. Unlike the conventional backtting algorithm, however, the proposed procedure does not produce maximum likelihood parameter estimates. The need to estimate the expected values of the chains’ state variables means that the procedure is an approximate method. The article is organized as follows. Section 2 provides an overview of GAMs, describes how FHMMs can be viewed as generalizations of them, and describes the generalized backtting algorithm. Relative to previous perspectives on FHMMs, we believe that the viewpoint taken here has a number of advantages. First, it places FHMMs on a rm statistical foundation by relating FHMMs to a class of models that is well studied in the statistics community, yet it generalizes this class of models in an interesting way. Second, it leads to an understanding of how FHMMs can be applied to many different types of time-series data, including Bernoulli and multinomial data, not just data that are real valued. Finally, it leads to an effective learning procedure for FHMMs that is easier to understand and easier to implement than existing learning procedures. Using this learning procedure, we believe that FHMMs are now a practical tool for analyzing sequential data. Section 3 provides some analytical results regarding the application of the generalized backtting algorithm to FHMMs. This section claries several of the theoretical properties of the algorithm. Section 4 presents simulation results comparing the performances of FHMMs trained with the generalized backtting algorithm to FHMMs trained with other learning procedures. The results suggest that FHMMs trained with any of the candidate algorithms have similar levels of performance when summarizing real-valued time series and that the new learning procedure proposed here outperforms alternative algorithms when summarizing multinomial time series. A summary and conclusions are presented in section 5.
2422
Robert A. Jacobs, Wenxin Jiang, and Martin A. Tanner
2 FHMMs and GAMs
We begin this section by providing a brief overview of GAMs. A thorough treatment can be found in Hastie and Tibshirani (1990). GAMs are closely related to generalized linear models (GLIMs) and can be understood by comparing them with GLIMs (McCullagh & Nelder, 1989). From a generative perspective, GLIMs generate data items using a threestage process. First, a linear combination of the covariate variables, denoted g, is formed, gD
M X
xm bm ,
(2.1)
mD1
where x1 , . . . , xM is a set of covariates and b1 , . . . , b M is a set of coefcients. Next, a monotonic and differentiable function, denoted h, maps the linear combination g to the expected value of the response variable y given the values of the covariates: E ( y | x1 , . . . , x M ) D h (g) .
(2.2)
Finally, the value of the response y is sampled from an output distribution. The choices of the distribution and the function h depend on the nature of the response variable. If, for example, the response y is real valued, then it is common to assume that y has a gaussian distribution and that the function h is the identity function. As a second example, if the response y is binary, then it is assumed that y has a Bernoulli distribution and that the function h is the logistic function. In general, the distribution of the response is a member of the exponential family of distributions. If h is chosen so that g equals the natural parameter of the distribution, then h ¡1 is known as the canonical link function for the distribution. For the purposes of statistical inference, estimates of the values of the coefcients b 1 , . . . , bM are typically obtained using the iteratively reweighted least-squares (IRLS) algorithm. This algorithm is an instance of a NewtonRaphson algorithm, and if h is chosen so that its inverse is a canonical link function, then the algorithm is also a special case of the Fisher scoring procedure. In short, the linearized response z ( n) for the nth data item is computed as the rst-order Taylor’s series approximation to h ¡1 ( y ( n) ) about the cur( ) ( ) rent estimate E ( y (n ) | x1n , . . . , xMn ). New estimates of the coefcients are formed by weighted linear regression of the linearized responses for all data items onto the covariate variables with weights that are proportional to the variances of the linearized responses (see McCullagh & Nelder, 1989, for details). This process is repeated until the parameter estimates converge. The IRLS algorithm is attractive because no special optimization software is required, just a function that computes weighted least-squares estimates.
FHMMs and Generalized Backtting
2423
Consider the case when the response variable y is real valued with a gaussian distribution and P the function h is the identity function. Then the linear combination g D m xm b m is the conditional mean of a gaussian distribution given values for the covariates. When the IRLS algorithm is used to estimate the values of the coefcients, the linearized response z is equal to the actual response y and the weights are equal to one. That is, the IRLS algorithm reduces to conventional linear regression. As a second example, consider the case when the response y is binary with a Bernoulli distribution and the function h is the logistic function. In this case, the linearized response (i.e., the rst-order Taylor’s series approximation described above) is zDgC
y ¡ h (g) , h (g) (1 ¡ h (g) )
(2.3)
and the weights (i.e., the variances of the linearized responses) are w D h (g) (1 ¡ h (g) ) .
(2.4)
Intuitively, a weight is small when the variance is small because the response y is close to its expected value h D E[y] in this case and, thus, the residual y ¡ h is small. In contrast, a weight is large when the variance is large. The IRLS algorithm performs a weighted least-squares regression of the linearized responses onto the covariates. This regression must be iterated because new estimates of the coefcients lead to new linearized responses and weights. GAMs differ from generalized linear models in that an additive combination replaces the linear combination. From a generative perspective, GAMs generate data items using a three-stage process. First, an additive combination based on the covariates, denoted g, is formed: gD
M X
fm ( xm ) ,
(2.5)
mD 1
where the fm s are arbitrary univariate functions, one for each covariate. Next, a monotonic and differentiable function h maps the additive combination g to the expected value of the response variable y given the values of the covariates: E ( y | x1 , . . . , xM ) D h (g) .
(2.6)
Finally, the value of the response y is sampled from an output distribution. As in the case of GLIMs, the choices of the distribution and the function h depend on the nature of the response variable. When statistical inference is performed, the parameters of the fm s need to be estimated. The IRLS algorithm for GAMs is essentially unchanged from
2424
Robert A. Jacobs, Wenxin Jiang, and Martin A. Tanner
its form for GLIMs with the exception that a weighted additive regression is used instead of a weighted linear regression. First, linearized responses z and weights w are computed for all data items. Then the parameters of the fm s are updated by performing a weighted additive regression of the linearized responses onto the covariates. The backtting algorithm is used to perform this weighted additive regression. When applied to GAMs, the IRLS algorithm denes the linearized response z and the weight w in the same way as when the algorithm is applied to GLIMs. For example, if the response variable is real valued with a gaussian distribution and the function h is the identity function, then the linearized response z equals the actual response y and the weights equal one. If the response variable is binary with a Bernoulli distribution and the function h is the logistic function, then the linearized response and weight are given by equations 2.3 and 2.4, respectively. The backtting algorithm iteratively considers each individual covariate variable one at a time in order to perform the weighted additive regression. It rst forms a residual error equal to the difference between the value of the linearized response and the sum of the contributions based on the remaining covariate variables. The residual error corresponding to the mth covariate on the nth data item is X ( n) ( ) (2.7) em Dzn ¡ fi ( xi( n) ) . 6 m iD
Next, the parameters of fm are updated so that the contribution based on the ( ) mth covariate approximates the error emn for each data item in proportion to the weight for that data item. This process is repeated for all of the covariates, possibly multiple times. An important point of this article is that factorial hidden Markov models can be regarded as a generalization of GAMs. From a generative perspective, the three-stage process by which FHMMs generate data is similar to the process by which GAMs generate data. First, at each time step, each hidden Markov chain makes an output contribution based on the value of its hidden state variable, and an additive combination of these contributions is formed. The contribution of chain m, denoted fm ( q m ) , is given by fm ( q m ) D ¹0m q m , where ¹m is a K-dimensional vector of parameters (¹0m is the transpose of ¹m ). The additive combination of the output contributions from all M chains, denoted g, is gD
M X
fm ( q m ) .
(2.8)
mD1
Next, a monotonic and differentiable function h maps the additive combination g to the expected value of the response variable y given the values of the state variables: E ( y | q 1 , . . . , q M ) D h (g) .
(2.9)
FHMMs and Generalized Backtting
2425
Finally, the value of y is sampled from an output distribution. As in the case of GLIMs and GAMs, the choices of the distribution and the function h depend on the nature of the response variable. Note that both GAMs and FHMMs map a set of variables—covariate variables x1 , . . . , x M in the case of GAMs and hidden state variables q 1 , . . . , q M in the case of FHMMs—to a response variable. Both of these classes of models also perform this mapping using an additive combination, not a linear combination, and then map this combination to a response using the inverse of a canonical link function. When performing statistical inference using FHMMs, it is important to keep in mind that GAMs and FHMMs have both similarities and differences. The most signicant difference is that GAMs consider the known values of covariate variables, whereas FHMMs consider the unknown values of hidden state variables. The similarities between the two classes of models mean that the parameter estimation algorithm for GAMs can be a useful guide to parameter estimation in FHMMs. The differences mean that the estimation algorithm for GAMs cannot be applied to FHMMs in its exact form, but some modications will be required. Because the values of the hidden state variables are unknown, the algorithm for parameter estimation in FHMMs will require estimates of the expected values of these state variables. Fortunately, estimates of these expected values can be computed efciently using standard techniques from the hidden Markov models literature. The values of the free parameters of FHMMs need to be estimated during statistical inference. The IRLS algorithm for FHMMs is similar to its form for GAMs. In particular, a modied version of the backtting algorithm is used to perform weighted additive regressions. We refer to this modied version as the generalized backtting algorithm. The algorithm rst denes the linearized responses and the weights. Once again, if the response variable is real valued with a gaussian distribution and the function h is the identity function, then the linearized response z equals the actual response y, and the weights equal one. If the response variable is binary with a Bernoulli distribution and the function h is the logistic function, then the linearized response and weight are given by equations 2.3 and 2.4. The algorithm then iteratively considers each individual state variable one at a time (or, equivalently, each individual hidden Markov chain one at at time). For each state variable, it forms a residual error equal to the difference between the value of the linearized response and the sum of the contributions based on the expected values of the remaining state variables. The residual error corresponding to the mth Markov chain at time t is X ( t) ( ) (2.10) em Dzt ¡ fi ( E[q i( t ) ]) , 6 m iD
( )
where E[q i t ] is the expected value of the hidden state variable for chain i ( ) and fi (E[q i t ]) is the expected contribution of chain i. Next, the parameters of the mth Markov chain are updated so that the contribution based on the
2426
Robert A. Jacobs, Wenxin Jiang, and Martin A. Tanner ( )
t mth state variable approximates the error em for all time steps in proportion to the weight for that time step. This process is repeated for all of the Markov chains, possibly multiple times. In order to understand learning in FHMMs better, it is worth pointing out an important feature of the IRLS algorithm that leads to an interesting viewpoint regarding FHMMs and multicomponent or modular architectures. This viewpoint follows from the fact that the set of residual errors ( ) femt gTtD 1 is itself a time-series data set. Because this is a real-valued time series, it can be modeled by a standard hidden Markov model with gaussian responses. In other words, FHMMs can be regarded as instances of multicomponent or modular architectures where the modules are the hidden Markov chains. Our analysis indicates that these chains should be implemented as HMMs with gaussian responses. From a generative perspective, the output of an FHMM as a whole at time t is formed by rst summing the expected outputs of each of the individual HMM modules, mapping this sum to an expected value of an output distribution using an inverse canonical link function (i.e., the function h), and then sampling from this distribution in order to generate the target time-series data item y ( t ) . From the perspective of statistical inference, the fact that each HMM module’s set of residual errors is itself a time-series data set is good news. The individual HMM modules can be quickly and effectively trained using conventional parameter estimation algorithms from the literature on hidden Markov models, such as the Baum-Welch algorithm. As a result, FHMMs as a whole can be efciently trained. This viewpoint highlights the strengths of the backtting approach to training additive models. When considering learning in any modular architecture, it is important to consider how error signals can be calculated that are customized for each of the architecture’s individual modules. In the case of FHMMs, the generalized backtting equation solves this problem (see equation 2.10). The time-series ( ) data femt gTtD 1 are residual errors that are specically tailored for HMM module m. HMM module m is then trained using the Baum-Welch algorithm ( ) to summarize the time series femt gTtD1 , and this process is repeated for all modules. The pseudocode in Figure 2 outlines the IRLS procedure applied to FHMMs. When implementing this procedure, we have found that it is relatively easy to go from a computer program for simulating conventional HMM models to a program for simulating FHMM models. The majority of the code for FHMMs is a subroutine that implements the Baum-Welch algorithm in individual HMM modules. To complete the specication of the IRLS procedure, two more issues need to be dealt with. First, training data for HMMs typically do not include weights fw ( t) gTtD1 on the responses, and so we need to describe how the Baum-Welch algorithm can be modied so that its updates of an HMM’s parameters take into account these weights. To do this, we introduce some
FHMMs and Generalized Backtting
2427
repeat until convergence do for hidden Markov chain m D 1 to m D M do compute linearized responses fz ( t ) gTtD 1 and weights fw ( t ) gTtD 1 (t ) compute time series fem gTtD 1 Baum-Welch algorithm: train chain m to summarize ( ) time series femt gTtD 1 using weights fw ( t ) gTtD 1 end end Figure 2: Pseudocode outlining the IRLS procedure applied to FHMMs ( )
( )
( )
t notation. Let qmi denote the ith element of state vector q mt . Let c mit denote the probability that component i of the mth HMM module is active at time t ( ) ( t) ( t) (t ) ) given the residual error time series: c mit D p ( qmi D 1|fem gT t D 1 . Finally, let jmij denote the probability that component i of HMM module m is active at time t and that component j is active at time t C 1 given the residual error time ( t) ( t) ( t C 1) (t ) ). series: jmij D p ( qmi D 1, qmj D 1|fem gT tD 1 At each iteration of the BaumWelch algorithm, the transition probability that component j will become active at time t C 1 given that component i is active at time t is given by T¡1 X
( )
t jmij w (t )
( C 1) p (qmjt
t D1 ( t) D 1 | qmi D 1) D
.
T¡1 X
( ) c mit
(2.11)
w ( t)
tD 1
The update equations for the mean and variance associated with the ith 2 component of the mth HMM module, denoted m mi and smi , respectively, are: T X ( ) ( ) c mit w ( t) emt
m mi D
tD 1 T X
(2.12)
( ) c mit
w
( t)
tD1
2 smi D
T X ( ) ( ) c mit w ( t) ( emt ¡ m mi ) 2 tD 1
T X ( ) c mit w ( t)
.
(2.13)
tD 1
If w ( t ) D 1 for all time steps, then equations 2.11 through 2.13 are identical to the standard Baum-Welch update equations.
2428
Robert A. Jacobs, Wenxin Jiang, and Martin A. Tanner
We also need to describe how to compute the expected values of each hid( ) den Markov chain’s state variable at each time step (i.e., the set fE[q mt ]gTtD1 for chains m D 1, . . . , M; these values are used in equation 2.10). We have studied two different methods for computing these values. Empirical results suggest that the methods lead to similar levels of performance. The ( ) ( ) rst method is to set the ith element of the vector E[q mt ] equal to c mit , the probability that the ith component of chain m is active at time t given the residual error time series: ( t) ( t) ] D c mi . E[qmi
(2.14)
We refer to the generalized backtting algorithm where the expected values of the state variables are computed in this manner as the GBF-c algorithm. The second method uses the Viterbi algorithm, another common technique in the HMM literature, to compute the single “best” state sequence given the time-series data for each hidden Markov chain. For chain m, this state ( ) sequence is dened as the sequence fq mt gTtD 1 that maximizes the probability of the sequence given the residual error time series. The expected values ( ) fE[q mt ]gTtD1 are simply set to this sequence: ( t) T ( t) T (t ) T ]gt D 1 D arg max p (fq m fE[q m gt D 1 | fem g tD 1 ) . (t) T fq m gtD 1
(2.15)
The generalized backtting algorithm where the expected values of the state variables are computed using the Viterbi algorithm is referred to as the GBF-V algorithm. 3 Some Analytical Results
This section considers some theoretical properties of FHMMs trained with the generalized backtting algorithm GBF-c . For simplicity, we consider the case where the observed time series Y D fy ( t) gTtD1 is real valued, and thus its expected value is a linear function of the contributions of the individual hidden Markov chains. In the case of a nonlinear model, a linearization transform from y to z (e.g., see equation 2.3) leads to approximately the same conclusions. Suppose that the observed data are generated by an FHMM with M independent Markov chains determining the data as described above. Let ( t) ( t) ( t) ym D fm ( q m ) D ¹0m q m denote the output contribution of chain m at time t. The output of the FHMM as a whole is Á y
( t)
|
( t) M fq m g mD 1
»N
M X
m D1
! ( t) ), fm ( q m
s
2
.
(3.1)
FHMMs and Generalized Backtting
2429
When statistical inference is performed, the parameters µ 1 , . . . , µ M need to be estimated where µ m is the set of parameters associated with chain m, including its mean and variance parameters, its component transition probabilities, and possibly its initial component probabilities. For chain m, the GBF-c algorithm uses the residual error, ( t ) ,new ( ) em D yt ¡
X 6 m iD
E[¹0i q i( t ) | fei( t) ,old gTtD1 ]
µ Dµ
old
,
(3.2)
and then ts the time series femt ,new gTtD 1 using a single-chain HMM and the Baum-Welch procedure. ( ) Let fymt gTtD1 be any time series generated by a single-chain HMM with parameters µm . What will be the estimated mean and variance of this time ( ) series? Is the residual error time series femt gTtD 1 modeled correctly in mean ( t) ( ) ( ) ( ) and variance? In other words, is E ( em ) D E ( ymt )? Is var(emt ) D var ( ymt ) ? We demonstrate that the means of the residuals are modeled correctly. ( )
Proposition 1.
( )
( )
t t E ( em ) D E ( ym ) .
Proof. ( ) ( ) E (emt ) D E ( y t ) ¡ ( ) D E (y t ) ¡
D
M X iD 1
n oT ¶ X ³ µ ( ) ( ) E E ¹0i q i t | ei t ,old
t D1 µDµ old
6 m iD
² X ± E ¹0i q i( t )
´ (3.3) (3.4)
6 m iD
E ( ¹0i q i( t ) ) ¡
X
E ( ¹0i q i( t) )
(3.5)
6 m iD
( ) D E ( ¹0m q mt )
(3.6)
( ) D E ( ymt ) .
(3.7)
This shows that the means are equal. ( )
Proposition 1 does not imply that the distribution of emt is modeled correctly. ( ) ( ) 6 To the contrary, we believe that in general, var (emt ) D var(ymt ). However, ( t) we cannot write an analytic expression for var(em ) (expressed in terms of ( ) y ( t ) and q i t s), and so we cannot prove this conjecture. We now consider a more careful specication of what the generalized ( ) backtting algorithm is estimating. Let Pfy(t) g [fyQ mt g| µm ] denote the likelihood m
2430
Robert A. Jacobs, Wenxin Jiang, and Martin A. Tanner ( )
of observed data fyQ mt gTtD1 for a single-chain HMM time series: ( t) g| µ m ] D P fy (t) g [fyQ m m
where
P (t)
fq m gTtD 1
X
(t) T fq m gtD 1
p q (1) Pq (1) , q (2) ¢ ¢ ¢ Pq (T¡1) , q (T) (2p s 2 ) ¡T/ 2 m
m
m
m
m
"
# T ± ²2 1 X ( t) ( t) 0 £ exp ¡ 2 yQ ¡ ¹m q m , 2s tD 1 m
(3.8)
denotes a sum over all possible state sequences, p q (1) is the m
(1)
initial (t D 1) probability for state variable q m , and Pq (t¡1) , q (t) is the transition m m (
)
( )
probability of moving from state q mt¡1 at time t ¡ 1 to state q mt at time t. Although seemingly complicated, this equation has the standard form for an HMM likelihood function. The limiting point of the GBF-c algorithm satises the following ( t) xed-point equation: 0 D rµm Pfy (t) g [fem g | µm ], m D 1, . . . , M. Proposition 2.
m
Using the fact that the Baum-Welch procedure nds maximum likelihood parameter estimates, this is obvious since at each step of the algorithm, the new estimate µnew m maximizes the single-chain HMM likelihood ( ) and therefore satises 0 D rµ new Pfy (t) g [femt ,new g | µ new m ] where the residuals m Proof.
m
femt ,new g are computed using parameter values and residual errors of other chains from the “old” step. ( )
Several points are worth noting: It is easy to extend proposition 2 to the case where there is more than one independent time series since the overall likelihood can be formed by taking products of the likelihoods for the individual series. The set of xed-point equations may not correspond to maximizing any scalar objective function (integrability conditions would need to be satised, but this is difcult to check analytically). This lack of an objective function is perhaps the most important drawback of the generalized backtting algorithm relative to other approximate maximum likelihood algorithms such as the variational algorithm. Fortunately, however, we nd that the generalized backtting algorithm always converges in practice and that it tends to increase monotonically the likelihood of the data at each iteration. ( )
( )
Using the fact E ( emt ) D E ( ymt ) from proposition 1, we conjecture that if the limit point of the algorithm is used to form an estimator of the mean of the time-series data at each time step, denoted mO ( t) , then this P
estimator will be consistent (i.e., mO ( t) ! E ( y ( t ) )). Let µ 1 N be a limit
FHMMs and Generalized Backtting
2431
point of the algorithm when N independent time series are used. Then PM ( t) 1 1 0 mO ( t ) D E[ m D 1 ¹m q m | µ D µ N ]. In other words, although µ N is not a maximum likelihood estimator (it may not maximize any objective ( ) function, and it is obtained by modeling the variance of emt incorrectly), it will produce a consistent estimator of the mean function for the observed data. Unfortunately, we cannot prove this conjecture for the general case, though it can be proved for the case of K D 2 (each hidden Markov chain has two components) and T D 1 (time series with length 1), in which case iteratively tting HMM models for each chain is equivalent to tting mixtures of gaussian distributions iteratively. Finally, and as an aside, it may be worth noting that there could be important similarities between the generalized backtting algorithm and the variational approximation proposed by Ghahramani and Jordan (1997). Interestingly, both learning procedures make use of the same set of error ( ) residuals femt gTtD 1 , m D 1, . . . , M (see equations 9b and 12b of Ghahramani & Jordan, 1997). In addition, both procedures ignore correlations among state variables (similar to mean-eld methods). It is difcult, however, to make concrete statements about the similarities and differences of the procedures because the xed-point equations for the variational approximation involve auxiliary parameters of an approximating distribution that are not part of the original factorial hidden Markov model. 4 Simulation Results
This section reports simulation results using three articial data sets; the observations in the rst data set are real valued, and the observations in the remaining datasets are multinomial. When attempting to predict how well the generalized backtting algorithm will work on these data sets, it is important to keep in mind that although the algorithm is new, its constituent parts are not. Methods such as the IRLS procedure, the backtting procedure, and the Baum-Welch procedure have all been successfully used in scores of applications over a period of many years. Consequently, we predict that FHMMs trained with the generalized backtting algorithm should show good performance. 4.1 Real-Valued Data Sets. When studying real-valued data sets, several algorithms exist for estimating the parameter values of an FHMM. Ghahramani and Jordan (1997) found that an EM algorithm worked best and that an approximate EM algorithm in which Gibbs sampling is used during the E-step and an approximate algorithm based on variational techniques each worked nearly as well. We conducted a set of simulations based on a related set of simulations conducted by Ghahramani and Jordan (1997). Training and test data were real valued and generated by an FHMM. We compared the training and test
2432
Robert A. Jacobs, Wenxin Jiang, and Martin A. Tanner
set log-likelihoods of ve models: HMM: An HMM trained using the Baum-Welch algorithm Exact: An FHMM trained using an exact EM algorithm SVA: An FHMM trained using the structured variational approximation of Ghahramani and Jordan GBF-c : An FHMM trained using the GBF-c algorithm GBF-V: An FHMM trained using the GBF-V algorithm The FHMM that generated the data had M hidden Markov chains, each containing K components. All of its parameter values, except for the covariance matrices associated with the hidden Markov chains and the output covariance matrix, were sampled from a uniform [0, 1] distribution and, if needed, appropriately normalized to satisfy the sum-to-one constraint of probability distributions. The covariance matrices were set to a multiple of the identity matrix, C D 0.01I. The training and test sets consisted of 20 sequences of length 20, where the observable vector at each time step was a four-dimensional vector. For each randomly sampled set of parameters, a training and test set were generated, and each model was run once. Fifteen sets of parameters, and thus 15 training and test sets, were generated for each of four problem sizes: fM D 3, K D 2g, fM D 3, K D 3g, fM D 5, K D 2g, and fM D 5, K D 3g. The FHMMs trained to summarize the data consisted of M hidden Markov chains with K states each, whereas the HMMs had KM states. Their parameter values were randomly initialized, except for the covariance matrices, which were set and xed to 0.01I. Learning algorithms were run for 100 iterations (in the case of FHMMs trained with either of the generalized backtting algorithms, a cycle is dened as ve iterations of the Baum-Welch algorithm for each hidden Markov chain, and the algorithm was run for 20 cycles). At the end of training, the log-likelihoods on the training and test sets were computed for all models using the exact algorithm. The mean ( § standard error of the mean) log-likelihood of each model on each of the four problem sizes is shown in Table 1. Also included in this table is the log-likelihood of the training and test sets under the true model that generated the data. Taken as a whole, the simulation results are highly compatible with the results reported by Ghahramani and Jordan. Model HMM tended to suffer from overtting; its test set log-likelihood is signicantly worse than its training set log-likelihood. This is evident for the smallest problem size and is extreme for the largest problem size. Among the models using FHMMs, model Exact tended to perform best. This result is unsurprising because it performs exact inference and maximum likelihood parameter estimation. Among the approximate models, models SVA, GBFc , and GBF-V showed similar levels of performance.
FHMMs and Generalized Backtting
2433
Table 1: Mean ( § Standard Error of the Mean) Log-Likelihood of Each Model on the Four Problem Sizes for the Real-Valued Data Set. M
K
Model
Training Set
Test Set
3
2
True HMM Exact SVA GBF-c GBF-V
3
3
True HMM Exact SVA GBF-c GBF-V
806.9 § 22.6 544.0 § 107.2 630.2 § 103.0 604.3 § 94.0 644.1 § 90.5 625.1 § 90.3
797.2 § 24.1 286.4 § 151.3 592.8 § 103.9 574.2 § 93.5 596.5 § 91.5 582.8 § 91.0
5
2
True HMM Exact SVA GBF-c GBF-V
5
3
True HMM Exact SVA GBF-c GBF-V
318.9 § 14.0 356.4 § 34.3 ¡73.5 § 83.4 ¡421.0 § 78.5 ¡373.0 § 57.4 ¡424.9 § 56.4
332.7 § 30.8 172.0 § 120.7 218.3 § 73.4 ¡447.5 § 166.1 ¡666.6 § 206.1 ¡722.6 § 180.2
¡335.9 § 24.8 205.9 § 104.7 ¡973.5 § 95.3 ¡1425.7 § 115.8 ¡1358.4 § 122.5 ¡1407.2 § 125.4
319.0 § 20.0 ¡1170.8 § 110.7 ¡175.0 § 97.1 ¡525.6 § 92.1 ¡518.5 § 69.9 ¡555.5 § 75.9
336.8 § 31.7 ¡1861.9 § 233.8 148.0 § 91.0 ¡519.1 § 174.5 ¡756.8 § 208.2 ¡812.1 § 176.9
¡327.7 § 28.2 ¡9262.4 § 606.4 ¡1267.0 § 120.9 ¡1656.0 § 138.5 ¡1554.9 § 128.9 ¡1595.1 § 135.0
4.2 Multinomial Data Sets. The variational techniques of Ghahramani and Jordan (1997) are specically designed for real-valued time series and do not extend in an obvious way to other types of data. These investigators offered some speculative suggestions as to how their techniques might be extended, but ultimately left this as an open research question. In contrast, the generalized backtting algorithm can be regarded as general purpose, because it is easily applied to any sequential data set in which the distribution of the response variable is a member of the exponential family of distributions. We demonstrate this by considering multinomial time series. Training and test data were generated by an FHMM as described above. We compared the training and test set log-likelihoods of two models: HMM, an HMM trained using the Baum-Welch algorithm, and GBF-c , an FHMM trained using the GBF-c algorithm. The FHMM that generated the data had M hidden Markov chains, each containing K components. All of its parameter values were sampled from a uniform [0, 1] distribution and, if needed, appropriately normalized to satisfy the sum-to-one constraint of probability distributions. The training and test sets consisted of 20 sequences of length 20. For the rst multinomial data set, the response variable could
2434
Robert A. Jacobs, Wenxin Jiang, and Martin A. Tanner
take one of four possible values at each time step; that is, the response variable was a four-dimensional vector with one component equal to one and the remaining three components set to zero. The response variable could take one of eight possible values at each time step in the second multinomial data set. For each randomly sampled set of parameters, a training and test set were generated, and each model was run once. Fifteen sets of parameters, and thus 15 training and test sets, were generated for each of four problem sizes: fM D 3, K D 2g, fM D 3, K D 3g, fM D 5, K D 2g, and fM D 5, K D 3g. Model HMM contained KM states, whereas model GBF-c was an FHMM with M hidden Markov chains with K states each. Their parameter values were randomly initialized, except for the covariance matrices of model GBFc , which were set and xed to the identity matrix. In the case of model HMM, the Baum-Welch algorithm was run for 100 iterations. Model GBF-c was run for 100 cycles, where each cycle consisted of one iteration of the Baum-Welch algorithm for each hidden Markov chain. At the end of training, the exact log-likelihoods on the training and test sets were computed for all models. The mean ( § standard error of the mean) log-likelihood of each model on each of the four problem sizes is shown in Table 2 for the data set with an alphabet size of four and in Table 3 for the data set with an alphabet size of eight. Also included in these tables are the log-likelihoods of the training and test sets under the true model that generated the data. On the data set with an alphabet size of four, models HMM and GBF-c showed similar levels of performance on the test sets. In contrast, on the data set with an alphabet size of eight, model HMM tended to overt the training data, and thus model GBF-c achieved a signicantly better level of performance on the test sets. This result suggests that FHMMs scale better than HMMs as the alphabet size of the data set increases. Table 2: Mean ( § Standard Error of the Mean) Log-Likelihood of Each Model on the Four Problem Sizes for the Multinomial Data Set whose Alphabet Size Was Four. M
K
Model
Training Set
Test Set
¡531.4 § 4.6 ¡510.9 § 5.1 ¡547.2 § 8.4
¡530.1 § 4.1 ¡548.2 § 3.3 ¡554.8 § 11.1
¡528.7 § 6.0 ¡497.3 § 10.6 ¡534.1 § 6.7
¡529.0 § 5.7 ¡566.2 § 13.7 ¡543.3 § 6.5
3
2
True HMM GBF-c
3
3
True HMM GBF-c
5
2
True HMM GBF-c
5
3
True HMM GBF-c
¡543.4 § 2.7 ¡526.5 § 3.6 ¡548.9 § 2.4
¡534.9 § 6.3 ¡514.9 § 11.5 ¡542.8 § 4.5
¡544.6 § 2.4 ¡558.3 § 3.8 ¡561.9 § 3.4
¡531.1 § 7.5 ¡554.9 § 15.5 ¡549.1 § 6.6
FHMMs and Generalized Backtting
2435
Table 3: Mean ( § Standard Error of the Mean) Log-Likelihood of Each Model on the Four Problem Sizes for the Multinomial Data Set Whose Alphabet Size Was Eight. M
K
Model
Training Set
Test Set
3
2
True HMM GBF-c
3
3
True HMM GBF-c
¡805.9 § 5.2 ¡740.8 § 5.1 ¡880.0 § 10.9
¡809.0 § 3.2 ¡925.0 § 9.7 ¡880.6 § 8.6
5
2
True HMM GBF-c
5
3
True HMM GBF-c
¡817.5 § 2.0 ¡612.6 § 2.3 ¡842.8 § 7.5
¡818.3 § 2.1 ¡1655.2 § 43.9 ¡867.4 § 9.8
¡799.5 § 4.2 ¡310.4 § 26.9 ¡805.7 § 3.3
¡796.9 § 4.6 ¡2972.0 § 174.2 ¡821.1 § 5.7
¡796.4 § 5.4 ¡566.8 § 5.2 ¡845.0 § 16.4
¡799.0 § 4.9 ¡1803.7 § 40.9 ¡865.7 § 21.2
Considering all simulations results as a whole, we conclude, as do Ghahramani and Jordan (1997), that FHMMs represent a potentially important advance in the modeling of sequential data. Due to their use of distributed representations, FHMMs with a relatively small number of parameters can have a large representational capacity. Consequently, FHMMs can often learn structure in data sets that conventional HMMs cannot. In addition, we conclude that the generalized backtting algorithm is an effective method for training FHMMs that is applicable to a wide variety of data types. As emphasized above, the constituent parts of the algorithm are familiar techniques in the neural computation, machine learning, and statistics communities, and so the efcacy of the algorithm is unsurprising. 5 Summary and Conclusion
Previous researchers developed new learning architectures for sequential data by extending conventional hidden Markov models through the use of distributed state representations (Williams & Hinton, 1991; Ghahramani & Jordan, 1997). An advantage of these architectures is that they are efcient from a representational viewpoint. A disadvantage is that exact inference and parameter estimation in these architectures is computationally intractable. Ghahramani and Jordan (1997) showed, however, that approximate inference and parameter estimation in one such architecture, FHMMs, is feasible in certain circumstances. Unfortunately, the learning algorithm proposed by these investigators, based on variational techniques, is difcult to understand and implement and is limited to the study of real-valued data sets.
2436
Robert A. Jacobs, Wenxin Jiang, and Martin A. Tanner
This article has proposed an alternative method for approximate inference and parameter estimation in FHMMs based on the perspective that FHMMs are a generalization of a well-known class of statistical models known as GAMs (Hastie & Tibshirani, 1990). Using existing statistical techniques for GAMs as a guide, we have developed the generalized backtting algorithm. This algorithm computes customized error signals for each hidden Markov chain of an FHMM and then trains each chain one at a time using conventional techniques from the hidden Markov models literature. Relative to previous perspectives on FHMMs, we believe that the viewpoint taken here has a number of advantages. First, it places FHMMs on rm statistical foundations by relating FHMMs to a class of models well studied in the statistics community, yet it generalizes this class of models in an interesting way. Second, it leads to an understanding of how FHMMs can be applied to many different types of time-series data, including Bernoulli and multinomial data, not just real-valued data. Finally, it leads to an effective learning procedure for FHMMs that is easier to understand and easier to implement than existing learning procedures. Simulation results suggest that FHMMs trained with the generalized backtting algorithm are a practical and powerful tool for analyzing sequential data. Although the study of sequential data has always been important, it has recently received renewed attention due to interest from relatively new scientic elds such as computational nance and bioinformatics. These elds make extensive use of conventional hidden Markov models due to their ease of use and attractive computational properties. We conjecture, however, that many problems that are currently approached through the use of HMMs may be better studied through the use of FHMMs trained with the generalized backtting algorithm. The application of FHMMs and the generalized backtting algorithm to real-world problems in the area of bioinformatics is our current focus of research. Acknowledgments
We are grateful to Z. Ghahramani for making software available on his web site that was useful in this project and to the anonymous reviewers for their helpful comments on this manuscript. This work was supported by NIH research grant R01-EY13149 and NSF grant DMS-0102636. References Baum, L., Petrie, T., Soules, G., & Weiss, N. (1970). A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Annals of Mathematical Statistics, 41, 164–171. Boyen, X., & Koller, D. (1999). Approximate learning of dynamic models. In M. S. Kearns, S. A. Solla, & D. A. Cohn (Eds.), Advances in neural information processing systems, 11. Cambridge, MA: MIT Press.
FHMMs and Generalized Backtting
2437
Dean, T., & Kanazawa, K. (1988).Probabilistic temporal reasoning. In Proceedings of the Seventh National Conference on Articial Intelligence. St. Paul, MN: AAAI Press. Ghahramani, Z., & Jordan, M. I. (1997). Factorial hidden Markov models. Machine Learning, 29, 245–273. Hastie, T. J., & Tibshirani, R. J. (1990). Generalized additive models. London: Chapman and Hall. Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., & Saul, L. (1998). An introduction to variational methods for graphical models. In M. I. Jordan (Ed.), Learning in graphical models. Cambridge, MA: MIT Press. McCullagh, P., & Nelder, J. (1989). Generalized linear models. London: Chapman and Hall. Murphy, K., & Weiss, Y. (2001). The factored frontier algorithm for approximate inference in DBNs. In Proceedings of the Sevententh Conference on Uncertainty in Articial Intelligence. San Francisco: Morgan Kaufmann. Rabiner, L., & Juang, B.-H. (1993). Fundamentals of speech recognition. Englewood Cliffs, NJ: Prentice Hall. Williams, C. K. I., & Hinton, G. E. (1991). Mean eld networks that learn to discriminate temporally distorted strings. In D. Touretzky, J. Elman, & G. Hinton (Eds.), Connectionist models: Proceedings of the 1990 Summer School. San Francisco: Morgan Kaufmann. Received October 2, 2001; accepted April 17, 2002.
LETTER
Communicated by Manfred Opper
Bayesian Model Assessment and Comparison Using Cross-Validation Predictive Densities Aki Vehtari Aki.Vehtari@hut. Jouko Lampinen Jouko.Lampinen@hut. Laboratory of Computational Engineering, Helsinki University of Technology, FIN-02015, HUT, Finland In this work, we discuss practical methods for the assessment, comparison, and selection of complex hierarchical Bayesian models. A natural way to assess the goodness of the model is to estimate its future predictive capability by estimating expected utilities. Instead of just making a point estimate, it is important to obtain the distribution of the expected utility estimate because it describes the uncertainty in the estimate. The distributions of the expected utility estimates can also be used to compare models, for example, by computing the probability of one model having a better expected utility than some other model. We propose an approach using cross-validation predictive densities to obtain expected utility estimates and Bayesian bootstrap to obtain samples from their distributions. We also discuss the probabilistic assumptions made and properties of two practical cross-validation methods, importance sampling and k-fold cross-validation. As illustrative examples, we use multilayer perceptron neural networks and gaussian processes with Markov chain Monte Carlo sampling in one toy problem and two challenging real-world problems.
1 Introduction
Whatever way the model building and the selection have been done, the goodness of the nal model should be assessed in order to nd out whether it is useful in a given problem. The cross-validation methods for model assessment and comparison have been proposed by several authors (for early accounts, see Stone, 1974, and Geisser, 1975, and for more recent review, see Gelfand, Dey, & Chang, 1992, and Shao, 1993). The cross-validation predictive density dates at least to Geisser and Eddy (1979), and review of cross-validation and other predictive densities appears in Gelfand and Dey (1994) and Gelfand (1996). Bernardo and Smith (1994) also discuss briey how cross-validation approximates the formal Bayes procedure of computing the expected utilities. c 2002 Massachusetts Institute of Technology Neural Computation 14, 2439– 2468 (2002) °
2440
Aki Vehtari and Jouko Lampinen
We synthesize and extend the previous work in many ways. We give a unied presentation from the Bayesian viewpoint emphasizing the assumptions made and propose practical methods to obtain the distributions of the expected utility estimates. We review expected utilities in section 1.1 and cross-validation predictive densities in section 1.2 and discuss assumptions made on future data distribution in section 1.3. We discuss the properties of two practical methods: the importance sampling leave-one-out in section 2.1 and the k-fold cross-validation in section 2.2. We propose a quick and generic approach based on the Bayesian bootstrap for obtaining samples from the distributions of the expected utility estimates in section 2.3. If there is a collection of models under consideration, the distributions of the expected utility estimates can also be used for comparison (section 2.4). We also discuss the relation of the proposed method to prior predictive densities and Bayes factors in section 3.1 and posterior predictive densities in section 3.2. Because the estimation of the expected utilities requires a full model tting (or k model ttings) for each model candidate, the approach is useful only when selecting among a few models. If we have many model candidates, for example, if doing variable selection, we can use some other methods like the variable-dimension Markov chain Monte Carlo (MCMC) methods (Green, 1995; Carlin & Chib, 1995; Stephens, 2000) for model selection and still use the expected utilities for nal model assessment. This approach is discussed in more detail by Vehtari (2001). We have tried to follow the notation of Gelman, Carlin, Stern, and Rubin (1995) and assume that readers have a basic knowledge of Bayesian methods (for a short introduction, see Lampinen & Vehtari, 2001). To illustrate the discussion, we use multilayer perceptron (MLP) networks and Gaussian processes (GP) with MCMC in one toy problem and two-real world problems (in section 4). 1.1 Expected Utilities. In prediction and decision problems, it is natural to assess the predictive ability of the model by estimating the expected utilities (Good, 1952; Bernardo & Smith, 1994). Utility measures the relative values of consequences. By using application-specic utilities, the expected benet or cost of using the model for predictions or decisions (e.g., by nancial criteria) can be readily computed. In the absence of applicationspecic utilities, many general discrepancy and likelihood utilities can be used. The posterior predictive distribution of output y for the new input x ( n C 1) given the training data D D f(x ( i) , y (i) ) I i D 1, 2, . . . , ng is obtained by integrating the predictions of the model with respect to the posterior distribution of the model,
Z p ( y | x ( n C 1) , D, M ) D
p ( y | x (n C 1) , h, D, M ) p (h | D, M ) dh,
(1.1)
Bayesian Model Assessment and Comparison
2441
where h denotes all the model parameters and hyperparameters of the prior structures and M is all the prior knowledge in the model specication (including all the implicit and explicit prior specications). If the predictions are independent of the training data given the parameters of the model (e.g., in parametric models), then p ( y | x (n C 1) , h, D, M ) D p ( y | x ( n C 1) , h, M ) . We would like to estimate how good our model is by estimating how good its predictions (i.e., the predictive distributions) are for future observations from the same process that generated the set of training data D. The goodness of the predictive distribution p ( y | x ( n C h) , D, M ) can be measured by comparing it to the actual observation y ( n C h) with the utility u, uh D u ( y (n C h ) , x ( n C h) , D, M ) .
(1.2)
The goodness of the whole model can then be summarized by computing some summary quantity of distribution of uh ’s over all future samples (h D 1, 2, . . .), for example, the mean uN D Eh [uh ]
(1.3)
or an a quantile uN a D Qa,h [uh ].
(1.4)
We call all such summary quantities the expected utilities of the model. Note that considering the expected utility just for the next sample (or onetime decision) and taking the expectation over the distribution of x ( n C 1) is equivalent to taking the expectation over all future samples. Preferably, the utility u would be application specic, measuring the expected benet or cost of using the model. For simplicity, we mention here some general utilities. Both the square error, ( ) ( ) uh D ( Ey [y | x n C h , D, M] ¡ y n C h ) 2 ,
(1.5)
and the absolute error, ( ) ( ) uh D abs ( Ey [y | x n C h , D, M] ¡ y n C h ) ,
(1.6)
measure the accuracy of the expectation of the predictive distribution, but the absolute error is more easily understandable especially when summarized using a quantile (e.g., a D 90%) because most of the predictions will have an error less than the given value. The predictive likelihood measures how well the model models the predictive distribution, uh D p ( y (n C h ) | x (n C h ) , D, M ) ,
(1.7)
2442
Aki Vehtari and Jouko Lampinen
and it is especially useful in model comparison (see section 2.4), and in nonprediction problems, in which the goal is to get scientic insights into modeled phenomena. Maximization of the expected predictive likelihood corresponds to minimization of information-theoretic Kullback-Leibler (KL) divergence between the model and the unknown distribution of the data, and equivalently, it corresponds to maximization of the expected information gained (Bernardo, 1979). An application-specic utility may measure the expected benet or cost. For simplicity we use the term utility also for costs, although a better word would be risk. Also, instead of negating cost, we represent the utilities in a form that is most appealing for the application expert. It should be obvious in each case whether a smaller or larger value for the utility is better. 1.2 Cross-Validation Predictive Densities. Because the future observations (x ( n C h) , y ( n C h) ) are not yet available, we have to approximate the expected utilities by reusing samples we already have. We assume that the future distribution of the data ( x, y) is stationary, and it can be reasonably well approximated using the (weighted) training data f(x ( i) , y ( i) ) I i D 1, 2, . . . , ng. In the case of conditionally independent observations, to simulate the fact that the future observations are not in the training data, the ith observation (x ( i) , y ( i) ) in the training data is left out, and then the predictive distribution for y ( i) is computed with a model that is tted to all of the observations except ( x (i ) , y ( i) ) . By repeating this for every point in the training data, we get a collection of leave-one-out cross-validation (LOO-CV)-predictive densities,
fp(y | x (i ) , D (ni) , M ) I i D 1, 2, . . . , ng,
(1.8)
where D (ni) denotes all the elements of D except ( x ( i) , y ( i) ) . To get the LOOCV-predictive density estimated expected utilities, these predictive densities are compared to the actual y ( i) ’s using utility u, and summarized, for example, with the mean, () () (n ) uN LOO D Ei [u(y i , x i , D i , M ) ].
(1.9)
If the future distribution of x is expected to be different from the distribution of the training data, observations could be weighted appropriately (demonstrated in section 4.2). By appropriate modications to the algorithm, the cross-validation predictive densities can also be computed for data with nite range dependencies (see sections 2.2 and 4.3). The LOO-CV-predictive densities are computed by the equation (compare to equation 1.1) Z p ( y | x ( i) , D (ni) , M ) D p ( y | x ( i) , h, D (ni) , M ) p (h | D (ni) , M ) dh. (1.10) For simple models, the LOO-CV-predictive densities may be computed quickly using analytical solutions (Shao, 1993; Orr, 1996; Peruggia, 1997),
Bayesian Model Assessment and Comparison
2443
but models that are more complex usually require a full model tting for each of the n predictive densities. When using the Monte Carlo methods, it means that we have to sample from p (h | D (ni) , M ) for each i, and this would normally take n times the time of sampling from the full posterior. If sampling is slow (e.g., when using MCMC methods), the importance sampling LOO-CV (IS-LOO-CV) discussed in section 2.1 or the k-fold-CV discussed in section 2.2 can be used to reduce the computational burden. 1.3 On Assumptions Made on Future Data Distribution. In this section, we briey discuss assumptions made on future data distribution in the approach described in this work and in related approaches (Rasmussen et al., 1996; Neal, 1998; Dietterich, 1998; Nadeau & Bengio, 2000) where the goal is to compare (not assess) the performance of methods (algorithms) instead of the single models conditioned on the given training data. Assume that the training data D have been produced from the distribution V. In model assessment and comparison, we condition the results on given realization of the training data D and assume that the future data for which we want to make predictions come from the same distribution as the training data, that is, V (see section 1.1). The method comparison approaches try to answer the question: “Given two methods A and B and training data D, which method (algorithm) will produce a more accurate model when trained on new training data of the same size as D?” (Dietterich, 1998). In probabilistic terms, the predictive distribution of output for every new input in the future is (compare to equation 1.1)
Z p ( y | x ( n C h ) , D¤h , M ) D
p ( y | x ( n C h) , h, D¤h , M ) p (h | D¤h , M ) dh,
(1.11)
where D¤h is the new training data of the same size as D. Although not explicitly stated in the question, all the approaches have assumed that D¤h can be approximated using the training data D, that is, D¤h comes from the distribution V. The method comparisonapproaches use various resampling, cross-validation, and data-splitting methods to produce proxies for D¤h . The reuse of training samples is more difcult than in the model comparison because the proxies should be as independent as possible in order to be able to estimate well the variability due to a random choice of training data. Because the goal of the method comparison is methodological research and not solving a real problem, it is useful to choose problems with large data sets, from which it is possible to select several independent training and test data sets of various sizes (Rasmussen et al., 1996; Neal, 1998). Note that after the method has been chosen and a model has been produced for a real problem, there is still a need to assess the performance of the model. When solving a real problem, is there a need to retrain the model on new training data of the same size and from the same distribution as D? This kind
2444
Aki Vehtari and Jouko Lampinen
of situation would rarely appear in practical applications, as it would mean that for every prediction, we would use new training data and previously used training data would be discarded. If the new training data come from the same distribution as the old training data, we could just combine the data and reestimate the expected utilities. The performance of the model with additional training data could be estimated roughly before getting those data, but it may be difcult because of the difculties in estimating the shape of the learning curve. We might want to discard the old training data if we assume that the future data come from some other distribution V C and the new training data D C would come from that distribution too. Uncertainty due to using new training data could be estimated in the same way as in method comparison approaches, but in order to estimate how well the results will hold in the new domain, we should be able to quantify the difference between the distributions V and V C . If we do not assume anything about the distribution V C , we cannot predict the behavior of the model in a new domain as stated by the no-free-lunch theorem (Wolpert, 1996a, 1996b). Even if the distributions V and V C have only a few dimensions, it is very hard to quantify differences and estimate their effect on expected utilities. If the applications are similar (e.g., a paper mill and a cardboard mill), it may be possible for an expert to give a rough estimate of the model performance in the new domain (it is probably easier to estimate the relative performance of two models than the absolute performance of a single model). In this case, it would be also possible to use information from the old domain as the basis for a prior in the new domain (Spiegelhalter, Myles, Jones, & Abrams, 2000). 2 Methods
In this section, we discuss the importance sampling leave-one-out (in section 2.1) and k-fold cross-validation (in section 2.2). We also propose an approach for obtaining samples from the distribution of the expected utility estimates (in section 2.3) and discuss model comparison based on expected utilities (in section 2.4). 2.1 Importance Sampling Leave-One-Out Cross-Validation. In ISLOO-CV, instead of sampling directly from p (h | D (ni) , M ) , samples hPj from the full posterior p (h | D, M ) are reused. Additional computation time in IS-LOO-CV compared to sampling from the full posterior distribution is negligible. If we want to estimate the expectation of a function h (h ) and have samples hPj from distribution g (h ) , we can write the expectation as
Z E (h (h ) ) D
Z h (h ) f (h ) dh D
h (h ) f (h ) g (h ) dh, g (h )
(2.1)
Bayesian Model Assessment and Comparison
2445
and approximate it with the Monte Carlo method, PL
P P l D1 h (hj ) w (hj ) , PL (P) l D 1 w hj
E (h (h ) ) ¼
(2.2)
where the factors w (hPj ) D f (hPj ) / g (hPj ) are called importance ratios or importance weights. (See Geweke, 1989, for the conditions of the convergence of the importance sampling estimates.) The quality of the importance sampling estimates depends heavily on the variability of the importance sampling weights, which depends on how similar f (h ) and g (h ) are. A new idea in Gelfand et al. (1992) and Gelfand (1996) was to use full posterior as the importance sampling density for the leave-one-out posterior densities. By drawing samples fyRj I j D 1, . . . , mg from p ( y | x (i) , D (ni) , M ) , we can calculate the Monte Carlo approximation of the expectation: Ey [g(y) | x ( i) , D (ni) , M] ¼
m 1 X g ( yRj ) . m j D1
(2.3)
If hRij is a sample from p (h | D (ni) , M ) and we draw yRj from p ( y | x ( i) , hRij , M ), then yRj is a sample from p ( y | x ( i) , D (ni) , M ). If hPj is a sample from p (h | D, M ) , then samples hRij can be obtained by resampling hPj using importance resampling with weights, ( i)
wj D
p (hPj | D (ni ) , M ) 1 / . ( ) ( ) i i P ( ) ( p hj | D, M p y | x , hPj , D (ni) , M )
(2.4)
The reliability of the importance sampling can be estimated by examining the expected variability of the importance weights. For simple models, this may be computed analytically (Peruggia, 1997), but if analytical solutions are inapplicable, we have to estimate this from the weights obtained. It is customary to examine the distribution of weights with various plots (Newton & Raftery, 1994; Gelman et al., 1995; Peruggia, 1997). We prefer plotting the cumulative normalized weights (see the examples in section 4.1). As we get n such plots for IS-LOO-CV, it would be useful to be able to summarize the quality of importance sampling for each i with just one value. For this, we use a heuristic measure of effective sample sizes based on an approximation of the variance of importance weights computed as i meff D 1 / ()
()
m ± X
² ( i) 2
wj
,
(2.5)
jD 1
where wj i are normalized weights (Kong, Liu, & Wong, 1994; Liu, 2001). We propose to examine the distribution of the effective sample sizes by checking
2446
Aki Vehtari and Jouko Lampinen ()
i the minimum and some quantiles and by plotting meff in increasing order (see the examples in section 4). Note that a small variance estimate of the obtained sample weights does not guarantee that importance sampling is giving the correct answer. On the other hand, the same problem applies to any variance or convergence diagnostics method based on nite samples of any indirect Monte Carlo method (Neal, 1993; Robert & Casella, 1999). Even in simple models like the Bayesian linear model, leaving one very inuential data point out may change the posterior so much that the variance of the weights is very large or innite (Peruggia, 1997). Moreover, even if leave-one-out posteriors are similar to the full posterior, importance sampling in high dimensions suffers from large variation in importance weights (MacKay, 1998). Flexible nonlinear models like MLP usually have a high number of parameters and a large number of degrees of freedom (all data points may be inuential). We demonstrate in section 4.1 a simple case where IS-LOO-CV works well for exible nonlinear models and in section 4.2 a case that is more difcult and where IS-LOO-CV fails. In section 4.3, we illustrate that the importance sampling does not work if data have such dependencies that several samples have to be left out at a time. In some cases, the use of importance link functions (ILF) (MacEachern & Peruggia, 2000) might improve the importance weights substantially. The idea is to use transformations that bring the importance sampling distribution closer to the desired distribution. (See MacEachern & Peruggia, 2000, for an example of computing case-deleted posteriors for Bayesian linear model.) For complex models, it may be difcult to nd good transformations, but the approach seems to be quite promising. If n is moderate, it might also be viable to use adaptive importance sampling methods (Zhang, 1996) for each i. Importance sampling might be improved by using model-specic importance link functions (MacEachern & Peruggia, 2000) or computationally more intensive adaptive importance sampling methods (Zhang, 1996). If there is reason to suspect the reliability of the importance sampling, we suggest using predictive densities from the k-fold-CV, discussed in section 2.2.
2.2 k-Fold Cross-Validation. In k-fold-CV, instead of sampling from n leave-one-out distributions p (h | D (ni ) , M ) , we sample only from k (e.g., k D 10) k-fold-CV distributions p (h | D (ns ( i) ) , M ) , and then the k-fold-CV predictive densities are computed by the equation (compare to equations 1.1 and 1.10)
Z p ( y | x ( i) , D (ns ( i) ) , M ) D
p ( y | x ( i) , h, D (ns ( i) ) , M ) £ p (h | D (ns ( i) ) , M ) dh,
(2.6)
Bayesian Model Assessment and Comparison
2447
where s (i) is a set of data points is as follows: the data are divided into k groups so that their sizes are as nearly equal as possible and s (i) is the set of data points in the group where the ith data point belongs. Approximately n / k data points are left out at a time, and thus, if k ¿ n, computational savings are considerable. Since the k-fold-CV predictive densities are based on smaller training data sets than the full data set, the expected utility estimate, uN CV D Ei [u(y ( i) , x ( i) , D (ns (i) ) , M ) ],
(2.7)
is biased. This bias has been usually ignored, maybe because k-fold-CV has been used mostly in model (method) comparison, where biases effectively cancel out if the models (methods) being compared have similarly steep learning curves. However, in the case of different steepness of the learning curves and in the model assessment, this bias should not be ignored. To get more accurate results, the bias-corrected expected utility estimate uN CCV can be computed by using a less well-known rst-order bias correction (Burman, 1989): uN tr D Ei [u(y (i ) , x ( i) , D, M )]
uN cvtr D Ej [Ei [u(y ( i) , x ( i) , D (nsj ) , M ) ]]I
uN CCV D uN CV C uN tr ¡ uN cvtr ,
(2.8) j D 1, . . . , k
(2.9) (2.10)
where uN tr is the expected utility evaluated with the training data given the training data, that is, the training error or the expected utility computed with the posterior predictive densities (see section 3.2), and uN cvtr is the average of the expected utilities evaluated with the training data given the k-fold-CV training sets. The correction term can be computed by using samples from the full posterior and the k-fold-CV posteriors, and no additional sampling is required. Although the bias can be corrected when k gets smaller, the disadvantage of small k is the increased variance of the expected utility estimate. The variance increases with smaller k for the following reasons: the k-fold-CV training data sets are worse proxies for the full training data, there are more ways to divide the training data randomly, but they are divided in just one way, and the variance of the bias correction increases. Values of k between 8 and 16 seem to have good balance between the increased accuracy and increased computational load. In LOO-CV (k D n), the bias is usually negligible unless n is very small. (See the discussion in the next section and some related discussion in Burman, 1989.) We demonstrate in section 4.1 a simple case where the IS-LOO-CV and (bias corrected) k-fold-CV give equally good results and in section 4.2 a more difcult case where the k-fold-CV works well and the IS-LOO-CV fails. In section 4.3, we demonstrate a case where k-fold-CV works but IS-LOO-CV
2448
Aki Vehtari and Jouko Lampinen
fails, since group dependencies in data require leaving groups of data out at a time. For the time series with unknown nite range dependencies, the k-foldCV can be combined with the h-block-CV proposed by Burman, Chow, and Nolan (1994). Instead of just leaving the ith point out, a block of h cases from either side of the ith point is removed from the training data for the ith point. The value of h depends on the dependence structure, and it could be estimated, for example, from autocorrelations. The approach could also be applied in other models with nite range dependencies (e.g., some spatial models) by removing a block of h cases from around the ith point. When more than one data point is left out at a time, importance sampling probably does not work, and either full h-block-CV or k-fold-h-block-CV should be used. 2.3 Distribution of the Expected Utility Estimate. To assess the reliability of the expected utility estimate, we estimate its distribution. Let us rst ignore the variability due to Monte Carlo integration and consider the variability due to approximation of the future data distribution with a nite number of training data points. We are trying to estimate the expected utilities given the training data D, but the cross-validation predictive densities p ( y | x ( i) , D (nsj ) , M ) are based on training data sets D (nsj ) , which are each slightly different. This makes the ui ’s slightly dependent in a way that will increase the estimate of the variability of the u. N In the case of LOO-CV, this increase is negligible (unless n is very small), and in the case of k-fold-CV, it is practically negligible with reasonable values of k (illustrated in section 4.1). If in doubt, this increase could be estimated as mentioned in section 4.1. (See also the comments in the next section.) If utilities ui are summarized with the mean
uN D Ei [ui ],
(2.11)
a simple approximation would be to assume ui ’s to have an approximately gaussian distribution (described by the mean and the variance) and to compute the variance of the expected utility estimate as (Breiman, Friedman, Olshen, & Stone, 1984) N D Vari [ui ] / n. Var[u]
(2.12)
Of course, the distribution of ui ’s is not necessarily gaussian, but still this (or a more robust variance estimate based on quantiles) is an adequate approximation in many cases. A variation of this, applicable in the k-fold-CV case, is that rst the mean expected utility uN j for each of the k folds is computed, and then the variance of the expected utility estimate is computed as (Dietterich, 1998) Var[u] N ¼ Varj [uN j ] / k.
(2.13)
Bayesian Model Assessment and Comparison
2449
Here, the distribution of uN j ’s tends to be closer to gaussian (due to central limit theorem), but a drawback is that this estimator has a larger variance than the estimator of equation 2.12. If the summary quantity is some other than mean (e.g., a-quantile) or the distribution of ui ’s is far from gaussian, the above approximations may fail. In addition, the above approximations ignore the uncertainty in the estimates of ui ’s due to Monte Carlo error. We propose a quick and generic approach based on the Bayesian bootstrap (BB) (Rubin, 1981), which can handle variability due to Monte Carlo integration, bias correction estimation, and the approximation of the future data distribution, as well as arbitrary summary quantities, and also gives good approximation in the case of nongaussian distributions. The BB makes a simple nonparametric approximation to the distribution of random variable. Having samples of z1 , . . . , zn of a random variable Z, it is assumed that posterior probabilities for the zi have Dirichlet distribution Di(1,. . . ,1) (Gelman et al., 1995), and values of Z that are not observed have zero posterior probability. Sampling from the Dirichlet distribution gives BB samples from the distribution of the distribution of Z, and thus samples of any parameter of this distribution can be obtained. For example, with w D E[Z], for each BB sample b, we calculate the P mean of Z as if gi,b were the probability that Z D zi ; that is, we calculate wPb D niD1 gi,b zi . The distribution of the values of wP b I b D 1, . . . , B is the BB distribution of the mean E[Z]. (See Lo, 1987; Weng, 1989; and Mason & Newton, 1992, for some important properties of the BB.) The assumption that all possible distinct values of Z have been observed is usually wrong, but with moderate n and not very thick tailed distributions, inferences should not be very sensitive to this unless extreme tail areas are examined. If in doubt, we could use a more complex model (e.g., a mixture model) that would smooth the probabilities (discarding also the assumption about a priori independent probabilities). Of course, tting parameters of the more complex model would require extra work, and it still may be hard to model the tail of the distribution well. N To get samples from the distribution of the expected utility estimate u, we rst sample from the distributions of each ui (variability due to Monte Carlo integration) and then from the distribution of the uN (variability due to the approximation of the future data distribution). From obtained samples, it is easy to compute, for example, credible intervals (CI), highest probability density intervals (HDPI; see Chen, Shao, & Ibrahim, 2000), histograms, and kernel density estimates. Note that the variability due to Monte Carlo integration can be reduced by sampling more Monte Carlo samples, but this can sometimes be computationally too expensive. If the variability due to Monte Carlo integration is negligible, samples from the distributions of each ui could be replaced by the expectations of ui . To simplify computations (and save storage space), we have used thinning to get near-independent MCMC samples (estimated by auto-
2450
Aki Vehtari and Jouko Lampinen
correlations; Neal, 1993; Chen et al., 2000). However, if MCMC samples were highly dependent, we could use dependent weights in BB (Kunsch, ¨ 1989, 1994). 2.4 Model Comparison with Expected Utilities. The distributions of the expected utility estimates can be used for comparing different models. The difference of the expected utilities of two models M 1 and M 2 is
uN M1 ¡M2 D Ei [uM1 ,i ¡ u M2 ,i ].
(2.14)
If the variability due to Monte Carlo integration is assumed negligible and a gaussian approximation is used for the distributions of the expected utility estimates (see equation 2.12 or equation 2.13), the probability p (uN M1 ¡M2 > 0) can be computed analytically. With the Bayesian bootstrap, we can sample directly from the distribution of the differences, or if the same random number generator seed has been used for both models when sampling over i (variabilities due to Monte Carlo integrations are independent, but variabilities due to the approximations of the future data distribution are dependent through i), we can get samples from the distribution of the difference of the expected utility estimates as uPN ( M1 ¡M2 ) ,b D uPN M1 ,b ¡ uPN M2 ,b .
(2.15)
Then we can, for example, plot the distribution of uN M1 ¡M2 or compute the probability p ( uN M1 ¡M2 > 0) . Following the simplicity postulate (parsimony principle), it is useful to start from simpler models and then test if a more complex model would give signicantly better predictions. (See the discussion of the simplicity postulate in Jeffreys, 1961.) Although possible overestimation of the variability due to training sets’ being slightly different (see the previous section) makes these comparisons slightly conservative, the error is small, and in model choice, it is better to be conservative than too optimistic. An extra advantage of comparing the expected utilities is that even if there is a high probability that one model is better, it might be determined that the difference between the expected utilities is still practically negligible. For example, it is possible that using a statistically better model would save only a negligible amount of money. The expected predictive densities have an important relation to Bayes factors, which are commonly used in Bayesian model comparison. If utility u is the predictive log-likelihood and (mean) expected utilities are computed by using cross-validation predictive densities, then PsBF ( M 1 , M2 ) ´
n Y p ( y ( i) | x ( i) , D (ni) , M 1 ) D exp(nuN M1 ¡M2 ) , p ( y ( i) | x ( i) , D (ni) , M 2 ) iD 1
(2.16)
Bayesian Model Assessment and Comparison
2451
where PsBF stands for pseudo-Bayes factor (Geisser & Eddy, 1979; Gelfand, 1996). Because we are interested in the performance of predictions for an unknown number of future samples, we like to report scaled PsBF by taking the nth root to get a ratio of “mean” predictive likelihoods (see the examples in section 4). Note that previously, only point estimates for PsBF have been used, but with the proposed approach, it is also possible to compute the distribution of the PsBF estimate. Because the method we have proposed is based on numerous approximations and assumptions, the results in the model comparison should be applied with care when making decisions. It should also be remembered that “selecting a single model is always a complex procedure involving background knowledge and other factors as the robustness of inferences to alternative models with similar support” (Spiegelhalter, Best, & Carlin, 1998, p. 3). 3 Relations to Other Predictive Approaches
In this section, we discuss the relations of the cross-validation predictive densities to prior predictive densities and Bayes factors (section 3.1) and posterior predictive densities (section 3.2). (See Vehtari, 2001, for a discussion of relations to other predictive densities, information criteria, and the estimation of the effective number of parameters.) 3.1 Prior Predictive Densities and Bayes Factors. The marginal prior
predictive densities (compare to equation 1.1), Z ()
p ( y | x i , M) D
() p ( y | x i , h, M ) p (h | M ) dh,
(3.1)
are conditioned only on the prior, not on the data. The expected utilities computed with the marginal prior predictive densities would measure the goodness of the predictions without training samples used and could be used as an estimate of the lower (or upper, if a smaller value is better) limit for the expected utility. R The joint prior predictive densities p ( D | h, M ) p (h | M ) dh D p ( D | M ) are used to compute the Bayes factors (BF) BF(M1 , M2 ) D p ( D | M1 ) / p ( D | M2 ), which are commonly used in Bayesian model comparison (Jeffreys, 1961; Kass & Raftery, 1995). Even if the posterior would not be sensitive to changes in the prior, when using BF in model comparison, the parameters for the priors have to be chosen with great care. The prior sensitivity of the Bayes factor has been long known (Jeffreys, 1961) and is sometimes called Lindley’s paradox or Bartlett’s paradox (see a nice review and historical comments in Hill, 1982). If prior and likelihood are very different, normalized prior predictive densities may be very difcult to compute (Kass & Raftery, 1995). However,
2452
Aki Vehtari and Jouko Lampinen
it may be possible to estimate unnormalized prior predictive likelihoods for large number of models relatively quickly (Ntzoufras, 1999; Han & Carlin, 2001), so that a prior predictive approach may be used to aid model selection, as discussed by Vehtari (2001). 3.2 Posterior Predictive Densities. Posterior predictive densities are naturally used for new data (see equation 1.1). When used for the training data, the expected utilities computed with the marginal posterior predictive densities would measure the goodness of the predictions as if the future data samples would be exact replicates of the training data samples. This is equal to evaluating the training error, which is well known to underestimate the generalization error of exible models (see also the examples in section 4). Comparison of the joint posterior predictive densities leads to the posterior Bayes factor (PoBF) (Aitkin, 1991). The posterior predictive densities should generally not be used for assessing model performance, except as an estimate of the upper (or lower if a smaller value is better) limit for the expected utility, or in model comparison because they favor overtted models (see also Aitkin, 1991). Only if the effective number of parameters is relatively small, that is, peff ¿ n, may the posterior predictive densities be a useful approximation to cross-validation predictive densities (Vehtari, 2001) and thus may be used to save computational resources. The posterior predictive densities are also useful in Bayesian posterior analysis advocated, for example, by Rubin (1984), Gelman and Meng (1996), Gelman et al. (1995), and Gelman, Meng, and Stern (1996). In the Bayesian posterior analysis, the goal is to compare posterior predictive replications to the data and examine the aspects of the data that might not accurately be described by the model. Thus, the Bayesian posterior analysis is complementary to the use of the expected utilities in model assessment. To avoid using the data twice, we have also used the cross-validation predictive densities for such analysis. This approach has also been used in some form by Gelfand et al. (1992), Gelfand (1996), and Draper (1995, 1996). 4 Illustrative Examples
As illustrative examples, we use MLP networks and gaussian processes with MCMC sampling (Neal, 1996, 1997, 1999; Lampinen & Vehtari, 2001) in one toy problem (MacKay’s robot arm) and two real-world problems (concrete quality estimation and forest scene classication). See Vehtari and Lampinen (2001) for details of the models, priors, and MCMC parameters. The MCMC sampling was done with the FBM1 software and Matlab code partly derived from the FBM and Netlab2 toolbox. For convergence diagnostics, we used a
1 2
Available on-line at: http://www.cs.toronto.edu/»radford/fbm.software.html. Available on-line at: http://www.ncrg.aston.ac.uk/netlab/.
Bayesian Model Assessment and Comparison
2453
visual inspection of trends, the potential scale-reduction method (Gelman, 1996), and the Kolmogorov-Smirnov test (Robert & Casella, 1999). Importance weights for MLP and GP were computed as described in Vehtari (2001). 4.1 Toy Problem: MacKay’s Robot Arm. In this section we illustrate some basic issues of the expected utilities computed by using the crossvalidation predictive densities. A very simple robot arm toy problem (rst used by MacKay, 1992) was selected, so that the complexity of the problem would not hide the main points that we want to illustrate. The task is to learn the mapping from joint angles to position for an imaginary robot arm. Two real input variables, x1 and x2 , represent the joint angles, and two real target values, y1 and y2 , represent the resulting arm position in rectangular coordinates. The relationship between inputs and targets is
y1 D 2.0 cos(x1 ) C 1.3 cos(x1 C x2 ) C e1 , y2 D 2.0 sin(x1 ) C 1.3 sin(x1 C x2 ) C e2 ,
(4.1)
where e1 and e2 are independent gaussian noise variables of standard deviation 0.05. As training data sets, we used the same data sets that MacKay (1992) used. 3 There are three data sets, each containing 200 input-target pairs, which were randomly generated by picking x1 uniformly from the ranges [¡1.932, ¡0.453] and [+0.453,+1.932 ], and x2 uniformly from the range [0.534,3.142]. To get more accurate estimates of the “true future utility,” we generated an additional 10,000 input-target pairs having the same distribution for x1 and x2 as above, but without noise added to y1 and y2 . The true future utilities were then estimated using this test data set and integrating analytically over the noise in y1 and y2 . We used an eight-hidden-unit MLP and a GP with normal (N) residual model. Figure 1 shows the expected utilities where the utility is root mean square error. The IS-LOO-CV and the 10-fold-CV give quite similar error estimates. Figure 2 shows that the importance sampling works probably very well for the GP but might produce unreliable results for the MLP. Although importance sampling weights for the MLP are not very good, the IS-LOO-CV results are not much different from the 10-fold-CV results in this simple problem. Note that in this case, small location errors and even a large underestimation of the variance in the IS-LOO-CV predictive densities are swamped by the uncertainty from not knowing the noise variance. Figure 1 also shows the realized, estimated, and theoretical noise in each data set. Note that the estimated error is lower if the realized noise is lower, and the uncertainty in estimated errors is about the same size as the uncertainty in the noise estimates. This demonstrates that most of the uncertainty in the estimate of the expected utility comes from not knowing the true 3
Available on-line at: http://www.inference.phy.cam.ac.uk/mackay/Bayes FAQ.html.
2454
Aki Vehtari and Jouko Lampinen
Realised noise in data Estimated noise (90% CI) Expected utility from posterior / Train error True future utility / Test error Expected utility with IS LOO CV (90% CI) Expected utility with 10 fold CV (90% CI) MLP Data 1
GP Data 1
MLP Data 2
GP Data 2
MLP Data 3
GP Data 3
0.044
0.046 0.048 0.050 0.052 0.054 0.056 Standard deviation of noise / Root mean square error
0.058
Figure 1: Robot arm. The expected utilities (root mean square errors) for MLPs and GPs. Results are shown for three different realizations of the data. The ISLOO-CV and the 10-fold-CV give quite similar error estimates. Realized noise and estimated noise in each data set are also shown. The dotted vertical line shows the level of the theoretical noise.
2455 GP Data 1
1
50 k MLP Data 1
100
50
0
0
50
100 150 Data point i
0
0
100
eff
0
Cum. weight
100
0
0.5
200
(i) Eff. sample size meff
Cum. weight
0.5
Eff. sample size m (i)
S kj=1
j
w (i)
MLP Data 1
1
S kj=1
j
w (i)
Bayesian Model Assessment and Comparison
50 k GP Data 1
100
100 150 Data point i
200
50
0
0
50
Figure 2: Robot arm. (Top plots) The total cumulative mass assigned to the k largest importance weights versus k (one line for each data point i). The MLP has more mass attached to fewer weights. (Bottom plots) The effective sample size of the importance sampling mieff for each data point i (sorted in increasing order). The MLP has less effective samples. These two plots show that in this problem, the IS-LOO-CV may be unstable for the MLP but probably is stable for the GP.
noise variance. Figure 3 veries this; it shows the different components that contribute to the uncertainty in the estimate of the expected utility. The variability due to having slightly different training sets in the 10-fold-CV and the variability due to the Monte Carlo approximation are negligible compared to the variability due to not knowing the true noise variance. The estimate of the variability due to having slightly different training sets in the 10-fold-CV was computed by using the knowledge of the true function. In real-world cases where the true function is unknown, this variability could be approximated using the CV terms calculated for the bias correction, although this estimate might be slightly optimistic. The estimate of the variability due to Monte Carlo approximation was computed directly from the Monte Carlo samples using the Bayesian bootstrap. Figure 3 also shows that the bias in the 10-fold-CV is quite small. Because the true function was known, we also computed estimates for the biases using the test data. For all the GPs, the bias corrections and the “true” biases were the same with about 2% accuracy. For the MLPs, there was much more variation, but still the “true” biases were inside the 90% credible interval of the bias correction estimate.
2456
Aki Vehtari and Jouko Lampinen
10 fold CV (90% CI) Bias due to 10 fold CV using smaller training sets Variability due to having different training sets Variability due to Monte Carlo approximation MLP Data 1
GP Data 1
0.050
0.052
0.054 0.056 Root mean square error
0.058
Figure 3: Robot arm. The different components that contribute to the uncertainty and bias correction for the expected utility (root mean square errors) for MLP and GP. Results are shown for the data set 1. The variability due to having slightly different training sets in 10-fold-CV and the variability due to the Monte Carlo approximation are negligible compared to the variability due to not knowing the true noise variance. The bias correction is quite small; it is about 0.6% of the mean error and about 6% of the 90% credible interval of error.
Although in this example there would be no practical difference in reporting the expected utility estimates without the bias correction, bias may be signicant in other problems. For example, in the examples in sections 4.2 and 4.3, the bias correction had notable effect. Figures 4 and 5 demonstrate the comparison of models using paired comparison of the expected utilities. Figure 4 shows the expected difference of root mean square errors and Figure 5 the expected ratio of mean predictive likelihoods (nth root of the pseudo-Bayes factors). The IS-LOO-CV and the 10-fold-CV give quite similar estimates, but disagreement shows slightly more clearly here when comparing models than when estimating expected utilities (compare to Figure 1). The disagreement between the IS-LOO-CV and the 10-fold-CV might be caused by bad importance weights of the ISLOO-CV for the MLPs (see Figure 2). Figure 6 shows different components that contribute to the uncertainty in paired comparison of the expected utilities. The variability due to having slightly different training sets in the 10-fold-CV and the variability due to the Monte Carlo approximation have a larger effect in pairwise comparison, but they are almost negligible compared to the variability due to not knowing the true noise variance. Figure 6 also shows that in this case, the bias in the 10-fold-CV is negligible. 4.2 Case 1: Concrete Quality Estimation. In this section, we present results from a real-world problem of predicting the quality properties of
Bayesian Model Assessment and Comparison
2457
Expected utility from posterior / Train error True future utility / Test error Expected utility with IS LOO CV (90% CI) Expected utility with 10 fold CV (90% CI)
MLP vs. GP Data 1
MLP vs. GP Data 2
MLP vs. GP Data 3 ¬ MLP better 0.001
GP better ¬
0.000 0.001 Difference in root mean square errors
0.002
Figure 4: Robot arm. The expected difference of root mean square errors for MLP versus GP. Results are shown for three realizations of the data. The disagreement between the IS-LOO-CV and the 10-fold-CV shows slightly more clearly when comparing models than when estimating expected utilities (compare to Figure 1).
Expected utility with IS LOO CV (90% CI) Expected utility with 10 fold CV (90% CI) MLP vs. GP Data 1
MLP vs. GP Data 2
MLP vs. GP Data 3 ¬ GP better 0.94
MLP better ¬ 0.96
0.98
1.00
1.02
1.04 1/n
Ratio of mean predictive likelihoods / PsBF
Figure 5: Robot arm. The expected ratio of mean predictive likelihoods (nth root of the pseudo-Bayes factors) for MLP versus GP. Results are shown for three realizations of the data. The disagreement between the IS-LOO-CV and the 10-fold-CV shows slightly more clearly when comparing models than when estimating expected utilities (compare to Figure 1).
2458
Aki Vehtari and Jouko Lampinen
10 fold CV (90% CI) Bias due to 10 fold CV using smaller training sets Variability due to having different training sets Variability due to Monte Carlo approximation
MLP vs. GP Data 1
0.000 0.001 Difference in root mean square errors
0.002
Figure 6: Robot arm. The different components that contribute to the uncertainty, and bias correction for the expected difference of the expected root mean square errors for MLP versus GP. Results are shown for the data set 1. The variability due to having slightly different training sets in the 10-fold-CV and the variability due to the Monte Carlo approximation are almost negligible compared to the variability from not knowing the true noise variance. In this case, the biases effectively cancel out, and the combined bias correction is negligible.
concrete. The goal of the project was to develop a model for predicting the quality properties of concrete as part of a large quality control program of the industrial partner of the project. The quality variables included, for example, compressive strengths and densities for 1, 28, and 91 days after casting, and bleeding (water extraction), spread, slump, and air percentage, which measure the properties of fresh concrete. These quality measurements depend on the properties of the stone material (natural or crushed, size and shape distributions of the grains, mineralogical composition), additives, and the amount of cement and water. In the study, we had 27 explanatory variables and 215 samples designed to cover the practical range of the variables, collected by the concrete manufacturing company. (The details of the problem and the conclusions made by the concrete expert are in (J¨arvenp a¨ a¨ , 2001.) In the following, we report the results for the volume percentage of air in the concrete and air percentage. Similar results were obtained for the other variables. We tested 10-hidden-unit MLP networks and GP models with Normal (N), Student’s tº, input dependent Normal (in.dep.-N) and input-dependent tº residual models. The Normal model was used as standard reference model, and Student’s tº, with an unknown degree of freedom º, was used as longer-tailed robust residual model that allows a small portion of samples to have large errors. When analyzing results from these two rst residual models, it was noticed that the size of the residual variance varied considerably depending on three inputs, which were zero/one variables indicating
Bayesian Model Assessment and Comparison
2459
Expected utility with IS LOO CV (90% CI) Expected utility with 10 fold CV (90% CI) MLP N GP N 0.12
0.14 0.16 0.18 0.20 0.22 Normalized root mean square error
0.24
0.26
Expected utility with IS LOO CV (90% CI) Expected utility with 10 fold CV (90% CI) MLP N GP N 0.3
0.4
0.5 0.6 0.7 90% quantile of absolute error
0.8
Figure 7: Concrete quality estimation. The expected utilities for MLP and GP with the Normal (N) residual model. (Top) The expected normalized root mean square errors. (Bottom) The expected 90% quantiles of absolute errors.
the use of additives. In the input-dependent residual models, the parameters of the Normal or Student’s tº were made dependent on these three inputs with common hyperprior. Figure 7 shows the expected normalized root mean square errors and the expected 90% quantiles of absolute errors for MLP and GP with Normal (N) residual model. The root mean square error was selected as a general discrepancy utility, and the 90% quantile of absolute error was chosen after discussion with the concrete expert, who preferred this utility because it is easily understandable. The IS-LOO-CV gives much lower estimates for the MLP and somewhat lower estimates for the GP than the 10-fold-CV. Figure 8 shows that the IS-LOO-CV for both MLP and GP has many data points with small (or very small) effective sample size, which indicates that the IS-LOO-CV cannot be used in this problem. Figure 9 shows the expected normalized root mean square errors, the expected 90% quantiles of absolute errors, and the expected mean predictive likelihoods for GP models with Normal (N), Student’s tº, input-dependent Normal (in.dep.-N), and input-dependent tº residual models. There is not much difference in expected utilities if root mean square error is used (it is easy to guess the mean of prediction), but there are larger differences if mean predictive likelihood is used instead (it is harder to guess the whole predictive distribution). The bias corrections are not shown, but they were
Aki Vehtari and Jouko Lampinen MLP N
75 50 25 0
0
50
100 150 Data point i
200
GP N
100
Effective sample size meff
(i)
100
eff
Effective sample size m (i)
2460
75 50 25 0
0
50
100 150 Data point i
200
Figure 8: Concrete quality estimation. The effective sample sizes of the impor(i) tance sampling meff for each data point i (sorted in increasing order) for MLP and GP with the Normal (N) noise model. Both models have many data points with a small effective sample size, which implies that the IS-LOO-CV fails.
about 3 to 5% of the median values, that is, they have a notable effect in model assessment. The biases were similar in different models, so they more or less effectively cancel out in model comparison. Table 1 shows the results for the pairwise comparisons of the residual models. In this case, the uncertainties in the comparison of the normalized root mean square errors and the 90% quantiles of absolute errors are so big that no clear difference can be made among the models. Because we get similar performance with all models (measured with these utilities), we could choose any one of them without the fear of choosing a poor model. With the mean predictive likelihood utility, there is more difference because it also measures the goodness of the tails. If in addition to point estimates, the predictive distributions (or credible intervals for predictions) are wanted, an input-dependent tº model would be probably the best choice. Knowing that the additives have a strong inuence on the quality of concrete, it was useful to report the expected utilities separately for samples with different additives, that is, assuming that in all future casts, no additives or just one of the additives will be used (see Figure 10). 4.3 Case 2: Forest Scene Classication. In this section, we illustrate that if, due to dependencies in the data, several data points should be left out at a time, k-fold-CV has to be used to get more accurate results. The case problem is the classication of forest scenes with MLP (Vehtari, Heikkonen, Lampinen, & Juuj¨arvi, 1998). The nal objective of the project was to assess the accuracy of estimating the volumes of growing trees from digital images. To locate the tree trunks and initialize the tting of the trunk contour model, a classication of the image pixels to tree and nontree classes was necessary.
Bayesian Model Assessment and Comparison
2461
Expected utility with 10 fold CV (90% CI) GP N GP tn GP in.dep. N GP in.dep. tn 0.16
0.18 0.20 0.22 Normalized root mean square error
0.24
Expected utility with 10 fold CV (90% CI) GP N GP tn GP in.dep. N GP in.dep. tn 0.40
0.45
0.50 0.55 0.60 90% quantile of absolute error
0.65
Expected utility with 10 fold CV (90% CI) GP N GP tn GP in.dep. N GP in.dep. tn 0.8
0.9 1.0 Mean predictive likelihood
1.1
Figure 9: Concrete quality estimation. The expected utilities for GP models with Normal (N), Student’s tº , input-dependent Normal (in.dep.-N), and inputdependent tº residual models. (Top) Expected normalized root mean square errors (smaller value is better). (Middle) Expected 90% quantiles of absolute errors (smaller value is better). (Bottom) Expected mean predictive likelihoods (larger value is better).
We extracted a total of 84 potentially useful features: 48 Gabor lters (with different orientations and frequencies) that are generic features related to shape and texture and 36 common statistical features (mean, variance, and skewness with different window sizes). Forty-eight images were collected by using an ordinary digital camera in varying weather conditions. The labeling of the image data was done by hand by identifying many types of tree and background image blocks with different textures and lighting conditions. In this study, only pines were considered.
2462
Aki Vehtari and Jouko Lampinen
Table 1: Concrete Quality Estimation: Pairwise Comparison of Expected Utilities of GP Models with Different Residual Models.
Normalized root mean square error
90% quantile of absolute error
Mean predictive likelihood
Residual Model
1
1 2 3 4
N tº input dependent N input dependent tº
0.60 0.78 0.67
1 2 3 4
N tº input dependent N input dependent tº
0.83 0.47 0.79
1 2 3 4
N tº input dependent N input dependent tº
0.98 0.99 1.00
Comparison 2 3 4 0.40 0.82 0.69 0.17 0.13 0.33 0.02 0.78 0.94
0.22 0.18
0.33 0.31 0.85
0.15 0.53 0.87
0.21 0.67 0.23
0.77 0.01 0.22
0.00 0.06 0.32
0.68
Note: The values in the matrices are probabilities that the model in the row is better than the model in the column.
Textures and lighting conditions are more similar in different parts of one image than in different images. If the LOO-CV is used or data points are divided randomly in the k-fold-CV, training and test sets may have data points from the same image, which would lead to overly optimistic estimates of the expected utility. This is caused by the fact that instead of
GP Input dependent tn Expected utility with 10 fold CV (90% CI) Samples with no additives Samples with additive A Samples with additive B All samples 0.2
0.3 0.4 0.5 0.6 0.7 0.8 90% quantile of absolute error
0.9
Figure 10: Concrete quality estimation. The expected utilities for the GP with the input-dependent tº residual model. The plot shows the expected 90% quantiles of absolute errors for samples with no additives, with additive A or B, and all samples.
Bayesian Model Assessment and Comparison
2463
Leave one point out at a time Leave one image out at a time
eff
Effective sample size m (i)
100 75
50 25
0
800
1600
2400 Data point i
3200
4000
4800
Figure 11: Forest scene classication. The effective sample sizes of the impor(i) tance sampling meff for each data point i (sorted in increasing order) for the 84-input logistic MLP. The effective sample sizes are calculated for both the leave-one-point-out (IS-LOO-CV) and the leave-one-image-out (IS-LOIO-CV) methods. In both cases, there are many data points with a small effective sample size, which implies that importance sampling is unstable in this problem.
having 4800 independent data points, we have 48 sample images, each with 100 highly dependent data points. This increases our uncertainty about the future data. To get a more accurate estimate of the expected utility for new unseen images, the training data set has to be divided by images. We tested two 20-hidden-unit MLPs with a logistic likelihood model. The rst MLP used all 84 inputs, and the second MLP used a reduced set of 18 inputs selected using the reversible jump MCMC (RJMCMC) method (see Vehtari, 2001). As discussed in section 2.1 and demonstrated in section 4.2, leaving one point out can change the posterior so much that importance sampling does not work. Leaving one image (100 data points) out will change the posterior even more. Figure 11 shows the effective sample sizes of the importance sampling for the 84-input MLP for the IS-LOO-CV and the IS-LOIO-CV (leave-one-image-out). For the 18-input MLP, the result was similar. In this case, neither the IS-LOO-CV nor the IS-LOIO-CV can be used. The expected classication errors for the 84- and 18-input MLPs are shown in Figure 12. The expected utilities computed by using the posterior predictive densities (training error) give estimates that are too low. The IS-LOO-CV and the eight-fold-CV with random data division give estimates that are too low because the data points from one image are highly dependent. The IS-LOO-CV also suffers from somewhat poor importance weights, and the IS-LOIO-CV suffers from very poor importance weights (see Figure 11). In the group eight-fold-CV, the data division was made by handling all the data points from one image as one indivisible group. The
2464
Aki Vehtari and Jouko Lampinen
Expected Expected Expected Expected Expected
utility utility utility utility utility
from posterior / Train error with IS LOO CV (90% CI) with IS LOIO CV (90% CI) with random 8 fold CV (90% CI) with group 8 fold CV (90% CI)
MLP 84 inputs
MLP 18 inputs
4
6
8 10 Classification error %
12
14
Figure 12: Forest scene classication. The expected utilities (classication errors) for the 84- and 18-input logistic MLPs.
bias corrections are not shown, but they were for the 84- and 18-input MLPs about 9% and 3% of the median values, respectively. Note that the more complex model had a steeper learning curve and a correspondingly larger bias correction. In this case, biases did not cancel out in model comparison. The pairwise comparison computed by using the group eight-fold-CV predictive densities gave a probability of 0.86 that the 84-input model has a lower expected classication error than the 18-input model. We still might use the smaller model for classication, as it would be not much worse, but slightly faster. 5 Conclusion
The main goal of this article was to give unied and formal presentation from a Bayesian viewpoint as to how to compute the distribution of the expected utility estimate that can be used to describe, in terms of application eld, how good the predictive ability of a Bayesian model is and how large uncertainty is in our estimate. The importance sampling leave-one-out predictive densities are a quick way to estimate the expected utilities; the approach is also useful in some cases with exible nonlinear models such as MLP and GP. If diagnostics hint that importance weights are not good, we can instead use the k-fold cross-validation predictive densities with the bias correction. Using k-fold-CV takes k times more time, but it is more reliable. In addition, if data points have certain dependencies, k-fold-CV has to be used to get reasonable results. We proposed a quick and generic approach based on the Bayesian bootstrap for obtaining samples from the distribu-
Bayesian Model Assessment and Comparison
2465
tions of the expected utility estimates. With the proposed method, it is also easy to compute the probability that one model has better expected utility than another one. Acknowledgments
This study was partly funded by TEKES grant 40888/97 (Project PROMISE, Applications of Probabilistic Modeling and Search) and Graduate School in Electronics, Telecommunications and Automation (GETA). We thank H. J¨arvenp a¨ a¨ for providing her expertise on the concrete case study and J. Kaipio, H. Tirri, E. Arjas, and anonymous reviewers for helpful comments. References Aitkin, M. (1991). Posterior Bayes factors (with discussion). Journal of the Royal Statistical Society B, 53(1), 111–142. Bernardo, J. M. (1979). Expected information as expected utility. Annals of Statistics, 7(3), 686–690. Bernardo, J. M., & Smith, A. F. M. (1994). Bayesian theory. New York: Wiley. Breiman, L., Friedman, J., Olshen, R., & Stone, C. (1984). Classication and regression trees. London: Chapman and Hall. Burman, P. (1989). A comparative study of ordinary cross-validation, v-fold cross-validation and the repeated learning-testing methods. Biometrika, 76(3), 503–514. Burman, P., Chow, E., & Nolan, D. (1994). A cross-validatory method for dependent data. Biometrika, 81(2), 351–358. Carlin, B. P., & Chib, S. (1995). Bayesian model choice via Markov chain Monte Carlo methods. Journal of the Royal Statistical Society B, 57(3), 473–484. Chen, M.-H., Shao, Q.-M., & Ibrahim, J. Q. (2000). Monte Carlo methods in Bayesian computation. New York: Springer-Verlag. Dietterich, T. G. (1998). Approximate statistical tests for comparing supervised classication learning algorithms. Neural Computation, 10(7), 1895–1924. Draper, D. (1995). Model uncertainty, data mining and statistical inference: Discussion. Journal of the Royal Statistical Society A, 158(3), 450–451. Draper, D. (1996). Posterior predictive assessment of model tness via realized discrepancies: Discussion. Statistica Sinica, 6(4), 760–767. Geisser, S. (1975). The predictive sample reuse method with applications. Journal of the American Statistical Association, 70(350), 320–328. Geisser, S., & Eddy, W. F. (1979).A predictive approach to model selection. Journal of the American Statistical Association, 74(365), 153–160. Gelfand, A. E. (1996). Model determination using sampling-based methods. In W. R. Gilks, S. Richardson, & D. J. Spiegelhalter (Eds.), Markov chain Monte Carlo in practice (pp. 145–162). London: Chapman & Hall. Gelfand, A. E., & Dey, D. K. (1994). Bayesian model choice: Asymptotics and exact calculations. Journal of the Royal Statistical Society B, 56(3), 501–514.
2466
Aki Vehtari and Jouko Lampinen
Gelfand, A. E., Dey, D. K., & Chang, H. (1992). Model determination using predictive distributions with implementation via sampling-based methods (with discussion). In J. M. Bernardo, J. O. Berger, A. P. Dawid, & A. F. M. Smith (Eds.), Bayesian statistics 4 (pp. 147–167). New York: Oxford University Press. Gelman, A. (1996). Inference and monitoring convergence. In W. R. Gilks, S. Richardson, & D. J. Spiegelhalter (Eds.), Markov chain Monte Carlo in practice (pp. 131–144). London: Chapman & Hall. Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. R. (1995). Bayesian data analysis. London: Chapman & Hall. Gelman, A., & Meng, X.-L. (1996). Model checking and model improvement. In W. R. Gilks, S. Richardson, & D. J. Spiegelhalter (Eds.), Markov chain Monte Carlo in practice (pp. 189–202). London: Chapman & Hall. Gelman, A., Meng, X.-L., & Stern, H. (1996). Posterior predictive assessment of model tness via realized discrepancies (with discussion). Statistica Sinica, 6(4), 733–807. Geweke, J. (1989). Bayesian inference in econometric models using Monte Carlo integration. Econometrica, 57(6), 1317–1339. Good, I. J. (1952). Rational decisions. Journal of the Royal Statistical Society B, 14(1), 107–114. Green, P. (1995). Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika, 82(4), 711–732. Han, C., & Carlin, B. P. (2001). Markov chain Monte Carlo methods for computing Bayes factors: A comparative review. Journal of the American Statistical Association, 96(455), 1122–1132. Hill, B. M. (1982). Lindley’s paradox: Comment. Journal of the American Statistical Association, 77(378), 344–347. Jeffreys, H. (1961). Theory of probability (3rd ed.). New York: Oxford University Press. J¨arvenp a¨ a¨ , H. (2001). Quality characteristics of ne aggregates and controlling their effects on concrete. Acta Polytechnica Scandinavica, Civil Engineering and Building Construction Series No. 122. Espoo: Finnish Academy of Technology. Kass, R. E., & Raftery, A. E. (1995). Bayes factors. Journal of the American Statistical Association, 90(430), 773–795. Kong, A., Liu, J. S., & Wong, W. H. (1994). Sequential imputations and Bayesian missing data problems. Journal of the American Statistical Association, 89(425), 278–288. Kunsch, ¨ H. R. (1989). The jackknife and the bootstrap for general stationary observations. Annals of Statistics, 17(3), 1217–1241. Kunsch, ¨ H. R. (1994). Approximate Bayesian inference with the weighted likelihood bootstrap: Discussion. Journal of the Royal Statistical Society B, 56(1), 39. Lampinen, J., & Vehtari, A. (2001). Bayesian approach for neural networks— review and case studies. Neural Networks, 14(3), 7–24. Liu, J. S. (2001). Monte Carlo strategies in scientic computing. Berlin: SpringerVerlag.
Bayesian Model Assessment and Comparison
2467
Lo, A. Y. (1987). A large sample study of the Bayesian bootstrap. Annals of Statistics, 15(1), 360–375. MacEachern, S. N., & Peruggia, M. (2000). Importance link function estimation for Markov chain Monte Carlo methods. Journal of Computational and Graphical Statistics, 9(1), 99–121. MacKay, D. J. C. (1992). A practical Bayesian framework for backpropagation networks. Neural Computation, 4(3), 448–472. MacKay, D. J. C. (1998). Introduction to Monte Carlo methods. In M. I. Jordan (Ed.), Learning in graphical models. Norwell, MA: Kluwer. Mason, D. M., & Newton, M. A. (1992). A rank statistics approach to the consistency of a general bootstrap. Annals of Statistics, 20(3), 1611–1624. Nadeau, C., & Bengio, S. (2000). Inference for the generalization error. In S. A. Solla, T. K. Leen, & K.-R. Muller ¨ (Eds.), Advances in neural information processing systems, 12 (pp. 307–313). Cambridge, MA: MIT Press. Neal, R. M. (1993). Probabilistic inference using Markov chain Monte Carlo methods (Tech. Rep. No. CRG-TR-93-1). Toronto: Department of Computer Science, University of Toronto. Neal, R. M. (1996). Bayesian learning for neural networks. Berlin: Springer-Verlag. Neal, R. M. (1997). Monte Carlo implementation of gaussian process models for Bayesian regression and classication (Tech. Rep. No. 9702). Toronto: Department of Statistics, University of Toronto. Neal, R. M. (1998).Assessing relevance determination methods using DELVE. In C. M. Bishop (Ed.), Neural networks and machine learning (pp. 97–129). Berlin: Springer-Verlag. Neal, R. M. (1999). Regression and classication using gaussian process priors (with discussion). In J. M. Bernardo, J. O. Berger, A. P. Dawid, & A. F. M. Smith (Eds.), Bayesian statistics,6 (pp. 475–501). New York: Oxford University Press. Newton, M. A., & Raftery, A. E. (1994). Approximate Bayesian inference with the weighted likelihood bootstrap (with discussion). Journal of the Royal Statistical Society B, 56(1), 3–48. Ntzoufras, I. (1999). Aspects of Bayesian model and variable selection using MCMC. Unpublished doctoral dissertation, Athens University of Economics and Business. Orr, M. J. L. (1996). Introduction to radial basis function networks (Tech. Rep.). Edinburgh: Centre for Cognitive Science, University of Edinburgh. Available on-line: http://www.anc.ed.ac.uk/»mjo/papers/intro.ps.gz. Peruggia, M. (1997). On the variability of case-deletion importance sampling weights in the Bayesian linear model. Journal of the American Statistical Association, 92(437), 199–207. Rasmussen, C. E., Neal, R. M., Hinton, G. E., van Camp, D., Revow, M., Ghahramani, Z., Kustra, R., & Tibshirani, R. (1996). The DELVE manual. Version 1.1. Available on-line: ftp://ftp.cs.utoronto.ca/pub/neuron/delve/doc/ manual.ps.gz. Robert, C. P., & Casella, G. (1999). Monte Carlo statisticalmethods. Berlin: SpringerVerlag. Rubin, D. B. (1981). The Bayesian bootstrap. Annals of Statistics, 9(1), 130–134.
2468
Aki Vehtari and Jouko Lampinen
Rubin, D. B. (1984). Bayesianly justiable and relevant frequency calculations for the applied statistician. Annals of Statistics, 12(4), 1151–1172. Shao, J. (1993). Linear model selection by cross-validation. Journal of the American Statistical Association, 88(422), 486–494. Spiegelhalter, D. J., Best, N. G., & Carlin, B. P. (1998). Bayesian deviance, the effective number of parameters, and the comparison of arbitrarily complex models (Research Rep. No. 98-009). Minneapolis: Division of Biostatistics, University of Minnesota. Spiegelhalter, D. J., Myles, J. P., Jones, D. R., & Abrams, K. R. (2000). Bayesian methods in health technology assessment: A review. Health Technology Assessment 2000, 4(38), 1–121. Stephens, M. (2000). Bayesian analysis of mixtures with an unknown number of components—an alternative to reversible jump methods. Annals of Statistics, 28(1), 40–74. Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions (with discussion). Journal of the Royal Statistical Society B, 36(2), 111–147. Vehtari, A. (2001). Bayesian model assessment and selection using expected utilities. Unpublished doctor of science in technology dissertation, Helsinki University of Technology. Available on-line: http://lib.hut./Diss/2001/ isbn9512257653/. Vehtari, A., Heikkonen, J., Lampinen, J., & Juuj¨arvi, J. (1998).Using Bayesian neural networks to classify forest scenes. In D. P. Casasent (Ed.), Intelligent Robots and Computer Vision XVII: Algorithms, techniques, and active vision, (pp. 66–73). Bellingham, WA: SPIE. Vehtari, A., & Lampinen, J. (2001). On Bayesian model assessment and choice using cross-validation predictive densities (Tech. Rep. No. B23). Espoo, Finland: Helsinki University of Technology, Laboratory of Computational Engineering. Weng, C.-S. (1989). On a second-order asymptotic property of the Bayesian bootstrap mean. Annals of Statistics, 17(2), 705–710. Wolpert, D. H. (1996a). The existence of a priori distinctions between learning algorithms. Neural Computation, 8(7), 1391–1420. Wolpert, D. H. (1996b). The lack of a priori distinctions between learning algorithms. Neural Computation, 8(7), 1341–1390. Zhang, P. (1996). Nonparametric importance sampling. Journal of the American Statistical Association, 91(435), 1245–1253. Received April 12, 2001; accepted March 25, 2002.
LETTER
Communicated by John Platt
Robust Regression with Asymmetric Heavy-Tail Noise Distributions Ichiro Takeuchi [email protected] Department of Information Engineering, Mie University, Tsu 514-8507, Japan Yoshua Bengio [email protected] UniversitÂe de MontrÂeal, DIRO, MontrÂeal, QuÂebec, Canada Takafumi Kanamori [email protected] Department of Mathematical and Computing Sciences, Tokyo Institute of Technology, Meguro-ku, Tokyo 152-8552, Japan In the presence of a heavy-tail noise distribution, regression becomes much more difcult. Traditional robust regression methods assume that the noise distribution is symmetric, and they downweight the inuence of so-called outliers. When the noise distribution is asymmetric, these methods yield biased regression estimators. Motivated by data-mining problems for the insurance industry, we propose a new approach to robust regression tailored to deal with asymmetric noise distribution. The main idea is to learn most of the parameters of the model using conditional quantile estimators (which are biased but robust estimators of the regression) and to learn a few remaining parameters to combine and correct these estimators, to minimize the average squared error in an unbiased way. Theoretical analysis and experiments show the clear advantages of the approach. Results are on articial data as well as insurance data, using both linear and neural network predictors. 1 Introduction
In a variety of practical applications, we often nd data distributions with an asymmetric heavy tail (see the shape of the distribution in Figure 1 (ii)). Modeling data with such an asymmetric heavy-tail distribution is essentially difcult because outliers, which are sampled from the tail of the distribution, have a strong inuence on parameter estimation. When the distribution is symmetric (around the mean), the problems caused by outliers can be reduced using robust estimation techniques (Huber, 1982; Hampel, Ronchetti, Rousseeuw, & Stahel, 1986), which basically intend to ignore or c 2002 Massachusetts Institute of Technology Neural Computation 14, 2469– 2496 (2002) °
2470
Ichiro Takeuchi, Yoshua Bengio, and Takafumi Kanamori
put less weights on outliers. Note that these techniques are highly biased for asymmetric distributions; most outliers are on the same side of the mean, so downweighting them introduces a strong bias on its estimation. We are concerned in this article with the problem of linear or nonlinear regression, that is, estimating the conditional expectation E[Y | X], in the presence of asymmetric, heavy-tail, and unknown noise. Regression problems also suffer from the effect of outliers when the distribution of the noise (the variations of the output variable Y that are not explainable by the input variable X) has an asymmetric heavy tail. As in the unconditional case, the robust methods that symmetrically downweight outliers (Huber, 1982; Hampel et al., 1986) are highly biased for asymmetric noise distributions. In this article, we propose a new robust regression method that can be applied to data with an asymmetric, heavy-tail, and unknown noise distribution and does not suffer from the above bias problem. We present a theoretical analysis, numerical experiments with synthetic data, and an application to automobile insurance premium estimation. These experiments show when and how the proposed method generalizes signicantly better than least squares and M estimators. Throughout the article, we use the following notations: X and Y for the input and the output random variables, FW (¢) and PW (¢) for the cumulative distribution function (cdf) and the probability density function (pdf) of random variable W. 2 Traditional Robust Methods
Let us rst consider the problem of estimating from a nite sample the unconditional mean E[Y] of a heavy-tail distribution (i.e., the density decays slowly to zero when going toward 1 or ¡1). The empirical average may be a poor estimator here because a few points will be sampled from the tails and may have very different values, thereby introducing a great deal of variability (from sample to sample) in the empirical average. In the case of a symmetric distribution, we can downweight or ignore the effect of these outliers in order to reduce this variability greatly. For example, the median estimator is much less sensitive to outliers than the empirical average for heavy-tail distributions. This idea can be generalized to conditional estimators; for example, one can estimate the conditional mean E[Y | X] from the empirical conditional median by minimizing absolute errors, fO0.5 D argmin f 2F
X i
|yi ¡ f (xi ) |,
(2.1)
where F is a set of functions (e.g., a class of neural networks), f(xi , yi ) , i D 1, 2, . . . , Ng is the training sample, and a hat (^) denotes estimation on data. Minimizing the above over P (X, Y ) and a large enough class of functions
Asymmetric Fat-Tail Robust Regression
2471
estimates the conditional median f0.5 , that is, P (Y < f0.5 ( X ) | X ) D 0.5. For regression problems with a heavy-tail noise distribution, the estimated conditional median is much less sensitive to outliers than least squares (LS) regression (which provides an estimate of the conditional average of Y given X). However, we want to tackle problems where the goal is to estimate the conditional mean, not the conditional median, and where they differ. In the presence of thick tails, M estimators (Huber, 1973) are among the most successful robust estimators for regression.1 The basic idea of M estimators is to use the minimization schema, fO D argmin f 2F
X i
r ( yi ¡ f ( xi ) )
(2.2)
where r is a loss function, with a unique minimum at zero, which is generally chosen symmetric (r (¡t ) D r ( t) ) when the actual noise distribution is unknown. As particular cases, r ( t) D 12 t2 yields the LS estimator, and r (t ) D |t| yields the median estimator, equation 2.1. Unfortunately, these robust estimators that symmetrically downweight the inuence of outliers do not work well for asymmetric distributions. Whereas removing outliers symmetrically on both sides of the mean does not change the average, removing them on one side changes the average considerably. In order to provide some quantitative evidence on this, we show in Table 1 a small Monte Carlo study on unconditional mean estimation. The values in the table are mean (variance) of 100 trials of estimation each with 10 samples, by LS, median, Cauchy (Huber, 1982), Huber (Huber, 1973), and Tukey’s biweight (Beaton & Tukey, 1974) estimators. The samples are from a positive-skewed asymmetric distribution that is shifted so its mean equals zero.2 Table 1 clearly illustrates the negative biases in all but LS estimators. Those negative biases are due to the loss functions (see equation 2.2) of the robust estimators, which in this case put less weight on data from the positive side of the distribution. As the example suggests, traditional robust estimators are designed under an assumption of symmetric distribution. If we use them with asymmetric distribution, they yield a strong bias. In unconditional estimation problems under asymmetric distribution, it has been proved that the LS estimator (sample average) is the “best” in the sense of being a uniformly minimum variance unbiased estimator (Lehmann, 1983). For many regres1 In the context of robust regression, many efforts have been devoted to deal with outliers in the input variable X, called leverage points, as well as those in the output variable Y. In this article, we do not consider outliers in X and therefore do not refer to those techniques, including bounded-inuence estimators (Krasker & Welsch, 1982) and S estimators (Rousseeuw & Leroy, 1987). 2 Log-normal distributions (see section 4), shifted so their mean equals zero and their pm , a degree of asymmetry introduced in the next section, are 0.65 (low), 0.75 (intermediate), and 0.85 (high), respectively.
2472
Ichiro Takeuchi, Yoshua Bengio, and Takafumi Kanamori
Table 1: A Comparison of Estimators on Unconditional Mean Estimation Problems Under Asymmetric Distribution. Degree of Asymmetry Estimator
Low
Intermediate
High
LS Median Cauchy Huber Tukey
0.03 (0.18) ¡0.26 (0.08) ¡0.25 (0.07) ¡0.13 (0.07) ¡0.47 (0.13)
0.08 (3.02) ¡1.35 (0.33) ¡1.25 (0.20) ¡0.41 (0.62) ¡1.39 (0.15)
0.22 (134.83) ¡7.13 (1.72) ¡6.36 (1.65) ¡2.41 (35.82) ¡4.76 (4.57)
Notes: The values are mean (variance) of 100 trials of estimation each with 10 samples by several estimators. The samples are from a positively skewed log-normal distribution (see note 2 and section 4), which is shifted as their means equal zero. Note that all but LS estimators are negatively biased, and the variances of LS estimator are much larger than the others.
sion problems as well, under asymmetric unknown noise distributions, the straightforward use of “robust” loss functions (see equation 2.2) is generally not helpful, as conrmed in section 4. 3 Robust Regression for Asymmetric Tails 3.1 Motivation. We saw in the previous section that the median estimator is robust but biased under asymmetric distributions. We rst consider an extension of median estimator as a motivation. Under asymmetric distributions, the median does not coincide with the mean, but another quantile does. We call its order pm : for a distribution P (Y ) , we dene pm , FY ( E[Y]) D P ( Y < E[Y]) . Note that pm > 0.5 ( < 0.5) suggests P ( Y) is positively (negatively) skewed. When the distribution is symmetric, pm D 0.5. Figure 1 illustrates this idea. For regression problems with an asymmetric noise distribution, we may extend median regression equation 2.1 to pth quantile regression (Koenker & Bassett, 1978), which estimates the conditional pth quantile; that is, we would like P (Y < fOp ( X ) | X ) D p, 8 9 < X = X (1¡ p ) |yi ¡ f ( xi ) | , (3.1) fOp D argmin p|yi ¡ f ( xi ) | C : ; f 2Fq ( ) ( ) i: yi ¸ f xi
i: yi < f xi
where Fq is a set of quantile regression functions. One could be tempted to obtain a robust regression with asymmetric noise distributions by estimating fpm instead of f0.5 . But what is pm for regression problems? It must be dened as pm (x ) , FY|X ( E[Y | X D x]) D P ( Y < E[Y | X] | X D x ).
Asymmetric Fat-Tail Robust Regression
2473 pm
pm ( = 0.5) P(Y)
P(Y)
distribution of empirical average
distribution of empirical average
distribution of empirical median
distribution of empirical median
Figure 1: The schematic illustration of empirical averages and empirical medians for (i) symmetric distribution and (ii) asymmetric distribution: Distributions of a heavy-tail random variable Y, its empirical average and empirical median, and their expectations (black circles). In i, those expectations coincide, while in ii they do not. As indicated by the areas of slanting lines, we dene pm as the order at which the quantile coincides with the mean. In i, pm D 0.5.
So the above idea raises three problems that we address with the algorithm proposed in section 3.2: (1) pm ( x) of P ( Y | X ) may depend on x in general; (2) unless the noise distribution is known, pm itself must be estimated, which may be as difcult as estimating E[Y | X]; and (3) if the noise density at pm is low (because of the heavy tail and large value of pm ), the estimator in equation 3.1 may itself be very unstable. (See Figure 2 and the discussion in the section 3.2.) 3.2 Algorithm. To overcome the difculties raised above, we propose a new robust regression algorithm that can be applied to data with an unknown but asymmetric and heavy-tail noise. The main idea is to learn most of the parameters of the model using conditional quantile estimators (which are biased but robust) and to learn a few remaining parameters to combine and correct these estimators—hence the name robust regression for asymmetric tails (RRAT) algorithm: Algorithm RRAT( n)
yi 2 R ) g, and hyperparameters: number of quantile regressors n, quantile orders ( p1 , . . . , pn ) , function classes Fq and Fc . Input : data pairs f(xi 2
Rd ,
(1) Fit n quantile regressions at orders p1 , p2 , . . . pn , each as in equation 3.1, yielding functions fOp1 , fOp2 , . . . , fOpn , with fOpi : Rd ! R, with fOpi 2 Fq .
(2) Fit a least-squares regression (not necessarily linear) with inputs q ( xi ) D ( fOp1 ( xi ) , . . . fOpn ( xi ) ) and targets yi , yielding a function fOc : Rn ! R, with fOc 2 Fc .
Output : conditional expectation estimator fO( x ) D fOc ( q ( x ) ) .
2474
Ichiro Takeuchi, Yoshua Bengio, and Takafumi Kanamori
y True 0.90-th quantile regression True 0.50-th quantile regression Estimated 0.90-th quantile regression Estimated 0.50-th quantile regression Samples of conditional quantile at order 0.90 Samples of conditional quantile at order 0.50 Samples of conditional quantile at other orders
y = ax + b
y = ax + b’ x
Figure 2: An introductory example of RRAT for a simple scalar linear regression problem with additive noise: Y D aX C b C Z, where Z is a zero-mean noise random variable (independent of X) with an asymmetric heavy-tail distribution whose (unknown) pm is 0.9 and a and b are parameters to estimate. With additive noise, pm is constant with regard to X: P ( Y < E[Y | X] | X ) D P ( aX C b C Z < aX C b) D P ( Z < 0) D pm . We could estimate the pm th quantile regression (thick solid line), but it might be unreliable for large pm at low densities. Instead, consider a simple RRAT with n D 1, p1 D 0.5, Ofp1 linear in x, and Ofc has just one additive parameter. fp1 (thick dotted line) has the same slope parameter a as fpm (but a different intercept b0 and b), but Ofpm (thin solid line) is more variable than Ofp1 (thin dotted line) because of the variabilities of empirical conditional quantiles (small circles at a few values of x with underlying pdf’s). With RRAT, (1) estimate a with Ofp1 , (2) keeping aO xed; then estimate b by LS regression.
An outer model selection loop is often required in applications in order to select hyperparameters such as n, ( p1 , . . . , pn ) and capacity control for function classes Fq and Fc . For example, this can be achieved with a validation set (used in the experiments), K-fold cross-validation, or the bootstrap. An introductory example of a very simple version of the RRAT algorithm is illustrated in Figure 2. Some of the parameters are estimated through conditional quantile estimators fp1 , . . . , fpn and they are combined and corrected
Asymmetric Fat-Tail Robust Regression
2475
by the function fc in order to estimate E[Y | X]. In general, this method yields more robust regressions when the number of parameters required to characterize fOc is small compared to the number of parameters required to characterize the quantile regressions fpi (this statement is justied in more detail in section 3.5). The intuitive explanation is that the fOpi parameters are estimated robustly (quantile estimators), whereas the fOc parameters are not (LS estimator). That is also the reason that RRAT can beat LS: having most of its parameters estimated robustly decreases its variance. Problem 2 above is dealt with by doing quantile regressions of orders p1 , . . . , pn not necessarily equal to pm . Problem 3 is dealt with if p1 , . . . , pn are in high-density areas (where estimation will be robust). The issue raised with problem 1 will be discussed in section 3.3. Note that RRAT with a linear fc (without constant term) is similar to some L estimators (Bickel, 1973) (linear combination of order statistics). However, the coefcients of the second level, fc , are not xed but estimated, so RRAT might be considered an adaptive L estimator. Note that RRAT also bears some similarities with ensemble methods, but relies crucially on each of the combined functions to be estimated robustly (here a quantile regression). Unlike in most ensemble methods, the training set is used to estimate the parameters of the combination function fc . 3.3 An Applicable Class of Problems. A surprising result is that with additive or multiplicative but otherwise arbitrary noise, it is sufcient to take a single quantile function (n D 1) with any chosen quantile p1 , and it is sufcient to take an afne (two-parameter) function fc in order to map the true quantile function to the true conditional expectation. In this section and the next, we explain and generalize this result. A large class of regression problems for which RRAT works (in the sense that the analog of problem 1 is eliminated) can be described as follows:
Y D gm ( X ) C Zgs ( gm ( X ) ) ,
(3.2)
where Z is a zero-mean random variable drawn from any form of (possibly asymmetric) continuous distribution, independent of X. The conditional expectation is characterized by an arbitrary function gm and the conditional standard deviation of the noise distribution is characterized by an arbitrary positive range function gs . Note that the regression in equation 3.2 is a subclass of heteroscedastic regression (Greene, 1997); the standard deviation of the noise distribution is not directly conditioned on X, but only on gm ( X ). This specication narrows the class of applicable problems, but equation 3.2 still covers a wide variety of noise structures, as explained later. In equation 3.2, E[Y | X] D E[gm ( X ) C Zgs ( gm (X ) ) | X] D gm ( X ) , and pm of the distribution P ( Z) coincides with pm of P (Y | X ) and does not depend on x, that is, P ( Y < E[Y | X] | X) D P ( Zgs ( gm ( X ) ) < 0 | X ) D P ( Z < 0 | X ) D
2476
Ichiro Takeuchi, Yoshua Bengio, and Takafumi Kanamori
P ( Z < 0) . Since the conditional expectation E[Y | X] coincides with the pm th quantile regression of P ( Y | X ), we have gm (x ) ´ fpm ( x ) D E[Y | x]. The following two theorems show that RRAT(n) “works” when large enough classes of Fq and Fc are provided by guaranteeing the existence of the function fc , which transforms the outputs of pi th quantile regressions fpi ( x ) (i D 1, . . . , n ) into the conditional expectation E[Y | x] for all x. The cases of n D 1 and n D 2 are explained in theorem 1 and theorem 2, respectively. The applicable class of problems with only one quantile regression fp1 , that is, RRAT(1), is smaller than the class satisfying equation 3.2, but it is very important for practical applications because it includes additive and multiplicative noise (see section 3.4). If the noise structure is as in equation 3.2, then there exists a function fc such that E[Y | X] D fc ( fp1 ( X ) ) , where fp1 ( X ) is the p1 th quantile regression, if and only if function h ( yN ) D yN C FZ¡1 ( p1 ) ¢ gs ( yN ) is strictly monotonic with respect to yN (taken in the range of gm , i.e., fE[Y | x] | 8xg). (The proof is in appendix A.) Theorem 1.
With the use of two quantile regressions fp1 and fp2 , we show that RRAT(2), covers the whole of the class satisfying equation 3.2. Theorem 2.
If the noise structure is as in equation 3.2 and
6 pm , p2 D 6 pm , p1 D
(3.3)
6 p2 , p1 D
(3.4)
then there exists a function fc such that E[Y | X] D fc ( fp1 ( X ) , fp2 ( X ) ) , where fpi ( X ) are the pi -quantile regressions (i D 1, 2). (The proof is in appendix B.) In comparison to theorem 1, we see that when using n D 2 quantile regressions, the monotonicity condition can be dropped. We conjecture that even the assumption of noise structure in equation 3.2 can be dropped when combining a sufcient number of quantile regressions. When Y > 0, we have that Z E[Y | X] D
1 0
Z P (Y > y | X ) dy D
1 0
P ( Y > fq ( X ) | X )
dfq ( X ) dq dq
by substituting y D fq ( X ) s.t. P ( Y > fq ( X ) | X ) D q. The integral can be unbiasedly estimated by a number of numerical integration schemes (the simplest being uniform spacing of q; better results might be obtained using, for example, a gaussian quadrature). Such approximations correspond to df a linear combination of the quantile regressors. The derivative dq can be unbiasedly (and efciently) estimated with ( fqi C 1 (X ) ¡ fqi¡1 (X ) ) / ( qi C 1 ¡qi¡1 ) .
Asymmetric Fat-Tail Robust Regression
2477
However, generally increasing n in RRAT may add more complexity (and parameters) to fc , thereby reducing the robustness gains brought by RRAT. 3.4 Additive and Multiplicative Noise. Consider the case where gs (in equation 3.2) is afne and gm ( x) ¸ 0 for all x,
Y D gm ( X ) C Z £ (c C dgm ( X ) ) ,
(3.5)
6 (0, 0) . The where c and d are constants such that c ¸ 0, d ¸ 0, (c, d ) D conditions of theorem 1 (including monotonicity of h ( y )) are veried for additive noise (d D 0, e.g., Y D E[Y | X] C Z), multiplicative noise (c D 0, e.g., Y D E[Y | X] (1 C Z) ), or combinations of both (c > 0, d > 0) for any form of noise distribution P ( Z ) (continuous and independent of X).
If the noise structure is afne (see equation 3.5) and gm (x ) ¸ 0 for all x, then a linear function fc is sufcient to map fp1 to f , that is, only two parameters need to be estimated by least squares. (The proof is in appendix C.) Property 1.
Additive and multiplicative noise models cover a very large variety of practical applications,3 so this result shows that RRAT(1) already enjoys wide applicability. The fact that a linear combination function fc is sufcient in the case of additive or multiplicative noise is encouraging. The previous conjecture suggests that a linear combination could be sufcient in more general cases, albeit with more quantile regressors. 3.5 Asymptotic Behavior. Let us call risk of an estimator fO( X ) the ex-
pected squared difference between E[Y | X] and fO(X ) (expectation over X, Y, and the training set), in the limit of large sample size. Let us write fOc ( fOp1 ( X ) ) for the conditional expectation obtained by RRAT(1) with nite variance.
Consider the class of regression problems with Y D f ( XI a¤ ) C b ¤ C Z, where f is an arbitrary function characterized by a set of parameters a¤ 2 Rd (with the assumption that the second derivative matrix of f (XI a¤ ) with respect to a¤ is full rank) and b ¤ 2 R is a scalar parameter, and where Z is the zero-mean noise random variable drawn from a (possibly asymmetric) heavy-tail distribution (independent of X). Also assume that the model (Fq ) used in the rst step of RRAT is well specied. Then the risk of LS regression is given by n1 ( d C 1)Var[Z], while p (1¡p ) the risk of RRAT(1) is given by 1n (Var[Z] C d 2 1( ¡1 ( 1 ) ) ) . It follows that the risk of Theorem 3.
PZ F Z p 1
3 The specication of the conditional expectations to be nonnegative is a trivial constraint because we can shift the output data. Also, note that in many practical applications with asymmetric heavy-tail noise distributions (including ours), the output range is nonnegative.
2478
Ichiro Takeuchi, Yoshua Bengio, and Takafumi Kanamori
RRAT(1) is less than that of LS regression if and only if proof is in appendix D.)
p1 (1¡p1 ) ( )) P2Z ( F¡1 Z p1
< Var[Z]. (The
For instance, as we have also veried numerically, if Z is log-normal (see section 4) RRAT beats LS regression on average when pm > 0.608 (recall that for symmetric distributions pm D 0.5). 4 Numerical Experiments
To investigate and demonstrate the proposed algorithm, we did three types of numerical experiments. In section 4.1, we compare RRAT, M estimators, and the LS estimator. In section 4.2, we show when RRAT works; we illustrate the performances of RRAT under several classes of data-generating processes. In section 4.3, we clear up which RRAT works; we demonstrate the performances of RRAT under several choices of hyperparameters (Fq , Fc , n, p1 , p2 , . . .). In Monte Carlo simulations, N D 1000 training pairs4 (xi , yi ) , (i D 1, 2, . . . , N ) are generated as per equation 3.2 with: xi » U[0, 1]d ,
(4.1)
yi D gm ( xi ) C zi gs ( gm ( xi ) ) ,
(4.2)
where gm : Rd ! R, gs : R ! (0, 1) and zi is a random sample from log-normal distribution LN(z 0, 0, s 2 ) (Antle, 1985), Z D z0 C eW where W » N (0, s 2 ) . The location parameter z 0 is chosen so that E[Z] D 0, and the spread parameter s is chosen so that Z has a prescribed pm , that is, pm D P ( Z < E[Z]). We tried only positive skews (pm > 0.5), without loss of generality. We measured the performances with mean squared error (MSE) on 1000 noise-free test samples (without the second term on the right-hand-side of equation 4.2), against the true conditional expectation. In each experimental setting, a number of experiments are repeated using completely independent data sets. The statistical comparisons between methods are given by the Wilcoxon signed rank test. 5 (In this section, we use the term signicant for the 0.01 level.) 4.1 Experiment 1:Comparison. We compared the performance of RRAT, M estimators, and the LS estimator to see how the robust estimators behave 4
We also tried the same experiments with different sizes of training samples. The results in experiment 1 (section 4.1) highly depend on training sample size, which we detail later. On the other hand, the training sample size did not affect the comparative results in experiment 2 (in section 4.2) and experiment 3 (in section 4.3). 5 We used a nonparametric test rather than a Student t-test because the normality assumption in the t-test does not hold in the presence of heavy tails.
Asymmetric Fat-Tail Robust Regression
2479
log{testMSE(LS)} - log{testMSE(*)}
4 3 2
LS RRAT Median Cauchy Logistic Tukey
1 0 -1 -2 -3 -4 0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
Figure 3: Performances of RRAT and M estimators compared with LS estimator. The horizontal axis (pm ) implies the degree of asymmetry (the larger pm , the more asymmetric). The values above the zero line suggest the estimator was better than the LS estimator. Note that under intermediate asymmetric noise distributions, which are practically important, M estimators were worse even than the LS estimator, while RRAT worked better.
under asymmetric noise distributions with well-specied models. We tried several types of M estimators: median, Cauchy (Huber, 1982), Huber (Huber, 1973), and Tukey’s biweight (Beaton & Tukey, 1974) estimators, each of which has its own loss function (see equation 2.2). We employed a simple setting in this series of experiments. gm was afne with two inputs (with parameters sampled from U[0, 1] in each experiment), gs was chosen to be additive, and we tried positive skews, with pm 2 f0.55, 0.60, . . . , 0.95g. We xed n D 1 and p1 D 0.50 in RRAT and used a well-specied fc : fc ( w) D c C w. LS estimates and the second step in RRAT were analytically computed, and others were iteratively computed with a Polack-Ribiere conjugate gradient method. The number of independent experiments was 500. The results are summarized in Figure 3. Figure 3 graphically shows the effect of asymmetric noise distributions on the performances of several M estimators compared with the LS estimator. To measure the degree of asymmetry in the horizontal axis, we employed pm , where pm D 0.5 implies symmetry and larger pm more asymmetry. In the
2480
Ichiro Takeuchi, Yoshua Bengio, and Takafumi Kanamori
vertical axis, the comparative performances are measured by the difference in logarithm of test samples MSE where values above the zero line suggest the robust estimator performed better than LS estimator, and vice versa. Figure 3 illustrates that when pm < 0.6, LS was the best, RRAT was the second best, and the M estimators fared much worse. Under intermediate asymmetric noise distribution where 0.6 < pm < 0.9, RRAT was the best. For 0.6 < pm < 0.8, LS was the second best, and the M estimators were much worse. When pm was around 0.85, Huber and Tukey estimators beat LS regression but are slightly worse than RRAT. Finally, when the noise distribution is extremely asymmetric (pm D 0.95), all four M estimators beat both RRAT and LS. The results for large pm were not what we initially expected. We conjecture that in this case, the large biases in M estimators are not as strong as the effects of variance due to the nonrobustness of the LS estimator in the second step in RRAT.6 Note, however, that the variances of a log-normal whose pm 2 f0.85, 0.90, 0.95g are, respectively, 5 £ 103 , 5 £ 105 , and 3 £ 107 . In this series of experiments, the expectation ranges in [0.0, 3.0], which means that the noise scale is 103 » 107 times larger than deterministic part of the system. For many applications with a signal-to-noise ratio that is not extremely small (including such noisy cases as the insurance data studied here), the M estimators are worse than the LS estimator, because of bias. In the following series of experiments, we compare RRAT only with the LS estimator. Concerning statistical signicance in Figure 3, when 0.55 · pm · 0.75, LS was signicantly (0.01 level) better than any M estimator, and when 0.55 · pm · 0.85, RRAT was signicantly (0.01 level) better than any M estimator. The comparison between LS and RRAT is detailed in the following sections. 4.2 Experiment 2: When Does RRAT Work? A series of experiments was designed to investigate the performance of RRAT with respect to gm , gs , and pm in equation 4.2. We chose gm from either a class of afne functions with 2, 5, or 10 inputs (with parameters sampled from U[0, 1] in each experiment), or from the class of 6-input nonlinear functions (described in Friedman, Grosse, & Suetzle, 1983). 7 gs was chosen to be ei6 We tried the same experiments with different sizes of training samples. The comparative performances of M estimators with respect to LS and RRAT highly depend on sample size. The overall relative performance of M estimators decreases with increasing sample size, and the pm threshold beyond which M estimators do better decreases with increasing sample size. On the other hand, the comparative performances between RRAT and LS did not depend on sample size. These results are consistent with our conjecture that RRAT wins over M estimators when its variance is not too large (because it is unbiased), and its variance increases when sample size is smaller or pm is larger. This variance is likely due to the second step of RRAT. 7 gm (x) D 10 sin(p x1 x2 ) C 20 sin (x3 ¡ 0.5)2 C 10x4 C 5x5 (does not depend on x6 ).
Asymmetric Fat-Tail Robust Regression
2481
ther additive (gs (w ) D 1.0), multiplicative (gs ( w) D 0.1w), or a combination of both (gs ( w) D 1.0 C 0.1w). We tried positive skews, with pm 2 f0.55, 0.60, . . . , 0.95g. We xed n D 1 and p1 D 0.50. In the case of afne functions, we used well-specied afne models for fq , and in the case of the nonlinear function, we used a neural network (NN) model for fq . (NNs can be shown to be good approximators not only for ordinary LS regression but also for quantile regression; White, 1992). We used well-specied fc —for additive noise, fc ( w) D c C w; for multiplicative noise, fc ( w) D dw; and in combination, fc ( w ) D c C dw. LS estimates of afne functions and the second step of RRAT are analytically computed. LS estimates of NN functions and the rst step of RRAT are implemented in an iterative manner with the stochastic gradient descent (for NNs)8 and with Polack-Ribiere conjugate gradients (for afne classes). In this implementation, we did not use specic capacity control (no weight decay and the number of hidden units in NNs were xed to 10 both for the rst step in RRAT and for LS.). The number of independent experiments is 500 for the afne gm and 50 for the nonlinear gm . The results for additive noise, multiplicative noise, and the combination of them are summarized in Figures 4, 5, and 6, respectively. In Figures 4, 5, and 6, the values above the zero line suggest RRAT faring better than LS and vice versa. In each gure, experimental curves for 2-, 5-, and 10-dimensional afne models and for the nonlinear models are given. For more asymmetric noise (larger pm ), RRAT fares better. Note also that as the number of parameters estimated through the rst step of RRAT (quantile regression fp1 ) increases, the relative improvement brought by RRAT over LS regression increases. The number of parameters of afnes are 3, 6, and 11, respectively, and that of the NN is 81. The relative improvements decrease or vanish when the predictor is nonlinear and pm is fairly large. The possible explanations on this apparent inconsistency with theorem 3 are that the second-order approximation or full-rank assumption were not satised, the NN approximation violates the assumption of theorem 3 that the model is well specied, or the number of samples is not large enough to be consistent with asymptotic behavior—or all of these. For additive noise (see Figure 4), the theoretical curves derived from theorem 3 are also indicated, which almost overlap with experimental results. Concerning statistical signicance in Figures 4, 5, and 6, RRAT was significantly (0.01 level) better than LS regression when pm ¸ 0.65, as analytically expected, except when gm is nonlinear and pm is 0.90 or 0.95. 4.3 Experiment 3: Which RRAT Works. Another series of experiments was designed to study the performance of RRAT when varying choices of the
8 We employed stochastic gradient, which worked better than conjugate gradient for nonlinear classes. As usual, the initial learning rate and the learning rate decrease factor were roughly selected by a few trials on the training set.
2482
Ichiro Takeuchi, Yoshua Bengio, and Takafumi Kanamori
Figure 4: Performances of RRAT compared with LS estimator under additive noise. Note that the more asymmetric the noise distribution is, the better RRAT (relatively) worked in most cases. Note also that the more parameters in fp1 , the better RRAT (relatively) worked. The theoretical curves derived from theorem 3 are also indicated, which almost overlap with empirical results.
hyperparameters n, p1 , . . . , pn . We tried n 2 f1, 2, 3g and pi 2 f0.2, 0.5, 0.8g. In the series of experiments in this section, we xed gm as a two-dimensional afne function, gs additive (gs (w ) D 1.0), and pm D 0.75. Fp is the class of two-dimensional afne models, and Fc is the class of additive constant models (i.e., they are well specied). For parameter estimation, we introduced capacity control with a weight decay parameter wd (penalty on the 2-norm of the parameters) chosen from f10¡5 , 10¡4 , . . . , 10 C 5 g. The best weight decay was chosen with another 1000 validation samples that are independent of training and test samples. From 100 independent experiments, we obtained the results summarized in Table 2. Table 2 shows the mean and its standard error over the test samples MSE for each method. The p-values from the Wilcoxon signed rank test (null hypothesis of no difference) are also indicated. From the given pvalues, it is clear that all variants of RRAT worked signicantly better than LS regression. Note that the choice of pi does not change the performance considerably. The true pm D 0.75, but RRAT(1) with p1 D 0.20 or 0.50 worked as well as RRAT(1) with p1 D 0.80. On the other hand,
Asymmetric Fat-Tail Robust Regression
2483
Figure 5: Performances of RRAT compared with LS estimator under multiplicative noise. Note that the more asymmetric the noise distribution is, the better RRAT (relatively) worked in most cases. Note also that the more parameters in fp1 , the better RRAT (relatively) worked.
the choice of n changes the performance considerably. (The p-values of the signicant difference between RRAT(1) and RRAT(2) were in the range (4.31 £ 10¡9 , 8.65 £ 10¡8 ) , those between RRAT(1) and RRAT(3) were in the range (7.73 £ 10¡8 , 1.62 £ 10¡7 ) , and those between RRAT(2) and RRAT(3) were in the range (3.50 £ 10¡1 , 4.67 £ 10¡1 ) .) As we assumed additive noise structure here, RRAT(1) is sufcient and RRAT(n), n ¸ 2, are redundant, as explained with property 1. When the noise structure is more complicated (as with our insurance data), RRAT(1) might not be sufcient and RRAT(n) with larger n might be more suitable. 5 Application to Insurance Premium Estimation
We applied the proposed method to a problem in automobile insurance pure premium estimation: estimate the risk of a driver given his or her prole (e.g., age, type of car). In this application, we want the conditional expectation (and not another conditional moment) because it corresponds to the average claim amount that customers with a certain prole will make in the future (and this expectation is the main determinant of cost). One of
2484
Ichiro Takeuchi, Yoshua Bengio, and Takafumi Kanamori
Figure 6: Performances of RRAT compared with LS estimator under the combination of additive and multiplicative noise. Note that the more asymmetric the noise distribution is, the better RRAT (relatively) worked in most cases. Note also that the more parameters in fp1 , the better RRAT (relatively) worked except the case of nonlinear with extremely asymmetric noise.
the challenges in this problem is that the premium must take into account the tiny fraction of drivers who cause serious accidents and le large claim amounts. That is, the data (claim amounts) have noise (unexplainable by input, i.e., the customer’s proles) distributed with an asymmetric heavy tail extending out toward positive values. We used real-world data from an insurance company. The distribution is unknown, so the models are probably not correctly specied. The number of input variables of the data set is 39, all discrete except one. A discrete variable with value i is one-hot encoded (in a vector with all entries zero except for a 1 at position i), yielding input vectors with 266 elements. We repeated the experiment 10 times, each time using an independent data set, by randomly splitting a large data set with 150,000 samples into 10 independent subsets with 15,000 samples. Each subset is then randomly split into three equal subsets with 5000 samples, respectively, for training, validation (model selection), and testing (model comparison). In the experiment, we compared RRAT and LS regression using linear and NN quantile predictors, that is, Fq is afne or a NN, tted using conju-
Asymmetric Fat-Tail Robust Regression
2485
Table 2: The Mean and Its Standard Error for the Average MSE of Each Method. LS regression Mean Standard error RRAT(1)
8.96 £ 10¡2 1.04 £ 10¡2 p1 D .2 £ 10¡2
p1 D .5 £ 10¡2
p1 D .8
Mean Standard error p-value
3.45 3.54 £ 10¡3 9.76 £ 10¡15
3.41 3.51 £ 10¡3 8.31 £ 10¡15
3.45 £ 10¡2 3.52 £ 10¡3 9.25 £ 10¡15
RRAT(2)
p1 D .2, p2 D .5
p1 D .2, p2 D .8
p1 D .5, p2 D .8
£ 10¡2
£ 10¡2
Mean Standard error p-value
4.56 4.22 £ 10¡3 1.73 £ 10¡11
4.61 4.11 £ 10¡3 4.87 £ 10¡11
RRAT(3)
p1 D .2, p2 D .5, p3 D .8
Mean Standard error p-value
4.66 £ 10¡2 4.17 £ 10¡3 3.97 £ 10¡11
4.52 £ 10¡2 4.20 £ 10¡3 2.34 £ 10¡11
Notes: In the tables for RRAT(n), n D 1, 2, 3, the p-values for the Wilcoxon signed rank test are also indicated. All variants of RRAT work signicantly better than LS regression. Note that the choice of pi does not change the performance considerably, but the choice of n does.
gate gradients. Capacity is controlled by weight decay 2 f10¡5 , 10¡4 , . . . 105 g (and in the case of the NN: early stopping, and number of hidden units 2 f5, 10, . . . , 25g), selected using the validation set. The correction-combination function fc is always afne (also with weight decay chosen using the validation set), and its parameters are estimated analytically. We tried RRAT(n) with n D 1, 2, 3 and pi D 0.20, 0.50, 0.80. For RRAT(1), we tried additive, afne, or quadratic correction-combination functions: fc 2 fc0 C fp1 , c0 C c1 fp1 , c0 C c1 fp1 C c2 fp21 g. For RRAT(n), n ¸ 2, we tried afne fc 2 fc0 C c1 fp1 C c2 fp2 C ¢ ¢ ¢g. Table 3 (linear model cases) and Table 4 (NN model cases) show the p-values from the Wilcoxon signed rank test. In RRAT(1), the choice of p1 D 0.20 does not work well, which suggests either that the underlying distribution of the data set is out of the class of equation 3.2 or the noise structure is more complicated than those tried. When p1 D 0.50, the choice of fc signicantly changed the performance. The worse performance of additive fc and better performance of afne fc suggest that the noise structure of the data set is more multiplicative than additive. When p1 D 0.80, RRAT worked better independently of the choice of fc , which suggests that the true pm of the data set (if it does not vary too much with x) stays around 0.80. Note that RRAT(n), n ¸ 2 always worked better than
2486
Ichiro Takeuchi, Yoshua Bengio, and Takafumi Kanamori
Table 3: Experimental Comparison of RRAT and LS Regression on Linear Predictors. RRAT(1) fc (fp 1 ) D c 0 C fp 1 p-value RRAT(1) fc (fp 1 ) D c 0 C c1 fp 1 p-value RRAT(1) fc (fp 1 ) D c 0 C c1 fp 1 C c2 fp21 p-value
RRAT(2) fc (fp 1 , fp 2 ) D c 0 C c1 fp 1 C c2 fp 2 p-value RRAT(3) fc (fp 1 , fp 2 , fp 3 ) D c0 C c1 fp1 C c2 fp 2 C c3 fp 3 p-value Best Model on Validation p-value
p1 D .2
p1 D .5
p1 D .8
2.97£10¡2 ¤
2.34£10¡2 ±
p1 D .2
p1 D .5
p1 D .8
1.01£10¡1 —
3.72£10¡2 ±
3.46£10¡3 ±±
p1 D .2
p1 D .5
p1 D .8
1.93£10¡1 —
4.67£10¡3 ±±
2.53£10¡3 ±±
p1 D .2, p2 D .5
p1 D .2, p2 D .8
p1 D .5, p2 D .8
1.42£10¡2 ±
3.46£10¡3 ±±
3.46£10¡3 ±±
8.30£10¡3
¤¤
p1 D .2, p2 D .5, p3 D .8 2.53£10¡3 ±± 2.53£10¡3 ±±
Notes: The p-values from the Wilcoxon signed ranktest, where ¤ (¤¤) denotes LS regression being signicantly better than RRAT at 0.05 (0.01)level,— denotes no signicant difference between them, and ± (±±) denotes RRAT being signicantly better than LS regression at 0.05 (0.01) level. Note that RRAT(n), n ¸ 2, always beats LS regression.
LS, even though one of the components, f0.20, was very poor by itself. It follows that the choice of pi is not critical to the success of the application; we can choose several pi and combine them. Furthermore, when we pick the best of the RRAT models (choice of n, pi , and fc from above) based on the validation set average squared error, even better results are obtained.
6 Conclusion
We have introduced a new robust regression method, RRAT, that is well suited to asymmetric heavy-tail unknown noise distributions (where previous robust regression methods are not well suited). It can be applied to both linear and nonlinear regressions. A large class of generating models
Asymmetric Fat-Tail Robust Regression
2487
Table 4: Experimental Comparison of RRAT and LS Regression on NN Predictors. RRAT(1) fc (fp1 ) D c 0 C fp 1 p-value RRAT(1) fc (fp1 ) D c 0 C c1 fp 1 p-value RRAT(1) fc (fp1 ) D c 0 C c1 fp 1 C c2 fp21 p-value
RRAT(2) fc (fp1 , fp2 ) D c 0 C c1 fp 1 C c2 fp 2 p-value RRAT(3) fc (fp1 , fp2 , fp3 ) D c 0 C c1 fp 1 C c2 fp 2 C c3 fp 3 p-value Best model on validation p-value
p1 D .2
p1 D .5
p1 D .8
1.83£10¡2 ¤
1.66£10¡1 —
1.09£10¡2 ±
p1 D .2
p1 D .5
p1 D .8
3.99£10¡1 —
4.67£10¡3 ±±
1.83£10¡2 ±
p1 D .2
p1 D .5
p1 D .8
4.39£10¡1 —
4.67£10¡3 ±±
1.42£10¡2 ±
p1 D .2, p2 D .5
p1 D .2, p2 D .8
p1 D .5, p2 D .8
3.46£10¡3 ±±
1.42£10¡2 ±
6.26£10¡3 ±±
p1 D .2, p2 D .5, p3 D .8 8.30£10¡3 ±± 4.67£10¡3 ±±
Notes: The p-values from the Wilcoxon signed ranktest, where ¤ (¤¤) denotes LS regression being signicantly better than RRAT at 0.05 (0.01)level, — denotes no signicant difference between them, and ± (±±) denotes RRAT being signicantly better than LS regression at 0.05 (0.01) level. Note that RRAT(n), n ¸ 2, always beats LS regression.
for which a universal approximation property holds has been characterized (and it includes additive and multiplicative noise with arbitrary conditional expectation). An analysis of known asymmetric heavy-tail noise distributions reveals when the proposed method beats least-square regression and M estimators. The proposed method has been tested and analyzed on both synthetic data and insurance data (the motivation for this research), showing it to outperform both least-squares regression and M estimators signicantly over a wide range of asymmetric distributions. Future work should investigate estimators of condence intervals around the RRAT estimator (e.g., using the bootstrap or other resampling methods). Another important direction of future work is understanding and improving the behavior of nonlinear predictors for a very small signal-to-noise ratio (large pm ).
2488
Ichiro Takeuchi, Yoshua Bengio, and Takafumi Kanamori
Appendix A: Proof of Theorem 1
For all x, fp1 (x ) is represented as a function of fpm ( x ) as follows: ± ² ¡1 ¡1 ( pm ) ¡ FZ¢g ( p1 ) 8x, fp1 (x ) D fpm ( x ) ¡ FZ¢g s fgm ( x)g s fgm ( x)g ¡1 D fpm ( x ) C FZ¢gs fgm ( x )g ( p1 ) ¡1 D fpm ( x ) C FZ ( p1 ) ¢ gs fgm ( x) g
¡1 D gm ( x ) C FZ ( p1 ) ¢ gs fgm ( x )g,
(A.1)
¡1 ( p ) is the pth quantile of random variable W. In equation A.1, from where FW the rst line to the second, we used E[Zgs fgm ( X )g | X] D 0. From the second to the third, we used the following property of cdf’s: if FaW (av ) D FW ( v) , ¡1 ¡1 ( p ) D a ¢ FW ( p ) , where a > 0, 0 < p < 1 are constants and W is a then FaW random variable drawn from a continuous distribution. From the third to fourth, we used fpm ( x ) ´ gm ( x ) for all x. If the monotonicity of h ( yN ) in the theorem holds, there is a one-to-one mapping between fp1 (x ) and gm ( x ) ( D E[Y | x] ). It follows that there exists a function fc such that E[Y | X] D fc ( fp1 (X ) ) .
Appendix B: Proof of Theorem 2
As in equation A.1, in the proof of theorem 1, fp1 ( x ) and fp2 (x ) are given, for all x, by fp1 ( x ) D gm ( x ) C FZ¡1 ( p1 ) ¢ gs ( gm ( x) ) , fp2 ( x ) D gm ( x ) C FZ¡1 ( p2 ) ¢ gs ( gm ( x) ) .
(B.1) (B.2)
Therefore, the proof of theorem 1 already shows the existence of the mapping from gm ( x ) to f fp1 ( x ) , fp2 ( x) g, with equations B.1 and B.2. Here we want to show that the mapping is one-to-one for all x. Assume that it is not one-to-one; then there is at least one case where two different values yN and yN 0 are mapped to identical f fp1 (x ) , fp2 ( x )g, that is, yN ¡ FZ¡1 ( p1 ) ¢ gs ( yN ) D yN 0 ¡ FZ¡1 ( p1 ) ¢ gs ( yN 0 ) , yN ¡ FZ¡1 ( p2 ) ¢ gs ( yN ) D yN 0 ¡ FZ¡1 ( p2 ) ¢ gs ( yN 0 ) .
(B.3) (B.4)
Dividing both sides of equations B.3 and B.4 by FZ¡1 ( p1 ) and FZ¡1 ( p2 ), respectively (from the continuity of Z and equation 3.4, FZ¡1 ( p1 ) and FZ¡1 ( p2 ) are
Asymmetric Fat-Tail Robust Regression
2489
not zero), and subtracting equation B.3 from B.4 yields Á
! 1 FZ¡1 ( p1 )
1
¡
FZ¡1 ( p2 )
( yN ¡ yN 0 ) D 0.
From the continuity of Z and equation 3.4, 0
(B.5) 1 1 6 D ( p1 ) ( p2 ) F¡1 F¡1 Z Z
and from the
6 y N , we nd that the assumption that the mapping between assumption yN D gm ( x ) and f fp1 ( x ) , fp2 ( x )g is not one-to-one is false. It follows that there exists a function fc such that E[Y|X] D fc ( fp1 ( X) , fp2 ( X ) ) .
Appendix C: Proof of Property 1
By applying gs in equation 3.5 into equation A.1, we get fp1 (x ) D gm (x ) ¡ FZ¡1 ( p1 ) ¢ ( c C d ¢ gm ( x ) ) .
(C.1)
We can exactly obtain a linear function fc , gm ( x) D fc ( fp1 (x ) ) D
¡FZ¡1 ( p1 ) ¢ c
1 ¡ FZ¡1 ( p1 ) ¢ d
C
1 1 ¡ FZ¡1 ( p1 ) ¢ d
¢ fp1 ( x ) ,
(C.2)
except when the denominator appearing on the right-hand side of the second line in equation C.2 is zero.9 Appendix D: Proof of Theorem 3
We consider a class of regression problems in the following form, Y D f (XI a¤ ) C b ¤ C Z,
(D.1)
where X and Y are the input and output random variables, respectively. f is an arbitrary function characterized by a set of parameters a¤ 2 Rd , and b ¤ 2 R is a scalar parameter. Z is the (zero-mean) noise random variable drawn from a (possibly) asymmetric heavy-tail distribution (independent of X). We assume that the model is well specied, that is, the true function f (xI a¤ ) C b ¤ is a member of the class of parametric models: M D f f (xI a) C 9 6 p1 . In the application of This problem can be solved by choosing another order p2 D this method, we suggest trying several orders, p1 , p2 , . . . , pn , and choosing the best one or using RRAT(n), n ¸ 2. This is demonstrated in sections 4 and 5 for application to insurance premium estimation.
2490
Ichiro Takeuchi, Yoshua Bengio, and Takafumi Kanamori
b: a 2 H ½ Rd , b 2 Rg, and that the training data is independently and identically distributed. The risk of the model by LS regression is given by standard calculation of asymptotic statistics: n o riskLS D EData EX f( f (xI a¤ ) C b ¤ ¡ f ( xI aO LS ) ¡ bOLS ) 2 g ³ ´ 1 1 D fVar(Z) C d Var ( Z) g C o , n n
(D.2)
where n is the number of training data and a O LS , bOLS denotes the corresponding estimated parameters by LS regression. As explained, RRAT provides the following estimates: 8 < fa O RRAT , cORRAT g D argmin a,c
:
X i:yi ¸ f ( xi Ia) Cc
C
p1 |yi ¡ f ( xi I a) ¡ c |
X
i:yi < f ( xi Ia) Cc
(1 ¡ p1 )
£ |yi ¡ f ( xi I a) ¡ c | n 1X bORRAT D fyi ¡ f (xi I a O RRAT ) g. n iD 1
9 = ;
,
(D.3)
(D.4)
Using these estimated parameters, the risk of the model by RRAT is given by: n o riskRRAT D EData EX f( f (xI a¤ ) C b ¤ ¡ f ( xI aO RRAT ) ¡ bORRAT ) 2 g ( ) ³ ´ 1 1 p1 (1 ¡ p1 ) Co Var (Z ) C d ¢ (D.5) D , ¡1 2 n n PZ ( F Z ( p 1 ) ) where PZ and FZ are the pdf and cdf of the noise distribution. From equation D.2 and D.4, RRAT yields more efcient estimates than LS regression when Var (Z ) >
p1 (1 ¡ p1 )
P2Z ( FZ¡1 ( p1 ) )
.
(D.6)
To prove equation D.5, let us introduce some notation. We dene |w| C D max (0, w) . We also omit the subscript RRAT for estimated parameters by RRAT.
Asymmetric Fat-Tail Robust Regression
First, we dene the matrix K as Z Ád d ( ( a¤ ) da f xI a¤ ) 0 da f xI KD d ( a¤ ) 0 da f xI
2491
f ( xI a¤ )
d da
!
1
Á p (x ) dx D
! m1 , 1
M2 m01
where M2 is a d £ d matrix and m1 is a d £ 1 vector. The risk of the RRAT is written as ³ ´ 1 risk D Tr Var( a, O bO ) K C o . n To obtain the variance of a, O we can use the following relation between the variance and the inuence function (Hampel et al., 1986): Z lim n Var( a, O cO ) D IF (x, y ) IF (x, y ) 0 p ( y | x ) p (x ) dy dx, n!1
where IF ( x, y) is the inuence function of the estimator ( a, O cO ) . The inuence function is dened as Q yQ ) D lim IF (x,
k !0
(ak , c k ) ¡ (a¤ , c ¤ )
k
,
(D.7)
where (ak , c k ) is given by minimizing the following with respect to (a, c ) : Z n C (1 ¡ k ) p1 y ¡ f ( xI a) ¡ c o
©
C C (1 ¡ p1 ) f (xI a) C c ¡ y p ( y | x ) p ( x ) dy dx
ª
C k p1 | yQ ¡ f ( xI Q a) ¡ c | C C (1 ¡ p1 ) | f (xI Q a) C c ¡ y| QC . d |x| C To obtain the inuence function we use dx D s ( x) 10 and 11 d ( x ). By using these relations, we obtain (ak , c k ) as follows:
(ak , c k ) 0 D (a¤ , c ¤ ) 0 ¡ k Á ¢K
¡1
d da
©
1 P Z ( FZ¡1 ( p1 ) )
Q a¤ ) f (xI 1
d ( ) s x dx
1 ¡ p1 ¡ s ( yQ ¡ f (xI Q a¤ ) ¡ c ¤ )
D
ª
!
C o (k ) .
(D.8)
Then we obtain the inuence function as Q yQ ) D IF (x, 10 11
1 PZ (FZ¡1 ( p1 ) )
© ª 1 ¡ p1 ¡ s ( yQ ¡ f ( xI Q a¤ ) ¡ c ¤ ) K ¡1
s (x) is 1 when x ¸ 0 and 0 when x < 0. d (x) is Dirac’s delta function.
Á
d da
! Q a¤ ) f ( xI . 1
2492
Ichiro Takeuchi, Yoshua Bengio, and Takafumi Kanamori
Then we nd that the covariance matrix of the estimator ( a, O cO ) is written as Var (a, O cO ) D
³ ´ 1 p1 (1 ¡ p1 ) 1 ¡1 Co . K ¡1 2 n PZ ( F Z ( p 1 ) ) n
Here we decompose the matrix K¡1 as Á K
¡1
D
H t0
! t . u
Then the variance matrix of aO is written as ³ ´ 1 p1 (1 ¡ p1 ) 1 Co Var( a) O D . H n PZ (FZ¡1 ( p1 ) ) 2 n
(D.9)
Next we calculate the variance of bO (written as Var (bO ) ) in equation D.4, which is decomposed as follows: Á Var (bO ) D Var D
n 1X ( yi ¡ f ( xi | a) O ) n iD 1
!
1 X EData fzi zj g n2 i, j C
(D.10)
» 1 X d O ¡ a¤ ) E f ( xi I a¤ ) 0 ( a Data 2 n i, j da
d f ( xj I a¤ ) da » ¼ 2 X d ¡ 2 O ¡ a¤ ) 0 EData zi ( a f (xj I a¤ ) n i, j da
¼
£ ( aO ¡ a¤ ) 0
( ) 2 ( I a¤ ) 2 X ¤ 0 @ f xj ¤ (aO ¡ a ) ¡ 2 O ¡a ) EData zi ( a n i, j @a2 Co
(D.11) (D.12)
(D.13)
³ ´ 1 . n
The rst term, equation D.10, is equal to Var(z) . n The second term, equation D.11, is calculated as follows. First, ©
¤) (
E z|X ( aO ¡ a
¤) 0
aO ¡ a
ª
1 p1 (1 ¡ p1 ) D H C op n pz ( m1 ) 2
³ ´ 1 , n
(D.14)
Asymmetric Fat-Tail Robust Regression
2493
where op (¢) is the probabilistic order with respect to p (x1 , . . . , xn ) . Substituting equations D.14 to D.11, we obtain » ¼ 1 X d ¤ 0 ¤ ¤ 0 d ¤ ( ) ( ) ( ) ( ) I a a O ¡ a a O ¡ a I a E f x f x Data i j n2 i, j da da ( 1 X 1 p1 (1 ¡ p1 ) Tr D 2 HEX n i, j n PZ (FZ¡1 ( p1 ) ) 2 ¼ ³ ´) 1 d ¤ d ¤ 0 £ f ( xi I a ) f (xj I a ) C o da da n ³ ´ 1 p1 (1 ¡ p1 ) 1 . D m01 Hm1 C o ¡1 2 n PZ ( F Z ( p 1 ) ) n »
(D.15)
The third term, equation D.12, is calculated as follows. First, we calculate the conditional expectation Ez|X fzi ( a O ¡ a¤ ) g. Now we dene ( a O ( i) , cO( i) ) as the estimator obtained from all the training data except (xi , yi ) . By denition, (a O ( i) , cO(i) ) is independent from zi . By simple calculation we nd Á ! Á ! © ª aO aO ( i) 1 1 ¡ 1 ¡ p1 ¡ s ( yi ¡ f (xi | a) O ¡ cO ) D cO cO(i ) n PZ ( FZ¡1 ( p1 ) ) Á ! ³ ´ d ¤ 1 ¡1 da f ( xi I a ) C op £K . (D.16) n 1 Then the difference between aO and aO (i) is calculated as © ª 1 1 1 ¡ p1 ¡ s ( yi ¡ f (xi | a) O ¡ cO ) n PZ (FZ¡1 ( p1 ) ) ³ ´ ³ ´ 1 d ¤) ( C C £ H . (D.17) f xi I a t op da n
a O ¡ a¤ D aO (i) ¡ a¤ ¡
Substituting equation D.17 in Ez|X fzi ( a O ¡ a¤ ) g, we obtain Ez|X fzi (aO ¡ a¤ )g D
1 1 ¡1 n PZ ( F Z ( p 1 ) ) ³ ´ 1 C op . n
³Z
1 0
´³ ´ d zQ pz ( zQ ) dQz H f ( xi I a¤ ) C t da (D.18)
2494
Ichiro Takeuchi, Yoshua Bengio, and Takafumi Kanamori
Then we obtain ¡
» ¼ 2 X ¤ 0 d ¤ ( ) ( ) a O ¡ a a E z f xI Data i n2 i, j da ³Z 1 ´ 2 1 1 D¡ 2 zQ pz (zQ ) dQz n n PZ ( FZ¡1 ( p1 ) ) 0 » ³ ´ ¼ ³ ´ X 1 d d ¤ ¤ 0 £ EX Tr H f ( xi I a ) C t f (xj I a ) C o da da n i, j 2 1 n PZ (FZ¡1 ( p1 ) ) ³ ´ 1 . Do n
³Z
D¡
1 0
´ ³ ´ ¡ ¢ 1 zQ pz ( zQ ) dQz m01 Hm1 C m01 t C o n (D.19)
The last equation is obtained from the denition of H and t, that is, we can nd that Hm1 C t D 0. ¡ ¢ The fourth term, equation D.13, is also o n1 . This can be veried by substituting equation D.17 into D.13. Then from the previous discussion, we obtain the variance of bO as ³ ´ Var (Z ) 1 p1 (1 ¡ p1 ) 1 0 C C Var (bO ) D . (D.20) m Hm o 1 1 ¡1 2 n n PZ ( F Z ( p 1 ) ) n O bO ¡ b ¤ is written as Next, we calculate the covariance between aO and b. bO ¡ b ¤ D
³ ´ n n 1X 1X 1 0 d ¤ ( ( ) C aO ¡ a) . zi ¡ f xi I a op p n iD 1 n iD 1 da n
(D.21)
Substituting the above equation to EData f(bO ¡ b ¤ ) (aO ¡ a¤ ) g, we obtain: O ¡ a¤ )g E Data f( bO ¡ b ¤ ) ( a D
n © © ªª 1X EX Ez|X zi (aO ¡ a¤ ) n iD 1
» ¼ ³ ´ n © ª d 1X 1 O ¡ a¤ ) ( a O ¡ a¤ ) 0 EX Ez|X ( a f ( xi I a¤ ) C o n i D1 da n ( ³Z 1 ´³ ´) n 1X 1 1 d ¤ D EX zQ pz (zQ ) dQz H f ( xi I a ) C t n iD 1 n P Z ( FZ¡1 ( p1 ) ) da 0 ( ) ³ ´ n 1X 1 p1 (1 ¡ p1 ) 1 d ¤ ( ) Co ¡ I a EX H f x i ¡1 2 n i D1 n PZ ( F Z ( p 1 ) ) da n ¡
Asymmetric Fat-Tail Robust Regression
1 1 D n PZ ( FZ¡1 ( p1 ) )
³Z
1 0
2495
´ Qzpz (zQ ) dQz ( Hm1 C t)
³ ´ 1 p1 (1 ¡ p1 ) 1 C Hm o 1 n PZ (FZ¡1 ( p1 ) ) 2 n ³ ´ 1 p1 (1 ¡ p1 ) 1 C . D ¡ Hm o 1 n PZ ( FZ¡1 ( p1 ) ) 2 n ¡
(D.22)
Now we have all the elements for calculating the risk of the estimator ( a, O bO ). The variance of ( a, O bO ) is given as 1 p1 (1 ¡ p1 ) Var ( a, O bO ) D n PZ ( FZ¡1 ( p1 ) ) 2 Á ! ¡Hm1 H £ ( p1 ) ) 2 PZ ( F¡1 Z ¡m01 H Var (Z ) C m01 Hm1 p1 (1¡p1 ) ³ ´ 1 Co . n On the other hand, the risk is calculated as Tr Var( a, O bO ) K C o risk is as follows: Tr Var (a, O bO ) K D
D
(D.23) ¡1¢ n
. Then the
1 p1 (1 ¡ p1 ) Tr n PZ (FZ¡1 ( p1 ) ) 2 Á !Á ¡Hm1 H M2 £ ( p1 ) ) 2 PZ ( F¡1 0 0 Z m1 ¡m1 H Var ( Z) C m1 Hm1 p1 (1¡p1 ) 1 p1 (1 ¡ p1 ) n PZ (FZ¡1 ( p1 ) ) 2 (
m1 1
!
) PZ (FZ¡1 ( p1 ) ) 2 £ Tr HM2 ¡ m1 Hm1 C Var (Z ) p1 (1 ¡ p1 ) ( ) 1 p1 (1 ¡ p1 ) Var(Z) C d ¢ (D.24) D , n PZ ( FZ¡1 ( p1 ) ) 2 0
where we use the relation H D M2¡1 C
1 1 ¡ m1 M2¡1 m1 0
M2¡1 m1 m01 M2¡1 .
Thus we obtain the assertion of equation D.5.
2496
Ichiro Takeuchi, Yoshua Bengio, and Takafumi Kanamori
Acknowledgments
We thank LÂeon Bottou for his stimulating suggestions, as well as the following agencies for funding: NSERC, MITACS, and JSPS. Most of this work was done while I. T. and T. K. were at UniversitÂe de MontrÂeal. References Antle, C. (1985). Lognormal distribution. In Encyclopedia of statistical sciences (Vol. 5, pp. 134–136). New York: Wiley. Beaton, A., & Tukey, J. (1974). The tting of power series, meaning polynomials, illustrated on band-spectroscopic data. Technometrics, 16, 147–185. Bickel, P. (1973). On some analogues to linear combinations of order statistics in the linear model. Annuals of Statistics, 1, 597–616. Friedman, J., Grosse, E., & Suetzle, W. (1983). Multidimensional additive spline approximation. SIAM Journal of Scientic and Statistical Computing, 4(2), 291– 301. Greene, W. (1997). Econometric analysis (3rd ed.). Upper Saddle River, NJ: Prentice Hall. Hampel, F., Ronchetti, E., Rousseeuw, P., & Stahel, W. (1986). Robust statistics: The approach based on inuence functions. New York: Wiley. Huber, P. (1973). Robust regression: Asymptotics, conjectures and Monte Carlo. Ann. Stat., 1, 799–821. Huber, P. (1982). Robust statistics. New York: John Wiley. Koenker, R., & Bassett, G., Jr. (1978). Regression quantiles. Econometrica, 46(1), 33–50. Krasker, W., & Welsch, R. (1982). Efcient bounded-inuence regression estimation. J. Am. Stat. Asso., 77, 595–604. Lehmann, E. (1983). Theory of point estimation. New York: Wiley. Rousseeuw, P., & Leroy, A. (1987). Robust regression and outlier detection. New York: Wiley. White, H. (1992). Nonparametric estimation of conditional quantiles using neural networks. In Proceedings of the 23rd Symposium on the Interface, Computer Science and Statistics (pp. 190–199). New York: Springer-Verlag.
Received August 6, 2001; accepted April 4, 2002.
LETTER
Communicated by Robert Jacobs
Many-Layered Learning Paul E. Utgoff [email protected] David J. Stracuzzi [email protected] Department of Computer Science, University of Massachusetts, Amherst, MA 01003, U.S.A. We explore incremental assimilation of new knowledge by sequential learning. Of particular interest is how a network of many knowledge layers can be constructed in an on-line manner, such that the learned units represent building blocks of knowledge that serve to compress the overall representation and facilitate transfer. We motivate the need for many layers of knowledge, and we advocate sequential learning as an avenue for promoting the construction of layered knowledge structures. Finally, our novel STL algorithm demonstrates a method for simultaneously acquiring and organizing a collection of concepts and functions as a network from a stream of unstructured information. 1 Introduction
Learning is an essential element of intelligent behavior. We know that a human cannot learn an arbitrary piece of knowledge at any time. Instead, one is receptive to those ideas that would not be too difcult to learn with a reasonably small amount of effort. Other ideas remain unfathomable and distant, until the agent’s knowledge develops further, rendering such formerly difcult knowledge now simple enough to absorb. This is the starting point for our discussion: that knowledge accumulates indenitely, seemingly as a result of a very basic kind of learning mechanism. Our discussion focuses on possible computational processes that can model long-term layered learning. Knowledge that could be acquired readily on presentation constitutes a frontier of receptivity, and that which has already been learned by the agent provides a basis on which to assimilate new knowledge. As currently simple knowledge is assimilated, the frontier of receptivity advances, improving the basis for further understanding of currently complex knowledge. We explore the idea that knowledge can accumulate incrementally in a virtually unbounded number of layers, and we refer to this view and its approaches as many-layered learning. How can an agent process its input stream so that it structures its knowledge in a usefully layered c 2002 Massachusetts Institute of Technology Neural Computation 14, 2497– 2529 (2002) °
2498
Paul E. Utgoff and David J. Stracuzzi
organization, and how can it do so over long periods of time, say, measured in decades? We discuss why many layers of knowledge are necessary for learning nontrivial concepts. After this background perspective, we illustrate several important points with a concrete example. One of these points is that layered learning can benet from an input stream that is the result of an organized curriculum. The second is that simple learning mechanisms can drive a knowledge organization process very effectively. An important conclusion is that it is possible to design algorithms that model sequential learning of a large number of interdependent concepts over a long period of time. 2 Background and Motivation
We proceed with a discussion of why a many-layered knowledge representation is essential for maximizing knowledge compression and hence generalization. Then we comment on common approaches as practiced in articial neural network learning. Following that is a short review of recent work that considers how to model learning of many layers of knowledge. 2.1 Learnability and Compression. Learning is a process of compressing observations and experiences into a form that can be applied advantageously thereafter. A general statement or hypothesis may explain a great many observations succinctly, and because it exploits regularity to achieve compression, it will likely be an excellent predictor of future events (Rissanen & Langdon, 1979). To the extent that a hypothesis is a correct theory, it can help the agent to predict consequences, and therefore to improve the agent’s projective reasoning. Structural and procedural knowledge can each be compressed, and this has important implications not only for space consumption but also for time consumption and learnability. One critical means of achieving compactness is to refer to previously acquired knowledge whenever possible rather than replicate it in place. One nds this notion in structured programming, by coding useful procedures or functions, and then referring to them where needed. This leads to great compression of executable code. It also results in large coding efciencies, as the functionality needed in multiple locales is produced and debugged independently just once and is used by reference thereafter. Indeed, this approach to modular programming led to the notion of data abstraction, sharing of code libraries, and the general elevation of the functionality of programmable machines. There is no arbitrary or practical constraint placed on the depth of the functional nesting. Shapiro (1987) applied this model of structured programming to learning, calling it structured induction. He saw that reuse of knowledge facilitated compression and that a learner could benet by applying this idea to classication tasks. By decomposing a learning problem into learning subproblems, one can learn the subproblems individually (in a context-independent
Many-Layered Learning
2499
setting) and then return to a higher-level learning problem at a workable level of abstraction. For example, in learning whether a king-pawn-versusking chess position can be won, one of the important criteria is whether the pawn can outrun the opposing king to the far side of the board in order to become a queen without being captured. This is itself a boolean predicate to be learned, which discriminates the “can outrun” from the “cannot outrun” positions. Upon learning this concept (Bruner, Goodnow, & Austin, 1956) of outrun, it can be used as a primitive in the original learning problem. This is a very powerful approach with respect to producing compression. An additional important view of this process is that the outrun predicate is a boolean feature that is true or false of a position, giving a new dimension for discrimination. As with modular programming, there is no arbitrary limit on the number of layers of knowledge nested in this manner. Shapiro demonstrated beautifully very dramatic improvements in compression, compared to learning the same classication task without a structural decomposition. This is much more than a matter of saving space. The total time required to learn the outrun concept and the canwin concept is very much less than the total time needed to learn the canwin concept without the decomposition. The concept of outrun is independent of the larger problem. It can be learned once, free of the contexts in which it appears, and then be reused as needed within a variety of contexts. Otherwise, the equivalent functionality of the outrun concept must be learned in each and every context in which it appears. The concept of outrun constitutes a building block because it is an element of knowledge that can be used to simplify learning an expression of higherlevel knowledge (predicates and functions) in more than one context. Pagallo and Haussler ’s (1990) FRINGE algorithm attempted to nd useful subconcepts by searching for the pathological effects of omitting them. Whereas Shapiro provided the task (concept) decomposition by hand, Pagallo did not. She noticed that without decomposition, knowledge would replicate itself. In particular, fragments of replicated structure can be observed at the fringe of a decision tree. By reducing each smallest replicated subtree to a boolean function and then rebuilding the tree with that function as a new feature (variable), a more compact tree would often result. In this way, important subconcepts could be identied automatically in an iterative manner. However, a signicant difculty is that one must see enough data to cause the replicated subtrees to form in each context. Thus, one must rst suffer the consequences before obtaining the benet. This nevertheless remains a promising direction for further research. Reducing replicated structure is a general approach to compression. Cook and Holder’s (1994) SUBDUE system induces a graph grammar, guided by the minimum-description-length compression measure, beam search, and background knowledge. There is no arbitrary depth limit for the grammar. These authors mention specically the important idea of building block knowledge—for example, a benzene ring that was identied in a chemistry
2500
Paul E. Utgoff and David J. Stracuzzi
application. One achieves compression by being able to reference something by name more than enough times to overcome the small cost of maintaining a name for that substructure. Zupan, Bohanec, DemsÆ ar, and Bratko’s (1999) HINT algorithm searches for a decomposition of the partial function indicated by a collection of labeled training instances. The algorithm considers limited subsets of the variables and subfunctions of those variables, picking the conguration that compresses the data best. The algorithm locks into each such decomposition step in a greedy manner. This approach is designed to nd a functional decomposition, which differs from grammatical structure replacement rules because a new function denition is synthesized. Although there are practical limitations on the number of arguments of the decomposed functions, there are no constraints on the depth of the decomposition (layering). In summary, one important avenue for achieving compression is to avoid replication of knowledge. This can be implemented by storing an element of knowledge as a single denition of some kind, such as a procedure or a function or a concept, and then referring to that element as needed. Generalization includes not only the process of grouping and abstracting data elements in the classical sense, but also, very importantly, the process of organizing knowledge into usefully referenceable entities that can serve as building blocks. One would like to benet from the widest applicability and reusability of the knowledge. Composition of individual knowledge elements facilitates compression. 2.2 Articial Neural Networks. A variety of articial neural network (ANN) algorithms have been devised (Rumelhart & McClelland, 1986; Freeman & Skapura, 1991; Fausett, 1994). They generally impose a severe constraint on the number of layers of computation, which causes compression and hence learnability to suffer. We shall refer to algorithms and approaches that strongly limit the number of layers as few-layered learning. ANN approaches that use few layers continue to receive a great deal of attention, because they can be applied to a useful class of problems and because there is still much to learn about them. These networks have much to offer, and our own work reported here ts generally into this category, though not with the restriction of few layers. Functionally shallow networks preclude forms of knowledge reuse that would facilitate compression and learnability, and such shallow networks are unattractive in this regard, particularly when the goal is to model lifelong accumulation of knowledge. Kaas (1982) discusses various neural organizations. Connectivity constraints and proximity constraints argue against neural plausibility of shallow networks. The constraint of few layers limits the ability to reuse knowledge learned previously as building blocks. Consider a boolean function expressible by n layers of combinational logic. To reexpress the function in just two layers may require an exponential expansion in the number of gates and connections. The combinations implicit in the deeper circuit must be made explicit
Many-Layered Learning
2501
in the shallower circuit. To undertake the learning of a boolean function subject to the constraint of just a few layers cripples the learning fatally by forcing it to learn an exponential number of subconcepts. This affects both space and time consumption. One must observe an exponential number of training instances in order to sample all the special case learning subproblems. These principles apply equally well to layers of hard or soft threshold functions. Consider a simple illustration of the organizational trade-off. Suppose we were to wish to build a boolean logic circuit patterned by the expression or (and ( or ( A, B ) , or ( C, D ) ) , and ( or ( E, F ) , or ( G, H ) ) ) , where the letters indicate boolean input values. As written, the corresponding circuit would have 3 layers of computation, using 7 two-input logic gates and 15 wires, counting inputs and output. In contrast, by distributing the two ands, a functionally equivalent logic circuit could be patterned by the expression or (and ( A, C) , and ( A, D ) , and (B, C) , and ( B, D ) , and ( E, G) , and (E, H ) , and ( F, G ) , and ( F, H ) ). The corresponding circuit would have two layers of computation, using 8 two-input logic gates, 1 eight-input gate, and 25 wires. The three-layered circuit requires less hardware than the two-layered circuit. For the boolean logic case, the constraint of two layers is analogous to requiring that a function be expressed in disjunctive normal form, which provides poor compression and tedious learning of many special cases. Gradient descent for global error minimization of shallowly nested functional forms has not been shown to scale to deeply nested forms. There is anecdotal evidence that additional layers may degrade the learning process (Tesauro, 1992). For networks of a xed architecture, trained by global error minimization, theoretical results have shown that loading data into such a network is an NP-complete problem, independent of the training algorithm and regardless of whether the network is shallow or deep (Judd, 1990; Blum & Rivest, 1988; SÂÏ õ ma, 1994). However, when the network architecture is allowed to grow during learning or is trained with localized signals, these results do not apply. White (1990) has shown that networks that grow during learning can learn arbitrary functions in the limit. Various shallow network architectures are often called universal approximators for certain classes of functions (Mitchell, 1997). Although a function may be representable in such an architecture, the learnability of such a function with such a representation is not guaranteed. It is similar to saying that any continuous function can be approximated arbitrarily accurately by a sum of monomials; one may require an infeasibly large number of such
2502
Paul E. Utgoff and David J. Stracuzzi
monomials. Similarly, any boolean function can be represented by a disjunctive normal form (and hence a three-layered network), but such a form may suffer with respect to learnability and compression. We must remain concerned with what is feasibly learnable in a given representation. Jacobs, Jordan, and Barto (1991) presented a modular architecture that partitions a space of tasks in such a way that one few-layered network handles each disjoint subset of the tasks. This is a form of decomposition in the sense that the tasks are partitioned once at the same level, with one gating function to select the output of a unit from those among the single subtask layer. A less sophisticated model of a similar kind is a piecewise-linear t of training data (Nilsson, 1965). Each linear threshold unit competes to classify an instance and is trained accordingly so that each linear discriminant applies to a subset of the training instances. Each linear discriminant serves no other purpose than to provide an answer for its subdomain (of expertise). These shallow networks do not produce building blocks of knowledge. Another kind of nonconstructive approach places multiple related tasks at the output layer with a few-layered architecture (Suddarth & Holden, 1991; Caruana, 1997). This enriches the error gradient at the hidden units, thereby hastening learning. Suddarth explored putting extra tasks on the output layer that did not actually need to be learned. Their mere presence during the training process sped learning for the actual task of interest. However, problems associated with using few layers remain. Several constructive methods add hidden units during learning, increasing the width of one or more existing layers (Ash, 1989; Hanson, 1990; Frean, 1990; Wynne-Jones, 1991; Utgoff & Precup, 1998). For nontrivial problems, these constructive methods generally bog down fatally. This is often explained as becoming stuck at a local minimum or being forced to traverse a very shallow error gradient. However, the potentially exponential learnability requirements imposed by so few layers are likely to be the major contributor. Other methods add hidden units in a manner that increases the number of layers (Gallant, 1986; Fahlman & Lebiere, 1990; Frean, 1990), driven by the single goal of reducing residual global error for the single task at hand. Reiterating, to impose a constraint on the maximum depth of knowledge nesting imposes a very strong constraint on the amount of compression and generalization that can be achieved. One cannot simply resort to trying to train deep networks from the outset by gradient descent for one or more advanced concepts at the output layer. Nonrecursively partitioning a large task into a single set of special cases does not produce building blocks of knowledge. Existing constructive methods have been designed for singletask learning and are designed to remove residual error, not form building blocks. Systems that are designed to solve multiple tasks using shared hidden units in a xed architecture benet from sharing hidden units but suffer from having few layers of computational units. We need deep networks in order to learn complex sets of concepts over a long period of time, yet gra-
Many-Layered Learning
2503
dient descent works only for shallow networks, leaving us with the ability to learn only simple concepts with shallow networks. 2.3 Sequential Learning. To facilitate learnability and compression, it is important to eliminate hindrances where possible, particularly any constraint on the number of layers of knowledge. We have mentioned several systems that have no such constraint but that solve one or more tasks xed ahead of time. What of the longer view, in which we wish agents to learn new tasks in terms of old? The ability to form building-block concepts is critical. One means of forming building blocks is to learn more than one concept in a sequential manner, so that old concepts are available for use in expressing new concepts. Some learning systems attempt to learn a succession of concepts instead of just a single target concept. Sammut and Banerji’s (1986) MARVIN is able to use previously learned concepts as building blocks when learning a new concept. The system assumes the presence of a wise teacher. That teacher decides which concepts to teach and in what order. This has the positive effect of progressively improving the basis for subsequent learning. Banerji (1980) refers to this layering of concepts as a growing language. Clark and Thornton (1997) discuss the need for layers of representation based on the need to map one representation to another. They do not propose a specic algorithm, instead discussing the problem more theoretically. They offer a very helpful distinction between two classes of learning problems, which they call Type-1 and Type-2 learning. For Type-1 learning problems, our well-studied statistical methods capture regularity that is directly observable, even if only faintly. However, for Type-2 learning, a mapping of the given variables to new variables is absolutely necessary in order to uncover otherwise unobservable regularity. Of course, more than one level of mapping to new variables may be needed, further complicating the learning problem. Off-line methods for searching for such mappings will generally be intractable because by denition there is no information available to guide the process. A clear implication is that such Type-2 mappings can arise from learning a variety of concepts or functions in a Type-1 manner, some of which happen to provide useful mappings for problems that will arise sometime thereafter. Clark and Thornton’s perspective is very important, and we make use of their types distinction here. New work is beginning to appear that approaches larger learning problems in a bottom-up manner, by learning a progression of tasks. This is very much in the spirit of Shapiro’s work on structured induction, and it addresses Type-2 learning problems by learning a sequence of tasks. For example, Stone and Veloso (2000) have explored many-layered learning in the domain of robotic soccer. They observed that the learning tasks they were tackling were intractable with standard (Type-1) methods. By teaching their system a progression of simpler (Type-1) tasks, the larger (Type-2) task could be learned. The system relies on a teacher to decide what to teach,
2504
Paul E. Utgoff and David J. Stracuzzi
and when. Stone and Veloso do not propose a uniform approach to learning at each layer. Instead, any learning algorithm and representation can be used at any layer of computational units. A recent approach to nesting of learning tasks is the KBCC system of Shultz and Rivest (2000). They extended cascade correlation (Fahlman & Lebiere, 1990) by training a set of networks ahead of time to solve a variety of useful tasks. Their KBCC system can add such a learned network (encapsulated), instead of a single unit, to the overall network being constructed. This produces a nested form of learning. Valiant (2000a, 2000b) has proposed a neuroidal architecture in which concepts are represented in layers of linear threshold units. He discusses the idea that each unit should correspond to a concept and that each unit can be trained individually (a localized training signal). It remains to the designer to organize the units and their connections and to decide how to train them. There is no limit to the depth of the nesting, and the goal is to express concepts in terms of other building-block concepts. This is an important step toward deep networks of building blocks, and away from limitations imposed by employing only gradient descent driven by output error. In a somewhat different vein, there is work that studies how a developing nervous system affects learning. For example, Elman (1993) has suggested that the less developed mental capacity of infants helps learning by admitting only small chunks of knowledge. Simulations with language learning in articial neural networks indicated that starting with a small network capable of processing short sentences helps to accelerate learning. When development continues, modeled by enlarging the network, the ability to learn to process longer sentences is considerably enhanced by having already learned to handle short sentences. Starting with the larger network at the outset impairs learning. More recently, Dominguez and Jacobs (2001) showed that a developmental approach to learning improves performance. Learning at one granularity of vision (spatial frequency range), followed by learning at another, is more efcient than learning with both granularities from the outset. These demonstrations of advantages that accrue from a developing nervous system are compelling. Some physical developmental stages are obvious in animals. For example, Turkewitz and Kenny (1982) discuss generally how some neural subsystems are programmed to have a head start over others. Kittens are born with a fully functional visual system, yet with eyelids that remain sealed for six to seven days after birth. This gives weaker sensory systems a chance to develop before stronger systems are enabled that would otherwise dominate. Quartz and Sejnowski (1997) discuss patterns of neurological growth, including axonal and dendritic arborization and synapse formation. They relate various studies that support the notion that nerve activity (use) and correlation among signals of proximal dendrites promote growth and branching. Their view is that development and learning are very much driven
Many-Layered Learning
2505
by the agent’s experiences and interactions with its environment. Learning remains a nonstationary problem throughout the life of the agent. Recapitulating, there are Type-2 problems that are too difcult to learn as a shallow Type-1 mapping from the inputs to the outputs. Indeed, this accounts for the common approach of manually engineering an input representation in order to reduce the learning task to something simple enough for one of our weak algorithms to handle. A handful of researchers are examining how to nest learning, so that new learning problems can be made easier by what has been learned previously. Developmentally, limited processing capability can facilitate early learning. As processing capability develops, new learning can build on top of, or inuenced by, what has already formed. 3 Design Goals and Assumptions
Our primary goal is to design a single learning mechanism that can exhibit difcult (Type-2) learning by way of layered simple (Type-1) learning. This constitutes a different paradigm from the more typical approach of applying a Type-1 method to a Type-1 problem, possibly with hand-engineering of the input representation, or futilely attempting to apply a Type-1 method to a Type-2 problem. In our view, learning of difcult concepts takes place only after learning of prerequisites renders them not difcult. This is very different in scope from the common attitude, much evident in practice, that one should be able to turn on a learning system and watch it run to completion. We share this goal, but hold that systems capable of Type-2 learning will require more than Type-1 learning algorithms. We conjecture that they will also require a bottom-up layering mechanism, so that Type-1 problems and results can be composed to realize Type-2 learning. Our paradigm does not mean that such a learning system could not be used to learn a single difcult concept of interest, but it does mean that preparatory learning would need to occur as a prerequisite. In any case, we envision a system oriented toward long-term learning of a large number of concepts that are too difcult to learn by any other means now known. To this end, we are also interested in how knowledge can accumulate in a set of data structures that do not lose their utility (Minton, 1990). Organization of knowledge in terms of building blocks is an essential element of our design. Of particular interest is how building blocks can be identied in an online, bottom-up manner, without resorting to off-line analyses of large data collections to search for useful decompositions. We shall distinguish on-line composition from off-line decomposition, even though a retrospective view of local knowledge organization may bear some strong similarities. We make three basic practical assumptions in order to produce a workable scope for experimentation. The rst is that linear threshold units and linear combination units are the only unit types and that they are individually trainable. The second is that such a unit can be adjusted at any time by delivering (presenting) a training instance to it wherever the unit may
2506
Paul E. Utgoff and David J. Stracuzzi
be located in the network. We do not propagate errors backward; gradient descent is applied only locally at each unit to train its adjustable parameters (weights). The third (very common) assumption is that the instance (input) representation consists of a set of propositional and numeric variables. The remaining sections present a domain that requires Type-2 learning, two algorithms that accomplish Type-2 learning by layering of Type-1 learning, and a second well-known domain in which our bottom-up approach outperforms a gradient-descent approach. We conclude with a discussion of the main lessons that we have learned. 4 Two Concepts from a Card-Stackability Domain
We explore several aspects of many-layered learning. Of interest is how to build an on-line learning algorithm that organizes its concepts (linear threshold units and linear combination units) as it acquires them. To ground the discussion initially, we employ a domain in which the most advanced of the concepts are two kinds of card stackability found in many forms of card solitaire. These concepts are rich enough for purposes of study and illustration. The rst kind of card stackability, called column stackable, pertains to cards that are still in play. A card c1 can be stacked onto a card c2 already at the bottom of a column if two conditions hold. First, the color (red or black) of the suit of card c1 and the color of the suit of card c2 must differ, and second, the rank (ace..king) of card c1 must be exactly one fewer than that of card c2 . We shall ignore the rules for which cards may be placed at the head of a column, as they are immaterial here. The second kind of card stackability, called bank stackable, applies to cards that become out of play upon being stacked onto a bank. A card c2 that is still in play may be placed onto a card c1 that is out of play in a bank if two conditions hold: the suit of card c2 and the suit of card c1 must be identical, and the rank of card c2 must be exactly one more than that of card c1 . Again we shall ignore the rules for which cards may start a bank (typically the aces) as they are unimportant here. Notice that these concepts depend on the properties of each card individually and each pair of cards collectively. The terms suit, rank, suit color, suit colors differ, rank successor, and suits identical are mentioned and are building blocks themselves. For humans, these concepts and functions are not difcult to compute, primarily because the rank, suit, and suit color are indicated plainly on each card. However, to make the problem slightly richer yet nevertheless understandable, suppose that the deck of cards to be used does not have these standard indications. Imagine instead that each of the 52 cards has solely one of the integers in the interval [0, 51] indicated, without rank, suit, or color. The rank of each card is implicitly an integer in the interval [0, 12], and the suit is implicitly an integer in the interval [0, 3]. Suits 0 and 2 are grouped into one color, as are suits 1 and 3. This produces a problem
Many-Layered Learning
2507
that is simple enough to understand in its entirety yet rich enough to lend itself to requiring many layers of knowledge. Deeply nested knowledge is our goal here; we do not wish to hand-engineer an input representation that leaves a problem simple enough for a few-layered (Type-1) method. For completeness, we state these denitions formally, but we shall not attempt to learn them exactly this way. We can dene column stackable and bank stackable as: suit(x) = (x div 13) rank(x) = (x mod 13) suit color(x) = (suit(x) mod 2) 6 suit color(c2)) ^ (1+rank(c1) = rank(c2)) column stackable(c1,c2) $ (suit color(c1) D bank stackable(c2,c1) $ (suit(c1) = suit(c2)) ^ (1+rank(c1) = rank(c2))
Notice that we have given a nested set of denitions. It is more compact to express the target concepts in such a manner. However, in our discussion, we shall avoid the integer and modular arithmetic that we have employed here. 4.1 A Hand-Designed Many-Layered Network. Figure 1 shows a hand-designed many-layered network consisting of the inputs, a variety of building-block units, and the two target concepts. All the units are shown in a column of boxes at the left, with the units of each layer shown as a group. For any unit, its output line ascends diagonally to the right, and its input line descends diagonally to the right. An output line of one unit is connected to the input line of another unit only where a dot appears at their intersection. Lines crossing without such a dot are not connected. A linear threshold unit is shown as a clear box, and a linear combination unit (no threshold) is shown as a shaded box. For example, rank(c1) has as its inputs (following its input line diagonally downward and looking for connecting dots) input(c1), club(c1), diamond(c1), heart(c1), and spade(c1). The building-block concepts and functions successively map the inputs to increasingly useful representations. Notice that there are six layers of computation, indicated by the seven groups of units. These subconcepts are learning problems in their own right. An agent should be able to acquire a structure of many layers that represents this knowledge. We shall see in section 6 that some of these units turn out to be not strictly necessary. This is simply a network that a human might (did) reasonably construct, and it serves our purpose for the moment. The propositional concepts for each card individually are symmetric. Looking at those for card c1, the concept of suit corresponds to xed subintervals of the domain interval [0, 51]. The less13(c1), less26(c1), and less39(c1) concepts enunciate these critical subinterval boundaries. With knowledge of these subintervals it is straightforward to compute the suit of c1 by testing whether it falls into a particular subinterval but not the next
2508
Paul E. Utgoff and David J. Stracuzzi
Figure 1: Hand-designed many-layered network.
Many-Layered Learning
2509
one smaller. The suits are computed as follows from the subintervals for an integer card value c1: less13(c1) T F F F
less26(c1) T T F F
less39(c1) T T T F
spade(c1) heart(c1) club(c1) diamond(c1)
If one were to compute the suit value as an integer in the interval [0, 3] and reference the input card value c1, then one could compute rank from the linear combination: c1 ¡ 13 ¢ suit ( c1 ) . The weight from each suit unit subtracts the corresponding multiple of 13 from the rank unit. However, in the hand-designed network above, there is no suit unit. Instead, there are four boolean units, one for each suit. Assuming here that TRUE maps to 1 and FALSE maps to 0, a suitable linear combination to compute rank would be: c1 ¡ 39 ¢ diamond(c1) ¡ 26 ¢ club(c1) ¡ 13 ¢ heart(c1) ¡ 0 ¢ spade(c1). Each of the suit colors red and black is a simple disjunction of the relevant suits. The suits black red unit, the suits red black unit, and the suit colors differ unit collectively compute an exclusive-or of the suit color. The rank successor concept is somewhat opaque because of using hard-threshold units to test this relation. To test for a difference of exactly one using only inequalities, it is necessary to test simultaneously for whether the difference in rank is at least one and for whether the difference is at most one. If each is true, then the difference is exactly one. The concept of suits identical is a disjunction of the four suit-equivalence tests. Finally, column stackable is the conjunction of the suit colors differ and rank successor concepts, and bank stackable is the conjunction of the suits identical and rank successor concepts. 4.2 A View of the Target Concepts. Figure 2 depicts the two target concepts column stackable and bank stackable as a matrix. Card c1 indexes the row, and card c2 indexes the column. A large solid dot indicates an ordered pair that is column stackable, a large hollow dot indicates an ordered pair that is bank stackable, and a small solid dot indicates a pair that is either impossible or is not a member of either target stackability concept. No ordered pair can be both column stackable and bank stackable. One can see that in these two input dimensions, each of the concepts requires some care to specify exactly. There are 12 decision regions, each with boundaries that are not aligned with either of the axes. The depicted decision boundaries and enclosed regions are discussed below. 5 Learning from an Organized Input Stream
We would like to design an on-line learning algorithm that learns new concepts using previously learned concepts as additional inputs to each learning task. Our goal is to mimic a process of being receptive to those new ideas
2510
Paul E. Utgoff and David J. Stracuzzi 52
39
c1
26
13
0
0
13
26
39
52
c2 Figure 2: Target concepts for column stackable (large solid dots) and bank stackable (large hollow dots).
that are not too difcult to understand given the current state of knowledge. How can we model mechanisms of this kind? This section presents an illustration of the economies that accrue from learning each layer, one after the other. 5.1 A Curriculum Algorithm. The hand-designed network of Figure 1 can be learned sequentially, one layer at a time. Each concept or function to be learned at a layer is trained individually in a supervised manner, as though it were an independent learning task. It is possible to learn each concept as a single unit, one at a time, but because the units at each layer are not connected to each other, it is also possible to take on an entire layer at a time. The stream of training examples is processed by delivering (presenting) each training example to its corresponding unit. One waits until the concepts at a layer are learned sufciently well before proceeding to the
Many-Layered Learning
2511
next. This supposes a good teacher or other mechanism for organizing the training in such a sequential manner and deciding when to proceed to the next layer. Although in our example the main goal for the agent is to learn the two concepts regarding stackability, these are too difcult to learn immediately. One needs to learn the simpler concepts rst, to build a satisfactory basis for subsequent Type-1 learning. In this domain, it is important to learn rst that certain intervals of integer values are important to recognize. From that basis, it becomes much easier for the agent to learn the suit concepts. So it goes, each new layer of knowledge advancing the frontier of receptivity, preparing the agent to acquire the next. It is the layering of Type-1 learning that produces Type-2 learning. We implemented an algorithm to train the layers successively as described above. This is an instance of a curriculum algorithm, which we shall characterize as any algorithm designed to provide instruction in an order that corresponds to a workable progression of an agent’s frontier of receptivity. When an ordered pair of cards is presented, a class label (or function value) is included so that the corresponding unit can be trained. If the unit is on the rst computational layer, it is trained in a straightforward manner, using the appropriate error correction rule. Each linear threshold unit is adjusted by stochastic gradient descent to reduce the absolute error, using step size 0.1, real inputs normalized by the largest magnitude for each input variable individually, and boolean inputs mapped onto 1 for TRUE and ¡1 for FALSE. Each linear combination unit is adjusted by stochastic gradient descent to reduce the mean-squared error, using the same step size, normalization, and boolean input encoding. If the unit is beyond the rst computational layer, all the units preceding it are rst evaluated in a feedforward manner, so that its inputs are available, given the ordered pair of cards (Rivest & Sloan, 1994). Then the error correction rule is applied. 5.2 Experiment 1: Perfect Connectivity. In a simple experiment, all the exact dependencies of the knowledge elements (concepts and functions modeled as linear units) were known, sidestepping any problems of connectivity. As shown in Figure 1, the units of each layer were learned successively in 2, 2, 557, 3, 2, and 2 epochs for each layer, respectively, using all the examples as the training corpus. Our interest here is in memory organization, not classication accuracy, so there is no need to use less than the full corpus. Total CPU time was 13.2 seconds on a 1.13 Ghz Pentium III. 5.3 Experiment 2: Complete Connectivity. A second experiment was run in which the connectivity was not known in advance. Instead, for each new layer of unlearned units, the output of every previously learned unit (including the input units) was connected as an input to each unlearned unit of the new layer. In this case, the layers were learned successively in 4, 2, 537, 4, 2, and 3 epochs, respectively, in 41.6 seconds. Even with
2512
Paul E. Utgoff and David J. Stracuzzi
this highly connected approach, the target concepts were still learned very rapidly when the layers are trained in this sequential manner. The approach of allowing all inputs or previously learned concepts to serve as an input basis for subsequent learning has the desirable property of allowing the agent to draw on whatever it already knows in order to understand what is new, to nd regularity where it would otherwise be obfuscated. However, an undesirable property of this massive connectivity is an ever-growing dimensionality for subsequent learning, which will not scale to large problems. One can adopt a method for learning in the presence of many irrelevant subconcepts (Littlestone, 1988), or implement a mechanism for eliminating connections, or devise a scheme for adding connections selectively. We present a connection elimination mechanism below, but we did not use it for our curriculum algorithms. 5.4 Other Experiments. We conducted experiments with a variety of feedforward articial neural networks and backprop (Werbos, 1977; Rumelhart & McClelland, 1986), which are too numerous to report in detail. In summary, putting aside speed issues, network topologies with more than two layers of hidden units failed. This was so even when providing a topology with perfect connectivity, as given in the hand-designed network of Figure 1. When confronting the task with the common two layers of hidden units, convergence was also elusive. The curriculum algorithm we used was given specic information for each of its units, and backprop was not. We are not offering a classical comparison of any kind. Rather, we are illustrating that there are limitations to backpropagation of error, a form of top-down learning, that are avoided with a curriculum algorithm, a form of bottom-up learning. Reecting on our experiments, the rst layer provides decision boundaries, and the second layer combines groups of boundaries to form decision regions. These regions, when found, are specic to the task at hand. Would such units, if learned, constitute building blocks? We think not. 5.5 Discussion. Figure 2 depicts possible decision boundaries (and implicitly the decision regions) for the two target card-stackability concepts. These 12 regions partition the instance space but do not provide multiple layers of composed knowledge. Instead, these regions are tailor-made for the two target card-stackability concepts. While partitioning instance space for a particular task may seem like a satisfactory accomplishment, we would rather that learning take place in a way that provides successive levels of mapping based on using earlier concepts where possible. For example, if a two-hidden-layer network is coaxed to learn column stackable, leaving out bank stackable, decision boundaries other than those depicted in Figure 2 may be found. For example, the boundaries (horizontal and vertical in the gure) might instead transect the bank stackable regions (have slope ¡1), making them useless for the subsequent purpose of learning bank stackable.
Many-Layered Learning
2513
The decision boundaries that are shown in Figure 2 would happen to work for either task alone because they conveniently bound regions that are relevant to both stackability concepts. Alas, learning one of the tasks alone may not result in such serendipitous boundaries that are useful for each. Examining this gure, we see that a 2/16/12/2 network is capable of representing both of the target solitaire concepts. There would be 2 input units, 16 computational units at layer 1 (for 16 region boundaries), 12 computational units at layer 2 (for 12 regions), and 2 output units, each computing a disjunction of the needed regions. Although it appears in the gure that there are four lines of fat black dots, those lines are punctuated with some thin black dots, corresponding to the fact that no king can be stacked onto another card (for column stackable). The three horizontal and three vertical lines are boundaries that are used to carve out these illegal cases, seen at the upper left dot of some of the apparent squares in the gure. Similarly, there are four regions along the major diagonal for bank stackable, because an ace cannot be placed onto another card. As we noted, the ability to represent something does not mean that learning will succeed. Indeed, theoretical results from computational learning theory show that we should not expect a global training algorithm to perform well on this problem. Notice that each of the two outputs in the 2/16/12/2 network architecture described above is in a k-term DNF format. Pitt and Valiant (1988) proved that k-term DNF concept representations cannot be efciently learned. Although an equivalent k-CNF representation may be learned efciently, converting from k-term DNF to k-CNF causes a worst-case exponential explosion in the representation size. Returning to Figure 1, imagine lopping off the bank stackable unit and its unique predecessors. That would excise bank stackable, suits identical, both spade, both heart, both club, and both diamond. Now, to learn bank stackable, it would be necessary to learn just these six concepts. The modularity of the subconcepts is apparent, making the learning of bank stackable relatively easy, given the existing knowledge of column stackable. To illustrate this point further, suppose that we wanted to reuse these networks for recognizing concepts in poker. The network in Figure 1 contains many concepts that can be applied to the new problem. For example, rank successor is needed to evaluate which elements of a possible straight may be present in a hand. Similarly, suits identical can be used in evaluating a ush. The same cannot be said of the few-layered networks discussed above. Few, if any, of the divisions are relevant to the new concepts, forcing learning to begin anew. For sequential Type-2 learning to work well, the decomposition of the presumably nal targets into useful subconcepts must already have occured. Does an agent ever really learn a nal target? One must learn the buildingblock knowledge for future tasks yet to be encountered. In some sense, this seems impossible, but it is only a matter of viewpoint. It is only in the top-down-decomposition view of the world that time must run backward.
2514
Paul E. Utgoff and David J. Stracuzzi
In the bottom-up-composition view, building blocks are created based on experience. A new block is learnable if and only if the prerequisites are in place. Learning simple useful concepts in a many-layered organization lays the foundation for whatever else may come. 6 Learning from an Unorganized Input Stream
We present and discuss our novel stream-to-layers (STL) algorithm, which demonstrates a mechanism for organizing concepts in terms of each other as they are acquired. This models quite directly the notion of an advancing frontier of receptivity, even without a teacher prescribing the layering. Although an agent can benet greatly from receiving information in an order that conforms to the agent’s receptivity, one cannot expect life’s experiences to be ordered so well. How can learning of multiple layers of building-block knowledge work in the absence of a good ordering of experiences? One approach is to try at all times to make sense of all that one can. This entails great inefciency, but it can account for learning useful building blocks. Agents make different use of the information that they receive, presumably because their current knowledge differs, giving each its own frontier of receptivity. For example, two people attending the same lecture will hear and understand it differently. How can this process be modeled? 6.1 A Stream-to-Layers Algorithm. Consider again the two target concepts column stackable and bank stackable. Suppose now that when an ordered pair (c1,c2) instance is presented, a set of observed relations is also stated, each as a positive or negative atom. For example, consider the following instance, in which c1 is bound to 6 (the 7Ä) and c2 is bound to 8 (the 9Ä): fc1/6,c2/8g,^(less13(c1), less26(c1), less39(c1), spade(c1), »heart(c1), »club(c1), »diamond(c1), black(c1), »red(c1), rank(c1,6), less13(c2), less26(c2), less39(c2), spade(c2), »heart(c2), »club(c2), »diamond(c2), black(c2), »red(c2), rank(c2,8), both spade(c1,c2), »both heart(c1,c2), »both club(c1,c2), »both diamond(c1,c2), »black red(c1,c2), »red black(c1,c2), »rank one or more(c1,c2), rank one or fewer(c1,c2), suits identical(c1,c2), »suit colors differ(c1,c2), »rank successor(c1,c2), »column stackable(c1,c2), »bank stackable(c1,c2)).
In the following instance, c1 is bound to 46 (the 8}), and c2 is bound to 34 (the 9|): fc1/46,c2/34g,^(»less13(c1), »less26(c1), »less39(c1), »spade(c1), »heart(c1), »club(c1), diamond(c1), »black(c1), red(c1), rank(c1,7), »less13(c2), »less26(c2), less39(c2), »spade(c2), »heart(c2), club(c2), »diamond(c2), black(c2), »red(c2), rank(c2,8), »both spade(c1,c2), »both heart(c1,c2), »both club(c1,c2), »both diamond(c1,c2), »black red(c1,c2), red black(c1,c2), rank one or more(c1,c2), rank one or fewer(c1,c2),
Many-Layered Learning
2515
»suits identical(c1,c2), suit colors differ(c1,c2), rank successor(c1,c2), column stackable(c1,c2), »bank stackable(c1,c2)).
This may be more information than is strictly necessary because the agent may already know how to infer some of these atoms from (c1,c2) due to earlier successful learning of some of the building-block concepts. In terms of level of discourse, this would be a mismatch between sender and receiver. Information that the agent could already infer is irrelevant, as is information that is currently too difcult to absorb. Providing such irrelevant information is a matter of communication inefciency, which is not a major concern of ours here. We simply provide the truth values for all the stated atoms, thereby avoiding all the problems related to discourse. We assume that the stream of observations holds the atoms that correspond to the concepts. One might say that an important segmentation of the agent’s observations has therefore already occurred. This is so and is one of the basic assumptions that we have already discussed. However, an agent’s waking hours include a stream of observations, which are represented in some as-yet unknown form. Our world provides a stream of sensory and perceptual information, and our interest is in being able to learn from such a stream. To this end, we manufacture such a stream, putting aside the problem of what mechanisms in an agent could produce such a stream. Our stackability example already assumes knowledge of integer values and their ordering. The stream of information that washes over an agent can provide primitives and higher-level information, and the STL algorithm described below suggests one way in which this information can be assimilated and organized over time. Can an agent learn the subconcepts and organize them into layers that correspond to computational dependencies? Yes, if one is guided by an assumption that only simple (Type-1) learning mechanisms are available to learn and rene a knowledge element. Table 1 shows the STL algorithm. For every predicate or function name observed, the algorithm updates the unit as described here. If the unit does not yet exist, it is created and added to the list of unlearned units, which is initially empty. Its inputs are initially the distinguished input values as provided in an observation. For each set of atoms presented as a training instance (observation), the algorithm attempts to learn each atom as a linear threshold unit or an unthresholded linear combination. For an atom with only bound variables as arguments, the atom indicates a boolean concept, and the positive or negative value indicated for the atom (absence or presence of negation connective) is its training label. However, for an atom with a single numeric argument, the atom indicates a numerical function, with the numeric argument being its training value. In this model, a function can have at most one numerical argument. The STL algorithm tries to learn all the concepts or functions that come its way. Of course, some concepts are learned more easily (sooner) than others. For example, the concepts that one sees in the second layer of Figure 1 will
2516
Paul E. Utgoff and David J. Stracuzzi
Table 1: The STL Algorithm. Input: A stream O of observations, each of the form ot D (Bt , Lt ), where Bt is a set
of variable bindings and Lt is a conjunction of literals. Initially: U à ;, where U is the set of all dened units. M à ;, where M is the set of all learned units or base inputs. On-line algorithm: For each observation ot : 1. Compute value of every u k 2 U using bound input variables. 2. M à M [ fhg [ Vt , where h is the distinguished bias input which is always 1, and Vt is the set of input variables in Bt . 3. For each literal lj 2 Lt : (Let A denote predicate or function to be learned corresponding to atom name in lj .) (a) If there is a numeric argument ai of lj , then A is a linear combination unit with target T à ai . Otherwise, A is a linear threshold unit; if lj is positive, then T à 1 else T à ¡1. (b) If not A 2 U, then create unit for A, U à U [ fAg, set A0 to be undened, Ainputs à M, Aweights à W, where each w k is sampled from uniform density over [¡0.05, 0.05]. (c) Update unit A using target T, appropriate gradient-descent correction, step size (0.1 for combination, 0.01 for threshold), input values each normalized to maximum magnitude 1.0. (d) If A has been learned sufciently well (see discussion), then M à M [ fAg. (e) If A is unlearnable over Ainputs (see discussion), then N à M ¡ Ainputs , Ainputs à Ainputs [ N, initialize new weights for new inputs N as in step 3b, reset A as learnable, go to step 3c. (f) If A 2 M and A has some inputs not tried for deletion, then: i. If A0 is undened (see step 3b), then A0 à A (copy of A), remove one of A0inputs selected at random and mark the deleted input as “tried.” ii. Update A0 as in step 3c. iii. If A0 has been learned sufciently well, as in step 3d, then M à M ¡ fAg, U à U ¡ fAg, discard A, set A to refer to A0 , set A0 to be undened, M à M [ fAg, U à U [ fAg. iv. Otherwise, if A0 is unlearnable over A0inputs , as in step 3e, then discard A0 , set A0 to be undened.
presumably be learned reliably before the others. Any concept or function that is learned successfully has its output value connected as an input to those concepts or functions that have not yet been learned reliably. This has the effect of pushing the as-yet-unlearned concepts to a deeper layer. This process continues, always pushing the unlearned concepts deeper and providing each with an improved basis. This models an advancing frontier of receptivity. The agent is receptive to what can be learned simply, given what has already been acquired successfully. This approach embodies an as-
Many-Layered Learning
2517
sumption that those concepts that can be learned early should be considered as potential building blocks (inputs) when learning other concepts later. The STL algorithm operates in an on-line manner. The algorithm must make two important decisions. The rst is to determine when a unit has successfully acquired its target concept and is therefore eligible to become an input to other, unsuccessful units. The criterion for successful learning in STL is that the unit must have produced a correct evaluation for at least n consecutive examples, where n D 1000VC(u) for unit u. The VC dimension of a linear unit is simply d C 1 for a unit with d inputs. We chose 1000 empirically for the problems at hand. We are examining how to formulate a more principled criterion. The second decision that STL must make is to determine when a unit cannot learn a target concept sufciently well. One possible approach would be simply to connect a trained unit to an untrained unit as soon as a trained unit becomes available. This is unnecessarily aggressive; one unit may train faster than another even though both are capable of learning given their current input connections. STL relies on sample complexity to determine when a unit requires additional input connections. If a unit is presented m examples without satisfying the above learning criterion, the unit is considered to have failed, and new connections are added before training resumes. The number of required examples is m ¸ 2 c2 ( VC ( u ) C ln( d1 ) ) with c D 0.8, condence parameter d D 0.01 and accuracy parameter 2 D 0.01 for thresholded linear units. The number of examples required for an unthresholded ¡ ¡ 16 ¢ linear combination ¡ 34 ¢¢ is described by a similar formula, ( ) C log 2Pdim log , where Pdim ( u ) is the pseudodim ¸ 128 u 2 2 2 d 2 2 mension of u and the condence d D 0.01 and accuracy 2 D 0.1 (Anthony & Bartlett, 1999). STL adds connections from all previously learned units to a unit that is not currently learnable. Units representing high-level linear functions quickly acquire the input connections required for successful learning. As discussed above, a disadvantage to this “connect-to-everything” approach is that initially units will have many more connections than is strictly necessary. This is particularly true of units representing high-level concepts. To combat this problem, STL employs a novel method for removing unnecessary input connections from each unit that it has successfully learned. When an individual unit satises the criterion for successful learning, it begins the process of removing any inputs that are not required for correct evaluation. The unit rst generates a copy of itself, selects one of the input connections at random, and removes it from the copy. Thus, the copy is a duplicate of the unit minus one input. The duplicate is then trained, and if it satises the learning criterion, the duplicate replaces the original unit and the process continues. Otherwise, the failed duplicate is discarded, and the removal process continues with a newly created duplicate. Each input connection, including the bias, is tested exactly once for a total of d C 1 removal attempts per network unit. The bias connection is always tested
2518
Paul E. Utgoff and David J. Stracuzzi
last in order to prevent spurious relationships among other inputs from conspiring to make the bias appear falsely useless. At rst glance, training d C 1 copies of each unit in the network may appear to be an overly expensive solution to the connectivity problem in STL. Indeed, the cost of removing unnecessary connections from the network outweighs the cost of generating the initial (learned) network. However, several aspects of the connection removal process work together to make the price quite reasonable. First, each duplicate is initialized with the same weights (minus one input) as the learned unit. This means that the duplicates begin training with a very benecial set of weight values and that training proceeds very quickly when the removed input is irrelevant. Second, STL’s reliance on simple concepts and simple learning units means that most of the concepts in a network will depend on only a small number of inputs. Most of the inputs to a given unit will be irrelevant; therefore, most of the d C 1 copies of the unit will train quickly. Finally, the process of removing connections takes place after the unit has successfully learned. Thus, the concepts quickly become available for use by higher-level concepts, and only afterward spend computational resources optimizing their representations. 6.2 Experiment 1: Using All the Hand-Designed Concepts. We presented all distinct (c1,c2) pairs and the corresponding atoms as training instances. Although STL is intended to be an on-line algorithm, we have only a nite amount of data: 52 ¢ 52 D 2704 observations. An innite stream of input data was simulated by treating this collection of observations as a circular list. This is very much like an off-line algorithm making multiple passes over the data, each pass being an epoch. However, the algorithm has no knowledge of epoch, and it operates in an on-line manner. The algorithm learned all concepts and functions in 17,026,156 instances, requiring 47:40 minutes (47 minutes and 40 seconds) on a 1.13-gigahertz Pentium III, and 132 total connections. The network had learned perfectly after 8:39 minutes, using the remaining time to remove connections. The time to complete the learning is very close to the 8:52 minutes used when connection elimination is disabled. Notice that the time to complete the learning is actually lower when connections are removed when possible. The constructed network, shown in Figure 3 has four computational layers and a different knowledge organization from that of the hand-designed network. Remarkably, the rank, suit colors differ, suits identical, and rank successor units are learned but not used (no output lines). It is somewhat unsatisfying to see these building blocks as superuous. It is explained in part by the difculty in learning. Something that takes much longer to learn, such as rank, will be pushed to a deeper layer. Meanwhile, a different basis for learning an advanced concept may be found. The integer interval units, such as less13, were not needed for learning the suit concepts. The spade and diamond suits can be learned easily without the
Many-Layered Learning
Figure 3: STL using all the hand-designed concepts.
2519
2520
Paul E. Utgoff and David J. Stracuzzi
interval units. After spade has been mastered, heart can be learned readily because it is any card value less than 26 that is not a spade. Similarly, club can be learned after diamond has been acquired. None of the interval concepts was required for learning the suits. 6.3 Experiment 2: Using a Reduced Set of Hand-Designed Concepts.
Having been shown that the interval units were not needed, we reran STL while leaving them out. Figure 4 shows the resulting network, which was learned in 13,232,552 observations, taking 31:49 minutes, with correctness achieved after just 4:48 minutes. There are six computational layers with 116 connections. Notice that black(c1) is learned without depending on the club(c1) concept. This makes sense, given that spade(c1) being true would cinch it, or if c1 be neither spade nor diamond, then a simple test for c1 being in the club range will sufce. It is less appealing semantically, but it is sensible. The rank units are used, but both club, rank one or more, and rank successor are not (no output lines). Of interest is the relationship between the number of instances presented to the network and the number of connections in the network when connection removal is enabled. Figure 5 shows a graph of training instances presented versus connections in the network. The vertical line represents the point at which all units in the network are considered sufciently learned. Notice the sharp rise and fall in connections during early training as connections are simultaneously added to unlearned units and removed from learned units. The six peaks in the graph correspond to the six computational layers in the resulting network. Finally, a majority of the instances are presented after the units in the network have already been learned. 6.4 Discussion. When exposed to a stream of observations of various boolean predicates and linear functions, the STL algorithm can learn these predicates and functions, organize them into a layered network of building blocks, and repeatedly advance its level of receptivity. Network organization occurs naturally as a result of simple Type-1 learning mechanisms. It is somewhat surprising that simple Type-1 mechanisms are apparently needed to build efcient deeply layered knowledge structures that represent difcult Type-2 concepts. 7 The Two-Clumps Problem
The two-or-more clumps problem is a synthetic boolean problem in which the n inputs are arranged sequentially, x 0 , . . . , xn¡1 . The objective of the problem is to decide whether there are two or more groups of adjacent “on” inputs. The input string is circular, so that x 0 and xn¡1 are adjacent. As an example of the two-or-more clumps problem with n D 8, the input string 00110100 is a positive instance, while 00111110 is a negative instance. Several different learning algorithms have been applied to the two-or-more
Many-Layered Learning
Figure 4: STL using a reduced set of hand-designed concepts.
2521
2522
Paul E. Utgoff and David J. Stracuzzi
240
Connections
220 200 180 160 140 120 100 80 0
200
400
600
800
1000
Instances (x 10000) Figure 5: STL total connections while training the stackability targets.
clumps problem (Mezard & Nadal, 1989; Frean, 1990; Redding, Kowalczyk, & Down, 1993). The essence of the problem is to count the number of edges, or on-to-off transitions, encountered while scanning along the string of bits. A twolayer network with n nodes in the hidden layer acting as edge detectors and a single output element implementing a threshold-2 function accurately represents the target concept. Data reecting this structure were generated for n D 25 according to Monte Carlo simulations as described by Mezard and Nadal (1989). Eight sets of training data, with 50, 100, 200, 300, 400, 500, 600, and 800 instances, along with a set of 600 instances for testing, were generated independently. Each set had a mean of 1.5 clumps per instance. Five different STL networks with 25 inputs and 26 concepts were each trained on the 600 instance set of two-or-more clumps data. The average overall training time was 39:28 minutes, the average number of training instances was 9,227,762, and the average number of network connections was just 104, which is just three more than the ideal network structure. A correct network was achieved after 1:36 minutes on average. Figure 6 shows a typical network structure after connection removal. All of the edge-detection units have learned their concepts exactly. Each edge unit has pruned every irrelevant input and found the correct edge pattern. The top-level threshold concept very nearly implements the correct concept, including only a few irrelevant inputs. Figure 7 shows a graph of training instances presented versus the number of network connections. The single peak indicates the addition of connections to the top-level disjunction, while the sharp drop during early training
Many-Layered Learning
Figure 6: A two-or-more clumps network trained via STL.
2523
2524
Paul E. Utgoff and David J. Stracuzzi
700
Connections
600 500 400 300 200 100 0
100
200
300
400
500
600
700
800
Instances (x 10000) Figure 7: STL total connections while training two-or-more clumps.
corresponds to the fast learning and optimization of the edge detector concepts. The large number of instances required to remove connections from the top-level concept reects the large number (25) of relevant inputs. In a second experiment, an STL network was trained on each of the eight sets of training data and then tested on the 600-instance testing set. Figure 8 shows the accuracy of each STL network along with the published accuracy measures from the Tiling algorithm (Mezard & Nadal, 1989), the Upstart algorithm (Frean, 1990), and the constructive higher-order network (CHON) algorithm (Redding et al., 1993). Recall that STL is given training information for the edge units that the other algorithms are not. If one wants to learn to recognize clumps, a useful prerequisite is to learn to recognize edges. Notice how STL performs comparatively well even with only 50 training instances and quickly rises to achieve 99% accuracy at just 300 instances. Eventually by 800 training instances, STL achieves a perfect 100% accuracy. 8 Applying STL to a Larger Task
Can STL scale up to richer input streams? As a simple test, we aggregated the card stackability state and the two-or-more clumps state into one. An ordered pair of cards is represented by two integer variables, and a clumps state consists of 25 boolean predicates, so the aggregate state has 27 components. The card pairs are sampled randomly from the complete space of pairs, and the 25-tuples are sampled randomly from a single large sample of the space of 25-tuples, as discussed above.
Many-Layered Learning
2525
1
Accuracy
0.9 0.8 0.7 STL CHON Upstart Tiling
0.6 0.5 100
200
300
400
500
600
700
800
Training Instances Figure 8: Two-or-more clumps accuracy for several algorithms.
The resulting network (omitted due to size) consists of eight layers of computation. To learn all the predicates and functions correctly required 31,438,449 observations and 5:31:28 hours of CPU, with another 81,526,067 instances and 11:19:02 hours of CPU to conclude the connection reduction process. The separation of units by relevant inputs is quite good. No units for the stackability concepts are inputs to the clumps concepts. For the stackability concepts, there are still 70 useless connections, of 51 £ 27 D 877 possible. The main bottleneck is the problem of determining relevant inputs for each unit, which is the classic single-unit feature selection problem. Although our removal method is of lower computational cost than previous methods (Stracuzzi & Utgoff, 2002), it is nevertheless doomed computationally. Methods for adding inputs are of the same complexity. Indeed, no scheme that entertains all possible inputs with equal probability will scale up for lifelong learning. We expect that a workable approach will be to take into account constraints imposed by the physical location of connections and consideration of only nearby connection points (Quartz & Sejnowski, 1997). 9 Summary and Conclusion
We examined two approaches for modeling many-layered learning. The rst involves learning from a curriculum, and simply illustrates that difcult problems can be learned when broken into a sequence of simple problems. It is remarkable that so much of the human academic enterprise is devoted
2526
Paul E. Utgoff and David J. Stracuzzi
to organizing knowledge for presentation in an orderly graspable manner. This ts well the supposition that humans do indeed have a frontier of receptivity and that new knowledge is laid down in terms of old, to the extent possible. We do not observe our teachers starting a semester with the very last chapter of a text and then hammering away at it week after week, waiting for all the subconcepts (hidden in the earlier chapters) to form themselves. Instead, teachers start quite sensibly at chapter 1 and progress through the well-designed layered presentation. Although agent autonomy is a laudable goal, in moderation, as scientists we impose a serious handicap when we deprive our agents of books and teachers (including parents). The second approach dispensed with organized instruction, offering a possible mechanism for extracting structure from an unstructured stream of rich information. We showed in the STL algorithm how adoption of simple Type-1 learning mechanisms can learn and organize concepts into a network of building blocks in an on-line manner. As simple concepts on the agent’s frontier are mastered, the basis for understanding grows, enabling subsequent mastering of concepts that were formerly too difcult. The approach accepts the seeming paradox that our apparent ability to do just Type-1 learning and layering is the bedrock of our intelligence because it produces Type-2 learning. An agent can benet greatly by following a curriculum. Were STL to process a stream that was generated to contain progressively higher-level information, it would spend a great deal less time futilely trying to master concepts that were currently hopelessly difcult. Exposing an agent to just what should be acquired next helps focus effort and can lead to a better structuring of the learned knowledge. While it has been informative for us to explore how to model learning of knowledge in many layers, some of the problems suggest new approaches. For example, STL relies on a kind of race to produce a knowledge organization. Whatever can be learned next using simple means achieves the status of building block, which means it has earned the right to be considered as an input to all units yet to be learned. We have noted that this can drive a mechanism for organizing knowledge as it is acquired. However, this strategy does not necessarily lead to the best possible organization. Furthermore, the successfully learned portion of an organized structure becomes statically cast. We would rather have a mechanism in which each unit can continue to consider which other units will serve it best as inputs and revise its selection of inputs dynamically. Finally, while we have advocated a building-block approach designed to eliminate replication of knowledge structures, one can see quite plainly in Figure 1 that many concepts learned for just one card were learned identically for the other. A mechanism for applying learned functions to a variety of arguments would be highly useful. Much of the work in inductive logic programming addresses this problem. It will be useful to explore further how variable binding mechanisms can be modeled in networks of simple
Many-Layered Learning
2527
computational devices (Valiant, 2000a, 2000b; Khardon, Roth, & Valiant, 1999). Our main results are an argument in favor of many-layered learning, a demonstration of the advantages of using localized training signals, and a method for self-organization of building-block concepts into a manylayered articial neural network. Learning of complex structures can be guided successfully by assuming that local learning methods are limited to simple tasks and that the resulting building blocks are available for subsequent learning. Acknowledgments
This work was supported by grants IRI-9711239 and IRI-0097218 from the National Science Foundation. Marcus Frean provided helpful information on generating 2-clumps data. Robbie Jacobs, Andy Barto, and Margie Connell gave helpful comments on an earlier version of this article. The anonymous reviewers added to the clarity of the presentation. References Anthony, M., & Bartlett, P. L. (1999). Neural network learning: Theoretical foundations. Cambridge: Cambridge University Press. Ash, T. (1989). Dynamic node creation in backpropagation networks. Connection Science, 1, 365–375. Banerji, R. B. (1980). Articial intelligence: A theoretical approach. Amsterdam: Elsevier. Blum, A., & Rivest, R. L. (1988). Training a three-node neural network is NPcomplete. In Proceedings of the 1988 IEEE Conference on Neural Information (pp. 494–501). San Mateo, CA: Morgan Kaufmann. Bruner, J. S., Goodnow, J. J., & Austin, G. A. (1956). A study of thinking. New York: Wiley. Caruana, R. (1997). Multitask learning. Machine Learning, 28, 41–75. Clark, A., & Thornton, C. (1997). Trading spaces: Computation, representation, and the limits of uninformed learning. Behavioral and Brain Sciences, 20, 57–90. Cook, D. J., & Holder, L. B. (1994). Substructure discovery using minimum description length and background knowledge. Journal of Articial Intelligence Research, 1, 231–255. Dominguez, M., & Jacobs, R. A. (2001). Visual development and the acquisition of binocular disparity sensitivities. In Proceedingsof the EighteenthInternational Conference on Machine Learning (pp. 114–121). Williamstown, MA: Morgan Kaufmann. Elman, J. L. (1993). Learning and development in neural networks: The importance of starting small. Cognition, 48, 71–99. Fahlman, S. E., & Lebiere, C. (1990). The cascade correlation architecture. In D. Touretzky (Ed.), Advances in neural information processing systems, 2 (pp. 524–532). San Mateo, CA: Morgan Kaufmann.
2528
Paul E. Utgoff and David J. Stracuzzi
Fausett, L. (1994). Fundamentals of neural networks: Architectures, algorithms, and applications. Upper Saddle River, NJ: Prentice Hall. Frean, M. (1990). The Upstart algorithm: A method for constructing and training feedforward neural networks. Neural Computation, 2, 198–209. Freeman, J. A., & Skapura, D. M. (1991). Neural networks:Algorithms, applications, and programming techniques. Reading, MA: Addison-Wesley. Gallant, S. I. (1986). Optimal linear discriminants. In Proceedings of the International Conference on Pattern Recognition (pp. 849–852). Piscataway, NJ: IEEE Computer Society Press. Hanson, S. J. (1990). Meiosis networks. In D. Touretzky (Ed.), Advances in neural information processing systems, 2 (pp. 533–541). San Mateo, CA: Morgan Kaufmann. Jacobs, R. A., Jordan, M. I., & Barto, A. G. (1991). Task decomposition through competition in a modular connectionist architecture: The what and where vision tasks. Cognitive Science, 15, 219–250. Judd, J. S. (1990). Neural network design and the complexity of learning. Cambridge, MA: MIT Press. Kaas, J. K. (1982). The segregation of function in the nervous system: Why do sensory systems have so many subdivisions? Contributions to Sensory Physiology, 7, 201–240. Khardon, R., Roth, D., & Valiant, L. (1999). Relational learning for NLP using linear threshold elements. In Proceedingsof the Sixteenth International Joint Conference on Articial Intelligence (pp. 911–919). San Mateo, CA: Morgan Kaufmann. Littlestone, N. (1988). Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine Learning, 2, 285–318. Mezard, M., & Nadal, J. P. (1989). Learning in feedforward layered networks: The tiling algorithm. Journal of Physics A, 22, 2191–2203. Minton, S. (1990). Quantitative results concerning the utility of explanationbased learning. Articial Intelligence, 42, 363–391. Mitchell, T. M. (1997). Machine learning. New York: McGraw-Hill. Nilsson, N. J. (1965). Learning machines. New York: McGraw-Hill. Pagallo, G., & Haussler, D. (1990). Boolean feature discovery in empirical learning. Machine Learning, 5, 71–99. Pitt, L., & Valiant, L. G. (1988). Computational limitations on learning from examples. Journal of the ACM, 35, 965–984. Quartz, S. R., & Sejnowski, T. J. (1997).The neural basis of cognitive development: A constructivist manifesto. Behavioral and Brain Sciences, 20, 537–596. Redding, N. J., Kowalczyk, A., & Downs, T. (1993). Constructive higher-order network algorithm that is polynomial in time. Neural Networks, 6, 997– 1010. Rissanen, J., & Langdon, G. G. (1979). Arithmetic coding. IBM Journal of Research and Development, 23, 149–162. Rivest, R. L., & Sloan, R. (1994). A formal model of hierarchical concept learning. Information and Computation, 114, 88–114. Rumelhart, D. E., & McClelland, J. L. (1986). Parallel distributed processing. Cambridge, MA: MIT Press.
Many-Layered Learning
2529
Sammut, C., & Banerji, R. B. (1986). Learning concepts by asking questions. In R. S. Michalski, J. G. Carbonell, & T. M. Mitchell (Eds.), Machine learning: An articial intelligence approach. San Mateo, CA: Morgan Kaufmann. Shapiro, A. D. (1987). Structured induction in expert systems. Reading, MA: Addison-Wesley. Shultz, T. R., & Rivest, F. (2000). Using knowledge to speed learning: A comparison of knowledge-based cascade-correlation and multi-task learning. In Proceedings of the Seventeenth International Conference on Machine Learning (pp. 871–878). Palo Alto, CA: Morgan Kaufmann. SÂÏ õ ma, J. (1994). Loading deep networks is hard. Neural Computation, 6, 842–850. Stone, P., & Veloso, M. (2000). Layered learning. In Proceedings of the Eleventh European Conference on Machine Learning (pp. 369–381). Berlin: Springer-Verlag. Stracuzzi, D. J., & Utgoff, P. E. (2002). Randomized variable elimination. In Proceedings of the Nineteenth International Conference on Machine Learning. San Mateo, CA: Morgan Kaufmann. Suddarth, S. C., & Holden, A. D. C. (1991). Symbolic-neural systems and the use of hints for developing complex systems. International Journal of Man-Machine Studies, 35, 291–311. Tesauro, G. (1992). Practical issues in temporal difference learning. Machine Learning, 8, 257–277. Turkewitz, G., & Kenny, P. A. (1982). Limitations on input as a basis for neural organization and perceptual development: A preliminary theoretical statement. Developmental Psychobiology, 15, 357–368. Utgoff, P. E., & Precup, D. (1998). Constructive function approximation. In H. Motoda & H. Liu (Eds.), Feature extraction, construction, and selection: A data-mining perspective. Norwell, MA: Kluwer. Valiant, L. G. (2000a).A neuroidal architecture for cognitive computation. Journal of the Association for Computing Machinery, 47, 854–882. Valiant, L. G. (2000b). Robust logics. Articial Intelligence, 117, 231–253. Werbos, P. J. (1977). Advanced forecasting methods for global crisis warning and models of intelligence. General Systems Yearbook, 22, 25–38. White, H. (1990). Connectionist nonparametric regression: Multilayer feedforward networks can learn arbitrary mappings. Neural Networks, 3, 535–549. Wynne-Jones, M. (1991). Node splitting: A constructive algorithm for feedforward neural networks. In J. E. Moody, S. J. Hanson, & R. P. Lippmann (Eds.), Advances in neural information processing systems, 4 (pp. 1072–1079). Zupan, B., Bohanec, M., DemÆ sar, J., & Bratko, I. (1999). Learning by discovering concept hierarchies. Articial Intelligence, 211–242. Received September 12, 2001; accepted April 10, 2002.
ARTICLE
Communicated by Rodney Douglas
Real-Time Computing Without Stable States: A New Framework for Neural Computation Based on Perturbations Wolfgang Maass [email protected]
Thomas Natschl¨ager [email protected] Institute for Theoretical Computer Science, Technische Universit¨at Graz; A-8010 Graz, Austria
Henry Markram henry.markram@ep.ch Brain Mind Institute, Ecole Polytechnique Federale de Lausanne, CH-1015 Lausanne, Switzerland
A key challenge for neural modeling is to explain how a continuous stream of multimodal input from a rapidly changing environment can be processed by stereotypical recurrent circuits of integrate-and-re neurons in real time. We propose a new computational model for real-time computing on time-varying input that provides an alternative to paradigms based on Turing machines or attractor neural networks. It does not require a task-dependent construction of neural circuits. Instead, it is based on principles of high-dimensional dynamical systems in combination with statistical learning theory and can be implemented on generic evolved or found recurrent circuitry. It is shown that the inherent transient dynamics of the high-dimensional dynamical system formed by a sufciently large and heterogeneous neural circuit may serve as universal analog fading memory. Readout neurons can learn to extract in real time from the current state of such recurrent neural circuit information about current and past inputs that may be needed for diverse tasks. Stable internal states are not required for giving a stable output, since transient internal states can be transformed by readout neurons into stable target outputs due to the high dimensionality of the dynamical system. Our approach is based on a rigorous computational model, the liquid state machine, that, unlike Turing machines, does not require sequential transitions between welldened discrete internal states. It is supported, as the Turing machine is, by rigorous mathematical results that predict universal computational power under idealized conditions, but for the biologically more realistic scenario of real-time processing of time-varying inputs. Our approach provides new perspectives for the interpretation of neural coding, the design of experiments and data analysis in neurophysiology, and the solution of problems in robotics and neurotechnology. c 2002 Massachusetts Institute of Technology Neural Computation 14, 2531–2560 (2002) °
2532
Wolfgang Maass, Thomas Natschl¨ager, and Henry Markram
1 Introduction Intricate topographically organized feedforward pathways project rapidly changing spatiotemporal information about the environment into the neocortex. This information is processed by extremely complex but surprisingly stereotypic microcircuits that can perform a wide spectrum of tasks (Shepherd, 1988; Douglas & Martin, 1998; von Melchner, Pallas, & Sur, 2000). The microcircuit features that enable this seemingly universal computational power are a mystery. One particular feature, the multiple recurrent loops that form an immensely complicated network using as many as 80% of all the synapses within a functional neocortical column, has presented an intractable problem for computational models inspired by current articial computing machinery (Savage, 1998) and for attractor neural network models. The difculty of understanding computations within recurrent networks of integrate-and-re neurons comes from the fact that their dynamics takes on a life of its own when challenged with rapidly changing inputs. This is particularly true for the very high-dimensional dynamical system formed by a neural microcircuit, whose components are highly heterogeneous and where each neuron and each synapse add degrees of freedom to the dynamics of the system. The most common approach for modeling computing in recurrent neural circuits has been to try to take control of their high-dimensional dynamics. Methods for controlling the dynamics of recurrent neural networks through adaptive mechanisms are reviewed in Pearlmutter (1995). So far, no one has been able to apply these to the case of networks of spiking neurons. Other approaches to modeling computation in biological neural systems are based on constructions of articial neural networks that simulate Turing machines or other models for digital computation (Pollack, 1991; Giles, Miller, Chen, Sun, & Lee, 1992; Siegelmann & Sontag, 1994; Hyoetyniemi, 1996; Moore, 1998). Among these are models, such as dynamical recognizers, capable of real-time computing on on-line input (in discrete time). None of these approaches has been demonstrated to work for networks of spiking neurons or any more realistic models for neural microcircuits. Maass (1996) showed that one also can construct recurrent circuits of spiking neurons that can simulate arbitrary Turing machines. But all of these approaches require synchronization of all neurons by a central clock, a feature that appears to be missing in neural microcircuits. In addition, they require the construction of particular recurrent circuits and cannot be implemented by evolving or adapting a given circuit. Furthermore, the results of Maass and Sontag (1999) on the impact of noise on the computational power of recurrent neural networks suggest that all of these approaches break down as soon as one assumes that the underlying analog computational units are subject to gaussian or other realistic noise distributions. Attractor neural networks, on the other hand, allow noise-robust computation, but their attractor landscape is generally hard to control, and they need a very large set of attractors in order to store
Real-Time Computing Without Stable States
2533
salient information on past inputs (for example, 1024 attractors in order to store 10 bits). In addition, they are less suitable for real-time computing on rapidly varying input streams because of the time required for convergence to an attractor. Finally, none of these approaches allows several real-time computations to be carried out in parallel within the same circuitry, which appears to be a generic feature of neural microcircuits. In this article, we analyze the dynamics of neural microcircuits from the point of view of a readout neuron whose task is to extract information and report results from a neural microcircuit to other circuits. A human observer of the dynamics in a neural microcircuitwould be looking for clearly distinct and temporally stable features, such as convergence to attractors. We show that a readout neuron that receives inputs from hundreds or thousands of neurons in a neural microcircuit can learn to extract salient information from the high-dimensional transient states of the circuit and can transform transient circuit states into stable readouts. In particular, each readout can learn to dene its own notion of equivalence of dynamical states within the neural microcircuit and can then perform its task on novel inputs. This unexpected nding of readout-assigned equivalent states of a dynamical system explains how invariant readout is possible despite the fact that the neural microcircuit may never revisit the same state. Furthermore, we show that multiple readout modules can be trained to perform different tasks on the same state trajectories of a recurrent neural circuit, thereby enabling parallel real-time computing. We present the mathematical framework for a computational model that does not require convergence to stable internal states or attractors (even if they do occur), since information about past inputs is automatically captured in the perturbations of a dynamical system (i.e., in the continuous trajectory of transient internal states). Special cases of this mechanism have already been reported in Buonomano and Merzenich (1995) and Dominey, Arbib, and Joseph (1995). Similar ideas have been discovered independently by Jaeger (2001) in the context of articial neural networks. 2 Computing Without Attractors As an illustration for our general approach to real-time computing consider a series of transient perturbations caused in an excitable medium (see Holden, Tucker, & Thompson, 1991), for example, a liquid, by a sequence of external disturbances (inputs) such as wind, sound, or sequences of pebbles dropped into the liquid. Viewed as an attractor neural network, the liquid has only one attractor state—the resting state—and may therefore seem useless for computational purposes. However, the perturbed state of the liquid, at any moment in time, represents present as well as past inputs, potentially providing the information needed for an analysis of various dynamic aspects of the environment. In order for such a liquid to serve as a source of salient information about present and past stimuli without rely-
2534
Wolfgang Maass, Thomas Natschl¨ager, and Henry Markram
ing on stable states, the perturbations must be sensitive to saliently different inputs but nonchaotic. The manner in which perturbations are formed and maintained would vary for different types of liquids and would determine how useful the perturbations are for such retrograde analysis. Limitations on the computational capabilities of liquids are imposed by their time constant for relaxation and the strictly local interactions and homogeneity of the elements of a liquid. Neural microcircuits, however, appear to be ideal liquids for computing on perturbations because of the large diversity of their elements, neurons, and synapses (see Gupta, Wang, & Markram, 2000) and the large variety of mechanisms and time constants characterizing their interactions, involving recurrent connections on multiple spatial scales (“loops within loops”). The foundation for our analysis of computations without stable states is a rigorous computational model, the liquid state machine. Two macroscopic properties emerge from our theoretical analysis and computer simulations as necessary and sufcient conditions for powerful real-time computing on perturbations: a separation property, SP, and an approximation property, AP. SP addresses the amount of separation between the trajectories of internal states of the system that are caused by two different input streams (in the case of a physical liquid, SP could reect the difference between the wave patterns resulting from different sequences of disturbances). AP addresses the resolution and recoding capabilities of the readout mechanisms—more precisely, its capability to distinguish and transform different internal states of the liquid into given target outputs. (Whereas SP depends mostly on the complexity of the liquid, AP depends mostly on the adaptability of the readout mechanism to the required task.) 3 Liquid State Machines Like the Turing machine (Savage, 1998), the model of a liquid state machine (LSM) is based on a rigorous mathematical framework that guarantees, under idealized conditions, universal computational power. Turing machines, however, have universal computational power for off-line computation on (static) discrete inputs, while LSMs have in a very specic sense universal computational power for real-time computing with fading memory on analog functions in continuous time. The input function u (¢) can be a continuous sequence of disturbances, and the target output can be some chosen function y (¢) of time that provides a real-time analysis of this sequence. In order for a machine M to map input functions of time u (¢) to output functions y (¢) of time, we assume that it generates, at every time t, an internal “liquid state” xM (t) , which constitutes its current response to preceding perturbations, that is, to preceding inputs u ( s) for s · t (see Figure 1). In contrast to the nite state of a nite state machine (or nite automaton), this liquid state consists of analog values that may change continuously over time. Whereas the state
Real-Time Computing Without Stable States
2535
Figure 1: Architecture of an LSM. A function of time (time series) u (¢) is injected as input into the liquid lter LM , creating at time t the liquid state xM ( t) , which is transformed by a memoryless readout map f M to generate an output y ( t) .
set and the state transition function of a nite state machine are in general constructed for a specic task, the liquid states and the transitions between them need not be customized for a specic task. In a physical implementation, this liquid state consists of all information about the current internal state of a dynamical system that is accessible to the readout modules. In mathematical terms, this liquid state is simply the current output of some operator or lter1 L M that maps input functions u (¢) onto functions x M ( t) : xM ( t ) D ( L M u ) ( t ) . In the following, we will refer to this lter L M often as a liquid lter or liquid circuit if it is implemented by a circuit. If it is implemented by a neural circuit, we refer to the neurons in that circuit as liquid neurons. The second component of an LSM M is a memoryless readout map f M that transforms, at every time t, the current liquid state xM (t ) into the output y (t ) D f M (x M ( t) ). In contrast to the liquid lter L M , this readout map f M is in general chosen in a task-specic manner (and there may be many different readout maps that extract different task-specic information in parallel from the current output of L M ). Note that in a nite state machine, there exists no analogon 1 Functions F that map input functions of time u(¢) on output functions y (¢) of time are usually called operators in mathematics, but are commonly referred to as lters in engineering and neuroscience. We use the term lter in the following and write (Fu)(t) for the output of the lter F at time t when F is applied to the input function u(¢). Formally, such lter F is a map from U n into (RR ) k , where RR is the set of all real-valued functions of time, (RR )k is the set of vectors consisting of k such functions of time, U is some subset of RR , and Un is the set of vectors consisting of n functions of time in U.
2536
Wolfgang Maass, Thomas Natschl¨ager, and Henry Markram
to such task-specic readout maps, since there, the internal nite states are already constructed in a task-specic manner. According to the preceding denition, readout maps are in general memoryless.2 Hence, all information about inputs u (s ) from preceding time points s · t that is needed to produce a target output y ( t) at time t has to be contained in the current liquid state xM (t) . Models for computation that have originated in computer science store such information about the past in stable states (e.g., in memory buffers or tapped delay lines). We argue, however, that this is not necessary since large computational power on functions of time can also be realized even if all memory traces are continuously decaying. Instead of worrying about the code and location where information about past inputs is stored and how this information decays, it is enough to address the separation question: For which later time points t will any two signicantly different input functions of time u (¢) and v (¢) cause signicantly different liquid states xuM ( t) and xM v ( t ) ? Good separation capability, in combination with an adequate readout map f M , allows us to discard the requirement of storing bits until further notice in stable states of the computational system. 4 Universal Computational Power of LSMs for Time-Varying Inputs We say that a class of machines has universal power for computations with fading memory on functions of time if any lter F, that is, any map from functions of time u (¢) to functions of time y (¢) , that is time invariant3 and has fading memory4 can be approximated by machines from this class to 2
The term memoryless refers to the fact that the readout map f M is not required to retain any memory of previous states xM (s), s < t, of the liquid. However, in a biological context, the readout map will in general be subject to plasticity and may also contribute to the memory capability of the system. We do not explore this issue here because the differentiation into a memoryless readout map and a liquid that serves as a memory device is made for conceptual clarication and is not essential to the model. 3 A lter F is called time invariant if any temporal shift of the input function u(¢) by some amount t 0 causes a temporal shift of the output function y D Fu by the same amount t 0 , that is, (Fut0 )(t) D (Fu)(t C t 0 ) for all t, t 0 2 R, where ut0 (t) :D u (t C t 0 ). Note that if U is closed under temporal shifts, then a time-invariant lter F: Un ! (RR ) k can be identied uniquely by the values y(0) D (Fu)(0) of its output functions y(¢) at time 0. 4 Fading memory (Boyd & Chua, 1985) is a continuity property of lters F that demands that for any input function u(¢) 2 Un , the output (Fu)(0) can be approximated by the outputs (Fv)(0) for any other input functions v (¢) 2 U n that approximate u(¢) on a sufciently long time interval [¡T, 0]. Formally, one denes that F: U n ! (RR )k has fading memory if for every u 2 U n and every e > 0, there exist d > 0 and T > 0 so that k (Fv)(0)¡ (Fu)(0)k < e for all v 2 U n with ku(t) ¡ v(t)k < d for all t 2 [¡T, 0]. Informally, a lter F has fading memory if the most signicant bits of its current output value (Fu)(0) depend just on the most signicant bits of the values of its input function u (¢) from some nite time window [¡T, 0] into the past. Thus, in order to compute the most signicant bits of (Fu)(0), it is not necessary to know the precise value of the input function u (s) for any time s, and it is also not necessary to know anything about the values of u(¢) for more than a nite time interval back into the past. Note that a lter that has fading memory is automatically causal.
Real-Time Computing Without Stable States
2537
any degree of precision. Arguably, these lters F that are approximated according to this denition include all maps from input functions of time to output functions of time that a behaving organism might need to compute. A mathematical theorem (see appendix A) guarantees that LSMs have this universal computational power regardless of specic structure or implementation, provided that two abstract properties are met: the class of basis lters from which the liquid lters L M are composed satises the pointwise separation property and the class of functions from which the readout maps f M are drawn satises the approximation property. These two properties provide the mathematical basis for the separation property SP and the approximation property AP that were previously discussed. Theorem 1 in appendix A implies that there are no serious a priori limits for the computational power of LSMs on continuous functions of time, and thereby provides a theoretical foundation for our approach to modeling neural computation. In particular, since this theorem makes no specic requirement regarding the exact nature or behavior of the basis lters, as long as they satisfy the separation property (for the inputs in question), it provides theoretical support for employing instead of circuits that were constructed for a specic task, partially evolved or even rather arbitrary “found” computational modules for purposeful computations. This feature highlights an important difference to computational theories based on Turing machines or nite state machines, which are often used as conceptual basis for modeling neural computation. The mathematical theory of LSMs can also be extended to cover computation on spike trains (discrete events in continuous time) as inputs. Here the ith component ui (¢) of the input u (¢) is a function that assumes only the values 0 and 1, with ui ( t) D 1 if the ith preceding neuron res at time t. Thus, ui (¢) is not a continuous function but a sequence of point events. Theorem 2 in appendix A provides a theoretical foundation for approximating any biologically relevant computation on spike trains by LSMs. 5 Neural Microcircuits as Implementations of LSMs In order to test the applicability of this conceptual framework to modeling computation in neural microcircuits, we carried out computer simulations where a generic recurrent circuit of integrate-and-re neurons (see appendix B for details) was employed as a liquid lter. In other words, computer models for neural microcircuits were viewed as an implementation of the liquid lter L M of an LSM. In order to test the theoretically predicted universal real-time computing capabilities of these neural implementations of LSMs, we evaluated their performance on a wide variety of challenging benchmark tasks. The input to the neural circuit was via one or several input spike trains, which diverged to inject current into 30% randomly chosen “liquid neurons”. The amplitudes of the input synapses were chosen from a gaussian distribution, so that each neuron in the liquid circuit received a slightly different input (a form of topographic injection). The liquid state of
2538
Wolfgang Maass, Thomas Natschl¨ager, and Henry Markram
the neural microcircuitat time t was dened as all information that a readout neuron could extract at time t from the circuit—that is, the output at time t of all the liquid neurons represented the current liquid state of this instantiation of an LSM. More precisely, since the readout neurons were modeled as integrate-and-re (I&F) neurons with a biologically realistic membrane time constant of 30 ms, the liquid state x M ( t) at time t consisted of the vector of contributions of all the liquid neurons to the membrane potential at time t of a generic readout neuron (with unit synaptic weights). Mathematically, this liquid state x M ( t) can be dened as the vector of output values at time t of linear lters with exponential decay (time constant 30 ms) applied to the spike trains emitted by the liquid neurons. Each readout map f M was implemented by a separate population P of I&F neurons (referred to as readout neurons) that received input from all the liquid neurons but had no lateral or recurrent connections.5 The current ring activity p (t) of the population P, that is, the fraction of neurons in P ring during a time bin of 20 ms, was interpreted as the analog output of f M at time t (one often refers to such representation of analog values by the current ring activity in a pool of neurons as space rate coding). Theoretically, the class of readout maps that can be implemented in this fashion satises the approximation property AP (Maass, 2000; Auer, Burgsteiner, & Maass, 2001), and is according to theorem 1 in principle sufcient for approximating arbitrary given fading memory lters F. In cases where a readout with discrete values 1 and 0 sufces, one can implement a readout map even by a single I&F neuron that represents these discrete output values by ring or nonring at time t. In cases where the target output consists of slowly varying analog values, a single readout neuron can be trained to represent these values through its time-varying ring rate. In any case, the readout neurons can be trained to perform a specic task by adjusting the strengths of synapses projected onto them from the liquid neurons using a perceptron-like local learning rule (Auer et al., 2001). The nal learned state of the readout neurons enables them to take particular weighted sums of the current outputs xM (t) of the liquid neurons and generate a response f M (x M ( t) ) that approximates the target value y ( t) . As a rst test of these neural implementations of LSMs, we evaluated the separation property SP of computer models for neural microcircuits on spike train inputs. A large set of pairs of Poisson spike trains u (¢) and v (¢) was randomly generated and injected (in separate trials) as input to the recurrent neural circuit. The resulting trajectories xuM (¢) and xvM (¢) of liquid states of the recurrent circuit were recorded for each of these time-varying inputs u (¢) and v (¢). The average distance kxuM ( t) ¡ xM v ( t ) k between these liquid states was plotted in Figure 2 as a function of the time t after the
5 For conceptual purposes, we separate the “liquid” and “readout” elements in this article, although dual liquid-readout functions can also be implemented.
Real-Time Computing Without Stable States
state distance
2.5
2539
d(u,v)=0.4 d(u,v)=0.2 d(u,v)=0.1 d(u,v)=0
2
1.5 1
0.5 0
0
0.1 0.2 0.3 0.4 0.5
time [sec]
Figure 2: Average distance of liquid states for two different input spike trains u and v (given as input to the neural circuit in separate trials, each time with an independently chosen random initial state of the neural circuit; see appendix B) plotted as a function of time t . The state distance increases with the distance d ( u, v ) between the two input spike trains u and v. Plotted on the y-axis is the average value of kxuM ( t ) ¡ xvM ( t ) k, where k.k denotes the Euclidean norm, and ( ) M( ) xM u t , xv t denote the liquid states at time t for input spike trains u and v. The plotted results for the values 0.1, 0.2, 0.4 of the input difference d0 represent the average over 200 randomly generated pairs u and v of spike trains such that |d0 ¡ d ( u, v) | < 0.01. Parameters of the liquid: 1 column, degree of connectivity l D 2 (see appendix B for details).
onset of the input, for various xed values of the distance d ( u, v ) 6 between the two spike train inputs u and v. These curves show that the distance between these liquids states is well above the noise level, that is, above the average liquid state differences caused by the same spike train applied with two different randomly chosen initial conditions of the circuit (indicated by the solid curve). Furthermore, these curves show that the difference in liquid states is after the rst 30 ms roughly proportional to the distance between the corresponding input spike trains. Note in particular the absence of chaotic effects for these generic neural microcircuit models with biologically realistic intermediate connection lengths. 6 In order to dene the distance d (u, v) between two spike trains u and v, we replaced each spike by a gaussian exp (¡(t / t )2 ) for t D 5 ms (to be precise, u and v are convolved with the gaussian kernel exp (¡(t / t )2 )) and dened d (u, v) as the distance of the resulting two continuous functions in the L2 -norm (divided by the maximal lengths 0.5 sec of the spike trains u and v).
Wolfgang Maass, Thomas Natschl¨ager, and Henry Markram
neuron #
2540
input pattern "zero"
input pattern "one"
40 20 0
40 20 0
liquid response
liquid response 135
90
90
45
45
0
0
neuron #
135
neuron #
readout response
readout response
50
50
25
25
0
0
correctness
correctness
1
1
0
0
certainty 1 0
certainty 1
0 0.1 0.2 0.3 0.4 0.5
time [s]
0
0 0.1 0.2 0.3 0.4 0.5
time [s]
6 Exploring the Computational Power of Models for Neural Microcircuit As a rst test of its computational power, this simple generic circuit was applied to a previously considered classication task (Hopeld & Brody, 2001), where spoken words were represented by noise-corrupted spatiotemporal spike patterns over a rather long time interval (40-channel spike patterns over 0.5 sec). This classication task had been solved in Hopeld and Brody (2001) by a network of neurons designed for this task (relying on unknown
Real-Time Computing Without Stable States
2541
mechanisms that could provide smooth decays of ring activity over longer time periods and apparently requiring substantially larger networks of I&F neurons if fully implemented with I&F neurons). The architecture of that network, which had been customized for this task, limited its classication power to spike trains consisting of a single spike per channel. We found that the same, but also a more general, version of this spatiotemporal pattern recognition task that allowed several spikes per input channel can be solved by a generic recurrent circuit as described in the previous section. Furthermore, the output of this network was available at any time and was usually correct as soon as the liquid state of the neural circuit had absorbed enough information about the input (the initial value of the correctness just reects the initial guess of the readout). Formally, we dened the correctness of the neural readout at time s by the term 1 ¡ |target output y ( s) ¡ readout activity p ( s) |, where the target output y ( s ) consisted in this case of the constant values 1 or 0, depending on the input pattern. Plotted in Figure 3 is for any time t during the presentation of the input patterns in addition to the correctness as a function of t also the certainty of the output at time t, which is dened as the average correctness up Figure 3: Facing page. Application of a generic recurrent network of I&F neurons, modeled as LSM, to a more difcult version of a well-studied classication task (Hopeld & Brody, 2001). Five randomly drawn patterns (called “zero,” “one,” “two,” . . . ), each consisting of 40 parallel Poisson spike trains over 0.5 sec, were chosen. Five readout modules, each consisting of 50 I&F neurons, were trained with 20 noisy versions of each input pattern to respond selectively to noisy versions of just one of these patterns (noise was injected by randomly moving each spike by an amount drawn independently from a gaussian distribution with mean 0 and variance 32 ms; in addition, the initial state of the liquid neurons was chosen randomly at the beginning of each trial). The responses of the readout, which had been trained to detect the pattern “zero,” is shown for a new, previously not shown, noisy version of two of the input patterns.a The correctness and certainty (= average correctness so far) are shown as functions of time from the onset of the stimulus at the bottom. The correctness is calculated as 1 ¡ |p(t) ¡ y ( t ) |, where p ( t ) is the normalized ring activity in the readout pool (normalized to the range [0 1]; 1 corresponding to an activity of 180 Hz; bin width 20 ms) and y ( t) is the target output. (Correctness starts at a level of 0 for pattern “zero,” where this readout pool is supposed to become active, and at a value of 1 for pattern “one,” because the readout pool starts in an inactive state). In contrast to most circuits of spiking neurons that have been constructed for a specic computational task, the spike trains of liquid and readout neurons shown in this gure look rather realistic. a The familiar delta rule was applied or not applied to each readout neuron, depending on whether the current ring activity in the readout pool was too high, too low, or about right, thus requiring at most two bits of global communication. The precise version of the learning rule was the p-delta rule discussed in Auer et al. (2001).
2542
Wolfgang Maass, Thomas Natschl¨ager, and Henry Markram
to that time t. Whereas the network constructed by Hopeld and Brody (2001) was constructed to be invariant with regard to linear time warping of inputs (provided that only one spike arrives in each channel), the readouts of the generic recurrent circuit that we considered could be trained to be invariant with regard to a large class of different types of noises. The results shown in Figure 3 are for a noise where each input spike is moved independently by an amount drawn from a gaussian distribution with mean 0 and SD 32 ms. Giving a constant output for a time-varying liquid state (caused by a time-varying input) is a serious challenge for an LSM, since it cannot rely on attractor states, and the memoryless readout has to transform the transient and continuously changing states of the liquid into a stable output (see section 10 and Figure 9 for details). In order to explore the limits of this simple neural implementation of an LSM for computing on time-varying input, we chose another classication task where all information of the input is contained in its temporal evolution, more precisely in the interspike intervals of a single input spike train. In this test, eight randomly generated Poisson spike trains over 250 ms or, equivalently, two Poisson spike trains over 1000 ms partitioned into four segments each (see the top of Figure 4) were chosen as template patterns. Other spike trains over 1000 ms were generated by choosing for each 250 ms segment one of the two templates for this segment and by jittering each spike in the templates (more precisely, each spike was moved by an amount drawn from a gaussian distribution with mean 0 and a SD that we refer to as jitter; see the bottom of Figure 4). A typical spike train generated in this way is shown in the middle of Figure 4. Because of the noisy dislocation of spikes, it was impossible to recognize a specic template from a single interspike interval (and there were no spatial cues contained in this single channel input). Instead, a pattern formed by several interspike intervals had to be recognized and classied retrospectively. Furthermore, readouts were not only trained to classify at time t D 1000 ms (i.e., after the input spike train had entered the circuit) the template from which the last 250 ms segment of this input spike train had been generated, but other readouts were trained to classify simultaneously the templates from which preceding segments of the input (which had entered the circuit several hundred ms earlier) had been generated. Obviously, the latter classication task is substantially more demanding, since the corresponding earlier segments of the input spike train may have left a clear trace in the current ring activity of the recurrent circuit just after they had entered the circuit, but this trace was subsequently overwritten by the next segments of the input spike train (which had no correlation with the choice of the earlier segments). Altogether there were in this experiment four readouts f1 to f4 , where fi had been trained to classify at time t D 1000 ms the ith independently chosen 250 ms segment of the preceding input spike train. The performance of the LSM, with a generic recurrent network of 135 I&F neurons as liquid lter (see appendix B), was evaluated after training of
spike train #
template template##
Real-Time Computing Without Stable States
2543
possible spike train segments 1. segment 2. segment 3. segment 4. segment 1 2 a typical resulting input spike train templ. 2 templ. 1 templ. 1 templ. 2 1 0
0.2
0.4
0.6
0.8
1
time [s] templ. 2 for 1. seg. (top) and a jittered version (bottom) 2 1 0
0.05
0.1
0.15 time [s]
0.2
0.25
Figure 4: Evaluating the fading memory of a generic neural microcircuit: the task. In this more challenging classication task, all spike trains are of length 1000 ms and consist of four segments of length 250 ms each. For each segment, two templates were generated randomly (Poisson spike train with a frequency of 20 Hz); see the upper traces. The actual input spike trains of length 1000 ms used for training and testing were generated by choosing for each segment one of the two associated templates and then generating a noisy version by moving each spike by an amount drawn from a gaussian distribution with mean 0 and an SD that we refer to as jitter (see the lower trace for a visualization of the jitter with an SD of 4 ms). The task is to output with four different readouts at time t D 1000 ms for each of the preceding four input segments the number of the template from which the corresponding segment of the input was generated. Results are summarized in Figures 5 and 6.
the readout pools on inputs from the same distribution (for jitter = 4 ms), but with an example that the LSM had not seen before. The accuracy of the four readouts is plotted in Figure 5A. It demonstrates the fading memory of a generic recurrent circuit of I&F neurons, where information about inputs that occurred several hundred ms ago can be recovered even after that input segment was subsequently overwritten. Since readout neurons (and neurons within the liquid circuit) were modeled with a realistic time constant of just 30 ms, the question arises where this information about earlier inputs had been stored for several hundred ms. As a control, we repeated the same experiment with a liquid circuit where
2544
readout correct. (dyn. syn.) 1 0.9 0.8 0.7 0.6 0.5
1
2
3
90 45 1 0.5
4
D
0.75
liquid resp. (stat. syn.) 135
neuron #
avg. correct.
liquid resp. (dyn. syn.) 135
C readout correct. (stat. syn.) 1 0.9 0.8 0.7 0.6 0.5
B neuron #
avg. correct.
A
Wolfgang Maass, Thomas Natschl¨ager, and Henry Markram
1
2 3 segment
4
90 45 1 0.5
0.75 time [s]
Figure 5: Evaluating the fading memory of a generic neural microcircuit: results. Four readout modules f1 to f4 , each consisting of a single perceptron, were trained for their task by linear regression. The readout module fi was trained to output 1 at time t D 1000 ms if the ith segment of the previously presented input spike train had been constructed from the corresponding template 1, and to output 0 at time t D 1000 ms otherwise. Correctness (percentage of correct classication on an independent set of 500 inputs not used for training) is calculated as average over 50 trials. In each trial, new Poisson spike trains were drawn as templates, a new randomly connected circuit was constructed (1 column, l D 2; see appendix B), and the readout modules f1 to f4 were trained with 1000 training examples generated by the distribution described in Figure 4. (A) Average correctness of the four readouts for novel test inputs drawn from the same distribution. (B) Firing activity in the liquid circuit (time interval [0.5 sec, 0.8 sec]) for a typical input spike train. (C) Results of a control experiment where all dynamic synapses in the liquid circuit had been replaced by static synapses (the mean values of the synaptic strengths were uniformly rescaled so that the average liquid activity is approximately the same as for dynamic synapses). The liquid state of this circuit contained substantially less information about earlier input segments. (D) Firing activity in the liquid circuit with static synapses used for the classication results reported in C. The circuit response to each of the four input spikes that entered the circuit during the observed time interval [0.5 sec, 0.8 sec] is quite stereotypical without dynamic synapses (except for the second input spike that arrives just 20 ms after the rst one). In contrast, the ring response of the liquid circuit with dynamic synapses, B, is different for each of the four input spikes, showing that dynamic synapses endow these circuits with the capability to process new input differently depending on the context set by preceding input, even if that preceding input occurred several hundred ms before.
Real-Time Computing Without Stable States
2545
the dynamic synapses had been replaced by static synapses (with synaptic weights that achieved about the same level of ring activity as the circuit with dynamic synapses). Figure 5C shows that this results in a signicant loss in performance for the classication of all except for the last input segment. A possible explanation is provided by the raster plots of ring activity in the liquid circuit with (see Figure 5B) and without dynamic synapses (see Figure 5D), shown here with high temporal resolution. In the circuit with dynamic synapses, the recurrent activity differs for each of the four spikes that entered the circuit during the time period shown, demonstrating that each new spike is processed by the circuit in an individual manner that depends on the context dened by preceding input spikes. In contrast, the ring response is very stereotypical for the same four input spikes in the circuit without dynamic synapses, except for the response to the second spike that arrives within 20 ms of the rst one (see the period between 500 and 600 ms in Figure 5D). This indicates that the short-term dynamics of synapses may play an essential role in the integration of information for real-time processing in neural microcircuits. Figure 6 examines another aspect of neural microcircuits that appears to be important for their separation property: the statistical distribution of connection lengths within the recurrent circuit. Six types of liquid circuits, each consisting of 135 I&F neurons but with different values of the parameter l that regulated the average number of connections and the average spatial length of connections (see appendix B), were trained and evaluated according to the same protocol and for the same task as in Figure 5. Shown in Figure 6 for each of these six types of liquid circuits is the average correctness of the readout f3 on novel inputs, after it had been trained to classify the second-to-last segment of the input spike train. The performance was fairly low for circuits without recurrent connections (l D 0). It was also fairly low for recurrent circuits with large values of l, whose largely lengthindependent distribution of connections homogenized the microcircuit and facilitated chaotic behavior. Hence, for this classication task, the ideal “liquid circuit” is a microcircuit that has in addition to local connections to neighboring neurons also a few long-range connections, thereby interpolating between the customarily considered extremes of strictly total connectivity (like in a cellular automaton), on one hand, and the locality-ignoring global connectivity of a Hopeld net, on the other hand. The performance results of neural implementations of LSMs reported in this section should not be viewed as absolute data on the computational power of recurrent neural circuits. Rather, the general theory suggests that their computational power increases with any improvement in their separation or approximation property. Since the approximation property AP was already close to optimal for these networks (increasing the number of neurons in the readout module did not increase the performance signicantly; not shown), the primary limitation in performance lay in the separation property SP. Intuitively, it is clear that the liquid circuit needs to be suf-
2546
Wolfgang Maass, Thomas Natschl¨ager, and Henry Markram
average correctness
1 0.9 0.8 0.7 0.6 0.5
0
2
4
l
6
8
Figure 6: Average correctness depends on the parameter l that controls the distribution of random connections within the liquid circuit. Plotted is the average correctness (at time t D 1000 ms, calculated as average over 50 trials as in Figure 5; same number of training and test examples) of the readout module f3 (which is trained to classify retroactively the second-to-last segment of the preceding spike train) as a function of l. The poor performance for l D 0 (no recurrent connections within the circuit) shows that recurrent connections are essential for achieving a satisfactory separation property in neural microcircuits. Values of l that are too large also decrease the performance because they support a chaotic response.
ciently complex to hold the details required for the particular task but should reduce information that is not relevant to the task (e.g., spike time jitter). SP can be engineered in many ways, such as incorporating neuron diversity, implementing specic synaptic architectures, altering microcircuit connectivity, or simply recruiting more columns. The last option is of particular interest because it is not available in most computational models. It is explored in the next section. 7 Adding Computational Power An interesting structural difference between neural systems and our current generation of articial computing machinery is that the computational power of neural systems can apparently be enlarged by recruiting more circuitry (without the need to rewire old or new circuits). We explored the consequences of recruiting additional columns for neural implementations of LSMs (see Figure 7B), and compared it with the option of adding further connections to the primary one-column-liquid that we used so far (135 I&F
Real-Time Computing Without Stable States
2547
neurons with l D 2; see Figure 7A). Figure 7C demonstrates that the recruitment of additional columns increases the separation property of the liquid circuit in a desirable manner, where the distance between subsequent liquid states (always recorded at time t D 1000 ms in this experiment) is proportional to the distance between the spike train inputs that had previously entered the liquid circuit (spike train distance is measured in the same way as for Figure 2). In contrast, the addition of more connections to a single column (l D 8; see appendix B) also increases the separation between subsequent liquid states, but in a quasi-chaotic manner where small input distances cause about the same distances between subsequent liquid states as small input differences. In particular, the subsequent liquid state distance is about equally large for two jittered versions of the input spike train state (yielding typically a value of d ( u, v ) around 0.1) as for signicantly different input spike trains that require different outputs of the readouts. Thus, improving SP by altering the intrinsic microcircuitry of a single column increases sensitivity for the task but also increases sensitivity to noise. The performance of these different types of liquid circuits for the same classication task as in Figure 6 is consistent with this analysis of their characteristic separation property. Shown in Figure 7D is their performance for various values of the spike time jitter in the input spike trains. The optimization of SP for a specic distribution of inputs and a specic group of readout modules is likely to arrive at a specic balance between the intrinsic complexity of the microcircuitry and the number of repeating columns. 8 Parallel Computing in Real-Time on Novel Inputs Since the liquid of the LSM does not have to be trained for a particular task, it supports parallel computing in real time. This was demonstrated by a test in which multiple spike trains were injected into the liquid, and multiple readout neurons were trained to perform different tasks in parallel. We added six readout modules to a liquid consisting of two columns with different values of l.7 Each of the six readout modules was trained independently for a completely different on-line task that required an output value at any time t. We focused here on tasks that require diverse and rapidly changing analog output responses y ( t). Figure 8 shows that after training, each of these six tasks can be performed in real time with high accuracy. The performance shown is for a novel input that was not drawn from the same distribution as the training examples and differs in several aspects from the training examples (thereby demonstrating the possibility of extrageneral-
7 In order to combine high sensitivity with good generalization performance, we chose here a liquid consisting of two columns as before, one with l D 2, the other with l D 8, and the interval [14.0 14.5] for the uniform distribution of the nonspecic background current Ib .
2548
Wolfgang Maass, Thomas Natschl¨ager, and Henry Markram
B
state dist.
C 20
4 col., l=2 1 col., l=2 1 col., l=8
15 10 5 0
0
0.2 0.4 0.6
input dist.
D avg. correct.
A
1 0.9 0.8 0.7 0.6 0.5
4 col., l=2 1 col., l=2 1 col., l=8
0
4
8
12 16
jitter [ms]
Figure 7: Separation property and performance of liquid circuits with larger numbers of connections or neurons. Schematic drawings of LSMs consisting of (A) one column and (B) four columns. Each column consists of 3 £ 3 £ 15 D 135 I&F neurons. (C) Separation property depends on the structure of the liquid. Average state distance (at time t D 100 ms) calculated as described in Figure 2. A column with high internal connectivity (high l) achieves higher separation as a single column with lower connectivity, but tends to chaotic behavior where it becomes equally sensitive to small and large input differences d ( u, v ) . On the other hand, the characteristic curve for a liquid consisting of four columns with small l is lower for values of d ( u, v) lying in the range of jittered versions u and v of the same spike train pattern ( d ( u, v) · 0.1 for jitter · 8 ms) and higher for values of d ( u, v ) in the range typical for spike trains u and v from different classes (mean: 0.22). (D) Evaluation of the same three types of liquid circuits for noise-robust classication. Plotted is the average performance for the same task as in Figure 6, but for various values of the jitter in input spike times. Several columns (not interconnected) with low internal connectivity yield a better-performing implementation of an LSM for this computational task, as predicted by the analysis of their separation property.
Real-Time Computing Without Stable States
2549
ization in neural microcircuits, due to their inherent bias, that goes beyond the usual denition of generalization in statistical learning theory). 9 Readout-Assigned Equivalent States of a Dynamical System Real-time computation on novel inputs implies that the readout must be able to generate an invariant or appropriately scaled response for any input even though the liquid state may never repeat. Indeed, Figure 3 already showed that the dynamics of readout pools can become quite independent from the dynamics of the liquid even though the liquid neurons are the only source of input. To examine the underlying mechanism for this relatively independent readout response, we reexamined the readout pool from Figure 3. Whereas the ring activity within the liquid circuit was highly dynamic, the ring activity in the readout pool was almost constant after training (see Figures 9B–9D). The stability of the readout response does not simply come about because the readout samples only a few “unusual” liquid neurons as shown by the distribution of synaptic weights onto a sample readout neuron (see Figure 9E). Since the synaptic weights do not change after learning, this indicates that the readout neurons have learned to dene a notion of equivalence for dynamic states of the liquid. Indeed, equivalence classes are an inevitable consequence of collapsing the high-dimensional space of liquid states into a single dimension, but what is surprising is that the equivalence classes are meaningful in terms of the task, allowing invariant and appropriately scaled readout responses and therefore real-time computation on novel inputs. Furthermore, although the input rate may contain salient information that is constant for a particular readout element, it may not be for another (see, for example, Figure 8), indicating that equivalence classes and dynamic stability exist purely from the perspective of the readout elements. 10 Discussion We introduce the LSM, a new paradigm for real-time computing on timevarying input streams. In contrast to most other computational models, it does not require the construction of a circuit or program for a specic computational task. Rather, it relies on principles of high-dimensional dynamical systems and learning theory that allow it to adapt unspecic evolved or found recurrent circuitry for a given computational task. Since only the readouts, not the recurrent circuit itself, have to be adapted for specic computational tasks, the same recurrent circuit can support completely different real-time computations in parallel. The underlying abstract computational model of an LSM emphasizes the importance of perturbations in dynamical systems for real-time computing, since even without stable states or attractors, the separation property and the approximation property may endow a dynamical system with virtually unlimited computational power on time-varying inputs.
2550
Wolfgang Maass, Thomas Natschl¨ager, and Henry Markram
input
sum of rates integrated over the past 30 ms 1 0.5 0
sum of rates integrated over the past 200 ms 1 0.5 0
detection of a spatiotemporal pattern 1 0.5 0
detection of a switch in spatial distribution of rates 1 0.5 0
spike coincidences: inputs 1 & 3, last 75 ms 1 0.5 0
spike coincidences: inputs 1 & 2, last 75 ms 1 0.5 0 0
0.5
1 time [sec]
1.5
2
In particular, we have demonstrated the computational universality of generic recurrent circuits of I&F neurons (even with quite arbitrary connection structure) if viewed as special cases of LSMs. Apparently, this is the rst stable and generally applicable method for using generic recurrent networks of I&F neurons to carry out a wide family of complex real-time computations on spike trains as inputs. Hence, this approach provides a platform for exploring the computational role of specic aspects of biological neural microcircuits. The computer simulations reported in this article provide
Real-Time Computing Without Stable States
2551
possible explanations not only for the computational role of the highly recurrent connectivity structure of neural circuits, but also for their characteristic distribution of connection lengths, which places their connectivity structure between the extremes of strictly local connectivity (cellular automata or Figure 8: Facing page. Multitasking in real time. Four input spike trains of length 2 sec (shown at the top) are injected into a liquid module consisting of two columns (randomly constructed with the same parameters; see appendix B), which is connected to multiple readout modules. Each readout module is trained to extract information for a different real-time computing task. The target functions are plotted as a dashed line and population response of the corresponding readout module as a solid line. The tasks assigned to the six readout modules were the following: Represent the sum of rates: at time t, output the sum of ring rates of all four input spike trains within the last 30 ms. Represent the integral of the sum of rates: at time t, output the total activity in all four inputs integrated over the last 200 ms. Pattern detection: output a high value if a specic spatiotemporal spike pattern appears. Represent a switch in spatial distribution of rates: output a high value if a specic input pattern occurs where the rate of input spike trains 1 and 2 goes up and simultaneously the rate of input spike trains 3 and 4 goes down; otherwise remain low. Represent the ring correlation: at time t, output the number of spike coincidences (normalized into the range [0 1]) during the last 75 ms for inputs 1 and 3 and separately for inputs 1 and 2. Target readout values are plotted as dashed lines, actual outputs of the readout modules as solid lines, all in the same timescale as the four spike trains shown at the top that enter the liquid circuit during this 2 sec time interval. Results shown are for a novel input that was not drawn from the same distribution as the training examples; 150 training examples were drawn randomly from the following distribution. Each input spike train was an independently drawn Possion spike train with a time-varying rate of r ( t ) D A C B sin (2p ft C a) . The parameters A, B, and f were drawn randomly from the following intervals (the phase was xed at (a D 0 deg): A [0 Hz, 30 Hz] and [70 Hz, 100 Hz], B [0Hz, 30 Hz] and [70 Hz, 100 Hz], f [0.5 Hz, 1 Hz] and [3 Hz, 5 Hz]. On this background activity, four different patterns had been superimposed (always in the same order during training): rate switch to inputs 1 and 3, a burst pattern, rate switch to inputs 1 and 2, and a spatiotemporal spike pattern. The results shown are for a test input that could not be generated by the same distribution as the training examples because its base level (A D 50 Hz), as well as the amplitude (B D 50 Hz), frequency ( f D 2 Hz) and phase (a D 180 deg) of the underlying time-varying ring rate of the Poisson input spike trains, were chosen to lie in the middle of the gaps between the two intervals used for these parameters during training. Furthermore, the spatiotemporal patterns (a burst pattern, rate switch to inputs 1 and 3, and rate switch to inputs 1 and 2), which were superimposed to achieve more input variation within the observed 2 sec, never occurred in this order and at these time points for any training input. Hence, the accurate performance for this novel input demonstrates substantial generalization capabilities of the readouts after training.
2552
Wolfgang Maass, Thomas Natschl¨ager, and Henry Markram
coupled map lattices) and uniform global connectivity (Hopeld nets) that are usually addressed in theoretical studies. Furthermore, our computer simulations suggest an important computational role of dynamic synapses for real-time computing on time-varying inputs. Finally, we reveal a most unexpected and remarkable principle: readout elements can establish their own equivalence relationships on high-dimensional transient states of a dynamical system, making it possible to generate stable and appropriately scaled output responses even if the internal state never converges to an attractor state. In contrast to virtually all computational models from computer science or articial neural networks, this computational model is enhanced rather than hampered by the presence of diverse computational units. Hence, it may also provide insight into the computational role of the complexity and diversity of neurons and synapses (see, e.g., Gupta et al., 2000). While there are many plausible models for spatial aspects of neural computation, a biologically realistic framework for modeling temporal aspects of neural computation has been missing. In contrast to models inspired by computer science, the LSM does not try to reduce these temporal aspects to transitions between stable states or limit cycles, and it does not require delay lines or buffers. Instead, it proposes that the trajectory of internal states of a recurrent neural circuit provides a raw, unbiased, and universal source of temporally integrated information, from which specic readout elements can extract specic information about past inputs for their individual task. Hence, the notorious trial-to-trial stimulus response variations in single and populations of neurons observed experimentally may reect an accumulation of information from previous inputs in the trajectory of internal states rather than noise (see also Arieli, Sterkin, Grinvald, & Aertsen, 1996). This would imply that averaging over trials or binning peels out most of the information processed by recurrent microcircuits and leaves mostly topographic information. This approach also offers new ideas for models of the computational organization of cognition. It suggests that it may not be necessary to scatter all information about sensory input by recoding it through feedforward processing as output vector of an ensemble of feature detectors with xed receptive elds (thereby creating the “binding problem”). It proposes that at the same time, more global information about preceding inputs can be preserved in the trajectories of very high-dimensional dynamical systems, from which multiple readout modules extract and combine the information needed for their specic tasks. This approach is nevertheless compatible with experimental data that conrm the existence of special maps of feature detectors. These could reect specic readouts, but also specialized components of a liquid circuit, that have been optimized genetically and through development to enhance the separation property of a neural microcircuit for a particular input distribution. The new conceptual framework presented in this article suggests complementing the experimental investigation of
Real-Time Computing Without Stable States
2553
A
B
C
D
E
Figure 9: Readout assigned equivalent states of a dynamical system. An LSM (liquid circuit as in Figure 3) was trained for the classication task as described in Figure 3. Results shown are for a novel test input (drawn from the same distribution as the training examples). (A) The test input consists of 40 Poisson spike trains, each with a constant rate of 5 Hz. (B) Raster plot of the 135 liquid neurons in response to this input. Note the large variety of liquid states that arise during this time period. (C) Population rate of the liquid (bin size 20 ms). Note that this population rate changes quite a bit over time. (D) Readout response (solid line) and target response (dashed line). The target response had a constant value of 1 for this input. The output of the trained readout module is also almost constant for this test example (except for the beginning), although its input, the liquid states of the recurrent circuit, varied quit a bit during this time period. (E) Weight distribution of a single readout neuron.
2554
Wolfgang Maass, Thomas Natschl¨ager, and Henry Markram
neural coding by a systematic investigation of the trajectories of internal states of neural microcircuits or systems, which are compared on one hand with inputs to the circuit and on the other hand with responses of different readout projections. The liquid computing framework suggests that recurrent neural microcircuits, rather than individual neurons, might be viewed as basic computational units of cortical computation and therefore may give rise to a new generation of cortical models that link LSM “columns” to form cortical areas where neighboring columns read out different aspects of another column and each of the stereotypic columns serves both liquid and readout functions. In fact, the classication of neurons into liquid and readout neurons is primarily made for conceptual reasons. Another conceptual simplication was made by restricting plasticity to synapses onto readout neurons. However, synapses in the liquid circuit are also likely to be plastic, for example, to support the extraction of independent components of information about preceding time-varying inputs for a particular distribution of natural stimuli and thereby enhance the separation property of neural microcircuits. This plasticity within the liquid would be input driven and less task specic and might be most prominent during development of an organism. In addition, the information processing capabilities of hierarchies or other structured networks of LSMs remain to be explored, which may provide a basis for modeling larger cortical areas. Apart from biological modeling, the computational model discussed in this article may also be of interest for some areas of computer science. In computer applications where real-time processing of complex input streams is required, such as in robotics, there is no need to work with complicated heterogeneous recurrent networks of I&F neurons as in biological modeling. Instead, one can use simple devices such as tapped delay lines for storing information about past inputs. Furthermore, one can use any one of a large selection of powerful tools for static pattern recognition (such as feedforward neural networks, support vector machines, or decision trees) to extract from the current content of such tapped delay line information about a preceding input time series, in order to predict that time series, classify that time series, or propose actions based on that time series. This works ne except that one has to deal with the problems caused by local minima in the error functions of such highly nonlinear pattern recognition devices, which may result in slow learning and suboptimal generalization. In general, the escape from such local minima requires further training examples or timeconsuming off-line computations, such as repetition of backpropagation for many different initial weights or the solution of a quadratic optimization problem in the case of support vector machines. Hence, these approaches tend to be incompatible with real-time requirements, where a classication or prediction of the past input time series is needed instantly. Furthermore, these standard approaches provide no support for multitasking, since one has to run for each individual classication or prediction task a separate
Real-Time Computing Without Stable States
2555
copy of the time-consuming pattern recognition algorithm. In contrast, the alternative computational paradigm discussed in this article suggests replacing the tapped delay line by a nonlinear on-line projection of the input time series into a high-dimensional space, in combination with linear readouts from that high-dimensional intermediate space. The nonlinear on-line preprocessing could even be implemented by inexpensive (even partially faulty) analog circuitry, since the details of this on-line preprocessing do not matter as long as the separation property is satised for all relevant inputs. If this task-independent on-line preprocessing maps input streams into a sufciently high-dimensional space, all subsequent linear pattern recognition devices, such as perceptrons, receive essentially the same classication and regression capability for the time-varying inputs to the system as nonlinear classiers without preprocessing. The training of such linear readouts has an important advantage compared with training nonlinear readouts. While the error minimization for a nonlinear readout is likely to get stuck in local minima, the sum of squared errors for a linear readout has just a single local minimum, which is automatically the global minimum of this error function. Furthermore, the weights of linear readouts can be adapted in an on-line manner by very simple local learning rules so that the weight vector moves toward this global minimum. Related mathematical facts are exploited by support vector machines in machine learning (Vapnik, 1998), although the boosting of the expressive power of linear readouts is implemented there in a different fashion that is not suitable for real-time computing. Finally, the new approach to real-time neural computation presented in this article may provide new ideas for neuromorphic engineering and analog VLSI. Besides implementing recurrent circuits of spiking neurons in silicon, one could examine a wide variety of other materials and circuits that may enable inexpensive implementation of liquid modules with suitable separation properties, to which a variety of simple adaptive readout devices may be attached to execute multiple tasks. Appendix A: Mathematical Theory We say that a class CB of lters has the point-wise separation property with regard to input functions from Un if for any two functions u (¢) , v (¢) 2 Un with 6 v ( s ) for some s · 0 there exists some lter B 2 CB that separates u (¢) u ( s) D 6 ( Bv ) (0) . Note that it is not required that there exists and v (¢) , that is, ( Bu ) (0) D 6 ( Bv ) (0) for any two functions u (¢) , v (¢) 2 Un a lter B 2 CB with ( Bu ) (0) D 6 with u (s ) D v ( s) for some s · 0. Simple examples for classes CB of lters that have this property are the class of all delay lters u (¢) 7! ut0 (¢) (for t0 2 R), the class of all linear lters with impulse responses of the form h (t ) D e¡at with a > 0, and the class of lters dened by standard models for dynamic synapses (see Maass & Sontag, 2000). A liquid lter L M of an LSM M is said to be composed of lters from CB if there are nitely many
2556
Wolfgang Maass, Thomas Natschl¨ager, and Henry Markram
lters B1 , . . . , Bm in CB, to which we refer as basis lters in this context, so that ( LM u ) (t) D h(B1 u ) ( t) , . . . , ( Bm u ) ( t) i for all t 2 R and all input functions u (¢) in U n . In other words, the output of LM for a particular input u is simply the vector of outputs given by these nitely many basis lters for this input u. A class CF of functions has the approximation property if for any m 2 N, any compact (i.e., closed and bounded) set X µ Rm , any continuous function h: X ! R, and any given r > 0, there exists some f in CF so that |h(x) ¡ f (x ) | · r for all x 2 X. The denition for the case of functions with multidimensional output is analogous. Theorem 1. Consider a space Un of input functions where U D fu: R ! [¡B, B]: |u(t) ¡ u ( s) | · B0 ¢ |t ¡ s| for all t, s 2 Rg for some B, B0 > 0 (thus, U is a class of uniformly bounded and Lipschitz-continuous functions). Assume that CB is some arbitrary class of time-invariant lters with fading memory that has the point-wise separation property. Furthermore, assume that CF is some arbitrary class of functions that satises the approximation property. Then any given timeinvariant lter F that has fading memory can be approximated by LSMs with liquid lters L M composed from basis lters in CB and readout maps f M chosen from CF. More precisely, for every e > 0, there exist m 2 N, B1 , . . . , Bm 2 CB and f M 2 CF so that the output y (¢) of the LSM M with liquid lter L M composed of B1 , . . . , Bm , that is, ( LM u) ( t) D h(B1 u ) ( t) , . . . , ( Bm u ) ( t) i, and readout map f M satises for all u (¢) 2 Un and all t 2 Rk(Fu) ( t) ¡ y (t )k · e . The proof of this theorem follows from the Stone-Weierstrass approximation theorem, like the proof of theorem 1 in Boyd and Chua (1985). One can easily show that the inverse of theorem 1 also holds: if the functions in CF are continuous, then any lter F that can be approximated by the LSMs considered in theorem 1 is time invariant and has fading memory. In combination with theorem 1, this provides a complete characterization of the computational power of LSMs. In order to extend theorem 1 to the case where the inputs are nite or innite spike trains rather than continuous functions of time, one needs to consider an appropriate notion of fading memory for lters on spike trains. The traditional denition, given in note 4, is not suitable for the following reason. If u (¢) and v (¢) are functions with values in f0, 1g that represent spike trains and ifd · 1, then the condition ku(t) ¡v(t)k < d is too strong. It would require that u ( t) D v (t ). Hence, we dene for the case where the domain U consists of spike trains (i.e., 0¡1 valued functions) that a lter F: Un ! (RR ) k has fading memory on spike trains if for every u D hu1 , . . . , un i 2 Un and every e > 0 there exist d > 0 and m 2 N so that k(Fv) (0) ¡ ( Fu ) (0)k < e for all v D hv1 , . . . , vn i 2 U n with the property that for i D 1, . . . , n the last m spikes in vi (before time 0) each have a distance of at most d from the corresponding ones among the last m spikes in ui . Intuitively, this says that
Real-Time Computing Without Stable States
2557
a lter has fading memory on spike trains if the most signicant bits of the lter output can already be determined from the approximate times of the last few spikes in the input spike train. For this notion of fading memory on spike trains, one can prove: Theorem 2. Consider the space Un of input functions where U is the class of spike trains with some minimal distance D between successive spikes (e.g., D D 1 ms). Assume that CB is some arbitrary class of time-invariant lters with fading memory on spike trains that has the point-wise separation property. Furthermore, assume that CF is some arbitrary class of functions that satises the approximation property. Then any given time-invariant lter F that has fading memory on spike trains can be approximated by LSMs with liquid lters LM composed from basis lters in CB and readout maps f M chosen from CF. The proof for theorem 2 is obtained by showing that for all lters F, fading memory on spike trains is equivalent to continuity with regard to a suitable metric on spike trains that turns the domain of spike trains into a compact metric space. Hence, one can also apply the Stone-Weierstrass approximation theorem to this case of computations on spike trains. Appendix B: Details of the Computer Simulation We used a randomly connected circuit consisting of 135 I&F neurons, 20% of which were randomly chosen to be inhibitory, as a single “column” of neural circuitry (Tsodyks, Uziel, & Markram, 2000). Neuron parameters were membrane time constant 30 ms, absolute refractory period 3 ms (excitatory neurons), 2 ms (inhibitory neurons), threshold 15 mV (for a resting membrane potential assumed to be 0), reset voltage 13.5 mV, constant nonspecic background current Ib D 13.5 nA, and input resistance 1 MV. Regarding connectivity structure, the probability of a synaptic connection from neuron a to neuron b (as well as that of a synaptic connection from 2 neuron b to neuron a) was dened as C ¢ e¡(D(a,b) / l) , where l is a parameter that controls both the average number of connections and the average distance between neurons that are synaptically connected. We assumed that the 135 neurons were located on the integer points of a 15 £ 3 £ 3 column in space, where D (a, b ) is the Euclidean distance between neurons a and b. Depending on whether a and b were excitatory (E) or inhibitory (I), the value of C was 0.3 (EE), 0.2 (EI), 0.4 (IE), 0.1 (II). In the case of a synaptic connection from a to b, we modeled the synaptic dynamics according to the model proposed in Markram, Wang, and Tsodyks (1998), with the synaptic parameters U (use), D (time constant for depression), and F (time constant for facilitation) randomly chosen from gaussian distributions that were based on empirically found data for such connections. Depending on whether a, b were excitatory (E) or inhibitory
2558
Wolfgang Maass, Thomas Natschl¨ager, and Henry Markram
(I), the mean values of these three parameters (with D, F expressed in second, s) were chosen to be .5, 1.1, .05 (EE), .05, .125, 1.2 (EI), .25, .7, .02 (IE), .32, .144, and .06 (II). The SD of each parameter was chosen to be 50% of its mean (with negative values replaced by values chosen from an appropriate uniform distribution). The mean of the scaling parameter A (in nA) was chosen to be 30 (EE), 60 (EI), ¡19 (IE), and ¡19 (II). In the case of input synapses, the parameter A had a value of 18 nA if projecting onto an excitatory neuron and 9.0 nA if projecting onto an inhibitory neuron. The SD of the A parameter was chosen to be 100% of its mean and was drawn from a gamma distribution. The postsynaptic current was modeled as an exponential decay exp(¡t / ts ) with ts D 3 ms (ts D 6 ms) for excitatory (inhibitory) synapses. The transmission delays between liquid neurons were chosen uniformly to be 1.5 ms (EE), and 0.8 for the other connections. For each simulation, the initial conditions of each leaky I&F neuron, that is, the membrane voltage at time t D 0, were drawn randomly (uniform distribution) from the interval [13.5 mV, 15.0 mV]. Together with the spike time jitter in the input, these randomly drawn initial conditions served as implementation of noise in our simulations (in order to test the noise robustness of our approach). Readout elements used in the simulations of Figures 3, 8, and 9 were made of 51 I&F neurons (unconnected). A variation of the perceptron learning rule (the delta rule; see Hertz, Krogh, & Palmer, 1991) was applied to scale the synapses of these readout neurons: the p-delta rule discussed in Auer et al. (2001). The p-delta rule is a generalization of the delta rule that trains a population of perceptrons to adopt a given population response (in terms of the number of perceptrons that are above threshold), requiring very little overhead communication. This rule, which formally requires adjusting the weights and the threshold of perceptrons, was applied in such a manner that the background current of an I&F neuron is adjusted instead of the threshold of a perceptron (while the ring threshold was kept constant at 15 mV). In Figures 8 and 9, the readout neurons are not fully modeled as I&F neurons, but just as perceptrons (with a low-pass lter in front that transforms synaptic currents into postsynaptic potentials, time constant 30 ms), in order to save computation time. In this case, the “membrane potential” of each perceptron is checked every 20 ms, and it is said to “re” at this time point if this “membrane potential” is currently above the 15 mV threshold. No refractory effects are modeled and no reset after ring. The percentage of readout neurons that re during a 20 ms time bin is interpreted as the current output of this readout module (assuming values in [0 , 1] ). In the simulations for Figures 5, 6, and 7, we used just single perceptrons as readout elements. The weights of such a single perceptron have been trained using standard linear regression: the target value for the linear regression problem was C 1(¡1) if the perceptron output 1(0) for the given input. The output of the perceptron after learning was 1(0) if the weighted sum of inputs was ¸ 0( < 0).
Real-Time Computing Without Stable States
2559
Acknowledgments We thank Rodney Douglas, Herbert Jaeger, Wulfram Gerstner, Alan Murray, Misha Tsodyks, Thomas Poggio, Lee Segal, Tali Tishby, Idan Segev, Phil Goodman, and Mark Pinsky for their comments on a draft of this article. The work was supported by project P15386 of the Austrian Science Fund, the NeuroCOLT project of the EU, the Ofce of Naval Research, HFSP, Dol & Ebner Center, and the Edith Blum Foundation. H. M. is the incumbent of the Diller Family Chair in Neuroscience at the Weizmann Institute, Rehovot, Israel.
References Arieli, A., Sterkin, A., Grinvald, A., & Aertsen, A. (1996). Dynamics of ongoing activity: Explanation of the large variability in evoked cortical responses. Science, 273, 1868–1871. Auer, P., Burgsteiner, H., & Maass, W. (2001). The p-delta rule for parallel perceptrons. Manuscript submitted for publication. Available on-line: http://www.igi.TUGraz.at/maass/p delta learning.pdf. Boyd, S., & Chua, L. O. (1985). Fading memory and the problem of approximating nonlinear operators with Volterra series. IEEE Trans. on Circuits and Systems, 32, 1150–1161. Buonomano, D. V., & Merzenich, M. M. (1995). Temporal information transformed into spatial code by a neural network with realistic properties. Science, 267, 1028–1030. Dominey P., Arbib, M., & Joseph, J. P. (1995). A model of corticostriatal plasticity for learning oculomotor association and sequences. J. Cogn. Neurosci., 7(3), 311–336. Douglas, R., & Martin, K. (1998).Neocortex. In G. M. Shepherd (Ed.), The synaptic organization of the brain (pp. 459–509). Oxford: Oxford University Press. Giles, C. L., Miller, C. B., Chen, D., H. H., Sun, G. Z., & Lee, Y. C. (1992). Learning and extracting nite state automata with second-order recurrent neural networks. Neural Computation, 4, 393–405. Gupta, A., Wang, Y., & Markram, H. (2000). Organizing principles for a diversity of GABAergic interneurons and synapses in the neocortex. Science, 287, 273– 278. Hertz, J., Krogh, A., & Palmer, R. G. (1991). Introduction to the theory of neural computation. Redwood City, CA: Addison-Wesley. Holden, A. V., Tucker, J. V., & Thompson, B. C. (1991). Can excitable media be considered as computational systems? Physica D, 49, 240–246. Hopeld, J. J., & Brody, C. D. (2001). What is a moment? Transient synchrony as a collective mechanism for spatio-temporal integration. Proc. Natl. Acad. Sci., USA, 89(3), 1282. Hyoetyniemi, H. (1996). Turing machines are recurrent neural networks. In J. Alander, T. Honkela, & M. Jacobsson (Eds.), Proc. of SteP’96—Genes, Nets
2560
Wolfgang Maass, Thomas Natschl¨ager, and Henry Markram
and Symbols (pp. 13–24). VTT Tietotekniikka: Finnish Articial Intelligence Society. Jaeger, H. (2001). The “echo state” approach to analyzing and training recurrent neural networks. Manuscript submitted for publication. Maass, W. (1996). Lower bounds for the computational power of networks of spiking neurons. Neural Computation, 8(1), 1–40. Maass, W. (2000). On the computational power of winner-take-all. Neural Computation, 12(11), 2519–2536. Maass, W., & Sontag, E. D. (1999). Analog neural nets with gaussian or other common noise distributions cannot recognize arbitrary regular languages. Neural Computation, 11, 771–782. Maass, W., & Sontag, E. D. (2000). Neural systems as nonlinear lters. Neural Computation, 12(8), 1743–1772. Markram, H., Wang, Y., & Tsodyks, M. (1998). Differential signaling via the same axon of neocortical pyramidal neurons. Proc. Natl. Acad. Sci., 95, 5323–5328. Moore, C. (1998). Dynamical recognizers: Real-time language recognition by analog computers. Theoretical Computer Science, 201, 99–136. Pearlmutter, B. A. (1995). Gradient calculation for dynamic recurrent neural networks: A survey. IEEE Trans. on Neural Networks, 6(5), 1212–1228. Pollack, J. B. (1991). The induction of dynamical recognizers. Machine Learning, 7, 227–252. Savage, J. E. (1998). Models of computation: Exploring the power of computing. Reading, MA: Addison-Wesley. Shepherd, G. M. (1988). A basic circuit for cortical organization. In M. Gazzaniga (Ed.), Perspectivesin memory research(pp. 93–134). Cambridge, MA: MIT Press. Siegelmann, H., & Sontag, E. D. (1994).Analog computation via neural networks. Theoretical Computer Science, 131, 331–360. Tsodyks, M., Uziel, A., & Markram, H. (2000).Synchrony generation in recurrent networks with frequency-dependent synapses. J. Neuroscience, 20, RC50. Vapnik, V. N. (1998). Statistical learning theory. New York: Wiley. von Melchner, L., Pallas, S. L., & Sur, M. (2000). Visual behaviour mediated by retinal projection directed to the auditory pathway. Nature, 404, 871–876. Received August 14, 2001; accepted May 3, 2002.
NOTE
Communicated by Irwin Sandberg
Universal Approximation of Multiple Nonlinear Operators by Neural Networks Andrew D. Back [email protected] Windale Technologies, Brisbane, QLD 4075, Australia
Tianping Chen [email protected] Department of Mathematics, Fudan University, Shanghai, 200433, China
Recently, there has been interest in the observed capabilities of some classes of neural networks with xed weights to model multiple nonlinear dynamical systems. While this property has been observed in simulations, open questions exist as to how this property can arise. In this article, we propose a theory that provides a possible mechanism by which this multiple modeling phenomenon can occur. 1 Introduction Understanding the mechanisms by which neural networks can approximate functions, functionals, and operators has generated signicant interest in recent years. The property of universal function approximation has been proven for various feedforward neural network models (Hornik, Stinchcombe, & White, 1989). In dynamic networks with embedded time delays, these results have been extended to the case of universal functional approximation (Chen & Chen, 1993) and universal operator approximation (Sandberg, 1991; Chen & Chen, 1995). 1 Universal approximation properties for recurrent neural networks over a nite time interval have been proven by Sontag (1992). Related work in continuously or discretely time-varying dynamical systems exists in the literature under various names, for example, multiple controllers (Narendra, Balakrishnan, & Ciliz, 1995), modular neural networks (Jacobs, Jordan, Nowlan, & Hinton, 1991; Schmidhuber, 1992), and hybrid systems (Branicky, Borkar, & Mitter, 1994; Brockett, 1993; Branicky, 1996; Lemmon & Antsaklis, 1995). Recently, several researchers have shown independently that some types of neural network are capable of learning to
1 We refer to neural networks with time-delay or recurrent connections as dynamic neural network models.
c 2002 Massachusetts Institute of Technology Neural Computation 14, 2561– 2566 (2002) °
2562
Andrew D. Back and Tianping Chen
model several different dynamical systems within one structure (Feldkamp, Puskorius, & Moore, 1997; Younger, Conwell, & Cotter, 1999; Hochreiter & Schmidhuber, 1997; Hochreiter, Younger, & Conwell, 2001). In this article, we provide a theoretical explanation of a mechanism by which neural network models can approximate dynamical systems that change continuously or switch between some form of characteristic behaviors. We refer to such systems as multiple nonlinear operator (MNO) models. In the next section, we present results on the universal approximation of MNOs by neural networks, followed by a brief example. 2 Universal Approximation of Multiple Nonlinear Operators 2.1 Functional and Operator Approximation. It is well known that nonlinear dynamical systems can be treated as functionals and operators (Chen & Chen, 1995; Sandberg, 1991; Stiles, Sandberg, & Ghosh, 1997). The time delay neural network (TDNN) functional (Chen & Chen, 1993) is dened on a compact set in C[a,b] , (the space of all continuous functions), and is given by 0 G ( u) D
N X iD 1
ci g @
1 m X jD 1
jij u ( tj ) C hi A ,
(2.1)
where u ( t) is a continuous real-valued output at time t. A more general class of model, which has signicant implications for modeling dynamic systems, are those capable of approximating arbitrary nonlinear operators. For dynamic models, this means mapping from one time-varying sequence (a function of time) to another. Chen and Chen (1995) have proved universal approximation for the following operator model, 0 G ( u) ( y) D
N X M X kD 1 i D 1
cik g @
m X jD 1
1 ¡ ¢ jijku xj C hikA g (vk ¢ y C &k ) ,
(2.2)
where G (u ) ( y ) is an arbitrary nonlinear operator with inputs u and y. The Chen network can be regarded as a feedforward network with two weight layers, in which the output weights have each been replaced by a two-layer functional approximating network. The idea of these parameter replacement networks has also been proposed independently elsewhere (Priestley, 1980; Back & Chen, 1998; Schmidhuber, 1992). In this previous work, universal operator approximation properties were not proven, however. 2.2 Multiple Nonlinear Operator Approximation. Consider a multiple nonlinear operator H, dened broadly as a set of mappings of the form
Universal Approximation of Multiple Nonlinear Operators
2563
H: Fi ! Fi C 1 , where i D 1, 2, . . . is the index of the discrete functionals in the discrete multiple model case, or H: Fa ! Fb corresponding to the continuous multiple model case. Such a model encapsulates the basic framework of multiple model capabilities observed in various neural networks (Feldkamp et al., 1997). Following a similar approach to Chen and Chen (1995), it is possible to construct a network that performs the task of universally approximating a multiple nonlinear operator. We obtain the following theorem: Theorem 1. Suppose that g is a nonpolynomial, continuous, bounded function, X is a Banach space, K1 µ X is a compact set in X and K2 µ Rn , K3 µ Rn , K4 µ R n are compact sets in Rn , V is a compact set in C (K1 ) , Z is a compact set in C ( K3 ) , and G is a nonlinear continuous operator, which maps V into C ( K2 ) £ C ( K3 ) . Then for any 2 > 0, there are positive integers M, N, P, m, p, constants cikh , Qilkh , rikh , &k, jijk , hik 2 R, points vk 2 Rn xj 2 K1 , xl 2 K3 i D 1, . . . , M, k D 1, . . . , N, h D 1, . . . , P, j D 1, . . . , m, l D 1, . . . , p such that G ( u ) ( y) ( z ) ¡ ± ² P PM PP Pp kh kh C r kh N Q c g z i <2 kD 1 i D 1 h D 1 i l D1 ij l ±P ² ¡ ¢ m k k g j D 1 jij u xj C hi g (vk ¢ y C &k )
(2.3)
holds for all u 2 V, y 2 K2 and z 2 Z. Proof. From theorems 3 and 5 in Chen and Chen (1995) and following the same approach as in theorem 5 (Chen & Chen, 1995), we assume that G is a continuous operator that maps a compact set Z in C ( K3 ) into C ( K4 ). Hence, G (Z ) D fG(z) : z 2 Zg is also a compact set in C ( K4 ) . From theorem 3 (Chen & Chen, 1995) we have for any 2 > 0, a positive integer N, real numbers cik (G ( z) ) ,hik, &k, jijk 2 R, vectors vk 2 Rn , i D 1, . . . , N, j D 1, . . . , m, k D 1, . . . , N such that G ( u) ( y ) (z ) ¡ ±P ² 2 PN PM k m k k < j D 1 jij u ( xj ) C hi kD 1 i D 1 ci ( G ( z ) ) g 2 g (vky C &k )
(2.4)
holds for all y 2 K2 , v 2 V, and z 2 Z. As before, we note that since G is a continuous operator, for each k D 1, . . . , N, cik ( G ( z) ) is a continuous functional dened on Z. Hence, by applying theorem 4 (Chen & Chen, 1995) for each i D 1, . . . , N, k D 1, . . . , N, it is possible to determine positive integers Nk, m k, constants cikh , Qijkh , r ikh , 2 R,
2564
Andrew D. Back and Tianping Chen
and xj 2 K1 , h D 1, . . . , N k, j D 1, . . . , mk, such that 0 1 Nk mk X X k( ( )) kh @ kh C kh A 2 ¡ Q r < c G z c g z i i i ij j 2L j D1 hD 1
(2.5)
holds for all i D 1, . . . , N, k D 1, . . . , N and z 2 Z where LD
N X
sup | g (vk ¢ y C &k ) |.
(2.6)
kD 1 y2K2
Therefore, substituting equation 2.5 into 2.4, we have G ( u) ( y ) ( z ) ¡ ± ² PN P M P P P mk kh C r kh kh g Q c z j i <2 . kD 1 i D 1 h D 1 i j D 1 ij ±P ² ¡ ¢ m k k (v ) g j D1 jij u xj C hi g k ¢ y C &k
(2.7)
Let P D maxkfN kg, p D max kfm kg, cikh D 0 8 N k < h · P, and Qilkh D 0 8 mk < l · p. Hence, we have G ( u ) ( y ) ( z) ¡ ±P ² P P P p N M P kh Q kh C rikh <2 . (2.8) kD 1 i D 1 h D 1 ci g l D 1 il zl ±P ² ¡ ¢ m k k g j D1 jij u xj C hi g (vk ¢ y C &k ) This model can be interpreted as the z input acting as a dynamic gating mechanism that, by virtue of its universal functional approximation capabilities on a compact set, allows the selection of a particular operator, dened by the encompassed Chen network. The mapping of the z input allows the effective output weights cikh to be varied such that the desired operators can be chosen. Thus, the operator map given by the mapping on u (xj ) can now be adjusted arbitrarily for every z input. The implication is that G ( u ) ( y) ( z) µ Ko , and we obtain a network capable of universally approximating any multiple nonlinear operator on a compact set by allowing the input z to select any unique operator mapping. Hence, a xed-weight network of this type is capable of approximating multiple dynamic systems, as observed in some classes of neural networks. 2.3 Example. An example multiple nonlinear operator, demonstrating the model, can be synthesized as follows. Let there exist an MNO dened by Hm ( u) ( x ) ( t) = Hm ( u ) x ( t) , where Hm ( u ) D
1 . 1 ¡ f1 ( u, t) q¡1 C f2 ( u, t) q¡2
(2.9)
Universal Approximation of Multiple Nonlinear Operators
2565
We introduce coefcient functions fi ( t) , i D 1, 2 given by 0
fi ( t) D ci0 C ci1 fi ( t) 0
fi ( t) D
bi0 u (t) C bi1 u ( t ¡ 1) . 1 C ai1 q¡1 C ai2 q¡2
(2.10)
The characteristics of the model Hm vary as a function of the ancillary input signal u. It is not difcult to select model parameters to observe widely varying model characteristics. The resulting MNO model encompasses multiple operators of the general form given in equation 2.9. The model can also be viewed as an extension of the usual bilinear structure, described in terms of the difference equation form of nonlinear pole-zero model given by y ( t) D x (t ) C f1 ( t) y (t ¡ 1) C f2 ( t) y (t ¡ 2) fi ( t) D ci1 C ci2 ( b 0x ( t) C b1 x ( t) ¡ ai1 fi ( t ¡ 1) C ai2 fi ( t ¡ 2) ) ,
(2.11)
where x ( t) is the input and y ( t) is the output. 3 Conclusion In this article, we have proposed a new theorem of multiple nonlinear operator approximation. This theorem provides an explanation of how multiple models can occur in dynamic neural networks. The proposed theorem is an extension of Chen and Chen’s earlier results on universal operator approximation. It is hoped that the work presented here will serve to stimulate additional research into this potentially powerful capability of neural networks. Acknowledgments This work was performed while A. D. B. was at the RIKEN Brain Science Institute, Institute of Physical and Chemical Research, Japan. A. D. B. acknowledges fruitful discussions with S. Amari and L. Feldkamp. T. C. is supported by NSF of China under grant numbers 69982003 and 60074005. References Back, A., & Chen, T. (1998). Approximation of hybrid systems by neural networks. In N. Kasabov, R. Kozma, K. Ko, R. O’Shea, G. Coghill & T. Gedeon (Eds.), Proc. of 1997 Int. Conf. on Neural Information Processing ICONIP’97 (Vol. 1, pp. 326–329). Berlin: Springer-Verlag. Branicky, M. (1996). Studies in hybrid systems. Unpublished doctoral dissertation, Massachusetts Institute of Technology.
2566
Andrew D. Back and Tianping Chen
Branicky, M. S., Borkar, V. S., & Mitter, S. K. (1994). A unied framework for hybrid control: Background, model, and theory (Techn. Rep. No. LIDS-P-223). Cambridge, MA: Laboratory for Information and Decision Systems, Department of Electrical Engineering and Computer Science, MIT. Brockett, R. (1993). Hybrid models for motion control systems. In H. Trentelman & J. C. Willems (Eds.), Perspectives in control (pp. 29–54). Boston: Birkhauser. Chen, T., & Chen, H. (1993).Approximations of continuous functionals by neural networks with application to dynamic systems. IEEE Trans. Neural Networks, 4(6), 910–918. Chen, T., & Chen, H. (1995). Universal approximation to nonlinear operators by neural networks with arbitrary activation functions and its application to dynamical systems. IEEE Trans. Neural Networks, 6(4), 911–917. Feldkamp, L., Puskorius, G., & Moore, P. (1997). Adaptive behaviour from xed weight dynamic networks. Information Sciences, 98, 217–235. Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780. Hochreiter, S., Younger, A. S., & Conwell, P. R. (2001). Learning to learn using gradient descent. In Lecture Notes on Comp. Sci. 2130, Proc. Intl. Conf. on Articial Neural Networks (ICANN-2001) (pp. 87–94). Berlin: Springer-Verlag. Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2(5), 359–366. Jacobs, R. A., Jordan, M. I., Nowlan, S. J., & Hinton, G. E. (1991). Adaptive mixtures of local experts. Neural Computation, 3(1), 79–87. Lemmon, M., & Antsaklis, P. (1995). Inductively inferring valid logical models for continuous-state dynamical systems. Theoretical Computer Science, 138, 201–220. Narendra, K., Balakrishnan, J., & Ciliz, M. (1995). Adaptation and learning using multiple models switching and tuning. IEEE Control Systems Magazine, 15(3), 37–51. Priestley, M. (1980). State-dependent models: A general approach to non-linear time series analysis. J. Time Series Analysis, 1(1), 47–71. Sandberg, I. (1991). Approximation theorems for discrete-time systems. IEEE Trans. Circuits, Syst., 38(5), 564–566. Schmidhuber, J. (1992).Learning to control fast-weight memories: An alternative to dynamic recurrent networks. Neural Computation, 4(1), 131–139. Sontag, E. (1992). Neural nets as systems models and controllers. In Proc. Seventh Yale Workshop on Adaptive and Learning Systems (pp. 73–79). New Haven, CT: Yale University. Stiles, B. W., Sandberg, I. W., & Ghosh, J. (1997). Complete memory structures for approximating nonlinear discrete-time mappings. IEEE Transactions on Neural Networks, 8(6), 1397–1409. Younger, A., Conwell, P., & Cotter, N. (1999).Fixed-weight on-line learning. IEEE Trans. on Neural Networks, 10(2), 272–283. Received November 15, 2001; accepted May 8, 2002.
LETTER
Communicated by Andrew Barto
Long-Term Reward Prediction in TD Models of the Dopamine System Nathaniel D. Daw [email protected] David S. Touretzky [email protected] Computer Science Department and Center for the Neural Basis of Cognition, Carnegie Mellon University, Pittsburgh, PA 15213, U.S.A. This article addresses the relationship between long-term reward predictions and slow-timescale neural activity in temporal difference (TD) models of the dopamine system. Such models attempt to explain how the activity of dopamine (DA) neurons relates to errors in the prediction of future rewards. Previous models have been mostly restricted to short-term predictions of rewards expected during a single, somewhat articially dened trial. Also, the models focused exclusively on the phasic pause-and-burst activity of primate DA neurons; the neurons’ slower, tonic background activity was assumed to be constant. This has led to difculty in explaining the results of neurochemical experiments that measure indications of DA release on a slow timescale, results that seem at rst glance inconsistent with a reward prediction model. In this article, we investigate a TD model of DA activity modied so as to enable it to make longer-term predictions about rewards expected far in the future. We show that these predictions manifest themselves as slow changes in the baseline error signal, which we associate with tonic DA activity. Using this model, we make new predictions about the behavior of the DA system in a number of experimental situations. Some of these predictions suggest new computational explanations for previously puzzling data, such as indications from microdialysis studies of elevated DA activity triggered by aversive events. 1 Introduction
Recent neurophysiological and modeling work has suggested parallels between natural and articial reinforcement learning (Montague, Dayan, & Sejnowski, 1996; Houk, Adams, & Barto, 1995; Schultz, Dayan, & Montague, 1997; Suri & Schultz, 1999). Activity recorded from primate dopamine (DA) neurons (Schultz, 1998), a system associated with motivation and addiction, appears qualitatively similar to the error signal from temporal difference (TD) learning (Sutton, 1988). This algorithm allows a system to learn c 2002 Massachusetts Institute of Technology Neural Computation 14, 2567– 2583 (2002) °
2568
Nathaniel D. Daw and David S. Touretzky
from experience to predict some measure of reward expected in the future (known as a return). One aspect of the TD algorithm that has received little attention in the modeling literature is what sort of prediction the system is learning to make. TD algorithms have been devised for learning a variety of returns, differing, for example, as to the time frame over which the predicted rewards accumulate and whether rewards expected far in the future, if they are included at all, are discounted with respect to more immediate rewards. The DA models that have been the subject of simulation (Montague et al., 1996; Schultz et al., 1997; Suri & Schultz, 1999) all assume for convenience a simple discrete-trial setting, with returns accumulating only over the course of a single trial. Although the modeled experiments have this structure, it is not clear whether or how animals segment a continuous stream of experience into a set of disjoint, repeating trials. And, as this article chiey explores, the focus on predictions within a single trial eliminates subtler effects that a longer-timescale aspect to predictions would produce in the modeled DA signal. A second weakness of current TD models of the DA signal is that they focus entirely on phasic activity. DA neurons exhibit tonic background ring punctuated by phasic bursts and pauses. Previous DA models explain the locations of the phasic events but treat the background ring rate as constant. Advocates of the TD models sometimes explicitly distinguish tonic DA as a separate phenomenon to which the models do not apply (Schultz, 1998). Nonetheless, experimental methods that measure DA activity over a very slow timescale have revealed effects that appear hard to reconcile with the TD models. Most critically, microdialysis studies have found evidence of elevated DA activity in response to aversive situations (see Horvitz, 2000, for a review). That such activity seems inconsistent with an error signal for reward prediction has fueled criticism of the TD models of DA (Redgrave, Prescott, & Gurney, 1999; Horvitz, 2000). Another slow-timescale effect that is not predicted by the TD models is the habituation, seen in voltammetry recordings, of DA release over the course of about a minute of unpredictable, rewarding brain stimulation (Kilpatrick, Rooney, Michael, & Wightman, 2000). This article addresses both of these weaknesses of existing TD models together. We show that rather than being inapplicable to tonic DA, these models have all along contained the germ of an account of it in the relationship between slow-timescale DA release and long-term reward predictions. We study the tonic behavior of TD models modied only so far as necessary to allow them to predict a longer-term return that includes rewards expected in future trials. When the articial horizon on the return is removed, the TD error includes a slowly changing background term, which we associate with tonic DA. These slow changes suggest a new computational explanation for the seemingly paradoxical data on DA responses to aversive events and for the slow habituation of DA release in brain stimulation experiments. We
Long-Term Reward Prediction in TD Models
2569
also suggest a number of predictions about experiments that have not yet been performed. Finally, we offer some thoughts about what these computational considerations suggest about the functional anatomy of DA and related brain systems. 2 TD Models of the Dopamine System 2.1 Return Denitions for TD Models. TD learning is a reward prediction algorithm for Markov decision processes (MDPs; Sutton, 1988). In a series of discrete time steps, the process moves through a series of states, and the goal is to learn the value of these states. (In the context of modeling animal brains, time steps can be taken to correspond to intervals of real time and states to animals’ internal representations of sensory observations.) Knowledge of values can, for instance, guide a separate action selection mechanism toward the most lucrative areas of the environment. More formally, systems use the TD algorithm to learn to estimate a value function V (s (t ) ) , which maps the state at some instant, s (t ), to a measure of the expected reward that will be received in the future. We refer to the function’s value at time t, V ( s ( t) ), using the abbreviated notation V ( t). There are a number of common measures of future reward, known as returns, differing mainly in how rewards expected at different times are combined. The simplest return, used in the DA models of Montague et al. (1996) and Schultz et al. (1997), is expected cumulative undiscounted reward, " # X (2.1) V ( t) D E r (t ) , t >t
where r (t ) is the reward received at time t . This return makes sense only in nite horizon settings—those in which time is bounded. It is not informative to sum rewards over a period without end, since V (t ) can then be innite. To usefully measure the value of a state in an innite horizon problem, it is necessary either to modify the problem—by subdividing events into a series of time-bounded episodes, with returns accumulating only within these trials—or to introduce a different notion of value that takes into account predictions stretching arbitrarily far into the future. The most common return that accommodates unbounded time discounts rewards exponentially in their delays; these discounted values can be summed over an innite window without diverging: " # X t ¡t c (2.2) Vexp (t) D E r (t ) . t >t
The parameter c < 1 controls the steepness of discounting. The published TD models were all simulated using a return truncated on trial boundaries (Montague et al., 1996; Schultz et al., 1997; Suri & Schultz,
2570
Nathaniel D. Daw and David S. Touretzky
1999), except for the model of Houk et al. (1995), which included no simulations. The horizon obviated the need for discounting, and indeed most of the experiments and expositions used the undiscounted cumulative return, equation 2.1. All of the reports also mentioned that Vexp could be used to extend the models to a more realistic innite-horizon situation, and Suri and Schultz (1999) used that return in their simulations. However, since even these investigations took place in an episodic setting, this work left unexamined how the behavior of the modeled DA system would be affected by longer-term predictions about rewards expected after the trial was complete. As we show here, the introduction of an innite horizon gives rise to long-term predictions that would produce subtle variations in DA behavior, some previously puzzling indications of which have been observed in experiment. To demonstrate this, we examine a model based on a different return that is closely related to equation 2.2 but makes the predictions we will be examining more obvious. This return assumes a constant baseline expectation of r reward per time step and measures the sum of differences between observed rewards and the baseline (Schwartz, 1993; Mahadevan, 1996; Tsitsiklis & Van Roy, 1999): " # X ( r (t ) ¡ r ) . (2.3) Vrel (t) D E t >t
This relative value function is nite if r is taken to be the long-term future PC average reward per time step, limn!1 (1 / n ) E[ tt Dn¡1 r (t ) ]. This value r, the t discrete analog of the long-term reward rate, is the same for all t under some assumptions about the structure of the MDP (Mahadevan, 1996). For horizonless problems, Vrel (rather than the standard undiscounted return V) properly extends Vexp to the undiscounted case where c D 1. Tsitsiklis and Van Roy (2002) prove that their TD algorithm for learning Vrel represents the limit as the discounting parameter c approaches one in a version of exponentially discounted TD. This proof assumes Vexp is estimated with a linear function approximator, using a constant bias term in the discounted algorithm but not in the average reward one. Assuming function approximators that meet these constraints, there can thus be no in-principle difference between TD models based on lightly discounted Vexp and those based on Vrel , so approximate versions of the DA neuron behaviors we predict in section 3 are also expected under the exponentially discounted return in an innite horizon setting. We use Vrel throughout this article to make it easier to examine the role of long-term predictions, since it segregates one component of such predictions in the r term. We return to the details of the relationship between the two models in the discussion. 2.2 Specifying a TD Model of Dopamine. In this article, we analyze the DA model of Montague et al. (1996), modied to use the TD rule of
Long-Term Reward Prediction in TD Models
2571
Tsitsiklis and Van Roy (1999) to learn an estimate VO rel of Vrel instead of V. The TD algorithm denes an error signal by which a function approximator can be trained to learn the value function. The model uses a linear function approximator, VO rel ( t) D w ( t) ¢ s (t) , where w is a trainable weight vector and s is a state vector. At each time step, weights are updated according to the delta rule, w ( t C 1) D w ( t) C º ¢ s ( t) ¢ drel ( t) ,
where º is a learning rate and drel is the average reward TD error: drel ( t) D VO rel ( t C 1) ¡ VO rel ( t) C r ( t) ¡ r (t) .
(2.4)
In this equation, the average reward r ( t) is time dependent because it must be estimated on-line; we use an exponentially windowed running average with learning rate k ¿ º: r ( t C 1) D k r ( t) C (1 ¡ k )r ( t) . Under the model, DA neurons are assumed to re with a rate proportional to drel ( t) C b for some background rate b. Negative prediction error is thus taken to correspond to ring at a rate slower than b. Finally, we must dene the state vector s ( t). The simple conditioning experiments we consider here involve at most a single discrete stimulus, whose timing provides all available information for predicting rewards. We can thus delineate all possible states of the tasks by how long ago the stimulus was last seen. We represent this interval using a tapped delay line. Specically, we dene the state vector’s ith element si ( t) to be one if the stimulus was last seen at time t ¡ i, and zero otherwise. This effectively reduces the linear function approximator to a table lookup representation in the complete state-space of the tasks. We do not use a constant bias term, which would be superuous since the average reward TD algorithm learns Vrel only up to an arbitrary additive constant. For similar reasons, in the very simple experiments we discuss where rewards or punishments are delivered randomly without any stimuli, we leave s (t ) empty, which means that all predictions learned in this context reside in r (t) rather than w ( t). (This is reasonable since reward delivery in these tasks can be viewed as controlled by a one-state Markov process whose Vrel is arbitrary.) Representation of state and timing is one of the more empirically underconstrained aspects of DA system models, and we have elsewhere suggested an alternative scheme based on semi-Markov processes (Daw, Courville, & Touretzky, 2002). We use the tapped delay line architecture here for its simplicity, generality, and consistency with previous models. In particular, this is essentially the same stimulus representation used by Montague et al.
2572
Nathaniel D. Daw and David S. Touretzky
(1996), following the suggestion of Sutton and Barto (1990). The one important difference between the representational scheme we use here and that of Montague et al. is in the length of s ( t). In their experiments, this vector was shorter than the interval between trials (which they took to be constant). The effect was that at some point during the intertrial interval, s ( t) became a vector of all zeros, and so the prediction VO ( t) D w ( t) ¢ s ( t) was also necessarily zero until the next stimulus occurred. This prevented predictions of rewards in subsequent trials from backing up over the intertrial interval, effectively enforcing a horizon of a single trial on the learned value function. (Backups might still have occurred had the model contained either a constant bias term in s ( t) or eligibility traces in the learning rule, but it did not.) In the Montague et al. (1996) model, then, the learned value function had a nite horizon as a side effect of the state representation and learning rule being used, even though it was exposed to a continuous series of trials. In our model, we instead take s ( t) to be long enough that it never empties out between trials. This requires us to randomize the interval between trials (which was done in most studies of DA neurons but not in previous models), since otherwise, stimulus delivery would itself be predictable from events in the previous trial, and a reward-predicting stimulus would not, after learning, continue to evoke prediction error. In fact, supporting our view that predictions should encompass more than a single trial, in highly overtrained monkeys performing a task with minimal variability in the intertrial interval, DA neurons ceased responding to reward-predicting stimuli (Ljunbgerg, Apicella, & Schultz, 1992).
3 Results
We rst verify that the average reward TD version of the DA model we examine here preserves the ability of previous models to capture the basic phasic response properties of DA neurons. Figure 1 shows the modeled DA signal (as the points of the discrete signal connected by lines) for three trials during the acquisition and extinction of a stimulus-reward association, reproducing some key qualitative properties of phasic DA neuron responses (Schultz, 1998). The results are not visibly different from those reported by Montague et al. (1996) and others, which is to be expected given the close relationship of the models. In the rest of this section, we examine how long-term reward predictions would affect DA system behavior, by nding differences between the undiscounted TD model (which is useful only in an episodic setting and was the basis of most previous investigations) and the average reward TD model (which learns to predict an innite-horizon return; we show later that all of the same properties are expected under an exponentially discounted TD model). The undiscounted TD error signal, used by Montague et al. (1996) to model the DA response, is d (t) D V ( t C 1) ¡ V ( t) C r ( t) . This differs from
rel
Prediction error d (t)
Long-Term Reward Prediction in TD Models
2573
1
1
1
0.5
0.5
0.5
0
0
0
0.5
S
R
1
Time
(a)
0.5
S
0.5
R
1
S
1
Time
(b)
no R
Time
(c)
Figure 1: Error signal drel ( t) from an average reward TD model, displaying basic phasic properties of DA neurons. S: stimulus, R: reward. (a) Before learning, activation occurs in response to a reward but not to the stimulus that precedes it. (b) After learning, activation transfers to the time of the stimulus, and there is no response to the reward. (c) When a predicted reward is omitted, negative prediction error, corresponding to a pause in background ring, occurs at the expected time of the reward.
Prediction error drel(t)
drel (see equation 2.4) only in that r ( t) is subtracted from the latter. The extra effect of learning this innite horizon return is to depress the TD error signal by the slowly changing long-term reward estimate r (t) . Manifestations of this effect would be most noticeable when r (t) is large or changing, neither of which is true of the experiment modeled above. The average reward TD simulations pictured in Figures 2 through 4 demonstrate the predicted behavior of the DA response in situations where the effect of r (t ) should be manifest. Figure 2 depicts the basic effect, a decrease in the DA baseline as the experienced reward rate increases. Empirically, DA neurons respond phasically to randomly delivered, unsignaled rewards; the gure shows how the tonic inhibitory effect of r ( t) on the error signal should reduce DA activity when such rewards are delivered
1
rate 1/20
1
0.5
0.5
0
0
0.5
Time
0.5
rate 1/3
Time
Figure 2: Tonic effects in modeled DA signal. Increasing the rate of randomly delivered, unsignaled rewards should depress baseline neuronal activity, an effect that is also visible as a lower peak response to rewards. (Left) Modeled DA response to rewards delivered at a low rate. (Right) Modeled DA response to rewards delivered at a higher rate, showing depressed activity. These snapshots are from the end of a long sequence of rewards delivered at the respective rates.
Nathaniel D. Daw and David S. Touretzky 1
0.5
rel
0 S R
S R
S
S
0.5 0
(c)
Time
S R
Time
1
rel
S
S R
0
(b)
1
rel
Prediction error d (t)
S R
S R
0.5
Time
(a)
Prediction error d (t)
<
Value V (t)
rel
1
Prediction error d (t)
2574
0.5 0 R (d)
R S R S
S
R
Time
Figure 3: Tonic effects in modeled DA signal after conditioning and extinction. O rel ( t) jumps S: stimulus, R: reward. (a) After conditioning, value prediction V when the stimulus is received, ramps up further in anticipation of reward, and drops back to baseline when the reward is received. (b) The corresponding TD error signal—the modeled DA response—shows the classic phasic burst to the stimulus and is then zero until the reward is received. Between trials, the error signal is negative, corresponding to a predicted tonic depression in DA ring below its normal background rate. (c) After extinction by repeated presentation of stimuli alone, the tonic depression disappears (along with the phasic responses) and the error signal is zero everywhere. (d) If extinction is instead accomplished by presenting unpaired rewards and stimuli, the error signal remains tonically depressed to a negative baseline between phasic responses to the unpredicted rewards.
at a fast Poisson rate compared to a slow one. If the experienced reward rate changes suddenly, the estimate r ( t) will slowly adapt to match it, and DA activity will smoothly adapt accordingly; we show the system’s stable behavior when this adaptation asymptotes. (Note that although the prediction parameters have reached asymptotic values, the rewards themselves are unpredictable, and so they continue to evoke phasic prediction error.) In Figure 2, the inhibition resulting from a high average reward estimate r ( t) can be seen in both the level of tonic background ring between rewards and the absolute peak magnitude of the phasic responses to rewards. Below, we discuss some evidence for the latter seen in voltammetry experiments measuring DA concentrations in the nucleus accumbens. A more complex effect emerges when randomly delivered rewards are signaled by a stimulus preceding them by a constant interval. In this case, the phasic DA response is known from experiment to transfer to the predictive stimulus. In addition, according to the average reward TD model, the
Prediction error drel(t)
Average reward r(t)
Long-Term Reward Prediction in TD Models
0
0.5
1
2575
0
0.5
Time
1
Time
Figure 4: Tonic effects in modeled DA signal in an aversive situation. Receipt of primary punishment (such as shock) decreases the average reward rate estimate, producing a small but long-lasting elevation in modeled DA activity. (Left) The reward rate estimate r ( t) responding to three punishments. (Right) The corresponding TD error signal—the modeled DA response—shows a phasic depression followed by a prolonged excitation.
inhibitory effect of r (t ) should manifest itself as before in a sustained negative prediction error (i.e., a reduced baseline ring rate) between trials (see Figure 3b). During the xed interval between stimulus onset and reward, the learning algorithm can eliminate this prediction error by smoothly ramping up VO rel (t) (see Figure 3a), so that VO rel ( t C 1) ¡ VO rel ( t) exactly cancels the ¡r (t ) term in equation 2.4 (see Figure 3b). But outside the stimulus-reward interval, the algorithm cannot predict events, since the interval between trials is random, so it cannot similarly compensate for the negative prediction error due to r ( t) . Hence, r ( t) drives drel (t ) slightly negative between trials. The predicted pattern of results—a jump in the DA baseline on stimulus presentation, followed by a drop when the reward is received—would be noticeable only when the rate of reward delivery is high enough to make its inhibitory effect between trials visible. For instance, it cannot be seen in Figure 1b, which was produced using the same task but at a much lower reward rate. This may be why, to our knowledge, it has not so far been observed in experiments. Another experimental situation involving a change of reward rate is the extinction of a stimulus-reward association. Previous TD models suggest that when a stimulus-reward association is extinguished by repeatedly presenting the stimulus but omitting the reward, the phasic DA response to the formerly predictive stimulus should wane and disappear. The concomitant reduction of the experienced reward rate means that under the average reward model, the inhibition of the DA baseline by r ( t) should also extinguish, slowly increasing the baseline ring rate between trials until it matches that within trials (see Figure 3c). Were extinction instead accomplished while preserving the reward rate—by presenting both stimuli and rewards, but unpaired—the reinforcers, now unpredictable, would trigger phasic excitation, but the tonic inhibitory effect of r ( t) would remain (see Figure 3d).
2576
Nathaniel D. Daw and David S. Touretzky
A nal prediction about DA activity in an innite horizon model emerges if we treat aversive stimuli such as shock as equivalent to negative reward. This is an oversimplication, but some behavioral evidence that it is nonetheless reasonable comes from animal learning experiments in which a stimulus that predicts the absence of an otherwise expected reward (a conditioned inhibitor for reward) can, when subsequently presented together with a neutral stimulus and a shock, block the learning of an association between the neutral stimulus and the aversive outcome (Dickinson & Dearing, 1979; Goodman & Fowler, 1983). This is presumably because the conditioned inhibitor, predictive of a different but still aversive outcome, is assumed to account for the shock. If aversive stimuli are treated as negative rewards, their appearance will decrease the predicted reward rate r ( t) , causing a prolonged increase in the tonic DA baseline. This effect is shown in Figure 4. In the simulations, phasic negative error is also seen to the negative rewards. As we discuss below, there is some experimental evidence for both sorts of responses. 4 Discussion
Previous models of the DA system were exercised in a fairly stylized setting, where events were partitioned into a set of independent trials and the system learned to predict rewards only within a trial. Though it is computationally straightforward to extend these models to a more realistic setting with longer-term predictions and no such partitions, and indeed the original articles all suggested how this could be done, no previous work has examined how such an improvement to the model would affect its predictions about the behavior of the DA system. By using a version of the TD algorithm that explicitly segregates long-term average reward predictions, we show here that such an extension suggests several new predictions about slow timescale changes in DA activity. Some of these predictions have never been tested, while others explain existing data that had been puzzling under older trial-based models. In our model, the long-term reward predictions that are learned if returns are allowed to accumulate across trial boundaries would manifest themselves largely as slow changes in the tonic DA baseline. Though little attention has been paid to tonic ring rates in the electrophysiological studies that the DA models address—and so the predictions we suggest here have for the most part not been addressed in this literature—a related experimental technique may offer some clues. Microdialysis studies examine chemical evidence of DA activity in samples taken very slowly, on the order of once per 10 minutes. It is likely that at least part of what these experiments measure is tonic rather than phasic activity. A key mystery from the microdialysis literature that has been used to criticize TD models of DA is evidence of increased DA activity in target brain areas, such as the striatum, in response to aversive events such as foot
Long-Term Reward Prediction in TD Models
2577
shocks (reviewed by Horvitz, 2000). Were this activity phasic, it would contradict the notion of DA as an error signal for the prediction of reward. But we show here that the slower tonic component of the modeled DA signal should indeed increase in the face of aversive stimuli, because they should reduce the average reward prediction r (see Figure 4). This may explain the paradox of an excitatory response to aversive events in a putatively appetitive signal. In the model as presented here, to be fair, the small and slow positive error would on average be offset by a larger but shorter duration burst of negative error, timelocked to the aversive stimulus. However, what microdialysis actually measured in such a situation would depend on such vagaries as the nonlinearities of transmitter release and reuptake. Also, instead of being fully signaled by DA pauses, phasic information about aversive events may partially be carried by excitation in a parallel, opponent neural system (Kakade, Daw, & Dayan, 2000; Daw, Kakade, & Dayan, in press). All of these ideas about the nature of the dopaminergic response to aversive stimuli would ideally be tested with unit recordings of DA neurons in aversive paradigms; the very sparse and problematic set of such recordings so far available reveals some DA neurons with excitatory and some with inhibitory responses (Schultz & Romo, 1987; Mirenowicz & Schultz, 1996; Guarraci & Kapp, 1999). In accord with our explanation, there is some evidence that the excitatory responses tend to last longer (Schultz & Romo, 1987). Another set of experiments used a much faster voltammetry technique to measure DA release in the striatum with subsecond resolution, in response to bursts of rewarding intracranial stimulation (Kilpatrick et al., 2000). In these experiments, no phasic DA release is measured when stimulation is triggered by the animal’s own lever presses, but DA responses are seen to (presumably unpredictable) stimulation bouts delivered on the schedule of lever presses made previously by another animal. In this case, transmitter release habituates over time, gradually reducing the magnitude of the phasic DA responses to stimulations over the course of a train lasting about a minute. This is essentially a dynamic version of the experiment shown in Figure 2, where a sudden increase in the delivered reward rate gives rise to a gradual increase in r (t ) and a corresponding decrease in modeled DA activity. Under a TD model involving only short-term predictions, it would be difcult to nd any computational account for this habituation, but it is roughly consistent with what we predict from the average-reward model. However, the effect seen in these experiments seems to be mainly identiable with a decline in the peak phasic response to the rewards (which is only part of what is predicted in Figure 2); the corresponding decline in tonic baseline DA levels, if any, is exceedingly small. This may also have to do with the nonlinearity of transmitter release, or with a oor effect in the measurements or in extracellular DA concentrations. The interpretation of this experiment points to an important and totally unresolved issue of timescale in the model presented here. Voltam-
2578
Nathaniel D. Daw and David S. Touretzky
metry experiments reveal information about DA activity on an intermediate timescale—slower than electrophysiology but faster than microdialysis. Voltammetry is inappropriate for monitoring the kind of long-term changes in basal DA concentrations measured by microdialysis; perhaps for this reason, the two methodologies give rather different pictures of DA activity during intracranial stimulation, with microdialysis suggesting a very longlasting excitation (Kilpatrick et al., 2000; Fiorino, Coury, Fibiger, & Phillips, 1993). We have argued that understanding DA responses to aversive stimuli requires distinguishing between phasic and tonic timescales, but in reality there are surely many timescales important to the DA system and at which the system might show distinct behaviors. The model we have presented here is too simple to cope with such a proliferation of timescales; it works by contrasting predictions learned at only two xed timescales: quickly adapting phasic reward predictions that vary from state to state, against a slowly changing average reward prediction that is the same everywhere. The speed at which the average reward estimate r (t) adapts controls the timescale of tonic effects in the model; it is unclear to exactly what real-world timescale this should correspond. A more realistic model might learn not just a single “long-term” average reward prediction, but instead predictions of future events across a spectrum of different timescales. We speculate that in such a model, the opponent interactions between fast and slow predictions that we study in this article could survive as competition to account for the same observations between predictions with many different characteristic timescales. The rigidity of timescale in our model (and other TD models) also makes it an unsatisfying account of animal learning behavior, since animals seem much more exible about timescales. Not only can they cope with events happening on timescales ranging from seconds to days, but their learning processes, as exhibited behaviorally, are often invariant to a very wide range of dilations or contractions of the speed of events (Gallistel & Gibbon, 2000). These considerations again suggest a multiscale model. The idea that animals are learning to predict a horizonless return rather than just the rewards in the current trial has also appeared in the behavioral literature. Kacelnik (1997) uses this idea to explain animals’ choices on experiments designed to measure their discounting of delayed rewards (Mazur, 1987). Under Kacelnik’s model, however, it is necessary to assume that animals neglect the intervals between trials in computing their longterm expectations, an assumption that would be difcult to incorporate convincingly in a TD model like the one presented here. Results like this suggest that the notion of a trial is not as behaviorally meaningless as we treat it here. More work on this point is needed. We have presented all of our results in terms of the average reward TD algorithm, because it is easiest to see how long- and short-term predictions interact in this context. However, with some caveats, all of these effects would also be seen in the more common exponentially discounted formu-
Long-Term Reward Prediction in TD Models
2579
lation of TD learning, which was used for at least one previously published DA model (Suri & Schultz, 1999). This will be true if the algorithm is used in an innite horizon setting and with discounting light enough to allow rewards expected in future trials to contribute materially to value estimates. We briey sketch here why this holds asymptotically (i.e., when the value functions are well learned, so that VO D V) using a table representation for the value function; Tsitsiklis and Van Roy (2000) give a more sophisticated proof that the algorithms are equivalent on an update-by-update basis in the limit as c ! 1, assuming linear value function approximation and that the state vector in the discounted case contains a constant bias element whose magnitude is chosen so that its learned weight plays a role similar to that of r in the average reward model. The key point is that with an innite horizon, Vexp grows as the reward rate increases, since it sums all discounted future rewards; a portion of the value of each state thus encodes the long-term future reward rate, weighted by the degree of discounting. In particular, the exponentially discounted value function Vexp can be viewed as the sum of a statedependent term that approximates Vrel (better as c increases) and a stateindependent baseline magnitude r / (1 ¡ c ) that encodes the long-term future reward rate expected everywhere. All of the DA behaviors we consider depend on r being subtracted in the error signal drel (t) . Though this term seems to be missing from the exponentially discounted error signal dexp (t ) D c VO exp ( t C 1) ¡ VO exp (t) C r ( t) , it is actually implicit: the stateindependent portions of VO exp (t ) and VO exp ( t C 1) are equal to r / (1 ¡ c ) , so the state-independent portion of c VO exp ( t C 1) ¡ VO exp ( t) in the error signal is equal to (c ¡ 1) ¢ r / (1 ¡ c ) , which is ¡r. This is easiest to see when rewards are Poisson with rate r, as in Figure 2. In this case, Vexp ( t) is exactly r / (1¡c ) for all t. The exponentially discounted TD error signal is dexp ( t) D r (t) C c VO exp ( t C 1) ¡ VO exp ( t) O exp ( t) D r ( t) C (c ¡ 1) V D r (t) ¡ r ,
just as in the average reward case. While this example holds for any c , in more structured state-spaces, the value of c can matter. For the predicted DA behaviors described in this article, the effect of steep discounting would be to modulate the tonic inhibition depending on proximity to reward; the correspondence between discounted and average reward models improves smoothly as c increases. In summary, the introduction of an innite horizon increases the value of Vexp at all states, leading to the subtraction of approximately r from the error signal and to just the sort of tonic behaviors we have discussed in this
2580
Nathaniel D. Daw and David S. Touretzky
article. Note that the equivalence between the models depends on the exponentially discounted model having a sufcient state representation; as we have discussed, the model of Montague et al. (1996) effectively had a nite horizon due to gaps in its state representation and the lack of a bias term. One signicant model of DA neurons that is physiologically more detailed than the TD models is that of Brown, Bullock, and Grossberg (1999). Though this is an implementational model and not grounded in reinforcement learning theory, it actually has much the same computational structure as a TD model. In particular, the DA response is modeled as the sum of a phasic primary reward signal like r ( t) and the positively rectied time derivative of a striatal reward anticipation signal that, like VO ( t) , shows sustained elevation between the occurrence of a stimulus and the reward it predicts. The chief structural elaboration of this model over the TD models is that it separates into a distinct pathway the transient inhibitory effects on DA that result, in TD models, from the negatively rectied portion of the time difference VO (t C 1) ¡ VO (t) . This distinction vanishes at the more algorithmic level of analysis of this article, but the Brown et al. model provides a detailed proposal for how the brain may implement TD-like computations. The average reward TD model presented here would require additional physiological substrates. In particular, the DA system would need the average reward estimate r ( t) to compute the error signal. One candidate substrate for this signal is serotonin (5HT), a neuromodulator that seems to act as an opponent to DA. Kakade et al. (2000) and Daw et al. (in press) detail this proposal under the assumption that 5HT reports r ( t) directly to DA target structures rather than (as we envision here) channeling this portion of the error signal through DA. Since 5HT is known to oppose DA not only at the level of target structures but also by inhibiting DA ring directly (reviewed by Kapur & Remington, 1996), it seems likely that any effects of the average reward would be visible in the DA system as well, as we predict here. Alternatively, the average reward could be computed entirely within the DA system, through activity habituation mechanisms in DA neurons. In a habituation model, some variable internal to DA neurons, such as calcium concentration or autoreceptor activation, could track r ( t) (by accumulating with neuronal ring) and inhibit further spiking or transmitter release. Intriguingly, if the inhibition worked at the level of transmitter release, the effects of the average reward signal predicted in this article might be visible only using methods that directly measured transmitter release rather than spiking. Such is the case for the two pieces of experimental evidence so far best consistent with our predictions: dialysis studies showing elevated DA activity in aversive situations (Horvitz, 2000) and voltammetry measurements of DA release habituating to high rates of intracranial stimulation (Kilpatrick et al., 2000). These considerations about how the DA system computes the hypothesized error signal could be tested by direct comparison of voltammetric
Long-Term Reward Prediction in TD Models
2581
and electrophysiological measurements of DA activity. More generally, our modeling suggests that greater attention should be paid to DA activity over slower timescales, in electrophysiological as well as neurochemical recordings. We particularly suggest that this would be useful in experiments that manipulate the rates of delivery of reward or punishment since these, in the model, control tonic DA release. Acknowledgments
This research was supported by NSF IRI-9720350, IIS-9978403, DGE9987588, and an NSF Graduate Fellowship. The authors particularly thank Peter Dayan, and also Rudolf Cardinal, Barry Everitt, and Sham Kakade for many helpful discussions and comments on this work. We also thank two anonymous reviewers for copious and substantive comments that helped shape the nal article. References Brown, J., Bullock, D., & Grossberg, S. (1999). How the basal ganglia use parallel excitatory and inhibitory learning pathways to selectively respond to unexpected rewarding cues. Journal of Neuroscience, 19(23), 10502–10511. Daw, N. D., Courville, A. C., & Touretzky, D. S. (2002). Dopamine and inference about timing. In Proceedings of the Second International Conference on Development and Learning (pp. 271–276). Los Alamitos, CA: IEEE Computer Society. Daw, N. D., Kakade, S., & Dayan, P. (in press). Opponent interactions between serotonin and dopamine. Neural Networks. Dickinson, A., & Dearing, M. F. (1979). Appetitive-aversive interactions and inhibitory processes. In A. Dickonson & R. A. Boakes (Eds.), Mechanisms of learning and motivation (pp. 203–231). Hillsdale, NJ: Erlbaum. Fiorino, D. F., Coury, A., Fibiger, H. C., & Phillips, A. G. (1993). Electrical stimulation of reward sites in the ventral tegmental area increases dopamine transmission in the nucleus accumbens of the rat. Behavioral Brain Research, 55, 131–141. Gallistel, C. R., & Gibbon, J. (2000). Time, rate and conditioning. Psychological Review, 107(2), 289–344. Goodman, J., & Fowler, H. (1983). Blocking and enhancement of fear conditioning by appetitive CSs. Animal Learning and Behavior, 11, 75–82. Guarraci, F., & Kapp, B. (1999). An electrophysiological characterization of ventral tegmental area dopaminergic neurons during differential Pavlovian fear conditioning in the awake rabbit. Behavioral Brain Research, 99, 169–179. Horvitz, J. C. (2000). Mesolimbocortical and nigrostriatal dopamine responses to salient non-reward events. Neuroscience, 96, 651–656. Houk, J. C., Adams, J. L., & Barto, A. G. (1995). A model of how the basal ganglia generate and use neural signals that predict reinforcement. In J. C. Houk, J. L. Davis, & D. G. Beiser (Eds.), Models of information processing in the basal ganglia (pp. 249–270). Cambridge, MA: MIT Press.
2582
Nathaniel D. Daw and David S. Touretzky
Kacelnik, A. (1997). Normative and descriptive models of decision making: Time discounting and risk sensitivity. In G. R. Bock & G. Cardew (Eds.), Characterizing human psychological adaptations (pp. 51–70). New York: Wiley. Kakade, S., Daw, N. D., & Dayan, P. (2000). Opponent interactions between serotonin and dopamine for classical and operant conditioning. Society for Neuroscience Abstracts, 26, 1763. Kapur, S., & Remington, G. (1996). Serotonin-dopamine interaction and its relevance to schizophrenia. American Journal of Psychiatry, 153, 466–476. Kilpatrick, M. R., Rooney, M. B., Michael, D. J., & Wightman, R. M. (2000). Extracellular dopamine dynamics in rat caudate-putamen during experimenterdelivered and intracranial self-stimulation. Neuroscience, 96, 697–706. Ljunbgerg, T., Apicella, P., & Schultz, W. (1992). Responses of monkey dopamine neurons during learning of behavioral reactions. Journal of Neurophysiology, 67, 145–163. Mahadevan, S. (1996). Average reward reinforcement learning: Foundations, algorithms and empirical results. Machine Learning, 22, 1–38. Mazur, J. E. (1987). An adjusting procedure for studying delayed reinforcement. In M. L. Commons, J. E. Mazur, J. A. Nevin, & H. Rachlin (Eds.), Quantitative analyses of behavior (Vol. 5, pp. 55–73). Hillsdale, NJ: Erlbaum. Mirenowicz, J., & Schultz, W. (1996). Preferential activation of midbrain dopamine neurons by appetitive rather than aversive stimuli. Nature, 379, 449–451. Montague, P. R., Dayan, P., & Sejnowski, T. J. (1996). A framework for mesencephalic dopamine systems based on predictive Hebbian learning. Journal of Neuroscience, 16, 1936–1947. Redgrave, P., Prescott, T. J., & Gurney, K. (1999). Is the short latency dopamine burst too short to signal reinforcement error? Trends in Neurosciences, 22, 146–151. Schultz, W. (1998). Predictive reward signal of dopamine neurons. Journal of Neurophysiology, 80, 1–27. Schultz, W., Dayan, P., & Montague, P. R. (1997). A neural substrate of prediction and reward. Science, 275, 1593–1599. Schultz, W., & Romo, R. (1987). Responses of nigrostriatal dopamine neurons to high intensity somatosensory stimulation in the anesthetized monkey. Journal of Neurophysiology, 57, 201–217. Schwartz, A. (1993). A reinforcement learning method for maximizing undiscounted rewards. In Proceedingsof the Tenth InternationalConference on Machine Learning (pp. 298–305). San Mateo, CA: Morgan Kaufmann. Suri, R. E., & Schultz, W. (1999). A neural network with dopamine-like reinforcement signal that learns a spatial delayed response task. Neuroscience, 91, 871–890. Sutton, R. S. (1988). Learning to predict by the method of temporal differences. Machine Learning, 3, 9–44. Sutton, R. S., & Barto, A. G. (1990). Time-derivative models of Pavlovian reinforcement. In M. Gabriel & J. Moore (Eds.), Learning and computational neuroscience: Foundations of adaptive networks (pp. 497–537). Cambridge, MA: MIT Press.
Long-Term Reward Prediction in TD Models
2583
Tsitsiklis, J. N., & Van Roy, B. (1999). Average cost temporal-difference learning. Automatica, 35, 319–349. Tsitsiklis, J. N., & Van Roy, B. (2002). On average versus discounted reward temporal-difference learning. Machine Learning, 49, 179–191. Received June 30, 2000; accepted April 30, 2002.
LETTER
Communicated by Marian Stewart-Bartlett
Invariant Object Recognition in the Visual System with Novel Views of 3D Objects Simon M. Stringer [email protected] Edmund T. Rolls [email protected], web: www.cns.ox.ac.uk Oxford University, Centre for Computational Neuroscience, Department of Experimental Psychology, Oxford OX1 3UD, England To form view-invariant representations of objects, neurons in the inferior temporal cortex may associate together different views of an object, which tend to occur close together in time under natural viewing conditions. This can be achieved in neuronal network models of this process by using an associative learning rule with a short-term temporal memory trace. It is postulated that within a view, neurons learn representations that enable them to generalize within variations of that view. When threedimensional (3D) objects are rotated within small angles (up to, e.g., 30 degrees), their surface features undergo geometric distortion due to the change of perspective. In this article, we show how trace learning could solve the problem of in-depth rotation-invariant object recognition by developing representations of the transforms that features undergo when they are on the surfaces of 3D objects. Moreover, we show that having learned how features on 3D objects transform geometrically as the object is rotated in depth, the network can correctly recognize novel 3D variations within a generic view of an object composed of a new combination of previously learned features. These results are demonstrated in simulations of a hierarchical network model (VisNet) of the visual system that show that it can develop representations useful for the recognition of 3D objects by forming perspective-invariant representations to allow generalization within a generic view. 1 Introduction
There is now much evidence demonstrating that over successive stages, the visual system develops neurons that respond with view, size, and position (translation) invariance to objects or faces (Desimone, 1991; Rolls, 1992, 2000; Rolls & Tovee, 1995; Tanaka, Saito, Fukada, & Moriya, 1991; Tanaka, 1996; Logothetis, Pauls & Poggio, 1995; Booth & Rolls, 1998). Rolls (1992, 2000) has proposed a biologically plausible mechanism to explain this behavior based on the following: (1) a series of hierarchical competitive networks c 2002 Massachusetts Institute of Technology Neural Computation 14, 2585– 2596 (2002) °
2586
Simon M. Stringer and Edmund T. Rolls
with local graded inhibition, (2) convergent connections to each neuron from a topologically corresponding region of the preceding layer, leading to an increase in the receptive eld size of cells through the visual processing areas, and (3) synaptic plasticity based on a modied Hebb-like learning rule with a temporal trace of each cell’s previous activity. The idea underlying the trace learning rule is to learn from the natural statistics of real-world visual input, where, for example, the successive transformed versions of the same image tend to occur close together in time (FoldiÂak, ¨ 1991; Rolls, 1992; Rolls & Tovee, 1995; Rolls & Deco, 2002). Rolls’s hypothesis about the functional architecture and operation of the ventral visual system (the “what” pathway, in which representations of objects are formed) was tested in a model, VisNet, of ventral stream cortical visual processing, where it was found that invariant neurons did indeed develop as long as the Hebbian learning rules incorporated a trace of recent cell activity, where the trace is a form of temporal average (Wallis & Rolls, 1997). The types of invariance demonstrated included translation, size, and the view of an object. This hypothesis of how object recognition could be implemented in the brain postulates that trace rule learning helps invariant representations to form in two ways (Rolls, 1992, 2000; Rolls & Deco, 2002). The rst process enables associations to be learned between different generic 3D views of an object where there are different qualitative shape descriptors. One example of different generic views is usually provided by the front and back of an object. Another example is when different surfaces come into view, and new surfaces dene the viewed boundary, when most 3D objects are rotated in three dimensions. For example, a catastrophic rearrangement of the shape descriptors (Koenderink, 1990) occurs when a cup is tilted so that one can see inside it. The second process is that within a generic view, as the object is rotated in depth, there will be no catastrophic changes in the qualitative 3D boundary shape descriptors, but instead the quantitative (metric) values of the shape descriptors alter (see further Koenderink, 1990; Biederman, 1987). For example, while the cup is being rotated within a generic view seen from somewhat below, the curvature of the cusp forming the top boundary will alter, but the qualitative shape descriptor will remain a cusp. In addition to the changes of the metric values of the object boundary shape descriptors that occur within a generic view, the surface features on a 3D object also undergo geometric transforms as it rotates in depth, as described below and illustrated in Figure 3. Trace learning could potentially help with both the between- and within-generic view processes. That is, trace learning could help to associate together qualitatively different sets of shape descriptors that occur close together in time and describe the generically different views of a cup. Trace learning could also help with the second process and learn to associate together the different quantitative (or metric) values of shape descriptors that typically occur when objects are rotated within a generic view. The main aim of this article is to show that trace learning in an appropriate architecture can indeed learn the geometric shape transforms that
Invariant Object Recognition with Novel Views
2587
are characteristic of the surface markings on objects as they rotate within a generic view and use this knowledge in invariant object recognition. We are able to address this particular issue here by using an object, a sphere, in which there are no catastrophic changes between different generic views as the object is rotated or metric changes in the dening boundary of the object as it is rotated. We also show here that once the network has learned the types of geometric transform characteristic of features on the surface of 3D objects, the network can generalize to novel views of the object when the object has been shown in only a limited number of views, and the surface markings are composed of new combinations of the previously learned features. Examples of the types of perspective transforms in the surface markings on 3D objects that are typically encountered as the objects rotate within a generic view and that we investigated in this article are shown in Figure 3. The surface markings on the sphere that consist of combinations of three features in different spatial arrangements undergo characteristic transforms as the sphere is rotated from 0 degree toward ¡60 degrees and +60 degrees. Each object is identied by surface markings that consist of a different spatial arrangement of the same three features (a horizontal, vertical, and diagonal line, which becomes an arc on the surface of the object). Boundary shape changes and shading and stereo cues are excluded from the stimuli, so that the invariant learning must be about the surface marking transforms. An interesting issue about the properties of feature hierarchy networks used for learning invariant representations is whether they can generalize to transforms of objects on which they have not been trained, that is, whether they assume that an initial exposure is required during learning to every transformation of the object to be recognized (Wallis & Rolls, 1997; Rolls & Deco, 2002; Riesenhuber & Poggio, 1998). We show here that this is not the case, for such feature hierarchy models can generalize to novel withingeneric views of an object. VisNet can achieve this when it is given invariant training on the features from which the new object will be composed. After the new object has been shown in some views, VisNet generalizes correctly to other views of the object. This part of the research described here builds on an earlier result of Elliffe, Rolls, and Stringer (2002) that was tested with a purely isomorphic transform, translation. 2 Methods 2.1 The VisNet Model. In this section, we provide an overview of the VisNet model. Further details may be found in Wallis and Rolls (1997), Rolls and Milward (2000), and Rolls and Deco (2002). The simulations performed here use the latest version of the VisNet model (VisNet2), with the same model parameters that Rolls and Milward (2000) used. VisNet is a fourlayer feedforward network of the primate ventral visual system, with the separate layers corresponding to V2, V4, the posterior inferior temporal cor-
2588
Simon M. Stringer and Edmund T. Rolls
Receptive Field Size / deg
Layer Layer 44
Layer 33 Layer
Layer Layer 22
50
TE
20
TEO
8.0
V4
3.2
V2
1.3
V1
0 1.3 3.2 8.0 20 50
Layer Layer 11
view independence
view dependent configuration sensitive combinations of features
larger receptive fields
LGN
Eccentricity / deg
Figure 1: (Left) Stylized image of the VisNet four-layer network. Convergence through the network is designed to provide fourth-layer neurons with information from across the entire input retina. (Right) Convergence in the visual system (adapted from Rolls, 1992). V1: visual cortex area V1; TEO: posterior inferior temporal cortex; TE: inferior temporal cortex (IT).
tex, and the anterior inferior temporal cortex, as shown in Figure 1. For each layer, the connections to individual cells are derived from a topologically corresponding region of the preceding layer, with connection probabilities based on a gaussian distribution. Within each layer there is competition between neurons, which is graded rather than winner-take-all and is implemented in two stages. First, to implement lateral inhibition, the activation of neurons within a layer is convolved with a local spatial lter that operates over several pixels. Next, contrast enhancement is applied by means of a sigmoid activation function where the sigmoid threshold is adjusted to control the sparseness of the ring rates. The trace learning rule (Foldi ¨ ak, 1991; Rolls, 1992; Wallis & Rolls, 1997) encourages neurons to develop invariant responses to input patterns that tended to occur close together in time, because these are likely to be from the same object. The rule used was
D wj
t ¡1 t xj ,
D ay
(2.1)
where the trace yt is updated according to yt D (1 ¡ g) yt C gyt ¡1 , and we have the following denitions: xj : jth input to the neuron yt : Trace value of the output of the neuron at time step t wj : Synaptic weight between jth input and the neuron
(2.2)
Invariant Object Recognition with Novel Views
2589
y: Output from the neuron a: Learning rate, annealed between unity and zero g: Trace value; the optimal value varies with presentation sequence length The parameter g may be set anywhere in the interval [0, 1], and for the simulations described here, g was set to 0.8. (A discussion of the good performance of this rule and its relation to other versions of trace learning rules is provided by Rolls & Milward, 2000, and Rolls & Stringer, 2001.) 2.2 Training and Test Procedure. The stimuli were designed to allow higher-order representations (of combinations of three features) to be built from lower-order feature representations (of pairs of features) (see Elliffe et al., 2002). The stimuli take the form of images of surface features on 3D rotating spheres, with each image presented to VisNet’s retina being a 2D projection of the surface features of one of the spheres. (For the actual simulations described here, the surface features and their deformations were what VisNet was trained and tested with, and the remaining blank surface of each sphere was set to the same gray scale as the background.) Each stimulus is uniquely identied by two or three surface features, where the surface features are (1) vertical, (2) diagonal, and (3) horizontal arcs and where each feature may be centered at three different spatial positions, designated A, B, and C, as shown in Figure 2. The stimuli are thus dened in terms of what features are present and their precise spatial arrangement with respect to each other. We refer to the two- and three-feature stimuli as pairs and triples, respectively. Individual stimuli are denoted by three numbers that refer to the individual features present in positions A, B, and C, respectively. For example, a stimulus with positions A and C containing a vertical and diagonal bar, respectively, would be referred to as stimulus 102, where the 0 denotes no feature present in position B. In total, there are 18 pairs (120, 130, 210, 230, 310, 320, 012, 013, 021, 023, 031, 032, 102, 103, 201, 203, 301, 302) and 6 triples (123, 132, 213, 231, 312, 321). Further image construction details are as follows, with an illustration of the images used shown in Figure 3. In the experiments presented later, the end points of the individual surface features subtend an angle of approximately 10 degrees with respect to the center of the spheres. In addition, each pair of spatial positions (A, B, C) subtends an angle of approximately 15 degrees with respect to the centers of the spheres. The 2D projections of the stimuli are scaled to be 128 £ 128 pixels, and these are placed on a blank background and preprocessed by a set of input lters before the nal images are presented to VisNet’s 128 £ 128 pixel input retina. The input lters accord with the general tuning proles of simple cells in V1 (see Rolls & Milward, 2000, for more details).
2590
Simon M. Stringer and Edmund T. Rolls
B
Positions A
C
Features 0
1
2
3
Figure 2: Details of the 3-dimensional visual stimuli used in the rotation invariance experiments of section 3. Each stimulus is a sphere that is identied by its unique combination of two or three surface features. There are three possible types of features that take the form of (1) a vertical arc, (2) a diagonal arc, and (3) a horizontal arc and lie on the surfaces of the spheres (a blank space is denoted by 0). These three features can be placed in any of three relative positions A, B, and C, which are themselves separated by 15 degrees of arc.
To train the network, each stimulus is presented to VisNet in a randomized sequence of ve orientations with respect to VisNet’s input retina, where the different orientations are obtained from successive in-depth rotations of the stimulus through 30 degrees. That is, each stimulus is presented to VisNet’s retina from the following rotational views: (i) ¡60 degrees, (ii) ¡30 degrees, (iii) 0 degree (central position with surface features facing directly toward VisNet’s retina), (iv) 30 degrees, and (v) 60 degrees. Figure 3 shows representations of the six visual stimuli with three surface features (triples) presented to VisNet during the simulations. Each row shows one of the stimuli rotated through the ve different rotational views in which the stimulus is presented to VisNet. At each presentation, the activation of individual neurons is calculated, then the neuronal ring rates are calculated, and then the synaptic weights are updated. Each time a stimulus has been presented in all training orientations, a new stimulus is chosen at random, and the process is repeated. The presentation of all the stimuli through all ve orientations constitutes one epoch of training. In this manner, the network is trained one layer at a time starting with layer 1 and nishing with layer 4. In the investigations described here, the numbers of training epochs for layers 1 through 4 were 50, 100, 100, and 75, respectively.
Invariant Object Recognition with Novel Views
2591
Figure 3: Representations of the six visual stimuli with three surface features (triples) presented to VisNet during the simulations. Each stimulus is a sphere that is identied by a unique combination of three surface features (a vertical, diagonal, and horizontal arc) that occur in three relative positions A, B, and C. Each row shows one of the stimuli rotated through the ve different rotational views in which the stimulus is presented to VisNet. From left to right, the rotational views shown are ¡60 degrees, ¡30 degrees, 0 degree (central position), 30 degrees, and 60 degrees.
Two measures of performance were used to assess the ability of the output layer of the network to develop neurons that are able to respond with view invariance to individual stimuli or objects (see Rolls & Milward, 2000). A single cell information measure was applied to individual cells in layer 4 and measures how much information is available from the response of a single cell about which stimulus was shown independently of view. A multiple cell information measure, the average amount of information that is obtained about which stimulus was shown from a single presentation of a stimulus from the responses of all the cells, enabled measurement of
2592
Simon M. Stringer and Edmund T. Rolls
whether, across a population of cells, information about every object in the set was provided. Procedures for calculating the multiple cell information measure are given in Rolls, Treves, and Tovee (1997) and Rolls and Milward (2000). In the experiments presented later, the multiple cell information was calculated from only a small subset of the output cells. There were ve cells selected for each stimulus, and these were the ve cells that gave the highest single cell information values for that stimulus. 3 Results
The aim of the rst experiment was to test whether the network could learn invariant representations of the surface markings on objects seen from different within-generic views and whether the learning would generalize to new views of objects (triples) after pretraining on feature subsets (pairs). To realize this aim, the VisNet network was trained in two stages. In the rst stage, the 18 feature pairs were used as input stimuli, with each stimulus being presented to VisNet’s retina in sequences of ve orientations as described in section 2.2. During this stage, learning was allowed to take place only in layers 1 and 2. This led to the formation of neurons that responded to the feature pairs with some rotation invariance in layer 2. In the second stage, we used the six feature triples as stimuli, with learning allowed only in layers 3 and 4. During this second training stage, the triples were presented to VisNet’s input retina only in the rst four orientations, i through iv. After the two stages of training were completed, we examined whether the output layer of VisNet had formed top-layer neurons that responded invariantly to the six triples when presented in all ve orientations, not just the four in which the triples had been presented during training. To provide baseline results for comparison, the results from experiment 1 were compared with results from experiment 2, which involved no training in layers 1,2 and 3,4, with the synaptic weights left unchanged from their initial random values. In Figure 4, we present numerical results for the two experiments described. On the left are the single cell information measures for all top (fourth) layer neurons ranked in order of their invariance to the triples, and on the right side are multiple cell information measures. To help interpret these results, we can compute the maximum single cell information measure according to Maximum single cell information = log2 (number of triples),
(3.1)
where the number of triples is six. This gives a maximum single cell information measure of 2.6 bits for these test cases. The information results from experiment 1 shown in Figure 4 demonstrate that even with the triples presented to the network in only four of the ve orientations during training, layer 4 is indeed capable of developing rotation-invariant neurons that can discriminate effectively among the six different feature triples in all ve ori-
Invariant Object Recognition with Novel Views
2593
Figure 4: Numerical results for experiments 1 and 2: (Left) Single cell information measures. (Right) Multiple cell information measures. There were six stimuli each shown in ve rotational views (6s 5r). In experiment 1, the rst two layers were trained on feature pairs in all transforms, and layers 3 and 4 were trained on the six triple stimuli in four of the ve transforms. For comparison, experiment 2 shows the performance with no training.
entations, that is, with correct recognition from all ve perspectives. Indeed, from the single cell information measures, it can be seen that a number of cells have reached the maximum level of performance in experiment 1. In addition, the multiple cell information for experiment 1 reaches the maximal level of 2.6 bits, indicating that the network as a whole is capable of perfect discrimination between the six triples in any of the ve orientations. The nding that some single neurons showed perfect performance on all ve instances of every one of the six stimuli (as shown by the single cell information analysis of Figure 4) provides evidence that the network can indeed generalize to novel deformations of stimuli when the rst two layers have been trained on the component feature combinations, but the top two layers have been trained on only some of the transforms of the complete stimuli. Further results from experiment 1 are presented in Figure 5, where we show the response proles of a top-layer neuron to the six triple-feature stimuli. This neuron has achieved excellent invariant responses to the six triple-feature stimuli. The performance was perfect in that the response proles are independent of the orientation of the sphere but differentiate between triples in that the responses are maximal for triple 132 and minimal for all other triples. In particular, the cell responses are maximal for triple 132 presented in all ve of the orientation transforms. The perfect performance of the neuron occurred even though the network was being tested with ve transforms of the stimuli, with only four transforms having been trained in layers 3 and 4. We performed a control experiment to show that the network really had learned invariant representations specic to the kinds of 3D deformations
2594
Simon M. Stringer and Edmund T. Rolls
Figure 5: Numerical results for experiment 1: Response proles of a top-layer neuron to the 6 triples in all 5 orientations.
undergone by the surface features as the objects rotated in depth. The design of the control experiment was to train the network on “spheres” with nondeformed surface features and then to test whether the network failed to operate correctly when it was tested with objects with the features present in the transformed way that they appear on the surface of a real 3D object. The training set of 2D feature triples was generated from the view of each feature triple present at 0 degree, which was 2D translated without deformation to the position on a sphere at which it would appear if the sphere had been rotated. It was found that when VisNet was rst trained on undeformed triple stimuli and then tested on the true 3D triple images, that performance was poor, as shown by the single and multiple cell information measures, and by the fact that most layer 4 neurons did not respond to all ve instances of each rotated triple stimulus. The results of this control experiment thus showed that the network had learned invariant representations specic to the kinds of 3D deformations undergone by the surface features as the objects rotated in depth in the rst experiment. 4 Discussion
In this article, we were able to show rst how trace learning can form neurons that can respond invariantly to the perspective transforms that surface markings on 3D objects show when the object is rotated within a generic view. The invariant learning was specically about surface markings, in that boundary curvature changes were excluded by the use of spherical objects. Thus, VisNet can learn how the surface features on 3D objects transform as the object is rotated in depth and can use knowledge of the characteristics of these transforms to perform 3D object recognition.
Invariant Object Recognition with Novel Views
2595
Second, the results show that this could occur for a novel view of an object that was not an interpolation from previously seen views. This was possible given that the low-order feature combination sets from which an object was composed had been learned in early layers of VisNet previously. That is, it was shown that invariant learning on every possible transform of the triple-feature stimuli is not necessary. Once the invariance has been trained for the feature pairs, then just a few presentations of the triples (needed to train the weights from the intermediate to the higher layers) are needed, and the invariance is present not because of any invariant learning about the triples but because of invariant learning about the feature pairs. The approach inherent in a feature hierarchy system such as that exemplied by VisNet is that the representation of an object at an upper level is produced by a combination of the ring of feature-sensitive neurons at lower levels that themselves have some invariant properties (Rolls & Deco, 2002). The concept is thus different from that suggested by Edelman (1999), who used existing invariant representations of different whole objects as a basis set for new whole objects. Effectively, a new whole object neuron at the whole object representation level in the network was trained to respond to linear combinations of the activity of the other pretrained object neurons. Provided that the new object was similar to some of the pretrained objects, the interpolation worked. The system we propose effectively learns invariant representations of low-order feature combinations at an early stage of the network. Then at an upper level, the system is trained on a new object constituted by an entirely new combination of lower-level neurons ring. The system then generalizes invariantly to different transforms of the new object. The system we describe is more closely related to the hierarchical nature of the cortical visual system and is also more powerful in that the new object need not be very similar at the object level to any previously learned object. It must just be composed of similar features. Acknowledgments
This research was supported by the Wellcome Trust and the MRC Interdisciplinary Research Centre for Cognitive Neuroscience. References Biederman, I. (1987). Recognition-by-components: A theory of human image understanding. Psychological Review, 94(2), 115–147. Booth, M. C. A., & Rolls, E. T. (1998). View invariant representations of familiar objects by neurons in the inferior temporal visual cortex. Cerebral Cortex, 8, 510–523. Desimone, R. (1991). Face-selective cells in the temporal cortex of monkeys. Journal of Cognitive Neuroscience, 3, 1–8.
2596
Simon M. Stringer and Edmund T. Rolls
Edelman, S. (1999). Representation and recognition in vision. Cambridge, MA: MIT Press. Elliffe, M. C. M., Rolls, E. T., & Stringer, S. M. (2002). Invariant recognition of feature combinations in the visual system, Biological Cybernetics, 86, 59–71. Foldi ¨ ak, P. (1991). Learning invariance from transformation sequences. Neural Computation, 3, 194–200. Koenderink, J. J. (1990). Solid shape. Cambridge, MA: MIT Press. Logothetis, N. K., Pauls, J., & Poggio, T. (1995). Shape representation in the inferior temporal cortex of monkeys. Current Biology, 5, 552–563. Riesenhuber, M., & Poggio, T. (1998). Just one view: Invariances in inferotemporal cell tuning. In M. I. Jordan, M. J. Kearns & S. A. Solla (Eds.), Advances in neural information processing systems, 10 (pp. 215–221). Cambridge, MA: MIT Press. Rolls, E. T. (1992). Neurophysiological mechanisms underlying face processing within and beyond the temporal cortical areas. Philosophical Transactions of the Royal Society, London [B], 335, 11–21. Rolls, E. T. (2000). Functions of the primate temporal lobe cortical visual areas in invariant visual object and face recognition. Neuron, 27, 205–218. Rolls, E. T., & Deco, G. (2002). Computational neuroscience of vision. New York: Oxford University Press. Rolls, E. T., & Milward, T. (2000). A model of invariant object recognition in the visual system: Learning rules, activation functions, lateral inhibition and information-based performance measures. Neural Computation, 12, 2547– 2572. Rolls, E. T., & Stringer, S. M. (2001). Invariant object recognition in the visual system with error correction and temporal difference learning. Network, 12, 111–129. Rolls, E. T., & Tovee, M. J. (1995). Sparseness of the neuronal representation of stimuli in the primate temporal visual cortex. Journal of Neurophysiology, 73, 713–726. Rolls, E. T., Treves, A., & Tovee, M. J. (1997). The representational capacity of the distributed encoding of information provided by populations of neurons in the primate temporal visual cortex. Experimental Brain Research, 114, 177–185. Tanaka, K. (1996). Inferotemporal cortex and object vision. Annual Review of Neuroscience, 19, 109–139. Tanaka, K., Saito, H., Fukada, Y., & Moriya, M. (1991). Coding visual images of objects in the inferotemporal cortex of the macaque monkey. Journal of Neurophysiology, 66, 170–189. Wallis, G., & Rolls, E. T. (1997). A model of invariant object recognition in the visual system. Progress in Neurobiology, 51, 167–194.
Received December 6, 2000; accepted April 26, 2002.
LETTER
Communicated by Mike Arnold
Dynamical Working Memory and Timed Responses: The Role of Reverberating Loops in the Olivo-Cerebellar System Werner M. Kistler [email protected] Chris I. De Zeeuw [email protected] Department of Neuroscience, Erasmus University Rotterdam, The Netherlands This article explores dynamical properties of the olivo-cerebellar system that arise from the specic wiring of inferior olive (IO), cerebellar cortex, and deep cerebellar nuclei (DCN). We show that the irregularity observed in the ring pattern of the IO neurons is not necessarily produced by noise but can instead be the result of a purely deterministic network effect. We propose that this effect can serve as a dynamical working memory or as a neuronal clock with a characteristic timescale of about 100 ms that is determined by the slow calcium dynamics of IO and DCN neurons. This concept provides a novel explanation of how the cerebellum can solve timing tasks on a timescale that is two orders of magnitude longer than the millisecond timescale usually attributed to neuronal dynamics. One of the key ingredients of our model is the observation that due to postinhibitory rebound, DCN neurons can be driven by GABAergic (“inhibitory”) input from cerebellar Purkinje cells. Topographic projections from the DCN to the IO form a closed reverberating loop with an overall synaptic transmission delay of about 100 ms that is in resonance with the intrinsic oscillatory properties of the inferior olive. We use a simple time-discrete model based on McCulloch-Pitts neurons in order to investigate in a rst step some of the fundamental properties of a network with delayed reverberating projections. The macroscopic behavior is analyzed by means of a mean-eld approximation. Numerical simulations, however, show that the microscopic dynamics has a surprisingly rich structure that does not show up in a mean-eld description. We have thus performed extensive numerical experiments in order to quantify the ability of the network to serve as a dynamical working memory and its vulnerability by noise. In a second step, we develop a more realistic conductance-based network model of the inferior olive consisting of about 20 multicompartment neurons that are coupled by gap junctions and receive excitatory and inhibitory synaptic input via AMPA and GABAergic synapses. The simulations show that results for the time-discrete model hold true in a time-continuous description. c 2002 Massachusetts Institute of Technology Neural Computation 14, 2597– 2626 (2002) °
2598
Werner M. Kistler and Chris I. De Zeeuw
CBCX
PC
StC pf
BaC
GoC
excit.
inhib.
GrC cf
mf
IO MDJ
DCN
pons
Figure 1: Schematic drawing of the cerebellar network. Excitatory and inhibitory neurons are shown as empty and lled circles, respectively. The following abbreviations are used. CBCX, cerebellar cortex; IO, inferior olive; MDJ, mesodiencephalic junction; DCN, deep cerebellar nuclei; PC, Purkinje cell; StC, stellate cell; BaC basket cell; GoC, Golgi cell; GrC, granule cell; pf, parallel ber; cf, climbing ber; mf, mossy ber.
1 Introduction
From a morphological, a physiological, and a functional point of view, the inferior olive (IO) is a particularly interesting structure of the mammalian brain. The neurons of the IO are densely coupled by gap junctions (Sotelo, LlinÂas, & Baker, 1974); excitatory and inhibitory synapses meet each other in dendritic glomeruli located at the site of the gap junctions (De Zeeuw et al., 1998); slice recordings reveal subthreshold oscillations of the membrane potential (LlinÂas & Yarom, 1986); and, nally, the IO is the sole source of climbing bers that control synaptic plasticity in cerebellar Purkinje cells (Szentagothai & Rajkovits, 1959; Desclin, 1974). Cerebellar Purkinje cells (PCs) receive excitatory synaptic input from two different ber systems: parallel bers and climbing bers (Ito, 1984). Each Purkinje cell receives input from up to 200,000 parallel bers, the axons of the granule cells that relay information from the mossy bers (see Figure 1). This massive input is contrasted by the input from a single climbing ber that each Purkinje cell receives. While parallel bers form only rather weak en passant synapses with PCs, climbing bers literally climb up into the
Dynamical Working Memory and Timed Responses
2599
dendritic tree of “their” PC and produce several synaptic transmission sites. The result is an extraordinarily strong synaptic coupling between climbing ber and PC so that a single action potential in a climbing ber is sufcient to trigger a sodium spike, followed by a short burst of spikelets in the corresponding PC. This so-called complex spike can easily be distinguished from “regular” simple spikes that are triggered by the parallel ber input. The complex spike activity in the cerebellar cortex is thus a faithful image of the underlying IO activity. Since the early 1970s, a lot of information about the electrophysiology of the olivo-cerebellar system has been collected. Climbing bers deliver irregular trains of action potentials at a rather low rate of about one or two spikes per second. Multiple electrode recordings from the cerebellar cortex show that complex spikes in nearby PCs within a common microzone can be highly synchronized (Bell & Kawasaki, 1972; LlinÂas & Sasaki, 1989; Lang, Sugihara, Welsh, & LlinÂas, 1999; Lang, 2001). It is commonly accepted that this synchrony is (at least partially) due to the dense coupling of IO neurons by gap junctions (Sotelo et al., 1974; De Zeeuw et al., 1996, 1998). Despite the irregular ring of individual IO neurons in vivo, IO neurons do have a tendency to re rhythmically at a frequency of about 10 Hz. The intrinsic oscillatory properties of IO neurons are manifest in slice recordings where IO neurons show pronounced subthreshold oscillations of the membrane potential (LlinÂas & Yarom, 1986; Lampl & Yarom, 1993; Bal & McCormick, 1997). Autocorrelograms of the complex spike activity also show an oscillatory component at about 10 Hz that becomes very pronounced in the presence of tremorgenic drugs like harmalin or blockage of GABAergic synapses with picrotoxin (LlinÂas & Sasaki, 1989; Lang et al., 1999; Lang, 2001; Lang, Sugihara, & LlinÂas, 1996). The above ndings are readily taken into account by the time window model of IO activity (Kistler & van Hemmen, 1999; Kistler, van Hemmen, & De Zeeuw, 2000). In summary, the activity of IO neurons that innervate the same microzone is characterized by narrow time windows that encompass their potential ring times. Due to the oscillatory properties of IO neurons, there is a time window every 100 ms during which neurons can re an action potential. Since the ring frequency of individual IO neurons is only about 1 Hz, each neuron res on average only in 1 out of 10 time windows. Consequently, the population activity shows a 10 Hz oscillation with a high degree of synchrony, whereas individual neurons are ring irregularly at a much lower rate (see Figure 2). The oscillatory component of the IO activity is not only due to intrinsic electrophysiological properties of IO neurons but can also be enhanced by the network in which the IO is embedded. Activity from the IO can reverberate back to the IO with a delay of about 100 ms, which matches the period of the intrinsic oscillation (Kistler & van Hemmen, 1999; Kistler et al., 2000). In short, complex spikes in cerebellar Purkinje cells are often followed by a pause in their simple spike activity (Armstrong, Cogdell, &
IO neurons
2600
Werner M. Kistler and Chris I. De Zeeuw
1 2 3 4 5 6
0ms
100ms
200ms
300ms
400ms
Figure 2: The time window hypothesis of inferior-olivary activity. IO neurons tend to re action potentials during narrow time windows every 100 ms. Since the ring frequency of individual IO neurons is much lower than 10 Hz, they re in only one out of about 10 windows. Spikes within each time window are highly synchronized.
Harvey, 1975, 1979; Loewenstein, Yarom, & Sompolinsky, 2000). During this pause, neurons from the deep cerebellar nuclei recover from the inhibition produced by the Purkinje cells and respond after a delay of about 100 ms with a short burst of rebound spikes (Jahnsen, 1986; LlinÂas & Muhlethaler, ¨ 1988; Ruigrok, 1997; Aizenman, Manis, & Linden, 1998; Aizenman & Linden, 1999; Czubayko, Sultan, Thier, & Schwarz, 2001). The rebound activity is conveyed by topographic projections from the DCN either directly or via the mesodiencephalic junction back to the IO (De Zeeuw & Ruigrok, 1994; De Zeeuw et al., 1998). 1 (See Figure 1.) The aim of this article is to investigate the role of the resonant feedback the IO receives via delayed reverberating loops. In section 2, a simple mathematical model of networks with delayed reverberating projections is developed that gives some interesting insights into network dynamics and the role of divergence and convergence within the network. It turns out that the network can produce complex sequences of ever-differing ring patterns that look random but are completely determined by the wiring of the reverberating loop. Macroscopically, the dynamics is (in most cases) simply attracted by a xed point for the population activity. Microscopically, however, the system exhibits a limit cycle behavior with numerous extremely large attractors, a behavior that has been studied to some extent in the context of spin-glass models (Kirkpatrick & Sherrington, 1978; Nutzel, ¨ 1991) and attractor neural network models (Nutzel, ¨ Kien, Bauer, Altman, & Krey,
1 Note that the occulus, which is projecting to the vestibular nuclei instead of the DCN, is not part of this reverberating loop (De Zeeuw, Wiley, Digiorgi, & Simpson, 1994). With cerebellum or cerebellar, we are thus referring primarily to the neocerebellum.
Dynamical Working Memory and Timed Responses
2601
1994). We demonstrate by means of numerical experiments that even in the presence of noisy synaptic transmission failures, the microscopic dynamics is still partially determined by the topology of the reverberating loop. This observation, which goes beyond the more straightforward conclusion that reverberating loops can result in sustained network activity (Boylls, 1978; Crick, 1994; Billock, 1997), has interesting functional implications for working memory and timing tasks. In section 3 we develop a conductance-based network model for a subpopulation of IO neurons. This model is used to validate the intrinsic assumptions of the time-discrete model, in particular, assumptions regarding synchrony and rhythmicity. The model also provides insight into the role of gap junctions in synchronization and spike timing. In the last section, we discuss functional implications and potential applications for working memory and timing tasks. 2 Networks with Delayed Reverberating Projections
In this section, we investigate the dynamical properties of neuronal networks that are part of a delayed reverberating loop. We assume that the delayed feedback is in resonance with the oscillation of the population activity and that the neurons stay synchronized, ring only during narrow time windows every T milliseconds. We furthermore assume that the set of neurons that is active in each cycle depends on only their current synaptic input that is due to the reverberating loop and thus depends only on the activity of the previous cycle. We employ a time-discrete description based on McCulloch-Pitts neurons. Each time step corresponds to one cycle of length T. The wiring of the reverberating loop is represented by a random coupling matrix. The statistical properties of the coupling matrix reect the level of divergence and convergence within the reverberating network. We introduce three models that differ slightly with respect to their treatment of inhibitory projections. In a rst step, we derive mean-eld equations for each of these models and discuss their macroscopic behavior. In a second step, we look more closely into the microscopic dynamics. It will turn out that subtle changes in the density of excitatory and inhibitory projections can have dramatic effects on the microscopic dynamics that do not show up in a mean-eld description. Similar models have been studied in various contexts by Kirkpatrick and Sherrington (1978), Derrida, Gardner, & Zippelius, (1987), Crisanti and Sompolinsky (1988), Nutzel ¨ (1991), and Kree and Zippelius (1991), to mention only a few. 2.1 Mean-Field Dynamics.
2.1.1 Excitatory Projections Only. We consider a population of N McCulloch-Pitts neurons (McCulloch and Pitts, 1943) that is described by a state vector S 2 f0, 1gN . In each time step tn , any given neuron i is either active
2602
Werner M. Kistler and Chris I. De Zeeuw
[Si ( tn ) D 1] or inactive [Si ( tn ) D 0]. Due to the reverberating loop, neurons receive (excitatory) synaptic input h that depends on the wiring of the loop, described by a coupling matrix J, and on the activity during the previous cycle, that is, h i ( tn ) D
N X
Jij Sj ( tn¡1 ) .
(2.1)
jD 1
Since the wiring of the reverberating loop at the neuronal level is unknown, we adopt a random coupling matrix with binary entries. More precisely, we take all entries Jij to be identically and independently distributed (i.i.d.) with probfJij D 1g D l / N.
We thus neglect possible differences in the synaptic coupling strength and content ourself with a description that accounts only for the presence or absence of a projection. In that sense, l is the convergence and divergence ratio of the network, that is, the averaged number of synapses that each neuron receives from and connects to other neurons, respectively. The neurons are modeled as deterministic threshold elements. The dynamics is given by Si ( tn ) D H [hi ( tn ) ¡ q] ,
(2.2)
with q being the ring threshold and H the Heaviside step function with H (x ) D 1 if x ¸ 1 and H (x ) D 0 for x < 0. Starting with a random initial ring pattern, Si (t0 ) 2 f0, 1g i.i.d. with probfSi ( t0 ) D 1g D a 0,
P we can easily calculate the expectation value of the activity a1 D N ¡1 N i D1 Si ( t1 ) in the next time step. According to equation 2.2, a neuron is active if it receives input from at least q neurons that have been active during the last cycle. The initial ring pattern S ( t0 ) and the coupling matrix J are independent, so the synaptic input h in equation 2.1 follows a binomial distribution. The probability a1 of any given neuron to be active in the next cycle is thus a1 D
N ³ ´ X N kD q
k
(a 0lN ¡1 ) k (1 ¡ a 0 lN ¡1 ) N¡k.
(2.3)
This equation gives the network activity a1 as a function of the activity a0 in the previous cycle. It is tempting to generalize this expression so as to relate the activity an in cycle n recursively to the activity in cycle n ¡ 1, an C 1 D 1 ¡
q¡1 X³ kD 0
´ N ( an lN ¡1 ) k (1 ¡ an lN ¡1 ) N¡k. k
(2.4)
Dynamical Working Memory and Timed Responses
2603
Unfortunately, this is in general not possible because the activity pattern S in cycle n ¸ 1 and the coupling matrix J are no longer independent, and correlations in the ring patterns can occur. For sparse networks with l ¿ N, however, these correlations can be neglected, and equation 2.4 can be used as an approximation (see Kree & Zippelius, 1991, for a precise denition of “l ¿ N”). Fortunately, the case with l ¿ N is the more interesting one anyway, because otherwise, a1 is a steep sigmoidal function of a0 , and the network activity either saturates (a1 ¼ 1) or dies out (a1 ¼ 0) after only one iteration. Furthermore, l ¿ N is a realistic assumption for the reverberating loops of the olivo-cerebellar system. In the following, we thus assume that l ¿ N so that the network activity is given by equation 2.4, or, if we approximate the binomial distribution by the corresponding Poisson distribution, by the recursion
an C 1 D 1 ¡
q¡1 X kD 0
( an l) k ¡an l e . k!
(2.5)
The dynamics of the population activity is completely characterized by the mean-eld equation, 2.5. For instance, it can easily be shown that an D 0 is a stable xed point except if q D 1 and l > 1. Furthermore, an is a monotonously growing function of an¡1 . Therefore, no macroscopic oscillations can be expected. In summary, three different constellations can be discerned (see Figure 3). First, for q D 1 and l > 1, there is a stable xed point at high levels of an ; the xed point at an D 0 is unstable. Second, if the ring threshold q is large compared to the convergence l, only an D 0 is stable. Finally, if q > 1 and l are sufciently large, bistability of an D 0 and an > 0 can be observed. Figure 4 summarizes the dynamical properties of equation 2.5 for different combinations of the parameters q and l. As can be seen from the graphs, the xed points of an are always rather rapidly reached, except for the few cases where the graph of an ( an¡1 ) is closely below the diagonal an D an¡1 , that is, for q D 2 and l D 3. In this case, the dynamics approaches the xed point at an D 0 in small steps after a large number of iterations (see Figure 3B). 2.1.2 Excitatory and Inhibitory Projections. Apart from excitatory reverberating projections via cerebellar cortex, deep cerebellar nuclei, and mesodiencephalic junction, there are also inhibitory projections from the deep cerebellar nuclei directly back to the inferior olive. All these projections are topographically organized so that both excitatory and inhibitory projections come simultaneously into play. Our previous model can easily be extended so as to account for both excitatory and inhibitory projections. The wiring of the excitatory loop is
2604
Werner M. Kistler and Chris I. De Zeeuw
0.5
0
0.5 an
0
1
0
0.5 an
50
0
5 10 15 iteration n
0
0
0.5 an
1
100
50 0
0.5
1
100 index i
index i
100
0
0.5
index i
0
1 an + 1
1 an + 1
an + 1
1
0
50 0
5 10 15 iteration n
0
5 10 15 iteration n
Figure 3: Dynamics of a reverberating loop with purely excitatory projections. The upper row shows the mean-eld approximation of the population activity an C 1 as a function of the activity in the previous cycle an ; see equation 2.5. The raster diagrams in the lower row give examples of the underlying microscopic dynamics in a simulation of N D 100 neurons. The time steps where a given neuron is active are marked by black bars. (A) l D 2, q D 1: Stable xed point at a ¼ 0.8. (B) l D 3, q D 2: Only an D 0 is stable. Note the long transient until the xed point is reached. (C) l D 8, q D 4: Bistability of an D 0 and an ¼ 0.95.
characterized, as before, by a random matrix Jijexc 2 f0, 1g with probfJijexc D 1g D lexc / N
i.i.d.
Similarly, the wiring of the inhibitory loop is given by a random matrix Jijinh 2 f0, 1g with probfJijinh D 1g D linh / N
i.i.d.
The parameters lexc and linh describe the divergence or convergence of excitatory and inhibitory projections, respectively. We assume that a neuron is activated if the difference between excitatory and inhibitory input exceeds its ring threshold q. The dynamics is thus given by 2 S i ( tn ) D H 4
3 N X j D1
Jijexc Sj (tn¡1 )
¡
N X j D1
Jijinh Sj ( tn¡1 )
¡ q5 .
(2.6)
Dynamical Working Memory and Timed Responses
2605
10 9 8 7 J
6 5 4 3 2 1 1
2
3
4
5
l
6
7
8
9
10
Figure 4: Overview of the mean-eld behavior of a network with purely excitatory reverberating projections for various ring thresholds q and projection densities l. Each of the small diagrams within the table shows the activity an as a function of the activity an¡1 in the previous cycle. Axes range from 0 to 1.
As in the previous section, we can calculate the mean-eld activity in cycle n C 1 as a function of the activity in the previous cycle. We obtain an C 1 D
N X k¡q k C l k X an lexc llinh ¡an (lexc C linh ) e . k!l! k Dq l D 0
(2.7)
The mean-eld approximation is valid for sparse networks, that is, if lexc ¿ N and linh ¿ N. As compared to the situation with purely excitatory feedback, equation 2.7 does not produce new modes of behavior. The only difference is that an C 1 is no longer a monotonous function of an . The slope dan C 1 / dan , however, cannot take values less than ¡1 so that macroscopic oscillations can be excluded (see Figures 5 and 6). 2.1.3 Shunting Inhibition. In the previous section, we implicitlyassumed that excitatory and inhibitory synapses are equally strong and that excitatory and inhibitory input interact linearly. In reality, however, the reversal
2606
Werner M. Kistler and Chris I. De Zeeuw
0.5
0
0.5 an
0
1
0
0.5 an
50
0
25 50 iteration n
0
0.5 an
1
100
50 0
0.5 0
1
100 index i
index i
100
0
0.5
index i
0
1 an + 1
1 an + 1
an + 1
1
0
25 50 iteration n
50 0
0
25 50 iteration n
Figure 5: Dynamics of a reverberating loop with excitatory and inhibitory projections (similar plots as in Figure 3; see equations 2.6 and 2.7). (A) lexc D 6, linh D 4, q D 1: Stable xed point at an ¼ 0.6. (B) lexc D 4, linh D 10, q D 1: Stable xed point at an ¼ 0.15. (C) lexc D 10, linh D 4, q D 3: Bistability between an D 0 and an ¼ 0.73. Note the high level of irregularity in the raster diagrams. Although the mean-eld dynamics is characterized by a simple xed point, the corresponding limit cycle of the microscopic dynamics can have an extremely long period.
potential of inhibitory synapses is usually close to the resting potential of the neuron. The effect of activating an inhibitory synapse is therefore rather an increase of the membrane conductance than a lowering of the membrane potential. In addition, inhibitory synapses are often located directly at the soma so that the activity of the neuron is virtually shunted by inhibitory input. We can easily modify equation 2.6 so as to mimic the different mechanisms of excitation and inhibition. Instead of using a subtractive interaction, we now write 2 S i ( tn ) D H 4
3 N X j D1
Jijexc Sj (tn¡1 )
2
¡ q 5 H 4¡
3 N X
Jijinh Sj ( tn¡1 ) 5 .
(2.8)
jD 1
Thus, a neuron is active in cycle n if it receives at least q action potentials via excitatory synapses and no inhibitory input. A single inhibitory presynaptic neuron is sufcient to shunt the neuron.
Dynamical Working Memory and Timed Responses
2607
10 8 linh
6 4 2 0 1
3
5
7
9 11 13 15 17 19 lexc
1
3
5
7
9 11 13 15 17 19 lexc
1
3
5
7
9 11 13 15 17 19 lexc
10 8 linh
6 4 2 0
10 8 linh
6 4 2 0
Figure 6: Overview of the mean-eld behavior of a network with both excitatory and inhibitory reverberating projections. (A) Similar plots as in Figure 4 for q D 1 and various projection densities of excitatory (lexc ) and inhibitory (lexc ) projections. (B) Same as in A but q D 2. (C) q D 4. Axes range from 0 to 1.
2608
Werner M. Kistler and Chris I. De Zeeuw
0.5
0
0.5 an
0
1
0
0.5 an
50
0
0
0
0.5 an
1
100
50
25 50 iteration n
0.5 0
1
100 index i
index i
100
0
0.5
index i
0
1 an + 1
1 an + 1
an + 1
1
0
25 50 iteration n
50 0
0
25 50 iteration n
Figure 7: Dynamics of a reverberating loop with shunting inhibition (similar plots as in Figure 3; see equations 2.8 and 2.9). (A) lexc D 6, linh D 3, q D 1: Fixed point at a ¼ 0.35. (B) lexc D 15, linh D 5, q D 1: Macroscopic oscillation of the population activity. The oscillations are also clearly visible in the simulation shown in the raster diagram. (C) lexc D 25, linh D 10, q D 1: Chaotic behavior. Note that the activity in the simulation suddenly dies out at n D 20, which is due to the nite size of the network.
Since excitatory and inhibitory coupling matrices are independent, the mean-eld activity is given by " an C 1 D 1 ¡
q¡1 X kD 0
( an lexc ) k k!
# e
¡an lexc
e¡an linh .
(2.9)
As compared to the situation with subtractive inhibition or with purely excitatory feedback, equation 2.9 has a richer dynamical structure (see Figure 7). Most notably, an C 1 is no longer a monotonous function of an , and the slope dan C 1 / dan can take values less than ¡1 so that oscillations and even chaotic behavior can be observed (see Figures 8 and 9). 2.2 Microscopic Dynamics. As it is already apparent from the examples shown in Figures 3, 5, and 7, the irregularity of the spike trains produced by different reverberating loops can be quite different. Numerical experiments have shown that in the case of purely excitatory projections, xed points of the mean-eld dynamics almost always correspond to a xed point of the microscopic dynamics, or at least to a limit cycle with short period. As soon
Dynamical Working Memory and Timed Responses
2609
10 8 linh
6 4 2 0 1
3
5
7
9 11 13 15 17 19 lexc
1
3
5
7
9 11 13 15 17 19 lexc
1
3
5
7
9 11 13 15 17 19 lexc
10 8 linh
6 4 2 0
10 8 linh
6 4 2 0
Figure 8: Similar plots as in Figure 6 but for shunting instead of subtractive inhibition.
2610
Werner M. Kistler and Chris I. De Zeeuw
Figure 9: Bifurcation diagram of the mean-eld activity in a reverberating network with shunting inhibition; see equation 2.9. A classical period-doubling scenario leads to chaotic behavior as the density of the excitatory projections lexc is increased. The density of the inhibitory projections linh D 10 and the ring threshold q D 1 are kept constant.
as inhibitory projections are introduced, this situation changes dramatically. Fixed points in the mean-eld dynamics still correspond to limit cycles in the microscopic dynamics; the length of the periods, however, is substantially larger and grows rapidly with the network size. A similar phenomenon was observed in a spin-glass model with random gaussian couplings (Nutzel, ¨ 1991). We have performed numerical experiments in order to quantify the dependency of the attractor size on the network architecture. Figure 10 A shows the period of the microscopic limit cycle averaged over 100 realizations of the network and the initial ring pattern as a function of the density of inhibitory projections. Starting by a period of 1 for linh D 0, the averaged period quickly rises to very large values depending on the actual network size. For small networks, the probability of ending up in the all-off xed point increases to signicant levels as the density of the inhibitory projections exceeds 3 or 4. Therefore, the averaged length of the limit cycle drops back to 1 as linh increases even further. There is a large variability in the length of the attractors produced by different realizations of the network or that are reached from different initial patterns in the same network. Figure 10B exhibits the maximum period length encountered in 100 randomly chosen realizations of network and initial pattern. The maximal values are about one order of magnitude larger than the mean. With respect to potential applications, it is particularly interesting to see how information about the initial ring pattern is preserved in the sequence of patterns generated by the reverberating network. To this end, we have generated random initial patterns S 0 and estimated the transinformation
200 100 50 20 10 5 2
N=100 N=50 N=20 0
1
2
3 4 linh
5
max. size
mean size
Dynamical Working Memory and Timed Responses
N=10 6 7
2611
N=100
1000
N=50
100
N=20 N=10
10 1
0
1
2
3 4 linh
5
6
7
Figure 10: Attractor length as a function of the inhibitory projection density linh and the network size N D 10, 20, 50, 100. (A) Length of the attractor averaged over 100 realizations of the coupling matrices and the initial pattern. The density of excitatory projections is kept constant at lexc D 3; the ring threshold is q D 1. The dynamics (subtractive inhibition) is given by equation 2.6. (B) Maximal length of the attractor of 100 randomly chosen realizations of coupling matrices and initial patterns. Comparison of A and B shows that there is a large variability in the actual attractor length.
I ( S 0, S n ) between the initial pattern and the pattern S n generated by n iterations, I (S 0 , S n ) D H ( S n ) ¡ H (S n | S 0 ) ,
(2.10)
with H ( S n ) being the entropy of a random variable S n and H (S n | S 0 ) the conditional entropy of S n given S 0 . Recall that the entropy of a random variable x with realizations x1 , x2 , . . . and probability distribution probfxi g is dened as X probfxi g log2 (probfxi g). (2.11) H ( x) D ¡ i
Similarly, the conditional entropy H ( x | y ) of x given y is (Ash, 1990), H ( x | y) D ¡
X i, j
probfxi , yj g log2 (probfxi | yj g).
(2.12)
Figure 11 shows numerical results for the transinformation I ( S 0 , S n ) obtained for small networks (N 2 f16, 20g) with subtractive or shunting inhibition averaged over 10 different realizations of the coupling matrices. We have paid particular attention not to overestimate the transinformation by using a sufciently large number of samples (up to nsamples D 500,000 randomly chosen initial conditions per network) and extrapolating the results to nsamples ! 1 (Treves & Panzeri, 1995). We found that the transinformation decays within the rst few iterations from the maximal value
Werner M. Kistler and Chris I. De Zeeuw
0.8
0.8
0.6
0.6 In /I0
In /I0
2612
0.4 0.2 0
0.4 0.2
0
2
4
n
6
8
10
0
0
2
4
n
6
8
10
Figure 11: Preservation of information about the initial ring pattern in a reverberating loop with N D 16 (open circles) and N D 20 (lled squares) neurons. The transinformation I (S 0 , Sn ) between the initial pattern and the pattern after n iterations is normalized by the maximum I ( S 0, S 0 ) . Error bars give the standard deviation of 10 different realizations of the coupling matrices. (A) Subtractive inhibition with lexc D 5, linh D 5, q D 1; see equation 2.6. (B) Shunting inhibition with lexc D 6, linh D 3, q D 1; see equation 2.8.
I0 D I ( S 0 , S 0 ) to about 0.2I0 . The initial decay reects the length of the transient until the limit cycle is reached. The asymptotic value of I (S 0 , S n ) as n ! 1 depends on the averaged number of initial patterns that end up on the same limit cycle. Once the state of the network has reached a limit cycle, it will stay there forever due to the purely deterministic dynamics given by equations 2.2, 2.6, or 2.8. In reality, however, the presence of noise leads to mixing in the phasespace so that the information about the initial state will nally be lost. There are several sources of noise in a biological network. The most prominent are uncorrelated noisy synaptic input from other neurons and synaptic noise caused by synaptic transmission failures. In order to investigate the impact of noise on the preservation of information in the reverberating network, we have repeated the calculation of transinformation for a network with synaptic noise, that is, a network where synaptic transmission fails with a certain probability pfail. MorePprecisely, we determined for each neuron the exc ( ) number of spikes hexc i ( tn ) D j Jij Sj tn¡1 a neuron would receive without noise via excitatory synapses and subtracted from this number a random number kexc drawn from a binomial distribution with parameters N D i hexc i ( tn ) (number of trials) and p D pfail (probability of success per trial). The same was done (independently) for inhibitory input. The thus reduced amount of input was then plugged into the dynamical equation. In case of subtractive inhibition, the above procedure gives 2 S i ( tn ) D H 4
3 N X jD 1
Jijexc Sj ( tn¡1 ) ¡
N X jD 1
inh 5 Jijinh Sj (tn¡1 ) ¡kexc i C ki ¡q .
(2.13)
0.8
0.8
0.6
0.6 In /I0
In /I0
Dynamical Working Memory and Timed Responses
0.4 0.2 0
2613
0.4 0.2
0
2
4
n
6
8
10
0
0
2
4
n
6
8
10
Figure 12: Preservation of information about the initial pattern in a reverberating network with synaptic noise. The solid line is the noise-free reference (pfail D 0), the dashed lines correspond to pfail D 0.001, pfail D 0.01, and pfail D 0.05 (from top to bottom). (A) Subtractive inhibition with N D 16, lexc D 5, linh D 5, q D 1; see equation 2.6. (B) Shunting inhibition with N D 16, lexc D 6, linh D 3, q D 1; see equation 2.8.
Figure 12 shows the amount of information about the initial pattern that is left after n iterations in the presence of synaptic noise in a small network with N D 16 neurons. As expected, unreliable synapses lead to a faster decay of the initial information. A failure probability of 5% already leads to a signicantly reduced capacity. We discuss the role of synaptic reliability more extensively in section 4. 3 A Model of the Inferior Olive
We have seen that a simple time-discrete model of a delayed reverberating loop can produce complex sequences of ever-differing spike patterns. This model can provide an explanation for the experimentally observed irregularity of complex spikes. Due to the deterministic origin of the spike patterns, there may also be interesting functional implications for working memory and timing tasks. In order to check whether the properties of the time-discrete description carry over to a model continuous in time, we have built a conductance-based model of a subpopulation of neurons from the inferior olive. The model is based on a two-compartment model of inferior olivary neurons developed by Schweighofer, Doya, and Kawato (1999). The simulations comprise up to 20 neurons coupled by gap junctions in an all-to-all fashion. Each neuron receives excitatory and inhibitory input via time-dependent conductances that mimic GABA and AMPA-sensitive synapses. In IO neurons, gap junctions, and excitatory and inhibitory synapses are tightly packed together in glomeruli located on distal dendrites (De Zeeuw et al., 1998). Therefore, the original IO model has been extended by an extra dendritic compartment that contains the synaptic conductances together with the gap junctions.
2614
Werner M. Kistler and Chris I. De Zeeuw
Each neuron thus consists of three compartments: a somatic compartment, a large dendritic compartment that represents the dendritic tree, and the extra compartment that mimics a single dendrite. Delayed reverberating projections have been implemented phenomenologically by the activation of synaptic conductances with a delay of about 100 ms after the occurrence of an action potential in one of the IO neurons. Since the excitatory reverberating loop via the mesodiencephalic junction contains one more synaptic transmission than the inhibitory loop, the overall delay of the excitatory projections is 5 ms longer than the inhibitory loop. The topology of both reverberating loops is described once more by means of sparse random matrices (see the appendix for details). Figure 13 shows an example of a simulation of the IO network. The initially silent network has been stimulated by a short depolarizing current pulse delivered to neurons 1 through 5. Due to the reverberating projections, this results in sustained activity in the form of irregular spike trains in each individual neuron; the population activity, however, shows a clear 10 Hz oscillation. Coupling by gap junctions and common synaptic input ensures that neurons that re an action potential within a certain cycle remain synchronized despite the heterogeneity in the network (some neurons receive more excitatory input spikes than others and thus tend to re earlier). The mechanisms underlying synchronization by synaptic and electrical coupling are well understood and described elsewhere (Gerstner, van Hemmen, & Cowan, 1996; Chow & Kopell, 2000). The simulation shows that the central assumption of the time-discrete model is indeed justied; neurons ring within one cycle do stay synchronized. Moreover, the conductance-based model generates, after appropriate tuning of the synaptic conductances, the very same sequence of spike patterns as the simple time-discrete model. Figures 13B and 13C show a spike raster diagram for the conductance-based model and the corresponding time-discrete model with q D 1, respectively. Since there is perfect agreement, the spike patterns observed in the conductance-based model are solely determined by the topology of the excitatory and inhibitory reverberating loops. It is interesting to check whether the correspondence between the two types of model holds true also for other congurations, for example, for other values of the ring threshold. Since the ring threshold cannot be directly accessed in conductance-based models, we have reduced the synaptic conductances in order to increase the amount of synaptic input required to trigger a spike. Figure 14 shows the results. The time-discrete model with q D 2 predicts that the activity dies out within four iterations. Again, there is perfect agreement between both models on the level of individual spikes. Although postinhibitory rebound in DCN neurons seems to be a robust mechanism to trigger action potentials with a delay of about 100 ms (Kistler & van Hemmen, 1999; Kistler et al., 2000), there will always be a certain jitter
membrane potential [mV]
Dynamical Working Memory and Timed Responses
0
2615
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
- 50 0
- 50 0
- 50 0
- 50 0
0.5
1 0
0.5
1 0
0.5
0.5
1 0
0.5
1
20 neuron
20 neuron
1 0
time [s]
10 1 0
500 time [ms]
1000
15 10 5 1 0
2
4 6 iteration
8
10
Figure 13: Simulation of a network of 20 IO neurons embedded in a reverberating loop with lexc D 6 and linh D 3. (A) Membrane potentials as a function of time. The simulation was started by a short depolarizing current pulse in neurons 1–5 at time t D 0. (The number of the neurons is given in the upper-right corner of each graph.) The strength of the excitatory and inhibitory synapses is gAMPA D 8 nS and gGABA D 18 nS, respectively. (B) Spike raster diagram of the same simulation as in A. (C) Sequence of spike patterns as predicted by the time-discrete model, equation 2.8, with q D 1 for this specic choice of the coupling matrices. Comparison with B shows perfect agreement.
in the arrival time of the reverberated spike activity. Postsynaptic AMPA receptors have rather low time constants of a few milliseconds so that the amplitude of the overall postsynaptic potential triggered by a volley of incoming spikes can depend sensitively on the temporal dispersion of the volley. In the present context, for instance, activation of several excitatory synapses can fail to trigger the neuron if the temporal spread is too large. Furthermore, inhibitory spikes can prevent the triggering of an action potential only if they arrive before the excitatory volley. Any variability in the spike timing thus translates in noise in the amplitude of the postsynaptic
Werner M. Kistler and Chris I. De Zeeuw
membrane potential [mV]
2616
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
- 50 0
- 50 0
- 50 0
- 50 0
0.5
1 0
0.5
1 0
0.5
0.5
1 0
0.5
1
20 neuron
20 neuron
1 0
time [s]
10 1 0
500 time [ms]
1000
15 10 5 1 0
2
4 6 iteration
8
10
Figure 14: (A) Similar simulation as in Figure 13 but with weaker synapses (gAMPA D 5 nS and gGABA D 12 nS). The initial activity thus dies after a few cycles. (B) Corresponding spike raster diagram. (C) Spike pattern predicted by the time-discrete model, equation 2.8, with q D 2. As in Figure 13, there is perfect agreement between B and C.
potentials similar to the noise produced by synaptic transmission failures discussed in section 2.2. A precise synchronization of IO neurons is thus essential in order to keep the dynamical working memory operational. In fact, multielectrode recordings have shown that complex spikes in nearby Purkinje cells can be synchronized on a millisecond timescale (Bell & Kawasaki, 1972; LlinÂas & Sasaki, 1989; Lang et al., 1999), a remarkable observation that usually is attributed to the gap junctional coupling of IO neurons. Recently, the role of gap junctions became experimentally accessible in genetically modied mice that are decient of a certain protein (connexin36) that is part of gap junctions connecting cortical neurons (Rash et al., 2000; Guldenagel ¨ et al., 2001; Deans, Gibson, Sellitto, Connors, & Paul, 2001; Hormuzdi et al., 2001). Forthcoming experimental results on the degree of synchrony of complex
Dynamical Working Memory and Timed Responses
2617
10 0
5 iteration
10
u(t) [mV]
neuron
20 10 0
0
500 t [ms]
500 t [ms]
- 60 1000 1050 t [ms]
u(t) [mV]
neuron
10 0
- 30
1000
20
0
0
1000
counts
0
0
- 10 - 5 0 5 Dt [ms]
10
counts
neuron
20
- 30 - 60 1000 1050 t [ms]
- 10 - 5 0 5 Dt [ms]
10
Figure 15: Similar simulation as in Figure 13 but with nonuniform delays of the reverberating projections. (A) Spike pattern predicted by the timediscrete model, equation 2.8. (B) Results of the corresponding simulation of the conductance-based IO model. (B1) Spike raster. Note that there are some discrepancies between A1 and B1 but both simulations nally reach the same attractor. (B2) Close-up of the spikes near t D 1010 ms. The jitter in the ring time is mostly due to the nonuniform delays of the reverberating projections. (B3) Autocorrelation of the population activity. The width of the peak is a measure for the degree of synchrony within each cycle. (C) Same simulation as in B but without gap junctions (gc D 0). Compared to the time-discrete model, the spike raster (C1) shows several deviations that are due to the reduced level of synchrony within each cycle; see the spike train in C2 and the autocorrelogram in C3.
spike in these mice can be compared directly to simulations where gap junctions have been switched off by setting their conductance to zero. Figure 15 shows an example of a simulation with noise in the timing of the reverberated spikes in the presence and absence of gap junctional coupling between IO neurons. The uniform delay of the reverberating projections has been replaced by a 6 ms wide distribution of delays. Comparison of the spike pattern generated by the time-discrete model in Figure 15A and the result from the conductance-based network model with gap junctions in Figure 15B shows a few missing and misplaced spikes near t D 300, 400, and 500 ms. Interestingly, however, the spike patterns at t D 600, . . . , 1000 ms are again identical. If the gap junctions are switched off (gc D 0), then signicant
2618
Werner M. Kistler and Chris I. De Zeeuw
deviations in the spike patterns are apparent. Nevertheless, the synchrony of spikes within one cycle does not break down. The width of the peaks in the autocorrelation of the population activity is merely about twice as large as in the presence of gap junctions (see Figure 15C). 4 Discussion
The conclusions that can be drawn are twofold. First, purely deterministic mechanisms can generate irregular spike trains. Second, these spike trains can carry information about the initial pattern and the elapsed time even in the presence of noise. We have studied this phenomenon for a concrete biological substrate, the olivo-cerebellar system. Activity from the IO triggers complex spikes in cerebellar Purkinje cells followed by a pause in the simple spike activity, which gives rise to a burst of rebound spikes in the cerebellar nuclei with a delay of about 100 ms. The rebound activity can reverberate back to the IO either directly via inhibitory projections or indirectly via excitatory neurons of the mesodiencephalic junction. The delayed reverberation is in resonance with the intrinsic oscillatory properties of the IO neurons. The activity of the IO is characterized by a periodic arrangement of narrow time windows that encompass potential ring times. The set of neurons active during one of these time windows determines due to the reverberating loop the set of neurons that re a spike in the next time window. Given the distinct ring characteristics of the IO neurons, a time-discrete model based on McCulloch-Pitts neurons with parallel dynamics is the natural choice. In this model, each time step of 100 ms corresponds to one cycle of the oscillating population activity. The corresponding mean-eld dynamics can easily be solved and describes the macroscopic behavior as a function of the ring threshold and the divergence or convergence of the reverberating loops. Although the macroscopic dynamics is mostly characterized by a simple xed-point behavior, the microscopic dynamics can be very complex. Numerical experiments show that the period of the microscopic attractors corresponding to a xed point of the population activity can be extremely long (Nutzel, ¨ 1991). This has interesting implications for a long-standing discussion on the origin of the experimentally observed irregularity of neuronal activity—and, to our knowledge, the rst explanation for the irregularity of the complex spike activity in Purkinje cells. It has been argued that neurons that integrate synaptic input from many presynaptic neurons cannot produce irregular spike trains unless the membrane time constant is unrealistically short or the presynaptic activity contains uctuations with an amplitude that is comparable to the mean input (Softky & Koch, 1993; Shadlen & Newsome, 1994). The latter condition can be met by balancing the amount of excitation and inhibition that can be achieved in a sparse recurrent network without ne-tuning of parameters (van Vreeswijk & Sompolinsky,
Dynamical Working Memory and Timed Responses
2619
1998; Brunel, 2000). This work demonstrates most clearly that the quenched noise in the random couplings of a sparse network is indeed capable of producing highly irregular spike trains without any additional sources of noise. Similar results have been obtained experimentally by Fusi, Del Giudice, & Amit (2000) with a hardware implementation of a recurrent network. It is a priori impossible to decide whether the irregularity in neuronal spike trains is a result of noisy neurons and synapses or whether the irregularity is the signature of a highly efcient temporal code similar to the noise produced by a computer modem. The second conclusion, which is intimately related to the rst one, does not provide the ultimate answer to this question but demonstrates that there is a solution in the middle: that the spike trains contain information, for example, on the initial ring pattern and the elapsed time even in the presence of (synaptic) noise. The reverberating network preserves this information for a certain amount of time depending on the level of noise. This is an emergent property of reverberating networks that is closely related to their chaotic microscopic dynamics. In an asynchronous network with balanced excitation and inhibition, two initially nearby states will diverge rapidly (van Vreeswijk & Sompolinsky, 1998). In the olivo-cerebella system, these states will also diverge, but this takes at least a few iterations. Due to the delayed triggering of spikes via postinhibitory rebound, each iteration corresponds to about 100 ms. The timescale on which information is lost is thus long enough to provide a useful storage device for past input. Similar to the idea of a liquid state machine (Maass, Natschlager, ¨ & Markram, in press), readout neurons can use this information for all sorts of real-time computation. As a concrete example, we want to show that conclusion 2 provides a novel mechanism for the implementation of a working memory and the solution of timing tasks. Suppose that external (sensory) events trigger activity in certain subsets of neurons that are part of the reverberating loop. Different events trigger different neurons, therewith starting transients that end up in different attractors. The spike patterns thus represent simultaneously information about what happened and when it happened. The “what” information is implicitly stored by the attractor that is nally reached and “when” information is contained in the actual position of the network on the attractor or the transient leading to the attractor (see Figure 16). The network can hence serve as a working memory for external events. Note that this mechanism—in contrast to traditional attractor neural networks (Hopeld, 1982; Amit, 1989)—does not rely on any form of synaptic plasticity though synaptic plasticity, might help to stabilize certain transients and their attractors. The critical question in this context is the vulnerability of the system by noise. In section 2.2, we saw that even a low failure rate of the synaptic transmission impairs the preservation of information from one cycle to the next. Nevertheless, a failure rate of 5% leaves after ve iterations more than 10% of the information about the initial pattern (see Figure 12). This means
2620
Werner M. Kistler and Chris I. De Zeeuw
1 2 3 4
400 ms
300 ms (event 1)
200 ms
100 ms
D
400 ms
300 ms
200 ms (event 2)
100 ms
5 6
D
’ ’ event 1 300ms ago!’ ’
1 2 3 4 5 6
’ ’ ... ?’ ’
Figure 16: Simultaneous representation of “what” and “when” information. An external event (event 1) triggers a subset of neurons (4, 5, and 6) leading to a certain sequence of spike patterns. A detector neuron (D) from a superordinated layer can easily be trained to recognize a particular pattern from this sequence, for example, the pattern that occurs exclusively 300 ms after event 1 (upper diagram). Other events trigger a different sequence of spike patterns that do not trigger the detector neuron (lower diagram).
that 10 neurons are enough to discern two different events half a second after they actually occurred. Conditioning experiments where test animals are trained to react with a certain delay to a visual or auditory cue show that this is indeed the physiologically relevant timescale (Schneiderman & Gormezano, 1964). Furthermore, the cerebellum seems to be critically involved in the acquisition and performance of the timed response (McCormick, Clark, Lavond, & Thompson, 1982; Perret, Ruiz, & Mauk, 1993; Yeo & Hesslow, 1998; Ahn & Welsh, 1999). The described mechanism provides a new explanation for how the cerebellar network can solve timing tasks on a timescale that is almost two orders of magnitude larger than the
Dynamical Working Memory and Timed Responses
2621
millisecond timescale usually attributed to neuronal dynamics (Kistler & van Hemmen, 1999; Kistler et al., 2000). In reality, time is a continuous variable, and therefore sources of noise other than synaptic transmission failures come into play—noise in the timing of action potentials, nonperfect synchronization, or heterogeneously distributed delays in the reverberating loop. In order to address these issues, we developed in section 3 a conductance-based network model for a population of IO neurons. This model can exactly reproduce the spike patterns of the (noise-free) time-discrete model, at least as long as a narrow distribution of delays in the reverberating loop is used. The assumptions on synchrony and rhythmicity implicitly made in the time-discrete model are thus conrmed; neurons stay synchronized, though only a subpopulation is active during each cycle. The mechanisms that keep the population synchronized—common synaptic input and electrotonic coupling—are well understood and studied elsewhere (Gerstner et al., 1996; Chow & Kopell, 2000). In the context of the olivo-cerebellar system, timing issues might actually be more important than transmission failures because some of the involved synapses, such as the climbing ber-Purkinje cell synapses, seem to be particularly suitable for a reliable transmission of action potentials. Nevertheless, the dense coupling of IO neurons via gap junctions indicates that nature has taken care to keep IO neurons well synchronized. Simulations of the conductance-based network model show that the strength of the electrotonic coupling is a delicate issue. If the coupling is too strong, then synaptic input will no longer trigger only the postsynaptic neuron but a whole bunch of electrotonically coupled cells. On the other hand, if the coupling is too weak, then spikes within one cycle are only poorly synchronized. The coupling nally used in the simulations presented in Figures 13 through 15 is in keeping with a rather weak coupling of IO cells in rat (Devor & Yarom, 2001). In this article, we have studied the olivo-cerebellar system. The general principle of a dynamical working memory, however, can carry over to other structures of the brain. Thalamic relay neurons, for instance, provide another example of neurons that show postinhibitory rebound and thus form a potential substrate for a delayed triggering of action potentials. These neurons are embedded in topographically organized loops that reciprocally connect the thalamic nuclei with the cerebral cortex (Sherman & Koch, 1990; Billock, 1997). Similarities with the loops connecting inferior olive and cerebellar cortex are not necessarily coincidental. Appendix: Conductance-Based Network Model
The simulations presented in section 3 are based on a model of inferior olivary neurons developed by Schweighofer et al. (1999). We have adapted this model as follows.
2622
Werner M. Kistler and Chris I. De Zeeuw
In order to account for the co-location of synapses and gap junctions in dendritic glomeruli, we have added an extra dendritic compartment to the original two-compartment model. The somatic compartment and the (large) dendritic compartment are equipped with the same set of ion channels as in the model of Schweighofer et al. (1999). The parameter values are gNa D 70, gK dr D 22, gCa l D 0.75, gh D 1.5, gCa h D 4, gK Ca D 35, gls D gld D 0.015 (all conductances in mS/ cm2 ). The morphology of the somatic and the large dendritic compartment corresponds to the “standard cell” of Schweighofer et al. (1999)— gint D 0.13 mS/ cm2 and p D 0.2 with a total area of 10,000 m m2 . The extra compartment, which contains gap junctions and synaptic conductances, is connected to the soma and is supposed to represent a single (passive) dendrite with a length of 100 m m and radius 0.6 m m. Gap junctional coupling within the network is assumed to be all-to-all, in keeping with the experimental result that each IO neuron is coupled to 10 to 40 neighboring cells (Devor & Yarom, 2001). The current through a gap junction is described as in the original model with a conductivity of gc D 0.5 nS which results in a coupling coefcient of about 0.1 similar to values reported by Devor and Yarom (2001). Synaptic input is described by the activation of time-dependent conductances in the extra compartment. The synaptic current is of the form Isyn ( t) D g[v ( t) ¡ E]S Á ! i Xh ¡(t¡ti ) / t2 ¡(t¡ti ) / t1 £ e ¡e H (t ¡ ti ) / ct1 ,t2 .
(A.1)
i
Here, the sum is over all arrival times ti of presynaptic action potentials, H is the Heaviside step function, ct1 ,t2 is a constant that normalizes the height of the peak of e ¡(t¡ti ) / t2 ¡ e ¡(t¡ti ) / t1 to unity, and g is the peak conductance. S (x ) D (1 ¡ exp[¡x / k]) / (1 ¡ exp[¡1 / k]) with k D 2 is a saturating function that describes a sublinear superposition of postsynaptic currents that have been triggered by simultaneously arriving presynaptic spikes. GABAergic synapses are modeled by a raise time of t1 D 0.5 ms, an exponential decay with time constant t2 D 20 ms, and a reversal potential of E D ¡75 mV. AMPA sensitive synapses have t1 D 0.5 ms, t2 D 2 ms, and E D 0 mV. The resulting (440-dimensional) differential equation is solved by custommade software based on a modied Bulirsch-Stoer integration scheme (Press, Teukolsky, Vetterling, & Flannery, 1992). One second of real time requires about 15 to 30 minutes CPU time on a 800 MHz PIII Linux PC. Acknowledgments
Part of this work was completed during a visit of W. M. K. at the Hebrew University of Jerusalem, and it is his great pleasure to thank Idan Segev and Yosef Yarom for their hospitality. We also thank Moritz Franosch, Christian
Dynamical Working Memory and Timed Responses
2623
Leibold, and Oliver Wenisch from the group of Leo van Hemmen, Technical University of Munich, for computing time on some of their Linux PCs. W. M. K. is supported by the NWO-Pioneer Program to C. I. De Zeeuw. References Ahn, E. S., & Welsh, J. P. (1999). Inferior olive optimizes but is not required for associative learning as measured by eyeblink conditioning. Soc. Neurosci. Abstr., 25, 1560. Aizenman, C. D., & Linden, D. J. (1999). Regulation of the rebound depolarization and spontaneous ring patterns of deep nuclear neurons in slices of rat cerebellum. J. Neurophysiol., 82, 1697–1709. Aizenman, C. D., Manis, P. B., & Linden, D. J. (1998). Polarity of long-term synaptic gain change is related to postsynaptic spike ring at a cerebellar inhibitory synapse. Neuron, 21, 827–835. Amit, D. J. (1989). Modeling brain function—the world of attractor neural networks. Cambridge: Cambridge University Press. Armstrong, D. M., Cogdell, B., & Harvey, R. J. (1975). Effects of afferent volleys from the limbs on the discharge patterns of interpositus neurones in cats anaesthetized with a-chloralose. J. Physiol. (Lond.), 248, 489–517. Armstrong, D. M., Cogdell, B., & Harvey, R. J. (1979). Discharge patterns of Purkinje cells in cats anaesthetized with a-chloralose. J. Physiol. (Lond.), 291, 351–366. Ash, R. (1990). Information theory. New York: Dover. Bal, T., & McCormick, D. A. (1997). Synchronized oscillations in the inferior olive are controlled by the hyperpolarization-activated cation current Ih . J. Neurophysiol., 77, 3145–3156. Bell, C. C., & Kawasaki, T. (1972). Relations among climbing ber responses of nearby Purkinje cells. J. Neurophysiol., 35, 155–169. Billock, V. A. (1997). Very short-term visual memory via reverberation: A role for the cortico-thalamic excitatory circuit in temporal lling-in during blinks and saccades? Vision Res., 37, 949–953. Boylls, C. C. (1978). Prolonged alterations of muscle activity in locomoting premammiliary cats by microstimulation of the inferior olive. Brain Res., 159, 445–450. Brunel, N. (2000). Dynamics of sparsely connected networks of excitatory and inhibitory spiking neurons. J. Comput. Neurosci., 8, 183–208. Chow, C. C., & Kopell, N. (2000). Dynamics of spiking neurons with electrical coupling. Neural Comput., 12, 1643–1678. Crick, F. (1994). The astonishing hypothesis. New York: Scribner. Crisanti, A., & Sompolinsky, H. (1988). Dynamics of spin systems with randomly asymmetric bonds: Ising spins and Glauber dynamics. Phys. Rev. A, 37, 4865– 4874. Czubayko, U., Sultan, F., Thier, P., & Schwarz, C. (2001). Two types of neurons in the rat cerebellar nuclei as distinguished by membrane potentials and intracellular llings. J. Neurophysiol., 85, 2017–2029.
2624
Werner M. Kistler and Chris I. De Zeeuw
Deans, M. R., Gibson, J. R., Sellitto, C., Connors, B. W., & Paul, D. L. (2001). Synchronous activity of inhibitory networks in neocortex requires electrical synapses containing connexin36. Neuron, 31, 477–485. Derrida, B., Gardner, E., & Zippelius, A. (1987). An exactly solvable asymmetric neural network model. Europhys. Lett., 2, 167–173. Desclin, J. C. (1974). Histological evidence supporting the inferior olive as the major source of cerebellar climbing bers in the cat. Brain Res., 77, 365–384. Devor, A., & Yarom, Y. (2001). Electrotonic coupling in the inferior olivary nucleus revealed by simultaneous double patch recordings. Unpublished. De Zeeuw, C. I., Lang, E. J., Sugihara, I., Ruigrok, T. J. H., Eisenman, L. M., Mugnaini, E., & Llin a s, R. (1996). Morphological correlates of bilateral synchrony in the rat cerebellar cortex. J. Neurosci., 16, 3412–3426. De Zeeuw, C., & Ruigrok, T. J. (1994). Olivary projecting neurons in the nucleus of Darkschewitsch in the cat receive excitatory monosynaptic input from the cerebellar nuclei. Brain Res., 653, 345–350. De Zeeuw, C. I., Simpson, J. I., Hoogenraad, C. C., Galjart, N., Koekkoek, S. K. E., & Ruigrok, T. J. H. (1998). Microcircuitry and function of the inferior olive. Trends Neurosci., 21, 391–400. De Zeeuw, C., Wylie, D. R., Digiorgi, P. L., & Simpson, J. I. (1994). Projections of individual Purkinje cells of identied zones in the occulus to the vestibular and cerebellar nuclei in the rabbit. J. Comp. Neurol., 349, 428–447. Fusi, S., Del Giudice, P., & Amit, D. J. (2000). Neurophysiology of a VLSI spiking neural network: LANN21. In S. Amari, C. L. Giles, & V. Piuri (Eds.), Neural computing: New challenges and perspectives for the new millennium. Piscataway, NJ: IEEE Computer Society Press. Gerstner, W., van Hemmen, J. L., & Cowan, J. D. (1996).What matters in neuronal locking? Neural Comput., 8, 1689–1712. Guldenagel, ¨ M., Ammerm uller, ¨ J., Feugenspan, A., Teubner, B., Degen, J., Sohl, ¨ G., Willecke, K., & Weiler, R. (2001). Visual transmission decits in mice with targeted disruption of the gap junction gene connexin36. J. Neurosci, 21, 6036– 6044. Hopeld, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proc. Natl. Acad. Sci. USA, 79, 2554–2558. Hormuzdi, S. G., Pais, I., LeBeau, F. E. N., Towers, S. K., Rozov, A., Buhl, E. H., Whittington, M. A., & Monyer, H. (2001). Impaired electrical signaling disrupts gamma frequency oscillations in connexin36-decient mice. Neuron, 31, 487–495. Ito, M. (1984). The cerebellum and neural control. New York: Raven Press. Jahnsen, H. (1986). Electrophysiological characteristics of neurones in the guinea-pig deep cerebellar nuclei in vitro. J. Physiol. (Lond.), 372, 129–147. Kirkpatrick, S., & Sherrington, D. (1978). Innite-ranged models of spinglasses. Phys. Rev. B, 17, 4384–4403. Kistler, W. M., & van Hemmen, J. L. (1999). Delayed reverberation through time windows as a key to cerebellar function. Biol. Cybern., 81, 373–380. Kistler, W. M., van Hemmen, J. L., & De Zeeuw, C. I. (2000).Time window control: A model for cerebellar function based on synchronization, reverberation, and time slicing. In N. M. Gerrits, T. J. H. Ruigrok, and C. I. De Zeeuw
Dynamical Working Memory and Timed Responses
2625
(Eds.), Cerebellar modules: Molecules, morphology, and function (pp. 275–297). Amsterdam: Elsevier. Kree, R., & Zippelius, A. (1991). Asymmetrically diluted neural networks. In E. Domany, J. L. van Hemmen, & K. Schulten (Eds.), Models of neural networks. Berlin: Springer-Verlag. Lampl, I., & Yarom, Y. (1993). Subthreshold oscillations of the membrane potential: A functional synchronizing and timing device. J. Neurophysiol., 70, 2181–2186. Lang, E. J. (2001). Organization of olivocerebellar activity in the absence of excitatory glutamatergic input. J. Neurosci., 21, 1663–1675. Lang, E. J., Sugihara, I., & Llin a s, R. (1996). GABAergic modulation of complex spike activity by the cerebellar nucleoolivary pathway in rat. J. Neurophysiol., 76, 255–275. Lang, E. J., Sugihara, I., Welsh, J. P., & LlinÂas, R. (1999). Patterns of spontaneous Purkinje cell complex spike activity in the awake rat. J. Neurosci., 19, 2728– 2739. Llin Âas, R., & Muhlethaler, ¨ M. (1988). Electrophysiology of guinea-pig cerebellar nuclear cells in the in vitro brain stem–cerebellar preparation. J. Physiol. (Lond.), 404, 241–258. Llin Âas, R., & Sasaki, K. (1989). The functional organization of the olivocerebellar system as examined by multiple Purkinje cell recordings. Eur. J. Neurosci., 1, 587–602. Llin Âas, R., & Yarom, Y. (1986). Oscillatory properties of guinea-pig inferior olivary neurones and their pharmacological modulation: An in vitro study. J. Physiol. (Lond.), 376, 163–182. Loewenstein, Y., Yarom, Y., & Sompolinsky, H. (2000).Complex spikes and bursting activity of simple spikes of cerebellar Purkinje cells are highly correlated. Neurosci. Lett., 55, 35. Maass, W., Natschl¨ager, T., & Markram, H. (in press). Real-time computing without stable states: A new framework for neural computation based on perturbations. Neural Computation. McCormick, D. A., Clark, G. A., Lavond, D. G., & Thompson, R. F. (1982). Initial localization of the memory trace for a basic form of learning. Proc. Natl. Acad. Sci. USA, 79, 2731–2735. McCulloch, W. S., & Pitts, W. (1943). A logical calculus of ideas immanent in nervous activity. Bulletin of Mathematical Biophys., 5, 115–133. Nutzel, ¨ K. (1991). The length of attractors in asymmetric random neural networks with deterministic dynamics. J. Phys. A: Math. Gen., 24, L151– L157. Nutzel, ¨ K., Kien, J., Bauer, K., Altman, J. S., & Krey, U. (1994). Dynamics of diluted attractor neural networks with delays. Biol. Cybern., 70, 553– 561. Perret, S. P., Ruiz, B. P., & Mauk, M. D. (1993). Cerebellar cortex lesions disrupt learning-dependent timing of conditioned eyelid response. J. Neurosci., 13, 1708–1718. Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. (1992). Numerical recipes in C (2nd ed.). Cambridge: Cambridge University Press.
2626
Werner M. Kistler and Chris I. De Zeeuw
Rash, J. E., Staines, W. A., Yasumura, T., Patel, D., Furman, C. S., Stelmack, G. L., & Nagy, J. I. (2000). Immunogold evidence that neuronal gap junctions in adult brain and spinal cord contain connexin-36 but not connexin-32 or connexin-43. Proc. Natl. Acad. Sci. USA, 97, 7573–7578. Ruigrok, T. J. H. (1997). Cerebellar nuclei: The olivary connection. In C. I. De Zeeuw & J. Voogd (Eds.), Progress in brain research (Vol. 114, pp. 167–192). Amsterdam: Elsevier Science. Schneiderman, N., & Gormezano, I. (1964).Conditioning of the nictitating membrane response of the rabbit as a function of CS-US interval. J. Comp. Physiol. Psychol., 57, 188–195. Schweighofer, N., Doya, K., & Kawato, M. (1999). Electrophysiological properties of inferior olive neurons: A compartmental model. J. Neurophysiol., 82, 804–817. Shadlen, M. N., & Newsome, W. T. (1994). Noise, neural codes and cortical organization. Current Opinion in Neurobiology, 4, 569–579. Sherman, S. M., & Koch, C. (1990). Thalamus. In G. Shepherd (Ed.), The synaptic organization of the brain (pp. 246–278). New York: Oxford University Press. Softky, W., & Koch, C. (1993). The highly irregular ring pattern of cortical cells is inconsistent with temporal integration of random EPSPS. J. Neurosci., 13, 334–350. Sotelo, C., Llin a s, R., & Baker, R. (1974). Structural study of inferior olivary nucleus of the cat: Morphological correlations of electrotonic coupling. J. Neurophysiol., 37, 541–559. ¨ Szentagothai, J., & Rajkovits, K. (1959). Uber den Ursprung der Kletterfasern des Kleinhirns. Z. Anat. Entwickl. Gesch., 121, 130–141. Treves, A., & Panzeri, S. (1995). The upward bias in measures of information derived from limited data samples. Neural Comput., 7, 399–407. van Vreeswijk, C., & Sompolinsky, H. (1998). Chaotic balanced state in a model of cortical circuits. Neural Comput., 10, 1321–1371. Yeo, C. H., & Hesslow, G. (1998). Cerebellum and conditioned reexes. Trends Cognit. Sci., 2, 322–330.
Received July 5, 2001; accepted April 30, 2002.
LETTER
Communicated by Peter Latham
Selectively Grouping Neurons in Recurrent Networks of Lateral Inhibition Xiaohui Xie [email protected] Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA 02139, U.S.A. Richard H. R. Hahnloser [email protected] H. Sebastian Seung [email protected] Department of Brain and Cognitive Sciences and Howard Hughes Medical Institute, Massachusetts Institute of Technology, Cambridge, MA 02139, U.S.A. Winner-take-all networks have been proposed to underlie many of the brain’s fundamental computational abilities. However, not much is known about how to extend the grouping of potential winners in these networks beyond single neuron or uniformly arranged groups of neurons. We show that competition between arbitrary groups of neurons can be realized by organizing lateral inhibition in linear threshold networks. Given a collection of potentially overlapping groups (with the exception of some degenerate cases), the lateral inhibition results in network dynamics such that any permitted set of neurons that can be coactivated by some input at a stable steady state is contained in one of the groups. The information about the input is preserved in this operation. The activity level of a neuron in a permitted set corresponds to its stimulus strength, amplied by some constant. Sets of neurons that are not part of a group cannot be coactivated by any input at a stable steady state. We analyze the storage capacity of such a network for random groups—the number of random groups the network can store as permitted sets without creating too many spurious ones. In this framework, we calculate the optimal sparsity of the groups (maximizing group entropy). We nd that for dense inputs, the optimal sparsity is unphysiologically small. However, when the inputs and the groups are equally sparse, we derive a more plausible optimal sparsity. We believe our results are the rst steps toward attractor theories in hybrid analog-digital networks. 1 Introduction
It has long been known that lateral inhibition in neural networks can lead to winner-take-all competition, so that only a single neuron is active at c 2002 Massachusetts Institute of Technology Neural Computation 14, 2627– 2646 (2002) °
2628
Xiaohui Xie, Richard H. R. Hahnloser, and H. Sebastian Seung
a steady state (Amari & Arbib, 1977; Feng & Hadeler, 1996; Hahnloser, 1998; Sum & Tam, 1996; Coultrip, Granger, & Lynch, 1992; Maass, 2000). When used for unsupervised learning, such winner-take-all networks enforce grandmother-cell representations as in vector quantization (Kohonen, 1989). Recently, much research has focused on unsupervised learning algorithms for sparsely distributed representations (Olshausen & Field, 1996; Lee & Seung, 1999). These algorithms lead to representations where multiple neurons participate in the encoding of an object and so are more distributed than vector quantization. Therefore, it is of interest to nd ways of using lateral inhibition to mediate winner-take-all competition between groups of neurons, enforcing the sparse representation at a network level. Competing groups of neurons are the essence of attractor models of associative memory. Selectively grouped neurons correspond to patterns that are stored as attractors in the network, with only one of these patterns retrieved at a steady state (Hopeld, 1982; Willshaw, Buneman, & Longuet-Higgins, 1969; Willshaw & Longuet-Higgins, 1970). In this case, the input to the network is represented in the initial conditions of the dynamic system, and the winning group is the resulting steady state. However, the binary behavior of an individual neuron in associative memory models is much different and computationally less informative than a biophysical neuron, whose ring rate encodes information on the signal it is processing. Although there have been extensions of these discrete and digital attractor networks to networks with graded (Hopeld, 1984; Miller & Zucker, 1999) or stochastic neurons (Golomb, Rubin, & Sompolinsky, 1990), the behavior of the individual neuron tends to be inactive or saturate and thus remains binary in essence. In this article, we show how winner-take-all competition between groups of neurons can be realized in networks of nonbinary, analog neurons. In a network model introduced later, neurons at a steady state can be either active or inactive and form a binary pattern representing a permitted grouping of the neurons. At the same time, the activated neurons carry analog values resulting from computations implemented by the network. We present a natural way of wiring the network to group neurons selectively by adding strong lateral inhibition between them. Given a collection of potentially overlapping groups, the inhibitory connectivity is set by a simple formula that can be interpreted as arising from an on-line learning rule. To show that the resulting network functions as group winner-take-all, we perform a stability analysis. If the strength of inhibition is sufciently great and the group organization satises certain conditions, one and only one group of neurons can be activated at a stable steady state. In general, the identity of the winning group depends on the network inputs and also the initial conditions of the dynamics. We characterize the storage capacity, the maximum number of groups the network can mediate to produce winner-take-all competitions, for random sparse groups in which each neuron has the probability p to be included in each group. Let n be the total number of neurons in the network. We
Generalized Winner-Take-All Networks
2629
determine the optimal sparsities p that maximize group entropy in two cases: (1) when the input is dense, the optimal sparsity scales as ln(n) / n, and (2) when the inputs arepof equal sparsity as the groups themselves, the optimal sparsity scales as ln(n) / n. In the rst case, the storage capacity roughly scales as n2 , and in the second case, the storage capacity scales as n / ln(n) . 2 Basic Denitions
Let m groups of neurons be given, where the group membership of the ath group is specied by ( jia
D
1
if the ith neuron is in the ath group
0
otherwise,
(2.1)
for i D 1, . . . , n. We will assume that every neuron belongs to at least one group1 and that every group contains at least one neuron. A neuron is allowed to belong to more than one group, so that the groups are potentially overlapping. The inhibitory synaptic connectivity of the network is dened in terms of the group membership, Jij D
m Y aD 1
( (1
¡ jiajja )
D
0
if i and j both belong to a group
1
otherwise.
(2.2)
The matrix J basically states that a connection between neuron i and j is established only if they do not belong to any of the same groups. This pattern of connectivity could arise from a simple learning mechanism. Suppose that all elements of J are initialized to be unity, and the groups are presented sequentially as binary vectors j 1 , . . . , j m . The ath pattern is learned through the update, Jij à Jij (1 ¡ jiajja ).
(2.3)
In other words, if both neurons i and j belong to pattern a, then the connection between them is removed. After presentation of all m patterns, this leads to equation 2.2. At the start of the learning process, the initial state of J corresponds to uniform inhibition, which is known to implement winnertake-all competition between individual neurons. It will be seen that as inhibitory connections are removed during learning, the competition evolves to mediate competition between groups of neurons rather than individual neurons. 1
This condition can be relaxed but is kept for simplicity.
2630
Xiaohui Xie, Richard H. R. Hahnloser, and H. Sebastian Seung
Let xi be the activity of neuron i. The dynamics of the network is given by 2 3C n X dxi C xi D 4bi C axi ¡ b Jij xj 5 , dt D 1 j
(2.4)
for all i D 1, . . . , n, where [z] C ´ maxfz, 0g denotes rectication, a > 0 is the strength of self-excitation, and b > 0 is the strength of lateral inhibition. bi is the external input. Equivalently, the dynamics can be written in matrix vector form as xP C x D [b C Wx] C ,
(2.5)
where W D aI ¡ b J includes both self-excitation and lateral inhibition. The state of the network is specied by the vector x D [x1 , . . . , xn ]T and the external input by the vector b D [b1 , . . . , bn ]T . Recurrent networks with linear threshold units have been used in a variety of neural modeling studies (Hansel & Sompolinsky, 1998; Salinas & Abbott, 1996; Wersing, Beyn, & Ritter, 2001; Hahnloser, Sarpeshkar, Mahowald, Douglas, & Seung, 2000). A vector v is said to be nonnegative, v ¸ 0, if all of its components are nonnegative. The nonnegative orthant is the set of all nonnegative vectors. Notice that in equation 2.4, xP i ¸ 0 whenever xi D 0. Moreover, the linear threshold function is obviously Lipschitz continuous. These two properties are sufcient to guarantee that the nonnegative orthant is a positive invariant set of the dynamics, that is, any trajectory of equation 2.4 starting in the nonnegative orthant remains there (Khalil, 1996). Furthermore, even if the initial state of x is negative, it will become nonnegative after some transient period. Therefore, for simplicity, we consider trajectories that are conned to the nonnegative orthant x ¸ 0. However, we consider input vectors b whose components are of arbitrary sign. 3 Network Performance
Next, we briey state some of the properties of the network. The detailed analysis is deferred to later sections. We start with a simple case with n different groups, each containing one of the n neurons, which is the traditional winner-take-all network. Suppose k > 1 neurons are active initially. After proper ordering, the interaction matrix between these k active neurons is W D (a C b ) I ¡ b T , where is the column vector consisting of all ones. One eigenvector of W is with eigenvalue a ¡ ( k ¡ 1) b. The other k ¡ 1 eigenvectors are differential modes whose components sum to zero, with eigenvalue a C b. If the inhibition strength is strong enough, b > 1 ¡ a, the differential modes are unstable, therefore, the network cannot have more than one neuron active at a steady
Generalized Winner-Take-All Networks
2631
state. Moreover, the network is guaranteed to converge to a steady state provided that a < 1. Under these conditions, we can conclude that for all b and initial conditions of x, the network always converges to one of the given groups. For the general case with arbitrary group membership matrixj , the above conclusion still holds true, except in some degenerate cases (which will be described in the next section). If the lateral inhibition is strong enough (b > 1 ¡ a) as in the previous case, any steady state with two active neurons not contained in the same group is unstable. If a < 1, the network is again guaranteed to converge to a steady state. Therefore, one and only one of the given groups can be active at each steady state. Which groups could potentially be the winner is specied by the input b. In the case of nonoverlapping groups,P the potential winners are determined by the aggregate positive input Ba D niD1 [bi ] Cjia that each group receives. Any group with Ba ¸ (1 ¡ a) b ¡1 bmax could end up as the winning group, where bmax ´ maxi fbi g. Which group wins in the end depends on the initial conditions. It is possible for a specic group to win for all initial conditions if its inputs are sufciently large. The synaptic connections between neurons within a group are restricted to self-excitation. This causes the activities of winning neurons to be equal to their rectied input, amplied by a gain factor 1 / (1 ¡ a) . Thus, the network implements a form of hybrid analog-digital computation, selectively amplifying activities in only one group of neurons. 4 Analysis of the Network Dynamics 4.1 Convergence to a Steady State. This section characterizes the steadystate responses of the network equation 2.4 to an input b that is constant in time. For this to be a sensible goal, we need some guarantee that the dynamics converges to a steady state and does not diverge. This is provided by the following theorem. Theorem 1.
Consider the network equation 2.4. The following statements are
equivalent: 1. For any input b, the network state x converges to a steady state that is stable in the sense of Lyapunov, except for initial conditions in a set of measure zero consisting of unstable equilibria. 2. The strength a of self-excitation is less than one. Proof.
To prove equation 2.2 ) equation 2.1, if a < 1, the function
E (x) D
1 b (1 ¡ a) xT x C xT Jx ¡ bT x 2 2
(4.1)
2632
Xiaohui Xie, Richard H. R. Hahnloser, and H. Sebastian Seung
is bounded below and radially unbounded in the nonnegative orthant. Furthermore, E is nonincreasing following the dynamics dE / dt D ¡( (I ¡ W ) x ¡ b) T (x ¡ [Wx C b] C ) X X ( ( I ¡ W )x ¡ b) 2i ¡ (x2i ¡ ( Wx C b) i xi ) D ¡ i2M /
i2M
·¡
X i2M
( ( I ¡ W )x
¡ b) 2i
¡
X
i2M /
x2i
· 0,
where M ´ fi | (Wx C b) i > 0, 8i D 1, . . . , ng. The notation (z) i denotes the ith component of the vector z. Equality above holds if and only if x is at the steady state. Therefore, E (x) is a Lyapunov-like function ensuring convergence to a stable steady state, except for initial conditions in a set of measure zero. To prove equation 2.1 ) equation 2.2, let us suppose that equation 2.2 is false. If a ¸ 1, choose b D (1, 0, . . . , 0) T and initial conditions x(0) D (1, 0, . . . , 0) T so that the dynamics of the rst neuron is reduced to xP 1 C x1 D [ax1 C 1] C ¸ xi C 1, in which x1 diverges. In addition, x1 diverges for initial conditions in a set of nonzero measure, so equation 2.1 is contradicted. Therefore, a < 1 is both the necessary and sufcient condition for convergence to a stable steady state. In the following, we restrict the network to a < 1. 4.2 Permitted and Forbidden Sets. In general, the network may have many xed points. However, only those that are stable are typically observed at a steady state. We will call a set of neurons that can be coactivated by some input at a stable (in the sense of Lyapunov) steady state a permitted set. Otherwise, it is termed a forbidden set. For a set of neurons to be a permitted set, two conditions have to be satised: its neurons have to be steadily coactivated by some input, and the steady coactivation must be stable. For the network we are considering, it is always possible to choose an input that realizes a steady coactivation of the given set of neurons. Hence, the rst condition is readily satised. Consequently, whether a set is permitted or forbidden depends on only its stability, which is determined by the synaptic connection matrix between the coactivated neurons. If the largest eigenvalue of that matrix is less than unity, then the set is permitted. Otherwise, it is forbidden. One special property of the permitted and forbidden sets is that any superset of a forbidden set is forbidden, and any subset of a permitted set is permitted (Hahnloser et al., 2000). An intuitive understanding of this property is that by inactivating a neuron, its feedback is removed. Because the connections in a symmetric network form effectively positive feedback
Generalized Winner-Take-All Networks
2633
loops, in the form of mutual excitation or disinhibition, removing feedbacks increases stability of the network. Similarly, adding positive feedbacks decreases stability, in agreement with the property that any superset of a forbidden set is forbidden. The above property adds convenience for verifying whether a set is permitted or forbidden. For example, if we know a subset of a set is forbidden, then the set itself is forbidden. We will use this property in the following sections. 4.3 Relationship Between Groups and Permitted Sets. The network in equation 2.4 is constructed to make the groups and their subgroups the only permitted sets of the network. To determine whether this is the case, we must answer two questions. First, are all groups and their subgroups permitted? Second, are all permitted sets contained in the given groups? The rst question is answered by the following lemma. Lemma 1.
All groups and their subgroups are permitted.
If a set is contained in a group, then there is no lateral inhibition between the neurons in the set. Provided that a < 1, all eigenvalues of the interaction matrix between neurons in the group are less than unity, so the set is permitted. Proof.
The answer to the second question, whether all permitted sets are contained in the groups, is not necessarily afrmative. For example, consider the network dened by the group membership matrix j D f(1, 1, 0) , (0, 1, 1) , (1, 0, 1) g. Since every pair of neurons belongs to some group, there is no lateral inhibition (J D 0), which means that there are no forbidden sets. As a result, (1, 1, 1) is a permitted set, but obviously it is not contained in any group. Let us dene a spurious permitted set to be a permitted set that is not contained in any group. For example, (1, 1, 1) is a spurious permitted set in the above example. To eliminate all the spurious permitted sets in the network, certain conditions on the group membership matrix j have to be satised. Denition 1. The membership matrix j is degenerate if there exists a set of k ¸ 3 neurons that is not contained in any group, but all of its subsets with
k ¡ 1 neurons belong to some group. Otherwise, j is called nondegenerate.
For example, j D f(1, 1, 0), (0, 1, 1) , (1, 0, 1)g is degenerate. Using this denition, we can formulate the following theorem. The neural dynamics equation 2.4 with a < 1 and b > 1 ¡ a has a spurious permitted set if and only if j is degenerate. Theorem 2.
2634
Xiaohui Xie, Richard H. R. Hahnloser, and H. Sebastian Seung
To prove the theorem, we need the following lemma. If b > 1 ¡ a, any set containing two neurons not in any same group is forbidden under the neural dynamics equation 2.4. Lemma 2.
We start by analyzing a very simple case where there are two neurons belonging to two different groups. Let the group membership be f(1, 0) , (0, 1) g. In this case, W D f(a, ¡b ) I (¡b , a) g. This matrix has eigenvectors (1, 1) T and (1, ¡1) T with eigenvalues being a¡b and a C b, respectively. Since a < 1 for convergence to a steady state and b > 0 by denition, the (1, 1) T mode is always stable. But if b > 1 ¡a, the (1, ¡1) T mode is unstable. This means that it is impossible for the two neurons to be coactivated at a stable steady state. Since any superset of a forbidden set is also forbidden, the result generalizes to more than two neurons. Proof.
Now we are ready to prove theorem 2 by using lemma 2. Proof of Theorem 2. If j is degenerate, there must exist a set k ¸ 3 neurons
that is not contained in any group, but all of its subsets with k ¡ 1 neurons belong to some group. There is no lateral inhibition between these k neurons, since every pair of neurons belongs to some group. Thus, the set containing all k neurons is permitted and spurious. On the other hand, if there exists a spurious permitted set P, we need to prove that j must be degenerate. We will prove this by contradiction and induction. Let us assume j is nondegenerate. P must contain at least two neurons since any one neuron subset is permitted and not spurious. By lemma 2, these two neurons must be contained in some group, or else it is forbidden. Thus, P must contain at least three neurons to be spurious, and any pair of neurons in P belongs to some group by lemma 2. If P contains at least k neurons and all of its subsets with k ¡ 1 neurons belong to some group, then the set with these k neurons must belong to some group; otherwise, j is degenerate. Thus, P must contain at least k C 1 neurons to be spurious, and all its k subsets must belong to some group. By induction, this implies that P must contain all neurons in the network, in which case P is either forbidden or nonspurious. This contradicts the assumption that P is a spurious permitted set. The group winner-take-all competition described above holds only for the case of strong inhibition b > 1 ¡a. If b is small, the competition will be weak and may not result in group winner-take-all. In particular, if b < (1 ¡a) / lmax (¡J ) , where lmax (¡J ) is the largest eigenvalue of ¡J, then the set of all neurons is permitted. Since every subset of a permitted set is permitted, this means that there are no forbidden sets, and the network is monostable. Remark.
Generalized Winner-Take-All Networks
2635
Hence, group winner-take-all does not hold. In the intermediate regime, (1 ¡ a) / lmax (¡J ) < b < 1 ¡ a, the network has forbidden sets, but the possibility of spurious permitted sets cannot be excluded. 5 The Potential Winners
We have seen that if j is nondegenerate, any stable coactive set of neurons must be contained in a group, provided that lateral inhibition is strong (b > 1 ¡ a). The group that contains the coactive set is the “winner” of the competition between groups. The identity of the winner depends on input b and also on the initial conditions of the dynamics. Suppose the ath group is the winner. For all neurons not in this group to be inactive, the self-consistent condition should read X i
[bi ] Cjia Jij ¸ (1 ¡ a) b ¡1 [bj ] C ,
(5.1)
for all j 2/ a. If the group a contains the neuron with the largest input, this condition is always satised. Hence, any group containing the neuron with the largest input is always a potential winner. In the case of nonoverlapping groups, the condition in equation 5.1 can be simplied as X [bi ] C jia ¸ (1 ¡ a) b ¡1 maxf[bj ] C g, i
j2a /
(5.2)
and thereforePpotential winners are determined by the aggregate group C a inputs Ba D i [bi ] ji . Denote the largest input as bmax D max i fbi g, and assume bmax > 0. Only those groups whose aggregate inputs are not smaller than (1 ¡ a) b ¡1 bmax can win, with the exact winner identity determined by the initial conditions of the dynamics. 6 An Example: The Ring Network
In this section, we take the ring network as an example to illustrate several results we have obtained so far. Let n neurons be organized into a ring, and let every set of d contiguous neurons form a group. Thus, in total, there are n patterns to be stored. In the special case d D 1, this network becomes a traditional winner-take-all network. In the case d > 1, the groups are overlapping and j could be degenerate. In fact, it can be shown that j becomes degenerate when d ¸ n / 3 C 1. This is illustrated in Figure 1, which shows the permitted sets of a ring network with 15 neurons. If the group width is d D 5 neurons, there are no spurious permitted sets (see Figures 1A–1C). However, when the group width is 6, the network contains 5 spurious permitted sets (see Figure 1F).
2636
Xiaohui Xie, Richard H. R. Hahnloser, and H. Sebastian Seung
Figure 1: Permitted sets of the ring network. The ring network consists of 15 neurons with a D 0.4 and b D 1. (A, D) The 15 groups are represented by columns. Black refers to active neurons, and white refers to inactive neurons. (A) Fifteen groups of width d D 5. (B) All permitted sets corresponding to the groups in A. (C) The 15 permitted sets in B that have no permitted supersets. They are the same as the groups in A. (D) Fifteen groups with width d D 6. (E) All permitted sets corresponding to groups in D. (F) There are 20 permitted sets in E that have no permitted supersets. Note that there are ve spurious permitted sets.
Figure 2 shows the effect of changing the strength of lateral inhibition. When the strength of inhibition is strong (b > 1 ¡ a), there are no spurious permitted sets provided that j is nondegenerate (see Figure 2D). At the other extreme, when b < (1 ¡ a) / lmax (¡J ) , there is no unstable differential mode in the network. All neurons could potentially be active at a stable steady state, given a suitable input (see Figure 2A). Between these two critical values (1 ¡ a < b < (1 ¡ a) / lmax (¡J ) ), there exist both unstable differential modes and spurious permitted sets (see Figure 2C). 7 Storage Capacity for Random Sparse Groups
An important characterization of any attractor network is its storage capacity for random patterns, that is, random groups (Amit, Gutfreund, & Som-
Generalized Winner-Take-All Networks
2637
b
b
b
b
Figure 2: Lateral inhibition strength b determines the behavior of the network. The network is a ring network of 15 neurons with width d D 5 and where a D 0.6, and input bi D 1 for all i. The panels show the steady-state activities of the 15 neurons. (A) There are no forbidden sets. (B) The marginal state b D (1 ¡ a) / lmax (¡J ) D 0.0874, in which the network forms a continuous attractor. (C) Forbidden sets exist, and so do spurious permitted sets. (D) Group winnertake-all case; no spurious permitted sets.
polinsky, 1985; Miller & Zucker, 1999). In our case, as the number of groups gets larger, the probability of the groups’ being degenerate increases. We call the probability of error the probability that a neuron outside a group is activated by mistake. We choose random sparse groups; p ¿ 1 is the probability that a particular neuron is part of a particular group. The storage capacity is dened to be the maximum number of groups the network can store, such that the error probability remains smaller than a given bound. After constructing the synaptic weight matrix, we present random inputs to the network. We assume that each component of the input b has the probability q of being positive. The expected number of neurons receiving positive inputs is nN D nq. Since a neuron receiving a nonpositive input can never become active in our network, the error probability is effectively determined by the network of the nN neurons receiving positive inputs.
2638
Xiaohui Xie, Richard H. R. Hahnloser, and H. Sebastian Seung
Figure 3: Diagram of m random groups. Filled circles represent active neurons. The rst c neurons in group 1 are coactivated. For a perfect retrieval, all the other nN ¡ c neurons must be inactive; that is, all must be inhibited by at least one of the c active neurons. The error probability P E is the probability that at least one of the nN ¡ c neurons is active.
Under the randomness in both the groups and the inputs, the expected N D npq. Next, we number of coactive neurons in a stable steady state is c D np assume that c neurons are coactivated and calculate the probability PE of mistakenly activating any of the other nN ¡ c neurons. We use X ( i, j) to denote the existence of synaptic inhibition between neurons i and j, which in our network implies that neurons i and j are not contained in any same group (see Figure 3). According to lemma 2, X ( i, j) also represents mutual exclusion of neuron i and j at any stable steady state. Without loss of generality, we index the c active neurons from 1 to c. For neuron j within the other nN ¡ c neurons to be inactive, it must make an inhibitory connection©W with at least the c neurons. The probability of ª one of W c ( ) this happening is Pr , where represents logical OR. ExtendX i, j iD 1 ing this to all the other nN ¡ c neurons, we derive the probability for all the nN ¡ c neurons being inactive as follows:
PC D Pr
8 N _
j D 1 i D1
9 = X ( i, j)
;
,
(7.1)
Generalized Winner-Take-All Networks
2639
V where represents logical AND. The error probability PE for at least one neuron being mistakenly activated is then
PE D 1 ¡ PC D Pr
8
9 = X ( i, j)
j D 1 i D1
;
,
(7.2)
where the overbar denotes logic complement. Next, we nd an upper bound on PE and use it to estimate the capacity of the network. 7.1 Capacity. The error probability is upper bounded by
(
PE · (nN ¡ c)Pr
c _
) X (i, j)
iD 1
Á D ( nN ¡ c )
( 1 ¡ Pr
c _
)! X (i, j)
,
(7.3)
iD 1
© Wc ª where Pr i D 1 X ( i, j ) can be exactly calculated using the inclusion-exclusion principle (Stanley, 1999) as follows: ( Pr
c _
) X ( i, j)
D
i D1
D
c X
(¡1) kC 1
) ³ ´ ( ^ c Pr X ( ik, j) k i 1 ,i2 ,...,ik
(7.4)
(¡1) kC 1
³ ´ c [1 ¡ p C p (1 ¡ p) k]m¡1 . k
(7.5)
kD 1 c X kD 1
In the above equation, the term 1 ¡ p C p (1 ¡ p ) k represents the probability that neuron j does not coexist with other k neurons. This can happen in two cases: with neuron j being inactive (with probability 1 ¡p) or neuron j being active but all other k neurons being inactive (with probability p (1 ¡ p ) k). Equation 7.5 can be further simplied by ( Pr
c _ i D1
) X ( i, j)
D 1C
¼1¡
c X
(¡1) kC 1
kD 0 c X kD 0
(¡1) k
³ ´ c [1 ¡ kp2 C O ( p3 ) ]m¡1 k
³ ´ c exp(¡kmp2 ) k
D 1 ¡ [1 ¡ exp(¡mp2 )]c .
(7.6)
(7.7) (7.8)
We have made two approximations in the above calculation. To derive equation 7.6, we have assumed that cp is sufciently small, implying sparse groups. In the approximation made in equation 7.7, we have assumed m (cp2 ) 2 ! 0 in the large n limit, that is, the number of groups m should
2640
Xiaohui Xie, Richard H. R. Hahnloser, and H. Sebastian Seung
Figure 4: The error probability P E is plotted as a function of the number of groups m. Here, q D 1, and the number of neurons n D 100. The solid curves show the results from numerical simulations, and the dashed curves are the upper bounds calculated by equation 7.8. Two different sparsities p are used.
scale in order less than 1 / ( cp2 ) 2 . We will see later, after deriving the capacity m, that these assumptions are indeed satised. By substituting equation 7.8 into equation 7.3, we derive the upper bound on the error probability:
PE · ( nN ¡ c) [1 ¡ exp(¡mp2 ) ]c .
(7.9)
We observed close tightness of this bound when compared to the true error probability from numerical simulations of random groups (see Figure 4). Given some small number d, the error probability PE is guaranteed not to exceed this number, provided that m < m¤ ( d ), where m¤ ( d ) D ¡p ¡2 lnf1 ¡ [d / ( nN ¡ c) ]1/ c g ¼ ¡p
¡2
lnf1 ¡ [d / ( nq ) ]
1/ ( npq )
g,
(7.10) (7.11)
where equation 7.11 follows from c ¿ n. N Given n, p, and q, using equation 7.11, we can estimate the maximum number of random groups the network can store in such a way that the probability of incorrect retrieval remains smaller than d.
Generalized Winner-Take-All Networks
2641
7.2 Optimal Sparsity. How sparse should the random groups optimally be? We dene the optimal sparsity p¤ as the sparsity p that maximizes the information capacity of the network. We measure the information capacity I by the normalized entropy of the m¤ random groups,
I D ¡m¤ n [p log2 p C (1 ¡ p ) log2 (1 ¡ p ) ] / n2 .
(7.12)
In other words, I is the entropy of the m¤ binary words with length n with probability p of 1, normalized by the number of synaptic connections. The denominator n2 corresponds to the total entropy the binary synaptic weight matrix J can hold. Thus, I is expressed as the fraction of the possible entropy of J, used to store groups. The optimal sparsity p¤ is given by p¤ D argmaxp fIg. To calculate p¤ , we rst have to choose some value for q, determining the probability that a neuron receives an excitatory input. We consider two different cases. First, q is independent of p, and without loss of generality we take q D 1, which corresponds to the case where the inputs are excitatory and nonsparse. Second, q depends on p, and for simplicity we choose q D p, which is the case where the inputs are of equal sparsity as the groups. The optimal sparsity calculations for both cases are derived in the appendix. Here we state only the results: (
log2 ( n ) / n p ¼ p k ln(n) / n ¤
when q D 1
when q D p,
(7.13)
where k D 2.86 is a constant. The approximation becomes exact when the number of neurons goes to innity. This result shows that to achieve the maximum information capacity, p¤ should scale as ln(n) / n when q D 1 and p as ln(n) / n when q D p. Correspondingly, the p average number of neurons in each pattern scales as ln(n) for q D 1 and n ln(n) for q D p. By substituting p¤ into equation 7.11, we derive the storage capacity for these optimal sparsities, ( ¤
m ¼
an2¡1/ c km n / ln(n)
when q D 1
when q D p,
(7.14)
where c ¼ log2 ( n ) , a D d1 / c / log22 ( n ) and km D ¡ ln[1 ¡ exp(¡1 / (2k2 ) ) ] / k2 ¼ 0.35. Since ln(n) hardly increases for large n, the capacity in the q D 1 case roughly scales as n2 and in the q D p case it roughly scales as n. Equation 7.7 is derived under the assumption that m ( cp2 ) 2 ! 0 in the large n limit. Now we check the validity of this assumption. Self-consistently, in the case q D 1, we nd m¤ (cp¤2 ) 2 » 1 / n2 , and in the case q D p, m¤ ( cp¤2 ) 2 » 1 / n. Both approach zero in the large n limit.
2642
Xiaohui Xie, Richard H. R. Hahnloser, and H. Sebastian Seung
8 Discussion
We have presented a network that uses structured lateral inhibition to mediate winner-take-all competition between potentially overlapping groups of neurons. Our construction uses the distinction between permitted and forbidden sets of neurons and identies the allowed groupings as permitted sets inherent in the network. Our capacity calculation in the q D 1 case reveals similarity with the Willshaw model (Willshaw & Longuet-Higgins, 1970): we nd that the optimal sparsity scales as ln(n) / n, for example, for a network of 1010 neurons, an optimal group consists of fewer than 30 neurons and is thus unrealistically small. In the case where inputs are sparse, q D p, we nd that the optimal p sparsity scales roughly as n and is thus within the realm of real networks. A distinct feature of our generalized winner-take-all network is the coexistence of discrete pattern selection and analog computation. We use strong lateral inhibitory interactions to constrain certain groupings of neurons, but leave the analog values of the active neurons unconstrained, except by the input. It might be interesting to apply our principle of how to constrain active groups to the problem of data reconstruction using a constrained set of basis vectors. The constraints on the linear combination of basis vectors could, for example, implement sparsity or nonnegativity constraints (Lee & Seung, 1999). The coexistence of analog ltering with logical constraints on neural activation represents a form of hybrid analog-digital computation that may be especially appropriate for perceptual tasks. Using this network model for object recognition, the perception of an object could be represented by the set of active neurons, while activities of these neurons correspond to continuous instantiations of the object such as viewpoint, illumination, and scale (Seung & Lee, 2000). In addition, this type of network may constitute a neural mechanism for feature binding and sensory segmentation problems, as suggested by Wersing et al. (Wersing, Steil, & Ritter, 2001; Wersing, 2002). In the domain of olfactory perception, recent experimental data on odor-evoked population responses in the olfactory bulb also show some promising applications of our model (Hildebrand & Shepherd, 1997; Christensen, Pawlowski, Lei, & Hildebrand, 2000; Mori, Nagao, & Yoshihara, 1999). As we have shown, there are some degenerate cases of overlapping groups to which our method does not apply. It is an interesting open question whether there exists a general way of translating arbitrary groups of coactive neurons into permitted sets without involving spurious permitted sets. There are several possible approaches. For example, we could use a more sophisticated interaction matrix, including both lateral inhibition and excitation. For instance, in the three-neuron degenerate example given earlier, if we choose the interaction matrix W D avvT with v D [1, ¡1, 1]T and 1 / 3 < a < 1 / 2, then the spurious set (1, 1, 1) is forbidden, whereas
Generalized Winner-Take-All Networks
2643
its subsets are still permitted. Another possible approach would be to use higher-order interactions. Take again the three-neuron degenerate case as an example. IfPwe addedP quadratic interactions to the dynamics, xPi C xi D C [bi C axi ¡ b j Jij xj ¡ c j, k xj x k ] , it would follow that for large enough inputs and suitable parameters, the set (1, 1, 1) would not be permitted, but its subsets would. One more possible approach would be to use hierarchical networks with interlayer excitation and intralayer inhibition. In the past, a great deal of research has been inspired by the idea of storing memories as xed-point attractors in neural networks with a xed input. Our model suggests an alternative viewpoint, which is to regard permitted sets as memories latent in the synaptic connections, while the xed points corresponding to permitted sets can continuously change depending on the input. From this viewpoint, the contribution of this article is a method of storing and retrieving memories as permitted sets in neural networks. Appendix: Calculation of the Optimal Sparsity for Random Patterns
Start from the information capacity of the network, given by I D ¡m¤ n[p log2 p C (1 ¡ p ) log2 (1 ¡ p) ] / n2 ¼ log2 ( p ) ( pn ) ¡1 lnf1 ¡ [ (nq ) ¡1 d]1 / (npq) g.
Here, m¤ is from equation 7.11 and the approximation is made in the small p limit. Next we consider two cases for choosing the value of q and nd the optimal p¤ D argmaxp fIg for these two cases, respectively. The calculation is done under the condition that the number of neurons n is sufciently large. A.1 Dense inputs, q D 1. The information capacity I can be written as I D c¡1 ln(1 ¡ ( d / n ) 1 / c ) log2 ( c / n) , where c D pn. By setting the derivative of I with respect to c equal to zero, we nd
ln(d / n ) 1/ c ln(c / n ) C [(d / n ) ¡1/ c ¡ 1] ln[1 ¡ ( d / n ) 1 / c ] [1 ¡ ln(c / n ) ] D 0.
Let z D ( d / n ) 1/ c . Then we have
ln z ln(c / n ) C (z ¡1 ¡ 1) ln(1 ¡ z) [1 ¡ ln(c / n )] D 0.
Under sparsity assumption, p D c / n ¿ 1, we have |ln(c / n ) | À 1. Hence, the above equation can be simplied to (1 ¡ z ) ln(1 ¡ z ) D z ln(z) .
(A.1)
The solutions of the above equation are z D 0, 1 / 2, and 1. Given a xed n, c can be only a nite number. Therefore, the solution z D 1 is impossible. The other two solutions lead to c D 0 or c D log2 (n ) . Correspondingly,
2644
Xiaohui Xie, Richard H. R. Hahnloser, and H. Sebastian Seung
p D 0 or p D log2 ( n) / n. Substituting the value of p D 0 into I, we nd that p D 0 corresponds to a local minimum. Furthermore, the boundary value p D 1 also corresponds to a local minimum. From these, we conclude that the optimal probability is given by p¤ D log2 (n ) / n. Notice that it satises sparsity assumption (p¤ ¿ 1). A.2 Sparse inputs, q D p. The derivative of I with respect to p is
I0 ( p ) D fln(1 ¡ t) [1 ¡ ln(p) ]
C t (1 ¡ t ) ¡1 ln(p) / ( np2 ) [1 C 2 ln(d / ( pn ) ) ]g / ( np2 ln 2)
¼ ¡[1 ¡ 2k ln(pn) / (np2 ) ] ln(p) ln(1 ¡ t) / (np2 ln 2),
where k ´ ¡t[(1 ¡ t) ln(1 ¡ t) ]¡1 and t ´ [d / ( pn ) ]1/ ( p n) . To derive the above equation, we have neglected small terms by assuming that n is sufciently large. By setting I0 ( p¤ ) D 0, we nd that p¤ obeys, 2
ln(p¤ n ) np¤ 2
D
1 . 2k
Deriving the exact form of p¤ as a function of n from the above equation is not easy. However, when n is sufciently large, we can simplify the calculation by assuming that k is independent of n. Under this ansatz, we derive p¤ to scale as p (A.2) p¤ D k ln(n) / n. Next, we need to self-consistently verify that our ansatz still holds by replacing p¤ in the denition of t, ln t D
ln(d) ¡ ln[kn ln(n) ] / 2 1 ¼¡ . 2k k ln(n)
(A.3)
The approximation becomes exact as n goes to innity. Thus, we have veried that t is approximately constant, equal to exp(¡1 / (2 k) ) . This completes our ansatz. We still need to determine the value of k. Substituting equation A.3 into the denition of k, we derive that 1 ¡t ln(t) . D 2t ln(1 ¡ t)
(A.4)
The root of this algebra equation can be found numerically. The nal result is t D 0.8396 and k D 2.86. We can further check that the boundary values p at 0 or 1 lead only to local minima of I. Therefore, we conclude that p¤ in equation A.2 is the optimal sparsity.
Generalized Winner-Take-All Networks
2645
Acknowledgments
We acknowledge comments on the manuscript for this article by M. Goldman and R. Tedrake. References Amari, S., & Arbib, M. A. (1977). Competition and cooperation in neural nets. In J. Metzler (Ed.), Systems neuroscience (pp. 119–165). San Diego, CA: Academic Press. Amit, D. J., Gutfreund, H., & Sompolinsky, H. (1985). Statistical mechanics of neural networks near saturation. Ann. Phys. (New York), 173, 30. Christensen, T. A., Pawlowski, V. M., Lei, H., & Hildebrand, J. G. (2000). Multiunit recordings reveal context-dependent modulation of synchrony in odorspecic neural ensembles. Nature Neuroscience, 12, 927–931. Coultrip, R., Granger, R., & Lynch, G. (1992). A cortical model of winner-take-all competition via lateral inhibition. Neural Networks, 5, 47–54. Feng, J., & Hadeler, K. (1996). Qualitative behaviour of some simple networks. J. Phys. A, 29, 5019–5033. Golomb, D., Rubin, N., & Sompolinsky, H. (1990). Willshaw model: Associative memory with sparse coding and low ring rates. Physical Review A, 41, 1843– 1854. Hahnloser, R. H. (1998). About the piecewise analysis of networks of linear threshold neurons. Neural Networks, 11, 691–697. Hahnloser, R. H., Sarpeshkar, R., Mahowald, M., Douglas, R. J., & Seung, H. S. (2000). Digital selection and analog amplication coexist in a silicon circuit inspired by cortex. Nature, 405, 947–951. Hansel, D., & Sompolinsky, H. (1998). Modeling feature selectivity in local cortical circuits. In C. Koch & I. Segev (Eds.), Methodsin neuronal modeling (pp. 499– 567). Cambridge, MA: MIT Press. Hildebrand, J. G., & Shepherd, G. M. (1997). Mechanisms of olfactory discrimination: Converging evidence for common principles across phyla. Annu. Rev. Neurosci., 20, 595–631. Hopeld, J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proc. Natl. Acad. Sci. USA, 79, 2554–2558. Hopeld, J. J. (1984). Neurons with graded response have collective properties like those of two-state neurons. Proc. Natl. Acad. Sci. USA, 81, 3088– 3092. Khalil, H. K. (1996). Nonlinear systems (2nd ed.). Englewood Cliffs, NJ: Prentice Hall. Kohonen, T. (1989). Self-organization and associative memory (3rd ed.). Berlin: Springer-Verlag. Lee, D. D., & Seung, H. S. (1999). Learning the parts of objects by nonnegative matrix factorization. Nature, 401, 788–791. Maass, W. (2000). On the computational power of winner-take-all. Neural Computation, 12, 2519–2535.
2646
Xiaohui Xie, Richard H. R. Hahnloser, and H. Sebastian Seung
Miller, D. A., & Zucker, S. W. (1999). Computing with self-excitatory cliques: A model and an application to hyperacuity-scale computation in visual cortex. Neural Computation, 11, 21–66. Mori, K., Nagao, H., & Yoshihara, Y. (1999). The olfactory bulb: Coding and processing of odor molecule information. Science, 286, 711–715. Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive eld properties by learning a sparse code for natural images. Nature, 381, 607–609. Salinas, E., & Abbott, L. F. (1996). A model of multiplicative neural responses in parietal cortex. Proc. Natl. Acad. Sci. USA, 93, 11956–11961. Seung, H. S., & Lee, D. D. (2000). Cognition the manifold ways of perception. Science, 290, 2268–2269. Stanley, R. P. (1999). Enumerative combinatorics (Vol. 1). Cambridge: Cambridge University Press. Sum, J. P. F., & Tam, P. K. S. (1996). Note on the maxnet dynamics. Neural Computation, 8(3), 491–499. Wersing, H. (2002). Learning lateral interactions for feature binding and sensory segmentation. In T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing systems, 1. Cambridge, MA: MIT Press. Wersing, H., Beyn, W.-J., & Ritter, H. (2001). Dynamical stability conditions for recurrent neural networks with unsaturating piecewise linear transfer functions. Neural Computation, 13(8), 1811–1825. Wersing, H., Steil, J. J., & Ritter, H. (2001). A competitive-layer model for feature binding and sensory segmentation. Neural Computation, 13, 357–387. Willshaw, D. J., Buneman, & Longuet-Higgins, H. C. (1969). Nonholographic associative memory. Nature, 222, 960–962. Willshaw, D., & Longuet-Higgins, H. (1970). Associative memory models. In B. Meltzer & O. Michie (Eds.), Machine Intelligence (Vol. 5). Received January 8, 2002; accepted May 7, 2002.
LETTER
Communicated by Volker Tresp
An Unsupervised Ensemble Learning Method for Nonlinear Dynamic State-Space Models Harri Valpola Harri.Valpola@hut. Juha Karhunen Juha.Karhunen@hut. Helsinki Universityof Technology,Neural Networks Research Centre, FIN-02015 HUT, Espoo, Finland A Bayesian ensemble learning method is introduced for unsupervised extraction of dynamic processes from noisy data. The data are assumed to be generated by an unknown nonlinear mapping from unknown factors. The dynamics of the factors are modeled using a nonlinear state-space model. The nonlinear mappings in the model are represented using multilayer perceptron networks. The proposed method is computationally demanding, but it allows the use of higher-dimensional nonlinear latent variable models than other existing approaches. Experiments with chaotic data show that the new method is able to blindly estimate the factors and the dynam ic process that generated the data. It clearly outperforms currently available nonlinear prediction techniques in this very difcult test problem. 1 Introduction
In unsupervised learning, the goal is to build a model that captures the statistical regularities in the observed data and provides a compact and meaningful representation for the data. From such a representation, it is often much easier to understand the basic characteristics of the data than directly from the raw data. In our case, we wish to nd a good nonlinear model for the dynamic process and the underlying factors or sources that have generated the observed data. The proposed method can be viewed as a nonlinear dynamic extension to standard linear principal component analysis (PCA) or factor analysis (FA). It is also related to nonlinear blind source separation (BSS) and independent component analysis (ICA). For these problems we have recently developed several methods described in more detail in Hyv¨arinen, Karhunen, & Oja (2001) and in Lappalainen and Honkela (2000); Valpola, Giannakopoulos, Honkela, and Karhunen (2000); and Valpola (2000b). In particular, nonlinear factor analysis (NFA) (Lappalainen & Honkela, 2000) serves as the starting point of the new method introduced in this c 2002 Massachusetts Institute of Technology Neural Computation 14, 2647– 2692 (2002) °
2648
Harri Valpola and Juha Karhunen
article. In NFA, it is assumed that the noisy data vectors x ( t) have been generated by the factors or source signals s ( t) through a nonlinear mapping f. Hence the data vectors are modeled by x ( t) D f (s ( t) ) C n (t) ,
(1.1)
where n ( t) denotes additive gaussian noise. More specically, in the NFA data model, equation 1.1, it is assumed that the factors or sources si ( t) (components of the vector s ( t) ) are gaussian, and the nonlinear mapping f is modeled by the standard multilayer perceptron (MLP) network. The MLP network (Bishop, 1995; Haykin, 1998) is used here because of its universal approximation property and ability to model well both mildly and strongly nonlinear mappings f . However, the cost function used in Bayesian ensemble learning is quite different from the standard mean-square error used in the backpropagation algorithm and its variants (Bishop, 1995; Haykin, 1998). Another important difference is that the learning procedure is completely unsupervised or blind, because in NFA, only the outputs x ( t) of the MLP network are known. We have extended NFA somewhat to so-called nonlinear independent factor analysis (NIFA) (Hyv¨arinen et al., 2001; Lappalainen & Honkela, 2000; Valpola et al., 2000; Valpola, 2000b). There the same model form, equation 1.1, is still employed, but now the sources si (t ) are assumed to be statistically independent, and they are modeled as mixtures of gaussians instead of plain gaussians. NIFA can be understood as a nonlinear BSS method, and it seems to be applicable to much larger-scale problems than most previously introduced nonlinear BSS methods, discussed in (Hyv¨arinen et al., 2001). We have successfully applied the NFA and NIFA methods to nding useful compact representations for both articial and real-world data sets (Lappalainen & Honkela, 2000; Valpola et al., 2000; Honkela & Karhunen, 2001). However, the NFA and NIFA methods have still the drawback that they do not take into account the possible dependences of subsequent data vectors. This means that as in standard principal or independent component analysis, the data vectors x ( t) can be randomly shufed and the found factors or sources remain the same. While this may be acceptable in certain situations, temporally (or spatially) adjacent data vectors are often in practice more or less statistically dependent. This dependence should then be modeled appropriately, too, and used in learning for improving the results. In this article, we extend the NFA method so that temporal dependences are taken into account by using a nonlinear state-space model for the factors s (t) . An important advantage of the proposed new method is its ability to learn a high-dimensional latent (source) space. We have also reasonably solved computational and overtting problems that have been major obstacles in developing this kind of unsupervised methods thus far. Potential applications for our method include prediction and process monitoring, control, and identication.
Learning Nonlinear State-Space Models
2649
In the next section, we briey overview Bayesian methods, in particular ensemble learning, giving references to related work. The structure of the dynamic model is represented in more detail in section 3. After this, we consider the evaluation of the cost function. The overall description and more detailed implementation of the learning method are presented in section 5, followed by experimental results in section 6. Future directions and potential problems are treated in section 7, and the conclusions of the study are given in the last section. Some detailed derivations have been moved to appendixes for better readability. 2 Bayesian Methods and Ensemble Learning 2.1 Bayesian Inference. In Bayesian data analysis and estimation methods (see, e.g., Bishop, 1995; Jordan, 1999; Gelman, Carlin, Stern, & Rubin, 1995; Neal, 1996) for continuous variables, all the uncertain quantities are modeled in terms of their joint probability density function (pdf). The key principle is to construct the joint posterior pdf for all the unknown quantities in a model, given the data sample. This posterior density contains all the relevant information on the parameters to be estimated in parametric models, or the predictions in nonparametric prediction or classication tasks. Denote by H the particular model under consideration and by µ the set of model parameters that we wish to infer from a given data set X. The posterior probability density p (µ | X, H ) of the parameters given the data X and the model H can be computed from Bayes’ rule:
p (µ | X, H ) D
p ( X | µ , H ) p (µ | H ) . p (X | H)
(2.1)
Here p (X | µ, H ) is the likelihood of the parameters µ , p ( µ | H ) is the prior pdf of the parameters, and p (X | H ) is a normalizing constant.1 The term H denotes all the assumptions made in dening the model, such as choice of an MLP network and specic noise model. We have explicitly included it in the Bayes’ rule, equation 2.1, to emphasize the dependence of the results on these modeling assumptions. Using the above notation, the normalizing term Z p (X | H) D
p ( X | µ , H ) p ( µ | H ) dµ
(2.2)
can be directly understood as the marginal probability of the observed data X assuming the model H (Bishop, 1995). Therefore it is called the evidence 1 We have dropped the subscripts of all pdf’s to simplify the notation; they are assumed to be the same as the arguments of the pdf’s in question unless otherwise stated.
2650
Harri Valpola and Juha Karhunen
of the model H. By evaluating the evidences p ( X | Hi ) for different models, one can choose the model with the highest evidence. This is the model that describes the observed data with the highest probability among the compared models. The parameters µ of a particular model H i are often estimated using the maximum likelihood (ML) method. This non-Bayesian method chooses the estimates µO that maximize the likelihood p ( X | µ, H ) of the data, assuming that the parameters µ are nonrandom constants and omitting the potential prior information on them. The maximum a posteriori (MAP) method is the simplest Bayesian method. It applies the same strategy to the posterior distribution p ( µ | X, H ) by nding the parameter values that maximize this distribution. This can be done by maximizing the numerator in equation 2.1 because the normalizing denominator in equation 2.2 does not depend on the parameters µ. However, using point estimates like those provided by the ML or MAP methods is often problematic. In these methods, the model order estimation and overtting (choosing too complicated a model for the given data) are severe problems. In high dimensions, the width of a peak of the posterior density is far more important to the probability mass than its height. For complex models, it is usually computationally prohibitive to try to express the posterior pdf exactly. This prevents the use of classical Bayesian methods, for example, the minimum mean-square error estimation, in all but the simplest problems. Furthermore, such classical methods still provide a point estimate µO , which is somewhat arbitrarily chosen from the possible values of µ in the posterior density (Sorenson, 1980). Instead of searching for some point estimate, the correct Bayesian procedure is to perform estimation by averaging over the posterior distribution p (µ | X, H ) . This means that the estimates will be sensitive to regions where the probability mass is large instead of being sensitive to high values of the pdf (Lappalainen & Honkela, 2000; Miskin & MacKay, 2001). One can go even one step further if possible and reasonable (for example in supervised learning problems) and make use of the complete set of models. This means that predicted values are obtained by computing a weighted sum over the predictions of several models, where the weighting coefcients are proportional to the evidence of each model (Bishop, 1995). More probable models therefore contribute more to the predicted or estimated values. Such a procedure appropriately solves the issues related to the model complexity and choice of a specic model among several candidates. The evidences p (X | µ, H ) of too simple models are low, because they leave a lot of the data unexplained. On the other hand, models that are too complex occupy little probability mass, although they often show a high but very narrow peak in their posterior pdf’s corresponding to the overtted parameters. The rst practical Bayesian method for neural networks was introduced by MacKay (1992). For the posterior density of the parameters, he used
Learning Nonlinear State-Space Models
2651
a gaussian approximation around their MAP estimate to evaluate the evidence of the model. However, such parametric approximation methods (Bishop, 1995) often do not perform well enough in practical real-world problems. Another group of Bayesian methods used in neural network models consists of Markov chain Monte Carlo (MCMC) techniques for numerical integration (Neal, 1996; Robert & Gasella, 1999). They perform the necessary integrations needed in evaluating the evidence, equation 2.2, and the posterior density, equation 2.1, numerically by drawing samples from the true posterior distribution. MCMC techniques have been successfully applied to practical supervised learning problems, for example, in Lampinen and Vehtari (2001) by using an MLP network structure. However, MCMC methods cannot be used in large-scale unsupervised learning problems, because the number of parameters grows far too large for estimating them in any reasonable time. Other fully Bayesian approaches include various variational methods (Jordan, Ghahramani, Jaakkola, & Saul, 1999) for approximating the integration by a tractable problem and mean-eld approach (Winther, 1998), where the problem is simplied by neglecting certain dependences between the random variables. In this article, we use a particularly powerful variational approximation method: ensemble learning. It is applicable to unsupervised learning problems too. 2.2 Ensemble Learning. Ensemble learning (Barber & Bishop, 1998; Lappalainen & Miskin, 2000), also called variational Bayes, is a method developed recently (Hinton & van Camp, 1993; MacKay, 1995) for approximating the posterior density (see equation 2.1). It can be used for both parametric and variational approximation. In the former, some parameters characterizing the posterior pdf are optimized, while in the latter, a whole function is optimized. More specically, ensemble learning employs the Kullback-Leibler (KL) divergence (also called KL distance or information) between two probability distributions q (v ) and p ( v ). This is dened by the cost function
» ¼ Z q q ( v) JKL (q k p) D Eq ln D q ( v ) ln dv. p p ( v) v
(2.3)
The KL divergence JKL measures the difference in the probability mass between the densities q ( v ) and p (v ) . Its minimum value 0 is achieved when the two densities are the same. A key idea in ensemble learning is to minimize the mist between the actual posterior pdf and its parametric approximation. The exact posterior pdf, equation 2.1, is written in the form p ( S, µ | X, H ), where we have now separated the source signals or latent variables S corresponding to the data X from the other parameters µ for clarity and later use. Denoting respectively the approximation of this posterior pdf by q ( S, µ ) yields for the
2652
Harri Valpola and Juha Karhunen
KL divergence of these two distibutions: Z Z
JKL ( q k p ) D
q (S, µ ) ln S
q (S, µ ) µ dS. ( p S, µ | X, H )
(2.4)
The cost function used in ensemble learning is dened as follows (Hinton & van Camp, 1993; MacKay, 1995):
C
KL
D JKL ( q k p) ¡ ln p ( X | H ).
(2.5)
The main reason for using equation 2.5 instead of 2.4 is that the posterior pdf p ( S, µ | X, H ) includes the evidence term p ( X | H ) . This requires an integration, equation 2.3, which is usually intractable. Inserting equation 2.4 into 2.5 yields, after slight manipulation, Z Z
C
KL
D
q ( S, µ ) ln S
q ( S, µ ) dµ dS, p ( X, S, µ | H )
(2.6)
which does not need p ( X | H ) . In fact, the cost function provides a bound for the evidence p (X | H ) . Recalling that the KL divergence JKL ( q k p ) is always nonnegative, it follows directly from equation 2.5 that the cost function C KL satises the inequality
C
KL
¸ ¡ ln p ( X | H ) .
(2.7)
This shows that minimizing the cost function C KL used in ensemble learning is equivalent to maximizing the bound ln p (X | H ) for the evidence p ( X | H ) of the model H. For computational reasons, the approximation q ( S, µ ) of the posterior pdf must be simple enough. This is achieved by neglecting some or even all the dependences between the variables. Our choice for q ( S, µ ) is a multivariate gaussian density with an almost diagonal covariance matrix. A gaussian distribution with mutually uncorrelated and hence also statistically independent components is often used as the approximation of the posterior pdf in ensemble learning. Even this crude approximation is adequate for nding the region where the mass of the actual posterior density is concentrated. The mean value of each component of the gaussian approximation can be used as a reasonably good point estimate of the corresponding parameter or variable if necessary, and the respective variance provides a useful simple measure of the reliability of this estimate. The main motivation in resorting to ensemble learning is that it avoids overtting, which would be difcult to avoid in our problem if ML or MAP estimates were used. Ensemble learning allows one to select a model having
Learning Nonlinear State-Space Models
2653
appropriate complexity, often making it possible to infer the correct number of sources or factors in our case. Bayes’ rule yields for the posterior probability of the model Hi p ( Hi | X ) D
p ( X | Hi ) p ( Hi ) . p (X )
(2.8)
Assume now that all the models Hi have the same constant prior probability p ( Hi ) . Then the model with the highest posterior probability p (Hi | X) is the model with the highest evidence p ( X | Hi ) . The inequality, equation 2.7, shows that minimization of the cost function C KL provides a bound on the logarithm of the evidence of the model. If this bound is close to the correct value, the model Hi that maximizes the posterior probability, equation 2.8, is also the one that minimizes the cost function C KL used in ensemble learning (Miskin & MacKay, 2001). The cost function, equation 2.6, also has a simple but insightful information-theoretic interpretation (Hinton & van Camp, 1993; Valpola, 2000a). It represents the combined cost of coding the model and the data using that model. The optimal solution uses the minimum total amount of bits for encoding. If the model H is made more complicated, the bits needed for coding the data usually decrease because the model is able to represent the data more accurately. On the other hand, the bits needed for coding the model then increase. Hence, ensemble learning tries to nd the optimum balance between the representation power and the complexity of the model. It behaves in this respect qualitatively quite similarly to the well-known minimum description length method (Bishop, 1995) of model selection. Until recently, ensemble learning has been applied mainly to supervised learning problems (Barber & Bishop, 1998), but it can be used for unsupervised learning as well. H. V. employed it for standard linear independent component analysis in Lappalainen (1999) using a xed form approximation, with application to real-world speech data. Recently, several authors have studied ensemble learning in similar problems using slightly different assumptions and methods (Choudrey, Penny, & Roberts, 2000; Miskin & MacKay, 2000, 2001; Roberts & Everson, 2001). The work by Attias (Attias, 1999a, 1999b, 2000, summarized in Attias, 2001), is also closely related to ours, although point estimates are partly used in his early work. In Attias (2000), this type of method has been extended to take into account temporal dependences, and in Miskin and MacKay (2000, 2001) for spatial dependences in image deconvolution applications. However, in all of this previous work, unsupervised ensemble learning or related Bayesian methods have been applied to linear generative data models only. The learning process then becomes simpler, but the representation power of such linear models is weaker than in this work.
2654
Harri Valpola and Juha Karhunen
3 The Dynamic Model 3.1 Model Structure. In this section, we introduce our dynamic model and its neural network realization using a graphical Bayesian model and dene the distributions of parameters needed in the Bayesian approach. In the next section, we justify our choices and discuss the properties of the selected model. Finally, we deal briey with relevant related work. The generative model for the observations x (t ) is as follows: x ( t) D f (s ( t) ) C n (t)
(3.1)
s ( t) D g ( s ( t ¡ 1) ) C m ( t).
(3.2)
Equation 3.1 is the same as in the nonlinear factor analysis model, equation 1.1. Hence we assume that the factors or sources si ( t) (components of s ( t) ) are gaussian. The gaussian additive noise n ( t) models the imperfections and errors in the nonlinear mapping f. The model 3.2 for the dynamics of the factors s (t) is fairly similar. The current factors s ( t) are generated through another nonlinear tranformation g from the factors s ( t ¡ 1) at the previous time instant. This means that the factors can be interpreted as the states of a dynamical system. The noise or error term m ( t) is again gaussian. In the dynamic model, equation 3.2, it is often called process noise or innovation process. The nonlinear mappings f and g are modeled by MLP networks. The functional form of observation or data mapping f in equation 3.1 is f ( s ( t) ) D B tanh [As ( t) C a ] C b ,
(3.3)
where A and B are the weight matrices of the hidden and output layers of the MLP network, respectively, and a and b the corresponding bias vectors. The hyperbolic tangent nonlinearity tanh is applied separately to each component of its argument vector. The dimensions of the vectors and matrices are chosen so that they match each other and the number of neurons in each layer. The dynamic mapping g in equation 3.2 is modeled using another MLP network as follows: g ( s ( t ¡ 1) ) D s ( t ¡ 1) C D tanh [Cs ( t ¡ 1) C c ] C d .
(3.4)
The matrices C and D contain the weights of the hidden and output layers, respectively, and the vectors c and d their biases. Because the factors s ( t) usually do not change much from their previous values s ( t ¡ 1) during a single time step, we have used the MLP network for modeling only the change in factors in equation 3.3.
Learning Nonlinear State-Space Models
2655
In order to apply the Bayesian approach, each unknown variable in the network is assigned a probability density function of its own. We apply the usual hierarchic denition of priors (Bishop, 1995; Gelman et al., 1995; Neal, 1996). For many parameters, for instance, the biases, it is difcult to assign a prior distribution. We can then use the fact that each individual bias has a similar role in the MLP network by assuming that the distribution of each element of a bias vector has the same, albeit unknown, distribution, which is then further modeled by a parametric distribution. These new parameters need to be assigned a priori too, but there are far fewer of them. Consider rst the observation model, equation 3.1. The noise vector n (t ) in that equation is assumed to be i.i.d. at different time instants and gaussian with a zero mean. However, the variances of different components of the noise vector n ( t) can be different. Given the source vector s ( t) , the variance of the data vector x (t) in the equation is due to the noise. Therefore, x (t ) otherwise has the same distribution as the noise, but its mean is now f ( s ( t) ). Denote generally the multivariate gaussian pdf of a random vector x having the mean vector ¹ and covariance matrix § by N[x I ¹I § ]. The pdf of the vector x ( t) in equation 3.1 is then p (x ( t) | s ( t) , µ ) D N[x ( t) I f (s ( t) ) I diag [exp(2v n ) ]],
(3.5)
where diag means a diagonal matrix. The components of the vector exp(2v n ) specify the diagonal covariance matrix of x ( t) and therefore also the variances of the respective components of x (t ). Hence, the components of x (t ) are assumed to be uncorrelated (which now implies independence due to the gaussianity assumption) in this prior model for x ( t) . The exponential parameterization of the variance corresponds to a log-normal distribution when the variance parameter has a gaussian prior. This choice is motivated later in section 3.2. Assuming again that the noise term m ( t) is i.i.d. and gaussian with a zero mean and diagonal covariance matrix, we get from equation 3.2 for the distribution of the source vector s (t ) in a similar way, p (s ( t) | s ( t ¡ 1) , µ ) D N[s (t )I g ( s ( t ¡ 1) )I diag [exp(2v m ) ]],
(3.6)
where the vector v m is used to dene the variances of the components of s ( t). Denote by Aij the element (i, j) of the weight matrix A and by ai the ith component of the bias vector a . The elements of the weight matrices B , C , and D are denoted respectively by Bij , Cij , and Dij , and bi , ci , and di denote the components of the bias vectors b , c , and d . The distributions of these parameters dening the MLP network mappings 3.3 and 3.4 are again taken to be gaussian. They are specied as: p ( Aij ) D N[Aij I 0I 1] p ( Bij ) D N[Bij I 0I exp(2vBj ) ]
(3.7) (3.8)
2656
Harri Valpola and Juha Karhunen
p (ai ) D N[ai I ma I exp(2va ) ]
(3.9)
p ( bi ) D N[bi I mb I exp(2vb ) ]
(3.10)
p (Cij ) D N[Cij I 0I exp(2vCj )]
(3.11)
p ( Dij ) D N[Dij I 0I exp(2vDj ) ]
(3.12)
p ( ci ) D N[ci I mc I exp(2vc ) ]
(3.13)
p (di ) D N[di I md I exp(2vd )].
(3.14)
The prior distributions of the parameters ma , mb , mc , md and va , vb , vc , vd dening the distributions of the bias vectors are assumed to be gaussian with zero mean and standard deviation 100. Hence these distributions are assumed to be very at.2 The distributions 3.8 through 3.14 are also conditional, because all of the other parameters in µ are assumed to be nonrandom in dening them. This has not been shown explicitly to keep the notation simpler. For specifying the model completely, we still need to dene the pdf’s of the hyperparameters controlling the noise variances exp(2v n ) and exp(2v m ) in the prior distributions 3.5 and 3.6 of x ( t) and s ( t) . For doing this, we denote the ith components of the vectors v n and v m respectively by vni and vmi . Furthermore, we specify the distributions of the hyperparameters mvB , vvB , mvC , vvC , mvD , and vvB , which in turn dene the variances of the elements of the matrices B , C , and D . These distributions are given by p ( vni ) D N[vni I mvn I exp(2vvn ) ]
(3.15)
p ( vmi ) D N[vmi I mvm I exp(2vvm )]
(3.16)
p ( vBj ) D N[vBj I mvB : exp(2vvB ) ]
(3.17)
p ( vCj ) D N[vCj I mvC : exp(2vvC ) ]
(3.18)
p ( vDj ) D N[vDj I mvD : exp(2vvD ) ].
(3.19)
The prior distributions of the 10 hyperparameters mvn , . . . , vvD dening the distributions 3.15 through 3.19 are similarly as before chosen gaussian with a zero mean and standard deviation 100. 3.2 Properties of the Model. We have parameterized all the distributions so that the resulting parameters have a roughly gaussian posterior distribution when possible. This is because the posterior will be approximated by a gaussian distribution. For example, the variances of the gaussian distributions have been parameterized on a logarithmic scale for this reason. In ensemble learning, the conjugate gamma distributions are often used 2
Each observed component is prenormalized to have unit variance.
Learning Nonlinear State-Space Models
2657
(Miskin & MacKay, 2001) since they yield simple update equations. However, our choice of log-normal model for the variances makes it much easier to build a hierarchical model for the variances (Valpola, Raiko, & Karhunen, 2001). It should be noted that assuming all the parameters gaussian is not too restrictive, because the tanh nonlinearities in the MLP networks are able to transform the gaussian distributions to virtually any output distribution. On the other hand, the use of gaussian distributions makes the evaluation of the KL cost function and the learning process mathematically tractable. The gaussian posterior approximation can be inaccurate particularly for the sources if the mappings f and g are strongly nonlinear. In this case, the mist between the approximation and the true posterior is large. The learning procedure aims at minimizing the mist and consequently favors smooth mappings, which result in a roughly gaussian posterior distribution for the sources. Model indeterminacies are handled by restricting some of the distributions. For instance, there is the well-known scaling indeterminacy (Hyv¨arinen et al., 2001) between the matrix A and the source vector s ( t) in the term As (t) + a in equation 3.3. This is taken care of by setting the variance of A to unity instead of parameterizing and estimating it. The weight matrix B of the output layer does not suffer from such an indeterminacy, and so the variance of each column of this matrix is exp(2vBj ). The network can effectively prune out some of hidden neurons by setting the outgoing weights of the hidden neurons to zero, and this is easier if the variance of the corresponding columns of B can be given small values. In principleat least, one could use any sufciently general neural network or graphical model structure for modeling the nonlinear mappings f and g . We have chosen MLP networks, because they are well suited to both mildly and strongly nonlinear mappings. Another even more important benet here is that the number of parameters grows only quadratically with the dimension of the latent space. In networks 3.3 and 3.4, the hidden layer is linear, but the output layer is nonlinear. This is because using a nonlinear mapping in the hidden layer would only complicate the learning process without bringing any notable improvement in the generality of the modeled dynamic system. It should also be noted that the MLP network equation, 3.4, modeling the dynamics can learn to overrule the term s ( t ¡ 1) if it turns out that it does not bear information about the current value s ( t) of the source vector. The model dened by equations 3.1 and 3.2 aims at nding factors that facilitate prediction. It is often easier to predict future factors from past factors than directly future observations from past observation. Learning of the factors s ( t) at the time instant t takes into account three sources of information: (1) the factors should be able to represent the observations x ( t), (2) the factors should be able to predict the factors s ( t C 1) at the next time step, and (3) the factors should be well predicted by the factors s (t ¡ 1) at the previous time step. This is depicted schematically in Figure 1.
2658
Harri Valpola and Juha Karhunen
Figure 1: The causal relations assumed by the model are depicted by solid arrows. Observations x ( i) give information about the values of the factors. The ow of information to the factor s ( t) is represented by dashed arrows. Bayes’ rule is used for reversing the solid arrows when necessary.
The model, equation 3.2, assumes that the factors s ( t) can be predicted from the immediately preceding factors s (t ¡ 1) without knowing the previous factors. This is not a restriction because any dynamic model depending on older factors s ( t ¡ 2) , s ( t ¡ 3) , . . . can be converted into an equivalent model of the type 3.2 in the way shown in Figure 2. It is well known (Haykin & Principe, 1998) that the nonlinear model 3.1 and 3.2 is not uniquely identiable. This is because any smooth transformation of the latent space can be absorbed in the mappings f and g . If several different parameter values result in equivalent external behavior of the model, no amount of data can distinguish among the alternatives. On the other hand, if only external behavior is of interest, the models agree, and the choice between the models becomes irrelevant. Ensemble learning can use the indeterminacies in the prior model by rendering the posterior model closer to the allowed functional forms of the posterior approximation. A manifestation of this behavior is seen later in the experiments.
Figure 2: (a) The factor s ( t ) depends on the three previous values: s ( t ¡ 3) , s ( t ¡ 2) and s ( t ¡ 1) . (b) This dynamics can be transformed into an equivalent state representation where s1 ( t) corresponds to s (t ) , while s2 ( t ) and s3 ( t) store the values s ( t ¡ 1) and s ( t ¡ 2) , respectively.
Learning Nonlinear State-Space Models
2659
3.3 Related Work. Before proceeding, we give relevant references on nonlinear dynamic modeling. A review article on nonlinear predictive dynamic models containing comparisons of various techniques for chaotic time series is Casdagli (1989). A tutorial paper on using neural networks, in particular, the well-known radial basis function (RBF) networks (Bishop, 1995; Haykin, 1998) to dynamically model nonlinear chaotic time series, is Haykin and Principe (1998). Dynamic modeling using various neural network structures such as recurrent or time-delayed neural networks is discussed on an introductory level in Principe, Evliano, and Lefebvre (2000), and on somewhat more advanced level in Haykin (1998). Quite recently, Bayesian techniques have been introduced for the problem of learning the mappings f and g in equations 3.1 and 3.2 in Briegel and Tresp (1999), Ghahramani and Roweis (1999), and Roweis and Ghahramani (2001). In Ghahramani and Roweis (1999) and Roweis and Ghahramani (2001), the nonlinear mappings are modeled by RBF networks, and only the linear output layers of the RBF networks are adapted. This yields a very efcient learning algorithm, but in practice restricts the dimension of the dynamic system, which can be learned. This is because the required number of radial basis functions increases exponentially with the dimension of the internal state of the process. Briegel and Tresp (1999) model the nonlinearities using MLP networks with sampling. The most important difference between these earlier methods and the one described in this article is that the parameters of the mappings f and g have so far been summarized using point estimates, while here full distributions are used. Bayesian techniques for hidden Markov models are reviewed in Ghahramani (2001). These models comprise another popular technique of dynamic modeling, which in practice often provides fairly similar results as statespace models. See, for instance, Ghahramani and Hinton (2000), where variational methods resembling ensemble learning are applied to a state-space model where a hidden Markov model switches between different linear dynamical models. Cichocki, Zhang, Choi, and Amari (1999) have considered a nonlinear dynamic extension of the standard linear model used in ICA and BSS by applying state-space models and hyper-RBF networks. However, their learning algorithms are not Bayesian, and they are only partly blind or unsupervised, requiring some amount of known training data in the beginning of learning. An interesting line of research stems from the traditional Kalman ltering, where the distribution of the unknown state is inferred given the observations and the known mappings f and g . It has also been modied for joint estimation of the model parameters and the state. A review of these methods can be found in Wan and Nelson (2001), and Kalman ltering is also briey discussed in section 5.1. An interesting modication, the unscented Kalman lter, is presented in Wan and Merwe (2001).
2660
Harri Valpola and Juha Karhunen
In Wan and Nelson (2001) and Wan and Merwe (2001), the unknown state and dynamics are estimated using extended Kalman ltering, while the observation mapping f is assumed to be the identity mapping. This means that the observations are assumed to be noisy versions of the underlying states. It is unclear whether the approach is suitable for our problem, where the unknown observation mapping is a general nonlinear function and the dimensionality of the state-space is high. More research is needed to investigate the relative merits of the approaches, but advantages of ensemble learning include its ability to handle state-spaces with large dimensionality, to learn the structure of the model, and to nd a meaningful solution among a family of innite possible state representations. 4 The Cost Function
Let us denote the two parts of the cost function 2.6 arising from the numerator and the denominator of the logarithm, respectively, by Z Z
C
q
D E q fln qg D
q ( S, µ ) ln q ( S, µ ) dµ dS
(4.1)
S
Z Z
C
p
D Eq f¡ ln pg D ¡
q ( S, µ ) ln p (X, S, µ ) dµ dS.
(4.2)
S
Hence the total cost, equation 2.6, is
C
KL
D C q C C p.
(4.3)
In equation 4.2 and after, we do not show the explicit dependence of the pdf p ( X, S, µ ) on the model H any more in order to simplify the notation. However, it should be remembered that such a dependence still exists. For approximating and then minimizing the total cost C KL , we need two things: the exact formulation of the joint density p ( X, S, µ ) of the data X, sources S, and other parameters µ, and the approximation of the posterior pdf q ( S, µ ) . Assume that there are available T data vectors x ( t). The complete data set is then X = fx (1) , . . . , x (T )g, and the respective set of source vectors is S = fs (1) , . . . , s (T ) g. The joint density can be expressed as the product p ( X, S, µ ) D p ( X | S, µ ) p (S | µ ) p ( µ ) .
(4.4)
The probability density p (X | S, µ ) of the data X given the sources S and the parameters µ can be evaluated from equations 3.5 and 3.3. Clearly, the ith
Learning Nonlinear State-Space Models
2661
component of the mapping vector f (s ( t) ) in equation 3.3 is fi ( s ( t) ) D b Ti tanh [As ( t) C a ] C bi ,
(4.5)
where b Ti denotes the ith row vector of the weight matrix B and bi is the ith component of the bias vector b . From equation 3.5, the distribution of the ith component xi ( t) of the data vector x ( t) is thus gaussian: p (xi ( t) | s (t) , µ ) D N[xi ( t) I fi ( s ( t) ) I exp(2vni ) ].
(4.6)
Assuming that the data vectors x ( t) are independent at different time instants and taking into account the independence of the components of the noise vector, we get p (X | S, µ ) D
T Y n Y
p ( xi ( t ) | s ( t ) , µ ) ,
(4.7)
tD 1 i D 1
where n is the common dimension of the vectors x (t) , f (s ( t) ), and n ( t) in equation 3.1. The distribution p (S | µ ) can be derived similarly as p (X | S, µ ) above, but taking now into account that the subsequent values s ( t) and s ( t ¡ 1) of the source vector depend on each other due to the model 3.2 and 3.4. Thus, we get p (S | µ ) D p (s (1) | µ )
T Y tD 1
p (s ( t) | s ( t ¡ 1) , µ ) ,
(4.8)
where p ( s (t ) | s ( t ¡ 1) , µ ) can be evaluated quite similarly as above from equations 3.6 and 3.4, yielding p (s ( t) | s ( t ¡ 1) , µ ) D D
m Y i D1 m Y i D1
p ( si ( t) | s ( t ¡ 1) , µ ) N[si ( t) I gi ( s ( t ¡ 1) )I exp(2vmi )].
(4.9)
Here m is the number of components of the vectors s (t) , g (s ( t ¡1) ) , and m (t) in equation 3.2 and gi (s ( t ¡ 1) ) is the ith component of the vector g (s ( t ¡ 1) ) obtained directly from equation 3.4. The prior distribution of the parameters p ( µ ) can be evaluated as the product of the gaussian distributions of the parameters specied in equations 3.7 through 3.14, but here one must take into account their dependences on the distributions of the hyperparameters given in equations 3.15
2662
Harri Valpola and Juha Karhunen
through 3.19. Finally, inserting the expressions obtained in this way for p (X | S, µ ) , p ( S | µ ) , and p ( µ ) into equation 4.4 provides the desired joint density p ( X, S, µ ) . We shall not write it explicitly here due to its complicated form. Another distribution needed in evaluating the cost functions C q and C p is the approximation q (S, µ ) of the posterior density of µ and S = fs (1) , . . . , s ( T ) g. It must be simple enough for mathematical tractability and computational efciency. First, we assume that the source vectors are independent of the other parameters µ , so that q ( S, µ ) decouples into q (S, µ ) D q ( S ) q ( µ ) .
(4.10)
For the parameters µ , a gaussian density q ( µ ) with a diagonal covariance matrix is used. This implies that the approximation q ( µ ) is a product of independent gaussian distributions qj (hj ) of each individual parameter hj : q (µ ) D
Y j
qj (hj ) D
Y j
N[hj I hNi I hQi ].
(4.11)
The mean hNi and variance hQi dene each component pdf qj (hj ) = N[hj I hNi I hQi ]. Quite similarly as in modeling the pdf p (S | µ ) , the components of each source vector s ( t) are assumed to be statistically independent. The subsequent source vectors are also otherwise independent, but their corresponding components sj ( t) and sj ( t ¡ 1) , j D 1, . . . , m, are pairwise correlated. This is a realistic minimum assumption for modeling the dependence of the vectors s (t ) and s (t ¡ 1) that follows from the chosen dynamic model 3.2, and it does not increase the computational load signicantly. These considerations lead to the approximation q (S) D
T Y m Y t D 1 i D1
q ( si ( t) | si (t ¡ 1) ) ,
(4.12)
where the conditional pdf q (si ( t) | si (t ¡ 1) ) describing the dependence is again modeled as a gaussian distribution. Conditioned on si ( t ¡ 1), its ± variance is si ( t) , and its mean, m si ( t) D sNi (t) C sMi ( t, t ¡ 1) [si (t ¡ 1) ¡ sNi ( t ¡ 1)],
(4.13)
depends linearly on the previous source value si ( t ¡ 1) . The form of equation 4.13 is chosen such that sNi ( t) is the mean of q (si (t ) ) when si ( t ¡ 1) is marginalized out. Hence, ±
q (si ( t) | si ( t ¡ 1) ) D N[si ( t) I m si ( t) I si (t )].
(4.14)
Learning Nonlinear State-Space Models
2663
The approximate density q ( si ( t) | si ( t ¡ 1) ) is thus parameterized by the ± mean sNi (t ), linear dependence sMi ( t, t ¡ 1) and variance si (t) as dened in equations 4.14 and 4.13. This concludes the specication of the approximation of the pdf q ( S) of the sources S in equation 4.12. Inserting equations 4.12 and 4.11 into 4.10 nally provides the desired approximating posterior density q ( S, µ ) of the sources S and the parameters µ , completing the specication of the cost functions 4.1 and 4.2. 5 Learning Method
The overall learning procedure resembles somewhat the batch version of the standard backpropagation algorithm (see Haykin, 1998) in that it consists of forward and backward phases for each iteration. In the forward phase, the value of the cost function 4.3 is computed using the current values of the source signals S and other parameters µ . Essentially, in this phase, the posterior means and variances of all the sources are propagated through the MLP networks 3.3 and 3.4, and the value of the cost function is then calculated. In the backward phase, parameters dening the distributions of the sources si ( t) and the parameters µ of the MLP networks are updated using EM-type algorithms. This takes place in a certain order, which we will describe later. The special properties of the problem are used for developing efcient updating algorithms. The updating procedure is simpler for the hyperparameters. The forward and backward phases are then repeated over sufciently many sweeps until the method has converged with the desired accuracy. The learning is batch type, so that all the data vectors are used for updating all the parameters and sources during each sweep. In the following sections, we describe in more detail the forward and backward phases and the learning algorithms used in them, as well as initialization of the learning process. 5.1 Relation to Kalman Filtering and Smoothing. The standard technique for estimating the state of linear gaussian state-space models is Kalman ltering or smoothing. In smoothing, the posterior probability of the states is computed in a two-phase forward and backward pass that collects the information from past and future observations (ltering consists only of the forward pass). The posterior probability of the state can be solved in this way because the effect of the value of the observation x ( t) on the posterior mean of the states is linear and does not depend on the values of the other observations. The posterior variance depends on only the mappings and noise levels, not the observations. Neither of these properties holds for nonlinear models because the values of the states affect the way in which the information passes through the dynamics. It is possible to extend linear Kalman ltering to nonlinear
2664
Harri Valpola and Juha Karhunen
dynamics, for instance, by linearizing the dynamics for the current estimate of the sources. Several iterations are required because the linearized model and the estimated operating point depend on each other. It is also possible that the iterations become unstable. For linear gaussian models, it is possible to derive algorithms similar to Kalman smoothing using ensemble learning, as was done in Ghahramani and Beal (2001). Our algorithm is designed for learning nonlinear models and propagates information only one step forward and backward in time in the forward and backward phase. This makes learning stable but does not signicantly slow convergence because the slowest part is, in any case, the adaptation of the MLP networks. Each iteration is also more efcient than the Kalman smoothing step since no inversions of matrices are required. This is due to the factorial form of the posterior approximation q ( S ) of the sources. 5.2 Forward Phase: Computation of the Cost Function. Consider now in more detail how the current value of the KL cost function C KL , dened in equations 4.1 through 4.3 is calculated for our model in practice. Both the joint density p ( S, µ , X ) and the approximation of the posterior density q ( S, µ ) are products of many simple terms. This property facilitates the calculation of the cost functions 4.1 and 4.2 greatly, because they split into expectations of many simple terms. This can be seen as follows. Assume that h (¡ ) and f ( ¡ ) are some joint pdf’s depending on a set of parameters ¡ D fc 1 , . . . , c l g. Assuming further that these parameters are statistically independent, the joint densities h ( ¡ ) and f ( ¡ ) decouple into products of densities hi (c i ) and fj (c j ) of independent parameters,
h (¡ ) D
l Y
hi (c i ) ,
f (¡ ) D
iD 1
l Y
fj (c j )
(5.1)
j D1
and the integral Z
Z h ( ¡ ) ln f ( ¡ ) d¡
D
D
D
c1
¢¢¢
l Z X jD 1
cj
l Z X jD 1
cj
Z Y l c l i D1
hi (c i )
l X jD 1
ln fj (c j ) dc 1 ¢ ¢ ¢ dc l
hj (c j ) ln fj (c j ) dc j
l Z Y 6 j iD
hj (c j ) ln fj (c j ) dc j
ci
hi (c i ) dc i
(5.2)
Thus, one must compute integrals of densities of each independent scalar parameter c i only (or more generally integrals that are functions of mutually
Learning Nonlinear State-Space Models
2665
dependent parameters only). Typically, terms in the prior probability of the model are conditional, but the same principle can be used efciently in evaluating the costs 4.1 and 4.2: if the term is of the form fj (c j | c k, c l ), for instance, then the integral needs to be computed over hj (c j ), h k (c k ) and hl (c l ) only. The computations in the forward phase start with the parameters of the posterior approximation q (S, µ ) of the unknown variables of the model. The posterior mean of each parameter hj belonging to the parameter set µ is denoted by hNj , and hQj denotes the respective posterior variance. For the components si ( t) of the source vectors s ( t) in the set S, the parameters of their posterior approximation are the posterior mean sNi ( t) , the posterior variance ± si (t) , and the dependence sMi ( t, t ¡ 1) . The rst stage of the forward computations is the iteration ±
sQi ( t) D si ( t) C sM2i (t, t ¡ 1) sQi ( t ¡ 1)
(5.3)
to obtain the marginalized posterior variances sQi ( t) of the sources. It is easy to see by induction that if the past values si ( t¡1 ) of the sources are marginalized out, the posterior mean of si (t) is indeed sNi (t ), and its marginalized variance is given by equation 5.3. The derivation is presented in section A.1. The marginalized variances sQi (t ) are computed recursively in a sweep forward in time t. Thereafter, the posterior means and variances are propagated through the MLP networks. The nal stage is the computation of the KL cost function C KL in equation 4.3. The rst part (equation 4.1) of the KL cost function 4.3 depends on only the approximation q ( S, µ ) of the posterior density. It is straightforward to calculate, because in the previous section, it was shown that q ( S, µ ) consists of products of independent gaussian pdf’s of the form qj (hj ) or q ( si ( t) | si (t ¡ 1) ) . Applying equation 5.2, we see that the cost 4.1 becomes the sum of the negative entropies of gaussian random variables hj and si ( t) given si (t ¡ 1). Recalling that the entropy H ( y) of a gaussian random variable y is Z C1 1 H ( y) D ¡ p ( y) ln p ( y) dy D [1 C ln(2p sy2 ) ] 2 ¡1 D
1 ln(2p esy2 ) , 2
(5.4)
where sy2 is the variance of y, we get
C
q
D ¡
± 1X 1 XX ln(2p ehQj ) ¡ ln(2p esi (t) ) , 2 j 2 i t
(5.5) ±
where hQj is the posterior variance of the parameter hj , and si ( t) the posterior variance of si ( t) given si ( t ¡ 1) .
2666
Harri Valpola and Juha Karhunen
Evaluation of the data part (equation 4.2) of the KL cost function C KL is more involved. For notational simplicity, we denote by E q fyg the expectation of the random variable y over the distribution q (S, µ ). This notation has already been used in equations 4.1 and 4.2. Consider now evaluation of the expectations of the form Eq f¡ ln p (c | ¢ ¢ ¢) g, where p (c | ¢ ¢ ¢) is the pdf of a random variable c conditioned on some quantities. Assume further that the prior distribution of c is gaussian, having the general form N[c I mc I exp(2vc ) ]. Using the fact that c , mc , and vc are independent, one can show (see section A.2) that E q f¡ ln p (c | mc , vc )g D
1 [(cN ¡ mN c ) 2 C cQ C mQ c ] exp(2vQc ¡ 2vNc ) 2 1 C vN c C ln(2p ) . (5.6) 2
N c , and vNc denote the posterior means ofc , mc , and vc , respectively, Here cN , m and cQ , mQ c , and vQc their posterior variances. Formula 5.6 can be directly applied to the distributions 3.7 through 3.19 of all the parameters and hyperparameters µ . Moreover, it can be applied to the posterior distributions 4.6 of the components xi ( t) of the data vectors x ( t) as well, yielding now E q f¡ ln p ( xi (t) | s (t ) , µ )g D
1 f[xi ( t) ¡ fNi ( s ( t) ) ]2 C fQi ( s ( t) ) g 2 1 £ exp(2vQ ni ¡ 2vN ni ) C vN ni C ln(2p ). (5.7) 2
In this case, c = xi (t) is a nonrandom quantity, and so its mean is simply cN = xi ( t) and variance cQ = 0. For computing the expectations 5.7, one must evaluate the posterior means fNi and variances fQi of the components fi of the nonlinear mapping f. These, in turn, depend on the posterior means and variances of the weights and sources. The posterior variances of the sources are obtained from equation 5.7. Approximate formulas for fNi and fQi can be derived by expanding f into its Taylor series (Lappalainen & Honkela, 2000). These formulas are given and discussed in section A.3. The calculation of Eq f¡ ln p (s (t ) | s ( t¡1 ) , µ )g is otherwise similar to equation 5.7, but the interaction of the conditional distributions 4.9 and 4.12 of the sources s ( t) give rise to extra terms. More specically, the posterior dependence of both the source signal si (t) and the nonlinear mapping gi ( s ( t ¡ 1) ) on the previous value si ( t ¡ 1) of that source results in a cross-term that can be approximated by making use of the Taylor series expansion of the nonlinearity g (see section A.4). The terms due to the posterior variances of the outputs of the nonlinear mappings f and g have the highest computational
Learning Nonlinear State-Space Models
2667
load in evaluating the cost function, because they require evaluation of the Jacobian matrix for these nonlinearities. Collecting and summing up the terms contributing to the cost function 4.2, the cost C p due to the data can be computed in a similar manner as is done in equation 5.5 for the cost C q . Probably the easiest way to think the data part C p of the KL cost function C KL is the following. Each single parameter hj contributes to the cost C p one term, which depends on the posterior mean hNj and variance hQj of that parameter hj . Quite similarly, each source signal si (t) gives rise to a term depending on the posterior mean sNi (t) and variance sQi (t) of that particular source. Furthermore, each component xi ( t) of the data vectors also contributes one term to the cost C p . The cost C q due to the approximation, given in equation 5.5, also has one term for each parameter hj and source signal si ( t) . Because these terms are entropies of gaussian pdf’s, they depend solely on the posterior variances ± hQj and si ( t) . The full form of the cost function C KL is given in section A.5. 5.3 Backward Phase: Updating the Parameters and the Sources. In our ensemble learning method, a posterior pdf of its own is assigned to each variable appearing in the computations. This posterior pdf is described in terms of a small number of parameters characterizing it, typically by its mean and variance. Such a representation is called posterior summary. If the posterior pdf in question is gaussian, the summary denes it completely; otherwise, it gives at least a rough idea of the pdf. In the backward phase, the posterior summary of each variable is updated by assuming that the summaries of the other variables remain constant. This learning stage takes place in the following order:
1. The sources and weights are updated. 2. The parameters of the noise and weight distributions are updated. 3. The hyperparameters are updated. The posterior summaries are updated by solving the zero of the gradients of the cost function C KL . The gradients are evaluated for the mean and variance of the posterior summary of each needed quantity. This takes place in terms of a backpropagation-type algorithm by reversing the steps in the forward computations and propagating the gradients backward. The update formulas are given and discussed in the sections 5.4 and A.6. Computation of the new posterior summaries is approximate for the values before the nonlinearities, that is, for the sources and for the weight matrices A , C and bias vectors a , b of the rst layer. In practice, it is necessary to update the weights and sources simultaneously to keep the computation time tolerable. However, the learning process must then be modied to take into account the couplings between these quantities. Section 5.3.3 shows how this can be done.
2668
Harri Valpola and Juha Karhunen
Updating the hyperparameters is simpler because the zero of the gradient can be solved exactly. These updates are discussed in the section 5.3.2. Finally, we discuss initialization of the learning process briey in the last subsection. 5.3.1 Updating the Sources and Weights. All the weights of the MLP networks 3.3 and 3.4 are assumed to have gaussian pdf’s detailed in equations 3.7 through 3.14. Denoting generally any of these weights by wi , the mean wN i and variance wQ i of wi specify its gaussian posterior pdf N[wi I wN i I wQ i ] completely. The variances wQ i are obtained by differentiating the KL cost function C KL with respect to wQ i (Lappalainen & Honkela, 2000; Valpola, 2000b): @C p @C q @C p 1 @C KL C ¡ . D D Qi Qi Qi Qi 2wQ i @w @w @w @w
(5.8)
Equating this to zero yields a xed-point iteration for updating the variances: µ ¶¡1 @C p . wQ i D 2 Qi @w
(5.9)
For a linear gaussian model, the cost C p would be quadratic with respect to the posterior means wN i , and a Newton iteration would thus converge in @2 C
@C
one step. It would also hold that @wN 2p D 2 @wQpi . In the nonlinear model, we i can assume that this holds approximately (Lappalainen & Honkela, 2000; Valpola, 2000b) and use the following update rule: wN i à wN i ¡
@C p Ni @w
µ
@2 C KL N 2i @w
¶¡1
¼ wN i ¡
@C p
Ni @w
wQ i .
(5.10)
Formulas 5.9 and 5.10 have a central role in learning. For linear gaussian models, the formulas give the optimum in one step, assuming that there is only one parameter that is changing. In the static nonlinear factor analysis (NFA) method, which does not include the dynamic model 3.2 for the sources s (t) , the means sNi (t ) and variances sQi ( t) of the gaussian posterior distributions of the sources si ( t) are updated quite similarly as in equations 5.9 and 5.10. However, the dynamic model 3.2 makes these updates somewhat more involved. In the computation of the KL cost function C KL , the marginalized posterior variances sQi ( t) of the sources are used as intermediate variables. Hence, the gradients of the parameters of the posterior pdf’s of the sources must also be computed through these variables. These gradients can be evaluated by applying the chain rule to the forward phase recursion formula 5.3 for the
Learning Nonlinear State-Space Models
2669
variances sQi (t ). The derivations and formulas are presented in section A.6. Here we give the resulting adaptation formula for the linear dependence ± parameter sMi ( t, t ¡ 1) and the conditional posterior variance si (t ) : sMi ( t, t ¡ 1) D
@gi ( t ) @si ( t ¡ 1)
µ @C si (t) D 2 ±
µ @C exp(2vQ i ¡ 2vN i ) 2
KL ( sQi ( t ) )
@sQi ( t )
¶ ¡1
KL ( sQi ( t ) )
¶¡1
@sQi ( t)
.
(5.11) (5.12)
5.3.2 Updating the Other Parameters. Updating the parameters of the noise and weight distributions and the hyperparameters is simpler because the forward and backward phases are not needed for them. For these parameters, the means of their gaussian posterior distributions are updated directly. This is done after every sweep—after all the sources and the weights have been updated. The respective posterior variances are adapted using Newton’s iteration. Consider rst updating of the means, recalling that all the parameters considered in this section have gaussian pdf’s. A well-known result from classical estimation theory states that the mean of a gaussian pdf has a gaussian posterior pdf when the other parameters are assumed to be xed and the prior distributions are gaussian (Gelman et al., 1995; Sorenson, 1980). Updating is then simple. In particular, let c i , i D 1, 2, . . . , be parameters that are gaussian distributed with a common mean mc and variance exp(2vc ) : p (c i | mc , vc ) D N[c i I mc I exp(2vc ) ].
(5.13)
Then the contribution arising from the parameters c i , i D 1, 2, . . . , to the cost function C p is X i
C p (c i ) D D
X i
Eq f¡ ln p (c i | mc , vc )g
X1 1 ln(2p ) C vNc C [(cNi ¡ mN c ) 2 C cQi C m Qc ] 2 2 i £ exp(2vQc ¡ 2vNc ) ,
(5.14)
where we have used the result 5.6. Similar terms but without the summation are obtained for the hyperparameters mc and vc :
C p (mc ) D
1 ln(2p ) C vN mc 2 1 C [( m Nc ¡ m N mc ) 2 C m Qc C m Q mc ] exp(2vQ mc ¡ 2vN mc ) 2
(5.15)
2670
Harri Valpola and Juha Karhunen
C p ( vc ) D
1 ln(2p ) C vN vc 2 1 C [( vNc ¡ m N vc ) 2 C vQc C m Q vc ] exp(2vQ vc ¡ 2vN vc ) . 2
(5.16)
The hyperparameters mc and vc are updated by differentiating the cost function C KL with respect to mN c , mQ c , vNc , and vQc and solving for zero. For m Nc and mQ c the equivalents of equations 5.9 and 5.10 give the optimum in one step but for vNc and vQc the resulting equations need to be solved numerically by iteration. For vNc , for instance, the problem is essentially to solve v from a C bv C c exp(¡2v) D 0 where a, b, and c are constants. This equation is solved by Newton iteration. 5.3.3 Correction of the Step Size. Formulas 5.9 and 5.10 can be used as such for updating the posterior pdf’s of the sources and weights if each of these quantities is adapted sequentially independently of the other weights and sources. However, in practice, it is necessary to update the sources and weights simultaneously. This reduces the computational load of the proposed method dramatically, because it is almost as easy to compute new posterior summaries for all these variables simultaneously as for a single variable. Even then, the computation time required by our method tends to be high, because the evaluation of the cost function in the forward step requires a lot of computation. The drawback of parallel updating of the weights and sources is that the assumption of independent updates is violated for them. The error caused by parallel updates is compensated as follows. For the weights wi , a learning parameter ai is introduced for each posterior mean wN i and quite similarly bi for each posterior variance wQ i to stabilize the oscillations caused by parallel updating. This is the standard method used in context with xed-point algorithms. The oscillations can be detected by monitoring the directions of updates of individual parameters. The learning parameter ai is updated using the rule 8 0.8ai if the sign of the update changed < ai à min(1, 1.05ai ) if the sign of the update (5.17) : remains the same. The parameter b i is computed using the same rule. This provides the following rened xed-point update rules for the posterior means and variances of the weights: wN i à wN i ¡ ai µ wQ i à 2
@C p
Qi @w
@C p wQ i Ni @w
¶ ¡bi
1¡bi
wQ i
(5.18)
.
(5.19)
Learning Nonlinear State-Space Models
2671
In equation 5.19, a weighted geometric rather than arithmetic mean is applied to the posterior variances of the weights because variance is a scale parameter. This is because the relative effect of adding a constant to the variance varies with the magnitude of the variance, whereas the relative effect of multiplying it by a constant is invariant to the magnitude. For the sources, the Jacobian matrices computed during the updates are used to infer the corrections needed in parallel updating. First, the effect of all the tentative updates of the posterior means of the sources on the vectors x ( t) and s ( t) is computed, and then this effect is projected back to the sources. This is compared with the result that would be obtained by sequential updating, that is, by updating only one posterior source mean at a time. The amount of the update is then corrected by the ratio of the assumed and actual effect of the proposed updates. The corrections made to the sources are explained in detail in Lappalainen and Honkela (2000) for the static NFA method. For the dynamic model of this article, the situation is slightly more complex. This follows from the fact that the source values s (t ¡ 1) at the previous time step also affect the “prior” p ( s (t) | s ( t ¡ 1) ) of the sources s ( t). However, essentially the same correction procedure can be used. 5.4 Initialization with Embedding. The learning procedure starts by initialization of the sources to meaningful values. They are not adapted until the estimates of the nonlinear mappings f and g have been improved from their rather random initial guesses. In order to nd out sources s (t) that describe not only the static features but also the dynamics of the data x ( t ), the data are augmented by using embedding in the initialization stage. Embedding is a standard technique in the analysis of nonlinear dynamic systems (Haykin & Principe, 1998). In embedding, time-shifted versions x ( t ¡ k) of the data vectors are used in addition to the original ones. This is based on the theoretical result that under suitable conditions, a sequence [x ( t) x ( t ¡ 1) . . . x ( t ¡ D ) ] of the data vectors contains all the information needed to reconstruct the original state if the number D of the delays is large enough (Takens, 1981). The solution used here is to initialize the sources to the principal components of concatenated 2d C 1 subsequent data vectors: zT ( t) D [x T ( t C d) x T ( t C d ¡ 1) ¢ ¢ ¢ x T ( t ¡ d C 1) x T ( t ¡ d ) ].
(5.20)
The dimension of z ( t) is 2d C 1 times larger than the original data vectors. In general, the state of a dynamic system is nonlinearly embedded in z ( t). However, the principal components of the concatenated vectors z already contain useful information about the temporal behavior, giving good initial values for the sources. During the rst sweeps, the data vectors x ( t) are replaced by the concatenated vectors z ( t) . This promotes the model to nd sources that describe the dynamics even when the dynamic mapping does not make use of these sources.
2672
Harri Valpola and Juha Karhunen
After some iterations, the dynamic mapping starts using them to predict the future values of the sources and the augmented part of the data. The corresponding parts of the observation network (matrix B and vector b ) can then be removed. After that, the learning process continues using the original data x (t ). 6 Experiments 6.1 Data. The working hypothesis tested was that our method should nd such a representation that can model more easily the dynamics of the process that has generated the observed data than the data directly. The dynamic process used to test our method was a combination of three independent dynamic processes with a total dimension of eight. One of the processes was a harmonic oscillator with angular velocity 1 / 3. The harmonic oscillator has a two-dimensional state representation and linear dynamics. The two other dynamic processes were chosen to be independent Lorenz processes (Lorenz, 1963). The time series of the eight states of the combined process are shown in the top part of Figure 3. There, the three uppermost signals correspond to the rst Lorenz process, the next three to the second Lorenz process, and the last two to the harmonic oscillator. The Lorenz system has a three-dimensional state-space whose dynamics is governed by the following set of differential equations:
dz 1 D s ( z2 ¡ z1 ) dt dz 2 D rz1 ¡ z2 ¡ z1 z3 dt dz 3 D z1 z2 ¡ bz3 . dt
(6.1) (6.2) (6.3)
The parameter vectors [s r b] of the Lorenz processes were [3 26.5 1] and [4 30 1]. The Lorenz system has three-dimensional chaotic nonlinear dynamics. The time evolution of the state of the sampled rst Lorenz process is shown in Figure 4. For making the learning problem more challenging, the dimensionality of the original state-space was rst reduced to ve using a linear projection. These ve projections are depicted in the middle part of Figure 3. Now the dynamics of the observations are needed in order to reconstruct the original state-space since only ve of the eight dimensions are visible. Finally, the 10-dimensional data vectors x ( t) shown in the bottom part of Figure 3 were generated by nonlinearly mixing the ve linear projections and then adding gaussian noise. The standard deviations of the signal and noise were 1 and 0.1, respectively. The nonlinear mixing was carried out using an MLP network that had randomly chosen weights and used the sinh ¡1 nonlinearity.
Learning Nonlinear State-Space Models
2673
Figure 3: (Top) The eight plots show the eight states dening the dynamics of the underlying process. The process is composed of two Lorenz processes, each having three states, and a harmonic oscillator that has two states. (Middle) The ve projections used for generating the observations. (Bottom) The 10 plots represent the noisy observations obtained through nonlinear mapping from the ve projections.
2674
Harri Valpola and Juha Karhunen
Figure 4: The original sampled Lorenz process.
The processes used in the experiments are deterministic, but equation 3.2 includes a noise term. The noise term is needed even in the case of the deterministic underlying process because there is always some degree of modeling error. The method can also learn noisy processes, but it would be much more difcult to evaluate the results. We use prediction of the future data to measure the performance of the learned dynamical process, but there would be no single true continuation of the data if the process were noisy. 6.2 Results. The proposed method was used for learning a dynamic model of the observations. 3 Several different random initializations of the MLP networks and structures of the model were tested. For the rst 500 sweeps, the concatenated vectors z (t) dened in equation 5.20 were used instead of the original data vectors x ( t) , as explained in section 5.4. The suitable number of concatenated data vectors depend on the problem at hand. In these experiments, the dimension of z (t) was 50, that is, ve consecutive original data vectors form z ( t). After 500 iterations, the time-lagged parts of z ( t) were removed, leaving only x ( t) , and the observation MLP (equation 3.3) was reduced accordingly. In all the simulations, both the MLP network (equation 3.3) describing the observations x (t) and the MLP network (equation 3.4) describing the dynamics of the sources s ( t) had one hidden layer of 30 neurons but the size 3 Matlab source code for the simulations is available at http://www.cis.hut./ projects/ica/bayes/.
Learning Nonlinear State-Space Models
2675
Figure 5: The best values of cost function and averages of eight simulations with different initializations are shown for different numbers of sources after 1 million iterations.
of the number of sources was varied between 8 and 11. It turned out that in the beginning of the learning, a model with 8 sources attained the lowest values of the cost function, but as the learning continued, models with 9 sources had achieved the lowest values. Figure 5 shows the best values of the cost function and the average of eight different random initializations as a function of the number of sources. After 7500 sweeps, the method had learned a dynamic process that was able to represent the observation vectors x ( t). The standard deviation of the observations was estimated to be 0.106 on the average, which is in reasonably good agreement with the actual value of 0.1. The quality of the dynamic model learned by our method was tested by predicting 1000 new source vectors using the estimated mapping g from the recursion s ( t) D g (s ( t ¡ 1) ). The estimated and predicted posterior means of the sources are shown in the top part of Figure 6. Our earlier experiments (Honkela & Karhunen, 2001; Hyv¨arinen et al., 2001; Lappalainen & Honkela, 2000; Valpola et al., 2000) with the static nonlinear factor analysis (NFA) and nonlinear independent factor analysis (NIFA) methods indicated that 7500 sweeps were sufcient for learning the factors or sources and the nonlinear mapping f . Clearly, more sweeps
2676
Harri Valpola and Juha Karhunen
Figure 6: (Top) The eight time series show the sources s ( t) of the best model after 7500 iterations. (Bottom) The nine time series depict the sources of the best model after 500,000 iterations. The rst 1000 values have been estimated based on the observations, and the following 1000 values have been predicted using s ( t ) D g ( s ( t ¡ 1) ) , that is, without the innovation process.
are needed for completely learning the underlying dynamic process and mapping g . Most of the learning process has ended after 100,000 sweeps, but some progress was observed even after that. After 500,000 sweeps, there was no signicant improvement, and the simulations were nished after 1,000,000 sweeps. In any case, the experiments conrmed that the introduced ensemble learning method is robust against overlearning. Hence, there is no need to control the complexity of the resulting mappings by stopping learning early. The lower part of Figure 6 shows the posterior means of the estimated and predicted sources at the end of the learning after 500,000 sweeps. Visual inspection of the time series in Figure 6 conrms that the method is able to capture the characteristics of the dynamics of the data-generating process. It also shows that only eight of the nine factors are actually used at the end. However, it is difcult to compare the estimated dynamics with the original one by looking only at the predicted sources s (t ). This is because the
Learning Nonlinear State-Space Models
2677
Figure 7: The estimated original eight-dimensional state of the dynamic process used for generating the data, reconstructed from the predicted sources s ( t) shown in Figure 6. After 7500 sweeps (top), the estimated model is able to follow the dynamics of the process for a while. After 500,000 sweeps (bottom), the essential characteristics of the dynamics are captured with high delity.
model learned by the method uses a state representation that differs from the original one. Two processes can be considered equivalent if their state representations differ by only an invertible nonlinear transformation. Because the original states of the studied process are known, it is possible to examine the dynamics in the original state-space. An MLP network was used for nding the mapping from the learned eight- and nine-dimensional sources or factors to the original eight-dimensional states. The mapping was then used for visualizing the dynamics in the original state-space. Figure 7 shows the reconstruction of the original states made from the predicted states s ( t) . It is evident that the sources contain all the required information about the state of the underlying dynamic process because the reconstructions are quite good for 1 · t · 1000 even after 7500 sweeps. Initially, our method does not model the dynamics accurately enough for being able to simulate the long-term behavior of the process, but after 500,000 sweeps, the dynamics of all the three underlying subprocesses are well captured.
2678
Harri Valpola and Juha Karhunen
Figure 8: Three sources of the estimated process that correspond to the origial process shown in Figure 4.
In Figure 8, the estimated states corresponding to the Lorenz process shown in gure 4 are plotted after 500,000 sweeps. Note that the MLP networks modeling f and g could have represented any rotation of the statespace. Hence, the separation of the independent processes is due only to the form of the posterior approximation. The ability to separate independent processes is important from a practical point of view because it usually makes it much easier to interpret the state representation. This makes it possible, for example, to analyze the type of change in process change detection problem as was proposed in Iline, Valpola, and Oja (2001). The quality of the estimate of the underlying process was tested by studying the prediction accuracy for new samples. It should be noted that since the Lorenz processes are chaotic, the best that any method can do is to capture their general long-term behavior; exact numerical prediction is possible only in the short term. Figure 7 conrms that our method achieves this goal. The proposed approach was compared with a nonlinear autoregressive (NAR) model, x ( t) D h ( x ( t ¡ 1), . . . , x ( t ¡ d) ) C n ( t) ,
(6.4)
which makes its predictions directly in the observation space. The nonlinear mapping h (¢) was again modeled by an MLP network, but with NAR standard backpropagation algorithm applied in learning. Several strategies
Learning Nonlinear State-Space Models
2679
Figure 9: The 10 plots show the original observations x ( t) (1 · t · 1000) and the prediction results ( t ¸ 1001) given by a nonlinear autoregressive model trained using the standard backpropagation algorithm.
were tested, and the best performance was provided by an MLP network with 20 inputs and one hidden layer of 30 neurons. The number of delays was d D 10, and the dimension of the inputs to the MLP network h (¢) was compressed from 100 to 20 using standard PCA. Figure 9 shows the performance of this NAR model. It is evident that the prediction of new observations is a challenging problem. Principal component analysis used in the model can extract features useful in predicting the future observations, but clearly the features are not adequate for modeling the long-term behavior of the observations. The noise process was omitted as in the results shown in Figures 6 and 7. Both the NAR model and the model used in our method contain a noise process that can be taken into account in the prediction by Monte Carlo simulation. Figure 10 shows the results averaged over 100 Monte Carlo simulations. The average predicted observations are compared to the actual continuation of the process. The gure shows the average cumulative squared prediction error. Since the variance of the noise on the observations is 0.01, the results can be considered perfect if the average squared prediction error is 0.01. The signal variance is 1, which gives the practical upper bound of the prediction accuracy. The gure conrms that our method has been able to model the dynamics of the process, because the short-term average prediction error is close to the noise variance. Even the long-term prediction falls below the signal variance, which indicates that at least some part of the process can be predicted in the more distant future. The steady progress made during learning is also evident. Already after 7500 sweeps, our method is better than the NAR-based method in predicting the process x ( t) a few steps ahead. Performance improves greatly when learning is continued. The nal predictions of our method after 500,000 sweeps are excellent up to time t D 1010 and good up
2680
Harri Valpola and Juha Karhunen
Figure 10: The average cumulative squared prediction error for the nonlinear autoregressive model (solid with dots) and for our dynamic algorithm with different iteration numbers. Each result is for the model having the lowest value of the cost function at that stage.
to t D 1022, while the NAR method is quite inaccurate even after t ¸ 1003. Here, the prediction started at time t D 1000. We have also conducted experiments with recurrent neural networks, which provide slightly better results than the NAR model but signicantly worse than our method. Results for individual time series have been presented in Valpola (2000c). An interesting nding was that the method always converged to almost the same representation of the state-space. This cannot be completely due to similar initializations of the sources because the retrieved sources had different orderings and signs. Although the state representations were similar, the prediction performance varied signicantly. This is probably due to local minima of the MLP network modeling the dynamics and the chaotic nature of the underlying process, which emphasizes any differences in the quality of the estimated model. It seems that the value of the cost function for the learning data is a relatively reliable indicator of the quality of the model as the prediction performance of the model with the best value of the cost function was also among the best in terms of prediction accuracy.
Learning Nonlinear State-Space Models
2681
7 Discussion
Our experiments have shown that in its current form, the described method is able to learn up to about 15-dimensional latent spaces or to extract blindly roughly 15 source processes. Unsupervised linear techniques such as PCA and ICA can handle clearly higher-dimensional latent spaces. However, our method is nonlinear, and should therefore be compared with other unsupervised nonlinear techniques. In nonlinear kernel models such as the selforganizing map (Haykin, 1998; Principe et al., 2000), the number of parameters scales exponentially with the dimension of the latent space. This restricts in practice the use of such methods to small-dimensional latent spaces, the typical dimension being only two. This is also the case with almost all the methods proposed thus far for nonlinear BSS and ICA (Hyv¨arinen et al., 2001). In our method, latent (source) space dimensions of the order of tens are attainable by splitting the model into parts. This approach was applied to modeling magnetoencephalographic data in S¨arel¨a, Valpola, VigÂario, and Oja (2001). The emphasis in our method is on nding a good model that describes the observed data well. A sufciently good posterior approximation is important, but in large problems at least, resources are better invested in improving the model structure and prior information rather than the quality of the posterior approximation. The model is nonlinear, and consequently the actual posterior cannot be described using any simple analytic form accurately. We have used gaussian approximation mainly due to its computational tractability. Posterior densities are asymptotically gaussian, but that result is not applicable to the sources as the ratio of observations to unknown variables in sources is constant. Gamma distributions could have been used as prior distributions for the variance parameters, but that modication would be of little practical signicance. The observations induce dependences among variables that were independent a priori. A key idea in ensemble learning is to neglect many of these dependences and instead try to nd the best possible summary of the posterior probability having the given restricted form. In this work, almost all the dependences are neglected. Only the posterior dependences of consecutive values of the sources are taken into account. This results in automatic pruning: some sources or parts of the network can be shut down. This can be a useful property, but at the beginning of the learning, it can be harmful. If the estimates of the mappings are nearly random, sources seem to be useless, resulting in pruning them away. The learning algorithm is unable to escape this local minimum later. Premature pruning is prevented by the initialization procedure described in section 5.4. Future work is needed for decreasing the computational load of the method. Currently, learning requires a lot of computer time—taking easily days or even more. The most time-consuming part is the computation of the Kullback-Leibler cost in the forward phase, because every source value,
2682
Harri Valpola and Juha Karhunen
network weight, and parameter gives rise to a term of its own in the cost function. Especially the evaluation of the Jacobian matrices needed in the Taylor series approximation increases computation time. At the moment, the computational load increases quadratically with the dimension of the latent space in practice. It is possible to reach linear complexity in this respect, as proposed by Valpola et al. (2001), although it may be that the complexity of the mapping that can be learned with the method is restricted in practice. A related issue is the need to simplify the method itself. It is quite complicated, and therefore learning can sometimes fail. We are currently working toward that goal (Valpola et al., 2001). As usual, our method is not well suited to all types of data. For achieving good results, the model 3.1–3.2 should describe the observed data reasonably well.
8 Conclusion
We have introduced a Bayesian ensemble learning method for unsupervised extraction of dynamic processes from noisy data. The data are assumed to be generated by an unknown nonlinear mapping from unknown sources or factors. The dynamics of the sources are modeled using a nonlinear state-space model. The nonlinear mappings in the model are represented using MLP networks, which can describe well both mild and strong nonlinearities. Usually MLP networks learn the nonlinear input-output mapping in a supervised manner using known input-output pairs, for which the meansquare error (MSE) of the mapping is minimized using the well-known backpropagation algorithm. Our learning procedure resembles standard backpropagation in that in the forward phase, the cost function is computed by propagating the input values through the network while keeping its weights frozen. In the backward phase, the parameters of the network are then updated. However, our learning method differs signicantly from backpropagation in several aspects. In our case, learning is unsupervised because only the outputs of the MLP networks are known. The MLP networks are used as generative models for the observations and sources rather than nonlinear signal transformation tools. The cost function employed is the KullbackLeibler information instead of the MSE error. Probably the most signicant difference lies in the Bayesian philosophy, where each variable is described using its own posterior distribution instead of a single point estimate. The proposed method is computationally demanding, but it allows the use of higher-dimensional nonlinear latent variable models than other existing approaches. The new method is able to learn a dynamic process responsible for generating the observations. Experiments with difcult chaotic data show excellent estimation and prediction performance compared with currently available nonlinear prediction techniques.
Learning Nonlinear State-Space Models
2683
Appendix: Derivations A.1 Derivation of Equation 5.3. The result in equation 5.3 follows from equations 4.12 through 4.14. Note that according to equation 4.14, si (t ) ¡ ± m si (t) is gaussian with zero mean and variance si ( t) and independent on m si (t) . Therefore,
sQi ( t) D Var fsi (t ) ¡ m si ( t) C m si ( t) g D Var fsi (t) ¡ m si ( t) g C Var fm si ( t)g ±
D si ( t) C Var fNsi ( t) C sMi ( t, t ¡ 1)[si ( t ¡ 1) ¡ sNi ( t ¡ 1) ]g ±
±
D si ( t) C Var fMsi ( t, t¡1 ) si ( t¡1 ) g D si ( t) C sM2i ( t, t¡1 ) sQi ( t¡1 ) ,
(A.1)
where Var f¢g denotes the variance over the distribution q ( S, µ ) . A.2 Derivation of Equation 5.6. Assume that the prior distribution of c is gaussian, having the general form N[c I mc I exp(2vc ) ]. Then
¡ ln p (c | mc , vc ) D
1 1 (c ¡ mc ) 2 exp(¡2vc ) C vc C ln(2p ) . 2 2
(A.2)
Using the fact that c , mc , and vc are independent, expectation of equation A.2 yields Eq f¡ ln p (c | mc , vc ) g D
1 Eq f(c ¡ mc ) 2 gEq fexp(¡2vc ) g 2 1 C Eq fvc g C ln(2p ) . 2
(A.3)
From the identity hQ D Var fhg D Eq fh 2 g ¡ Eq fhg2 , it follows that Eq f(c ¡ mc ) 2 g D Eq fc 2 g ¡ 2Eq fc mc g C Eq fmc2 g Nc C m N c2 C mQ c D cN 2 C cQ ¡ 2cN m N c ) 2 C cQ C m Qc . D (cN ¡ m
(A.4)
The result Eq fexp(¡2vc )g D exp(2vQc ¡2vNc ) can be derived starting from the denition Eq fexp(¡2vc ) g D p
1 2p vQc
Z
Á
(vc ¡ vNc ) 2 exp ¡ 2vQc
! exp(¡2vc ) dvc
(A.5)
2684
Harri Valpola and Juha Karhunen
and manipulating the integrand as follows:
¡
vc2 ¡ 2vc vNc C vNc2 C 4vc vQc ( vc ¡ vNc ) 2 ¡ 2vc D ¡ 2vQc 2vQc D ¡ D ¡
( vc ¡ vNc C 2vQc ) 2 C 4vNc vQc ¡ 4vQc2 2vQc
( vc ¡ vNc C 2vQc ) 2 C 2vQc ¡ 2vNc . 2vQc
(A.6)
This can be recognized as the expectation of the term exp(2vQc ¡2vNc ) over vc where vc has a gaussian distribution with mean vNc ¡ 2vQc and variance vQc . As the term is constant with respect to vc , it follows that Eq fexp(¡2vc ) g D exp(2vQc ¡ 2vNc ) . Inserting this, equation A.4 and Eq fvc g D vNc into equation A.2, then yields equation 5.6. A.3 Posterior Mean and Variance After Nonlinear Mapping. According to the prior model 3.5 for the observations x (t ), xi ( t) is gaussian with mean fi (s ( t) ) and variance vni (here fi denotes the ith component of the vector f and vni similarly for v n ). Similar reasoning as in section A.2 yields equation 5.7, which requires fNi ( s ( t) ) and fQi ( s ( t) ) , the posterior mean and variance of the nonlinear mapping f ( s ( t) ). A detailed derivation of these terms is given in Lappalainen and Honkela (2000). Here, we discuss only the main results. The derivation is based on the following approximations for the mean and variance of the output of a nonlinear scalar function w (j ) :
1 00 w (jN )jQ 2 wQ (j ) ¼ [w 0 (jN ) ]2jQ . wN (j ) ¼ w (jN ) C
(A.7) (A.8)
A second-order Taylor series expansion about jN yields equation A.7, while equation A.8 is derived from a rst-order expansion. The computations with the mapping f (s ( t) ) proceed layer by layer much as in an ordinary MLP network, except that means and variances are propagated through the network instead of simple scalar values. The nonlinearities in the hidden layer are handled by the approximations A.7 and A.8. In the rst linear mapping from the sources to the hidden layer, the computation of the means and variances of the inputs of the hidden layer neurons is relatively simple as all the inputs and weights are mutually independent according to the posterior assumption. This is not the case in the second
Learning Nonlinear State-Space Models
2685
layer, as the outputs of the hidden neurons are dependent due to the correlations induced by common inputs. The variance of the output is computed using the formula fQi ( s ( t) ) D fQi¤ ( s ( t) ) C
X j
µ sQj (t )
@fi ( s ( t )
¶2
@si ( t)
,
(A.9)
where fQi¤ (s ( t) ) denotes the variance resulting from the weights. In an MLP network with one hidden layer, the value of a weight cannot propagate through multiple paths, and therefore fQi¤ ( s ( t) ) can be computed as in the rst layer. Equation A.9 is computationally the most expensive part of the algorithm, as it requires the Jacobian matrix of f ( s ( t) ) , that is the terms @fi ( s ) / @si ( t) . This essentially requires multiplying two matrices A and B , and the nonlinearity makes it impossible to make the computations in advance. As in the standard backpropagation algorithm, the computation of the gradients involves layer-wise computations in reverse order starting from terms such as @C p / @ fQi ( s ( t) and yielding terms such as @C p / @sNi (t) in the end. The compuational complexity of these backward-phase computations is of the same order as that of the forward phase. A.4 Computation of the Terms E q f¡ ln p(s (P t) | s(t¡1), h )g. The computation of the terms Eq f¡ ln p ( s (t) | s ( t¡1 ) , µ ) g D i Eq f¡ ln p ( si (t) | s ( t¡1 ) , µ ) g in the cost function C p is very similar to equation 5.7, and the posterior mean gN i ( t) and variance gQ i (t ) of the mapping are computed analogously to section A.3. There are two differences. First, the sources si ( t) are random variables with variance sQi ( t) . The term thus includes sQi ( t) in addition to gQ i ( s (t ¡ 1) ) . More important, the source si (t) is assumed to have a posterior dependence with si ( t ¡ 1) according to equations 4.12 through 4.14. This results in an extra term. We can approximate the dependence of gi ( s ( t ¡1) ) on si ( t ¡1) by making a linear approximation gi (s ( t¡1 ) ) D gN i ( s ( t¡1 ) ) C @gi ( s ( t¡1 ) ) / @si ( t¡1 ) [si ( t¡ 1) ¡ sNi ( t ¡ 1) ]. This means that the counterpart of equation A.4 produces a cross term of the form
¡2Eq fsi ( t) gi (s ( t ¡ 1) )g » D ¡2Eq [Nsi ( t) C sMi ( t, t ¡ 1) ( si ( t ¡ 1) ¡ sNi ( t ¡ 1) )] µ
£ gN i ( s (t ¡ 1) C
@gi ( s ( t ¡ 1) ) @si ( t ¡ 1)
D ¡2Nsi ( t ) gN i ( s ( t ¡ 1) ) ¡ 2Msi ( t, t ¡ 1)
¶¼
( si ( t ¡ 1) ¡ sNi (t ¡ 1) )
@gi ( s ( t ¡ 1) ) @si ( t ¡ 1)
sQi ( t ¡ 1).
(A.10)
2686
Harri Valpola and Juha Karhunen
This differs from equation A.4 by the second term, and the nal result is thus 1 1 ai (t ) exp(2vQ i ¡ 2vN i ) C vN i C ln 2p 2 2
(A.11)
ai (t ) D [Nsi (t) ¡ gN i (t )]2 C sQi ( t) C gQ i ( t) ¡ 2Msi ( t, t ¡ 1)
@gi ( t) @si ( t ¡ 1)
sQi (t ¡ 1), (A.12)
where the ith component of the vector g (s ( t ¡ 1) ) is denoted by gi ( t), and the variance parameter of the ith factor by vi . A.5 The Full Cost Function. The joint pdf p ( X, S, µ ) of the variables of the model is a product of many simple terms dened by equations 3.5 through 3.19. There is one term for each variable of the model. The approximation q ( S, µ ) of the posterior pdf of the unknown variables has a similar factorial form, and hence the cost function splits into a sum of many simple terms due to the logarithm in equation 2.6. All the distributions in the model are assumed gaussian, and thus the terms have very similar forms. Each unknown variable generates terms for the numerator and denominator of equation 2.6, while the observed variables X appear only in the denominator. The terms originating from the sources differ from the ones originating from other unknown variables since some of the posterior correlations of the sources are modeled. Therefore, there are three kinds of terms (C h , C s , and C x ) corresponding to general parameters, sources, and data. In general, an unknown variable h with mean m and standard deviation exp(v) generates the term h (h ,
C
1 [(hN ¡ m N ) 2 C hQ C m] Q exp(2vQ ¡ 2vN ) C vN 2 1 1 C ln 2p ¡ ln 2p ehQ 2 2 1 1 Q N ) 2 C hQ C m] Q exp(2v¡2 Q vN ) C v¡ N ln eh, D [(hN ¡ m 2 2
m, v ) D
(A.13)
which is a combination of equation 5.6 and the appropriate term from equation 5.5. Note that C h (¢) is a function of the posterior means and variances N m, Q m, h, N v, N h, Q and v, Q but for short the arguments are h , m, and v. The terms for the sources are given by equations A.11 and A.12 and the appropriate term from 5.5:
C s (s (t ) , g ( t) , v ) D
1 1 ± a(t) exp(2vQ ¡ 2vN ) C vN ¡ ln es ( t) 2 2
a(t) D ( sN ( t) ¡ gN ( t) ) 2 C sQ ( t) C gQ ( t) ¡ 2Ms ( t, t ¡ 1)
@g ( t) @s ( t ¡ 1)
(A.14) sQ (t ¡ 1).
(A.15)
Learning Nonlinear State-Space Models
2687
The term for an observed variable is similar to equation A.13 except that observed variables do not have posterior variance and there are no terms originating from equation 5.5: x ( x,
C
1 1 [(x ¡ m N ) 2 C m] Q exp(2vQ ¡ 2vN ) C vN C ln 2p . 2 2
m, v) D
(A.16)
Some of the unknown parameters have hand-set prior means or variances. For instance, all hyperparameters in the model are gaussian with zero mean and standard deviation 100. We shall use a short-hand notation where mN D m and mQ D 0 if m is a given constant. Thus, the terms for the hyperparameters can be given as D C h (h , 0, ln 1002 ) D
hyper (h )
C
C ln 100 ¡
1 2 (hN C hQ ) / 1002 2
1 ln ehQ . 2
(A.17)
We can now write down the full cost function used in this article. The terms of the cost function can be grouped into parts related to the data, sources, and the mappings f and g . Let us denote them by C data , C sources, C f , and C g , respectively. The full cost function is a sum of these terms:
C
KL
D C data C C sources C C f C C g .
(A.18)
The individual terms are: XX C data D C x (xi ( t) , fi ( t) , vni ) i
t
X
C
C
sources
XX
D
f
D
XX i
C C
mvn , vvn ) C Chyper (mvn ) C Chyper (vvn )
C s ( si ( t) , gi (t ) , vmi )
X
C
h ( vmi ,
i
h ( Aij ,
C
mvm , vvm ) C Chyper ( mvm ) C Chyper ( vvm )
0, ln 12 )
j
X
C
h ( ai ,
i
XX i
(A.19)
i
t
C
C
h ( vni ,
C
i
j
C
ma , va ) C Chyper (ma ) C Chyper ( va )
h ( Bij ,
0, vBj )
(A.20)
2688
Harri Valpola and Juha Karhunen C C
X
C
h ( vBj ,
C
h ( bi ,
j
X i
The term C g is like C respectively.
f
mvB , vvB ) C Chyper ( mvB ) C Chyper ( vvB )
mb , vb ) C Chyper ( mb ) C Chyper ( vb ) .
(A.21)
except A, a, B, and b are replaced by C, c, D, and d,
A.6 Gradients of the Cost Function C KL with Respect to the Source Parameters. Let us use the notation C KL (Qsi ( t) ) to mean that the cost function KL is considered to be a function of the intermediate variables sQi (1) , . . . , sQi ( t ) in addition to the parameters of the posterior approximation. The gradient computations resulting from equation 5.3 by the chain rule are then as follows:
C
@C KL ±
@si ( t)
D D
@C KL ( sQi ( t) ) ±
@si ( t)
@C KL ( sQi ( t) )
@C KL @sMi ( t, t ¡ 1)
±
@si ( t)
C
@C KL ( sQi ( t) ) @sQi ( t) ± @sQi ( t) @si ( t)
C
@C KL ( sQi ( t) ) @sQi ( t)
(A.22)
@C KL (Qsi ( t) ) @C KL (Qsi ( t ) ) @sQi ( t) C @sMi ( t, t ¡ 1) @sQi ( t) @sMi ( t, t ¡ 1)
D
@C KL (Qsi ( t) )
D
@sMi ( t, t ¡ 1)
C2
@C KL ( sQi ( t) ) sMi ( t, t ¡ 1) sQi (t ¡ 1) @sQi ( t)
(A.23)
@C KL ( sQi ( t) ) @C KL ( sQi ( t C 1) ) @C KL ( sQi ( t C 1) ) @sQi ( t C 1) C D @sQi ( t) @sQi ( t) @sQi ( t C 1) @sQi ( t) D
@C KL ( sQi ( t C 1) ) @C KL ( sQi ( t C 1) ) 2 C sMi (t C 1, t) . @sQi ( t) @sQi ( t C 1)
(A.24)
±
The gradient @C KL ( sQi (t) ) / @si ( t) can be computed from equation 5.5. Equating ± @C KL / @si ( t) to zero results in @C KL ±
@si ( t)
1
D 0)¡ ±
2si ( t)
±
C
µ
D 0 ) si ( t) D 2
@C KL (Qsi ( t) ) @sQi ( t) @C KL ( sQi ( t) ) @sQi ( t )
¶ ¡1
.
(A.25)
This xed-point update rule resembles equation 5.8, and a similar learning parameter is applied for stability as in equation 5.19.
Learning Nonlinear State-Space Models
2689
Solving the equation @C KL / @sMi (t, t ¡ 1) = 0 from equation A.24 yields the update rule 5.11 for the linear dependence parameter sMi ( t, t ¡ 1): @C KL @sMi ( t, t ¡ 1)
D 0)2 D D
@C KL ( sQi ( t) ) @C KL (Qsi ( t) ) sMi ( t, t ¡ 1) sQi (t ¡ 1) D ¡ @sQi ( t) @sMi ( t, t ¡ 1)
@gi ( t)
@si ( t ¡ 1) @gi ( t)
@si ( t ¡ 1)
sQi (t ¡ 1) exp(2vQ i ¡ 2vN i ) ) sMi ( t, t ¡ 1) sQi (t ¡ 1)
µ @C £ exp(2vQ i ¡ 2vN i ) 2
KL ( sQi ( t ) )
@sQi ( t )
¶ ¡1
,
(A.26)
where @C KL ( sQi (t) ) / @sMi ( t, t ¡ 1) is obtained from equations A.11 and A.12. The gradient @C KL (Qsi ( t C 1) ) / @sQi ( t) includes a term exp(2vQ i ¡ 2vN i ) from equations A.11 and A.12. There are also terms originating from the nonlinear mappings f and g because their forward computation starts with the posterior means sNi (t ) and variances sQi (t) . The evaluation of these terms is discussed in section A.3. The marginalized variances sQi ( t) are computed recursively forward in time from equation 5.3, and this has a counterpart in the computation of ± the gradients. The term @C KL (Qsi ( t) ) / @sQi ( t) needed for the updates of si ( t) and sMi (t, t ¡ 1) depends on @C KL ( sQi ( t C 1) ) / @sQi (t C 1) according to equation A.23. Similarly, equation 5.11 shows that sMi (t, t ¡ 1) depends on @C KL ( sQi ( t) ) / @sQi ( t), which in turn depends on sMi ( t C 1, t) according to equation A.24. Hence the dependences sMi ( t, t ¡ 1) and the gradients with respect to the marginalized variances sQi ( t) are computed recursively backward in time. Acknowledgments
We thank Antti Honkela and J.-P. Barnard for useful comments and Alexander Ilin for performing some additional experiments. This research has been funded by the European Commission project BLISS and the Finnish Center of Excellence Programme (2000–2005) under the project New Information Processing Principles. References Attias, H. (1999a). Independent factor analysis. Neural Computation, 11(4), 803– 851. Attias, H. (1999b). Learning a hierarchical belief network of independent factor analyzers. In M. S. Kearns, S. Solla, & D. Cohn (Eds.), Advances in neural information processing systems, 11 (pp. 361–367). Cambridge, MA: MIT Press.
2690
Harri Valpola and Juha Karhunen
Attias, H. (2000). Independent factor analysis with temporally structured sources. In S. Solla, T. Leen, & K.-R. Muller ¨ (Eds.), Advances in neural information processing systems, 12 (pp. 386–392). Cambridge, MA: MIT Press. Attias, H. (2001). ICA, graphical models and variational methods. In S. Roberts & R. Everson (Eds.), Independent component analysis: Principles and practice (pp. 95–112). Cambridge: Cambridge University Press. Barber, D., & Bishop, C. (1998). Ensemble learning in Bayesian neural networks. In M. Jordan, M. Kearns, & S. Solla (Eds.), Neural networks and machine learning (pp. 215–237). Berlin: Springer-Verlag. Bishop, C. (1995).Neural networks for patternrecognition. Oxford: Clarendon Press. Briegel, T., & Tresp, V. (1999). Fisher scoring and a mixture of modes approach for approximate inference and learning in nonlinear state space models. In M. Kearns, S. Solla, & D. Cohn (Eds.), Advances in neural information processing systems, 11 (pp. 403–409). Cambridge, MA: MIT Press. Casdagli, M. (1989). Nonlinear prediction of chaotic time series. Physica D, 35, 335–356. Choudrey, R., Penny, W., & Roberts, S. (2000). An ensemble learning approach to independent component analysis. In Proc. of the IEEE Workshop on Neural Networks for Signal Processing, Sydney, Australia, December 2000. Piscataway, NJ: IEEE Press. Cichocki, A., Zhang, L., Choi, S., & Amari, S.-I. (1999). Nonlinear dynamic independent component analysis using state-space and neural network models. In Proc. of the 1st Int. Workshop on Independent Component Analysis and Signal Separation (ICA’99) (pp. 99–104). Aussois, France. Gelman, A., Carlin, J., Stern, H., & Rubin, D. (1995). Bayesian data analysis. Boca Raton, FL: Chapman & Hall/CRC Press. Ghahramani, Z. (2001). An introduction to hidden Markov models and Bayesian networks. Int. J. of Pattern Recognition and Articial Intelligence, 15(1), 9–42. Ghahramani, Z., & Beal, M. (2001). Propagation algorithms for variational Bayesian learning. In T. Leen, T. Dietterich, & V. Tresp (Eds.), Advances in neural information processing systems, 13 (pp. 507–513). Cambridge, MA: MIT Press. Ghahramani, Z., & Hinton, G. (2000). Variational learning for switching statespace models. Neural Computation, 12(4), 963–996. Ghahramani, Z., & Roweis, S. (1999). Learning nonlinear dynamical systems using an EM algorithm. In M. Kearns, S. Solla, & D. Cohn (Eds.), Advances in neural information processing systems, 11 (pp. 599–605). Cambridge, MA: MIT Press. Haykin, S. (1998). Neural networks—A comprehensive foundation (2nd ed.). Englewood Cliffs, NJ: Prentice Hall. Haykin, S., & Principe, J. (May 1998). Making sense of a complex world. IEEE Signal Processing Magazine, 15(3), 66–81. Hinton, G., & van Camp, D. (1993). Keeping neural networks simple by minimizing the description length of the weights. In Proc. of the 6th Ann. ACM Conf. on Computational Learning Theory (pp. 5–13). Santa Cruz, CA. Honkela, A., & Karhunen, J. (2001). An ensemble learning approach to nonlinear independent component analysis. In Proc. European Conf. on Circuit Theory and Design (ECCTD’01) (pp. I–41/I–44). Espoo, Finland.
Learning Nonlinear State-Space Models
2691
Hyva¨ rinen, A., Karhunen, J., & Oja, E. (2001). Independent component analysis. New York: Wiley. Iline, A., Valpola, H., & Oja, E. (2001). Detecting process state changes by nonlinear blind source separation. In Proc. of the 3rd Int. Workshop on Independent Component Analysis and Signal Separation (ICA2001) (pp. 704–709). San Diego, CA. Jordan, M. (Ed.). (1999). Learning in graphical models. Cambridge, MA: MIT Press. Jordan, M., Ghahramani, Z., Jaakkola, T., & Saul, L. (1999). An introduction to variational methods for graphical models. In M. Jordan (Ed.), Learning in graphical models (pp. 105–161). Cambridge, MA: MIT Press. Lampinen, J., & Vehtari, A. (2001). Bayesian approach for neural networks— review and case studies. Neural Networks, 14(3), 7–24. Lappalainen, H. (1999).Ensemble learning for independent component analysis. In Proc. Int. Workshop on Independent Component Analysis and Signal Separation (ICA’99) (pp. 7–12). Aussois, France. Lappalainen, H., & Honkela, A. (2000). Bayesian nonlinear independent component analysis by multi-layer perceptrons. In M. Girolami (Ed.), Advances in independent component analysis (pp. 93–121). Berlin: Springer-Verlag. Lappalainen, H., & Miskin, J. (2000).Ensemble learning. In M. Girolami (Ed.), Advances in independent component analysis (pp. 75–92). Berlin: Springer-Verlag. Lorenz, E. (1963). Deterministic nonperiodic ow. Journal of Atmospheric Sciences, 20, 130–141. MacKay, D. (1992). A practical Bayesian framework for backpropagation networks. Neural Computation, 4(3), 448–472. MacKay, D. (1995). Developments in probabilistic modelling with neural networks—ensemble learning. In Neural Networks: Articial Intelligence and Industrial Applications. Proc. of the 3rd Annual Symposium on Neural Networks (pp. 191–198). Berlin: Springer-Verlag. Miskin, J., & MacKay, D. (2000). Ensemble learning for blind image separation and deconvolution. In M. Girolami (Ed.), Advances in independent component analysis (pp. 123–141). Berlin: Springer-Verlag. Miskin, J., & MacKay, D. (2001). Ensemble learning for blind source separation. In S. Roberts & R. Everson (Eds.), Independent component analysis: Principles and practice (pp. 209–233). Cambridge: Cambridge University Press. Neal, R. (1996). Bayesian learning for neural networks. Berlin: Springer-Verlag. Principe, J., Euliano, N., & Lefebvre, C. (2000). Neural and adaptive systems— Fundamentals through simulations. New York: Wiley. Robert, C., & Gasella, G. (1999). Monte Carlo statistical methods. Berlin: SpringerVerlag. Roberts, S., & Everson, R. (2001). Introduction. In S. Roberts & R. Everson (Eds.), Independent component analysis: Principles and practice (pp. 1–70). Cambridge: Cambridge University Press. Roweis, S., & Ghahramani, Z. (2001). An EM algorithm for identication of nonlinear dynamical systems. In S. Haykin (Ed.), Kalman ltering and neural networks (pp. 175–220). New York: Wiley. S¨arel¨a, J., Valpola, H., Viga rio, R., & Oja, E. (2001). Dynamical factor analysis of rhythmic magnetoecephalographic activity. In Proc. of the 3rd Int. Workshop on
2692
Harri Valpola and Juha Karhunen
Independent Component Analysis and Signal Separation (ICA2001) (pp. 451–456). San Diego, CA. Sorenson, H. (1980). Parameter estimation—Principles and problems. New York: Marcel Dekker. Takens, F. (1981). Detecting strange attractors in turbulence. In D. Rand & L.-S. Young (Eds.), Dynamical systems and turbulence (pp. 366–381). Berlin: SpringerVerlag. Valpola, H. (2000a). Bayesian ensemble learning for nonlinear factor analysis. Unpublished doctoral dissertation, Helsinki University of Technology, Espoo, Finland. Valpola, H. (2000b).Nonlinear independent component analysis using ensemble learning: theory. In Proc. Int. Workshop on Independent Component Analysis and Blind Signal Separation (ICA2000) (pp. 251–256). Espoo, Finland. Valpola, H. (2000c). Unsupervised learning of nonlinear dynamic state-space models (Tech. Rep. No. A59). Helsinki: Lab of Computer and Information Science, Helsinki University of Technology. Valpola, H., Giannakopoulos, X., Honkela, A., & Karhunen, J. (2000). Nonlinear independent component analysis using ensemble learning: Experiments and discussion. In Proc. of the 2nd Int. Workshop on Independent Component Analysis and Blind Signal Separation (ICA2000) (pp. 351–356). Espoo, Finland. Valpola, H., Raiko, T., & Karhunen, J. (2001). Building blocks for hierarchical latent variable models. In Proc. 3rd Int. Workshop on Independent Component Analysis and Signal Separation (ICA2001) (pp. 710–715). San Diego, CA: Available on-line: http://www.cis.hut./harri/. Wan, E., & Merwe, R. (2001). The unscented Kalman lter. In S. Haykin (Ed.), Kalman ltering and neural networks (pp. 221–280). New York: Wiley. Wan, E., & Nelson, A. (2001).Dual extended Kalman lter methods. In S. Haykin (Ed.), Kalman ltering and neural networks (pp. 123–173). New York: Wiley. Winther, O. (1998). Bayesian mean eld algorithms for neural networks and gaussian processes. Unpublished doctoral dissertation, University of Copenhagen, Denmark.
Received August 14, 2001; accepted April 30, 2002.
LETTER
Communicated by Gerard Dreyfus
Data-Reusing Recurrent Neural Adaptive Filters Danilo P. Mandic [email protected] School of Information Systems, University of East Anglia, Norwich, NR4 7TJ, U.K. A class of data-reusing learning algorithms for real-time recurrent neural networks (RNNs) is analyzed. The analysis is undertaken for a general sigmoid nonlinear activation function of a neuron for the real time recurrent learning training algorithm. Error bounds and convergence conditions for such data-reusing algorithms are provided for both contractive and expansive activation functions. The analysis is undertaken for various congurations that are generalizations of a linear structure innite impulse response adaptive lter. 1 Introduction
Recurrent neural networks (RNN)s represent an emerging technique in nonlinear adaptive signal processing. They are also suitable for implementing nonlinear autoregressive moving average (NARMA) models (Connor, Martin, & Atlas, 1994; Nerrand, Roussel-Ragot, Peresonnaz, & Dreyfus, 1993; Nerrand, Roussel-Ragot, Urbani, Personnaz, & Dreyfus, 1994). However, RNNs for real-time adaptive ltering applications, such as prediction, encounter problems due to the slow convergence of their gradient adaptive algorithms (Bengio, Simard, & Frasconi, 1994). The so-called data-reusing algorithms, which have been considered for linear adaptive lters, offer an increased convergence rate as compared to standard algorithms (Treichler, Johnson, & Larimore, 1987; Roy & Shynk, 1989; Schnaufer & Jenkins, 1993). For a recurrent perceptron, which is a nonlinear version of an innite impulse response (IIR) linear lter, whose nonlinear activation function is W, the output of a neural adaptive lter y is given by y ( k) D W ( u T ( k) w ( k) ) ,
(1.1)
where u ( k) , w ( k) , and (¢) T denote, respectively, the input vector, weight vector, and vector transpose operator. Since the updated weight vector w ( k C N which 1) is available before the next input vector u ( k C 1) , a new estimate y, is also known as an a posteriori estimate, can be calculated as (Mandic & c 2002 Massachusetts Institute of Technology Neural Computation 14, 2693– 2707 (2002) °
2694
Danilo P. Mandic
Chambers, 1998) yN ( k) D W (u T ( k) w ( k C 1) ) .
(1.2)
The corresponding instantaneous output errors for the two cases above are given respectively as e ( k) D d ( k) ¡ y ( k) , and eN ( k) D d ( k) ¡ yN ( k), where d ( k) is some teaching signal. This technique can be repeated and is called a data-reusing technique (Roy & Shynk, 1989). A data-reusing algorithm for the recurrent nonlinear case can hence be expressed as w i C 1 ( k) D w i ( k) ¡ grw i ( k) E ( ei ( k) )
ei ( k) D d ( k) ¡ W ( u T ( k) w i ( k) ) ,
i D 1, . . . , L,
(1.3)
where the cost function E ( ei ( k) ) is typically Ei ( k) D 12 e2i ( k) , and index i denotes the ith iteration of the algorithm 1.3 and g is the learning rate. Nerrand et al. (1993) provide an extensive study of learning algorithms for neural networks for nonlinear adaptive ltering. A case with only one iteration of equation 1.3 is addressed in Mandic and Chambers (2000b). We wish to preserve the useful feature of a data-reusing algorithm that the magnitude of an output error uniformly converges along the iteration 1.3, that is, |ei C 1 ( k) | · c |ei ( k) |,
0 < c < 1,
i D 1, . . . , L,
(1.4)
which represents a constraint on equation 1.3. It follows that for L D 1, algorithm 1.3 reduces to the standard gradient-descent algorithm: w 1 ( k) D w ( k) . The ( L C 1) th iteration of equation 1.3 provides the values of quantities that relate to the time instant ( k C 1) , that is, w L C 1 ( k) D w ( k C 1) . Unlike for the case of linear adaptive lters (Roy & Shynk, 1989; Schnaufer & Jenkins, 1993), the data-reusing approach for RNNs working as nonlinear adaptive lters depends also on the characteristics of the nonlinear activation function of a neuron, that is, whether it is a contraction. Here, an extension to the approach of Mandic and Chambers (2000b) is provided and considers bounds on the ith data-reusing prediction error ei ( k), i D 1, 2, . . . , L for recurrent neural networks, starting from a recurrent perceptron, to a fully connected recurrent neural network. Both the contractive and expansive nonlinear activation function of a neuron are addressed. In this context, the range of values for the learning rate that provide convergence and the change undergone by the weights of a network in the data-reusing case are analyzed.
Data-Reusing Recurrent Neural Adaptive Filters
2695
2 Bounds on the Data-Reusing Error Adaptation for Recurrent Neural Networks
From equation 1.3, a gradient-descent algorithm for a single-neuron recurrent network (recurrent perceptron) is given by Haykin (1999), w i C 1 ( k) D w i ( k) C g ( k) ei ( k) ¦
i ( k)
T(
ei ( k) D d ( k) ¡ W ( u k) w i ( k) ) ,
i D 1, . . . , L,
(2.1)
where w i ( k) is the weight vector at the ith iteration of equation 2.1, g is the learning rate, u ( k) is the input vector consisting of external data, feedback, and bias, d ( k) is some desired signal, ¦ i ( k) is a gradient (sensitivities) vector, and ei ( k) is the prediction error after the ith iteration of the data-reusing algorithm, 2.1, at the time instant k. A brief overview of the real time recurrent learning (RTRL) algorithm is given in the appendix. An insight into the RTRL algorithm and algorithm 2.1 shows that the standard gradient algorithm that runs in discrete time k represents an outer loop, whereas the data-reusing part, equation 2.1, which runs iteratively on the input data from the discrete time instant k, represents an inner loop of the whole algorithm. This way, the data-reusing algorithm described here is a combination of a recursive and iterative algorithm. In Nerrand et al. (1993, 1994) and Nerrand, Roussel-Ragot, Personnaz, and Dreyfus (1991), a unifying concept of algorithms for training neural networks is provided. In their taxonomy, the data-reusing algorithms presented here are referred to as unidirected–directed algorithms. Starting from the last iteration in equation 2.5, for i D L, we obtain the nal data-reusing weight update as w ( k C 1) D w L C 1 ( k) D w L ( k) C g ( k) eL ( k) ¦ D w L¡1 ( k) C g ( k) eL¡1 ( k) ¦ C g(k) eL ( k) ¦
D w ( k) C
L X
L ( k)
L¡1 ( k)
L ( k)
g ( k) ei ( k) ¦
i ( k) .
(2.2)
iD 1
The consecutive output errors in the data-reusing iteration are expressed as e1 ( k) D d ( k) ¡ W ( u T ( k) w ( k) )
e2 ( k) D d ( k) ¡ W ( u T ( k) w 2 ( k) ) ¢¢¢ ¢¢¢ ¢¢¢
eL ( k) D d ( k) ¡ W ( u T ( k) w L ( k) ).
(2.3)
2696
Danilo P. Mandic
The instantaneous error at the output neuron can be further expressed as ei ( k) D d ( k) ¡ W ( u T ( k) w i ( k) )
D [d(k) ¡ W ( u T ( k) w i¡1 ( k) ) ]
¡ [W (u T ( k) w i ( k) ) ¡ W (u T ( k) w i¡1 ( k) ) ],
(2.4)
clearly indicating the dependence of the ith data-reusing error on the features of a nonlinear activation function1 of the neuron, W. Premultiplying the rst equation in equation 2.1 by u T ( k) and applying the nonlinear activation function W on either side, we obtain W ( u T ( k) w i C 1 ( k) ) D W ( u T ( k) w i ( k) C g ( k) ei ( k) u T ( k) ¦
i ( k) ) .
(2.5)
Further analysis depends on whether W is a contraction or an expansion. Brief insight into the contraction mapping theorem (CMT) and its consequences on stability in a recurrent perceptron is given in the appendix. 2.1 The Case of a Contractive Activation Function. For a broad class of contractive sigmoid activation function we have (Mandic & Chambers, 2000b)
W ( a C b ) · W ( a ) C W ( b) .
(2.6)
Notice that both e ( k) and ei ( k) , i D 2, . . . , L have the same sign (Roy & Shynk, 1989; Douglas & Rupp, 1997), as seen in Figure 1. From equation 1.3, the direction of the vectors D w i ( k) can be assumed the same as the direction of the input vector u ( k) . As a gradient-descent algorithm gives only an approximate solution to the optimization problem, the quadratic surface dened by e2 ( k) has a solution set that is a linear variety 1
Clearly, equation 2.4 can be expressed as ei (k) D [d(k) ¡ W (u T (k)w i¡1 (k))] ¡ [W (u T (k)w i (k)) ¡ W (u T (k)w i¡1 (k))] D ei¡1 (k) ¡ [W (u T (k)w i (k)) ¡ W (u T (k)w i¡1 (k))].
By the CMT (Gill, Murray, & Wright, 1981; (see the Appendix) 9j 2 (u T (k)w i¡1 (k), u T (k)w i (k)), such that |W (u T (k)w i (k)) ¡ W (u T (k)w i¡1 (k))| · W0 (j )| u T (k)D w i (k)|, which connects the output errors at the ith and (i ¡1)th iteration, the rst derivative of the nonlinear activation function of neuron and the norm of the input data. This relationship is elaborated later in the article. A logistic sigmoid function is assumed, although the results are general.
Data-Reusing Recurrent Neural Adaptive Filters
2697
Figure 1: Geometric interpretation of data-reusing techniques.
instead of a single minimum. Hence, the solution space is a hypersurface (Schnaufer & Jenkins, 1993), whose dimension is one less that the space on which it rests, and vector u ( k) is perpendicular to the solution hypersurface S ( k) . Figure 1 gives a geometric interpretation of the relation between the standard gradient, normalized gradient (Mandic & Chambers, 2000a), and data-reusing gradient algorithm for on-line training of neural nonlinear adaptive lters. From Figure 1, it is evident that repeating an a posteriori technique for a sufcient number of iterations approaches the normalized gradient-based algorithm (Mandic & Chambers, 2000a). The lower bound for the data-reusing estimation error obtained by algorithm 2.1 with L iterations and a contractive nonlinear activation function W is Theorem 1.
eL C 1 ( k) > [1 ¡ g ( k) u T ( k) ¦ where ¦
( k ) ]L e ( k ) ,
( k) is such that maxiD 1,...,L ( u T ( k) ¦
(2.7) i ( k) )
is obtained.
2698
Danilo P. Mandic
Let a D u T ( k) w i ( k) and b D g ( k) e ( k) u T ( k) ¦ i ( k) . Applying equation 2.6 to 2.5, and subtracting d ( k) from both sides of the resulting equation, and recognizing that for W a contraction, |W (j ) | < |j |, 8j 2 R, we obtain Proof.
ei C 1 ( k) > [1 ¡ g ( k) u T ( k) ¦
i ( k)] ei ( k) .
(2.8)
Notice that the whole process is a xed-point iteration around the xed point dened by the information vector u ( k) , and hence the sequence u T ( k) ¦ i ( k) , i D 1, . . . , L has its maximum u T ( k) ¦ ( k) along the iteration. This yields ei ( k) > [1 ¡ g ( k) u T ( k) ¦ ¢¢¢ ¢¢¢ ¢¢¢
ei ( k) > [1 ¡ g ( k) u T ( k) ¦
( k)]ei¡1 ( k) ( k)]i¡1 e ( k)
(2.9)
and equation 2.8 nally becomes eL C 1 ( k) > [1 ¡ g ( k) u T ( k) ¦
( k)]L e ( k) ,
(2.10)
which is the lower bound for the data-reusing algorithm with a contractive activation function of a neuron. For the algorithm given in equation 2.1 with the constraint |ei ( k) | < |ei¡1 ( k) | to be feasible, the term [1 ¡ g ( k) u T ( k) ¦ ( k) ] in equation 2.9 must have a norm of less than unity, that is, it must be a contraction operator (Bharucha-Reid, 1976). Notice that for L ! 1, the error from equation 2.10 becomes e1 ( k) D 0, that is, after an innite number of iterations, this algorithm becomes a normalized RTRL (NRTRL) algorithm, as derived in Mandic and Chambers (2000a). In that case, the whole procedure is a xedpoint iteration, and the necessary condition that guarantees convergence is |1 ¡ g ( k) u T ( k) ¦ ( k) | < 1 (Mandic & Chambers, 1998). The following corollary gives the range of the learning rate g ( k) so that the contraction mapping properties are satised. The range allowed for the learning rate g ( k) in a data-reusing algorithm, 2.1, to ensure convergence for the conditions given in theorem 1 is Corollary 1.
0 < g ( k) <
2 . | u T ( k) ¦ ( k) |
(2.11)
It is assumed that the value of the learning rate g ( k) is held xed during the xedpoint iteration.
Data-Reusing Recurrent Neural Adaptive Filters
2699
2.2 The Case of an Expansive Activation Function. If function W is an
expansion, then (Mandic & Chambers, 2000b) W ( a C b) ¸ W (a ) C W ( b) .
(2.12)
This is not desirable, since the second term in equation 2.4 can grow without bound. However, an analog statement to that of theorem 1 can be expressed, replacing the sign > in equation 2.7 by < . Although this provides an upper bound, it is not of practical signicance, since the errors grow with the number of iterations due to the expansion of W in equation 2.12. An account of the relationship between the norm of a weight matrix, learning rate and a slope of the activation function is given in Mandic and Chambers (1999b). 2.3 The Case of a Linear Activation Function. In this case, the problem degenerates into the problem of data-reusing linear adaptive lters, which are extensively studied in Roy and Shynk (1989), Schnaufer and Jenkins (1993), and Sheu et al. (1992). 3 A General RNN
In the case of a general RNN, we have a weight matrix W comprising the weight vectors w of individual neurons. For the case of a single-output network, as is common in nonlinear prediction, the gradient matrix of the rst neuron ¦ 1 is of special interest. Due to the similarity of expressions for a recurrent perceptron and a general RNN, the equation of interest now becomes (see the appendix)
D W ( k)
D g(k) e ( k)
@y1 ( k) @W ( k)
D g ( k) e ( k) ¦
1
( k) ,
(3.1)
where ¦ 1 ( k) represents the matrix of gradients at the output neuron with respect to W ( k) . The correction to the weight vector of the jth neuron, at the time instant k, becomes
D w j ( k)
D g ( k) e ( k) ¦
1(j)
( k) ,
(3.2)
) where ¦ 1( j represents the jth row of the gradient matrix ¦ 1 ( k) . Hence, from the analysis for a recurrent perceptron and dening the gradients ¦ 1(1) ( k) in a way similar to that in theorem 1, we have theorem 2.
For a data-reusing algorithm based on equation 2.1 and a contractive nonlinear activation function W, the lower bound for the error obtained by equation 2.1 and the range of the allowed step size for which the algorithm converges Theorem 2.
2700
Danilo P. Mandic
are, respectively, eL C 1 ( k) > [1 ¡ g ( k) u T ( k) ¦ 0 < g ( k) <
| u T ( k) ¦
1(1)
( k) ]L e ( k)
2 1(1)
( k) |
(3.3)
.
(3.4)
4 On the Weight Change in the Data-Reusing Algorithm
For simplicity, we consider the case of a recurrent perceptron. After applying the data-reusing algorithm L times, from equation 2.9, we have L X
ei ( k) >
i D1
L X iD 1
D
[1 ¡ g ( k) u T ( k) ¦
( k)]i¡1 e ( k)
1 ¡ [1 ¡ g ( k) u T ( k) ¦ ( k)]L e ( k). g ( k) u T ( k) ¦ ( k)
(4.1)
This sum converges for a contractive activation function W and |1¡g ( k) u T ( k) ¦ ( k) | < 1. Now, from equations 2.2 and 4.1, we obtain the amount for which the weights change after L iterations of equation 2.1 (element by element):
D w ( k) <
1 ¡ [1 ¡ g ( k) u T ( k) ¦ u T ( k) ¦ ( k)
( k) ]L
e ( k) ¦
( k).
(4.2)
A similar relationship can be obtained for the case of a general WilliamsZipser type RNN. 5 Convergence of the Data-Reusing Approach
In the area of linear adaptive lters, the analysis of convergence consists of the analysis of the convergence in the mean, convergence in the mean square, and convergence in the steady state. However, in the nonlinear case, a Wiener solution does not generally exist, and hence, the convergence is mainly considered by Lyapunov stability (DeRusso, Roy, Close, & Desrochers, 1998; Zurada & Shen, 1990), or through contraction mapping. Here, due to the assumption that the standard and the data-reusing errors have the same sign throughout the iteration, convergence of the data-reusing prediction error is dened by convergence of the underlying learning algorithm for the standard error, given a contractive activation function of a neuron. Additional stability and robustness of the data-reusing algorithms is ensured due to the denominators of the corresponding relationships between the two errors, which are greater than unity (Mandic & Chambers, 1999a).
Data-Reusing Recurrent Neural Adaptive Filters
2701
From equation 4.2 we have w ( k C 1) < w ( k) C
1 ¡ [1 ¡ g ( k) u T ( k) ¦ u T ( k) ¦ ( k)
( k) ]L
e ( k) ¦
( k).
(5.1)
Introducing the weight error vector v as v ( k) D w ( k) ¡ w ¤ ( k), where w ¤ ( k) is some optimal value, and using the contractivity condition of the activation function, we have e ( k) ¼ e¤ ( k) ¡ W ( u T ( k) v ( k) ).
(5.2)
Hence, we obtain v ( k C 1) < v ( k) C
¡
1 ¡ [1 ¡ g ( k) u T ( k) ¦ u T ( k) ¦ ( k)
1 ¡ [1 ¡ g ( k) u T ( k) ¦ u T ( k) ¦ ( k)
( k) ]L
( k) ]L
e¤ ( k) ¦
W ( u T ( k) v ( k) ) ¦
With the assumption of contractivity of W (|W (j ) | < |j |, neous part of equation 5.3 becomes µ v ( k C 1) D I ¡
1 ¡ [1 ¡ g ( k) u T ( k) ¦ u T ( k) ¦ ( k)
( k)
( k)]L
( k).
(5.3)
8j ), the homoge-
¶ ¦ ( k) u T ( k) v ( k) .
(5.4)
In order for equation 5.4 to converge, the term in square parentheses has to be a contraction mapping operator, which has already been shown. On the other hand, the data-reusing algorithms presented here have been shown to have a uniformly smaller prediction error than the standard ones, and hence they converge by continuity. Further details on convergence of standard algorithms can be found in Kuan and Hornik (1991), Bershad, Shynk, and Feintuch (1993a, 1993b), and Yuille and Kosowsky (1993), whereas convergence of normalized gradient-descent nonlinear algorithms is addressed in Mandic and Chambers (2000a). 6 Discussion
It has been shown that repeating the data-reusing algorithm leads to a normalized gradient algorithm; however, the NRTRL algorithm for recurrent neural networks derived in Mandic and Chambers (2000a) is sensitive to the characteristics of the input signals and the choice of the constant in the algorithm. One way to achieve a performance similar to that of the normalized algorithm is to use a contractive activation function of a neuron on the class of data-reusing algorithms (see Figure 1). Mandic and Chambers (1999b), illustrate the data-reusing approach for only one iteration (a posteriori). Other algorithms, such as the extended Kalman lter (EKF) training
2702
Danilo P. Mandic
for recurrent neural networks, have been analyzed and compared with standard RTRL (Mandic, Baltersee, & Chambers, 1998). A comparison between NRTRL and the EKF algorithm shows that the NRTRL exhibits as good a performance as the EKF for nonlinear signals. Hence, the benet of datareusing techniques is that they achieve a near NRTRL performance in few iterations, albeit with an additional computational burden, as compared to the NRTRL training. Unlike the NRTRL, however, this class of algorithms does not suffer from stability problems for contractive nonlinear activation functions. Therefore, the computational complexity of the data-reusing RTRL increases with the number of iterations but is still of the order O ( N 4 ) for a relatively small number of iterations. The computational complexity considerations and numerical examples for a simple data-reusing technique for RNNs are given in Mandic and Chambers (1998). In Figure 2, the RTRL and data-reusing RTRL were compared for prediction of a benchmark nonlinear input given by Narendra and Parthasarathy (1990), z ( k) D
z ( k ¡ 1) C r3 ( k) , 1 C z2 ( k ¡ 1)
where r ( k) was a normally distributed N through a stable AR lter given by
(6.1) (0, 1) white noise n ( k) passed
r ( k) D 1.79r(k ¡1) ¡1.85r(k ¡ 2) C 1.27r(k ¡3) ¡ 0.41r(k ¡4) C n ( k) .
(6.2)
The logarithm of the averaged squared prediction error, obtained by a Monte Carlo simulation with 100 trials of the experiment for both contractive and expansive activation function, is shown in Figure 2. The top part of Figure 2 shows the performance of a data-reusing algorithm for a recurrent perceptron with a contractive activation function for L D 1, L D 2, L D 5, and L D 10. The data-reusing algorithm outperformed the standard algorithm (L D 1). The performance of this algorithm improves with increasing order of the data-reusing iteration and saturates for large L, conrming the analysis and the diagram shown in Figure 1. The bottom part of Figure 2 shows the performance of a data-reusing algorithm for a recurrent perceptron with an expansive activation function for L D 1, L D 3, and L D 10. The performance deteriorates with the order of iteration, conrming the above analysis. 7 Conclusion
Relationships among the prediction error, learning rate, and slope of the nonlinear activation function of a neuron have been provided for a nonlinear adaptive lter realized by a recurrent neural network, trained with a datareusing gradient algorithm. Such relationships establish a lower bound on
Data-Reusing Recurrent Neural Adaptive Filters
2703
Figure 2: (Top) Performance of RTRL and data-reusing RTRL with L D 2, L D 5, and L D 10 for prediction of a nonlinear input for a contractive nonlinear activation function. (Bottom) Performance of RTRL and data-reusing RTRL with L D 3 and L D 10 for prediction of a nonlinear input for an expansive nonlinear activation function.
the error in recurrent neural predictors, whose instantaneous output error is uniformly smaller in magnitude along the data-reusing iteration. This has been achieved for the RTRL learning algorithm for a recurrent perceptron and a general RNN. The analysis is general, although it has been performed for a logistic sigmoid. It has been shown that convergence in this case is pre-
2704
Danilo P. Mandic
served for a nonlinear activation function exhibiting contractive behavior. Dynamical behavior of such data-reusing gradient-based algorithms lies between the standard and normalized gradient algorithm. Appendix A.1 The RTRL Algorithm. For nonlinear adaptive prediction of nonstationary signals through recurrent neural networks, the RTRL algorithm is best suited for real-time applications (Williams & Zipser, 1989; Haykin, 1994). The weight matrix update of a general RNN can be expressed as (Haykin, 1994)
D W ( k)
D ge(k)
@y1 ( k) @W ( k)
D ge(k) ¦
1
( k) ,
(A.1)
where g is a learning rate, and ¦ 1 ( k) represents the matrix of gradients at the output neuron y1 with respect to W ( k) . To simplify the presentation, we introduce three new matrices—the N £ (N C p C 1) matrix ¦ ( j) ( k) , where p is the number of external input signals, the N £ ( N C p C 1) matrix U j ( k) , and the N £ N diagonal matrix F ( k)—as ( j)
¦
( k) D
@y ( k) @wj ( k)
2
0
6 . 6 .. 6 6 ( ) Uj k D 6 6 u ( k) 6 . 6 . 4 .
,
y D [y1 ( k) , . . . , yN ( k) ],
j D 1, 2, . . . , N
(A.2)
3
7 7 7 7 7 Ã jth row, 7 7 7 5
j D 1, 2, . . . , N
(A.3)
0 F ( k) D diag (W 0 ( u T ( k) w
(1)
( k) ) , . . . , W 0 ( u T ( k) w ( N ) ( k) ) ).
(A.4)
Hence, the gradient updating equation regarding the recurrent neuron can be symbolically expressed as (Williams & Zipser, 1989; Haykin, 1994) ( j)
¦
( k C 1) D F ( k)[Uj ( k) C ¦
( j)
( k) W a ( k) ],
j D 1, 2, . . . , N,
(A.5)
where W a denotes the set of those entries in W that correspond to the feedback connections. The correction to the weight vector of the jth neuron, at the time instant k becomes
D w ( j) ( k) where ¦
1( j )
D g(k) e ( k) ¦
1( j )
( k)
represents the jth row of the gradient matrix ¦
(A.6) 1
( k).
Data-Reusing Recurrent Neural Adaptive Filters
a
K (a)
K (b)
2705
b
Figure 3: The contraction mapping.
A.2 Contraction Mapping and Nonlinear Activation Functions. By the
CMT, function K is a contraction on [a, b] 2 R if (Gill et al., 1981): i) x 2 [a, b] ) K ( x) 2 [a, b] ii) 9c < 1 2 R C
s.t.
|K (x ) ¡ K ( y ) | · c |x ¡ y|,
8x, y 2 [a, b],
as shown in Figure 3. Using the mean value theorem, for 8x, y 2 [a, b], 9j 2 ( a, b) such that |K ( x) ¡ K ( y ) | D |K0 (j ) ( x ¡ y ) | D |K0 (j ) | |x ¡ y|.
(A.7)
Now, the clause c < 1 in ii ) becomes c ¸ |K0 (j ) |, j 2 (a, b) . For the example of the logistic nonlinear activation function of a neuron W ( v) D 1 , with slope b, c < 1 , b < 4 is the condition for function W to 1 C e¡b v be a contraction. Acknowledgments
I thank Andrew Hanna of the University of East Anglia, Norwich, for his help in simulations and useful comments. References Bengio, Y., Simard, P., & Frasconi, P. (1994). Learning long-term dependencies with gradient descent is difcult. IEEE Transactions on Neural Networks, 5(2), 157–166. Bershad, N. J., Shynk, J. J., & Feintuch, P. L. (1993a). Statistical analysis of the single-layer backpropagation algorithm: Part I—Mean weight behaviour. IEEE Transactions on Signal Processing, 41(2), 573–582. Bershad, N. J., Shynk, J. J., & Feintuch, P. L. (1993b). Statistical analysis of the single-layer backpropagation algorithm: Part II—MSE and classication performance. IEEE Transactions on Signal Processing, 41(2), 583–591. Bharucha-Reid, A. T. (1976). Fixed point theorems in probabilistic analysis. Bulletin of the American Mathematical Society, 82(5), 641–657. Connor, J. T., Martin, R. D., & Atlas, L. E. (1994). Recurrent neural networks and robust time series prediction. IEEE Transactions on Neural Networks, 5(2), 240–254. DeRusso, P. M., Roy, R. J., Close, C. M., & Desrochers, A. A. (1998). State variables for engineers (2nd ed.). New York: Wiley.
2706
Danilo P. Mandic
Douglas, S. C., & Rupp, M. (1997). A posteriori updates for adaptive lters. In Conference Record of the Thirty-First Asilomar Conference on Signals, Systems and Computers (Vol. 2, pp. 1641–1645). Pacic Grove, CA. Gill, P. E., Murray, W., & Wright, M. H. (1981). Practical optimization. London: Academic Press. Haykin, S. (1994). Neural networks—A comprehensive foundation. Upper Saddle River, NJ: Prentice Hall. Haykin, S. (1999). Neural networks—A comprehensive foundation (2nd ed.). Upper Saddle River, NJ: Prentice Hall. Kuan, C.-M., & Hornik, K. (1991). Convergence of learning algorithms with constant learning rates. IEEE Transactions on Neural Networks, 2(5), 484– 489. Mandic, D. P., Baltersee, J., & Chambers, J. A. (1998). Nonlinear prediction of speech with a pipelined recurrent neural network and advanced learning algorithms. In A. Prochazka, J. Uhlir, P. J. W. Rayner, & N. G. Kingsbury (Eds.), Signal analysis and prediction (pp. 291–309). Boston: Birkhauser. Mandic, D. P., & Chambers, J. A. (1998). A posteriori real time recurrent learning schemes for a recurrent neural network based non-linear predictor. IEE Proceedings—Vision, Image and Signal Processing, 145(6), 365–370. Mandic, D. P., & Chambers, J. A. (1999a).A posteriori error learning in non-linear adaptive lters. IEE Proceedings—Vision, Image and Signal Processing, 146(6), 293–296. Mandic, D. P., & Chambers, J. A. (1999b). Relationship between the slope of the activation function and the learning rate for the RNN. Neural Computation, 11(5), 1069–1077. Mandic, D. P., & Chambers, J. A. (2000a). A normalised real time recurrent learning algorithm. Signal Processing, 80(11), 1909–1916. Mandic, D. P., & Chambers, J. A. (2000b). Relations between the a priori and a posteriori errors in nonlinear adaptive neural lters. Neural Computation, 12(6), 1285–1292. Narendra, K. S., & Parthasarathy, K. (1990). Identication and control of dynamical systems using neural networks. IEEE Transactions on Neural Networks, 1(1), 4–27. Nerrand, O., Roussel-Ragot, P., Personnaz, L., & Dreyfus, G. (1991). Neural network training schemes for non-linear adaptive ltering and modelling. In Proceedings of the International Joint Conference on Neural Networks (Vol. 1, pp. 61–66). Seattle, WA. Nerrand, O., Roussel-Ragot, P., Personnaz, L., & Dreyfus, G. (1993). Neural networks and nonlinear adaptive ltering: Unifying concepts and new algorithms. Neural Computation, 5, 165–199. Nerrand, O., Roussel-Ragot, P., Urbani, D., Personnaz, L., & Dreyfus, G. (1994). Training recurrent neural networks: Why and how? An illustration in dynamical process modelling. IEEE Transactions on Neural Networks, 5(2), 178– 184. Roy, S., & Shynk, J. J. (1989). Analysis of the data-reusing LMS algorithm. In Proceedings of the 32nd Midwest Symposium on Circuits and Systems (Vol. 2, pp. 1127–1130). Champaign, IL.
Data-Reusing Recurrent Neural Adaptive Filters
2707
Schnaufer, B. A., & Jenkins, W. K. (1993). New data-reusing LMS algorithms for improved convergence. In Conference Record of the Twenty-Seventh Asilomar Conference on Signals and Systems (Vol. 2, pp. 1584–1588). Pacic Grove, CA. Sheu, M.-H., Wang, J.-F., Chen, J.-S., Suen, A.-N., Jeang, Y.-L., & Lee, J.-Y. (1992). A data-reuse architecture for gray-scale morphologic operations. IEEE Transactions on Circuits and Systems—II: Analog and Digital Signal Processing, 39(10), 753–756. Treichler, J. R., Johnson, Jr., C. R., & Larimore, M. G. (1987). Theory and design of adaptive lters. New York: Wiley. Williams, R., & Zipser, D. (1989). A learning algorithm for continually running fully recurrent neural networks. Neural Computation, 1, 270–280. Yuille, A. L., & Kosowsky, J. J. (1993).Statistical physics algorithms that converge. In R. J. Mammone (Ed.), Articial neural networks for speech and vision (pp. 19– 35). London: Chapman & Hall. Zurada, J. M., & Shen, W. (1990). Sufcient condition for convergence of a relaxation algorithm in actual single-layer neural networks. IEEE Transactions on Neural Networks, 1(4), 300–303. Received May 9, 2000; accepted April 18, 2002.
LETTER
Communicated by Christian Kuhlmann
Training a Single Sigmoidal Neuron Is Hard Ï õ ma JiÏrÂõ SÂ [email protected] Institute of Computer Science, Academy of Sciences of the Czech Republic, P.O. Box 5, 18207 Prague 8, Czech Republic We rst present a brief survey of hardness results for training feedforward neural networks. These results are then completed by the proof that the simplest architecture containing only a single neuron that applies a sigmoidal activation function s: < ¡! [a, b], satisfying certain natural axioms (e.g., the standard (logistic) sigmoid or saturated-linear function), to the weighted sum of n inputs is hard to train. In particular, the problem of nding the weights of such a unit that minimize the quadratic training error within (b ¡ a) 2 or its average (over a training set) within 5(b ¡ a) 2 / (12n ) of its inmum proves to be NP-hard. Hence, the wellknown backpropagation learning algorithm appears not to be efcient even for one neuron, which has negative consequences in constructive learning. 1 The Complexity of Neural Network Loading
Neural networks establish an important class of learning models that are widely applied in practical applications to solving articial intelligence tasks (Haykin, 1999). The most prominent position among the successful neural learning heuristics is occupied by the backpropagation algorithm (Rumelhart, Hinton, & Williams, 1986), which is often used for training feedforward networks. This algorithm is based on the gradient-descent method that minimizes the quadratic regression error of a network with respect to given training data. For this purpose, each unit (neuron) in the network applies a differentiable activation function, for example, the standard (logistic) sigmoid, to the weighted sum of its local inputs (the so-called sigmoidal or standard sigmoid unit) rather than the discrete Heaviside (threshold) function with binary outputs (the so-called Heaviside or threshold neuron). However, the underlying optimization process appears to be very time-consuming even for small networks and training tasks. This was conrmed by an empirical study of the learning time required by the backpropagation algorithm to converge, which suggested its exponential scaling with the size of training sets (Tesauro, 1987) and networks (Tesauro & Janssens, 1988). Its slow convergence is probably caused by the inherent complexity of training the feedforward networks. c 2002 Massachusetts Institute of Technology Neural Computation 14, 2709– 2728 (2002) °
2710
JiÏrÂõ SÂÏ õ ma
The rst attempt to theoretically analyze the time complexity of learning by feedforward networks is due to Judd (1990), who introduced the so-called loading problem, which is the problem of nding the weight parameters for a given xed network architecture and a training task so that the network responses are perfectly consistent with all training data. For example, an efcient loading algorithm is needed for the proper PAC (probably approximately correct) learnability (Blumer, Ehrenfeucht, Haussler, & Warmuth, 1989) besides the polynomial VC-dimension that the most common neural network models possess (Anthony & Bartlett, 1999; Roychowdhury, Siu, & Orlitsky, 1994; Vidyasagar, 1997). However, Judd (1990) proved the loading problem of feedforward networks to be NP-complete even if very strong restrictions are imposed on their architectures and training tasks. The drawback of Judd’s proofs is in using quite unnatural network architectures with irregular interconnection patterns having a xed input dimension and growing number of outputs, which do not appear in practice. However, his arguments are valid for practically all the common unit types, including the standard sigmoid neurons. Eventually, Judd (1988) provided a polynomialtime loading algorithm for restricted shallow architectures whose practical applicability was probably ruled out by the hardness result for loading deep triangular networks (SÂÏ õ ma, 1994), which was further extended to pyramidal networks (de Souto & de Oliveira, 1999). Furthermore, Parberry (1992) proved a similar NP-completeness result for loading feedforward networks with irregular interconnections and only a small constant number of units. In addition, Wiklicky (1994) has shown that the loading problem for higherorder networks with integer weights is even algorithmically not solvable. In order to achieve the hardness results for common layered architectures with complete connectivity between neighbor layers, Blum and Rivest (1992) in their seminal work considered the smallest conceivable two-layer network with only three binary neurons (two hidden and one output units) employing the Heaviside activation function. They proved the loading problem for such a three-node network with n inputs to be NP-complete and generalized the proof for a polynomial number of hidden units (in terms of n) when the output neuron computes logical AND. Hammer (1998a) further replaced the output AND gate by the threshold unit, while Kuhlmann (2000) achieved the proof for an output unit implementing any subclass of the Boolean functions depending on all the outputs from the hidden nodes. Lin and Vitter (1991) extended the NP-completeness result even for a twonode cascade architecture with one hidden unit connected to the output neuron that also receives the inputs. However, Megiddo (1988) has shown that the loading problem for two-layer networks with a xed number of real inputs and the Heaviside hidden nodes and the output unit implementing an arbitrary Boolean function is solvable in polynomial time. Much effort has been spent to generalize the hardness results for continuous activation functions, especially for the standard sigmoid employed in the backpropagation heuristics for which the loading problem is proba-
Training a Single Sigmoidal Neuron Is Hard
2711
bly at least algorithmically solvable (Macintyre & Sontag, 1993). DasGupta, Siegelmann, and Sontag (1995) proved that loading a three-node network whose two hidden units apply the continuous saturated-linear activation function while the output threshold unit is used for dichotomic classica¨ (1993) showed the NP-completeness tion, is NP-complete. Further, Hoffgen of loading a three-node network employing the standard sigmoid for exact interpolation but with the severe restriction to binary weights. A more realistic setting as concerns the backpropagation learning was rst considered by SÂÏ õ ma (1996), who proved that loading a three-node network with two standard sigmoid hidden neurons is NP-hard, although he assumed an additional constraint on the weights of the output threshold unit used for binary classication, which is satised, for example, when the output bias is zero. Hammer (1998b) replaced this constraint by requiring the output unit with bounded weights to respond with outputs that are in the absolute value greater than a given accuracy, which excludes a small output interval around zero from the binary classication. This approach also provides the hardness result for a more general class of sigmoidal activation functions than just the standard sigmoid. On the other hand, there exists a xed sigmoidal activation function that, although articial, has appropriate mathematical properties and for which the feedforward networks are always loadable (Sontag, 1992). The loading problem assumes a correct classication of all training data, while in practice one is typically satised by the weights yielding a small training error. Therefore, the complexity of approximately interpolating a training set by feedforward neural networks with, in general, real outputs has been studied. Jones (1997) considered a three-node network with n inputs, two hidden neurons employing any monotone Lipschitzian sigmoidal activation function (e.g., the standard sigmoid), and one linear output unit with bounded weights and zero bias. For such a three-node network, he proved that learning each pattern with real outputs from [0, 1] within a small absolute error 0 < e < 1 / 10 is NP-hard, implying that the problem of nding the weights that minimize the total quadratic regression error within a xed e of its inmum (or absolutely) is also NP-hard. He extended this NP-hardness proof to polynomial number k of sigmoidal hidden neurons (in terms of n) and a convex linear output unit (with nonnegative weights whose sum is 1) when the total quadratic error is required to be within 1 / (16k5 ) of its inmum (or within 1 / (4 k3 ) of its inmum for the Heaviside hidden units). In addition, Vu (1998) found the relative error bounds (with respect to the error inmum) for hard approximate interpolation that are independent of the training set size p by considering the average quadratic error dened as the total error divided by p. In particular, he proved that it is NP-hard to nd the weights of a two-layer network with n inputs, k hidden sigmoidal neurons (satisfying certain Lipschitzian conditions), and one linear output unit with zero bias and positive weights such that for given training data,
2712
JiÏrÂõ SÂÏ õ ma
the average quadratic error is within a xed bound of order O (1 / ( nk5 ) ) of its inmum. Moreover, for two-layer networks with k hidden neurons employing the Heaviside activations and one sigmoidal (or threshold) output unit, Bartlett and Ben-David (1999) improved this bound to O (1 / k3 ), which is independent of the input dimension. In the case of the threshold output unit used for classication, DasGupta and Hammer (2000) proved the same relative error bound O (1 / k3 ) on the fraction of correctly classied training patterns, which is NP-hard to achieve for training sets of size k3.5 · p · k4 related to number k of hidden units. They also showed that it is NP-hard to approximate this success ratio within a relative error smaller than 1 / 2244 for three-node networks with n inputs, two hidden standard sigmoid neurons, and one output threshold unit (with bounded weights) exploited for the classication with accuracy 0 < e < 0.5. On the other hand, minimizing the failure ratio, that is, the number of misclassied training patterns divided by the minimum possible number of misclassications (when the presence of errors is assumed), within every constant c > 1 is NP-hard for feedforward threshold networks with zero biases in the rst hidden layer (DasGupta & Hammer, 2000). The preceding results suggest that training feedforward networks with xed architectures is hard indeed. However, a possible way out of this situation might be the constructive learning algorithms that adapt the network architecture to a particular training task. It is conjectured that for successful generalization, the network size should be kept small; otherwise, a training set can easily be wired into the network implementing a lookup table (Sontag, 1992). A constructive learning algorithm usually requires an efcient procedure for minimizing the training error by adapting the weights of only a single unit that is being added to the architecture while the weights of the remaining units in the network are already xed, for example, the cascade-correlation learning architecture (Fahlman & Lebiere, 1990). Clearly, for a single binary neuron employing the Heaviside activation function, the weights that are consistent with given training data can be found in polynomial time by linear programming provided that they exist, although this problem restricted to binary weights is NP-complete (Pitt & Valiant, 1988), and also to decide whether the Heaviside unit can implement a Boolean function given in a disjunctive or conjunctive normal form is co-NP-complete (Hegedus ¨ & Megiddo, 1996). Such weights do not often exist, but a good approximate solution would be sufcient for constructive learning. Nevertheless, several authors provided NP-completeness proofs for the problem of nding the weights for a single Heaviside unit so that the number of misclassied training patterns is at most a given constant (Hoffgen, ¨ Simon, & Van Horn, 1995; Roychowdhury, Siu, & Kailath, 1995) which remains NP-complete even if the bias is assumed to be zero (Amaldi, 1991; Johnson & Preparata, 1978). In this case, minimizing the failure ratio within every constant c > 1 is NP-hard (Arora, Babai, Stern, & Sweedyk, 1997).
Training a Single Sigmoidal Neuron Is Hard
2713
Hush (1999) further generalized these results for a single sigmoidal neuron by showing that it is NP-hard to minimize the training error under the L1 norm strictly within 1 of its inmum. He conjectured that a similar result holds for the quadratic error corresponding to the L2 norm, which is used in backpropagation learning. In this article, this conjecture is proved. In particular, it will be shown that the issue of deciding whether there exist weights of a single neuron employing a sigmoidal activation function s: < ¡! [a, b] satisfying certain natural axioms (e.g., the standard sigmoid or saturated-linear function), so that the total quadratic error with respect to the training data is at most a given constant, is NP-hard. The presented reduction also provides an argument that the problem of nding the weights that minimize the quadratic training error within (b ¡ a) 2 or its average within 5(b ¡ a) 2 / (12n ) of its inmum is NP-hard. This implies that the popular backpropagation learning algorithm may be not efcient even for a single neuron and thus has negative consequences in constructive learning. A preliminary version of this article (SÂÏ õ ma, 2001) focused on only the standard sigmoid, while the results presented now are valid for a more general class of sigmoidal activation functions. 2 Training a Sigmoidal Neuron
In this section the basic denitions regarding a sigmoidal neuron and its training will be reviewed. A single (perceptron) unit (neuron) with n real inputs x1 , . . . , xn 2 < rst computes its real excitation, j D w0 C
n X
w i xi ,
(2.1)
iD 1
where w D ( w0 , w1 , . . . , wn ) 2
(2.2)
Here we assume any sigmoidal activation function s: < ¡! [a, b] (a < b) that can be normalized as s1 (j ) D s (j 0 C sj )
for all j 2 <,
(2.3)
where j 0 2 < is a shift, s 2 f¡1, 1g is a sign, and function s1 : < ¡! [a, b] satises the following four axioms: 1. s1 is nondecreasing. 2. s1 (¡j ) ¡ a D b ¡ s1 (j ) for all j 2 <.
JiÏrÂõ SÂÏ õ ma
2714
3. limj !¡1 s1 (j ) D a.
4. jd 2 < is given such that s1 (jd ) > a and, for all j1 , j2 · jd , and D > 0, j1 < j1 C D · j2 ¡ D < j2 implies (s1 (j1 C D ) ¡ a) 2 ¡ (s1 (j1 ) ¡ a) 2 · (s1 (j2 ) ¡ a) 2 ¡ (s1 (j2 ¡ D ) ¡ a) 2 . Correspondingly, we call such a neuron the sigmoidal unit. Note that conditions 2 and 3 imply also limj !C 1 s1 (j ) D b. These axioms are satised for most commonly used activation functions, especially for the standard (logistic) sigmoid ss : < ¡! [0, 1] (a D 0, b D 1), ss (j ) D
1
, 1 C e¡j
(2.4)
which is employed in the widely applied backpropagation learning heuristics. In this case, ss ´ s1 is normalized, and we can choose jd D ln 2, for which ¡ ¢ 2e ¡j 2e¡j ¡ 1 2 00 (ss ) (j ) D ¸0 (2.5) ¡ ¢4 1 C e¡j for all j · jd , conforming to axiom 4. Another well-known example is the saturated-linear activation function s`: < ¡! [0, 1] (a D 0, b D 1): 8 > <0 s` (j ) D j > : 1
for j · 0
for 0 < j < 1
(2.6)
for j ¸ 1,
which can be normalized as s1 (j ) D s` ( 12 C j ) to satisfy axiom 2, where jd D 12 then meets axiom 4. Furthermore, a training set, © ª T D ( x k, d k ) I x k D (x k1 , . . . , x kn ) 2
(2.7)
is introduced containing p pairs—training patterns, each composed of an n-dimensional real input x k and the corresponding desired scalar output value dk from [a, b] to be consistent with the range of activation function s. Given a weight vector w , the quadratic training error
ET ( w ) D
p X ¡ kD 1
y ( w , x k ) ¡ dk
¢2
D
p X kD 1
Á Á s w0 C
n X i D1
! wi x ki
!2 ¡ dk
(2.8)
of the sigmoidal neuron with respect to training set T is dened as the difference between the actual outputs y ( w , x k ) depending on current weights w
Training a Single Sigmoidal Neuron Is Hard
2715
and the desired outputs d k over all training patterns k D 1, . . . , p measured by the L2 regression norm. The main goal of learning is to minimize the training error 2.8 in the weight space. The decision version for the problem of minimizing the error of the sigmoidal neuron with respect to a given training set is formulated as follows: Minimum Sigmoidal-Unit Error (MSUE)
Instance: A training set T and a positive real number e > 0. Question: Is there a weight vector w 2
In this section, the main result that training even a single sigmoidal neuron is hard will be proved: Theorem 1.
MSUE is NP-hard.
In order to achieve the NP-hardness result, a known NP-complete problem will be reduced to the MSUE problem in polynomial time (Garey & Johnson, 1979). In particular, the following feedback arc set problem is employed that is known to be NP-complete (Karp, 1972): Proof.
Feedback Arc Set (FAS)
Instance: A directed graph G D ( V, A) and a positive integer a · |A|.
Question: Is there a subset A0 µ A containing at most a ¸ |A0 | directed edges such that the graph G0 D ( V, AnA0 ) is acyclic? The FAS problem was also exploited for a similar result concerning the Heaviside unit (Roychowdhury et al., 1995). However, the reduction is adapted here for general sigmoidal activation functions, and its verication substantially differs. 3.1 The Construction of Graph Gr . Given a FAS instance G D ( V, A ) , a, a corresponding graph Gr D ( Vr , Ar ) is rst constructed so that every directed edge (u, v ) 2 A in G is replaced by cs parallel oriented paths, © ª (3.1) P ( u,v ) ,h D ( u, uv ) , ( uv , uvh1 ) , ( uvh1 , uvh2 ) , . . . , ( uvh,r¡1 , v ) ,
for h D 1, . . . , cs in Gr sharing only their rst edge ( u, uv ) and vertices u, uv , v, as is depicted in Figure 1. The constant number cs of these parallel paths depends on the choice of activation function s: < ¡! [a, b]: & ³ ´2 ’ b ¡a (3.2) cs D 2 , s1 (jd ) ¡ a
JiÏrÂõ SÂÏ õ ma
2716
v
uv1,r-1
uv2,r-1
uvc ,r-1
uv12
uv22
uvc 2
uv11
uv21
uvc 1
s
s
s
uv
u Figure 1: A subgraph of Gr with edges A (u,v) corresponding to ( u, v ) in G.
where jd is given by axiom 4 for a corresponding normalized function s1 — for example, css D 5 for the standard sigmoid ss or cs` D 2 for the saturatedlinear function s`. In addition, each path P ( u,v) ,h includes r D 8a C 6
(3.3)
additional vertices uv , uvh1 , uvh2 , . . . , uvh,r¡1 unique to (u, v ) 2 A, that is, the subsets of edges A (u,v ) D
cs [ hD1
P ( u,v ) ,h
(3.4)
Training a Single Sigmoidal Neuron Is Hard
2717
corresponding to different ( u, v) 2 A are pairwise disjoint. Thus, Vr D V [ fuv I ( u, v) 2 Ag © ª [ uvh1 , uvh2 , . . . , uvh,r¡1 I (u, v ) 2 A, h D 1, . . . , cs [ Ar D A (u,v ) .
(3.5) (3.6)
( u,v)2A
It follows that |Vr | D (cs (r ¡ 1) C 1) |A| C |V| D ( cs (8a C 5) C 1) |A| C |V| D O (| A| 2 C |V|)
(3.7)
|Ar | D (cs r C 1) |A| D ( cs (8a C 6) C 1) |A| D O (| A| 2 ) ,
(3.8)
according to equation 3.3 (recall cs is a constant and a · |A|). Obviously, the FAS instance G, a has a solution iff the FAS problem is solvable for Gr , a. 3.2 The Reduction to the MSUE Problem. Graph Gr is then exploited for constructing the corresponding MSUE instance that has p D 2|Ar | training patterns, two for each edge in Ar , for a sigmoidal unit with n D |Vr | inputs, one for each vertex in Vr . In particular, training set
T (G ) D
©¡
¢ ¡ ¢ ¡s ¢ x ( i, j) , a , s ¢ x ( i, j) , b I ( i, j) 2 Ar ,
x ( i, j) D ( x ( i, j) ,1 , . . . , x ( i, j) ,n ) 2 f¡1, 0, 1gn
ª
(3.9)
contains one pair (¡s ¢ x ( i, j) , a) , ( s ¢ x ( i, j ) , b ) for each edge ( i, j) 2 Ar where s 2 f¡1, 1g comes from equation 2.3 and 8 > <¡1 x ( i, j) , ` D 1 > : 0
for ` D i
for ` D j
6 i, j for ` D
` D 1, . . . , n,
( i, j) 2 Ar ,
(3.10)
that is, input vector x ( i, j) D (0, . . . , 0,
¡1 " i
, 0, . . . , 0,
1 "
, 0, . . . , 0) 2 f¡1, 0, 1gn
(3.11)
j
actually encodes oriented edge (i, j) 2 Ar . In addition, the error in the MSUE instance is required to be at most
e D (2a C 1) (b ¡ a) 2 .
(3.12)
JiÏrÂõ SÂÏ õ ma
2718
Clearly, the construction of the corresponding MSUE instance can be achieved in polynomial time in terms of the size of the original FAS instance according to equations 3.7 and 3.8. Now, the correctness of the reduction will be veried; it will be shown that the FAS instance has a solution iff the corresponding MSUE instance is solvable. 3.3 The MSUE Instance Is Solvable. First assume that there exists a solution A0 µ A of the FAS instance containing at most a ¸ |A0 | directed edges making graph G0 D (V, AnA0 ) acyclic. Dene a subset © ª (3.13) A0r D ( u, uv ) I (u, v ) 2 A0
containing |A0r | D |A0 | · a edges from Ar . Clearly, graph G0r D (Vr , Ar nA0r ) is also acyclic, and hence its vertices i 2 Vr can be evaluated by integers w0i so that any directed edge ( i, j) 2 Ar nA0r satises w0i < wj0 . The corresponding weight vector w is dened as wi D K ¢ w0i
w0 D j0,
for i 2 Vr ,
(3.14)
where j0 is introduced in equation 2.3 and K > 0 is a sufciently large positive constant satisfying inequality (s1 (¡K ) ¡ a) 2 ·
(b ¡ a) 2 2|Ar |
(3.15)
(the existence of such K follows from axiom 3 for s1 ), which will be proved to be a solution for the MSUE instance. The error ET (G) ( w ) is expressed by using equations 2.8, 3.9, 3.10, and 3.14: ET ( G) (w ) D
X ¡ ¡ ¢ ¢2 s j0 C s ¢ ( wi ¡ wj ) ¡ a
(i, j )2Ar
C
X ¡ ¡ ¢ ¢2 s j 0 C s ¢ ( wj ¡ wi ) ¡ b ,
(3.16)
( i, j )2Ar
which can further be rewritten as X ¡ X ¡ ¢2 ¢2 s1 (wi ¡ wj ) ¡ a C s1 (wj ¡ wi ) ¡ b ET ( G) (w ) D (i, j )2Ar
D 2
X ¡ ( i, j )2Ar
(i, j )2Ar
¢2 s1 ( wi ¡ wj ) ¡ a
(3.17)
by using equation 2.3 and axiom 2 for s1 . For (i, j) 2 Ar nA0r it holds wi ¡ wj D K ( w0i ¡ wj0 ) · ¡K < 0
(3.18)
Training a Single Sigmoidal Neuron Is Hard
2719
according to equation 3.14, where w0i ¡ wj0 · ¡1 due to w0i , wj0 are integers. This implies ¡ ¢2 s1 ( wi ¡ wj ) ¡ a · (s1 (¡K ) ¡ a) 2 ·
(b ¡ a) 2 2|Ar |
for ( i, j) 2 Ar nA0r
(3.19)
by inequality 3.15 because s1 is nondecreasing (axiom 1). Based on whether ( i, j) 2 A0r or ( i, j) 2 Ar nA0r , error 3.17 can be split into two parts that are upper bounded separately, as follows: X ¡ ¢2 s1 ( wi ¡ wj ) ¡ a ET ( G ) ( w ) D 2 ( i, j )2A0r
C2
X
¡
( i, j )2Ar nA0r
¢2 s1 ( wi ¡ wj ) ¡ a
· 2|A0r | (b ¡ a) 2 C 2|Ar | (s1 (¡K ) ¡ a) 2
|A0
· (2a C 1) (b ¡ a) 2 D e
(3.20)
|Ar nAr · | Ar |, and inequality by using r | · a, (s1 (j ) ¡ · (b ¡ 3.19. Hence, w is a solution of the MSUE problem. a) 2
0|
a) 2 ,
3.4 The FAS Instance Is Solvable. On the other hand, assume that there exists a weight vector w 2
ET ( G ) ( w ) · e .
(3.21)
Dene a subset of edges A0 D f(u, v ) 2 AI wu ¸ wv g µ A 0
(3.22) 0)
in G. First, observe that graph G D (V, AnA is acyclic since each vertex u 2 V µ Vr is evaluated by a real weight wu 2 < so that any directed edge ( u, v ) 2 AnA0 in G0 satises wu < wv . Moreover, it must be checked that |A0 | · a. This is accomplished by proving that the error ET ( G) ( w ) on training set T (G ) is bounded below, ET ( G) ( w ) > |A0 | ¢
e aC1
,
(3.23)
which combined with inequality 3.21 implies that |A0 | < a C 1 or |A0 | · a equivalently. For this purpose, error ET ( G) ( w ) is again expressed: X ¡ ¡ ¢ ¢2 s w0 C s ¢ ( wi ¡ wj ) ¡ a ET ( G ) ( w ) D ( i, j )2Ar
C
X ¡ ¡ ¢ ¢2 s w 0 C s ¢ ( wj ¡ wi ) ¡ b ,
( i, j )2Ar
(3.24)
JiÏrÂõ SÂÏ õ ma
2720
which can be rewritten as follows: ET ( G) (w ) D
X ¡ (i, j )2Ar
C
¢2 s1 (w00 C wi ¡ wj ) ¡ a
X ¡ ( i, j )2Ar
¢2 s1 (¡w00 C wi ¡ wj ) ¡ a ,
(3.25)
where substitution w00 D s ¢ ( w0 ¡j 0 ) (i.e., w 0 D j 0 C s ¢ w00), equation 2.3, and axiom 2 for s1 are applied. According to disjoint union 3.6, the error is split among the contributions from the subsets of edges A ( u,v) corresponding to edges ( u, v ) 2 A in G and further lower bounded by considering only the edges ( u, v ) from A0 µ A: ET ( G) (w ) D
X
(u,v )2A
EA ( u,v) ¸
X
(3.26)
EA ( u,v )
( u,v)2A0
where EA ( u,v ) D
X ( i, j )2A (u, v)
±¡ ¢2 s1 ( w00 C wi ¡ wj ) ¡ a ¢2 ²
¡
C s1 (¡w00 C wi ¡ wj ) ¡ a
.
(3.27)
In the following, inequality, EA ( u,v ) >
e aC1
(3.28)
will be proved for any ( u, v) 2 A0 , which is sufcient for proving lower bound 3.23 according to inequality 3.26. Clearly, w00 ¸ 0 can be assumed here without loss of generality because denition 3.27 with w00 replaced by ¡w00 remains the same. For (u, v ) 2 A0 , let P ( u,v ) µ A (u,v ) be a path associated with the minimum error, EP (u,v ) D
X ( i, j )2P (u,v)
±¡
¢2 s1 ( w00 C wi ¡ wj ) ¡ a ¡
¢2 ²
C s1 (¡w00 C wi ¡ wj ) ¡ a
,
(3.29)
among paths P (u,v ) ,h for h D 1, . . . , cs . Furthermore, denote by (c, d ) , ( e, f ) 2 P ( u,v ) the edges on this path with the rst two maximal so-called decrements wc ¡ wd ¸ we ¡ w f ¸ wi ¡ wj for all ( i, j) 2 P ( u,v) nf(c, d ) , ( e, f ) g. Note that at least one decrement along path P ( u,v ) must be nonnegative since wu ¡wv ¸ 0 for ( u, v ) 2 A0 , that is, wc ¡ wd ¸ 0.
Training a Single Sigmoidal Neuron Is Hard
3.4.1 Case w00 C we ¡w f ¸ jd . jd , that is, ¡
2721
First, consider the case when w00 C we ¡w f ¸
¢2 s1 ( w00 C we ¡ w f ) ¡ a ¸ (s1 (jd ) ¡ a) 2 ,
(3.30)
since s1 is nondecreasing (axiom 1). Recall that there are cs parallel paths for ( u, v ) 2 A0 sharing only their rst edge ( u, uv ) 2 A ( u,v) (see Figure 1). Therefore, error EA ( u,v) can be lower bounded by cs times partial contribution (s1 ( w00 C we ¡ w f ) ¡ a) 2 from single edge (e, f ) that has the second maximal decrement we ¡w f on path P (u,v ) with minimal error EP ( u,v) . Note that in this lower bound, edge (e, f ) with the second maximal decrement on path P ( u,v) is used since edge ( c, d) with the maximal decrement might coincide with ( u, uv ) , which is shared by all parallel paths P ( u,v ) ,h for h D 1, . . . , cs . On the other hand, if ( e, f ) coincides with ( u, uv ), then edge ( c, d ) 2 P ( u,v) nf(u, uv ) g could be used here whose decrement wc ¡ wd is still lower bounded by we ¡ w f . Hence, ¡ ¢2 EA ( u,v) ¸ cs ¢ s1 ( w00 C we ¡ w f ) ¡ a ¸ cs ¢ (s1 (jd ) ¡ a) 2 ¸ 2(b ¡ a) 2 >
e , aC1
(3.31)
according to inequality 3.30, equations 3.2, and 3.12. 3.4.2 Case w00 C we ¡ w f < jd . On the other hand, suppose that w00 C we ¡ w f < jd . In this case, vertices i 2 Vr nV on path P ( u,v ) for ( u, v) 2 A0 will possibly be relabeled with new weights w0i 2 < so that the following three conditions are met: 1. wu ¸ wv are xed. 2. EP ( u,v ) does not increase (possibly decreases). 3. (a) at most one edge ( c, d ) 2 P ( u,v) has positive decrement w0c ¡w0d > 0 or (b) all edges ( i, j) 2 P ( u,v) are associated with nonnegative decrements w0i ¡ wj0 ¸ 0. Note that cases 3a and 3b can happen simultaneously when all the decrements w0i ¡ wj0 D 0 are zero except for w0c ¡ w0d > 0. Furthermore, notice that error EP ( u,v) actually depends on decrements wi ¡ wj rather than on the actual weights wi , wj . For example, these decrements can arbitrarily be permuted along path P ( u,v) , producing new weights, whereas EP ( u,v ) and wu , wv do not change. Now suppose that there exists an edge (i, j) 2 P ( u,v) nf(c, d ) g with positive decrement 0 < wi ¡ wj · wc ¡ wd together with an edge ( `, m ) 2 P ( u,v) associated with negative decrement w` ¡ wm < 0, as depicted in Figure 2. Such a situation breaks condition 3 above, and hence the
JiÏrÂõ SÂÏ õ ma
2722
(b-a)2
(s1(x)-a)2
-D +D w’0+w-w m l
0 w’0 xd w’0+w-w j i
w’0+w-w c d x
Figure 2: Weight relabeling procedure.
underlying decrements are updated as follows: w0i ¡ wj0 D wi ¡ wj ¡ D
(3.32)
w0` ¡ w0m D w` ¡ wm C D ,
(3.33)
where
D
D min(wi ¡ wj , wm ¡ w`) > 0,
(3.34)
which leads to at least one new zero decrement. This can be achieved by permuting the decrements along path P ( u,v) so that wi ¡ wj follows immediately after w` ¡ wm (this produces new weights but preserves EP ( u,v ) ) and by decreasing the weight of the middle vertex common to both decrements by D , which may inuence error EP ( u,v) . However, axiom 4 of s1 can be employed for j1 D w00 C w` ¡ wm < j2 D w00 C wi ¡ wj · w00 C we ¡ w f < jd , and for D dened in equation 3.34, which implies ¡
¢2 ¡ ¢2 s1 ( w00 C wi ¡ wj ) ¡ a C s1 ( w00 C w` ¡ wm ) ¡ a ¡ ¢2 ¸ (s1 ( w00 C w0i ¡ wj0 ) ¡ a) 2 C s1 ( w00 C w0` ¡ w0m ) ¡ a .
(3.35)
Similarly, inequality 3.35 with w00 ¸ 0 replaced by ¡w00 holds, which ensures that error EP (u,v ) may decrease only while w0i ¡ wj0 D 0 or w0` ¡ w0m D 0. By repeating this relabeling procedure, eventually at most one positive decrement w0c ¡ w0d > 0 remains (case 3a) or all the negative decrements are eliminated (case 3b). Furthermore, X ( w0i ¡ wj0 ) D wu ¡ wv ¸ 0 (3.36) w0c ¡ w0d C ( i, j )2P(u, v) nf(c,d)g
Training a Single Sigmoidal Neuron Is Hard
2723
due to ( u, v ) 2 A0 , which implies X (w0i ¡ wj0 ). w0d ¡ w0c ·
(3.37)
( i, j )2P (u,v) nf(c,d)g
Hence, w0d ¡ w0c · w0i ¡ wj0
(3.38)
for all ( i, j) 2 P ( u,v ) since the decrements w0i ¡ wj0 for (i, j) 2 P ( u,v ) nf(c, d )g in sum 3.37 are all nonpositive (case 3a) or all nonnegative (case 3b). According to formula 3.38, inequality 3.28 would follow from EA ( u,v) ¸ EP ( u,v) ¡ ¢2 ¡ ¢2 ¸ s1 ( w00 C w0c ¡ w0d ) ¡ a C r ¢ s1 ( w00 C w0d ¡ w0c ) ¡ a ¡ ¢2 C s1 (¡w00 C w0c ¡ w0d ) ¡ a ¡ ¢2 e C r ¢ s1 (¡w00 C w0d ¡ w0c ) ¡ a > (3.39) C a 1 because there are r edges ( i, j) on path P ( u,v) except for ( c, d) , and s1 is nondecreasing (axiom 1). Particular terms of addition 3.39 can suitably be coupled so that it sufces to show (s1 (j ) ¡ a) 2 C r ¢ (s1 (¡j ) ¡ a) 2 D (s1 (j ) ¡ a) 2 C r ¢ (b ¡ s1 (j ) ) 2 >
e 2(a C 1)
(3.40)
for any excitation j 2 < where axiom 2 of s1 is used. For this purpose, domain < of nondecreasing s1 is split into two disjoint intervals X and
(s1 (j ) ¡ a) 2 ·
(2a C 1) e (b ¡ a) 2 < (b ¡ a) 2 D 2(a C 1) 2(a C 1)
(3.42)
holds according to equations 3.41 and 3.12, which implies s s1 (j ) · a C (b ¡ a)
2a C 1 , 2(a C 1)
(3.43)
JiÏrÂõ SÂÏ õ ma
2724
and s1 (j ) < b, that is, (b ¡ s1 (j ) ) 2 > 0.
(3.44)
Thus, if r is sufciently large, then inequality 3.40 is satised, and the proof is complete. It will be proved that the choice 3.3 of r is sufcient. In this case, it will even be shown that e (s1 (j ) ¡ a) 2 C r ¢ (b ¡ s1 (j ) ) 2 ¸ (b ¡ a) 2 > (3.45) , 2(a C 1) which reduces to r¸
(b ¡ a) 2 ¡ (s1 (j ) ¡ a) 2 b ¡ 2a C s1 (j ) D 2 b ¡ s1 (j ) (b ¡ s1 (j ) )
for inequality 3.44. For every y 2 Y where " Y D s1 ( X ) D fs1 (j ) I j 2 Xg µ a, a C (b ¡ a)
(3.46)
s
# 2a C 1 2(a C 1)
(3.47)
follows from inequality 3.43, dene function F: Y ¡! < by formula from equation 3.46: b ¡ 2a C y . b ¡y
F ( y) D
(3.48)
Function F proves to be increasing; for any y1 , y2 2 Y inequality F ( y1 ) < F ( y2 ) can be shown equivalent with y1 < y2 by using its explicit formula 3.48. Hence, according to inclusion 3.47, it sufces to show inequality 3.46 with s1 (j ) replaced by s 2a C 1 a C (b ¡ a) (3.49) , 2(a C 1) that is, q r¸
1C
q
1¡
2a C 1 2(a C 1) 2a C 1 2(a C 1)
.
(3.50)
Inequality 3.50 can further be rewritten as ³
r¡1 rC1
´2 ¸
2a C 1 , 2(a C 1)
(3.51)
which can be veried by using equation 3.3. This completes the argument for inequality 3.40 and consequently for inequality 3.28 providing the proof that A0 is a solution of the FAS problem.
Training a Single Sigmoidal Neuron Is Hard
2725
3.5 The Relative Error. The proof of theorem 1 also gives the NP-hardness result regarding the relative (average) error bounds:
Given a training set T containing p D |T| training patterns, it is NP-hard to nd a weight vector w 2
Given an FAS instance G D ( V, A ) , a, a corresponding MSUE instance T ( G) , e is constructed according to equations 3.9 and 3.12 in polynomial time. Assume that a weight vector w ¤ 2
ET ( G ) ( w ¤ ) ·
inf ET ( G) ( w ) C (b ¡ a) 2 .
w 2
(3.52)
The corresponding subset of edges A¤ µ A making graph G¤ D ( V, AnA¤ ) acyclic can be then read from w ¤ according to equation 3.22. It will be proved in the following that |A¤ | · a iff the original FAS instance has a solution. This means that nding the weight vector w ¤ that satises condition 3.52 is NP-hard. It sufces to show that for |A¤ | ¸ a C 1 there is no subset A0 µ A such that |A0 | · a and G0 D (V, AnA0 ) is acyclic since the opposite implication is trivial. On the contrary, suppose that such a subset A0 exists. Thus, a weight vector w 0 2
(3.53)
according to inequality 3.20. On the other hand, for |A¤ | ¸ a C 1 inequality inf ET ( G) (w ) ¸ E T ( G) (w ¤ ) ¡ (b ¡ a) 2
w 2
> |A¤ | ¢
(2a C 1) (b ¡ a) 2 ¡ (b ¡ a) 2 aC1
¸ 2a(b ¡ a) 2
(3.54)
follows from inequalities 3.52, 3.23, and denition 3.12. Hence, by axiom 3 of s1 , there exists K > 0 such that 2|Ar | (s1 (¡K ) ¡ a) 2 < ¡2a(b ¡ a) 2 C
inf ET ( G) ( w ) ,
w 2
(3.55)
2726
JiÏrÂõ SÂÏ õ ma
which provides contradiction ET ( G) (w 0 ) < infw 2
4 Conclusion
The hardness results for loading feedforward networks are completed by the proof that approximate training of only a single sigmoidal neuron, for example, by using the popular backpropagation heuristics, is hard. This suggests that the constructive learning algorithms that minimize the training error gradually by adapting unit by unit may not be efcient. These results are valid for a general class of sigmoidal activation functions characterized by axioms, including the most important standard (logistic) sigmoid. Acknowledgments
This research was partially supported by project LN00A056 of the Ministry of Education of the Czech Republic. References Amaldi, E. (1991). On the complexity of training perceptrons. In T. Kohonen, K. M¨a kisara, O. Simula, & J. Kangas (Eds.), Proceedings of the First ICANN’91 International Conference on Articial Neural Networks (pp. 55–60). Amsterdam: Elsevier. Anthony, M., & Bartlett, P. L. (1999). Neural network learning: Theoretical foundations. Cambridge: Cambridge University Press. Arora, S., Babai, L., Stern, J., & Sweedyk, Z. (1997). The hardness of approximate optima in lattices, codes, and systems of linear equations. Journal of Computer and System Sciences, 54(2), 317–331. Bartlett, P. L., & Ben-David, S. (1999). Hardness results for neural network approximation problems. In P. Fischer & H.-U. Simon (Eds.), Proceedings of the Fourth EuroCOLT’99 European Conference on Computational Learning Theory, LNAI 1572 (pp. 50–62). Berlin: Springer-Verlag. Blum, A. L., & Rivest, R. L. (1992). Training a 3-node neural network is NPcomplete. Neural Networks, 5(1), 117–127. Blumer, A., Ehrenfeucht, A., Haussler, D., & Warmuth, M. K. (1989). Learnability and the Vapnik-Chervonenkis dimension. Journal of the ACM, 36(4), 929–965. DasGupta, B., & Hammer, B. (2000). On approximate learning by multi-layered feedforward circuits. In H. Arimura, S. Jain, & A. Sharma (Eds.), Proceedings of
Training a Single Sigmoidal Neuron Is Hard
2727
the Eleventh ALT’2000 International Conference on Algorithmic Learning Theory, LNAI 1968 (pp. 264–278). Berlin: Springer-Verlag. DasGupta, B., Siegelmann, H. T., & Sontag, E. D. (1995). On the complexity of training neural networks with continuous activation functions. IEEE Transactions on Neural Networks, 6(6), 1490–1504. de Souto, M. C. P., & de Oliveira, W. R. (1999).The loading problem for pyramidal neural networks. Electronic Journal on Mathematics of Computation. Available on-line: http://gmc.ucpel.tche.br/ejmc/. Fahlman, S. E., & Lebiere, C. (1990). The cascade-correlation learning architecture. In D. S. Touretzky (Ed.), Advances in neural information processing systems (NIPS ’89), 2 (pp. 524–532). San Mateo, CA: Morgan Kaufmann. Garey, M. R., & Johnson, D. S. (1979). Computers and intractability: A Guide to the theory of NP-Completeness. San Francisco: Freeman. Hammer, B. (1998a). Some complexity results for perceptron networks. In L. Niklasson, M. Boden, & T. Ziemke (Eds.), Proceedings of the Eight ICANN’98 International Conference on Articial Neural Networks (pp. 639–644). Berlin: Springer-Verlag. Hammer, B. (1998b). Training a sigmoidal network is difcult. In M. Verleysen (Ed.), Proceedings of the Sixth ESANN’98 European Symposium on Articial Neural Networks (pp. 255–260). Brussels: D-Facto Publications. Haykin, S. (1999). Neural networks: A Comprehensive foundation (2nd ed.). Upper Saddle River, NJ: Prentice Hall. Heged us, ¨ T., & Megiddo, N. (1996). On the geometric separability of Boolean functions. Discrete Applied Mathematics, 66(3), 205–218. Hoffgen, ¨ K.-U. (1993). Computational limitations on training sigmoid neural networks. Information Processing Letters, 46(6), 269–274. Hoffgen, ¨ K.-U., Simon, H.-U., & Van Horn, K. S. (1995). Robust trainability of single neurons. Journal of Computer and System Sciences, 50(1), 114–125. Hush, D. R. (1999). Training a sigmoidal node is hard. Neural Computation, 11(5), 1249–1260. Johnson, D. S., & Preparata, F. P. (1978). The densest hemisphere problem. Theoretical Computer Science, 6(1), 93–107. Jones, L. K. (1997). The computational intractability of training sigmoidal neural networks. IEEE Transactions on Information Theory, 43(1), 167–173. Judd, J. S. (1988). On the complexity of loading shallow networks. Journal of Complexity, 4(3), 177–192. Judd, J. S. (1990). Neural network design and the complexity of learning. Cambridge, MA: MIT Press. Karp, R. M. (1972). Reducibility among combinatorial problems. In R. E. Miller & J. W. Thatcher (Eds.), Complexity of computer computations (pp. 85–103). New York: Plenum Press. Kuhlmann, Ch. (2000). Hardness results for general two-layer neural networks. In N. Cesa-Bianchi & S. A. Goldman (Eds.), Proceedings of the Thirteenth COLT 2000 Annual Conference on Computational Learning Theory (pp. 275–285). San Francisco: Morgan Kaufmann. Lin, J.-H., & Vitter, J. S. (1991). Complexity results on learning by neural nets. Machine Learning, 6(3), 211–230.
2728
JiÏrÂõ SÂÏ õ ma
Macintyre, A., & Sontag, E. D. (1993). Finiteness results for sigmoidal “neural” networks. In Proceedings of the Twenty-Fifth STOC’93 Annual ACM Symposium on the Theory of Computing (pp. 325–334). New York: ACM Press. Megiddo, N. (1988). On the complexity of polyhedral separability. Discrete Computational Geometry, 3(4), 325–337. Parberry, I. (1992). On the complexity of learning with a small number of nodes. In Proceedings of the International Joint Conference on Neural Networks (Vol. 3, pp. 893–898). Piscataway, NJ: IEEE Press. Pitt, L., & Valiant, L. G. (1988). Computational limitations on learning from examples. Journal of the ACM, 35(4), 965–984. Roychowdhury, V. P., Siu, K.-Y., & Kailath, T. (1995). Classication of linearly non-separable patterns by linear threshold elements. IEEE Transactions on Neural Networks, 6(2), 318–331. Roychowdhury, V. P., Siu, K.-Y., & Orlitsky, A. (Eds.). (1994). Theoretical advances in neural computation and learning. Boston: Kluwer. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986).Learning representations by back-propagating errors. Nature, 323(9), 533–536. SÂÏ õ ma, J. (1994). Loading deep networks is hard. Neural Computation, 6(5), 842– 850. SÂÏ õ ma, J. (1996). Back-propagation is not efcient. Neural Networks, 9(6), 1017– 1023. SÂÏ õ ma, J. (2001). Minimizing the quadratic training error of a sigmoid neuron is hard. In N. Abe, R. Khardon, & T. Zeugmann (Eds.), Proceedings of the Twelfth ALT’2001 International Conference on Algorithmic Learning Theory, LNAI 2225 (pp. 92–105). Berlin: Springer-Verlag. Sontag, E. D. (1992). Feedforward networks for interpolation and classication. Journal of Computer and System Sciences, 45(1), 20–48. Tesauro, G. (1987). Scaling relationships in back-propagation learning: Dependence on training set size. Complex Systems, 1(2), 367–372. Tesauro, G., & Janssens, B. (1988). Scaling relationships in back-propagation learning. Complex Systems, 2(1), 39–44. Vidyasagar, M. (1997). A theory of learning and generalization. London: SpringerVerlag. Vu, V. H. (1998). On the infeasibility of training neural networks with small squared errors. In M. I. Jordan, M. J. Kearns, & S. A. Solla (Eds.), Advances in neural information processing systems (NIPS ’97), 10 (pp. 371–377). Cambridge, MA: MIT Press. Wiklicky, H. (1994). The neural network loading problem is undecidable. In J. Shawe-Taylor & M. Anthony (Eds.), Proceedings of the First EuroCOLT’93 European Conference on Computational Learning Theory (pp. 183–192). Oxford: Clarendon Press. Received February 19, 2002; accepted May 2, 2002.
LETTER
Communicated by Andrew Barto
Two Timescale Analysis of the Alopex Algorithm for Optimization P. S. Sastry [email protected] Department of Electrical Engineering, Indian Institute of Science, Bangalore 560012, India M. Magesh [email protected] Department of Electrical Engineering, Indian Institute of Science, Bangalore 560012, India K. P. Unnikrishnan [email protected] General Motors R&D Center, Warren, MI 48090-9055, U.S.A. Alopex is a correlation-based gradient-free optimization technique useful in many learning problems. However, there are no analytical results on the asymptotic behavior of this algorithm. This article presents a new version of Alopex that can be analyzed using techniques of two timescale stochastic approximation method. It is shown that the algorithm asymptotically behaves like a gradient-descent method, though it does not need (or estimate) any gradient information. It is also shown, through simulations, that the algorithm is quite effective. 1 Introduction
Finding the values of parameters to optimize a performance index is a method of wide applicability in many disciplines, such as operations research, adaptive control, and neural networks. Most such optimization algorithms perform an iterative search in the parameter space. If the value of the objective function and its derivatives can be easily computed at any arbitrary point in the parameter space, then a number of methods, such as gradient descent, conjugate directions, and quasi-Newton, can be used. The computational costs and the convergence properties of such classical algorithms are fairly well understood (Bertsekas, 1995). However, there are many situations where optimization techniques that do not require derivative information are attractive. Examples include regression function learning and reinforcement learning, where one can observe only the noise-corrupted value of the objective function at any point c 2002 Massachusetts Institute of Technology Neural Computation 14, 2729– 2750 (2002) °
2730
P. S. Sastry, M. Magesh, and K. P. Unnikrishnan
in the parameter space. Even for deterministic problems, where the exact value of the objective function is available, the gradient information may be difcult to get. An example is the problem of obtaining the optimal hyperplane at each node while learning an oblique decision tree (Murthy, Kasif, & Salzberg, 1994; Shah & Sastry, 1999). Even for learning weights in a feedforward neural network, one may wish to avoid calculating the gradient, for example, through backpropagation, to make the learning algorithm somewhat independent of the network architecture (Dembo & Kailath, 1990) or for better hardware implementation (Jabri & Fowler, 1991). One of the standard approaches in such situations is to estimate the gradient by means of local perturbations. Parameter perturbation methods have a long history in adaptive control (Draper & Li, 1951). Other examples of such gradient estimation techniques include stochastic approximations (Spall, 1992; Kushner & Yin, 1997; Bharath & Borkar, 1999) and some perturbationbased methods for learning in neural networks (Dembo & Kailath, 1990; Jabri & Fowler, 1991). In all these methods, multiple function evaluations are done in the neighborhood of the current point in the parameter space and the gradient information estimated from these measurements. Another set of closely related methods are the reinforcement learning techniques (Narendra & Thathachar, 1989; Barto & Jordan, 1987; Williams, 1992; Santharam, Sastry, & Thathachar, 1994; Rajaraman & Sastry, 1999). These algorithms do not explicitly estimate the gradient; instead, they make stochastic moves in the parameter space using some distributions (which may be adapted at each iteration) so as to follow the gradient in an expected sense. Many of these algorithms are proved to converge to a local optimum. In this article, we consider Alopex, another similar optimization method. This algorithm can be used for optimizing a multivariate function without needing (or explicitly estimating) any derivative information. Alopex has had one of the longest histories of such methods since its introduction for mapping visual receptive elds (Harth & Tzanakou, 1974). It has subsequently been modied and used in many applications such as models of visual perception (Harth & Unnikrishnan, 1985; Harth, Unnikrishman, & Pandya, 1986) and visual development (Nine & Unnikrishnan, 1994), pattern recognition (Venugopal, Pandya, & Sudhakar, 1992; Micheli-Tzanakou, 2000), adaptive control (Venugopal, Pandya, & Sudhakar, 1994), learning in neural networks (Unnikrishnan & Venugopal, 1994; Bia, 2001), and learning decision trees (Shah & Sastry, 1999). Empirically, the Alopex algorithm has been shown to be robust and effective in many applications. For example, Unnikrishnan and Venugopal (1994), through extensive simulations, showed it to be an efcient technique to learn weights in feedforward and recurrent neural networks. The algorithm is independent of network architecture, the objective function to be minimized, and the choice of activation functions. Thus, it is an attractive model-free distributed algorithm. However, its asymptotic behavior is poorly understood. At present, there is no theoretical result to guarantee
Analysis of Alopex
2731
that the algorithm even proceeds toward a local optimum. In this article, we present a modied Alopex algorithm whose asymptotic behavior can be understood using the two timescale analysis techniques (Borkar, 1997). We also show through simulations the effectiveness of this algorithm. In any general optimization algorithm that does an iterative search over the parameter space, there are two decisions to be made at each iteration: the direction of movement from the current point and the step size along the chosen direction. The Alopex algorithm that we discuss here uses a constant step size, and the choice of direction of movement is stochastic. The probability of choosing a direction depends nonlinearly on the correlation between changes in parameter values and changes in the objective function over a few past iterations. Further, the choice of direction is computed in a distributed manner. Unlike the perturbation methods referred to earlier, Alopex does not explicitly compute any gradient estimate. In fact, Alopex needs only one function measurement per iteration. While the reinforcement learning methods also do not explicitly estimate gradient, unlike them, the stochastic direction that Alopex takes depends on correlations of past changes in parameter values and the objective function values. As a result, Alopex does not have any simple Markovian property, rendering its analysis more difcult. We begin with a brief description of the original Alopex algorithm following Unnikrishnan and Venugopal (1994) in section 2. In section 3, we present a new version of Alopex (which may be called two timescale Alopex or 2tAlopex) and explain the motivation for this version. We also present an analysis that clearly brings out the nature of search performed by this algorithm and shows that the algorithm goes to a local minimum. In section 4, we present some simulation results and conclude in section 5. 2 The Alopex Algorithm
Let E: <m ! < be the objective function. We are interested in nding a w D ( w1 , . . . ,wm ) t 2 <m where E attains a minimum. Let w ( k) D ( w1 ( k) , . . . ,wm ( k) ) t be the current guess of a minimum at kth stage of the iterative process. Then, under the Alopex algorithm, w ( k) is updated as follows: w ( k C 1) D w ( k) C gx ( k) ,
(2.1)
where x ( k) D ( x1 ( k) , . . . , xm ( k) ) t is given by xi ( k) D ¡1 D C1
with probability
pi ( k)
with probability 1 ¡ pi ( k) ,
(2.2)
with pi ( k) given by pi ( k) D
1 1 C exp(¡ci ( k) / T ( k) )
(2.3)
2732
P. S. Sastry, M. Magesh, and K. P. Unnikrishnan
and ci ( k) are given by ci ( k) D D wi ( k) D E ( k)
D wi ( k) D E ( k)
D wi ( k) ¡ wi ( k ¡ 1) D gxi ( k ¡ 1) D E ( w ( k) ) ¡ E ( w ( k ¡ 1) ) .
(2.4)
T ( k) in equation 2.3 is a “temperature” parameter updated every N iterations (for a suitably chosen N) using the following annealing schedule: T ( k) D T ( k ¡ 1) D
Á
if k is not a multiple of N
m k X 1 X |ci ( k0 ) | Nm i D1 k0 D k¡N
k g X | D E ( k0 ) |, D N k0 D k¡N
otherwise. ! because
| D wi ( k) | D g .
(2.5)
The algorithm is initialized by randomly choosing w (0) and obtaining w (1) by making each xi (0) take values C 1 or ¡1 with equal probability. From equations 2.2 and 2.3, it is clear that at each iteration, the algorithm makes a random choice of direction x ( k), computed in a distributed fashion, and takes a constant-sized step along that direction. The xi are independent of each other (given the pi ’s). The probability of xi ( k) taking value ¡1, pi ( k) , depends on the correlation between D wi ( k) and D E ( k) (which is the ci ( k) ), as given by equation 2.3. For example, if, over the previous two iterations, wi has been increased and E has decreased, then with probability greater than 0.5, xi would be C 1. How sharply this probability goes toward 0 or 1 with increasing magnitude of correlation depends on the temperature. From the annealing schedule given by equation 2.5, it is clear that the algorithm has some self-scaling property. This is because in determining pi , we are comparing current D E with its recent averaged value. Recently Bia (2001) has shown, through extensive empirical investigations, that one can get very good performance by having N D 1 in the annealing schedule. There is another way to view the strategy underlying the Alopex algorithm. At each iteration, the algorithm selects a random direction to move from the current point. Since this is done in a distributed fashion, each of the wi is to be increased or decreased (randomly) independent of others. From the algorithm, it is easy to see that this is done as follows. If (over the previous two iterations) the objective function, E, is decreasing, then we move wi ( k) in the same direction as in the previous iteration with a probability greater than 0.5. Similarly, if E is increasing, then each wi is moved in a direction opposite to that in the previous iteration, with a probability greater than 0.5. As is easy to see, if E were a function of only one variable, even if
Analysis of Alopex
2733
the above kind of step is taken deterministically, we will go toward a local minimum because D ED w would have the same sign as the derivative. In the multidimensional case, we have to make the above kind of independent movement in each coordinate direction stochastic so that we will explore all directions of movement. The Alopex algorithm searches over only countably many w if the stepsize g is constant. However, fw ( k) , k ¸ 0g is not Markovian because w ( k) depends on both w ( k ¡ 1) and w ( k ¡ 2) . We can make it Markovian by choosing Z ( k) D < w ( k) , w ( k ¡ 1) > as the state. Suppose we modify equation 2.1 so as to keep w in a compact set by introducing self-loops in the chain at all end points. Then Z ( k) becomes a nite-state ergodic Markov chain. However, due to the form of Z ( k) , nding closed-form solutions to the steady-state probabilities does not appear to be feasible. In the next section, we suggest a small modication to the basic Alopex algorithm. We can understand the asymptotic behavior of this modied algorithm by using the ideas from two timescale stochastic approximations (Borkar, 1997). 3 A Two Timescale Version of Alopex
In this section, we present our modied version of the Alopex (which will be termed 2t-Alopex) algorithm. The main modication is that pi ( k) depends on pi ( k¡1 ) in addition to being dependent on the correlation between D E ( k) and D wi ( k) . After specifying the algorithm, we discuss the motivation for the modication and present some analysis to elucidate the asymptotic behavior of the algorithm. The 2t-Alopex algorithm is as follows: w ( k C 1) D w ( k) C gx ( k) ,
(3.1)
where x ( k) D ( x1 ( k) , . . . , xm ( k) ) t is given by xi ( k) D ¡1 D C1
with probability
pi ( k)
with probability 1 ¡ pi ( k)
(3.2)
with pi ( k) given by pi ( k) D (1 ¡ l) pi ( k ¡ 1) C lbi ( k) D pi ( k ¡ 1) C l(b i ( k) ¡ pi ( k ¡ 1) ) ,
(3.3)
where bi ( k) D
1 , 1 C exp(¡xi ( k ¡ 1) [E( w ( k) ) ¡ E (w ( k) ¡ gx ( k ¡ 1) )] /gT ( k) )
(3.4)
with T ( k) given by equation 2.5 and l and g are step-size parameters such that g D o (l) . As in the original Alopex algorithm, for initialization, we
2734
P. S. Sastry, M. Magesh, and K. P. Unnikrishnan
choose a random w (0) and make pi (0) D 0.5, 8i. From equation 3.3, pi ( k) is a weighted average of pi ( k ¡ 1) and bi ( k) . From equation 3.4, we see that bi ( k) is essentially same as the right-hand side of equation 2.3, which is used to calculate pi ( k) in the original algorithm. (Since ci ( k) D gxi ( k ¡ 1) D E ( k) , the only difference between the right-hand side of equation 2.3 and that of 3.4 is that we moved the g factor to the denominator inside the exponential function; the reason for this will become clear from the analysis presented later.) To understand the motivation behind the modied Alopex, consider the original Alopex algorithm as given by equations 2.1 through 2.5. For sufciently small g, from equation 2.1, we can expect the algorithm to take w ( k) toward a local minimum if, for each i, xi ( k) D ¡1 whenever @@wEi ( w ( k) ) > 0 and vice versa. By writing D E ( k) D E ( w ( k) ) ¡ E ( w ( k) ¡ gx ( k ¡ 1) ) and approximating E ( w ( k) ¡g x ( k ¡ 1) ) by a rst-order Taylor series, we can write, from equation 2.4, 2
3 m X @E ( w ( k) ) xj ( k ¡ 1) 5 ci ( k) ¼ gxi ( k ¡ 1) 4g @ w j jD 1 D g2
@E @wi
( w ( k ) ) C g2
X @E ( w ( k) ) xi ( k ¡ 1)xj ( k ¡ 1) , @wj 6 i jD
(3.5)
where we have used xi ( k) 2 D 1, 8i, k. When g is small, we can expect that many successive random x ’s would be computed using essentially the same w . Hence, we can expect that the second term in equation 3.5 would be very small in magnitude due to the xi xj terms averaging close to zero. Then ci would essentially be determined by the ith partial derivative of E, thus providing (with a high probability) the correct descent direction for equation 2.1. One can say that this is the heuristic idea behind the Alopex algorithm. However, there is no analytical result to show that the algorithm takes us to a local minimum. In our modication of the Alopex algorithm, the above intuitive idea is forced on the behavior of the algorithm. For this, we introduced some dynamics in pi ( k) . Since l is much larger than g (recall that we stipulated in the modied algorithm that g D o (l)), we run the dynamics of pi ( k) much faster than that of w ( k) . This can ensure that the kind of averaging referred to earlier takes place and that x ( k) gives the correct descent direction. 3.1 Analysis of the Two Timescale Alopex Algorithm. In this section, we discuss the asymptotic behavior of the 2t-Alopex algorithm with a xed temperature parameter (that is, T ( k) D T, 8k in equation 3.4). Dene vectors p ( k) D ( p1 ( k) ¢ ¢ ¢ pm ( k) ) t and b ( k) D (b1 ( k) ¢ ¢ ¢ bm ( k) ) t . The 2t-Alopex algo-
Analysis of Alopex
2735
rithm given by equations 3.1 through 3.4 can be written in vector notation as w ( k C 1) D w ( k) C g[F( w ( k) , p ( k) ) C M ( k) ]
(3.6)
p ( k C 1) D p ( k) C l[G( w ( k) , p ( k) ) C N ( k)],
(3.7)
F ( w , p ) D E [x ( k) | w ( k) D w , p ( k) D p ]
(3.8)
where
G ( w , p ) D E [b ( k) ¡ p ( k) | w ( k) D w , p ( k) D p ],
(3.9)
where E denotes the expectation and M ( k) , N ( k) are zero-mean independent and identically distributed (i.i.d.) sequences given by M ( k) D x ( k) ¡ F (w ( k) , p ( k) )
(3.10)
N ( k) D [b ( k) ¡ p ( k)] ¡ G ( w ( k) , p ( k) ).
(3.11)
From equations 3.6 and 3.7, we can intuitively understand the dynamics of the algorithm as follows. Since g is an order of magnitude smaller than l, p ( k) evolves much faster than w ( k) . Thus, we can say that the fast-evolving dynamics of p ( k) given by equation 3.7 sees w ( k) as “almost constant,” while the slowly evolving dynamics of w ( k) given by equation 3.6 sees p ( k) as “almost equilibriated.” This intuitive idea can be formalized using the techniques of the two timescale stochastic approximations (Borkar, 1997). First consider the dynamics of p ( k) given by equation 3.7 with w ( k) held constant at, say, w . (It may be noted here that in dening b i through equation 3.4, we made sure that p ( k) dynamics is meaningful even when w ( k) is held constant.) Then, using standard techniques of analyzing such stochastic difference equations (Borkar, 1997), for sufciently small l, the asymptotic behavior of a suitably interpolated continuous-time version of the process p ( k), denoted by p w ( t) , 1 is well approximated by the solution of the ordinary differential equation (ODE), pP w D G ( w , p w ) ,
p w (0) D p (0) ,
(3.12)
where G ( w , p ) is given by the limit of the right-hand side of equation 3.9 as l ! 0 (though, with a little abuse of notation, we use the same symbol G for both functions). Here the subscript w on p is intended to emphasize that we are considering the dynamics p ( k) when w ( k) is held constant at a specic w . Suppose the ODE 3.12 has a globally asymptotically stable equilibrium point, denoted by p ¤ (w ) . (Note that the equilibrium point of the equation 1
An interpolation that is suitable here is p w (t) D p (k)
for
t 2 [kl, (k C 1)l).
2736
P. S. Sastry, M. Magesh, and K. P. Unnikrishnan
would be a function of w .) The intuition that the slowly evolving process w ( k) sees the process p ( k) as “almost equilibriated” leads us to think that we can replace p ( k) in equation 3.6 by p ¤ ( w ( k) ) . Hence, once again, a suitably interpolated continuous-time version of the process w ( k), to be denoted by w N (t) , would be well approximated, for sufciently small g, by the ODE N dw ( t) D F ( w N (t) , p ¤ ( w N (t ) ) ) , dt
w N (0) D w (0) .
(3.13)
All of these intuitive arguments can be formalized using the results in Borkar (1997). (Specically, we can conclude from the results in Borkar, 1997, that if the step-size parameters are replaced by time-varyingP step sizes Pg k, l k that satisfy the usual stochastic approximation conditions, g lk D D k P 2 P 2 1 and gk < 1, lk < 1, and if we maintain gk D o (l k ) , and if the ODE 3.12 has a globally asymptotically stable solution for each w , then the asymptotic behavior of w ( k) is well approximated by the solution of equation 3.13 with an almost sure sense of convergence. We can also extend this result to constant step-size algorithms by thinking of the approximation in a weak convergence sense.) Thus, we can understand the asymptotic behavior of the 2t-Alopex algorithm if we can characterize the functions p ¤ ( w ) and hence F ( w , p ¤ ( w ) ) . Specically, if we can show that Fi ( w , p ¤ ( w ) ) , the ith component of the vector F ( w , p ¤ (w ) ) , has the same sign as ¡ @@wEi (w ) for all w 2 <m , then we can conclude that the 2t-Alopex algorithm takes us toward a local minimum of E. This is what we do in the rest of this section. We begin by showing that there exists a stationary point for equation 3.12. That is, we show that given any w 2 <m , there is a p 2 <m (which depends on w ) such that the right-hand side of equation 3.12 is zero. Any stationary point of ODE 3.12 satises
E [b | w , p ] D p ,
(3.14)
where b D (b 1 , . . . , bm ) t and bi is given by equation 3.4 under the limit g ! 0. (Note that g D o (l) and hence g ! 0 as l ! 0.) Thus, the bi (which depends on the w and the random vector x whose distribution is determined by p ) is given by bi D
1 1 C exp[¡xi x t r E ( w ) / T]
,
(3.15)
where r E denotes the gradient of E with respect to the parameter vector w . The expression for bi given by equation 3.15, is obtained by expanding E ( w ¡ gx ) in equation 3.4 by a Taylor series and taking limit as g ! 0. (As remarked earlier, comparing equations 2.3 and 3.4, we see that in the 2t-Alopex algorithm, we shifted the g factor to the denominator inside the exponential function. This is done to ensure that the right-hand side of equation 3.4 has a proper limit as g ! 0.)
Analysis of Alopex
2737
Since w is xed, to keep the notation simple, we write E [b | p ] for E [b | w , p ]. Using equations 3.4 and 3.2, we have
E [b i | p ] D
X
1 1 C exp[¡xi xt r E (w ) / T]
x2X
gx ( p ) ,
(3.16)
where X is the set of all m-dimensional vectors whose components are either C 1 or ¡1, and gx is given by (1¡xj ) / 2
gx ( p ) D PjmD 1 pj
(1 ¡ pj ) (1 C xj ) / 2 .
(3.17)
It is easily seen that for each x 2 X , gx is a continuous function from [0, 1]m to [0, 1] and hence E [bi | p ] is a continuous function from [0, 1]m to [0, 1]. Thus, E [b | p ] is a continuous function from the convex compact set [0, 1]m into itself. Hence, by Brouwer ’s xed-point theorem, there exists a p ¤ 2 [0, 1]m such that E [b | p ¤ ] D p ¤ . This p ¤ would be a stationary point of equation 3.12 because it satises equation 3.14. (It may be remembered that this p ¤ would be a function of w , though we are not showing this dependence explicitly to keep the notation simple.) Having shown the existence of stationary point, we next investigate some relevant properties of (any) stationary point of ODE 3.12. Denote Ei ( w ) D @@wEi (w ) , i D 1, . . . , m. To keep the notation simple, we will not explicitly show the argument w all the time. Let E0i D Ei / T. Dene bN by bN D
1 P . 1 C exp[¡ j xj Ej0 ]
(3.18)
From equations 3.18 and 3.15, we have bi D bN
D 1 ¡ bN
if
xi D 1
if
xi D ¡1.
(3.19)
Now we can write an expression for E [bi | p ] using equation 3.19 as
E [bi | p ] D E [bi | p , xi D 1]Prob(xi D 1) C E [b i | p , xi D ¡1]Prob(xi D ¡1) D (1 ¡ pi ) E [bN | p , xi D 1] C pi E [1 ¡ bN | p , xi D ¡1]
D (1 ¡ pi ) E [bN | p , xi D 1] C pi ¡ pi E [bN | p , xi D ¡1].
(3.20)
Let h ( x, y ) D x (1¡y) / 2 (1 ¡ x ) (1 C y) / 2 . Now, we have
E [bN | p , xi D 1] D E [bN | p , xi D ¡1] D
X
1 P PjD6 i h ( pj , xj ) 0 6 i xj Ej ] i ¡ jD
1C
exp[¡E0
1C
exp[E0
X i
1 P P jD6 i h ( pj , xj ). ¡ jD6 i xj Ej0 ]
(3.21)
2738
P. S. Sastry, M. Magesh, and K. P. Unnikrishnan
Suppose E i (w ) < 0. Then for all x 2 X , X X ¡E0i ( w ) ¡ xj Ej0 (w ) > E0i ( w ) ¡ xj Ej0 ( w ). 6 j iD
6 j iD
Hence, for any w such that Ei ( w ) < 0, we have
E [bN | p , xi D 1] < E [bN | p , xi D ¡1], 8p .
(3.22)
Using equations 3.20 and 3.22, we have, for any w such that Ei ( w ) < 0,
E [bi | p ] < (1 ¡ pi ) E [bN | p , xi D ¡1] C pi ¡ pi E [bN | p , xi D ¡1]. (3.23) Thus, we have (for any w such that Ei ( w ) < 0)
E [bi | p ] < (1 ¡ 2pi ) E [bN | p , xi D ¡1] C pi ,
8p .
(3.24)
Applying the above for p ¤ and using the fact that E [b | p ¤ ] D p ¤ , p¤i D E [bi | p ¤ ] < (1 ¡ 2p¤i ) E [bN | p ¤ , xi D ¡1] C p¤i . Since 0 < E [bN | p ¤ , xi D ¡1] < 1, we can now conclude that E i (w ) < 0 ) (1 ¡ 2p¤i ) > 0
³ ´ 1 ¤ or pi < . 2
(3.25)
In a similar manner, we can conclude that E i (w ) > 0 ) (1
¡ 2p¤i )
< 0
³ ´ 1 ¤ or pi > . 2
(3.26)
Thus, (any) stationary point of ODE 3.12 is such that p¤i < 12 if Ei < 0, and vice versa. We can now use this to understand the asymptotic behavior of our algorithm using ODE 3.13. From equation 3.8, we see that Fi (w , p ¤ ( w ) ) , the ith component of the right-hand side of ODE 3.13, is Fi ( w , p ¤ ( w ) ) D ¡1.p¤i ( w ) C 1. (1 ¡ p¤i ( w ) ) D 1 ¡ 2p¤i (w ) . From equations 3.25 and 3.26, we see that E i (w ) > 0 ) 1 ¡ 2p¤i ( w ) < 0 ) Fi ( w , p ¤ (w ) ) < 0
(3.27)
E i (w ) < 0 ) 1 ¡ 2p¤i ( w ) > 0 ) Fi ( w , p ¤ (w ) ) > 0.
(3.28)
Hence, we can conclude that the algorithm implements the correct descent direction and thus takes w ( k) toward a local minimum of E.
Analysis of Alopex
2739
So far we have shown that ODE 3.12 has at least one stationary point and any stationary point is such that we get the correct descent direction (see equations 3.27 and 3.28). To complete the analysis, we need to show that ODE 3.12 has a unique globally asymptotically stable stationary point. For this, we show that, p ¤ , the stationary point of ODE 3.12, is globally asymptotically stable and hence is also unique. From equation 3.20, we have
E [bi | p ] ¡pi D (1¡ pi ) E [bN | p , xi D 1]¡pi E [bN | p , xi D ¡1]
D E [bN | p , xi D 1] ¡pi ( E [bN | p , xi D 1] C E [bN | p , xi D ¡1]) .
(3.29)
Dene function, qi : [0, 1]m ! <, by qi ( p ) D E [bi | p ]¡pi . From equation 3.21, it is easily seen that 0 < E [bN | p , xi D 1], E [bN | p , xi D ¡1] < 1. Now from equation 3.29, we have that qi ( p ) > 0 if pi D 0 and qi ( p ) < 0 if pi D 1. Also, we know that there is a p ¤ 2 [0, 1]m such that qi ( p ¤ ) D 0. It is clear from equation 3.21 that E [bN | p , xi D 1], E [bN | p , xi D ¡1] are not dependent on pi . (They are functions of all the other components of p ). Hence, using equation 3.29, we see that @qi @pi
(p) < 0
8p .
(3.30)
Since in the analysis above the index i is arbitrary, equation 3.30 holds for all i D 1, . . . , m. Dene function, q : [0, 1]m ! <m by q ( p ) D [q1 ( p ) ¢ ¢ ¢ qm ( p )]t . Now, from equation 3.30 and mean value theorem, we can conclude that 6 p¤ . (q ( p ) ¡ q ( p ¤ ) ) t (p ¡ p ¤ ) < 0, 8p D
(3.31)
(Note that we have used q ( p ¤ ) D 0.) Now consider a Lyapunov function for ODE 3.12 given by V (p ) D 12 ( p ¡ p ¤ ) t ( p ¡ p ¤ ) . It is easy to see that 6 p ¤ and V ( p ¤ ) D 0. We have V ( p ) > 0, 8p D dV ( p ( t) ) D ( p (t) ¡ p ¤ ) t ( q ( p ( t) ) ¡ q ( p ¤ ) ) dt ¤ 6 p , < 0, 8p D
by equation 3.31. Hence, there is a unique p ¤ that satises E [b i | p ¤ ] D p¤i , i D 1, . . . , m, which is globally asymptotically stable for ODE 3.12, thus completing our analysis. The analysis makes explicit the motivation for the 2t-Alopex algorithm. By stipulating g D o (l) , the dynamics of p ( k) evolves much faster. The equilibrium of this dynamics is such that the resulting x gives the negative gradient direction (in an expected sense). Thus, although the algorithm does not have (or estimate) the gradient information, it implements a biased random walk in the parameter space so that we go toward a local minimum.
2740
P. S. Sastry, M. Magesh, and K. P. Unnikrishnan
One can say that the introduction of dynamics in p ( k) has the same effect as that of getting many random x’s at the same w by using correlations and then moving w ( k) in the direction of the mean of these random vectors. The two timescale idea gives us the theoretical handle to ensure that such an averaging does indeed take effect. 4 Simulation Results
Unnikrishnan and Venugopal (1994) showed, through extensive simulations, that the Alopex algorithm is efcient and attractive for learning weights in both feedforward and recurrent neural networks. We have observed that, as expected, the 2t-Alopex algorithm also performs equally well for such neural network applications. For the problem of learning weights in a feedforward neural network with sigmoidal activation functions, there are efcient procedures for obtaining gradient information. Thus, although the performance of Alopex in such problems is impressive (particularly because it demands much less information about the model than, say, backpropagation), it does not effectively illustrate the utility of the algorithm. Hence, in this section, we start with a problem where a gradient-free optimization technique is attractive. 4.1 Learning Oblique Decision Trees. The problem we consider is that of learning decision tree classiers from examples. Almost all algorithms for learning decision trees grow the tree recursively starting with the root. The main computation in such algorithms is the following. At any node, given a set of examples that would land in that node, we use some function to assign a gure of merit to any hyperplane that can be used at that node in the decision tree. Then the objective is to nd the optimal hyperplane that has the best value of the gure of merit among all hyperplanes in a set of allowable hyperplanes. The classic ID3 algorithm (and its later versions such as C4.5) allows only hyperplanes that are parallel to one of the axes of the feature space. Decision trees that allow only such hyperplanes can be called axis-parallel decision trees. If we allow any general hyperplane at each node, then the tree is called an oblique decision tree or a multivariate decision tree (Murthy et al., 1994; Brodley & Utgoff, 1995). In either case, we need a gure of merit to evaluate hyperplanes. Some of the possibilities that have been successfully employed are the information gain or decrease in the so-called Gini impurity (Quinlan, 1983; Fayyad & Irani, 1992; Brieman, Friedman, Ghene, & Stone, 1984; Murthy et al., 1994). These measures are based on assessing the increase in the skewness of class distribution in the subsets of examples that land in the children nodes as compared to that at the parent node, as a result of using the hyperplane at the parent node. Another function proposed recently is one that is based on a notion of improving degree of linear separability (Shah & Sastry, 1999). In all these cases, it is difcult to obtain the gradient of the performance measure with respect to
Analysis of Alopex
2741
the hyperplane parameters (Brieman et al., 1984; Murthy et al., 1994; Shah & Sastry, 1999). 2 In the case of learning axis parallel decision trees using either the ID3 or CART methodology, brute-force optimization techniques are used (Quinlan, 1983; Fayyad & Irani, 1992; Brieman et al., 1984). Here one evaluates the gure of merit for every relevant hyperplane and thus nds the optimum. For learning axis parallel decision trees, this is feasible because the number of allowable hyperplanes at a node is not unmanageably large. However, for learning oblique decision trees, such a strategy is not feasible. Often what is done here is to perturb the hyperplane parameters iteratively and thus evaluate the gure of merit for some randomly chosen hyperplanes around the current point and choose the best of them (Brieman et al., 1984; Murthy et al., 1994). Hence, a gradient-free optimization technique whose asymptotic behavior is well understood is certainly useful in this problem. For the simulations presented here, we consider learning oblique decision trees. We choose two synthetic two-class pattern recognition problems with two-dimensional feature space. In both problems, the classes are separable with a piecewise linear surface. This allows one to judge the learned decision tree by viewing the regions in the feature space represented by the learned decision tree. The results obtained are shown in Figures 1, 2, and 3. In all cases, the same set of examples is used by the algorithms to learn the trees. Each gure shows the learned decision trees on the left side and the underlying class regions on the right. Each node in the tree is labeled with the name of the hyperplane learned for that node, and this hyperplane is also shown plotted in the feature space on the right half of the gure. Looking at the hyperplanes corresponding to different nodes in the tree plotted in the feature space, one can assess how well the learned tree captures the separating surface between the classes. Figure 1 compares the trees obtained with OC1 algorithm and our method on one of the problems, and Figure 2 does the same for the second problem. OC1 is a standard method to learn oblique decision trees (Murthy et al., 1994). This algorithm uses decrease in Gini impurity as the objective function and an optimization technique based on evaluating the objective function at many points obtained through specialized random perturbation of the current parameters. For the comparison here, we have used the same objective function (reduction in Gini impurity) but the 2t-Alopex algorithm as the optimization technique. As is easy to see, 2 By this, we mean that there is no formula or efcient algorithm for obtaining the gradient. Since function measurements are available, one can always estimate the gradient by multiple function measurements. It may also be noted here that whether it is easy to obtain the gradient depends on the function used to evaluate the goodness of a hyperplane at a node. ForP example, we may choose the objective to be learning a vector w and a scalar (w t x i C w 0 ¡ yi )2 is minimized. (Here, the summation is over all training w 0 such that examples (x i , yi ) where xi is the feature vector and yi is the class label.) Then the gradient is easy to obtain, and we can use, for example, the LMS algorithm for optimization. However, most successful decision tree algorithms use functions based on the distribution of different classes in the children nodes, as cited above.
2742
P. S. Sastry, M. Magesh, and K. P. Unnikrishnan
H1
1 H4
H2
0.8 H3
H2
H3
0.6 H1
0.4
C0
C1 H4
C0 0.2
C0
0 0
C1
0.2
0.4
0.6
0.8
1
0.8
1
1
H1
0.8
H2 H1
0.6
C0
H2
0.4
0.2
C0
C1
0 0
0.2
0.4
0.6
Figure 1: Comparison of decision trees learned with OC1 algorithm (a, b) with those learned with 2t-Alopex (c, d). In both cases, the objective function is a reduction in Gini impurity. (a, c) The learned trees and subgures. (b, d) The corresponding class regions with the hyperplanes learned at different nodes of the tree superimposed.
using Alopex results in trees that are as good as or better than those obtained with OC1. 3 Figure 3 shows the trees learned using the separability based 3 The OC1 algorithm is simulated using the freely downloadable code available from Murthy et al. (1994). Both the perturbation-based optimization technique of OC1 and the
Analysis of Alopex
2743
H1
H2
C0
H7
H3
H4
C1
H8
H9
C0
H5
C1
C1
H6
C1
H10
C0
C0
C0
C1
H11
C0
C1
H1
H2
H3
C1
H5
C0 C0
H4
C0
H6
C1
C1
H7
C0
C1
Figure 2: Comparison of decision trees learned with OC1 algorithm (a, b) with those learned with 2t-Alopex (c, d). In both cases, the objective function is a reduction in Gini impurity. (a, c) The learned trees. (b, d) The corresponding class regions with the hyperplanes learned at different nodes of the tree superimposed.
2t-Alopex are stochastic algorithms. Hence, with the same set of examples, we would, in general, get different trees on different runs. For the results shown here, we ran each algorithm 10 times with different random seeds and chose the best tree in terms of accuracy on a separately generated test set of examples. All trees shown have comparable accuracies (on the test set) that vary between 98.7% and 99.2%.
2744
P. S. Sastry, M. Magesh, and K. P. Unnikrishnan
1 H2 0.8
H1 H3
H1
0.6
0.4
H2
H3 0.2
C0
C1 C0
C1
0 0
0.2
0.4
0.6
0.8
1
H1
H2
C0
H3
C1 H4
C0
H5
C1 C0
C1
Figure 3: Decision trees learned with a separability-based objective function and Alopex as the optimization technique. (a, c) The learned trees. (b, d) The corresponding class regions with the hyperplanes learned at different nodes of the tree superimposed.
Analysis of Alopex
2745
gure of merit proposed in Shah and Sastry (1999) with the 2t-Alopex algorithm used as the optimization technique. As can be seen, the trees learned are much more compact compared to those in Figures 1 and 2. The Gini impurity measure is continuous in the fractions of different class patterns present in the example set. Hence, assuming that we have a large number of i.i.d. samples and that the underlying class-conditional densities are continuous, one can expect that the gure of merit based on Gini impurity would be differentiable with respect to the hyperplane parameters. However, the separability-based gure of merit uses some counts of patterns that are most likely responsible for a set of patterns being linearly nonseparable (see (Shah & Sastry, 1999) for details), and hence it is difcult to say whether this gure-of-merit function is differentiable in the parameters of the hyperplane.4 However, making use of the 2t-Alopex optimization algorithm on this separability-based objective function, we are able to learn good hyperplanes at various nodes to come out with compact and accurate decision trees. One can generalize this experience and say that existence of a good gradient-free optimization technique can even contribute to researchers’ thinking of more complex gure-of-merit functions to learn better oblique decision trees. 4.2 Linear System Identication. The second problem on which we illustrate the performance of 2t-Alopex algorithm is one for which the gradient of the objective function is available, and there is also an efcient gradient-descent technique available. The problem is that of identifying a linear system using squared error as the objective function. The LMS algorithm is a very effective and efcient technique for solving this problem. We show that the performance of the 2t-Alopex algorithm is comparable to that of LMS on this problem. Considering that Alopex does not need any gradient information and that it is capable of learning nonlinear models also, it is very interesting that its performance is comparable to that of LMS on linear system identication. Here we also show the results obtained with the original Alopex algorithm (Unnikrishnan & Venugopal, 1994) for comparison. In this problem, the specic system to be identied has the transfer function given by
H ( z ¡1 ) D
0.05 ¡ 0.4z ¡1 . 1.0 ¡ 1.1314z ¡1 C 0.25z ¡2
(4.1)
We need to estimate this transfer function using only input-output data. We 4 For our analysis, we assumed that the objective function has continuous rst partial derivatives. This is needed because the analysis is intended to show that the algorithm behaves like gradient descent. However, since we do not explicitly estimate the gradient information, we do not get into any problems such as numerical instability of the iterates, even if the objective function is not differentiable at some points.
2746
P. S. Sastry, M. Magesh, and K. P. Unnikrishnan
explore both sufcient-order and reduced-order modeling. In the former case, we assume that the order of the system is known, and thus we use the class models given by H ( z¡1 ) D
b0 C b1 z¡1 . 1.0 C a1 z¡1 C a2 z¡2
(4.2)
This class of models is parameterized by four parameters: b 0, b1 , a1 , and a2 . The identication here involves estimation of the four parameters to optimize some chosen performance measure. In the case of reduced-order modeling, we chose the model as a rst-order system given by H ( z¡1 ) D
b . 1.0 ¡ az ¡1
(4.3)
This class of models is parameterized by two parameters: b and a. Here, identication involves estimation of the two parameters. In both cases, we have used squared difference between the instantaneous system output and the model output averaged over the previous 50 iterations as the error to be minimized. In the case of sufcient-order modeling, the optimization is over <4 , and in the case of reduced-order modeling, it is over <2 . In each case, we considered identication in the presence of noise. For simulating the noisy case, we added i.i.d. zero-mean gaussian noise with s D 0.1 to the output of the system given by equation 4.1. During training, the input to the system is an i.i.d. sequence uniform over the interval [0, 1]. The learning curves for this problem are shown in Figures 4 and 5 for the two cases of sufcient-order modeling and reduced-order modeling, respectively. These show the plot of the error (averaged over 10 experiments) at each instant during the training phase. As can be seen, the performance of the 2t-Alopex algorithm is as good as that of LMS. From the gures, it is also easily seen that the 2t-Alopex algorithm learns as fast as the original Alopex in case of sufcient-order modeling and performs better in case of reduced-order modeling. Also, the variance in the learning curve of 2tAlopex is somewhat smaller than that seen in the learning curves of the original Alopex. 5 Conclusion
We have presented a two timescale version of the Alopex algorithm. We presented an analysis of the asymptotic behavior of the algorithm to justify its use as an optimization technique. Apart from proving that the algorithm can obtain local optima, the analysis is interesting because it clearly brings out how correlations between changes in parameters and changes in the
Mean Squared Error
Analysis of Alopex
2747
0.05 0.045 0.04 0.035 0.03 0.025 0.02 0.015 0.01 0.005
2t-Alopex Alopex LMS
0
50
100 150 200 250 300
Time Instants (X50)
Mean Squared Error
Figure 4: Learning curves for the linear system identication: sufcient-order model with noise.
0.24 0.22
2t-Alopex Alopex LMS
0.2 0.18 0.16 0 10 20 30 40 50 60 70 80 90100 Time Instants (X50)
Figure 5: Learning curves for the linear system identication: reduced-order model with noise.
objective function can be used in a stochastic search technique to yield good descent directions even though we do not explicitly estimate the gradient at any point. We also presented some simulation results to illustrate the effectiveness of this algorithm. The rst problem we considered is that of learning an oblique decision tree from examples. Here, it is difcult to obtain the gradient information, and hence 2t-Alopex is effective. We also showed that even
2748
P. S. Sastry, M. Magesh, and K. P. Unnikrishnan
for linear system identication under noise, performance of the 2t-Alopex algorithm is comparable to that of LMS. Recently, the 2t-Alopex algorithm was shown to be very effective for learning optimal cepstral compensating vectors for speaker adaptation in speech recognition (Srivatsan, 2002), which is also an optimization problem in which it is not feasible to obtain gradient information. The examples we have considered here represent different types of learning problems. However, the update equations for parameters under the Alopex algorithm are the same, as is to be expected when one sees this as a general gradient-free optimization technique. Normally one chooses learning algorithms based on the model used. In general, this is because different parameters in the model affect the error (the objective function to be minimized) in different ways. When one uses a gradient-based optimization technique, the updating of different parameters is automatically treated differently by making use of the derivative information that characterizes the sensitivity of the objective function to changes in different parameters. Thus, one can say that the difculty in designing gradient-free optimization techniques is in nding ways by which proper descent directions are automatically obtained even though the sensitivity information in terms of derivatives is not available. In this context, the analysis presented in section 3.1 shows how correlations between changes in individual parameter values and changes in objective function can be used to obtain good descent directions. Also, computations based on such correlations are likely to be numerically more stable than computations that involve explicit estimation of gradients through multiple function evaluations, especially when function measurements are corrupted with noise. In this sense, we feel that the Alopex algorithm analyzed here is a good general-purpose technique for gradient-free optimization. Acknowledgments
We thank V. S. Borkar for helpful comments on the text and Srivatsan Laxman for help with the gures. References Barto, A. G., & Jordan, M. I. (1987).Gradient following without backpropagation in layered networks. In Proc. IEEE First Intl. Conf. Neural Networks (pp. II629– II636). Piscataway, NJ: IEEE Press. Bertsekas, D. P. (1995). Nonlinear programming. Belmont, MA: Athena Scientic. Bharath, B., & Borkar, V. S. (1999). Stochastic approximation algorithms: Overview and recent trends. Sadhana, 24, 425–452. Bia, A. (2001). Alopex-B: A new, simpler, but yet faster version of the Alopex training algorithm. Int. J. Neural Systems, 11, 497–507.
Analysis of Alopex
2749
Borkar, V. S. (1997). Stochastic approximation with two time scales. Systems and Control Letters, 29, 291–294. Brieman, L., Friedman, J. H., Ghene, A., & Stone, J. (1984). Classication and regression trees. Belmont, CA: Wadsworth. Brodley, C. E., & Utgoff, P. E. (1995). Multivariate decision trees. Machine Learning, 19, 45–77. Dembo, A., & Kailath, T. (1990). Model free distributed learning. IEEE Transactions on Neural Networks, 1, 58–70. Draper, C. S., & Li, Y. T. (1951). Principles of optimalising control systems and an application to internal combustion engine. New York: ASME Publications. Fayyad, U. M., & Irani, K. B. (1992). On the handling of continuous valued attributes in decision tree generation. Machine Learning, 8, 87–102. Harth, E., & Tzanakou, E. (1974). Alopex: A stochastic method for determining visual receptive elds. Vision Research, 14, 1475–1485. Harth, E., & Unnikrishnan, K. P. (1985). Brainstem control of sensory information: A mechanism for perception. Int. J. Psychophysiology, 3, 110–119. Harth, E., Unnikrishnan, K. P., & Pandya, A. S. (1986). The inversion of sensory information processing by feedback pathways: A model of visual cognitive functions. Science, 237, 187–189. Jabri, M., & Fowler, B. (1991). Weight perturbation: An optimal architecture and learning technique for analog VLSI feed-forward and recurrent multilayer networks. Neural Computation, 3, 546–565. Kushner, H. J., & Yin, G. G. (1997). Stochastic approximation algorithms and applications. New York: Springer-Verlag. Micheli-Tzanakou, E. (2000). Supervised and unsupervised pattern recognition: Feature extraction in computational intelligence. Boca Raton, FL: CRC Press. Murthy, S. K., Kasif, S., & Salzberg, S. (1994). A system for induction of oblique decision trees. J. Artif. Intell. Res., 2, 1–32. Narendra, K. S., & Thathachar, M. A. L. (1989).Learning automata:An introduction. Englewood Cliffs, NJ: Prentice Hall. Nine, H. S., & Unnikrishnan, K. P. (1994). The role of subplate feedback in the development of ocular dominance columns. in J. M. Bower (Ed.), Computation in neurons and neural systems (pp. 389–393). Norwell, MA: Kluwer. Quinlan, J. R. (1983). Learning efcient classication procedures and their applications to chess end games. In R. S. Michalski, J. G. Carbonell, & T. M. Mitchell (Eds.), Machine learning: An articial intelligence approach (Vol. 1). Palo Alto, CA: Tiogo. Rajaraman, K., & Sastry, P. S. (1999). Stochastic optimisation over continuous and discrete variables with applications to concept learning under noise. IEEE Trans Systems, Man and Cybernetics, 29, 542–553. Santharam, G., Sastry, P. S., & Thathachar, M. A. L. (1994). Continuous action set learning automata for stochastic optimisation. Journal of the Franklin Institute, 331, 607–628. Shah, S., & Sastry, P. S. (1999). New algorithms for learning and pruning oblique decision trees. IEEE Transactions on Systems, Man and Cybernetics Part C, 29, 495–505.
2750
P. S. Sastry, M. Magesh, and K. P. Unnikrishnan
Spall, J. C. (1992). Multivariate stochastic approximation using a simultaneous perturbation gradient approximation. IEEE Transactions on Automatic Control, 37, 332–341. Srivatsan L. (2002). Speaker specic compensation for speech recognition. Unpublished master’s thesis, Indian Institute of Science, Bangalore. Unnikrishnan, K. P., & Venugopal, K. P. (1994). Alopex: A correlation based learning algorithm for feed-forward and recurrent neural networks. Neural Computation, 6, 469–490. Venugopal, K. P., Pandya, A. S., & Sudhakar, R. (1992). Invariant recognition of 2D objects using Alopex neural networks. SPIE Proc., 1708, 182–190. Venugopal, K. P., Pandya, A. S., & Sudhakar, R. (1994). A recurrent neural network controller and learning algorithm for the online control of autonomous underwater vehicles. Neural Networks, 7, 833–846. Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8, 229–256. Received July 27, 2001; accepted May 3, 2002.
LETTER
Communicated by Gudrun Klinker
A New Color 3D SFS Methodology Using Neural-Based Color Reectance Models and Iterative Recursive Method Siu-Yeung Cho [email protected] Department of Electronic and Information Engineering, Hong Kong Polytechnic University, Hong Kong Tommy W. S. Chow [email protected] Department of Electronic Engineering, City University of Hong Kong, Hong Kong In this article, a new methodology for color shape from shading (SFS) problem is proposed. The problem of color SFS refers to the well-known fact that most real objects usually contain mixtures of diffuse and specular color reections and are affected by the multicolored interreection under unknown reectivity. In this article, these limitations are addressed, and a new color SFS methodology is proposed. The proposed approach focuses on two main parts. First, a generalized neural-based color reectance model is developed. Second, an iterative recursive method is developed to reconstruct a multicolor 3D surface. Experimental results on synthetic-colored objects and real-colored objects were performed to demonstrate the performance of the proposed methodology. 1 Introduction
Human vision is quite adept at turning a planar source of information into a fairly accurate representation of solid objects, thereby reconstructing the missing third dimension. The idea of extracting shape or depth information from an image has been studied in the eld of computer vision since the late 1960s. The basic idea of the shape from shading (SFS) technique was rst suggested by Horn (1970) and further studied by Horn and Brooks (1989) and Horn and Sjoberg (1990). The classical formulation of the SFS problem is based on the conventional Lambertian model together with the minimization of the cost function using the calculus of variations (Horn & Brooks, 1986). It was realized at that time that in the absence of considerable necessary information, the problem of shape recovery is highly ill posed. Determining the actual shape of objects in a scene is therefore extremely difcult. In order to tackle this ill-posed SFS problem, other approaches were developed for direct shape reconstruction. For example, in Szeliski (1991), Cheung, Lee, and Li (1997), and Bichsel and Pentland (1992), invesc 2002 Massachusetts Institute of Technology Neural Computation 14, 2751– 2789 (2002) °
2752
Siu-Yeung Cho and Tommy W. S. Chow
tigations were performed to increase the rate of convergence and overcome the ill-posed problem (Horn, 1990). In fact, the success of SFS depends on two major elements: research into a better reectance model and the development of an effective SFS algorithm. The importance of developing a better reectance model lies in whether the models are able to describe the surface shape accurately from the image intensity. Thus far, major breakthroughs in formulating new reectance models have been rather inadequate. Despite the simplicity and the immense popularity of the Lambertian model, this model is unable to generalize with strong specular components. Healy and Binford (1988) employed the Torrance-Sparrow (1967) model as the specular model, which assumes that a surface is composed of small, randomly oriented, and mirror-like facets. Another non-Lambertian model has been proposed by Lee and Kuo (1997) that depends not only on the gradient variable, but also on the variables of the position and the distance of light source. There is no doubt that obtaining a complete generalized model is difcult because the image intensity is dependent on many parameters, including the surface property, the light source direction, the specular direction, the viewing direction, and the distance between the surface point and the light source. More recently, a novel approach that uses a self-learning feedforward neural network (FNN) has been proposed to model the reectance model. This results in avoiding some of the Lambertian assumption (Cho & Chow, 1999a). The breakthrough is essential because, under the developed FNN-based model, the illuminate source direction is no longer required. Promising results indicate that the accuracy of the reconstructed depth is satisfactory from a practical applications point of view. Apart from tackling the Lambertian assumptions, there are other analytical reectance models developed for tackling the problematic specular components. These include the Torrance-Sparrow model and the Beckmann-Spizzochino (1963) model, but their overall performance is signicantly degraded when the illumination and viewer information are unknown. We thus developed a radial basis function (RBF) type of model to generalize the pure specular components so that the basic parameters of the specular terms are not required (Cho & Chow, 2000). In addition, Nayar, Ikeuchi, and Kanade (1990a) have proposed a reectance model that consists of three components: the diffuse lobe, the specular lobe, and the specular spike. The Lambertian model was used to represent the diffuse lobe, while the Torrance-Sparrow model was used to model the other specular components. In order to generalize a more sophisticated reectance model, the effect of other parameters, such as multiple light source distribution, perspective projection, and complex geometrics, have to be taken into account. Recently, Cho and Chow (in press) developed a novel neuralbased hybrid reectance model. The idea of this method is to optimize a proper reectance model by using a hybrid structure of the FNN and the RBF networks. As a result, the performance of the 3D shape reconstruction is signicantly enhanced.
A New Color 3D SFS Methodology
2753
Most previous work on SFS method has mainly dealt with gray-scale images. Instead of simply considering the color image as a gray-scale image, much research aimed at developing colored reectance models to cope with color images. Using color image in the SFS method, the accuracy and noise robustness of the algorithms have been improved. Bui-Tuong (1975) developed a popular illumination model, the Phong model, for the color surfaces, which assumes that maximum specular reectance occurs when the angle between the light reection and the viewpoint is zero and falls off sharply when the angle increases. Warnock (1969) used a cosine term to model the specular reection using the Phong model. In general, the color reected from a patch of surface depends on the illumination, the position of the viewer, and the orientation, together with the reectivity of the surface, if the interreections in a scene are negligible. These dependencies are described by a shading function that maps orientations to color. The shading function that corresponds to the material of a given object, under given illumination conditions, can be determined in a number of ways. Ward (1992) described a method for measuring the reection properties of a surface using a goniometer. The shading function can be computed using a suitable reection model if the physical properties of the material are known. Klinker, Shafer, and Kanade (1987) proposed a color reection model where the color of every pixel of an object is described as a linear combination of the object color (diffuse component) and the highlight color (specular component). Their model and color histogram demonstrated that in real scenes, a transition area exists from the purely matte area that is generally considered to be highlighted. This transition area determines the characteristic shapes of color clusters, which is the information that they used to detect and remove the highlights. In a dichromatic reection, all parameters of the reectance model (maximum specular and diffuse color and roughness) can be determined from a color histogram of the images (Klinker, Shafer, & Kanade, 1990; Novak & Shafer, 1992). But these existing colored reectance models are purely monochromatic and are unable to deal with multicolored objects. Lee and Bajcsy (1992) introduced an algorithm for detecting specularities from Lambertian reection using multiple color images from different viewing directions. The algorithm, which is based on the Lambertian consistency called spectral differencing, is a pixelwise parallel algorithm that detects specularities by color differences between a number of images. These two algorithms mainly deal with the removal of the specular components from color images. In terms of the color SFS developments, very limited research has been conducted. For instance, Christensen and Shapiro (1994) proposed a method of color photometric stereo, which used several images taken from the same viewpoint but with different lighting, to determine the 3D shape of an object. Tian and Tsui (1996) proposed an approach for the non-Lambertian surfaces. It used an extended light model with the information of the hue channel to detect the specular and diffuse components. These two methods are applicable to multicolored
2754
Siu-Yeung Cho and Tommy W. S. Chow
objects, but the color reection parameters, as well as the position of light sources, should be determined a priori. For practical applications, it is true that most objects should be multicolored. Also, the color reection parameters are not known and must be estimated. Clearly, it would be erroneous to treat a multicolored object as a gray-scale image. It is also difcult to separate the multicolored intensity into diffuse and specular components under different color levels. The multicolored intensity can correspond to an innite number of orientations, which makes the job immensely difcult. Hitherto, most color images have been used only for single-colored objects composed of dielectric materials under white illumination. Clearly, there is a need to establish a new methodology to tackle the problem of multicolored SFS. Also, it is worth pointing out that the specular reection area and the diffuse reection area generate different colors when the light source color is different from that of the object. These factors pose additional difculties for the multicolored SFS problem. In modeling the color image, the component areas of the specular and diffuse consist of different color components. In the case of the physicalbased color reectance model, the specular component due to the interface reection does not change across the color regions even though the diffuse component caused by body reection does. In color space, the color regions project to color planes, and the color of the specular component is found as the intersection of these planes. The spectrum of the interreected light is formed as a product of the illumination spectrum and the spectra of the two surface reectance functions. In fact, the surface reectance functions multiply together as many times as the light is interreected, with the total mutual illumination contribution equaling the sum of an innite number of interreections. Analysis of the color space in the physical-based color model is especially difcult for multicolored objects modeling. Nayar, Ikeuchi, and Kanade (1990b) assumed that in the extreme case, with a concave edge that is red on one side and blue on the other, the spectral reectance function of the red will be zero at the blue end of the spectrum and vice versa. The product of the red and blue spectra will be zero in the physical-based model. As a result, while interreection is to be expected based on the edge geometry, the reconstruction algorithm for Nayar’s physical-based model should overestimate the interreection and lead to an incorrect calculation of the surface shape. Thus, it is necessary to segment the color image into different color components for determining the orientation of the multicolored surface. Based on color segmentation, the advantages in using color information at each pixel are twofold: more constraints on the orientation can be formed, and the result is less sensitive to noise. As a result, the shape of multicolor real images can be determined more accurately. The proposed method has three steps. First, each color information in each pixel is separated into three primary color components (red, green, and blue band). Second, each color band forms the corresponding shading information. The corresponding surfaces are subsequently reconstructed by using the proposed
A New Color 3D SFS Methodology
2755
neural-based color reectance models. Third, the nal surface orientation is computed from these three reconstructed surfaces by using our proposed iterative recursive (IR) method. In this article, a new color SFS methodology is proposed. This article focuses on two main parts. First, a generalized hybrid neural-type color reectance model is derived. This part of the article extends a neural-type reectance model (Cho & Chow, in press) to color reectance models. Using this proposed approach, all components of color diffuse reection and specular reection are generalized by the proposed neural-type color reectance model. Under this novel approach, the illuminate information is no longer required, and the evaluation of the surface depth is iteratively performed. Second, a new IR method for reconstructing multicolor 3D surface is derived. This method is based on the idea of recursive inverse ltering algorithms (Kundur & Hatzinakos, 1998; Chow, Li, & Cho, 2000) for the blind image restoration to determine the nal surface orientation. Using these proposed methodologies, our results indicate that the shape of a multicolored object can be accurately determined, and the overall results are noise robust. 2 Background of Reectance Models 2.1 Basic Reectance Models. The Lambertian model is a useful reectance model commonly adopted in the area of computer vision as an ideal reectance model. Equation 2.1 shows the conventional Lambertian reectance model representing a surface illuminated by a single point light source,
Rdiffuse ( p ( x, y) , q ( x, y ) ) D gn ¢ s T ,
8(x, y ) 2 V,
(2.1)
where Rdiffuse is the diffuse intensity, g is the composite albedo, s D (cos t sin s sin t sin s cos s ) is illuminate source direction, and t and s denote the tilt and slant angles, respectively. The surface normal can be represented as Á ! ¡p ¡q 1 nD p p p , p2 C q2 C 1 p2 C q2 C 1 p2 C q2 C 1 where p D @@xz and q D @@yz are the surface gradients and z is the height eld of the surface, with x and y forming a 2D pseudoplane over the domain V of image plane. Specular component occurs when the incident angle of the light source is equal to the reected angle. This component is formed by two terms: the specular spike and the lobe. Healey and Binford (1988) derived the specular model by simplifying the Torrance-Sparrow (1967) model in which the gaussian distribution was used to model the facet orientation function. A more sophisticated model based on the geometrical optics approach was
2756
Siu-Yeung Cho and Tommy W. S. Chow
also presented as the specular reectance model (Nayar, Ikeuchi, & Kanade, 1991), such that Rspecular
D kspec
³ ´ a2 Li dw i exp ¡ 2 , coshr 2b
(2.2)
where kspec represents the fractions of incident energy determined by the Fresnel coefcients and the geometrical attenuation factor. The term coshr describes the emitting angle that the radiance of the surface in the viewing direction is determined. Because most object surfaces in the real world are neither purely Lambertian reectance models nor purely specular components, they are a linear combination of them. They are hybrid surfaces that include diffuse and specular components. Nayar et al. (1990a) formed a hybrid model to tackle the problem such that the model consists of three components: diffuse lobe, specular lobe, and specular spike. In contrast, this article describes a straightforward representation of the hybrid surface that the total intensity of the hybrid surface is the summation of the specular intensity and the Lambertian (diffuse) intensity, as follows: Rhybrid D (1 ¡ v) Rdiffuse C vRspecular ,
(2.3)
where Rhybrid is the total intensity for the hybrid surface, Rdiffuse and Rspecular are the diffuse intensity and specular intensity, respectively, and v is the weight of the specular component. 2.2 Color Reectance Model for Determining Surface Orientations.
A dichromatic reection model (DRM) is a color reectance model that describes hybrid reection properties without specifying the diffuse and the specular reection components. This type of model can be used for inhomogeneous dielectric materials. The surface structure of dielectric materials can be modeled as being composed of an interface and an optically neutral medium containing color pigments. Mathematically, the DRM can be formulated as Rdichromatic D Rs (l , n , s, v ) C Rb (l , n , s, v ) D ms ( n , s, v ) ¢ cs (l) C mb ( n , s , v ) ¢ cb (l) .
(2.4)
The modeled scene radiance Rdichromatic is a quantity that depends on the wavelength l. The Rdichromatic is an additive composition of the radiance of the interface reection Rs and the body reection Rb . The assumption of the DRM is that both reection components are factorized into geometrical components (ms and mb ) and a spectral component (cs and cb ). Factor cs is the interface reection color and cb is the body reection color. With
A New Color 3D SFS Methodology
2757
the additional assumption of a neutral reecting interface cs describing the spectral color of the light source, the geometrical component ms can be made explicit by the Torrance-Sparrow model, the Phong model, or other types of model, whereas the scaling factor mb can often be modeled by the Lambertian model. The scene radiance can be represented by a 3D color vector R dichromatic shown in equation 2.5, given that the description of the scene radiance Rdichromatic is restricted to three narrow wavelength bands in the red (R), green (G), and blue (B) spectral range of the visible light: 0 1 R (Red, n , s , v ) B C R D @R (Green, n , s , v ) A R (Blue, n , s, v ) D Rs C Rb D ms ( n , s , v ) ¢ cs C mb ( n , s , v ) ¢ cb
0
1 0 1 cs,Red cb,Red B C B C D ms ( n , s , v ) ¢ @cs,Green A C mb ( n , s, v ) ¢ @cb,Green A cs,Blue cb,Blue
(2.5)
ms and mb represent arbitrary scaling factors, and the vectors c s and cb form a two-dimensional subspace in the RGB color space (dichromatic plane). It can be observed that the dichromatic model must be derived from the following three single-color reectance component models: RRed D ms ( n , s , v ) ¢ cs,Red C mb (n , s , v ) ¢ cb,Red ,
(2.6a)
RGreen D ms (n , s , v ) ¢ cs,Green C mb ( n , s, v ) ¢ cb,Green ,
(2.6b)
RBlue D ms ( n , s, v ) ¢ cs,Blue C mb (n , s , v ) ¢ cb,Blue .
(2.6c)
3 Neural-Based Color Reectance Models
In the conventional SFS approach, critical parameters—the light source, the viewing direction, and some other assumptions—are required a priori. In this article, a simple neural network self-learning scheme, based on the relationship between the surface orientation and the intensity, was exploited to model the unknown parameters for generalizing the reectance model. Apparently the use of an FNN network and a radial basis function (RBF) network can provide approximations of the Lambertian model and the specular model, respectively, under the theoretical view of the universal approximation capability of neural networks (Hornik, Stinchcombe, & White, 1989; Park & Sandberg, 1991). There are clearly advantages of establishing a hybrid-type neural reectance model (Cho & Chow, in press),
2758
Siu-Yeung Cho and Tommy W. S. Chow
which combines the FNN and the RBF networks. The developed model is expressed as Rhybrid D (1 ¡ v) FFNN ( a ) C vFRBF ( a ) Á ! N X D (1 ¡ v) Qa v 0 C v kQa ( w k ai, j ) kD 1
Á Cv
v0 C
N X kD 1
!
vkQb (kai, j ¡ c kk) ,
(3.1)
where v is the nonzero weighting factor of the specular components used to establish the quantity of the specular component for the hybrid model. This approach would enable us to generalize either the purely Lambertian surface (the diffuse component only) or the non-Lambertian surface (a hybrid of diffuse and specular components). In deriving the proposed color reectance model, some assumptions about the scene and the object should be made. First, all parts of the surface under consideration have the same physical properties, such as color, reection, and roughness. Second, the position, color, and intensity of the light source should be known. Third, the object does not receive light reected from other objects or other parts of the same object. Based on these assumptions, there is a corresponding color image reectance function R that maps surface orientations to colors for each reection model, surface properties, and illumination: c D ( r, g, b) D Rcolor ( n ).
(3.2)
The surface orientation of visible parts of a surface has a unit normal n . Each normal is mapped to a single color c , and many different normals can, conversely, be mapped to the same color. The range of R is a part of the RGB color space, which is represented as a cube with red, green, and blue axes. R ¡1 is a set-valued function that maps a color to the set of normals that correspond to that color, ¡1 ( c ) ´ fn | Rcolor (n ) D cg. Rcolor
(3.3)
Using the above representations, there are several types of color image reectance function for different materials. The reectance function for matte surfaces can be described by the Lambertian reectance model. The reections from composite materials such as paint and plastic can be divided into a diffuse and a specular component. The diffuse component has the color of the object, whereas the specular component has the color of the light source. For dielectrics and colored metals, the hue of the reected light depends on the angles, because the specular reections of RGB light exhibit different
A New Color 3D SFS Methodology
2759
characteristics. A physical-based model that incorporates this hue change is described by the Torrance-Sparrow model. As a result, the color image reectance function can be dened as similar to the Nayar et al. (1990a) hybrid model: 0
1 Rhybrid (Red, n ) B C R color ( n ) D @Rhybrid (Green, n ) A . Rhybrid (Blue, n )
(3.4)
In accordance with these characteristics, it is essential to incorporate all these different models by a universal modeling method. Using the universal approximation capability of the hybrid neural-based model shown in equation 3.1, the color image reectance function can be modeled by the proposed neural-based model in hybrid structure as 0
R color
1 0 1 FFNN (Red, a ) FRBF (Red, a ) B C B C D (1 ¡ v) @FFNN (Green, a ) A C v @FRBF (Green, a ) A . FFNN (Blue, a ) FRBF (Blue, a )
(3.5)
Similar to the dichromatic model as shown in equation 2.5, the neural-based color reectance model is derived from the following three color reectance component models such that Á RRed
D (1 ¡ v) Qa v 0,Red C
Á Cv
v0,Red C
N X kD 1
N X kD 1
RGreen D (1 ¡ v) Qa v 0,Green C Cv
v 0,Green C
N X kD 1
Á RBlue D (1 ¡ v) Qa v 0,Blue C Á Cv
v 0,Blue C
!
vk,RedQb (ka ¡ c k,Red k) ,
Á
Á
! v k,RedQa (w k,Reda i, j )
N X kD 1
N X
(3.6a)
! v k,GreenQa (w k,Greena i, j )
kD 1
!
v k,GreenQb (ka ¡ c k,Greenk)
N X kD 1
(3.6b)
! v k,Blue Qa ( w k,Blue a i, j ) !
vk,BlueQb (ka ¡ c k,Blue k) .
(3.6c)
2760
Siu-Yeung Cho and Tommy W. S. Chow
Using the above three basic color reectance models, three corresponding surface orientations can be reconstructed from the RGB color images by the iterative SFS learning algorithm. After reconstructing these three surfaces, the nal resulting surfaces can be computed. The details of the computations are described in the next section. 4 3D Shape Recovery Computation by Neural-Based Color Models
In solving the color SFS problem by the proposed neural-based color reectance model in equation 3.5, the cost function commonly dened is ZZ kIRGB ¡ R color ( n ) k2
ET D V
C l( ( @p / @x) 2 C ( @p / @y ) 2 C ( @q / @x ) 2 C ( @q / @y ) 2 ) dx dy,
(4.1)
where IRGB D ( IRed , IGreen, IBlue ) represents the RGB image intensity and R color represents the neural-based color reectance model which can be derived from three color reectance component models as equations 3.6a–3.6c. Thus, three cost functions can be dened correspondingly: ZZ ( IRed ¡ RRed ) 2
E Red D V
C l( ( @p / @x) 2 C ( @p / @y) 2 C ( @q / @x) 2 C ( @q / @y ) 2 ) dx dy,
(4.2a)
ZZ E Green D V
(IGreen ¡ RGreen ) 2 C l( ( @p / @x ) 2 C ( @p / @y ) 2 C ( @q / @x ) 2 C ( @q / @y ) 2 ) dx dy, (4.2b)
ZZ E Blue D V
( IBlue ¡ RBlue ) 2 C l( ( @p / @x) 2 C ( @p / @y) 2 C ( @q / @x) 2 C ( @q / @y) 2 ) dx dy, (4.2c)
The rst term is the intensity error term, and the second is a smoothness constraint given by the spatial derivatives of p and q. l is a scalar that assigns a positive smoothness parameter. The free parameters of three neural-based color reectance models ( RRed, RGreen , RBlue ) and the object surface gradients ( p, q) are determined by performing a unied learning mechanism. Figure 1 shows the diagram of the learning framework. Throughout the learning
A New Color 3D SFS Methodology
2761
Figure 1: Learning framework diagram for solving the color SFS algorithm with three neural-based color reectance models.
scheme, the neural parameters and the surface gradients are computed iteratively where the neural parameters are optimized by a specic learning algorithm. The surface depth is estimated by a variational-based approach on a discrete grid of points. Three corresponding surface orientations can be reconstructed by using the iterative learning SFS algorithm. After reconstructing these three surfaces, the 3D shape of the surface is computed by using these three orientations. In our study, an IR method is used for this computation. 4.1 Variational Approach for Solving the Color Shape from Shading Problem. In tackling the variational approach, proposed by Horn and
Brooks (1986) and Zhang, Tsai, Cryer, and Shah (1999), we need to solve a rst-order nonlinear partial differential equation derived in the form of Euler-Lagrange equations. The variational method is used to calculate the solutions of the surface gradients ( p, q) and the corresponding object surface depth (zRed, zGreen, zBlue ) on the discrete grid by the corresponding color
2762
Siu-Yeung Cho and Tommy W. S. Chow
reectance models. Using the calculus of variations procedure to minimize the cost functions in equations 4.2a through 4.2c, a set of discrete recursive equations is formed. For instance, the surface gradients updated equations of the R components are Red ( ) ( pRed i, j n C 1) D pNi, j n C
e 2 Red @R ( I ¡ RRed ) Red , 4l i, j @pRed
(4.3)
( ( ) N Red qRed i, j n C 1) D q i, j n C
e 2 Red @R ( I ¡ RRed ) Red , 4l i, j @qRed
(4.4)
where e denotes the spacing between picture cells and l acts as a smoothness Red parameter for the regularization. pNRed N Red i, j and q i, j are the averages of pi, j and qRed i, j , respectively, over the four nearest neighbors. They are given by Red Red Red ) ( Red pNRed i, j D pi C 1, j C pi¡1, j C pi, j C 1 C pi, j¡1 / 4
and
Red Red Red ) ( Red qNRed i, j D qi C 1, j C qi¡1, j C q i, j C 1 C q i, j¡1 / 4.
@RRed / @pRed and @RRed / @qRed are the spatial derivatives of the color reectance
model with respect to the surface gradients. They can be expressed as: @RRed @pRed
D (1 ¡ v)
@FFNN (Red, a ) @pRed
Cv
@FRBF (Red, a ) @pRed
,
and @RRed @qRed
D (1 ¡ v)
@FFNN (Red, a ) @qRed
Cv
@FRBF (Red, a ) @qRed
.
Similarly, the corresponding surface gradients of the G and B components are updated in the form of equations 4.3 and 4.4. After obtaining the updated surface gradients at the n C 1 iterations, the surface depth is estimated iteratively from the updated surface gradients. In determining the best-t surface z to the given components of the gradients, the estimated surface depth for the R component is Red zRed i, j ( n C 1) D zN i, j ( n ) ¡
e Red (f C gRed i, j ) , 4 i, j
(4.5)
Red Red Red where zN Red i, j is an average of z as the same form of pNi, j . fi, j and gi, j are
estimates of the partial derivatives of p and q, respectively, such that fi,Red j D Red Red Red Red Red ( pi C 1, j ¡ pi¡1, j ) / 2 and gi, j D ( qi C 1, j ¡ qi¡1, j ) / 2. Note that the subscript i in the discrete case corresponds to x in the continuous case and the subscript j corresponds to y. Similarly, the corresponding surface gradients of the G
A New Color 3D SFS Methodology
2763
and B components are updated in the form of equations 4.3 and 4.4. Thus, the estimated surface depth for the G and B components are
e Green (f ), C gGreen i, j 4 i, j e Blue ( n C 1) D zN Blue ( ) (f ). C gBlue zBlue i, j i, j n ¡ i, j 4 i, j
( n C 1) D zN Green ( n) ¡ zGreen i, j i, j
(4.6) (4.7)
4.2 Learning Algorithm for Neural-Based Color Reectance Model.
Instead of determining the physical reectivity parameters of the color reectance model, the free parameters and weights of the proposed neuralbased color reectance model are optimized to generalize the proper reectance model. In accordance with the discrete form of the cost function, equation 4.1 is written as X X 2 ( rx p2i, j C ry p2i, j C rx q2i, j C ry q2i, j ) , kIRGB (4.8) ET D i, j ¡ R color k C l i, j2D
i, j2D
where D is the index set of the image. frx pi, j I ry pi, j I rx qi, j I ry qi, j g are discrete estimates of the partial cross derivatives of p and q such that, and
rx pi, j D ( pi C 1, j ¡ pi¡1, j ) / 2,
ry pi, j D ( pi, j C 1 ¡ pi, j¡1 ) / 2,
rx qi, j D ( qi C 1, j ¡ qi¡1, j ) / 2,
ry qi, j D ( qi, j C 1 ¡ qi, j¡1 ) / 2.
From the cost function, equation 4.8, the discrete form of the smoothness constraint is independent of the neural-based model such that the minimization of the cost function with respect to the network parameters is performed only on the error term. The cost function, equation 4.8, for each color component is then in the form of X 2 ( Ii,Red ERed D j ¡ (1 ¡ v) FFNN (Red, a i, j ) ) i, j2D
C
EGreen D
X i, j2D
X i, j2D
i, j2D
X i, j2D
C
(4.9a)
( Ii,Green ¡ (1 ¡ v) FFNN (Green, a i, j ) ) 2 j
X
C
EBlue D
( Ii,Red (Red, a i, j ) ) 2 , j ¡ vFRBF
( Ii,Green ¡ vFRBF (Green, a i, j ) ) 2 , j
(4.9b)
2 ( Ii,Blue j ¡ (1 ¡ v)FFNN (Blue, a i, j ) )
X i, j2D
( Ii,Blue (Blue, a i, j ) ) 2 . j ¡ vFRBF
(4.9c)
2764
Siu-Yeung Cho and Tommy W. S. Chow
Conventionally, optimizing the FNN and the RBF networks is usually performed by using a gradient-descent type of supervised learning process. But it is well known that this method has the shortcomings of the local minima problem and slow convergence speed. In this article, a modied heuristic learning scheme is suggested for computing the free parameters of the hybrid neural network. This learning scheme is modied from a method set out in Cho and Chow (1999b). The algorithm is a hybrid-type algorithm combining the least-squares method and the penalty approach optimization. It is known that this algorithm is able to speed up the convergence rate and avoid being stuck in local mimina. Based on the derivation of the proposed algorithm, for instance, of the R component, the weight-updated equations for the FNN model are given as follows for the output layer of FNN, 0 1¡1 0 1 X X 1 T A @ (W i, j W Ti,j ) A @ (Qa¡1 (IRed v Red ( n C 1) D i, j )W i, j ) , 1 ¡ v i, j2D i, j2D
(4.10)
and the hidden layer of FNN, Á (n C w Red k
1) D
( n ) C gw w Red k
! @ERed
) ¡ Red C r Epen ( w Red k @w k
1 · k · N,
, (4.11)
where Qa¡1 (¢) is a inversion form of the activation function and the vectors 0 W i, j
1
1T
BQ ( w Red a ) C B a 1 i, j C C DB .. B C @ A . ) Qa ( w Red N a i, j
Red Red ) T and v Red D ( vRed are denoted. gw is the learning rate and 0 , v1 , . . . , vN Red . The proposed learn@ERed / @w k is the error gradient with respect to w Red k ing algorithm can also be modied to apply in computing the free parameters of the RBF model. The updated equations for the RBF model are as follows: for linear parameters of RBF:
v
Red (
0 1¡1 0 1 X 1 X Red T T (W i, j W i, j ) A @ ( Ii, j W i, j ) A , n C 1) D @ v i, j2D i, j2D
(4.12)
A New Color 3D SFS Methodology
2765
for RBF centers: Á (n C cRed k
1) D
( n) C c Red k
gc
! @ERed
) ¡ Red C r Epen ( cRed k @c k
,
Á
!
(4.13)
for RBF widths: skRed ( n C
1) D
skRed ( n ) C gs
@ERed
¡ Red C r Epen (skRed ) @sk
,
(4.14)
Red Red ) T where the vectors v Red D ( vRed and 0 , v1 , . . . , vN
0 W i, j
BQ (ka B b i, j DB B @
1
1T
C ¡ cRed 1 k) C C .. C A .
Qb (kai, j ¡ cRed N k) are denoted. gc and gs are the learning rates for the RBF centers and widths, respectively. Based on the above formulation, r Epen (¢) is dened as a penalty term for the penalty optimization approach. The penalty term, which is designed to provide an uphill force so the learning process is not stuck in local minima, is assigned as a gaussian-like function, 2 r Epen ( x ) D ¡ | x ( n ¡ 1) ¡ x ¤ | m ³ ´ ( x ( n ¡ 1) ¡ x ¤ ) T (x ( n ¡ 1) ¡ x¤ ) £ exp ¡ , m2
(4.15)
where x ¤ is the suboptimal parameter assumed to be stuck in a local minima, and m is a penalty weighting factor. Note that the selection and adjustment of the penalty weight factor are crucial to the entire performance of the algorithm in such a way that the corresponding conditions are based on the convergence characteristics. The detailed descriptions about these conditions and the learning mechanism have been presented in Cho and Chow (1999b). The learning mechanism of the proposed approach is illustrated in Figure 2. Similarly, the other two components (G and B) of the color reectance model can be modeled by the above learning scheme. After all these computations, three surface orientations are reconstructed to correspond to the three color components. These three surface orientations are calculated by means of the IR method, detailed in the next section.
2766
Siu-Yeung Cho and Tommy W. S. Chow
Figure 2: An iterative SFS algorithm with the neural-based color reectance models.
4.3 Final Surface Orientation Computation by Using the IR Method.
The major difculty of computing a proper surface orientation from three different surfaces lies in the problem of insufcient information on the true surface. This poses a problem that there may be many or even possibly an innite number of solutions. Clearly, a priori information or constraints will be helpful in reducing the number of possible solutions and regularizing the poorly conditioned cost function. This method uses the concept of blind image restoration (BIR) for determining the nal surface orientation. The reconstruction process is based on the corresponding three surfaces reconstructed by the three color components. The BIR technique refers to the nonnegativity and support constraints of the recursive inverse ltering algorithm (Kundur & Hatzinakos, 1998; Chow et al., 2000). The BIR method, with no explicit knowledge of the original surfaces, mainly uses recursive ltering to minimize a convex cost function. In this study, we suppose the three
A New Color 3D SFS Methodology
2767
reconstructed surfaces, zRed (x, y ) , zGreen (x, y ), and zBlue (x, y ) , are inputs to a 3 £ 1 variable coefcient, and h (x, y ) is a lter whose output represents an estimate of the true surface denoted zO (x, y ). This estimate is passed through a nonlinear constraint process that uses a nonexpansive mapping to project the estimated surface into the space representing the known characteristics of the true surface. The difference between the projected surface zO NL ( x, y ) and zO ( x, y) is used as an error signal to update the h ( x, y) . Figure 3 presents an overview of this method. Using the proposed methodology, the surface is assumed to be nonnegative with known support. Hence, the cost function under this support requires dening this ill-posed problem. The zO NL (x, y ) represents the projection of the estimated surface onto the set of surfaces that are nonnegative with given nite support such that 8 if zO ( x, y ) ¸ 0 and (x, y ) 2 D
X 8(x,y)
[OzNL (x, y ) ¡ zO ( x, y) ]2 C c 4
X 8(x,y)
32 h ( x, y ) ¡ 15 ,
(4.17)
where zO (x, y ) D ( zRed (x, y ) zGreen (x, y ) zBlue ( x, y ) ) ¤ h ( x, y) . The variable c in the second term of equation 4.17 is nonzero only when the background value of the surface depth is zero. The second term constrains the parameters away from the trivial all-zero global minimum for this situation. According to equation 4.16, the cost function, equation 4.17, can then be rewritten as J (h) D
¶ 1 ¡ sgn ( zO ( x, y) ) 2 ( x,y)2D 2 32 X X Cl h ( x, y ) ¡ 15 , zO 2 ( x, y ) C c 4 X
µ
zO 2 (x, y )
( x,y ) 2D /
where the denition for sgn (¢) is ( 1, if z ¸ 0 sgn ( z ) D . ¡1, if z < 0
(4.18)
8(x,y)
(4.19)
The cost function consists of three components. The rst term penalizes the negative values of the surface estimate inside the region of support. The second term penalizes the values of the surface estimate outside the region
2768
Siu-Yeung Cho and Tommy W. S. Chow
Figure 3: An IR computation for nding the nal 3D surface orientation.
of support. The third term constrains the value coefcients h ( x, y ) away from the trivial all-zero global minimum. In order to ensure the convergence of the algorithm to the global minimum, the cost function, equation 4.18, should be shown to be convex with respect to the value coefcients fh ( x, y) g. To prove the convexity of the cost function, equation 4.18, the following theorem is used by Hiriart-Urruty and Lemarecal (1993). Let J1 , . . . , Jm be in Conv
JD
m X
(4.20)
tj Jj jD 1
is in Conv
J1 ( h ) D J2 ( h ) D
X ( x,y)2D
X
( x,y)2D
µ zO 2 ( x, y ) zO 2 ( x, y ) ,
2 J3 ( h ) D 4
¶ 1 ¡ sgn ( zO (x, y ) ) , 2
X ( x,y)2D
(4.21) (4.22)
32 h ( x, y ) ¡ 15 ,
(4.23)
A New Color 3D SFS Methodology
2769
0 1 h1 B C where zO ( x, y ) D (zRed ( x, y) zGreen ( x, y ) zBlue ( x, y) ) ¤ h ( x, y ) and h D @h2 A. h3 Equation 4.18 can be represented as J ( h ) D J1 ( h ) C lJ2 ( h ) C c J3 ( h ). To establish the convexity of J1 , the following denition and theorem are used: Denition 1 (Monotone Function). The mapping J:
monotone on
(4.24)
where h¢, ¢i represents the inner product operation. Let J be a function differentiable on
Thus, the derivative of J1 with respect to zO (x, y ) is given by: µ ¶ 1 ¡ sgn (zO ( x, y ) ) @ J1 zO 2 ( x, y)d ( zO ( x, y) ) ¡ D 2Oz ( x, y) 2 2 @zO ( x, y ) D zO ( x, y )[1 ¡ sgn ( zO ( x, y) ) ] D 2Oz ( x, y) us (¡z ( x, y) ) .
(4.25)
The function d (¢) and us (¢) represent the Dirac delta function and unit step function, respectively. The gradient of J1 with respect to all fOz ( x, y ) g is given by
r J1 ( zO ) D 2[Oz (1, 1) us (¡z (1, 1) ) , zO (1, 2)
£ us (¡z (1, 2) ) , . . . , zO (Nx , Ny ) us (¡z (Nx , Ny ) )]T ,
(4.26)
where Nx £Ny are the dimensions of the surface estimate zO ( x, y ) . The gradient inner product can be computed as P ( zO , zO 0 ) D hr J1 (zO ) ¡ r J1 ( zO 0 ) , zO ¡ zO 0 i X D p (zO ( x, y ) , zO 0 (x, y ) ) ,
(4.27)
(x,y)2D
where zO , zO 0 2
2 02 0 D zO ( x, y ) us (¡Oz ( x, y ) ) C zO ( x, y ) us (¡Oz ( x, y ) )
¡ zO (x, y ) zO 0 ( x, y ) [us (¡Oz ( x, y) ) C us (¡Oz0 ( x, y) ) ].
(4.28)
2770
Siu-Yeung Cho and Tommy W. S. Chow
The above gradient inner product is expressed as: p ( zO (x, y ) , zO 0 (x, y ) ) 8 0 > > > < zO 2 ( x, y) ¡ zO (x, y ) zO 0 ( x, y ) ¸ 0 D > zO 02 ( x, y) ¡ zO (x, y ) zO 0 ( x, y ) ¸ 0 > > : (zO ( x, y ) ¡Oz0 ( x, y) ) 2 ¸ 0
if zO ( x, y ) , zO 0 (x, y ) ¸ 0 if zO ( x, y ) ¸ 0, zO 0 ( x, y ) < 0 if zO ( x, y ) < 0, zO 0 ( x, y ) ¸ 0 if zO ( x, y ) , zO 0 (x, y ) < 0
. (4.29)
Thus, it follows from equation 4.26 that hr J1 ( zO ) ¡ r J1 ( zO0 ) , zO ¡ zO 0 i ¸ 0.
(4.30)
Using theorem 2, J1 is convex with respect to zO , and h is related to zO linearly, so J1 is convex with respect to h . Apparently, J2 and J3 are quadratic with respect to h . Thus, they are also convex. Since J1 , J2 , and J3 are all convex, it follows from theorem 1 that J is convex with respect to its parameters fh (1, 1) , . . . , h (Nx , Ny )g. The above show that the cost function 4.18 is convex with respect to the 3 £ 1 coefcients h ( x, y ) . The convergence of the algorithm is ensured by using a variety of numerical optimization methods. In this article, a simple gradient-descent-type optimization routine is used for minimizing the cost function 4.18. The gradient vector of J is expressed as
r J ( h ( x, y ) ) D ¡
@J ( h ) @h ( x, y)
8 > <
3 zRed ( x, y) µ ¶ 6 7 1 ¡ sgn ( zO ( x, y) ) D ¡2 zO ( x, y ) 4zGreen ( x, y ) 5 > 2 : Blue ( ) z x, y ( x,y )2D 2 Red 3 z ( x, y ) 6 7 C lOz ( x, y ) 4zGreen ( x, y) 5 zBlue ( x, y) ( x,y) 2D / 2 39 > = X 4 5 Cc h ( x, y ) ¡ 1 . (4.31) > ; 8(x,y) 2
The updating of the coefcient h ( x, y ) is performed by h k C 1 ( x, y ) D h k ( x, y ) C gr J ( h ( x, y ) ) ,
(4.32)
where g is a small, positive step size. Based on the updated equations, proper values of the h (x, y ) are optimized, and hence the nal estimated surface orientation zO ( x, y ) is obtained.
A New Color 3D SFS Methodology
2771
5 Experimental Results and Discussion
In this section, the performance of the proposed methodology is investigated by using two synthetic color images and two color real images. All the algorithms were written in Matlab script and were performed on a SUN Ultra5 workstation. In all cases, the neural model of a size of 2-10-1 was used for the hybrid structure of the proposed color reectance model. The boundary conditions of all the reconstruction algorithms were specied by taking the average of the two nearest values of the appropriate gradient component to obtain the nearest value of the depth in the interior of the grid. Color synthetic objects were used for the quantitative and noise sensitivity analysis of the proposed approach. Other conventional methods, such as the Lambertian, Torrance-Sparrow, and dichromatic models, were used to compare their performance with that of the proposed approach. The comparisons were also performed using the real colored objects. 5.1 Quantitative Results of Single-colored Object—Synthetic Sphere Object. Quantitative results were obtained by reconstructing a color syn-
thetic sphere object. The true depth map of the synthetic sphere object was generated mathematically by (p if x2 C y2 · r2 r2 ¡ x2 ¡ y2 (5.1) z ( x, y) D , 0 otherwise where r D 52 and ¡63 · x, y · 64. This synthetic image was generated using the depth map by equation 5.1, and the surface gradients were computed by using the forward discrete approximation. The color-shaded images were generated according to the color image irradiance equation. The insertion of specular components was based on the Torrance-Sparrow model. Different color images as well as different illuminate conditions were generated to the corresponding images for the performance investigations. Figure 12 (see the color images section) shows the color synthetic sphere objects with different color and specular effects under different illuminate conditions. The comparisons between our proposed methodology were performed with the generalized hybrid reectance model and the dichromatic reectance model. The results are tabulated in Table 1. Clearly, the results indicate that the performance of our approach is substantially better than those of the hybrid model and the dichromatic model. From the obtained results in the different objects’ colors (especially in pink and yellow), the monochromatic hybrid reectance model could not obtain accurate recovery surfaces for these colored objects by the dichromatic reection model. Impressive results were obtained by using the proposed approach even when the colors of the objects were mixed colors (pink or yellow). Figure 4 shows three sets of the reconstructed surfaces for the different colors of synthetic sphere objects. Each set of surfaces includes three surface orientations from the three
Lighting Direction, s
Red
t D 0± ; s D 0± t D 45± ; s D 15± t D 225± ; s D 15± Unknown Green t D 0± ; s D 0± t D 45± ; s D 15± t D 225± ; s D 15± Unknown t D 0± Blue s D 0± t D 45± ; s D 15± t D 225± ; s D 15± Unknown
Color
t D 0± ; s D 0± t D 45± ; s D 45± t D 225± ; s D 15± Unknown t D 0± ; s D 0± t D 45± ; s D 45± t D 225± ; s D 15± Unknown t D 0± s D 0± t D 45± ; s D 45± t D 225± ; s D 15± Unknown
Viewing Direction, v
0.186
0.835
0.832
1.836 0.312
0.959
0.957
0.968 0.314
0.921
0.961
0.980
0.056
0.152
0.151
0.336 0.093
0.181
0.181
0.206 0.094
0.173
0.184
0.254
Maximum 0.048 0.048 0.049 0.049 0.072 0.055 0.066 0.085 0.073 0.065 0.076 0.087
1.2 £ 10¡3 1.1 £ 10¡3 6.9 £ 10¡3 2.0 £ 10¡3 16 £ 10¡3 15 £ 10¡3 17 £ 10¡3 2.0 £ 10¡3 15 £ 10¡3 15 £ 10¡3 17 £ 10¡3
0.246
0.239
0.234
0.236 0.232
0.229
0.233
0.161 0.228
0.161
0.162
0.162
Mean Maximum
0.7 £ 10¡3
Variation
1.1 £ 10¡3
0.9 £ 10¡3
0.8 £ 10¡3
0.8 £ 10¡3 1.2 £ 10¡3
0.9 £ 10¡3
0.8 £ 10¡3
0.5 £ 10¡3 1.1 £ 10¡3
0.5 £ 10¡3
0.5 £ 10¡3
0.5 £ 10¡3
Variation
Absolute Depth Error
Absolute Depth Error
Mean
Dichromatic Reectance Model
Generalized Hybrid Reectance Model (Lambertian + TorranceSparrow)
Table 1: Comparative Results for the Reconstruction of the Color Synthetic Sphere Object.
0.049
0.031
0.039
0.051 0.028
0.037
0.039
0.032 0.028
0.025
0.023
0.026
Mean
0.176
0.170
0.161
0.246 0.123
0.174
0.171
0.145 0.124
0.117
0.126
0.121
Maximum
0.6 £ 10¡3
0.4 £ 10¡3
0.4 £ 10¡3
0.6 £ 10¡3 0.2 £ 10¡3
0.4 £ 10¡3
0.4 £ 10¡3
0.3 £ 10¡3 0.3 £ 10¡3
0.2 £ 10¡3
0.2 £ 10¡3
0.2 £ 10¡3
Variation
Absolute Depth Error
Proposed Color SFS Approach
2772 Siu-Yeung Cho and Tommy W. S. Chow
t D 0± ; s D 0± t D 45± ; s D 45± t D 225± s D 15± Unknown t D 0± s D 0± t D 45± ; s D 45± t D 225± ; s D 15± Unknown
t D 0± ; s D 0± t D 45± ; s D 15± t D 225± ; s D 15± Unknown t D 0± ; s D 0± t D 45± ; s D 15± t D 225± ; s D 15± Unknown
Yellow
Pink
Viewing Direction, v
Lighting Direction, s
Color
Table 1 continued.
0.314
0.959
0.957
0.947 0.301
0.909
0.935
0.974
0.094
0.181
0.181
0.193 0.091
0.161
0.172
0.199
Maximum 0.076 0.079 0.078 0.101 0.073 0.079 0.078 0.113
16 £ 10¡3 15 £ 10¡3 19 £ 10¡3 2.0 £ 10¡3 16 £ 10¡3 15 £ 10¡3 19 £ 10¡3
Mean
2.0 £ 10¡3
Variation
0.477
0.322
0.339
0.450 0.331
0.327
0.339
0.339
Maximum
3.6 £ 10¡3
2.0 £ 10¡3
1.9 £ 10¡3
3.6 £ 10¡3 2.3 £ 10¡3
2.0 £ 10¡3
1.9 £ 10¡3
2.3 £ 10¡3
Variation
Absolute Depth Error
Absolute Depth Error
Mean
Dichromatic Reectance Model
Generalized Hybrid Reectance Model (Lambertian + TorranceSparrow)
0.047
0.041
0.035
0.047 0.021
0.042
0.037
0.023
0.179
0.164
0.161
0.159 0.132
0.167
0.165
0.148
Mean Maximum
0.5 £ 10¡3
0.5 £ 10¡3
0.4 £ 10¡3
0.5 £ 10¡3 0.2 £ 10¡3
0.5 £ 10¡3
0.4 £ 10¡3
0.1 £ 10¡3
Variation
Absolute Depth Error
Proposed Color SFS Approach
A New Color 3D SFS Methodology 2773
2774
Siu-Yeung Cho and Tommy W. S. Chow
Figure 4: Three sets of the reconstructed surfaces for the different colors of synthetic sphere objects. (A) Reconstructed 3D shape of synthetic green sphere. (B) Reconstructed 3D shape of synthetic yellow sphere. (C) Reconstructed 3D shape of synthetic pink sphere. From left to right in all rows: surface of R component, surface of G component, surface of B component, and nal computed surface.
fundamental color components (RGB) reconstructed by the neural-based approach. The nal surface orientations from these three surfaces, computed by the IR method, were satisfactory. Besides, from the illustration of the errors under unknown illuminate and specular conditions, inaccurate reconstructions were obtained by the generalized hybrid reectance model and the dichromatic reectance model because of the dominating effect of the uncertain illuminate condition and the strong specular components. The obtained results indicate that the obtained errors of the proposed neuralbased model are less than those of the hybrid model and the dichromatic model by about 70% and 50%, respectively. Clearly, the proposed methodology is able to deliver accurate results. For the other cases under different illuminate conditions, experimental studies on different values of v were performed. The results are shown in Cho and Chow (in press). Distortions
A New Color 3D SFS Methodology
2775
occurred at the reconstructed object. Clearly, the specular effect is overemphasized under the reconstruction process, when the value of v is too large (i.e., very close to one). Throughout the study in Cho and Chow (in press), a typical value of 0.7 was chosen to deliver optimal solutions in maintaining a good balance between avoiding the distortion and obtaining the contribution of the specular component modeling. 5.2 Noise Sensitivity Results of Colored Object—Synthetic Vase Object. The noise sensitivity results of our neural-based approach are exam-
ined. A colored synthetic vase object was used for the investigations. The true depth map of the synthetic vase object was generated mathematically by q (5.2) z ( x, y) D f ( y ) 2 ¡ x2 , where f (x ) D 0.15 ¡ 0.1y(6y C 1) 2 ( y ¡ 1) 2 (3y ¡ 2) , ¡0.5 · x · 0.5, and 0.0 · y · 1.0. Similarly, the color synthetic vase image was generated based on the depth map (see equation 5.2), and the different color-shaded images were generated according to the color shading function. Two different noise types, gaussian and Rayleigh, were used. Different noise levels were added to the image intensities. Figure 13 (see the color images section) presents the color synthetic vase images under noise-free, 5% gaussian noise, and 5% Rayleigh noise conditions. The comparative results are tabulated in Table 2 for investigating the performance of the proposed color SFS methodology with the other models, which include the generalized hybrid reectance model and the dichromatic reectance model. From the obtained results at noisy conditions, the mean depth error obtained by the proposed neural-based approach is signicantly lower than that of the monochromatic reectance model (the generalized hybrid reectance model) by about 45%. Obviously, consistent results were obtained by the proposed method. Under a high noise level (10% noise), the average depth error yielded by the neuralbased color reectance approach is much lower than that of the dichromatic model. Thus, it can be concluded that the proposed neural model is able to provide accurate results and noise-robust 3D shape reconstruction ability under different illuminate conditions and noisy environments. 5.3 Results of Synthetic Multicolored Objects. This section investigates the SFS of multicolored objects. Two types of multicolored presentations were used. Different color combinations of these multicolored objects are shown in Figure 14 in the color images section. In this case, the synthetic vase contains two colors in such a way that the top of the vase body is one color, and the rest of the vase is another color. The colors of the object were sharply separated to validate the proposed approach in applying to 3D shape reconstruction of more than one color. In our study, six color combinations were generated and investigated in comparing the generalized hybrid
Pink
Red
Color
No noise 2% Gaussian 5% 10% 2% Rayleigh 5% 10% No noise 2% Gaussian 5% 10% 2% Rayleigh 5% 10%
0.107 0.105 0.104 0.147 0.105 0.107 0.145 0.072 0.106 0.103 0.147 0.105 0.106 0.133
0.697 0.697 0.703 0.928 0.677 0.697 0.911 0.487 0.684 0.707 0.899 0.677 0.696 0.842
Mean Maximum 9.2 £ 10 8.9 £ 10¡3 8.6 £ 10¡3 17 £ 10¡3 9.2 £ 10¡3 9.2 £ 10¡3 16 £ 10¡3 3.2 £ 10¡3 8.9 £ 10¡3 8.6 £ 10¡3 9.5 £ 10¡3 8.8 £ 10¡3 9.3 £ 10¡3 9.9 £ 10¡3
¡3
Variation
Absolute Depth Error
Generalized Hybrid Reectance Model (Lambertian + TorranceSparrow)
0.072 0.079 0.108 0.112 0.073 0.098 0.111 0.077 0.076 0.106 0.138 0.107 0.107 0.114
Mean 0.649 0.653 0.603 0.659 0.647 0.644 0.654 0.489 0.489 0.673 0.851 0.692 0.676 0.641
Maximum 9.5 £ 10 9.4 £ 10¡3 9.8 £ 10¡3 9.8 £ 10¡3 9.5 £ 10¡3 9.8 £ 10¡3 9.8 £ 10¡3 3.5 £ 10¡3 3.5 £ 10¡3 8.1 £ 10¡3 9.2 £ 10¡3 9.0 £ 10¡3 8.9 £ 10¡3 9.9 £ 10¡3
¡3
Variation
Absolute Depth Error
Dichromatic Reectance Model
Table 2: Noise Sensitivity Investigation for the Color Synthetic Vase Object Reconstruction.
0.048 0.049 0.046 0.054 0.048 0.049 0.051 0.048 0.048 0.048 0.049 0.048 0.055 0.054
Mean
0.583 0.593 0.554 0.591 0.587 0.557 0.588 0.386 0.415 0.499 0.566 0.567 0.563 0.593
Maximum
4.6 £ 10¡3 4.4 £ 10¡3 3.8 £ 10¡3 4.7 £ 10¡3 4.5 £ 10¡3 4.1 £ 10¡3 4.7 £ 10¡3 3.6 £ 10¡3 4.1 £ 10¡3 4.1 £ 10¡3 3.6 £ 10¡3 3.8 £ 10¡3 4.2 £ 10¡3 4.3 £ 10¡3
Variation
Absolute Depth Error
Proposed Color SFS Approach
2776 Siu-Yeung Cho and Tommy W. S. Chow
A New Color 3D SFS Methodology
2777
reectance and the dichromatic models. The results are tabulated in Table 3. The 3D reconstructed surfaces of this multicolored vase object are shown in Figure 5. In Figure 5a, the reconstructed 3D shape of the R, G, and B components, as well as the nal computed surfaces obtained by our proposed methodology, are shown. Figures 5b and 5c show the reconstructed 3D shape obtained by the generalized hybrid reectance model and the dichromatic model, respectively. The results of our proposed method clearly outperform the other reectance models in the case of multicolored objects. Another multicolored presentation was performed to validate the proposed approach. In this case, multicolored synthetic sphere and vase were used. Different colors of the objects were generated on the reectance surfaces (see Figure 15 in the color images section). The determination of the multicolored presentations is based on the color map of the images such that different presentations of the multicolored surfaces can make the color specular effect of the reectance surface under different color information. It is rather difcult to obtain the correct surface shape in this analysis. In these investigations, four two-colored presentations were investigated in comparing the performance of the different types of reectance model under different color specular conditions. Two examples of multicolor presentations of the synthetic sphere and vase were divided into the RGB components. Their image intensities are shown in Figure 6. Based on their corresponding image intensities, three corresponding 3D surfaces were reconstructed, and the nal surface orientations were computed by using these 3D surfaces accordingly. Their resultant 3D surfaces are shown in Figures 7 and 8 for the reconstruction of the synthetic sphere and vase, respectively. Convincing results are obtained by our approach because 3D surface reconstructions were performed on each primary color band individually, and then the resultant 3D surface was determined from the reconstructed surface of three primary-color objects. The problem of interreection effect in the color space analysis by the physical-based model can then be eliminated to avoid incorrect surface recovery. In this investigation, the performance of our proposed color SFS methodology was compared with those of the generalized hybrid model and the dichromatic reectance model. In Figure 7, the results of the nal 3D surface of the synthetic sphere are very convincing in different color presentations, whereas the obtained surface orientations by the generalized hybrid model and the dichromatic model are unsatisfactory, probably because of the effect of the multicolored reection. Similarly, in Figure 8, the results of the 3D synthetic vase surface also provided consistently good performance in different color presentations, whereas the other two models failed to provide satisfactory results. The numerical comparative results are set out in Table 4 for the multicolored synthetic sphere and vase. The results of our proposed neural-based color reectance approach are much better than those obtained using other reectance models. It is worth noting that our approach outperforms the other models under different multicolor presentations. The mean depth error obtained by our method
0.1833 0.174 0.181 0.179 0.172 0.175
0.973 0.894 0.947 0.919 0.952 0.949
Maximum 0.0777 0.062 0.073 0.074 0.077 0.076
19 £ 10¡3 18 £ 10¡3 19 £ 10¡3 19 £ 10¡3 19 £ 10¡3 19 £ 10¡3
Vase
Sphere
Object
A B C D A B C D
Multicolor Presentation
0.065 0.062 0.063 0.061 0.142 0.145 0.141 0.143
Mean
0.229 0.222 0.221 0.219 0.952 0.949 0.946 0.955
Maximum 0.9 £ 10 0.9 £ 10¡3 0.9 £ 10¡3 0.9 £ 10¡3 10 £ 10¡3 10 £ 10¡3 10 £ 10¡3 10 £ 10¡3
¡3
Variation
Absolute Depth Error
Generalized Hybrid Reectance Model (Lambertian + TorranceSparrow)
0.054 0.052 0.053 0.054 0.087 0.116 0.097 0.106
0.178 0.175 0.176 0.174 0.563 0.952 0.663 0.942
Maximum
0.6 £ 10 0.6 £ 10¡3 0.6 £ 10¡3 0.6 £ 10¡3 5.2 £ 10¡3 14 £ 10¡3 6.1 £ 10¡3 14 £ 10¡3
¡3
Variation
Absoute Depth Error Mean
0.527 0.459 0.478 0.624 0.524 0.624
3.7 £ 10¡ 3 2.9 £ 10¡ 3 2.3 £ 10¡ 3 4.4 £ 10¡ 3 3.7 £ 10¡ 3 3.8 £ 10¡ 3
Variation
0.009 0.015 0.012 0.02 0.049 0.043 0.051 0.051
0.107 0.099 0.076 0.1 0.435 0.557 0.549 0.541
Maximum
0.04 £ 10¡3 0.1 £ 10¡3 0.06 £ 10¡3 0.1 £ 10¡3 2.7 £ 10¡3 3.4 £ 10¡3 3.5 £ 10¡3 3.1 £ 10¡3
Variation
Absolute Depth Error
Proposed Color SFS Approach
0.05 0.039 0.045 0.047 0.049 0.048
Maximum
Absolute Depth Error
Proposed Color SFS Approach
Mean
Mean
5.6 £ 10¡3 5.3 £ 10¡3 5.6 £ 10¡3 5.6 £ 10¡3 5.2 £ 10¡3 5.5 £ 10¡3
Variation
Dichromatic Reectance Model
0.652 0.575 0.576 0.574 0.563 0.552
Maximum
Absolute Depth Error Mean
Variation
Absolute Depth Error
Mean
Dichromatic Reectance Model
Table 4: Comparative Results for Multicolored Synthetic Objects.
Figure 13a (green+red) Figure 13b (red+green) Figure 13c (blue+red) Figure 13d (red+blue) Figure 13e (green+blue) Figure 13f (blue+green)
Multicolored Combination
Generalized Hybrid Reectance Model (Lambertian + TorranceSparrow)
Table 3: Comparative Results for the Different Multicolored Combination of the Vase Object.
2778 Siu-Yeung Cho and Tommy W. S. Chow
A New Color 3D SFS Methodology
2779
Figure 5: The reconstructed 3D Surfaces for the multicolored combination of Figure 14a. (a) Reconstructed 3D shape of the R, G, and B components. (b) Reconstructed 3D shape obtained by the hybrid reectance model. (c) Reconstructed 3D shape obtained by the dichromatic model.
is also signicantly lower than the dichromatic reectance model by about 60%. In addition, another type of multicolored presentation of synthetic vase is applied to validate our approach for generalizing to pastel colors with moderate contributions on all color bands. Figure 16 (in the color section) shows this type of multicolored presentation in three colors changing gradually from light red to light blue. This presentation provides the pastel colors on all color bands. Figure 9a shows the result of the reconstructed synthetic vase surfaces by the proposed approach, and Figure 9b shows the result by the dichromatic model. The comparative results show that our proposed method provides a convincing result in the pastel-colored presentation. It can be concluded that the proposed method provides a robust immunity to the chromatic reectance surface with multicolored objects for 3D shape reconstruction. 5.4 Real Color Images Results. In this section, real color images are used for validating the proposed model. In all cases, the reectance surfaces were of non-Lambertian chromatic, and the illumination was of unknown mul-
2780
Siu-Yeung Cho and Tommy W. S. Chow
Figure 6: The image intensities of RGB components of the multicolored synthetic objects. (a) Multicolored synthetic sphere in color presentation A. (b) Multicolored synthetic vase in color presentation C. From left to right for both rows: R component, G component, B component.
tiple sources. The projection was usually perspective, and the reectance surface was also affected by the color specular reection components. The performance of the proposed method was compared with that of the generalized hybrid reectance model and the dichromatic reectance model. In these real color images, the images are quite difcult to handle because they contain non-Lambertian surfaces with different reectivity properties and multicolored characteristics. All the parameters of the dichromatic reectance model, such as the diffuse color and specular color and its roughness, were required to be determined by a color space analysis from the color histogram of the image. This analysis was based on calculating the spectrum of the interreected light by Nayar et al. (1990b). The spectrum is formed as the product of the illumination spectrum and the spectra of the two surface reectance functions. In fact, the surface reectance functions multiply together as many times as the light is interreected, with the total mutual illumination contribution equivalent to the sum of an innite number of interreections. The resulting spectrum is then ltered by the sensor response functions to obtain an RGB triple representing the color. Because the directions of the illuminating and viewing were unknown, the generalized hybrid reectance model required the estimation of the light source direction. This was performed by using the Zheng and Chellappa’s (1991)
A New Color 3D SFS Methodology
2781
Figure 7: Reconstructed 3D surfaces for the different color presentations of the multicolored synthetic sphere objects. (a) Reconstructed 3D shape obtained by the proposed methodology in color presentation A. (b) Reconstructed 3D shape obtained by the generalized hybrid reectance model in color presentation A. (c) Reconstructed 3D shape obtained by the dichromatic reectance model in color presentation A. (d) Reconstructed 3D shape obtained by the proposed methodology in color presentation C. (e) Reconstructed 3D shape obtained by the generalized hybrid reectance model in color presentation C. (c) Reconstructed 3D shape obtained by the dichromatic reectance model in color presentation C. All rows left to right: surface of R component, surface of G component, and the nal computed surface.
2782
Siu-Yeung Cho and Tommy W. S. Chow
Figure 8: Reconstructed 3D surfaces for the different color presentations of the multicolored synthetic vase objects. (a) Reconstructed 3D shape obtained by the proposed methodology in color presentation A. (b) Reconstructed 3D shape obtained by the generalized hybrid reectance model in color presentation A. (c) Reconstructed 3D shape obtained by the dichromatic reectance model in color presentation A. (d) Reconstructed 3D shape obtained by the proposed methodology in color presentation C. (e) Reconstructed 3D shape obtained by the generalized hybrid reectance model in color presentation C. (c) Reconstructed 3D shape obtained by the dichromatic reectance model in color presentation C. All rows left to right: surface of R component, surface of G component, and the nal computed surface.
A New Color 3D SFS Methodology
2783
Figure 9: Reconstructed 3D shape of the synthetic vase in multicolor presentation as shown in gure 16. (a) Reconstructed surface computed by our approach. (b) Reconstructed surface computed by the dichromatic model.
method, whereas our proposed model never requires the estimation of the light source direction. The viewing direction is assumed to be orthogonal of the object surface. An image of green pepper and its corresponding image intensities of the RGB components are shown in Figure 17 (see the color gure section). Figures 10a through 10c show the reconstructed 3D shape of the corresponding RGB components intensities, and the 3D shape of this green pepper obtained by the proposed color SFS methodology is shown in Figure 10d. Also, the estimation of the illuminating and viewing direction are never required for our proposed method. Figure 10e shows the recovered 3D shape using the generalized hybrid reectance model. The result is far from satisfactory because of the effect of the color information on the object and its interreection. Figure 10f shows the result obtained by the hybrid reectance model with incorrect light source direction. In this case, the recovered 3D pepper certainly has lost a lot of information for the surface orientation under the abnormal color parameters and strong effect of the color specular reection. The recovered 3D shapes by the dichromatic reectance model for color non-Lambertian surfaces under known and unknown reectivity parameters are shown in Figures 10g and 10h, respectively. Although the performance of the dichromatic model is better than that of the generalized hybrid model, which is used for the monochromatic reection, that the recovered 3D pepper is still unsatisfactory is attributed to the uncertain illuminate condition and the color specular reection. In order to validate the 3D reconstruction of multicolored objects, another real image is used. An image of this object and its corresponding RGB components are shown in Figure 18 (refer to the color images section). Fig-
2784
Siu-Yeung Cho and Tommy W. S. Chow
Figure 10: Results for the green pepper object. (a) 3D surface orientation of the R component reconstructed by the neural-based red color reectance model. (b) 3D surface orientation of the G component reconstructed by the neuralbased green color reectance model. (c) 3D surface orientation of the B component reconstructed by the neural-based blue color reectance model. (d) Final 3D surface orientation of the green pepper object computed by the iterative recursive method from the above three surfaces. (e) 3D surface orientation reconstructed by the generalized hybrid reectance model with correct illuminate condition. (f) 3D surface orientation reconstructed by the generalized hybrid reectance model with incorrect illuminate condition. (g) 3D surface orientation reconstructed by the dichromatic reectance model with correct color parameters determination. (h) 3D surface orientation reconstructed by the dichromatic reectance model with incorrect color parameters determination.
ure 11a shows the 3D shape of this multicolored object obtained by the proposed color SFS methodology. Satisfactory shape is obtained that this multicolored 3D object shape can be sketched approximately. Figure 11b shows the 3D shape of this multicolored object by the dichromatic color model with the incorrect result of color space analysis respectively. The recovered 3D shape by this physical-based color reectance model appears quite good under the correct color space analysis of interreection illumination, but the performance was obviously degraded by incorrect color space analysis. We can therefore conclude that our proposed color model is able to deliver much better results in multicolored object reconstruction than those yielded by the physical-based color model under incorrectly calculated color space analysis.
A New Color 3D SFS Methodology
2785
Figure 11: Results for the multicolored real object of Figure 15a. (a) 3D reconstructed surface orientation of this multicolored real object computed by our approach. (b) 3D reconstructed surface orientation of this multicolored real object computed by the dichromatic color model with incorrect results of color space analysis.
6 Conclusion
In dealing with the colored SFS problem, we cannot simply consider the color image as gray scale. Currently, most existing color reectance models are mainly purely monochromatic type. This article presents a generalized methodology for recovering 3D shape from a color image—a single-colored object or a multicolored object. The proposed methodology consists of a neural-based color reectance model and an iterative recursive method. It rst treats the multicolor image as three primary color components. Three corresponding 3D surfaces are reconstructed by using the proposed neuralbased color reectance models. These models are able to determine the object depth without the essential physical parameters of diffuse color, specular color reection, and illuminating condition. Finally, the developed iterative recursive method is used to reconstruct a nalized 3D surface from the three reconstructed depths, zRed (x, y ) , zGreen ( x, y ), and zBlue ( x, y ) . The obtained results indicate that the proposed methodology exhibits encouraging performance in dealing with the effects of strong color specular and noise and can successfully handle the multicolored objects SFS problem.
Acknowledgments
We thank the anonymous reviewers for their valuable comments. This research was supported by the Hong Kong Government Competitive Earmarked Research Grant 9040601.
2786
Siu-Yeung Cho and Tommy W. S. Chow
A New Color 3D SFS Methodology
2787
Figure 12: Facing page. Color synthetic sphere images under different specular effects at different illuminations. (a–e) tilt = 45± , slant = 45± . (f–j) tilt = 225± , slant = 45± . Figure 13: Facing page. Color synthetic vase under different noise conditions. (a) Red under no noise. (b) Red under 5% gaussian noise. (c) Red under 10% Rayleigh noise. (d) Pink under no noise. (e) Pink under 5% gaussian noise. (f) Pink under 10% Rayleigh noise. Figure 14: Facing page. Multicolored synthetic vase object under different color combinations. Figure 15: Facing page. Multicolored synthetic sphere and vase under different multicolor presentations. (a) Sphere A. (b) Sphere B. (c) Sphere C. (d) Sphere D. (e) Vase A. (f) Vase B. (g) Vase C. (h) Vase D. Figure 16: Facing page. Multicolored synthetic vase with three colors changing gradually. (a) Original color image. (b) Image intensity of the red component. (c) Image intensity of the green component. (d) Image intensity of the blue component. Figure 17: Facing page. Single green pepper image. (a) Original color image. (b) Image intensity of the red component. (c) Image intensity of the green component. (d) Image intensity of the blue component. Figure 18: Facing page. Multicolored real object images. (a) Original color image. (b) Image intensity of the red component. (c) Image intensity of the green component. (d) Image intensity of the blue component. References Beckmann, P., & Spizzichino, A. (1963). The scattering of electromagnetic waves from rough surfaces. New York: Macmillan. Bichsel, M., & Pentland, A. (1992). A simple algorithm for shape from shading. IEEE Conference on Computer Vision and Pattern Recognition (pp. 459–465). Washington, DC: IEEE Computer Society Press. Bui-Tuong, P. (1975).Illumination for computer generated pictures. Comm. ACM, 18, 311–317. Cheung, W. P., Lee, C. K., & Li, K. C. (1997). Direct shape from shading with improved rate of convergence. Pattern Recognition, 30, 353–365. Cho, S. Y., & Chow, T. W. S. (1999a).Shape recovery from shading by a new neural based reectance model. IEEE Transactions on Neural Networks, 10, 1536–1541. Cho, S. Y., & Chow, T. W. S. (1999b). Training multilayer neural networks using fast global learning algorithm—Least squares and penalized optimization methods. Neurocomputing, 25, 115–131. Cho, S. Y., & Chow, T. W. S. (2000). Learning parametric specular reectance model by radial basis function networks. IEEE Transaction on Neural Networks, 17, 1346–1350. Cho, S. Y., & Chow, T. W. S. (in press). Enhanced 3D shape recovery using neural based hybrid reectance model. Neural Computation.
2788
Siu-Yeung Cho and Tommy W. S. Chow
Chow, T. W. S., Li, X-D., & Cho, S. Y. (2000). Improved blind image restoration scheme using recurrent ltering. IEE Proceedings—Vision, Image, and Signal Processing, 147(1). Christensen, P. H., & Shapiro, L. G. (1994). Three-dimensional shape from color photometric stereo. International Journal of Computer Vision, 13, 213–227. Healy, G., & Binford, T. O. (1988). Local shape from specularity. Computer Vision, Graphics and Image Processing, 42, 62–86. Hiriart-Urruty, J. B., & Lemarecal, C. (1993). Convex analysis and minimization algorithm I. New York: Springler-Verlag. Horn, B. K. P. (1970). Shape from shading: A smooth opaque object from one view. Unpublished doctoral dissertation, MIT. Horn, B. K. P. (1990). Height and gradient from shading. International Journal of Computer Vision, 5, 37–75. Horn, B. K. P., & Brooks, M. J. (1986). The variational approach to shape from shading. Computer Vision Graphics Image Process, 33, 174–208. Horn, B. K. P., & Brooks, M. J. (1989). Shape from Shading. Cambridge, MA: MIT Press. Horn, B. K. P., & Sjoberg, R. W. (1990). Calculating the reectance map. Applied Optics, 18, 37–75. Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward network are universal approximation. Neural Networks, 2, 359–366. Klinker, G. F., Shafer, S. A., & Kanade, T. (1987). Using a color reection model to separate highlights from object color. In Proceedings of the International Conference on Computer Vision (pp. 145–150). Los Alamitos, CA: IEEE Computer Society Press. Klinker, G. J., Shafer, S. A., & Kanade, T. (1990). A physical approach to color image understanding. Intl. J. Comp. Vision, 7, 7–38. Kundur, D., & Hatzinakos, D. (1998). A novel blind deconvolution scheme for image restoration using recursive ltering. IEEE Transactions on Signal Processing, 46, 375–390. Lee, K. M., & Kuo, C.-C. J. (1997). Shape from shading with a generalized reectance map model. Computer Vision and Image Understanding, 67, 143– 160. Lee, S. W., & Bajcsy, R. (1992). Detection of specularity using colors and multiple views. In Proceedingsof the 2nd European Conference on Computer Vision (pp. 99– 114). Orlando, FL: Academic Press. Nayar, S. K., Ikeuchi, K., & Kanade, T. (1990a).Determining shape and reectance of hybrid surface by photometric sampling. IEEE Transaction on Robotics and Automation, 6, 418–430. Nayar, S. K., Ikeuchi, K., & Kanade, T. (1990b). Shape from interreections. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2–11). Los Alamitos, CA: IEEE Computer Society Press. Nayar, S. K., Ikeuchi, K., & Kanade, T. (1991). Surface reection: Physical and geometrical perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 13, 611–634. Novak, C. L., & Shafer, S. A. (1992). Anatomy of a histogram. In Proceedings of the Conference on Vision Pattern Recognition (pp. 599–605). Los Alamitos, CA: IEEE Computer Society Press.
A New Color 3D SFS Methodology
2789
Park, J., & Sandberg, I. W. (1991). Universal approximation using radial basis function networks. Neural Computation, 3, 261–275. Szeliski, R. (1991) Fast shape from shading. CVGIP: Image Understanding, 53, 129–153. Tian, Y.-L., & Tsui, H. T. (1996). Shape from shading for non-Lambertian surfaces from one color image. IEEE Proceedings of ICPR’96 (pp. 258–262). Los Alamitos, CA: IEEE Computer Society Press. Torrance, K. E., & Sparrow, E. M. (1967). Theory of off-specular reection from roughened surfaces. Journal of the Optical Society of America, 57, 1105–1114. Ward, G. J. (1992). Measuring and modeling anisotropic reection. Comp. Graphics, 26, 265–272. Warnock, J. (1969). A hidden-surface algorithm for computer generated half-tone pictures (Tech. Rep. No. TR 4-15, NTIS AS-753 671). Salt Lake City: Computer Science Department, University of Utah. Zhang, R., Tsai, P. S., Cryer, J. E., & Shah, M. (1999).Shape from shading: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21, 690–706. Zheng, Q., & Chellappa, R. (1991). Estimation of illuminant direction, albedo, and shape from shading. IEEE Transactions on Pattern Analysis and Machine Intelligence, 13, 680–702. Received July 24, 2001; accepted May 10, 2002.
REVIEW
Communicated by Tomaso Poggio
On Different Facets of Regularization Theory Zhe Chen [email protected]
Simon Haykin [email protected] Adaptive Systems Lab, Communications Research Laboratory, McMaster University, Hamilton, Ontario, Canada L8S 4K1
No more things should be presumed to exist than are absolutely necessary.— William of Occam Everything should be made as simple as possible, but not simpler.—Albert Einstein This review provides a comprehensive understanding of regularization theory from different perspectives, emphasizing smoothness and simplicity principles. Using the tools of operator theory and Fourier analysis, it is shown that the solution of the classical Tikhonov regularization problem can be derived from the regularized functional dened by a linear differential (integral) operator in the spatial (Fourier) domain. State-ofthe-art research relevant to the regularization theory is reviewed, covering Occam’s razor, minimum length description, Bayesian theory, pruning algorithms, informational (entropy) theory, statistical learning theory, and equivalent regularization. The universal principle of regularization in terms of Kolmogorov complexity is discussed. Finally, some prospective studies on regularization theory and beyond are suggested.
1 Introduction Most of the inverse problems posed in science and engineering areas are ill posed—computational vision (Poggio, Torre, & Koch, 1985; Bertero, Poggio, & Torre, 1988), system identication (Akaike, 1974; Johansen, 1997), nonlinear dynamic reconstruction (Haykin, 1999), and density estimation (Vapnik, 1998a), to name a few. In other words, given the available input data, the solution to the problem is nonunique (one-to-many) or unstable. The classical regularization techniques, developed by Tikhonov in the 1960s (Tikhonov & Arsenin, 1977), have been shown to be powerful in making the solution well posed and thus have been applied successfully in c 2002 Massachusetts Institute of Technology Neural Computation 14, 2791–2846 (2002) °
2792
Zhe Chen and Simon Haykin
model selection and complexity control. Regularization theory was introduced to the machine learning community (Poggio & Girosi, 1990a, 1990b; Barron, 1991). Poggio and Girosi (1990a, 1990b) showed that a regularization algorithm for learning is equivalent to a multilayer network with a kernel in the form of a radial basis function (RBF), resulting in an RBF network. The regularization solution was originally derived by a differential linear operator and its Green’s function (Poggio & Girosi, 1990a, 1990b). A large class of generalized regularization networks (RNs) is reviewed in Girosi, Jones, and Poggio (1995). In the classical regularization theory, a recent trend in studying the smoothness of the functional is to put the functionals into the reproducing kernel Hilbert space (RKHS), which has been well developed in different areas (Aronszajn, 1950; Parzen, 1961, 1963; Yosida, 1978; Kailath, 1971; Wahba, 1990). By studying the properties of the functionals in the RKHS, many learning models, including smoothing splines (Kimeldorf & Wahba, 1970; Wahba, 1990), RNs (Girosi et al., 1995), support vector machines (SVMs; Vapnik, 1995, 1998a), and gaussian processes (MacKay, 1998; Williams, 1998a), can be related to each other. In this sense, regularization theory has gone beyond its original implication and expectation. In this review, we emphasize the theory of operators, Green’s function, and kernel functions. In particular, based on operator theory and Fourier analysis, we derive the spectral regularization framework (Chen & Haykin, 2001a, 2001b) along the lines of classical spatial regularization (Poggio & Girosi, 1990a; see also Haykin, 1999). As we show, the regularization approach is per se to expand the solution in terms of a set of Green’s functions, which depend on the form of stabilizer in the context of differential or integral operators and their associated boundary conditions. With two principles of regularization, smoothness and simplicity, in mind, we provide an extensive overview of regularization theory. The main contributions of this article are to review the theory of regularization, examine the relations between regularization theory and other related theoretical work, and present some new perspectives. The rest of the review is organized as follows. Section 2 briey formulates the ill-posed problem and introduces regularization theory as the solution. Section 3 introduces the classical Tikhonov regularization theoretical framework with prerequisite materials on machine learning and operator theory. Following the theory of operator and Green’s function, we derive the spectral regularization framework using Fourier analysis. Starting with Occam’s razor and minimum length description (MDL) theory in section 4, we present different facets of regularization theory from sections 5 through 10. Various relevant research topics are reviewed and their connections to regularization theory highlighted. Finally, the universal principle of Kolmogorov complexity for regularization is explored in section 11, followed by the summary and comments in section 12.
On Different Facets of Regularization Theory
2793
2 Why Use Regularization? A problem is said to be well posed in the Hadamard sense 1 if it satises the following three conditions (Tikhonov & Arsenin, 1977; Morozov, 1984; Haykin, 1999): 1. Existence. For every input vector x 2 X, there exists an output vector y D F (x) , where y 2 Y.2 2. Uniqueness. For any pair of input vectors x, z 2 X, it follows that F (x) D F (z) if and only if x D z.
3. Continuity (stability). The mapping is continuous, that is, for any 2 > 0 there exists f D f (2 ) such that the condition dX (x, z) < f implies that dY (F (x) , F (z) ) < 2 , where d (¢, ¢) represents the distance metric between two arguments in their respective spaces. If any of these three conditions is not satised, the problem is said to be ill posed. In terms of operator language, we have Denition 1 (Kress, 1989). Let A: X ! Y be an operator from a normed space X into a normed space Y. The equation Ax D y is said to be well posed if A is bijective 3 and the inverse operator A ¡1 : Y ! X is continuous. Otherwise the equation is called ill posed. According to denition 1, if A is not surjective, then Ax D y is not solvable for all y 2 Y and thus violates the existence condition; if A is not injective, Ax D y may have more than one solution and thus violates the uniqueness condition; if A ¡1 exists but is not continuous, then the solution x does not depend continuously on the observation y, which again violates the stability condition (Kress, 1989). The following three equivalent conditions are usually used to describe the constraints of the solution of the inverse problem (Poggio & Koch, 1985): Find x that minimizes kAx ¡ yk that satises kPxk < c1 , where c1 is a positive scalar constant. Find x that minimizes kPxk that satises kAx ¡ yk < c2 , where c2 is a positive scalar constant.
1
Hadamard sense is a special case of Tikhonov sense (Vapnik, 1998a). Throughout this review, X, Y represent the range or domain in specic functional space. 3 The operator is bijective if it is injective and surjective. If for each y 2 A(X) there is at most (or at least) one element x 2 X with Ax D y, then A is said to be injective (or surjective). 2
2794
Zhe Chen and Simon Haykin
Find x that minimizes kAx ¡ yk2 C lkPxk2 , where l is a regularization parameter (l D c2 / c1 ), where x is the solution, y is the observation, A and P represent some linear operators, and k ¢ k is some kind of norm operator depending on a specic physical scenario. The rst condition is called the quasi-solution and the second is called the discrepancy principle, both of which belong to the constrained optimization problem. The third condition is an unconstrained optimization problem, which we focus on in this review. Ill-posed problems are ubiquitous in the real world, since most inverse problems are subject to some physical constraints and have no unique solution. We briey describe three examples relevant to the neural networks and signal processing communities. 2.1 Computational Vision. The early vision problem is often referred to the rst processing stage in computational vision, which consists of decoding two-dimensional images in terms of properties of three-dimensional objects. Many computational vision problems, such as shape from shading, surface reconstruction, edge detection, and computation of optical ow, are generally ill posed (Poggio & Koch, 1985; Poggio et al., 1985; Bertero et al., 1988). In general, the solutions to these ill-posed problems can be formulated as (Poggio et al., 1985) arg min kAx ¡ yk2 C lkPxk2 . x
(2.1)
By invoking different regularization tools (Marroquin, Mitter, & Poggio, 1987; Poggio, Voorhees, & Yuille, 1988; Yuille & Grzywacz, 1988; Schnorr & Sprengel, 1994; Bernard, 1999), many computational vision problems haven been solved with great success. 2.2 Density Estimation. Density estimation is a fundamental problem in machine learning. Suppose the observed data are sampled by the density p (x) from the cumulative distribution function F (x) , which is expressed by Z
x ¡1
p (t ) dt D F (x).
(2.2)
Now the problem is formulated as: Given some data xi ( i D 1, . . . , `) , how can p (x) be estimated from a nite number of noisy observations? Empirically, one may estimate the distribution function by
F` (x) D
` 1X
` iD 1
H (x ¡ xi ) ,
(2.3)
On Different Facets of Regularization Theory
2795
where H (¢) is a step function; H (x) D 1 if all xn > 0 and H (¢) D 0 otherwise. The density is further estimated by solving Z
x ¡1
p (t ) dt D F` (x).
(2.4)
Solving this inverse operator is generally a stochastic ill-posed problem, especially in the high-dimensional case (Vapnik, 1998a). 2.3 Dynamic Reconstruction. Nonlinear dynamic reconstruction (e.g., sea clutter dynamics) is a common problem in the physical world (Haykin, 1999). Without loss of generality, the nonlinear dynamics can be formulated by the following state-space model, xt C 1 D F (xt ) C º 1,t
(2.5)
yt D G (xt ) C º2,t ,
(2.6)
where F is a nonlinear mapping vector-valued function, G is a scalar-valued function, and º 1,t and º2,t represent process noise and measurement noise contaminating the state variable xt and the observable yt , respectively. Given a time series of observable yt , the problem is to reconstruct the dynamics described by F, which is generally ill posed in the following sense: (1) for some unknown reasons, the existence condition may be violated; (2) there may not be sufcient information in the observation for reconstructing the nonlinear dynamics uniquely, which thus violates the uniqueness condition; and (3) the unavoidable presence of noise adds uncertainty to the dynamic reconstruction—when the signal-to-noise ratio (SNR) is too low, the continuity condition is also violated. One way to solve ill-posed problems is to make the problems well posed by incorporating prior knowledge into the solutions (Tikhonov & Arsenin, 1977; Morozov, 1984; Wahba, 1990, 1995; Vapnik, 1998a). The forms of prior knowledge vary and are problem dependent. The most popular and important prior knowledge is the smoothness prior,4 which assumes that the functional mapping of the input to the output space is continuous and smooth. This is where regularization naturally comes in, arising from the well-known Tikhonov’s regularization theory,5 which we detail in the next section. 4 Another methodology to embed prior knowledge is the theory of hints and virtual examples. See Abu-Mostafa (1995), and Niyogi, Girosi, and Poggio (1998). 5 Many well-known statistical terminologies (for example, ridge regression, penalized least-square estimate, penalized likelihood estimate, smoothing splines, and averaging kernel regression) can be well formulated in Tikhonov’s regularization framework.
2796
Zhe Chen and Simon Haykin
3 Regularization Framework 3.1 Machine Learning. Consider the following machine learning problem: Given a set of observation data (learning examples) f(xi , yi ) 2 RN £ Rg`iD 1 ½ X £ Y, the learning machine f is expected to nd the solution to the inverse problem. In other words, it needs to approximate a real function in the hypothesis satisfying the constraints f (xi ) D y (xi ) ´ yi , where y (x) is supposed to be a deterministic function in the target space. Hence, the learning can be viewed as a multivariate functional approximation problem (Poggio & Girosi, 1990a, 1990b). It can be also viewed as an interpolation problem (Powell, 1985; Micchelli, 1986; Broomhead & Lowe, 1988), where f is an interpolant parameterized by weights w. Note that this problem is ill posed in that the approximants satisfying the constraints are not unique. To solve the ill-posed problem, we usually require some smoothness property of the solution f , and the regularization theory naturally comes in. Statistically, the approximation accuracy is measured by the expectation of the approximation error. In the Hilbert space,6 denoted as H, the expected risk functional may be expressed as Z RD L (x, y) dP (x, y ) Z D
X£Y
L (x, y) p (x, y) dx dy,
(3.1)
X£Y
where L (x, y ) represents the loss functional. A common loss function is the mean squared error dened by L2 norm. Suppose y is given by a nonlinear function f (x) corrupted by additive white noise independent of x: y D f (x) C e , where e is bounded and follows an unknown probability metric m (e ) (namely, the noise e can have various probability density models, as we discuss in section 8). In that case, p (x, y ) D p (x) p ( y | x), and the conditional probability density p ( y | x) is represented by the metric function m [y ¡ f (x) ]. In particular, the expected risk functional with the L2 norm is given by Z [y ¡ f (x) ]2 p (x, y ) dx dy. RD (3.2) X£Y
In practice, the joint probability p (x, y ) is unknown, and an estimate of R based on nite (say, `) observations is used instead, with an empirical risk functional
R emp D
` X
iD 1
[yi ¡ f (xi ) ]2 ,
(3.3)
6 Hilbert space is dened as an inner product space that is complete in the norm induced by the inner product. The common measure space where the Lebesque squareintegrable functions are dened is a special case of Hilbert space: L2 space.
On Different Facets of Regularization Theory
2797
which introduces an estimate yO (x) (yO D y ¡ e D f (x) ). Quantitatively, suppose A · L (x, y ) · B and 0 · g · 1; for the loss taking the value g, the generalization error has the upper bound (Vapnik, 1995): r
R · Remp C ( B ¡ A )
dVC log(1 C 2`/ dVC ) ¡ log(g / 4) `
(3.4)
with the probability 1 ¡g, where dVC is a nonnegative integer called the VC dimension, which is a capacity metric of the learning machine. The second term on the right-hand side of equation 3.4 determines the VC condence.7 3.2 Tikhonov regularization. In the theory of regularization, the expected risk is decomposed into two parts—the empirical risk functional Remp [ f ] and the regularizer risk functional Rreg[ f ]:
R[ f ] D R emp [ f ] C lRreg [ f ] D
` 1X 1 [yi ¡ f (xi ) ]2 C lkD f k2 , 2 iD 1 2
(3.5)
where k¢k is the norm operator.8 l is a regularization parameter that controls the trade-off between the identity (goodness of t) of data and the roughness of the solution. D is a linear differential operator, which is dened as the FrÂechet differential of Tikhonov functional (Tikhonov & Arsenin, 1977; Haykin, 1999). Geometrically, D is interpreted as a local linear approximation of the curve (or manifold) in high-dimensional space. The smoothness prior implicated in D makes the solution stable and insensitive to noise. Denition 2, Fr Âechet differential (Balakrishnan, 1976). A function f mapping X to Y is said to be FrÂechet differentiable at a point x, if for every h in X,
lim 2
!0
f (x C 2 h ) ¡ f ( x ) 2
D df ( x, h )
exists, and denes a linear bounded transformation (in h) mapping X into Y. df (x, h) D F ( x) h is the FrÂechet differential; F ( x ) is the FrÂechet derivative. 7 See Niyogi and Girosi (1996, 1999) for detailed discussions on the approximation error (bias) and estimation error (variance). 8 It is usually referred to the L2 norm in the Hilbert space unless stated otherwise. Also note that it can be dened in a particular form in the RKHS.
2798
Zhe Chen and Simon Haykin
Since the Fre chet differential is regarded as the best local linear approximation of a functional, we have d R ( f C bh) (3.6) dR ( f, h ) D , db bD0 where h (x) is a constant xed function of x. By using the Riesz representation Theorem (Yosida, 1978; Debnath & Mikusinski, 1999) and following the steps in Haykin (1999), we have d Remp ( f C bh) dRemp ( f, h ) D db b D0 D ¡
` X
i D1
*
D ¡ h,
[yi ¡ f (xi ) ]h(xi ) ` X
iD 1
+ ( yi ¡ f (xi ) )d (x ¡ xi ) ,
(3.7)
where d (¢) is the Dirac delta function and h¢, ¢i denotes the inner product of two functionals in the Hilbert space H. Similarly, the FrÂechet differential of the regularizing term Rreg is written as d Rreg ( f C bh) dRreg ( f, h ) D db bD0 Z D[ f C bh]Dh dx D Z D
bD0
D f Dh dx D hDh, D f i.
(3.8)
The above results are well known in the spatial domain. See Haykin (1999) for details. 3.3 Operator Theory and Green’s Function. Instead of presenting a rigorous mathematical treatment on operator theory, we briey introduce the basic concepts and theorems that are relevant to our purposes and sketch a picture of differential and integral operators and their associated Green’s (kernel) functions. More information can be found in books on functional analysis (Balakrishnan, 1976; Yosida, 1978; Kress, 1989; Debnath & Mikusinski, 1999). 3.3.1 Operator. Roughly speaking, an operator is a kind of correspondence relating one function to another in terms of a differential or integral equation (and, thus, the differential or integral operator arises from
On Different Facets of Regularization Theory
2799
the correspondence). An operator can be linear or nonlinear, nite dimensional or innite dimensional. We discuss only linear operators here. Given a bounded linear operator A: X ! Y, its adjoint operator is dened as Q Y ! X, that is, hAx, yi D hx, Ayi Q ( x 2 X, y 2 Y ) and kAk D kAk. Q It is A: called adjoint because it describes the inverse correspondence between two functions in the operator, with exchanging range and domain. The role of the adjoint operator is similar to the (conjugate) transpose of a (complex) matrix. An operator is a self-adjoint operator if it is equal to its adjoint opQ (x, x0 2 X ). Analogous to matrix erator, in other words, hAx, x0 i D hx0 , Axi computation, we can imagine functionals as innite-dimensional vectors in the sense of generalized functions, and linear operators can be viewed as innite-dimensional matrices. Solving the linear differential (integral) operator is in essence solving a differential (integral) equation. Many concepts familiar in linear algebra and matrix theory (e.g., norm, determinant, condition number, trace) can be extended to operator theory. The spectral theory is used to dene, in operator language, the properties similar to eigenvectors and eigenvalues in matrix computation. Given a differential or integral operator, we may dene the eigenvectors as follows: Find the value j for which there exists a nonzero function f satisfying the equation Af D j f , where the vector function f satisfying this equation is called the eigenvector of the operator A and j is the corresponding eigenvalue. Denition 3. Given a positive-denite kernel function K, an integral operator R T is dened by T f D f ( s ) D K ( s, x ) f ( x ) dx; its adjoint operator TQ is dened by R TQ f D f ( x ) D K ( x, s ) f ( s) ds. T is self-adjoint if and only if K ( s, x) D K (x, s) for all s and x, where the overline denotes complex conjugate. Remarks. An integral operator with a symmetric kernel K ( s, x ) is self-adjoint. Note that denition 3 is valid only for homogeneous integral equations.9 The integral equation given in the denition 3 is sometimes called a Fredholm integral equation of the rst kind (Kress, 1989). From denition 3, one may have Z Q f D g: g (s ) D TT
K0 ( s, x ) f ( x ) dx
(3.9)
K (s, t) K ( x, t) dt.
(3.10)
Z
K0 ( s, x ) D 9
R
For the inhomogeneous integral equation, we have f (s) D K (s, x) f (x) dx C h(s), where h(s) is an even function. It is also called a Fredholm integral equation of the second kind.
2800
Zhe Chen and Simon Haykin
Q we know, by denition, that K is still an integral Denoting K D TT, operator with the associated kernel function K0 , and K is also a selfQ is still a differential operator.10 adjoint operator; likewise, L D DD
The integral operator essentially calculates E[ f (x ) ] (where E[¢]) repre( ) sents mathematical expectation), provided K P s, x takes the form of a ( ) ( ) (¢) Backus-Gilbert averaging kernel: K (s, x) D m i D 1 si x ki s , where ki is a known smooth function (O’Sullivan, 1986). In particular, K (s, x) can be a covariance kernel if T is nuclear (Balakrishnan, 1976). 3.3.2 Norm. The norm operator essentially determines the smoothness of a functional, depending on the functional space where the inner product is dened. For example, a general norm in the Hilbert space Hm is dened by "Z Á
! #1 / 2 m 2 2 @f @ f | f | C C ¢ ¢ ¢ C m dx @x @x 2
k f kHm D
R
2
® ® 31 / 2 m ® k ®2 X @ f® ® D 4 ® k® 5 I ® @x ® kD 0 2
when m D 0, it reduces to the L2 norm. If the norm operator is dened in the Sobolev space Wpm (Adams, 1975), 11 the norm is given by µZ ³ kfk
Wpm
D
m ¶ 1/ p p p´ @f @ f | f | C C ¢ ¢ ¢ C m dx . @x @x p
R
(3.11)
In other words, for 0 · p · 1, m D f0, 1, . . .g, the functional f is said to belong to Sobolev space Wpm if it is m-times differentiable with an associated norm k f kWpm D k f kp C k f ( m) kp . Based on different norm operators, different smoothness functionals can be dened. For example, the smoothing splines arise from the case of m D 2, p D 2 (Wahba, 1990). In the RKHS (Aronsajn, 1950), the Hilbert norm is dened by12 Z | F (s) | 2 k f kHm ´ k f kK D (3.12) ds, RN K (s) where K is the (unique) kernel function associated to the RKHS, and K is the Fourier transform of K. For this reason, RKHS is sometimes called a proper functional Hilbert space. 10 Q (Du, v) D D(u, Q Dv) and K (u, v) D It can be understood by observing L(u, v) D D Q Q T(Tu, Tv). v) D T(u, 0 11 Hilbert space is a special case of Sobolev space, Hm D Wm 2 , and L2 space is W2 . 12 For a discussion of iterative evaluation of the RKHS norm, see Kailath (1971).
On Different Facets of Regularization Theory
2801
3.3.3 Green’s Function. It is well known in functional analysis (Yosida, 1978) that given a positive integral operator T, we can always nd a (pseudo-)differential operator,13 D, as its inverse. The operator D corresponds to the inner product of the RKHS with a reproducing kernel K associated to T, where the kernel K is called Green’s function of the differential operator D. If the operator D is a certain form of pseudodifferential operator, the functional space induced by the kernel14 is a Sobolev space (Adams, 1975). In analogy to the inverse of a matrix, Green’s function represents the inverse of a (sufciently regular) linear differential operator (Lanczos, 1961). For a large class of problems, it appears in the form of a kernel function that depends on the position of two points in the given domain. Green’s function can be dened as the solution of a certain differential equation that has the Dirac delta function on the driving force. The reciprocity theorem makes it possible to dene the Green’s function in terms of either a differential operator or its adjoint operator (see Lanczos, 1961, chap. 5 for details). Mathematically, we have the following denition: Denition 4 (Courant & Hilbert, 1970). Given a linear differential operator L, the function G (x, » ) is said to be the Green’s function for L if it has the following properties: For a xed » , G (x, » ) is a function of x and satises the given boundary conditions. Except at the point x D » , the derivatives of G (x, » ) with respect to x are all continuous; the number of derivatives is determined by the order of operator L. With G (x, » ) considered as a function of x, it satises the partial differential equation LG(x, » ) D d (x ¡ » ) . Denoting Q (x) as a continuous function of x 2 RN , it follows that the function Z F (x) D
RN
G (x, » ) Q ( » ) d»
(3.13)
is the solution of the differential equation LF(x) D Q (x) ,
(3.14)
13 The pseudodifferential operator differs from the differential operator in that it may contain an innite sum of differential operators and its Fourier transform is not necessarily a polynomial. 14 There exists a one-to-one correspondence between the operator and the kernel according to Schwartz’s kernel theorem.
2802
Zhe Chen and Simon Haykin
where G (x, » ) is the Green’s function for the linear differential operator L. The proof can be found in Courant and Hilbert (1970) and Haykin (1999). 3.4 Fourier Analysis and Spectral Regularization. Studying regularization theory in the Fourier domain dates back to Kimeldorf and Wahba (1970, 1971), Duchon (1977), Micchelli (1986), Wahba (1990), and Girosi (1992). It has been also treated fairly well in the kernel learning framework (e.g., Sch¨olkopf & Smola, 2002). The common approach in the literature is to dene the smoothness of a stabilizer in some functional space, for example, the seminorms in the Banach space (Duchon, 1977; Micchelli, 1986) or the Hilbert norm in the Sobolev space (Kimeldorf & Wahba, 1970). (See Girosi, 1992, 1993; Girosi et al., 1995, for details.) In what follows, we establish the spectral regularization framework in the Fourier domain in a slightly different way. Starting with the denition of Fourier operator and Parseval theorem, we dene the spectral operator and further discuss the regularization scheme in terms of operators and integral equations. To derive the regularization solution, we make use of operator theory and Green’s function to establish the spatial and spectral regularization frameworks, with discussions relating to other relevant work. We also point out important theorems and give another proof of the regularization solution by using the Riesz representation theorem. 3.4.1 Spectral Operator. Essentially, the spectral operator is an integral operator given by denition 3: Z Tf D
RN
K (s, x) f (x) dx.
In the case of the Fourier operator, K (s, x) D exp(¡jhs, xi) where j ´ Formally, we have:
p
¡1.
Denition 5. For any functional f (x) 2 H, the Fourier operator T is dened by R C1 T f D F : F (s) D ¡1 f (x) exp(¡jxs) dx; F (s) 2 H.15 Theorem 1, Plancherel identity (Yosida, 1978). Given two functionals f, g 2 H and their corresponding Fourier transform F and G , the Plancherel identity 1 hF (s) , G (s) i.16 Written in the (Parseval theorem) states that h f (x) , g (x) i D 2p operator form, it is expressed by h f, gi D hT f, Tgi. 15
The range and domain of Fourier operator are both in the Hilbert space. For simplicity of notation, we henceforth take all the constants thatRappear in the denitions of (inverse) Fourier transform to be 1. Note that hF (s), G(s)i D F (s)G(s) ds. 16
On Different Facets of Regularization Theory
2803
Remarks. If we dene the differential operator D as 1 X (¡1) n dn , n! dxn ¡1
DD
(3.15)
then its corresponding spectral operator is 1 X (¡1) n ( js) n
TD D
¡1
n!
D exp(¡js).
(3.16)
where TD denotes the Fourier operator functioning on a differential operator D (with respect to f ), namely, TD f D T(D f ) D D(T f ) D DF .
Recalling equation 3.10, for the Fourier operator T, the kernel function associated with the operator K is given by Z 0 (x ) x exp( jsx ) exp(¡jsxi ) ds D d (x ¡ xi ) , K , i D and it follows that Z Kf D K0 (x, xi ) f (s) ds Z
d (x ¡ xi ) f (s) ds D f (x ¡ xi ) .
D
(3.17)
Example 1, Dirichlet kernel (Lanczos, 1961; Vapnik, 1995, 1998a). K (h ) D
sin(n C 12 )h 2p sin h2
.
(3.18)
From equation 3.18, the truncated Fourier series is written as Z fn ( x ) D
p ¡p
f ( s) Kn (s, x ) ds
(3.19)
where Kn ( s, x ) D D
n 1X (cos ks cos kx C sin ks sin kx ) p kD 0 n 1X cos k (s ¡ x ) D Kn (s ¡ x ) , p kD 0
(3.20)
2804
Zhe Chen and Simon Haykin
where Kn (s ¡ x ) denes an integral operator in the sense that Z f ( x) D
p ¡p
f ( s) K (s ¡ x ) ds.
(3.21)
In particular, the Dirac delta function d ( s, x) has the series 3.19 as its Fourier expansion when n ! 1 (see appendix A for the proof). Hence, the Dirichlet kernel corresponds to a truncated Fourier expansion mapping (Burges, 1999). Example 2, Fej Âer kernel (Lanczos, 1961; Vapnik, 1998a). W n (h ) D
sin2 n2 h 2p n sin2 h2
,
which is the arithmetic mean of the Dirichlet kernel. In other words, the Fourier coefcients of the FejÂer kernel are the weighted version of those of the Dirichlet kernel dependent on n: a0k D (1 ¡ k/ n ) ak, b0k D (1 ¡ k/ n ) bk. Other examples, such as periodic gaussian kernel, B-spline kernel, Laplacian kernel, and regularized Fourier expansion kernel, can be found in Vapnik (1998a, 1998b), Smola, Sch¨olkopf, & Muller ¨ (1998), and Scholkopf ¨ & Smola (2002). It is noted that the Dirichlet kernel, FejÂer kernel, and B-spline kernel are interpolation kernels; translationally invariant kernels (e.g., gaussian) are convolution kernels; and polynomial kernels and Fourier kernel are dot product kernels. 3.4.2 Regularization Scheme. Consider the operator equation AQ D f (A: X ! Y); the regularization scheme is to nd an approximated solution Q2 related to Q, such that kQ2 ¡ Qk · 2 , where 2 is a small, positive value. Generally, it consists of nding a linear operator Rl: Y ! X (l > 0) with the property of pointwise convergence liml!0 Rl AQ D Q for all Q 2 X, or, equivalently, Rl f ! A ¡1 f as l ! 0. However, as shown in Kress (1989, theorem 15.6), for the regularization scheme, the operator Rl cannot be uniformly bounded with respect to l, and the operator Rl A cannot be norm convergent as l ! 0. In particular, one has the approximation error, kQ2l ¡ Qk · kRl AQ ¡ Qk C 2 kRl k, which states that the error consists of two parts: the rst term of the righthand side is due to the approximation error between Rl and A ¡1 , and the second term reects the inuence of incorrect data. In general, the rst term decreases as l ! 0, whereas the second term increases as l ! 0. The regularization parameter, l, controls the trade-off between accuracy (the rst term) and stability (the second term). The regularization approach
On Different Facets of Regularization Theory
2805
amounts to nding an appropriate l to achieve the minimum error of the regularized risk functional. Denition 6 (Kress, 1989). The regularization scheme Rl , namely, the choice of regularization parameter l D l(2 ) , is called regular if for all f 2 A ( X) and all f 2 2 Y with k f 2 ¡ f k · 2 , there holds Rl(2 ) f 2 ! A¡1 f, 2 ! 0. By using the tools of eigendecomposition (ED) and singular value decomposition (SVD) in spectral theory, we have the following theorems: Theorem 2 (Kress, 1989). A ( X ) ? D N ( AQ )
and
For a bounded linear operator, there holds N ( AQ ) ? D A ( X ) ,
where A ( X ) ? means for all Q 2 X and g 2 A ( X ) ? , hAQ, gi D 0; N ( AQ ) denotes the Q in the sense that Ag Q D 0 for all g. null space of A, Theorem 3 (Kress, 1989). Let X be a Hilbert space, and let A: X ! X be a (nonzero) self-adjoint compact operator. Then all eigenvalues of A are real. All eigenspaces N (j I ¡ A ) for nonzero eigenvalues j have nite dimension, and eigenspaces associated with different eigenvalues are orthogonal. Suppose the eigenvalues are ordered such that |j1 | ¸ |j2 | ¸ ¢ ¢ ¢, and denote by Pn : X ! N (jn I ¡ A ) the orthogonal projection onto the eigenspace for the eigenvalue jn ; then there holds AD
1 X
j n Pn
nD 1
in the sense of norm convergence. Let Q: X ! N ( A) denote the orthogonal projection onto the null space N ( A ) ; then there holds
QD
1 X nD 1
PnQ C QQ
for all Q 2 X. In the case of an orthonormal basis, hQn , Qk i D dn, k, we have the following expansion representation: AQ D
QD
1 X nD 1 1 X nD 1
jn hQ, Qn iQn , hQ, Qn iQn C QQ.
2806
Zhe Chen and Simon Haykin
Theorem 4 (Kress, 1989). Let X and Y be Hilbert spaces. Let A: X ! Y be a Q Y ! X be its adjoint. Let sn denote the singular linear compact operator and A: values of A, which are the square roots of the eigenvalues of the self-adjoint compact Q operator AA: X ! X. Let fsn g be an ordered sequence of nonzero singular values Q ) . Then there exist of A according to the dimension of the null spaces N (sn2 I ¡ AA orthonormal sequences fQn g in X and fgn g in Y such that AQn D sn gn ,
Q n D snQn Ag
for all integers n. For each Q 2 X, there holds the SVD
QD
1 X n D1
hQ , Qn iQn C QQ
with the orthogonal projection Q: X ! N ( A) and AQ D
1 X nD1
sn hQ, Qn ign .
Theorem 5, Picard theorem (Kress, 1989). Let A: X ! Y be a linear compact operator with singular system (sn , Qn , gn ) . The Fredholm integral equation of the rst kind AQ D f is solvable if and only if f 2 N ( AQ ) ? and 1 X 1 |h f, gn i| 2 < 1. 2 s nD1 n
Then a solution is given by
QD
1 X 1 h f, gn iQn . s n D1 n
The Picard theorem essentially describes the ill-posed nature of the integral equation AQ D f . The perturbation ratio kQ2 k / k f 2 k D 2 / sn determines the degree of ill posedness, the more quickly s decays and the more severe the ill posedness is. For more examples of regularization in terms of operators and Fredholm integral equations of the rst kind, see Kress (1989). 3.4.3 Regularization Solution. hDh, D f i D hTD h, TD f i.
By virtue of theorem 1, it follows that (3.22)
On Different Facets of Regularization Theory
2807
For ease of notation, we henceforth simply denote TD as T. From equation 3.22, equation 3.8 is rewritten as Z D f Dh dx D hDh, D f i dR reg ( f, h ) D Z
D
Z
T f Th ds D
T f Th ds D hTh, T f i.
(3.23)
For any pair of functions u (x) and v (x) , given a linear differential operator ˜ D and its associated Fourier operator T (i.e., TD ), their adjoint operators, D ˜ are uniquely determined to satisfy the boundary conditions and T, Z Z ˜ (3.24) u (x) Dv(x) dx D v (x) Du(x) dx RN
RN
and Z
Z V
u (s) Tv(s) ds D
˜ v (s) Tu(s) ds,
(3.25)
V
where V represents spectrum support in the frequency domain. Equation 3.24 is called Green’s identity, which describes the bilinear identity of matrix calculus into the realm of function space (Lanczos, 1961). EquaQ tion 3.25 follows from the fact that hu(x) , Dv(x) i D hDu(x), v (x) i, hu(x) , Dv (x) i D hu(s) , Tv(s) i. Using Green’s identity and setting u (x) D D f (x) and Dv(x) D Dh(x), we further obtain an equivalent form of equation 3.23 in the light of equation 3.24, as shown by Z ˜ f (x) dx D hh, DD ˜ f i(x) . (3.26) dR reg ( f, h ) D h (x) DD On the other hand, we also obtain another form in the light of equation 3.25 by setting u (s) D T f (s) and Tv(s) D Th(s) , as given by Z ˜ f (s) ds D hh, TT ˜ f i(s) . (3.27) dR reg ( f, h ) D h (s) TT Thus, the condition that the FrÂechet differential being zero dR ( f, h) D dR emp ( f, h ) C ldRreg ( f, h ) D 0 can be rewritten, by virtue of equations 3.8, 3.26, and 3.27, in the following form, * " #+ ` X 1 ˜ f (x) ¡ ( yi ¡ f )d (x ¡ xi ) , (3.28) dR ( f, h) D h (x) , DD l i
2808
Zhe Chen and Simon Haykin
or *
"
(
˜ f (s) ¡ F dR ( f, h ) D h (s) , TT
` 1X ( yi ¡ f )d (x ¡ xi ) l i
)#+ ,
(3.29)
where F f¢g denotes taking the Fourier transform. The necessary condition for f (x) being an extremum of R[ f ] is that dR ( f, h ) D 0 for all h 2 H, or, equivalently, the following conditions are satised in the distribution sense: ` X ˜ fl (x) D 1 [yi ¡ f (xi ) ]d (x ¡ xi ) DD l i D1
(3.30)
and ( ˜ fl (s) D F TT D
` 1X [yi ¡ f (xi ) ]d (x ¡ xi ) l iD 1
)
` 1X [yi ¡ f (xi )] exp(¡jxi s). l iD 1
(3.31)
Equation 3.30 is the Euler-Lagrange equation of the Tikhonov functional ˜ and R[ f ], and equation 3.31 is its Fourier counterpart. Denoting L D DD ˜ K D TT as earlier, G (x, » ) is the Green’s function for the linear differential operator L. Recalling denition 4, LG(x, » ) D d (x ¡ » ) ,
(3.32)
we may derive its counterpart in the frequency domain KG(s, » ) D exp(¡js» ) . Recalling equation 3.13, it follows that Z f (x) D G (x, » ) Q ( » ) d»
(3.33)
(3.34)
RN
is the solution of the following differential equation and integral equation, L f (x) D Q (x) ,
(3.35)
K f (s) D W (s) ,
(3.36)
where W (s) is the Fourier transform of Q (x) . In the light of equations 3.32 through 3.34, we may derive the solution of the regularization problem as fl (x) D
` X
iD 1
wi G (x, xi ) ,
(3.37)
On Different Facets of Regularization Theory
2809
where wi D [yi ¡ f (xi ) ] / l, and G (x, xi ) is a positive-denite Green’s function for all i (see appendix B for an outline of the proof). Remarks. ˜ is dened by the Laplace operator Provided the operator L D DD (Yuille & Grzywacz, 1988; Poggio & Girosi, 1990a; Haykin, 1999), 17 LD
1 X (¡1) n nD 0
n!2n
r 2n ,
(3.38)
where
r2 D
` X ` X
i D 1 j D1
@2 @xi @xj
,
˜ in the spectral domain reads then the corresponding operator K D TT KD
1 X (¡1) 2n s2n nD 0
n!2n
³
s2 D exp 2
´ .
(3.39)
By noting that LG(x) D d (x) ,
(3.40)
KG(s) D 1,
(3.41)
from the property of Green’s function, it further follows that ³ ´ ksk2 G (s) D exp ¡ , 2 ³ ´ kxk2 G (x) D exp ¡ , 2
(3.42) (3.43)
where G (x) $ G (s) is a Fourier transform pair.
Note that the solution to the Tikhonov regularization problem, given by equation 3.37, is incomplete in the sense that it only represents the solution modulo a term that lies in the null space of the operator D (Poggio & Girosi, 1990a). In general, the solution to the Tikhonov 17 This can be also regarded as an RKHS norm of the LÂevy’s n-dimensional (n ! 1) Brownian motion (Kailath, 1971).
2810
Zhe Chen and Simon Haykin
regularization problem is given by (Kimeldorf & Wahba, 1970; Poggio & Girosi, 1990a; Girosi et al., 1995) fl (x) D
` X
iD 1
wi G (x, xi ) C b (x) ,
(3.44)
where b (x) is a term that lies in the null space of D (the simplest case is P b (x) D const.),18 which satises the orthogonal condition `iD 1 wi b (xi ) D 0. Hence, the functional space of the solution fl is an RKHS of the direct sum of two orthogonal RKHS.19 Equation 3.44 can be understood using an analogy of solving a matrix equation Ax D b. The general solution of the equation is given by x D ( A† C Z) b D ( ( AT A) ¡1 AT C Z) b († represents the Moore-Penrose pseudoinverse), where Z accounts for the orthogonal (null) space where A† lies. Hence, the regularization solution 3.37 is somehow similar to the minimum-norm solution of the matrix equation, as given by x D A† b D ( AT A) ¡1 AT b.
The essence of regularization is to nd a proper subspace, namely, the eigenspace of L f or K f , within which the operator behaves like a “well-posed” operator (Lanczos, 1961). The solution in the subspace is unique.
Another viewpoint to look at the regularization solution is the following: Rewriting equation 3.5 in a matrix form, 1 1 ky ¡ fk2 C l(D f ) T (D f ) 2 2 1 1 ˜ D (y ¡ f) T (y ¡ f) C lDD f 2 2 1 1 D (y ¡ f) T (y ¡ f) C lfT Kf, 2 2
R[ f ] D
(3.45)
and taking the derivative of of risk function with respect to f and setting it to zero, we have f D (I C lK) ¡1 y, which gives rise to a smoothing matrix20 S D (I C lK) ¡1 , where K is a quadratic symmetric penalty matrix associated with L (Hastie, 1996). By taking the ED of S, say S D U§ UT , UT y expresses y in terms of the orthonormal basis dened by the column vectors of U, which is quite similar to the Fourier expansion (Hastie, 1996). 18
We notice that for b (x), Db (x) D 0 and TD b (s) D 0. For a good mathematical treatment on the RKHS in the context of regularization theory, see Wahba (1990, 1999), Corradi and White (1995), Girosi (1998), Vapnik (1998a), and Evgeniou, Pontil, and Poggio (2000). 20 When l D 0, the smoothing matrix reduces to an identity matrix, and the learning problem reduces to a pure interpolation problem in the noiseless situation. 19
On Different Facets of Regularization Theory
2811
R P Observing equation 3.37, one may replace by , G (x, xi ) by K (x, xi ), and wi by [yi ¡ f (xi ) ] / l. Suppose that K is a translationally invariantreproducing kernel that is positive denite and satises K (x, xi ) D K (x ¡ xi ) . Then the approximated function can be expressed by the convolution between the observation and a kernel: Z 1 ( yi ¡ f (xi ) ) K (x ¡ xi ) dxi f (x) D l Z Z 1 1 D ¡ f (xi ) K (x ¡ xi ) dxi C yi K (x ¡ xi ) dxi , l l where the rst term has exactly the same form as the integral operator, and the second term is similar to kernel regression. Intuitively, f (x) can be reconstructed by the data sample smoothed by an averaged kernel, which accounts for a convolutional window (the so-called Parzen window). In RBF (regularization) networks, the number of RBF (i.e., Green’s function) units, m, is exactly equal to the observation number `. In practice, we may choose m < `, which corresponds to the so-called hyperbasis function networks or generalized RBF networks. From the matrix equation viewpoint, there are ` equations associated with ` observation pairs but m unknowns; hence, it is an overdetermined situation (Golub & Van Loan, 1996). In order to nd the minimum norm P (vTi GT y) solution, we might use SVD to obtain fO D m vi (assuming the i D1 si2 rank of G is m), where vi are the vectors of the unitary matrix V and si are the singular values. For more discussion on their optimization, see Moody & Darken (1989), Girosi (1992), and Haykin (1999). The preceding derivation states that the solution of the classical Tikhonov regularization is independent of the domain where the regularizer (stabilizer) is dened, and the regularization is equivalent to the expansion of the solution in terms of a linear superposition of ` Green’s functions centered at the specic observation. Note that some invariance properties are implicitly embedded in T,21 which implies that the derived Green’s function should be translationally and rotationally invariant. In other words, G (x, xi ) is inherently an RBF with the radially symmetric and shift-invariant form (Poggio & Girosi, 1990a), G (x, xi ) D G (|x ¡ xi |) .
(3.46)
More generally, it can be the hyperbasis function (HyperBF) (Poggio & Girosi, 1990a) with the form G (x, xi ) D G (|x ¡ xi | T S ¡1 |x ¡ xi |) , 21
Fourier transform is invariant to shift, rotation, and starting point.
(3.47)
2812
Zhe Chen and Simon Haykin
where S is a positive-denite covariance matrix; or it can be the RBF with the weighted norm (so-called Mahalanobis distance metric), G (x, xi ) D G (kx ¡ xi kQ ) D G (|x ¡ xi | T QT Q|x ¡ xi |) .
(3.48)
3.4.4 Spectral Regularization. Thus far, we have been able to establish a functionally equivalent form of spectral regularizer, being the counterpart of equation 3.5,
R [ f ] D Remp [ f ] C lRreg [F ] D
` 1X 1 [yi ¡ f (xi ) ]2 C lkDF k2 2 i D1 2
D
` 1X 1 [yi ¡ f (xi ) ]2 C lkT f k2 . 2 i D1 2
(3.49)
Written in a matrix form, the spectral regularizer is given by ˜ f D fT Kf, kT f k2 D (T f ) T (T f ) D TT which again relates to the same smoothing matrix as equation 3.45. By using a property of the Fourier transform, when f (x) $ F (s) , one m (x) obtains @ @xf m $ ( js) m F (s) . Intuitively, we may think of the differential operator kD f k as taking m-order derivative in the Hilbert space Hm ; thus, the corresponding kT f k2 has the form of seminorms (Micchelli, 1986): Z 2 | F (s) | 2 £ ksk2m ds. kT f k D (3.50) If the norm operator is dened in the RKHS associated with a kernel K and its Fourier transform K (s) D ksk ¡2m ,22 the seminorm reduces to the form Z
2
kT f k D
RN
| F (s) | 2 ds, K (s)
(3.51)
which denes a semi-Hilbert space (Micchelli, 1986). Equation 3.51 has the same form as the spectral penalty stabilizer given in Wahba (1990) and Girosi et al. (1995). K (s) can be viewed as a low-pass lter, which plays a role of smoothing in the sense that the solution f can be viewed as the convolution of the observation with a smoothing lter, f (x) D y (x) ¤ B (x) ,
(3.52)
22 The smoothness order m has to be large enough to guarantee K to be continuous and decay fast to zero as s ! 1, for example, m > N / 2.
On Different Facets of Regularization Theory
2813
where ¤ denotes the operation of convolution product and B (x) is the inverse Fourier transform of B (s) , given by Poggio et al. (1988) and Girosi et al. (1995):
B (s) D
K (s) . l C K (s)
(3.53)
When l D 0, f (x) D y, it follows that B (s) D 1, B (x) D d (x) , which corresponds to an interpolation problem in the noiseless situation. Many choices of kernel K can be found in Girosi et al. (1995). 3.4.5 Important Theorems. We briey mention four important theorems in the literature and indicate their relations to the theory of regularization: Riesz representation theorem (Yosida, 1978; Debnath & Mikusinski, 1999): Any bounded linear functional on a Hilbert space has a unique representer in that space. This is useful in describing the dual space to any space that contains the compactly supported continuous functions as a dense subspace. Green’s function is developed from this theorem (Lanczos, 1961). The norm of the bounded linear operator is the same as the norm of its representer. Based on this theorem, we are able to derive another proof of the regularization solution (see appendix C). Bochner’s theorem (Bochner, 1955; see also Wahba, 1990; Vapnik, 1998a): Among the continuous functions on RN , the positive denite functions are those functions that are the Fourier transforms of nite measures. Hence, the Fourier transform of a positive measure constitutes a Hilbert-Schmidt kernel. Representer theorem (Kimeldorf & Wahba, 1970; see also Wahba, 1999): Minimizing the risk functional with spectral stabilizer 3.51 with interpolation constraints is well posed, and the solution is generally given by equation 3.44. The proof was rst established for the squared loss function and later extended to the general pointwise loss functions. See Sch¨olkopf and Smola (2002, chap. 4) for many variations of this theorem. Sobolev embedding theorem (see e.g., Yosida, 1978): This describes the relation between the Hilbert space Hm and a class of smooth functions whose derivatives exist pointwise. It is essentially related to the generalized function with innite differentiability in the distribution sense. See Watanabe, Namatame, and Kashiwaghi (1993), and Zhu, Williams, Rohwer, and Morciniec (1998) for discussion. 3.4.6 Interpretation. Geometrically, the smoothness of a functional is measured by the order of differentiability, tangent distance, or curvature in the spatial domain; in the frequency domain, the smoothness is seen
2814
Zhe Chen and Simon Haykin
from its power spectral density (which is the Fourier transform of the correlation function). When the solution is smooth, the spectral components are concentrated in the low-frequency bands. In a biological point of view, regularization is essentially the expansion of a set of Green’s functions taking the form of the RBF, which behaves like a lookup table mapping similar to the working mechanism of the brain. For the case of a gaussian RBF, its factorizable property lends itself to a physiologically convincing support (Poggio & Girosi, 1990a, 1990b); and the covariance matrix S in equation 3.47 denes the properties (e.g., shape, size, orientation) of the receptive eld (Xu, Krzyzak, & Yullie, 1994; Haykin, 1999). Regularization based on the entropy principle also nds some plausible biological supports, which we detail in section 6. Finally, the regularization problem has a very nice probabilistic interpretation (Poggio & Girosi, 1990a; Girosi et al., 1995; Evgeniou et al., 2000), on which we will give detailed discussions (in sections 5 and 8) in a Bayesian framework. 3.5 Transformation Regularization: Numerical Aspects. Thus far in section 3, we have discussed regularization in the continuous domain. In practice, we are more concerned about regularization in the standpoint of numerical calculation (Hansen, 1992, 1998). This approach is another kind of transformation regularization (or subspace regularization) in the sense that regularization is taken in a subspace domain by matrix decomposition (e.g., ED, SVD, or QR decomposition). Due to the analogy between operators and matrices, we can rewrite the risk functional in a matrix form,
R D ky ¡ Gwk2 C lkPwk2 ,
(3.54)
where w D [w1 , . . . , w`]T , G D G (x, xi ) is a radial basis matrix, and y D [y1 , . . . , y`]T . P is a user-designed matrix for the stabilizer. Since G is usually ill conditioned,23 one might use matrix decomposition or factorization to alleviate this situation. Taking the SVD of G, G D U§ ZT ,
(3.55)
where U and Z are left and right singular vectors of G, § is a diagonal matrix with the singular values si ( i D 1, . . . , `) in the main diagonal. In the case of zero-order Tikhonov regularization (where P is an identity matrix I), suppose w D Z®, we obtain (§
T
§
C lI) ® D
§
T
d,
(3.56)
23 By ill conditioned, we mean that the conditional number (dened as the ratio of the largest eigenvalue to the smallest eigenvalue) of the matrix is very large, and the matrix is rank decient or almost rank decient.
On Different Facets of Regularization Theory
2815
where d D UT y is a vector of Fourier coefcients (Hansen, 1998). 24 Furthermore, applying the SVD to P, suppose P D VSZT . Then the spectral coefcients corresponding to the zero-order and nonzero-order25 of the Tikhonov regularization can be described, respectively, as ci D
si2 2 si C
ci D
si2 , si2 C ls2i
l
,
ci di si
ai D
(3.57)
and ai D
ci di , si
(3.58)
where si are dened as the diagonal elements of the diagonal matrix S. Furthermore, in the case of nonzero-order Tikhonov regularization, provided w D Z®, we obtain (§
T
§
C lST S) ® D
§
T
d.
(3.59)
The solution to equation 3.54 when P D I can be computed explicitly as w D (G C lI) † y, which has the same form as in the continuous case (see appendix C). More generally, we have the relationship ZT w D § † ld D §
T l U y,
(3.60)
in which
§
l
n 2 o 8
PDI
,
(3.61)
6 I PD
where c i D (si2 C s2i ) 1/ 2 are the generalized singular values of the ( § , S) (see appendix D for a denition). The discrete Picard condition (Hansen, 1998) states that a necessary condition for obtaining a good regularized solution is that the magnitude of the Fourier coefcients |di | must decay to zero faster than the singular values si . By reweighting a posteriori the generalized singular values si according to their contributions (through calculating the geometric mean of |di |), the 24
Note that it coincides with the remark following equation 3.45. In a loose sense, Fourier transform can be viewed as diagonalizing the regularization operator in the continuous domain, a fact that corresponds to applying ED or SVD to the regularizer matrix in the discrete domain. 25 For instance, provided D D r is a rst-order gradient operator, we can dene P D J(w)w ¡1 , where J(w) is a Jacobian matrix; provided D D r 2 is a second-order Laplace operator, we can dene P D H (w)w ¡1 , where H (w) is a Hessian matrix.
2816
Zhe Chen and Simon Haykin
weight coefcients can be devised to be the reciprocal of the energy spectrum of the data (Velipasaoglu, Sun, Zhang, Berrier, & Khoury, 2000). Since the ill-conditioned matrix G usually has a wide range of singular values, 26 the singular values corresponding to the components with high energy and high SNR are supposed to be penalized less and those corresponding to the components with low energy and low SNR are penalized more (Velipasaoglu et al., 2000). In addition, for discussions on regularization in the context of optimization, see Dontchev and Zolezzi (1993). 3.6 Choosing the Regularization Parameter. In order to obtain a stable and convergent regularized solution, the choice of regularization parameter is very important. There are several approaches for determining the regularization parameter in practice. According to MacKay (1992), regularization parameter l can be estimated by a Bayesian method within the second evidence framework.27 Given data D and model M, the regularization parameter can be estimated by p (l | D , M ) / p ( D | l, M ) p (l | M ).
(3.62)
The regularization parameter can be also estimated by the average squared error and generalized cross-validation (GCV) approaches (Wahba, 1990, 1995; Haykin, 1999; Yee & Haykin, 2001). The advantage of the GCV estimate over the average squared error and ordinary cross-validation approaches lies in the fact that it does not need any prior knowledge of the noise variance and treats all observation data equally in the estimate. Yee (1998) and Yee and Haykin (2001) proved that the regularized strict interpolation radial basis function network (SIRBFN) is asymptotically equivalent to the Nadaraya-Watson regression estimate with mean-square consistency, in which the optimal l is allowed to vary with the new observations. The optimal adaptive regularization parameter and its sufcient convergence condition was studied in Leung and Chow (1999), and it is shown that the choice of l should be l¸
¡kr Remp (w) k2 hr R emp (w) , r R reg (w) i
(3.63)
l¸
hr R emp (w) , r R reg (w) i ¡kr Rreg (w) k2
(3.64)
and
in order to guarantee the convergence of Remp (w) and Rreg (w) , respectively. 26 Alternatively, its singularity can be observed using QR decomposition, which is discussed in section 9. 27 The rst evidence framework is to estimate the posterior probability of the weights while supposing l is known. See section 5 for details.
On Different Facets of Regularization Theory
2817
The discussion so far assumes that only a single regularization parameter is used in the regularized functional; however, this is not a strict requirement. In fact, multiple regularization parameters are sometimes necessary in practical scenarios, depending on the prior knowledge underlying the data. In general, the regularized functional can be dened by
Rreg [ f ] D
1 |S|
D
1 |S|
³
Z Z
RN
@f
³ S
@x
X RN
´T ³
sk
k
@f @x k
@f
´ dx
@x
´2 dxk ,
(3.65)
where S is a diagonal matrix (i.e., with radial symmetry constraint) with nonnegative elements diagfs1 , s2 , . . . , sN g, |S| denotes the determinant of S. One possible choice of sk (k D 1, . . . , N) can be assigned to be proportional to the variances of xk along different coordinates. See Girosi (1992) for further practical examples and discussions. 4 Occam’s Razor and MDL Principle Occam’s razor is a principle that favors the shortest (simplest) hypothesis that can well explain a given set of observations. In the articial intelligence and neural network communities, it is commonly used for complexity control in data modeling, which is essentially connected to regularization theory. MacKay (1992) provided a descriptive discussion on Occam’s razor from a Bayesian perspective (see section 5 for discussion). The evidence can be viewed as the product of a best-t likelihood and Occam’s factor. In a regression problem, Occam’s razor is anticipated to nd the simplest smoothlike model that can well approximate or interpolate the observation data. The minimum description length (MDL) principle is based on an information-theoretic analysis of the concepts of complexity and randomness. It was proposed as a computable approximation of Kolmogorov complexity (Rissaen, 1978; Cherkassky & Mulier, 1998). The idea of MDL is to view machine learning as a process of encoding the information carried by the observations into the model. The code length is interpreted as a characterization of the data related to the generalization ability of the code. Specically, the observation D D fxi , yi g is assumed to be drawn independently from an unknown distribution, and the learning problem is formulated as the dependency estimation of y on x. A metric measuring the complexity of the data length is given by Rissaen, 1978,
R D L ( D | M) C L ( M ) ,
(4.1)
where the rst term measures a code length (log likelihood) of the difference between the data and the model output, and the second term measures the
2818
Zhe Chen and Simon Haykin
code length of the model, which relates to the number of lookup table (say m) of a codebook, that is, L (M ) D dlog2 me (Cherkassky & Mulier, 1998). The MDL principle is close in spirit to Akaike’s information-theoretic criterion (Akaike, 1974). MDL is closely related to the regularization principle (MacKay, 1992; Hinton & van Camp, 1993; Rohwer & van der Rest, 1996; Cherkassky & Mulier, 1998). Hinton and van Camp (1993) found that the regularization of weight decay and soft weight sharing are indeed the vindication of MDL. Basically, a neural network acts like an encoder-decoder. The data and weights constitute an information ow, which is transferred through the channel, that is, the hidden layers (see Figure 1 for illustration). Regularization attempts to keep the weights simple by penalizing the information they carry. The amount of information in the weights can be controlled by adding some noise with a specic density, and the noise level (i.e., regularization parameter) can be adjusted during the learning process to optimize the trade-off between the empirical error (mist of data) and the amount of information in the weights (regularizer). Suppose the approximation error is zero-mean gaussian distributed with standard deviation si and quantization width m ; one then obtains (Hinton & van Camp, 1993) ¡ log2 p ( ei ) / ¡ log2 m C log2 s C
ei2 , 2s 2
and the mist of data is measured by the empirical risk, Á ! 1X 2 ` R emp D k` C log2 ei , 2 ` i
(4.2)
(4.3)
where k is a constant dependent on m . Hence, minimizing the squared error (the second term of the right-hand side of 4.3) is equivalent to the MDL principle. Assuming the weights are gaussian distributed with zero-mean and standard deviation sw , the regularizer L ( M) of weight decay can be written by an MDL metric,
R reg D
1 X 2 wi , 2sw2 i
(4.4)
in which sw2 plays the role of regularization parameter. In the noisy weight case, MDL corresponds to introducing a high variance to L ( D | M ) ; consequently, the mist of data is less reliable. More Pgenerally, the weights may be assumed to have a mixture density p (w) D i p i p ( wi ). For a detailed discussion, see Hinton and van Camp (1993), Bishop (1995a), and Cherkassky and Mulier (1998). 28 28 For a detailed discussion of MDL and the structural risk minimization (SRM) principle, as well as the shortcoming of MDL, see Vapnik (1998a).
On Different Facets of Regularization Theory
2819
5 Bayesian Theory Bayesian theory is an efcient approach to deal with prior information. It lends itself naturally to a choice of the regularization operator (MacKay, 1992). This is not surprising since regularization theory has a nice Bayesian interpretation (Poggio & Girosi, 1990a; Evgeniou et al., 2000). Much effort has been spent to apply the Bayesian framework to machine learning and complexity control (MacKay, 1992; Bruntine & Weigend, 1991; Williams, 1994). Given the data D , what one cares about is nding out the most probable model underlying the observation data. Following the Bayes formula, the posterior probability of model M is estimated by p ( M | D ) D p (D | M ) p (M ) / p ( D ) , where the denominator is a normalizing constant representing the evidence. Maximizing p ( M | D ) is equivalent to minimizing ¡ log p ( M | D ) , as shown by ¡ log p (M | D ) / ¡ log p ( D | M ) ¡ log p ( M ) ,
(5.1)
where the rst term of the right-hand side corresponds to the MDL described in the previous section and the second term represents the minimal code length for the model M. Once M is known, one can apply Bayes theorem to estimate its parameters. The posterior probability of the weights w, given the data D and model M, is estimated by p (w | D , M ) D
p ( D | w, M ) p (w | M ) , p (D | M)
(5.2)
where p (D | w, M ) represents the likelihood and p (w | M ) is the prior probability given the model. Assuming the training patterns are identically and independently distributed, we obtain p (D | w, M ) D D
Y
p (xi , yi | w, M )
i
Y
p ( yi | xi , w, M ) p (xi ) .
(5.3)
i
Ignoring the normalizing denominator, equation 5.2 is further rewritten as p (w | D , M ) / p (w | M )
Y
p (xi , yi | w, M ) .
(5.4)
i
Furthermore, the prior p (w | M ) is characterized as p (w | M ) D
exp(¡lRreg (w) ) , Zw (l)
(5.5)
2820
Zhe Chen and Simon Haykin
R where Zw (l) D dw exp(¡lRreg (w) ) . The choices of l and Rreg (w) usually depend on some assumption of p (w) . For instance, when w is the gaussian prior as p (w) / exp(¡ l2 kwk2 ) , it leads to the maximum a posteriori (MAP) estimate: X X l log p (w | D , M ) D ¡ kwk2 C log p (xi ) C log p ( yi | xi , w) . 2 i i
(5.6)
When p ( yi | xi , w) is measured by the L2 metric under gaussian noise model assumption, equation 5.6 reads ¡ log p (w | D , M ) /
X l [yi ¡ f (xi , w) ]2 . kwk2 C 2 i
In particular, remarks on several popular regularization techniques are in order. Remarks. Weight decay (Hinton, 1989): Weight decay is equivalent to the wellknown ridge regression in statistics (Wahba, 1990), which is a version of zero-order Tikhonov regularization. Weight elimination (Weigend, Rumelhart, & Humberman, 1991): Weight elimination can be interpreted as the negative log-likelihood prior probability of the weights. The weights are assumed to be a mixture of uniform and gaussian-like distributions. Approximate smoother (Moody & Rognvaldsson, ¨ 1997): In the case of @k f (x)
sigmoid networks with tanh nonlinearity, by letting Rreg D k @xk k2 , it that it reduces to some sort of approximate smoother Pcan2 be shown p when k D 1. kw k v j j j As summarized in Table 1, most regularizers correspond to weight priors with different probability density functions.29 It is noteworthy to point out that (1) weight decay is a special case of weight elimination, (2) the role of weight elimination is similar to the Cauchy prior, and (3) in the case / b (bw) (p(w) approximates to the Laplace prior as b ! of p (w) / cosh¡1P @ 1), we have @wj j |wj | ¼ tanh (bwj ). In the sense of weight priors, the regularizer is sometimes written as kPwk2 in place of kD f k2 (we discuss the functional priors later in section 8). 29 The complexity of weights is evaluated by the negative logarithm of probability density function of the weights.
On Different Facets of Regularization Theory
2821
Table 1: Regularizers and Weight Priors. Regularizer
Distribution
Comment
Constant w2
Uniform distribution Gaussian distribution Uniform C gaussian Cauchy distribution Laplace distribution Supergaussian distribution
Uniform prior Weight decay Weight elimination Cauchy prior Laplace prior Supergaussian prior
w2 1 C w2
log(1 C w2 ) |w| log(cosh(w) )
6 Shannon’s Information Theory According to information theory (Shannon, 1948), the efciency of coding is measured by its entropy: the less the entropy, the more efcient the encoder (Cover & Thomas, 1991). Imagine a learning machine (neural network) as an encoder; the information ow is transmitted through the channel, in which the noisy information is nonlinearly ltered. A schematic illustration is shown in Figure 1. Naturally, it is anticipated that the information is encoded as efciently as possible, that is, by using fewer look-up tables (basis functions) or fewer codes (connection weights). The entropy reduction in coding the information has been validated by neurobiological observations (Daugman, 1989). Intuitively, we may think that the minimum entropy (MinEnt) can be used as a criterion for regularization. Actually, MinEnt regularization may nd plausible theoretical support and interpretation in classication (Watanabe, 1981) and regression problems. In functional approximation, information in the data is expected to concentrate on as few hidden units as possible, which refers to sparse coding or sparse representation. 30 In other words, maximum energy concentration corresponds to minimum Shannon entropy (Coifman & Wickerhauser, 1992). On the other hand, in pattern recognition, one might anticipate that only a few hidden units correspond to a specic pattern (or specic pattern is coded by specic hidden units). In other words, the knowledge learned by the neural network is not uniformly distributed in the hidden units and its connections. Based on this principle, we may derive a MinEnt regularizer for the translationally invariant RBF network as follows. Recalling G(xi , xj ) is an ` £ ` symmetric matrix, it is noted that the symmetry is not a necessary condition and our statement holds for any ` £ m 6 m) asymmetric matrix (which corresponds to the generalized RBF net(` D work). For the purpose of expression clarity, we denote Gi ( i D 1, . . . , `) as 30 It was well studied and observed that the sensory information in the visual cortex is sparsely coded (Barlow, 1989; Atick, 1992; Olshausen & Field, 1996; Harpur & Prager, 1996).
2822
Zhe Chen and Simon Haykin
Figure 1: A schematic illustration of a neural network as an encoder-decoder.
the ith vector of matrix G and Gij as the jth ( j D 1, . . . , m ) component of vector Gi . In order to measure the coding efciency, we normalize every component Gij by its row vector,31 P ij D
|Gij | 2 kGi k2
,
(6.1)
and consequently obtain a (symmetric) probability matrix P with elements Pij , which satisfy the relationship m X jD 1
Pij D 1.
(6.2)
Thus, the information-theoretic entropy32 is dened by
HD
31
` X
iD 1
Hi D ¡
` X m X
Pij log Pij .
(6.3)
iD 1 jD 1
If the kernel function is a normalized RBF, this step is not necessary. One can, alternatively, use the generalized Renyi entropy instead of the Shannon entropy. 32
On Different Facets of Regularization Theory
2823
Specically, when P i1 D ¢ ¢ ¢ D Pim D m1 , Hi obtains the maximum value of log m, and consequently Hmax D ` log m. Therefore, the normalized entropy can be computed as bD H
H ` log m
.
(6.4)
b we obtain an information-theoretic regularizer that By setting Rreg D H, relates to the simplicity principle of coding, given a xed m.33 It is also closely related to the maximization of collective information principle (Kamimura, 1997) and the decorrelation of the information by the nonlinear lter of hidden layer (Deco, Finnoff, & Zimmermann, 1995). The MinEnt regularization principle is also connected to the well-studied Infomax principle in supervised learning (Kamimura, 1997) as well as in unsupervised learning (Becker, 1996). Suppose the input units are represented by A and hidden units by B; then the mutual information I ( A, B ) between A and B can be represented by their conditional entropy (Cover & Thomas, 1991; Haykin, 1999): I ( A, B ) D H (B ) ¡ H ( B | A) .
(6.5)
Minimizing the conditional entropy H ( B | A) , the uncertainty of hidden units given the input data results equivalently in maximizing the mutual information between input and hidden layers. 7 Statistical Learning Theory Rewriting the expected risk R in an explicit form, one can decompose it into two parts (O’Sullivan, 1986; Geman, Bienenstock, & Doursat, 1992), 34
R D E[(y ¡ f (x) ) 2 | x]
D E[(y ¡ E[y | x] C E[y | x] ¡ f (x) ) 2 |x]
D E[(y ¡ E[y | x]) 2 |x] C ( E[y | x] ¡ f (x) ) 2 ,
(7.1)
which is the well-known bias-variance dilemma in statistics (Geman et al., 1992; Wolpert, 1997; Breiman, 1998). The rst term on the right-hand side of equation 7.1 is the bias of the approximation, and the second term measures the variance of the solution. The regularization coefcient l in equation 3.5 or 3.49 controls the trade-off between the two terms in the error decomposition. 33
The redundant hidden nodes can be pruned based on this principle. The error decomposition of the Kullback-Leibler divergence risk functional was discussed in Heskes (1998). 34
2824
Zhe Chen and Simon Haykin
In recent decades a new statistical learning framework has been formalized in terms of structural risk minimization (SRM). Based on this paradigm, support vector machines (SVMs) have been developed for a wide class of learning problems (Vapnik, 1995, 1998a, 1998b; see also Sch¨olkopf & Smola, 2002). SVM can be regarded as a type of regularization network with the same form of solution as equation 3.37 but trained with a different loss function, and consequently with a different solution (Girosi, 1998; Evgeniou et al., 2000). In the SVM, the 2 -insensitive loss function is used and trained by quadratic programming; some of the weights wi are zero, and the xi corresponding to the nonzero wi are called support vectors. The solution found by the SVM is usually sparse (Vapnik, 1995; Poggio & Girosi, 1998). By choosing a specic kernel function, the mapping from data space to feature space corresponds to a regularization operator, which explains the reason that SVMs exhibit good generalization performance in practice. In particular, writing f (x) in terms of a positive semidenite kernel function (not necessarily satisfying Mercer condition),35 we obtain X ai K (xi , x) C b. (7.2) f (x) D i
When K (xi , xj ) D hDK (xi , ¢) , DK (xj , ¢) i, regularization network (RN) is equivalent to the SVM (Girosi, 1998; Smola et al., 1998). Recalling the Green’s ˜ function DDG(x i , x) D dxi (x) D d (x ¡ xi ) , minimizing the risk functional, we obtain G (xi , xj ) D hDG(xi , ¢) , DG(xj , ¢) i D hW (xi ) , W (xj ) i
(7.3)
with W: xi ! DG(xi , ¢). Hence, the Green’s function is actually the kernel function induced by the Hilbert norm of the RKHS. Further discussions on the links between SVM and RNs can be found in Girosi (1998), Smola et al. (1998), and Evgeniou et al. (2000). Note that in the SVM, the dimensionality of the kernel (Gram) matrix is the same as the number of observations, `, which is in line with the RBF network; similar to the generalized RBF network in which the number of hidden units is less than `, SVM can be also trained with a reduced set. We recommend that readers consult Sch¨olkopf and Smola (2002) for an exhaustive treatment of regularization theory in the context of kernel learning and SVMs. 8 Bayesian Interpretation: Revisited Neal (1996) and Williams (1998b) found that a large class of neural networks with certain weight priors will converge to a gaussian process (GP) prior 35 An insightful discussion on the bias b in terms of the conditional positive-denite kernel was given in Poggio, Mukherjee, Rifkin, Rakhlin, and Verri (2001).
On Different Facets of Regularization Theory
2825
over the functions, in the limit of an innite number of hidden units. This fact motivated researchers to consider the change from priors on weights to priors on functions: instead of imposing the prior on the weight (parameter) space, one can directly impose the prior on the functional space.36 A common way is to dene the prior in the RKHS (Wahba, 1990; Evgeniou et al., 2000). Using Bayes formula p ( f | D ) / p (D | f ) p ( f ), we have Á p (D | f ) / exp ¡
! ` 1 X 2 ( (x ) ) yi ¡ f i , 2s 2 iD 1
(8.1)
where s 2 is the variance of the gaussian noise, and the prior probability p ( f ) is given by p ( f ) / exp(¡k f k2K / 2s2 ) ,
(8.2)
where the stabilizer k f k2K is a norm dened in the RKHS associated with the kernel K. In particular, Parzen (1961, 1963) showed that the choice of the RKHS is equivalent to the choice of a zero-mean stochastic process with covariance kernel K (which is assumed to be symmetric positive denite), that is, E[ f (x) f (y) ] D s2 K (x, y) . And the regularization parameter is shown to be the SNR (i.e., the variance ratio s 2 / s2 ; Bernard, 1999). Hence, for the RNs or SVMs, choosing a kernel K is equivalent to assuming a gaussian prior on the functional f with normalized covariance equal to K (Papageorgiou, Girosi, & Poggio, 1998; Evgeniou et al., 2000). Usually, K is chosen to be positive denite and stationary (which corresponds to the shift-invariant property in the RBF). Also, choosing the covariance kernel is inherently related to nding the correlation function of the gaussian process (Wahba, 1990; Sch¨olkopf & Smola, 2002). Generally, if the function is represented by the mixtures of gaussian processes, the kernel should also be a mixture of covariance kernels. It is noteworthy to point out that the Bayesian interpretation discussed above applies not only to the classical regularization functional 3.5 with the square loss function but also to a generic loss functional (Evgeniou et al., 2000),
R[ f ] D
` X
i D1
L ( yi ¡ f (xi ) ) C lk f k2K ,
(8.3)
where L is any monotonically increasing loss function. When p ( f ) is assumed to be gaussian, the kernel is essentially the correlation function 36 One can also dene the prior in the functional space and further project it to the weight space (Zhu & Rohwer, 1996).
2826
Zhe Chen and Simon Haykin
E[ f (xi ) f (xj )] by viewing f as a stochastic process. In the case of zero-mean gaussian processes, we have the covariance kernel, K (xi , xj ) D exp(¡|xi ¡ xj | 2 ) .
(8.4)
In the nongaussian situation, L is not a square loss function (Girosi, Poggio, & Caprile, 1991). In particular, for the SVM, it was shown (Evengiou et al., 2000) that the 2 -insensitive loss function can be interpreted by a nongaussian noise model with superposition of gaussian processes with different variances and means,37 Z exp(¡|x|2 ) D
1
Z
1
dt
¡1
0
p dbl ( t)m (b ) b exp(¡b ( x ¡ t) 2 ) ,
(8.5)
where b D 1 / 2s 2 and l2 ( t) D
1 2(2 C 1)
( Â[¡2
,2 ] ( t ) C d ( t
¡ 2 ) C d (t C 2 ) ) ,
(8.6)
m (b ) / b 2 exp(¡1 / 4b ) ,
(8.7)
where Â[¡2 ,2 ] ( t) is 1 for t 2 [¡2 , 2 ] and 0 otherwise. L can be also chosen as Huber ’s loss function (Huber, 1981): (1 L (j ) D
2
|j | < c
|j | 2
c|j | ¡
c2 2
|j | ¸ c
,
(8.8)
which was used to model the 2 -contaminated noise density: p (j ) D (1 ¡ 2 ) g (j ) C 2 h (j ) (where g (j ) is a xed density and h (j ) is an arbitrary density, both of which are assumed to be symmetric with respect to origin, 0 < 2 < 1). Provided g (j ) is chosen to be a gaussian density g (j ) D the robust noise density is derived as ± ² 8 2 < p1¡2 exp ¡ j 2 , 2s 2p s ± ² p (j ) D : p1¡2 exp c2 ¡ c |j | , 2 s 2s 2p s
|j | < cs |j | ¸ cs
,
p1 2p s
2
j ) exp(¡ 2s 2 ,
(8.9)
where c is determined from the normalization condition (Vapnik, 1998a). Table 2 lists some loss functions and their associated additive noise probability density models, which are commonly used in regression. Albeit we discuss the learning problem in the regression framework, the discussion is also valid for classication problems. In addition, the loss function can
On Different Facets of Regularization Theory
2827
Table 2: Loss Functions and Associated Noise Density Models.
Gaussian Laplacian 2 -insensitive Huber’s loss
Loss Function L (j )
Noise Density Model p (j )
1 2 j 2
p1 exp (¡|j | 2 2) 2p 1 exp(¡|j |) 2 1 exp(¡|j | 2 ) 2 (2 C 1) 2 exp (¡ j2 ) / 2 exp ( c2 ¡ c|j |) 2
/
|j | |j |2
(
|j | < c
1 2 j 2 2
c|j | ¡ c2 otherwise » 2 j / 2 |j | < c Talvar c2 / 2 otherwise 1 Cauchy log(1 C j 2 ) 2 Hyperbolic cosine log(cosh(j ) ) 1 |j | r Lp norm r
( »
exp(¡j / 2c) / exp(¡c2 / 2)
|j | < c otherwise |j | < c otherwise
1 1 p 1 Cj 2 1 p cosh(j ) r exp(¡|j | r ) 2C (1 / r)
include higher-order cumulants to enhance robustness in the nongaussian noise situation (Leung & Chow, 1997, 1999, 2001). In the light of the discussions, studying the priors on the functional provides much freedom for incorporating of prior knowledge to the learning problem. In particular, the problem is in essence to design a kernel that may explain well the observation data underlying the functional. For example, if one knows that a functional might not be described by a gaussian process (e.g., Laplacian process or others), one can design a particular kernel (Laplacian kernel or other localized nonstationary kernels) without worrying about the smoothness property. See Sch¨olkopf and Smola (2002) and Genton (2000) for more discussion. 9 Pruning Algorithms Pruning is an efcient way to improve the generalization ability of neural networks (Reed, 1993; Cherkassky & Mulier, 1998; Haykin, 1999). It can be viewed as a sort of regularization approach where the complexity term is measured by a stabilizer. Pruning can be either connections pruning or nodes pruning. Roughly, there are four kinds of pruning algorithms developed from different perspectives, with the rst two kinds mainly concerned with connections pruning and the last two concerned with nodes pruning: Penalty function (Setiono, 1997) or stabilizer-based pruning approaches, such as weight decay (Hinton, 1989), weight elimination (Weigend et al., 1991), and Laplace prior (Williams, 1994). Soft weight 37 A limitation of Bayesian interpretation of SVM was also discussed in Evengiou et al. (2000).
2828
Zhe Chen and Simon Haykin
sharing (Nowlan & Hinton, 1992) is another kind of pruning algorithm that supposes the weights are represented by a mixture of gaussians and the weights are supposed to share the same value. Second-order (Hessian) information-based pruning approaches, such as optimal brain damage (OBD; LeCun, Denker, & Solla, 1990) and optimal brain surgeon (OBS; Hassibi, Stock, & Wolff, 1992). Information-theoretic criteria-based pruning schemes (Deco et al., 1995; Kamimura, 1997). Matrix decomposition (or factorization)–based pruning methods, such as principal component analysis (PCA; Levin, Leen, & Moody, 1994), SVD (Kanjilal & Banerjee, 1995), QR decomposition (Jou, You & Chang, 1994), discriminant component pruning analysis (DCP; Koene & Takane, 1999), and contribution analysis (Sanger, 1989). The matrix decomposition–based pruning schemes are based on the observation of the ill-conditioned matrix G. For instance, taking the QR decomposition, Gw D QR, we may obtain a new expression after pruning some hidden nodes (see appendix E for derivation): ˆ w, y D Gw ! yˆ D G ˆ b D GL1 R¡1 (w D [L1 where G 1
(9.1) ˆ is calculated by L2 ]), and w
b T D wT L1 C wL2 ( R1¡1 R2 ) T . w
(9.2)
10 Equivalent Regularization In the machine learning community, there are various approaches implementing the implicit regularization (neither the explicit form of equation 3.5 nor equation 3.49), which we call equivalent regularization. To name a few, these approaches include early stopping (Bishop, 1995a; Cherkassky & Mulier, 1998), use of momentum (Haykin, 1999), incorporation of invariance (Abu-Mostafa, 1995; Leen, 1995; Niyogi et al., 1998), tangent distance and tangent prop (Simard, Victorri, LeCun, & Denker, 1992; Simard, LeCun, Denker, & Victorri, 1998; Vapnik, 1998a), smoothing regularizer (Wu & Moody, 1996), at minima (Hochreiter & Schmidhuber, 1997), sigmoid gain scaling, target smoothing (Reed, Marks, & Oh, 1995), and training with noise (An, 1996; Bishop, 1995a, 1995b). Particularly, training with noise is an approximation to training with a kernel regression estimator; choosing the variance of the noise is equivalent to choosing the bandwidth of the kernel of the regression estimator. For a detailed discussion on training with noise, see An (1996), Bishop (1995a, 1995b), and Reed et al. (1995). An equivalence discussion between gain scaling, learning rate, and scaling weight magnitude is given in Thimm, Moerland, & Fiesler (1996).
On Different Facets of Regularization Theory
2829
Many regularization techniques correspond to the structural learning principle. With structural learning, the learning process (in terms of hypothesis, data space, parameters, or implementation) is controlled and performed in a nested structure: S 0 ½ S1 ½ ¢ ¢ ¢ Sm ½ ¢ ¢ ¢ . The capacity control and generalization performance are guaranteed and improved by constraining the learning process in the specic structure, which can be viewed as an implicit regularization (Cherkasskay & Mulier, 1998). Early stopping is an example of structural learning in terms of implementation of a learning process. By early stopping, it is said that the training of a network is stopped before it goes to the minimum error, while observing the error on an independent validation set begin to increase. In the case of quadratic risk function Remp , early stopping is similar to weight-decay regularization; the product of the iteration index t and the learning rate g plays the role of regularization parameter l, in the sense that the components of weight vectors parallel to the eigenvectors of the Hessian satisfy (Bishop, 1995a) wi( t) ’ w¤ , ji À (gt) ¡1
|wi( t) | ¿ |w¤ |, ji ¿ (gt) ¡1 ,
(10.1) (10.2)
where w¤ denotes the desired minimum point where Remp achieves in the weight space and ji represent the eigenvalues of the Hessian matrix H(w). In this sense, early stopping can be interpreted as an implicit regularization where a penalty is dened on a searching path in the parametric space. The solutions are penalized according to the number of gradient descent steps taken along the path from the starting point (MacKay, 1992; Cherkasskay & Mulier, 1998). 11 Kolmogorov Complexity: A Universal Principle for Regularization? Regularization theory can be established from many principles (e.g., MDL, Bayes, entropy), with many visible successes in model selection, complexity control, and generalization improvement. The weakness of these principles is that none of them has universality, in the sense that they cannot be applied to an arbitrary area under an arbitrary situation. Is it possible to nd a universal principle for regularization theory in machine learning? This question leads us to think in terms of a well-known concept in the computational learning community: Kolmogorov complexity (or algorithmic complexity). Kolmogorov complexity theory, motivated by the Turing machine, was rst studied by Solomonoff and Kolmogorov. The main thrust of Kolmo-
2830
Zhe Chen and Simon Haykin
gorov complexity lies in its universality, which is dedicated to constructing a universal learning method based on the universal coding method (Schmidhuber, 1994). According to Kolmogorov complexity theory, any complexity can be measured by the length of the shortest program for a universal Turing machine, which correctly reproduces the observation data in a look-up table. The core of Kolmogorov complexity theory contains three parts: complexity, randomness, and information. It is interesting to make a comparison between these three concepts and machine learning. In particular, the MDL principle is well connected to the complexity; the randomness is inherently related to the well-known nofree-lunch (NFL) theorems (Wolpert, 1996), such as NFL for cross-validation (Zhu, 1996; Goutte, 1997), optimization (Wolpert & Macready, 1997), noise prediction (Magdon-Ismail, 2000), and early stopping (Cataltepe, Abu-Mostafa, & Magon-Ismail, 1999). All NFL theorems basically state that no learning algorithms can be universally good; some algorithms that perform well will perform poorly in other situations. The information is related to the entropy theory, which measures the uncertainty in a probabilistic sense. From the Bayesian viewpoint, Kolmogorov complexity can be understood by dealing with a universal prior (so-called Solomonoff-Levin distribution), which measures the prior probability of guessing a halting program that computes the bitstrings on a universal Turing machine. Since the Kolmogorov complexity and the universal prior cannot be computed, some generalized complexity concepts for the purpose of computability were developed (e.g., Levin complexity). An extended discussion on this subject, however, is beyond the scope of this review.38 Some theoretical studies of the relations between MDL, Bayes theory and Kolmogorov complexity can be found in Vitanyi and Li (2000). 39 Can the Kolmogorov complexity be a universal principle for regularization theory? In other words, can it cover different principles such as MDL, Bayes, and beyond? There seems to exist an underlying relationship between Kolmogorov complexity theory and regularization theory (especially in the machine learning area). However, for the time being, a solid theoretical justication is missing. Although some seminal work has been reported (Pearlmutter & Rosenfeld, 1991; Schmidhuber, 1994), the studies and results were empirical, and theoretic verication was still in an early stage. The question remains unanswered and needs investigation. 12 Summary and Beyond This review presents a comprehensive understanding of regularization theory from different viewpoints. In particular, the spectral regularization 38 For a descriptive and detailed treatment on the Kolmogrov complexity, see Cover and Thomas (1991) and a special issue on the Kolmogrov complexity in Computer (vol. 42, no. 4, 1999). 39 See NIPS2001 workshop for more information.
On Different Facets of Regularization Theory
2831
framework is derived in the vein of the Fourier operator and Plancherel identity. State-of-the-art research on various regularization techniques is reviewed, and many related topics in machine learning are addressed and explored. The contents of this review cover such topics as functional analysis, operator theory, machine learning, statistics, statistical learning, Bayesian inference, information theory, computational learning theory, matrix theory, numerical analysis, and optimization. Nevertheless, they are closely related to regularization theory. Roughly, Occam’s razor, MDL, and MinEnt are the principles of implementing the regularization; Bayesian theory, information theory, and statistical learning theory can be formulated in the theoretic framework level, which establishes the mathematical foundation for the regularization principle, whereas pruning algorithms, equivalent regularization approaches, RNs, SVMs, and GP belong to the application or implementation level. A schematic relationship of the topics in this review is illustrated in Figure 2. To this end, we conclude with some comments on possible future investigations related to the topics discussed in this review: It has been shown that there exists a close relationship between regularization theory and sparse representation (Poggio & Girosi, 1998), SVMs (Girosi, 1998; Smola et al., 1998; Evgeniou et al., 2000; Scholkopf ¨ & Smola, 2002), gaussian processes (MacKay, 1998; Williams, 1998b; Zhu et al., 1998), independent component analysis (Hochreiter & Schmidhuber, 1998; Evgeniou et al., 2000; Bach & Jordan, 2001), wavelet approximation (Bernard, 1999), matching pursuit (Mallat & Zhang, 1993), and basis pursuit (Chen, Donoho, & Saunders, 1998). Further efforts will be to put many machine learning problems into a general framework and discuss their properties. Prospective studies of regularized networks are devoted to build an approximation framework in hybrid functional spaces. Many encouraging results have been attained in the RKHS. One can extend the framework to other functional spaces, such as generalized Fock space (van Wyk & Durrani, 2000) or Besov space. The idea behind the hybrid approximation is to nd an overcomplete representation of an unknown function by means of a direct sum of some possibly overlapping functional spaces, which results in a very sparse representation of the function of interest. The SRM principle (Vapnik, 1995, 1998a) seems to be a solid framework and a powerful mathematical tool for this goal. In addition, the algorithmic implementation of regularization remains an important issue. Fast algorithms beyond the quadratic programming in the kernel learning community are anticipated. A recent new direction is the on-line or incremental learning algorithms developed for GNs, SVMs, or GP (Zhu & Rohwer, 1996; De Nicolao & Ferrari-Trecate, 2001; Cauwenberghs & Poggio, 2001; Csat Âo & Opper, 2002).
2832
Zhe Chen and Simon Haykin
Figure 2: Schematic illustration of the theory of regularization. The dashed lines distinguish three levels: principle, theory, and application; the undirectional solid lines represent some conceptual links; and the directional arrows represent the routes from principle to theory and from theory to applications.
A wide class of RNs, including RBF (Powell, 1985; Micchelli, 1986; Broomhead & Lowe, 1988; Girosi, 1992, 1993), hyperBF (Poggio & Girosi, 1990a), smoothing splines (Silverman, 1984; Wahba, 1990), and generalized additive models (Hastie & Tibshirani, 1990), can be formalized mathematically based on regularization theory (Girosi et al., 1995). Since the connection between RNs and SVMs was theoretically established, one has much freedom to choose the regularization operator and the associated basis functions or kernels (Genton, 2000). For example, it can be extended to wavelet networks (Koulouris, Bakshi, & Stephanopoulous, 1995; Mukherjee & Nayar, 1996; Bernard, 1999; Gao, Harris, & Gunn, 2001). Besides, theoretic studies on the generalization bounds of the RNs or kernel machines are always important (Xu et al., 1994; Corradi & White, 1995; Krzyzak & Linder, 1996; Freeman & Saad, 1995; Niyogi & Girosi, 1996, 1999; Vapnik, 1998a; Williamson, Smola, & Sch¨olkopf, 2001; Cucker & Smale, 2002), for which we are left with many open problems. Most well-established smoothing operators in the literature so far are dened in either the spatial domain (e.g., smoothing splines) or frequency domain (e.g., RKHS norm), as reviewed in this article. One can, however, consider the space-frequency smoothing operator, which es-
On Different Facets of Regularization Theory
2833
sentially corresponds to a localized, nonstationary (or locally stationary) kernel in the general case. 40 Careful design of the operator may achieve multiresolution smoothing or approximation, which is the spirit of structural learning. Besides, the operator and the associated kernels can be time varying, which allows on-line estimation from the data. Canu and Elisseeff (1999) reported that if the Radon-Nikodym derivative instead of Fre chet derivative is used in the regularized functional, the solution to the regularization problem gives rise to a sigmoidshaped network, which partially answers the unanswered question posed by Girosi et al. (1995): Can the sigmoid-like neural network be derived from regularization theory? A wide class of neural networks and stochastic models form a curved exponential family of parameterized neuromanifolds (Amari & Nagaoka, 2000). It will be possible to study the intrinsic relationship between differential geometry and regularization theory in terms of choice of kernels (Burges, 1999). Naturally, the smoothness of the nonlinear feature mapping in terms of kernel (which is connected to the covariance property of the functional of interest) is measured by the curvature of the corresponding hypersurface. The higher the curvature is, the less smooth the parameterized model is, and thus the poorer generalization performance is anticipated (Zhu & Rohwer, 1995). It is also possible to incorporate some invariance to implement the equivalent regularization (Simard et al., 1998; Burges, 1999; Scholkopf ¨ & Smola, 2002). Another area not covered in this review is the nonclassical Tikhonov regularization, which involves the nonconvex risk functional. Without the nice quadratic property, it is still possible to use variational methods (e.g., mean-eld approximation method) or stochastic sampling methods (e.g., Gibbs sampling, Markov chain Monte Carlo) to handle regularized problems. Some studies in this direction can be found in Geman and Geman (1984) and Lemm (1996, 1998).
Appendix A: Proof of Dirichlet Kernel For the purpose of self-containing the article, the proof of Dirichlet kernel given in Lanczos (1961) is presented here. Observe that the Dirac delta
40 Choosing a localized kernel with compact support is essentially related to nding an operator with composition of some band-limiting and time-limiting operators.
2834
Zhe Chen and Simon Haykin
function d ( s, x ) satises the following conditions: Z p d ( s, x ) cos kx dx D cos ks,
(A.1)
¡p
Z
p
¡p
d (s, x ) sin kx dx D sin ks.
(A.2)
Consider a symmetric, translationally invariant function G (s, x) D G ( x, s ) D G ( s ¡ x ) ´ G (h ) D G (¡h ) , which is zero everywhere except in the interval |h | · 2 , where 2 is a small, positive value. The expansion coefcients ak and bk of this function are Z 1 2 cos k ( s C h ) g (h ) dh ak D p ¡2 Z 2 1 cos kj (A.3) D G (h ) dh, p ¡2 Z 1 2 sin k ( s C h ) g (h ) dh bk D p ¡2 Z 2 1 sin kj (A.4) D G (h ) dh, p ¡2 R where j 2 (s ¡ 2 , s C 2 ) . Provided ¡2 2 G (h ) dh D 1, letting 2 ! 0, then the point j ! s, and one may obtain the desired expansion coefcients in equations A.1 and A.2. If we replace K ( s ¡ x ) by the Fourier expansion coefcients of the Dirac function, then comparing equation 3.21, the Dirichlet kernel Kn ( s, x) acts like a Dirac delta function: Z p (A.5) f (s )d ( s, x ) ds D f (x ) . ¡p
Appendix B: Proof of Regularization Solution The proof of regularization solution was partly given in Haykin (1999) and is rewritten here for completeness. By virtue of equation 3.13, applying L to the function f (x) , we obtain Z L f (x) D L G (x, » ) Q (» ) dj Z
D D
Z
RN
RN
RN
LG(x, » ) Q ( » ) d» d (x ¡ » ) Q ( » ) d»
D Q (x).
(B.1)
On Different Facets of Regularization Theory
2835
Similarly, applying K to the function f (x) yields Z K f (s) D K G (x, » ) Q ( » ) d» RN
Z
D
RN
KG(s, » ) Q ( » ) d»
Z
D
RN
exp(¡js» ) Q (» ) d»
D W (s) .
(B.2)
The solution of the regularization problem is further derived by setting
Q (») D
` 1X [yi ¡ f (xi ) ]d ( » ¡ xi ) , l iD 1
W ( ! ) D F fQ ( » ) g D
(B.3)
` 1X [yi ¡ f (xi ) ] exp(¡jxi ! ). l i D1
(B.4)
In the spatial domain, we have (
Z fl (x) D
RN
G (x, » )
` 1X [yi ¡ f (xi ) ]d ( » ¡ xi ) l iD 1
` 1X [yi ¡ f (xi )] D l iD 1
D
` X
) d»
Z
RN
G (x, » )d ( » ¡ xi ) d»
wi G (x, xi ) ,
(B.5)
iD 1
where wi D [yi ¡ f (xi )] / l. Equivalently, in the frequency domain, we have (
Z fl (x) D
RN
F fG(x, » ) gF
Z D D D
RN
G (x, ! )
` 1X [yi ¡ f (xi ) ]exp(¡jxi ! ) d! l i D1
` 1X [yi ¡ f (xi )] l iD 1 ` X
iD 1
) ` 1X [yi ¡ f (xi ) ]d ( » ¡ xi ) d! l iD 1
wi G (x, xi ) ,
Z
RN
G (x, ! ) exp( jxi ! ) d!
(B.6)
2836
Zhe Chen and Simon Haykin
which is identical to equation B.5. The second line in the above equation uses equation B.4, and the rst line follows from the Parseval theorem. Appendix C: Another Proof of Regularization Solution The proof is slightly modied from the proof given in Bernard (1999). From Riesz’s representation theorem, we know that 8x 2 RN ; there exists a basis function w x 2 H such that f (x) D h f, w x i for all f 2 H. By dening an interpolation kernel, K (x, s) D hw x , w s i,
(C.1)
with the property K (x, s) D K (s, x) , the solution to the regularization problem can be written by the basis function w x : ` 1X 1 [yi ¡ f (xi ) ]2 C lk f k2H 2 i D1 2
R[ f ] D
` 1X 1 [yi ¡ f (xi ) ]2 C lh f, f i. 2 i D1 2
D
(C.2)
Differentiating equation C.2 with respect to f and setting to be zero, we obtain dR[ f ] D lh f, df i C * lf C
D
` X
i D1
` X
iD 1
hdf , w xi ih f, w xi i ¡
h f, w xi iw xi ¡
` X
i D1
` X
iD 1
yi hdf , w xi i
+ yi w xi df D 0.
(C.3)
It further follows that lf D
` X
i D1
( yi ¡ h f, w xi i) w xi .
Hence, the solution can be represented by a linear combination of basis functions w xi : f (x) D
` X
iD 1
ci w x i D
` X
ci K (x, xi ).
i D1
Denote f D [h f, w x1 i, . . . , h f, w x` i]T , c D [c1 , . . . , c`]T , y D [y1 , . . . , y`]T , K as an `-by-` matrix; writing in a matrix form, we have Kc D f, lf D K (y ¡ f) , and c D (K C lI) ¡1 y.
On Different Facets of Regularization Theory
2837
Appendix D: GSVD The generalized singular value decomposition (GSVD) is dened as [U, V, Z, § , S] D GSVD(A, B) , where A, B are m £ p and n £ p matrices, respectively; Um£m , Vn£n are the unitary matrices; matrix Zp£q (q D minfm C n, pg) is usually (but not necessarily) square; and § and S are diagonal matrices. All satisfy the following relationship: A D U§ Z T ,
B D VSZT ,
§
T
§
C ST S D I.
Suppose si and si are the on-diagonal singular values in the singular matrices § and S, respectively; the generalized singular values are dened by c i D (si2 C s2i ) 1/ 2 . Appendix E: QR Decomposition Applying the QR decomposition to the matrix G (Golub & Van Loan, 1996), one has GL D QR, where G is an `£ m input matrix, L is an m £ m transposition matrix, Q is an ` £ ` full-rank matrix, and R is an ` £ m upper triangle matrix. In particular, R and Q are expressed by " R1 RD 0
# £ R2 , Q D Q1 0
¤ Q2 I
henceforth, £ GL D GL1
¤ GL2 ,
and it further follows that Q1 D GL1 R1¡1 , GL2 D Q1 R2 D GL1 R1¡1 R2 , where L1 is an m £ r matrix, L2 is an m £ ( m ¡ r) matrix, Q1 is an ` £ r matrix, Q2 is an ` £ (m ¡ r ) matrix, R1 is a r £ r matrix, and R2 is a r £ ( m ¡ r) matrix. Suppose the number of hidden nodes is pruned from m to r (m > r ) : ˆ `£r , that is, there are (m ¡ r) redundant hidden nodes to be G`£m ! G
2838
Zhe Chen and Simon Haykin
b , the new expression for the deleted. Denoting the new weight vector by w b D Q1 D GL1 R¡1 , and w ˆ w, network is written by yˆ D G ˆ where G ˆ r£1 is 1 calculated as b T D wT L1 C wL2 ( R1¡1 R2 ) T . w Acknowledgments The work is supported by the Natural Sciences and Engineering Research Council (NSERC) of Canada. Z. C. is partly supported by Clifton W. Sherman Scholarship. This work was initially motivated by and beneted a lot from the NIPS community. We thank Rambrandt Bakker for some helpful feedbacks on reading the manuscript and the anonymous reviewers for many critical comments on the earlier draft. References Abu-Mostafa, Y. S. (1995). Hints. Neural Computation, 7, 639–671. Adams, R. A. (1975). Sobolev spaces. San Diego, CA: Academic Press. Akaike, H. (1974). A new look at the statistical model identication. IEEE Transactions on Automatic Control, 19, 716–723. Amari, S., & Nagaoka, H. (2000). Methods of information geometry. New York: AMS and Oxford University Press. An, G. (1996). The effects of adding noise during backpropagation training on a generalization performance. Neural Computation, 8, 643–674. Aronszajn, N. (1950). Theory of reproducing kernels. Transactions of the American Mathematical Society, 68, 337–404. Atick, J. J. (1992). Could information theory provide an ecological theory of sensory processing? International Journal of Neural Systems, 3, 213–251. Bach, F. R., & Jordan, M. I. (2001). Kernel independent component analysis (Tech. Rep.). Berkeley: Division of Computer Science, University of California, Berkeley. Balakrishnan, A. V. (1976).Applied functional analysis. New York: Springer-Verlag. Barlow, H. B. (1989). Unsupervised learning. Neural Computation, 1, 311–325. Barron, A. R. (1991). Complexity regularization with application to articial neural networks. In G. Roussas (Ed.), Nonparametric functional estimation and related topics (pp. 561–576). Dordrecht: Kluwer. Becker, S. (1996). Mutual information maximization: Models of cortical selforganization. Network: Computation in Neural Systems, 7, 7–31. Bernard, C. (1999). Wavelets and ill-posed problems: Optical ow and data interpolation. Unpublished doctoral dissertation, Ecole Polytechnique, France. Avaiable on-line at: http://www.cmap.polytechnique.fr/» bernard/these/tableen.pdf. Bertero, M., Poggio, T., & Torre, V. (1988). Ill-posed problems in early vision. Proceedings of the IEEE, 76(8), 869–889.
On Different Facets of Regularization Theory
2839
Bishop, C. (1995a). Neural networks for pattern recognition. New York: Oxford University Press. Bishop, C. (1995b).Training with noise is equivalent to Tikhonov regularization. Neural Computation, 7, 108–116. Bochner, S. (1955). Harmonic analysis and the theory of probability. Berkeley: University of California Press. Breiman, L. (1998). Bias-variance, regularization, instability and stabilization. In C. Bishop (Ed.), Neural networks and machine learning (pp. 27–56). New York: Springer-Verlag. Broomhead, D. S., & Lowe, D. (1988). Multivariate functional interpolation and adaptive networks. Complex Systems, 2, 321–355. Bruntine, W. L., & Weigend, A. S. (1991). Bayesian back-propagation. Complex Systems, 5, 603–643. Burges, C. (1999). Geometry and invariance in kernel based methods. In B. Sch¨olkopf, C. Burges, & A. Smola (Eds.), Advances in kernel methods—Support vector learning (pp. 89–116). Cambridge, MA: MIT Press. Canu, S., & Elisseeff, A. (1999). Regularization, kernels and sigmoid net. Unpublished manuscript. Available on-line at: http://psichaud.insarouen.fr/» scanu/. Cataltepe, Z., Abu-Mostafa, Y. S., & Magon-Ismail, M. (1999). No free lunch for early stopping. Neural Computation, 11, 995–1001. Cauwenberghs, G., & Poggio, T. (2001). Incremental and decremental support vector machine learning. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.), Advances in neural informationprocessingsystems,13 (pp. 409–415). Cambridge, MA: MIT Press. Chen, S., Donoho, D., & Saunders, M. A. (1998). Atomic decomposition by basis pursuit. SIAM Journal on Scientic Computing, 20(1), 33–61. Chen, Z., & Haykin, S. (2001a). A new view on regularization theory. In Proc. IEEE Int. Conf. System, Man and Cybernetics (pp. 1642–1647). Tucson, AZ. Chen, Z., & Haykin, S. (2001b). On different facets of regularization theory (Tech. Rep.). Hamilton, ON: Communications Research Laboratory, McMaster University. Cherkassky, V., & Mulier, F. (1998). Learning from data: Concepts, theory and methods. New York: Wiley. Coifman, R. R., & Wickerhauser, M. V. (1992). Entropy-based algorithms for best-basis selection. IEEE Transactions on Information Theory, 38, 713– 718. Corradi, V., & White, H. (1995). Regularized neural networks: Some convergence rate results. Neural Computation, 7, 1225–1244. Courant, R., & Hilbert, D. (1970). Methods of mathematical physics (Vol. 1). New York: Wiley. Cover, T., & Thomas, J. A. (1991). The elements of information theory. New York: Wiley. CsatÂo, L., & Opper, M. (2002). Sparse on-line gaussian process. Neural Computation, 14, 641–668. Cucker, F., & Smale, S. (2002). On the mathematical foundations of learning. Bulletin of the American Mathematical Society, 39, 1–49.
2840
Zhe Chen and Simon Haykin
Daugman, J. G. (1989). Entropy reduction and decorrelation in visual coding by oriented neural receptive elds. IEEE Transactions on Biomedical Engineering, 36(1), 107–114. Debnath, L., & Mikusinski, P. (1999). Introduction to Hilbert spaces with applications (2nd ed.). San Diego, CA: Academic Press. Deco, G., Finnoff, W., & Zimmermann, H. G. (1995). Unsupervised mutual information criterion for elimination of overtraining in supervised multilayer networks. Neural Computation, 7, 86–107. De Nicolao, G., & Ferrari-Trecate, G. (2001). Regularization networks: Fast weight calculation via Kalman ltering. IEEE Transactions on Neural Networks, 12(2), 228–235. Dontchev, A. L., & Zolezzi, T. (1993). Well-posed optimization problems. Berlin: Springer-Verlag. Duchon, J. (1977). Spline minimizing rotation-invariant semi-norms in Sobolev spaces. In W. Schempp & K. Zeller (Eds.), Constructive theory of functions of several variables (pp. 85–100). Berlin: Springer-Verlag. Evgeniou, T., Pontil, M., & Poggio, T. (2000). Regularization networks and support vector machines. Advances in Computational Mathematics, 13(1), 1–50. Freeman, A., & Saad, D. (1995). Learning and generalization in radial basis function network. Neural Computation, 7, 1000–1020. Gao, J. B., Harris, C. J., & Gunn, S. R. (2001). On a class of support vector kernels based on frames in function Hilbert space. Neural Computation, 13, 1975–1994. Geman, S., Bienenstock, E., & Doursat, R. (1992). Neural networks and the biasvariance dilemma. Neural Computation, 4, 1–58. Geman, S., & Geman, D. (1984). Stochastic relaxation, Gibbs distribution, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6, 721–741. Genton, M. G. (2000). Classes of kernels for machine learning: A statistics perspective. Journal of Machine Learning Research, 2, 299–312. Girosi, F. (1992). Some extensions of radial basis functions and their applications in articial intelligence. International Journal of Computer and Mathematics with Applications, 24(12), 61–80. Girosi, F. (1993). Regularization theory, radial basis functions and networks. In V. Cherkassky, J. H. Friedman, & H. Wechsler (Eds.), From statistics to neural networks (pp. 134–165). Berlin: Springer-Verlag. Girosi, F. (1998). An equivalence between sparse approximation and support vector machines. Neural Computation, 10, 1455–1480. Girosi, F., Jones, M., & Poggio, T. (1995). Regularization theory and neural network architecture. Neural Computation, 7, 219–269. Girosi, F., Poggio, T., & Caprile, B. (1991). Extensions of a theory of networks for approximation and learning: Outliers and negative examples. In D. Touretzky & R. Lippmann (Eds.), Advances in neural information processing systems, 3 (pp. 750–756). San Mateo, CA: Morgan Kaufmann. Golub, G. H., & Van Loan, C. F. (1996). Matrix computations (3rd ed.). Baltimore, MD: John Hopkins University Press. Goutte, C. (1997). Note on free lunches and cross-validation. Neural Computation, 9, 1245–1249.
On Different Facets of Regularization Theory
2841
Hansen, P. C. (1992). Analysis of discrete ill-posed problems by means of the L-curve. SIAM Review, 34(4), 561–580. Hansen, P. C. (1998). Rank decient and discrete ill-posedproblems:Numerical aspects of linear inversion. Philadelphia: SIAM. Harpur, G. F., & Prager, R. W. (1996). Development of low entropy coding in a recurrent network. Network: Computation in Neural Systems, 7, 277–284. Hassibi, B., Stock, D. G., & Wolff, G. J. (1992). Optimal brain surgeon and general network pruning. In Proc. Int. Conf. Neural Networks (pp. 293–299), San Francisco. Hastie, T. (1996). Pseudosplines. Journal of Royal Statistical Society B, 58, 379–396. Hastie, T., & Tibshirani, R. (1990). Generalized additive models. London: Chapman and Hall. Haykin, S. (1999). Neural networks: A comprehensive foundation (2nd ed.). Upper Saddle River, NJ: Prentice Hall. Heskes, T. (1998). Bias / variance decomposition of likelihood-based estimator. Neural Computation, 10, 1425–1433. Hinton, G. E. (1989). Connectionist learning procedure. Articial Intelligence, 40, 185–234. Hinton, G. E., & van Camp, D. (1993). Keeping neural networks simple by minimizing the description length of the weights. Proc. Sixth ACM Conf. Computational Learning Theory (pp. 5–13). San Cruz, CA: Hochreiter, S., & Schmidhuber, J. (1997). Flat minima. Neural Computation, 9, 1–42. Hochreiter, S., & Schmidhuber, J. (1998). Source separation as a by-product of regularization. In M. S. Kearns, S. Solla, & D. Cohn (Eds.), Advances in neural information processing systems, 11 (pp. 459–465). Cambridge, MA: MIT Press. Huber, P. (1981). Rubust statistics. New York: Wiley. Johansen, T. A. (1997). On Tikhonov regularization, bias and variance in nonlinear system identication. Automatica, 30(3), 441–446. Jou, I. C., You, S. S., & Chang, L. W. (1994). Analysis of hidden nodes for multilayer perceptron neural networks. Pattern Recognition, 27(6), 859–864. Kailath, T. (1971). RKHS approach to detection and estimation problems—Part I: deterministic signals in gaussian noise. IEEE Transactions on Information Theory, 17(5), 530–549. Kamimura, R. (1997). Information controller to maximize and minimize information. Neural Computation, 9, 1357–1380. Kanjilal, P. P., & Banerjee, D. N. (1995). On the application of orthogonal transformation for the design and analysis of feedforward networks. IEEE Transactions on Neural Networks, 6(5), 1061–1070. Kimeldorf, G. S., & Wahba, G. (1970). A correspondence between Bayesian estimation on stochastic processes and smoothing by splines. Annals of Mathematical Statistics, 41(2), 495–502. Kimeldorf, G. S., & Wahba, G. (1971). Some results on Tchebychefan spline functions. Journal of Mathematical Analysis and Applications, 33(1), 82–95. Koene, R., & Takane, Y. (1999). Discriminant component pruning: Regularization and interpretation of multilayered backpropagation networks. Neural Computation, 11, 783–802.
2842
Zhe Chen and Simon Haykin
Koulouris, A., Bakshi, B. R., & Stephanopoulous, G. (1995). Empirical learning through neural networks: The wave-net solution. In G. Stephanopoulous & C. Han (Eds.), Intelligent systems in process engineering (pp. 437–484). San Diego, CA: Academic Press. Kress, R. (1989). Linear integral equations. Berlin: Springer-Verlag. Krzyzak, A., & Linder, T. (1996). Radial basis function networks and complexity regularization in function learning. In M. Mozer, M. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 197–203). Cambridge, MA: MIT Press. Lanczos, C. (1961). Linear differential operators. New York: D. Van Nostrand. LeCun, Y., Denker, J. S., & Solla, S. A. (1990). Optimal brain damage. In D. Touretzky (Ed.), Advances in neural information processing systems, 2 (pp. 598– 605), CA: Morgan Kaufmann. Leen, T. K. (1995).From data distributions to regularization in invariant learning. Neural Computation, 7, 974–981. Lemm, J. C. (1996). Prior information and generalized questions (AI Memo 1598). Cambridge, MA: MIT. Lemm, J. C. (1998). How to implement a priori information: A statistical mechanics approach (Tech. Rep. No. MS-TPI-98-12). Institute for Theoretical Physics, Munster ¨ University. Available on-line at: http://pauli.unimuenster.de/» lemm. Leung, C.-T., & Chow, T. (1997). A novel noise robust fourth-order cumulants cost function. Neurocomputing, 16(2), 139–147. Leung, C.-T., & Chow, T. (1999). Adaptive regularization parameter selection method for enhancing generalization capability of neural networks. Articial Intelligence, 107, 347–356. Leung, C.-T., & Chow, T. (2001). Least third-order cumulant method with adaptive regularization parameter selection for neural networks. Articial Intelligence, 127(2), 169–197. Levin, A., Leen, T., & Moody, J. (1994). Fast pruning using principal components. In J. D. Cowan, G. Tessauro, & J. Alspetor (Eds.), Advances in neural information processing systems, 6 (pp. 35–42). Cambridge, MA: MIT Press. MacKay, D. J. C. (1992). Bayesian methods for adaptive models. Unpublished doctoral dissertation, California Institute of Technology. Available on-line at: http://wol.ra.phy.cam.ac.uk/mackay/. MacKay, D. J. C. (1998). Introduction to gaussian processes. In C. Bishop (Ed.), Neural networks and machine learning (pp. 134–165). Berlin: Springer-Verlag. Magdon-Ismail, M. (2000). No free lunch for noise prediction. Neural Computation, 12, 547–564. Mallat, S., & Zhang, Z. (1993). Matching pursuit in a time-frequency dictionary. IEEE Transactions on Signal Processing, 41, 3397–3415. Marroquin, J. L., Mitter, S., & Poggio, S. (1987). Probabilistic solution of ill-posed problems in computational vision. Journal of American Statistical Association, 82, 76–89. Micchelli, C. A. (1986). Interpolation of scattered data: Distance matrices and conditionally positive denite functions. Constructive Approximation, 2, 11– 22.
On Different Facets of Regularization Theory
2843
Moody, J., & Darken, C. J. (1989). Fast learning in networks of locally-tuned processing units. Neural Computation, 1, 281–294. Moody, J., & Rognvaldsson, ¨ T. (1997).Smoothing regularizers for projective basis function networks. In M. Mozer, M. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 585–591), Cambridge, MA: MIT Press. Morozov, V. A. (1984). Methods for solving incorrectly posed problems. New York: Springer-Verlag. Mukherjee, S. & Nayar, S. K. (1996). Automatic generalization of RBF networks using wavelets. Pattern Recognition, 29(8), 1369–1383. Neal, R. (1996). Bayesian learning for neural networks. Berlin: Springer-Verlag. Niyogi, P., & Girosi, F. (1996). On the relationship between generalization error, hypothesis complexity, and sample complexity for radial basis function. Neural Computation, 8, 819–842. Niyogi, P., & Girosi, F. (1999). Generalization bounds for function approximation from scattered noisy data. Advances in Computational Mathematics, 10, 51–80. Niyogi, P., Girosi, F., & Poggio, T. (1998). Incorporating prior information in machine learning by creating virtual examples. Proceedings of the IEEE, 86(11), 2196–2208. Nowlan, S. J., & Hinton, G. E. (1992).Simplifying neural networks by soft weightsharing. Neural Computation, 4, 473–493. Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive eld properties by learning a sparse code for natural images. Nature, 381, 607–609. O’Sullivan, F. (1986).A statistical perspective on ill posed inverse problems (with discussions). Statistical Science, 1, 502–527. Papageorgiou, C., Girosi, F., & Poggio, F. (1998). Sparse correlation kernel analysis and reconstruction (AI Memo 1635). Cambridge, MA: MIT. Parzen, E. (1961). An approach to time series analysis. Annals of Mathematical Statistics, 32, 951–989. Parzen, E. (1963).Probability density functionals and reproducing kernel Hilbert spaces. In M. Rosenblatt (Ed.), Proc. Symposium of Time SeriesAnalysis (pp. 155– 169). New York: Wiley. Pearlmutter, B. A., & Rosenfeld, R. (1991). Chaitin-Kolmogorov complexity and generalization in neural networks. In D. Touretzky & R. Lippmann (Eds.), Advances in neural information processing systems, 3 (pp. 925–931). San Mateo, CA: Morgan Kaufmann. Poggio, T., & Girosi, F. (1990a). Networks for approximation and learning. Proceedings of the IEEE, 78(10), 1481–1497. Poggio, T., & Girosi, F. (1990b). Regularization algorithms for learning that are equivalent to multilayer networks. Science, 247, 978–982. Poggio, T., & Girosi, F. (1998). A sparse representation for function approximation. Neural Computation, 10, 1445–1454. Poggio, T., & Koch, C. (1985). Ill-posed problems in early vision: From computational theory to analogue networks. Proceedings of the Royal Society of London, B, 226, 303–323. Poggio, T., Mukherjee, S., Rifkin, R., Rakhlin, A., & Verri, A. (2001). b (CBCL Paper 198 / AI Memo 2001-011). Cambridge, MA: MIT.
2844
Zhe Chen and Simon Haykin
Poggio, T., Torre, V., & Koch, C. (1985). Computational vision and regularization theory. Nature, 317, 314–319. Poggio, T., Voorhees, H., & Yuille, A. (1988). A regularized solution to edge detection. Journal of Complexity, 4, 106–123. Powell, M. J. D. (1985). Radial basis functions for multivariable interpolation: A review. In IMA Conference on Algorithms for the Approximation of Functions and Data (pp. 143–167). RMCS, Shrivenham, England. Reed, R. (1993). Pruning algorithms—A review. IEEE Transactions on Neural Networks, 4, 740–747. Reed, R., Marks, R. J., & Oh, S. (1995).Similarities of error regularization, sigmoid gain scaling, target smoothing, and training with jitter. IEEE Transactions on Neural Networks, 6, 529–538. Rissaen, J. (1978).Modeling by shortest data description. Automatica, 14, 465–471. Rohwer, R., & van der Rest, J. C. (1996). Minimum description length, regularization, and multimodal data. Neural Computation, 8, 595–609. Sanger, D. (1989). Contribution analysis: A technique for assigning responsibilities to hidden units in connectionist networks. Connection Science, 1, 115–138. Schmidhuber, J. (1994). Discovering problem solutions with low Kolmogorovcomplexity and high generalization capability (Tech. Rep. FKI-194-94). Munich: Technische Universitat Munchen. ¨ Available on-line at: http://papa.informatik.tumuenchen.de/mitarbeiter/schmidhu.html. Schnorr, C., & Sprengel, R. (1994). A nonlinear regularization approach to early vision. Biological Cybernetics, 72, 141–149. Sch¨olkopf, B., & Smola, A. (2002). Learning with kernels: Support vector machines, regularization, optimization and beyond. Cambridge, MA: MIT Press. Setiono, R. (1997). A penalty function approach for pruning feedforward neural networks. Neural Computation, 9, 185–204. Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27, 379–423, 623–656. Silverman, B. W. (1984). Spline smoothing: The equivalent variable kernel method. Annals of Statistics, 12, 898–916. Simard, P., LeCun, Y., Denker, J., & Victorri, B. (1998). Transformation invariance in pattern recognition: Tangent distance and tangent propagation. In G. B. Orr, & K-R. Muller ¨ (Eds.), Neural networks: Tricks of the trade. New York: Springer-Verlag. Simard, P., Victorri, B., LeCun, Y., & Denker, J. (1992). Tangent prop—A formalism for specifying selected invariances in an adaptive network. In J. E. Moody, S. J. Hanson, & R. P. Lippmann (Eds.), Advances in neural information processing systems, 4 (pp. 895–903). Cambridge, MA: MIT Press. Smola, A., Sch¨olkopf, B., & Muller, ¨ K.-R. (1998). The connection between regularization operators and support vector kernels. Neural Networks, 11, 637–649. Thimm, G., Moerland, P., & Fiesler, E. (1996). The interchangeability of learning rate and gain in backpropagation neural networks. Neural Computation, 8, 451–460. Tikhonov, A. N., & Arsenin, V. Y. (1977).Solution of ill-posedproblems. Washington, DC: V. H. Winston.
On Different Facets of Regularization Theory
2845
van Wyk, M. A., & Durrani, T. S. (2000). A framework for multiscale and hybrid RKHS-based approximators. IEEE Transactions on Signal Processing, 48(12), 3559–3568. Vapnik, V. (1995). The nature of statistical learning theory. New York: SpringerVerlag. Vapnik, V. (1998a). Statistical learning theory. New York: Wiley. Vapnik, V. (1998b). The support vector method of function estimation. In C. Bishop (Ed.) Neural networks and machine learning (pp. 239–268). New York: Springer-Verlag. Velipasaoglu, E. O., Sun, H., & Zhang, F., Berrier, K. L., & Khoury, D. S. (2000). Spatial regularization of the electrocardiographic inverse problem and its application to endocardial mapping. IEEE Transactions on Biomedical Engineering, 47(3), 327–337. Vitanyi, P. M. B., & Li, M. (2000). Minimum description length induction, Bayesianism, and Kolmogorov complexity. IEEE Transactions on Information Theory, 46(2), 446–464. Wahba, G. (1990). Spline models for observational data. Philadephia: SIAM. Wahba, G. (1995). Generalization and regularization in nonlinear system. In M. Arbib (Ed.), Handbook of brain theory and neural networks (pp. 426–432). Cambridge, MA: MIT Press. Wahba, G. (1999). Support vector machines, reproducing kernel Hilbert spaces, and randomized GACV. In B. Sch¨olkopf, C. Burges, & A. Smola (Eds.), Advances in kernel methods—Support vector learning (pp. 69–87). Cambridge, MA: MIT Press. Watanabe, S. (1981). Pattern recognition as a quest for minimum entropy. Pattern Recognition, 13(5), 381–387. Watanabe, K., Namatame, A., & Kashiwaghi, E. (1993). A mathematical foundation on Poggio’s regularization theory. In Proc. Int. Joint Conf. Neural Networks (pp. 1717–1722). Nagoya, Japan. Weigend, A., Rumelhart, D., & Humberman, B. A. (1991). Generalization by weight-elimination applied to currency exchange rate prediction. In Proc. Int. Joint Conf. Neural Networks (pp. 2374–2379). Singapore. Williams, C. K. I. (1998a).Prediction with gaussian processes: From linear regression to linear prediction and beyond. In M. Jordan (Ed.), Learning in graphical models. Cambridge, MA: MIT Press. Williams, C. K. I. (1998b). Computation with innite neural networks. Neural Computation, 10, 1203–1216. Williams, P. M. (1994). Bayesian regularization and pruning using a Laplace prior. Neural Computation, 6, 117–143. Williamson, R. C., Smola, A., & Sch¨olkopf, B. (2001). Generalization performance of regularization networks and support vector machines via entropy numbers of compact operators. IEEE Transactions on Information Theory, 47(6), 2516–2532. Wolpert, D. H. (1996). The lack of a priori distinction between learning algorithms. Neural Computation, 8, 1341–1390. Wolpert, D. H. (1997). On bias plus variance. Neural Computation, 9, 1211–1243.
2846
Zhe Chen and Simon Haykin
Wolpert, D. H., & Macready, W. G. (1997). On free lunch theorems for optimization. IEEE Transactions on Evolutionary Computation, 1, 67–82. Wu, L., & Moody, J. (1996). A smoothing regularizer for feedforward and recurrent neural network. Neural Computation, 8, 461–489. Xu, L., Krzyzak, A., & Yullie, A. (1994). On radial basis function nets and kernel regression: Statistical consistency, convergence rates, and receptive eld size. Neural Networks, 7(4), 609–628. Yee, P. (1998). Regularized radial basis function networks: Theory and applications to probability estimation, classication, and time series prediction. Unpublished doctoral dissertation, McMaster University. Yee, P., & Haykin, S. (2001). Regularized radial basis function networks: Theory and applications. New York: Wiley. Yosida, K. (1978). Functional analysis (5th ed.). New York: Springer-Verlag. Yuille, A., & Grzywacz, N. (1988). The motion coherence theory. In Proc. Int. Conf. Computer Vision (pp. 344–354). Washington, DC. Zhu, H. (1996). No free lunch for cross validation. Neural Computation, 8, 1421– 1426. Zhu, H., & Rohwer, R. (1995). Information geometric measurement of generalisation (Tech. Rep., NCRG4350). Aston University. Available on-line at: http://www.ncrg.aston.ac.uk/Papers/. Zhu, H., & Rohwer, R. (1996). Bayesian regression lters and the issue of priors. Neural Computing and Applications, 4, 130–142. Zhu, H., Williams, C. K. I., Rohwer, R., & Morciniec, M. (1998). Gaussian regression and optimal nite dimensional linear models. In C. Bishop (Ed.), Neural networks and machine learning (pp. 167–184). New York: Springer-Verlag.
Received September 25, 2001; accepted May 28, 2002.
NOTE
Communicated by Shun-ichi Amari
Notes on Bell-Sejnowski PDF-Matching Neuron Simone Fiori [email protected] Faculty of Engineering, University of Perugia, Loc. Pentima bassa, 21, I-05100 Terni, Italy This article investigates the behavior of a single-input, single-unit neuron model of the Bell-Sejnowski class, which learn through the maximumentropy principle, in order to understand its probability density function matching ability. 1 Introduction
Independent component analysis (ICA) is an emerging neural signal processing technique that allows representing sets of signals as linear combinations of statistically independent bases (Bell & Sejnowski, 1996). In particular, the results of recent investigations of the statistical properties of natural images in relation to the properties of simple cells in V1 suggest that these cells learn to form spatial lters that perform an ICA of the images (Barlow, 2001; Bell & Sejnowski, 1997). One of the most interesting aspects of the ICA theory that Bell and Sejnowski (1996) proposed is the emerging ability of the neurons in the structural model to align to the statistical distributions of the stimuli. This observation was successfully exploited in order to design different learning rules for blind separation (Fiori, 2000), blind deconvolution (Fiori, 2001a), and probability density function estimation (Fiori, 2001b). In this note, we consider the basic Bell-Sejnowski class of neuron models and recall the maximum-entropy adapting formulas. By properly selecting a model in the class that gives rise to tractable mathematics, we are able to present the closed-form expressions of the learning equations, which we particularize for some special excitations. Our main goal is to discuss the features of the neuron model in an analytical way in order to gain deeper insight into the behavior of the equations governing information-theoretic nonlinear unit learning. 2 Neuron Model and pdf-Matching Learning Equations
We consider the simple neuron model depicted in Figure 1, which may be formally described by the input-output equation y D s (wx C b) ,
(2.1)
c 2002 Massachusetts Institute of Technology Neural Computation 14, 2847– 2855 (2002) °
2848
Simone Fiori
Figure 1: Structure of single-input neuron.
where x ( t) 2 R and y ( t) 2 R denote the neuron’s input stimulus and the neuron’s response signal, respectively; w 2 R denotes the neuron’s connection strength; and b 2 R stands for the bias. The nonlinear function s (¢) represents a bounded (saturating) squashing activation function, which meets the monotonicity condition s0 (¢) > 0. Both the input and output signals are treated as stationary random processes, described by the probability density functions (pdfs) px ( x) and py ( y) . We do not make any particular hypothesis about the stimulus’s statistical distribution, but for a sufcient regularity: px ( x) should be a smooth function endowed with sufciently high-order moments. The statistical distribution of the neuron’s response depends on the distribution of the stimulus through the neuron’s nonlinear transfer function; formally, the relationship among the two statistics is px ( x) dy def py ( y ) D , D ( x) D D |w|s0 ( wx C b) . D ( x) dx
(2.2)
Bell-Sejnowski’s neuron under consideration here should learn to respond with a maximum entropy signal, which gives rise to a learning theory based on entropy maximization. The entropy of the neuron’s excitation and its response relate through the statistics transformation formula, 2.2, which gives the relationship Z Hy ( w, b) D Hx C
R
px (j ) log[|w|s0 ( wj C b) ] dj.
(2.3)
The neuron’s parameters w and b may be learned through an optimization principle that has the target of maximizing the neuron’s response entropy through an entropy-gradient learning rule. The aim of this analysis is to elucidate some properties of the pdf-matching neuron that descend from the mathematical properties of the learning equations. In order to carry out our analytical considerations, it is necessary to choose a neuron’s structure and dene the shape of the squashing function
Notes on Bell-Sejnowski PDF-Matching Neuron
2849
s (¢) . The following expression, which may be regarded as a kind of sigmoidal function, proves to generate tractable mathematics: ³ ´ 1 1 1 (2.4) s ( u ) D C Q ( u) C un C 1 , , n odd integer. 2 2 nC1 In formula 2.4, we used the (regularized) incomplete gamma function (Abramowitz & Stegun, 1972), n is an odd integer, and Q (¢) denotes the signum function; with the convention that Q (0) D 0, function (2.4) is continuous along with all its derivatives. It is worth noting that the maximum entropy learning equations do not depend explicitly on the sigmoidal function, but on the ratio s00 ( u ) / s0 ( u ) only. From equation 2.4, we nd s00 ( u ) / s0 ( u ) D ¡(n C 1) un . This remarkable result suggests that the learning terms depend not on the neuron’s response but on its net activation only. On the basis of the chosen neuron’s model, the closed-form entropy maximization learning equations read: Z 1 dw ¡ ( n C 1) (2.5) D px (j ) ( wj C b) nj dj, dt w R Z db (2.6) D ¡(n C 1) px (j ) ( wj C b) n dj. dt R The selected squashing function gives rise to tractable mathematics. In fact, by making use of the binomial expansion formula twice, the neuron’s learning equations can be cast in the friendly form n X n X 1 dw `¡m ` n¡` n ¡ ( n C 1) D T`n¡m w b , ¡m Tm ( mL m C 1 C mL m m 1 )m 1 dt w m D 0 `D m n X n X db n `¡m ` n¡` D ¡(n C 1) T`n¡m w b , ¡m TmmL m m 1 dt D 0 m `D m
(2.7)
(2.8)
n! n denotes the binomial (Tartaglia’s) coefcient where Tm , and we m! ( n¡m)! dened the stimulus moments, Z Z def def m1 D px (j )j dj, mL m D px (j ) (j ¡ m 1 ) m dj, m ¸ 0. R
R
It might also be interesting to write explicitly the neuron’s entropy gap, def
the learning criterion dened as C h ( w, b) D Hy (w, b ) ¡ Hx : C h (w, b ) D log(|w| ) ¡
nC1 X nC1 X mD 0 `Dm
C1 nC1 `¡m ` n¡`C 1 . T`n¡m w b ¡m Tm mL mm 1
(2.9)
The entropy gap is just the neuron’s response entropy minus a quantity that is independent of the neuron and can thus be used in place of the actual response entropy for learning purposes.
2850
Simone Fiori
3 Analysis of the Case Study
In this section, we illustrate the behavior of the neuronal model when excited with external stimuli of known statistics. 3.1 Analysis for a Gaussian Excitation. The signal x ( t) might be conceived as the net input to the neuron—as the linear combination of many input signals or as the only input to the neuron. In both cases, it makes sense to study the learning equations when x ( t) has a gaussian distribution, with mean m 2 R and standard deviation s > 0. In this case, we have m 1 D m , mL m D 0 for m odd and mL 2m D 1 ¢ 3 ¢ ¢ ¢ (2m ¡ 1)s 2m . The case n D 3 is interesting to study because it is nontrivial and does not give rise to intractable mathematics. By particularizing learning equations 2.7 and 2.8, we obtain the differential system governing the neuron’s learning phase:
1 dw ¡ 4[m b3 m C 3(m 2 C s 2 ) wb2 C 3m (m 2 C 3s 2 ) w2 b D dt w C (m 4 C 6s 2 m 2 C 3s 4 ) w3 ],
db D ¡4[b3 C 3m wb2 C (m 2 C s 2 ) w2 b C m (m 2 C 3s 2 ) w3 ]. dt
(3.1) (3.2)
One of the purposes of this analysis is to nd, in closed form, the equilibrium points of the above learning equations; this may be achieved by solving the system of two equations that arises by making the right-hand side of the above learning equations equal to zero, where it is understood that w > 0. We have been able to identify two special cases when the equilibrium equations may be exactly solved. When the mean value of the input gaussian excitation is zero, the equilibrium system noticeably simplies to 1 D 4(3s 2 w2 b2 C 3s 4 w4 ) , 0 D b ( b2 C 3s 2 w2 ) . Clearly the second equation has b D 0 as the only feasible solution, because the subequation, b2 C 3s 2 w2 D 0, would lead to complex-valued solutions; 1 by making b equal to zero in the rst equation, we also nd w D p . 4 p 12s As a numerical example, let us consider the case that s D 2, and the learning equations have been discretized in time with sampling step D t D 0.001. Figure 2 shows the course of the slope w and bias b of the neuron’s activation during the learning phase; it may be readily veried that b ! 0 1p and w ! p . It would also be interesting to investigate the shape of 4 12 2 the entropy gap as a function of the learnable parameters. The entropy gap
Notes on Bell-Sejnowski PDF-Matching Neuron
2851
1 0.9 0.8 0.7
w,b
0.6 0.5 0.4 0.3 0.2 0.1 0 0
500
1000
1500 Input samples
2000
2500
3000
Figure 2: Case m D 0: Learning curves for the parameters w (dashed line) and b (solid line). 0
G
h
10 20 30 1
0.5
0
0.5
1
0.5
1
b
0.5
0
1
w
b
0.5
0
0.5
1 1
0.8
0.6
0.4
0.2
0 w
0.2
0.4
0.6
0.8
Figure 3: Entropy gap surface (top) and contour plot (bottom) for the case m D 0.
2852
Simone Fiori 1
0.8
0.6
w ,b
0.4
0.2
0
0.2
0.4 0
500
1000
1500 Input samples
2000
2500
3000
Figure 4: Case m D 1: Learning curves for the parameters w (dashed line) and b (solid line).
surface and contour plot for this case are depicted in Figure 3. The symmetry of the gap C h (w, b) with respect to the line b D 0, as well as the fact that the minima lie on this line, is quite apparent, conrming the conclusion of the theoretical analysis. The entropy barrier in correspondence of w D 0 is also clearly visible. The result pertaining to a unitary mean value is nontrivial and interesting. In fact, when m D 1, the equilibrium equation for b becomes an identity over the line w C b D 0, and the equilibrium equation for w then gives 1 . w D ¡b D p 4 12s p As a numerical example, let us consider again the case that s D 2. Figure 4 shows the course of w and b during the learning phase; it may be 1p 1p readily veried that this time b ! ¡ p while w ! p . The shape 4 4 12 2 12 2 of the entropy gap as a function of the learnable parameters, shown in Figure 5, again conrms that the learning algorithm has found the right neural conguration. The black diagonal line in the gure represents the line of equation w C b D 0, and the fact that the minima of the entropy gap lies on such line clearly emerges.
Notes on Bell-Sejnowski PDF-Matching Neuron
2853
G
h
0
50
100 1
0.5
0
0.5
1
1
b
0.5
0
0.5
1
w
b
0.5
0
0.5
1 1
0.8
0.6
0.4
0.2
0 w
0.2
0.4
0.6
0.8
Figure 5: Entropy gap surface (top) and contour plot (bottom) for the case m D 1. 3.2 Analysis for a Laplacean Excitation. Here we aim at studying the behavior of the learning equations when x ( t) has a Laplacean distribution,
px ( x ) D
l exp(¡l|x|) , 2
(3.3)
with l > 0 being the Laplacean parameter. In this case, we have m 1 D 0, mL m D 0 for m odd; the other moments are easily computed to be mL m D lm!m . The case of Laplacean excitation is interesting because it allows an understanding of the behavior of the Bell-Sejnowski neuron when it gets excited from a speech signal. In the case n D 3, learning equations 2.7 and 2.8 particularize to: ³ ´ 1 6 24 3 dw 2 C ¡4 (3.4) D wb w , l2 l4 dt w ³ ´ 6 2 db 3 (3.5) D ¡4 b C 2 w b . l dt Also, the entropy gap easily computes to ³ ´ 12 2 2 24 4 4 C h (w, b ) D log(|w| ) ¡ b C 2 w b C 4 w . l l
(3.6)
2854
Simone Fiori
G
h
0
5
10 1
0.5
0
0.5
1
1
b
0.5
0
0.5
1
w
b
0.5
0
0.5
1 1
0.8
0.6
0.4
0.2
0 w
0.2
0.4
0.6
0.8
Figure 6: Entropy gap surface (top) and contour plot (bottom) for the case of Laplacean excitation.
In this case, the equilibrium points of the system of learning equations are l easily found from equation 3.5 to be b D 0, which corresponds to w D § p . 4 96
Note that the other resolving equation, b2 C l62 w2 D 0, does not have any real solution except for b D w D 0, which is not feasible because w D 0 is excluded by the learning equations. The shape of the entropy gap as a function of the learnable parameters, shown in Figure 6, visually conrms that the equilibrium points found for the learning algorithm are correct and unique (apart for the sign of w). The black straight line in the gure represents the line of equation b D 0, which the minima of the entropy gap lies on. As a numerical example, let us consider the case that l D 4. It may be 4 readily veried that the learning equations make b ! 0 while w ! p ¼ 4 96 1.27789. 4 Conclusion
The aim of this article was to discuss in some detail the behavior of a singleweight, single-bias nonlinear neuron model of the Bell-Sejnowski class in the presence of stochastic excitations of a known structure in order to show
Notes on Bell-Sejnowski PDF-Matching Neuron
2855
analytically that the maximum entropy learning principle causes the neuron’s transference function to align to the input pdf as predicted by Bell and Sejnowski (1996). It has been shown by choosing an activation function whose shape is very close to the standard sigmoids but leads to tractable mathematics and by considering a gaussian and a (double-sided) Laplacean excitation. In these cases, the closed-form expressions of the learning equations could be analytically presented and their features investigated. References Abramowitz, M., & Stegun, C. A. (Eds.). (1972). Gamma (factorial) function and incomplete gamma function. In M. Abramowitz & I. Stegun (Eds.), Handbook of mathematical functions with formulas, graphs, and mathematical tables (pp. 255– 258, 260–263). New York: Dover. Barlow, H. B. (2001). Redundancy reduction revisited: Network: computation. Neural Systems, 12, 241–253. Bell, A. J., & Sejnowski, T. J. (1996). An information maximization approach to blind separation and blind deconvolution. Neural Computation, 7(6), 1129– 1159. Bell, A. J., & Sejnowski, T. J. (1997). The independent components of natural scenes are edge lters. Vision Research, 37, 3327–3338. Fiori, S. (2000). Blind signal processing by the adaptive activation function neurons. Neural Networks, 13(6), 597–611. Fiori, S. (2001a). A contribution to (neuromorphic) blind deconvolution by exible approximated Bayesian estimation. Signal Processing, 81(10), 2131–2153. Fiori, S. (2001b).Probability density function learning by unsupervised neurons. International Journal of Neural Systems, 11(5), 399–417. Received February 13, 2002; accepted May 30, 2002.
LETTER
Communicated by Richard Lippmann
Biophysiologically Plausible Implementations of the Maximum Operation Angela J. Yu [email protected] Gatsby Computational Neuroscience Unit, University College London, London WC1N 3AR, U.K. Martin A. Giese [email protected] Department for Cognitive Neurology, University Clinic Tubingen, ¨ 72076 Tubingen, ¨ Germany Tomaso A. Poggio [email protected] CBCL, Massachusetts Institute of Technology, Cambridge, MA 02142, U.S.A. Visual processing in the cortex can be characterized by a predominantly hierarchical architecture, in which specialized brain regions along the processing pathways extract visual features of increasing complexity, accompanied by greater invariance in stimulus properties such as size and position. Various studies have postulated that a nonlinear pooling function such as the maximum (MAX) operation could be fundamental in achieving such selectivity and invariance. In this article, we are concerned with neurally plausible mechanisms that may be involved in realizing the MAX operation. Different canonical models are proposed, each based on neural mechanisms that have been previously discussed in the context of cortical processing. Through simulations and mathematical analysis, we compare the performance and robustness of these mechanisms. We derive experimentally veriable predictions for each model and discuss the relevant physiological considerations. 1 Introduction
Neurophysiological experiments have provided evidence that visual processing in the cortex can be characterized by a predominantly hierarchical system, in which areas farther along the ventral pathway are selective for increasingly complex stimulus features, accompanied by increasing invariance with respect to stimulus size and position (Hubel & Wiesel, 1962; Perrett et al., 1991; Logothetis, Pauls, & Poggio, 1995; Tanaka, 1996; Pasupathy & Connor, 1999). c 2002 Massachusetts Institute of Technology Neural Computation 14, 2857– 2881 (2002) °
2858
Angela J. Yu, Martin A. Giese, and Tomaso A. Poggio
Different mechanisms have been proposed to account for the selectivity and invariance properties of the visual cortex. For instance, one body of theoretical work involves exible central mechanisms that dynamically adjust stimulus scale and position selectivity according to the input (e.g., the “shifter circuit” presented in Anderson & Van Essen, 1987). However, although there is evidence for such dynamic modulation in a variety of visual areas (Motter, 1994; Connor, Preddie, Gallant, & Van Essen, 1997; Treue & Maunsell, 1996), it functions at a timescale too slow to account for the short latencies found in some object recognition tasks (Thorpe, Fize, & Marlot, 1996). From a different approach, a growing number of models of visual processing call for some form of pooling from feature detectors in an earlier stage of processing. The underlying idea was rst postulated by Hubel and Wiesel (1962), to account for complex cell invariance with respect to spatial phase shifts via linear summation of responses of rectifying, phase-sensitive neurons. Fukushima’s (1980) more general, hierarchical Neocognitron network achieves invariant responses via a feedforward network of alternating layers of feature detectors and nonlinear pooling neurons. More recently, Riesenhuber and Poggio (1999a, 1999b) have reproduced neurophysiological data from area IT with a hierarchical model that uses a combination of linear summation and maximum (MAX) operations. Pooling by a MAX operation, as opposed to linear summation, achieves high feature specicity and invariance simultaneously (Riesenhuber & Poggio, 1999b). Suppose the inputs to the system are activity levels of a population of simple cell bar detectors that prefer the same orientation but have receptive elds in different locations. Summing from these detectors gives the same total response if the input is an oriented bar contained in any one of the receptive elds, achieving position invariance. However, the response is even stronger if multiple bars (e.g., a grating) or background clutter is present, causing the pooling neuron to lose selectivity as a bar detector. Taking the maximum of the responses of these feature detectors alleviates this specicity problem, because then the system output is solely determined by the response of its most active afferent. Thus, the MAX operation both preserves feature specicity and achieves invariance in a more robust fashion than linear summation. It is interesting to note that pooling by the MAX operation is computationally equivalent to scanning an image with a template, which has been the basis for many recognition algorithms in computer vision (Riesenhuber & Poggio, 1999b, 2000). In this study, we are concerned with neurophysiological implementations of the MAX operation. In addition to its proposed involvement in a variety of cortical processes such as object recognition (Riesenhuber & Poggio, 1999b), motion recognition (Giese, 2000), and visual velocity estimation (Grzywacz & Yuille, 1990), the MAX operation is interesting as a basic nonlinear operator that can be implemented by simple, neurophysiologically plausible models. Throughout this article, we dene the ideal MAX oper-
Implementations of the Maximum Operation
2859
ation as a mapping from an input vector x D [x1 , x2 , . . . , xn ] to an output signal z, where z / xm ´ max xi . 1 ·i ·n
More generally, the operation should achieve the following properties: Selectivity: The output signal z depends only on the maximum of all the input signals (sometimes referred to as input amplitude in this work), xm , and not on the other values. Linearity: The output signal z depends linearly on xm with a constant gain factor g, that is, z D gxm . The rst property is critical to achieving feature specicity. The second property is important for optimally recovering information about the strength of the maximal input. Of course, biological systems can be expected to implement only an approximation to the ideal MAX operation. In some computation models employing the MAX operation, it has been shown that an approximation to the ideal MAX operation indeed sufces (Riesenhuber & Poggio, 1999b). Both the computational and implementation aspects of our models derive much inspiration from two closely related areas of earlier work on neural modeling: winner-takes-all (WTA) and gain control networks. WTA has been widely studied in the neural networks and VLSI literature (Grossberg, 1973; Kohonen, 1995; Lazzaro, Ryckenbusch, Mahowald, & Mead, 1989; Starzyk & Fang, 1993; Hahnloser, Douglas, Mahowald, & Hepp, 2000; Fukai & Tanaka, 1997) and has been used to model cortical functions such as the integration of component motions (Nowlan & Sejnowski, 1995) and attentional selection (Koch & Ullman, 1985; Lee, Itti, Koch, & Braun, 1999). WTA networks select the afferent input with the largest amplitude, but their output is not required to reect this amplitude and is often nonlinear or even binary. In general, WTA networks convey the identity of the “winner neuron” but not its precise amplitude to the downstream cortical processing areas 1 , whereas the MAX operation communicates the amplitude but not necessarily the identity. Gain control circuits make the response of neural detectors independent of stimulus energy or contrast, thus exhibiting some invariance in stimulus size or intensity (Reichardt, Poggio, & Hausen, 1983; Carandini & Heeger, 1994; Wilson & Humanski, 1993; Simoncelli & Heeger, 1998; Salinas & Abbott, 1996). However, typical 1 There are some exceptions: Yuille and Grzywacz’s (1989) divisive feedback network and Hahnloser’s (1998) linear threshold VLSI network both output the maximal input under certain conditions, and Lazzaro et al.’s (1989) VLSI network outputs the logarithm of the maximal input. These implementations sit on the denitional boundary between MAX and WTA networks.
2860
Angela J. Yu, Martin A. Giese, and Tomaso A. Poggio
gain control circuits fail the linearity requirement because they normalize all input channels regardless of input amplitude. The remainder of the article is structured as follows. In section 2, we isolate a small number of neural mechanisms that may be involved in realizing the MAX operation and present four highly simplied canonical circuits that implement these mechanisms. In section 3, we present our main simulation results, with the relevant mathematical analysis to be found in the appendix. In section 4, a number of model-specic predictions are presented, along with a discussion on potential experimental paradigms that can be used to verify these predictions. In section 5, we discuss various neuronal implementations of the computational operations involved in each of the models and compare the relative plausibility of these implementations. Finally, section 6 relates our efforts to previous work in neural modeling and makes suggestions for future directions of research. Some of the results have appeared previously in abstract form (Yu, Giese, & Poggio, 2000a, 2000b).
2 Models
The main issues we explore in this work are divisive2 versus subtractive inhibition,3 feedforward versus recurrent architecture, and mean ring rate versus integrate-and-re description. We focus on four simple canonical neural models that implement the MAX operation with well-studied neural principles and which allow some degree of mathematical analysis (see Figure 1 for schematic diagrams of a generic feedforward and feedback network). All of these models can be described as a three-layer neural circuit with an input layer representing static input signals xn , a symmetrically connected intermediate layer that transforms the input signals into output signals yn in a nonlinear fashion, and an output unit z that linearly sums the intermediate-layer activities. In biophysiological terms, the inputs correspond to output signals from earlier stages of sensory processing. If these earlier feature detectors
2 The physiological mechanisms and functional role of shunting inhibition have been a topic of intensive theoretical and experimental investigations (Naka & Rushton, 1966; Torre & Poggio, 1978; Koch, Poggio, & Torre, 1983; Koch & Poggio, 1987; Ferster & Jagadesh, 1992; Holt & Koch, 1997; Carandini, Heeger, & Movshon, 1997; Doiron, Longtin, Berman, & Maler, 2001; Borg-Graham, Monier, & Fregnac, 1998; Anderson, Carandini, & Ferster, 2000). The latest evidence on the role of shunting inhibition in the visual cortex converges on the observation that inhibitory synaptic conductances can rise signicantly and rapidly after stimulus presentation, having a divisive effect on the excitatory synaptic input (Borg-Graham et al., 1998; Anderson et al., 2000). 3 Intracellular recordings show that half-wave rectication provides a good t for intracellular recording data (Carandini & Ferster, 2000).
Implementations of the Maximum Operation
2861
Pooling Unit
x1
y1
x2
y2
x3
y3
z
Pooling Unit
x1
y1
x2
y2
x3
y3
z
Figure 1: Schematic feedforward and feedback models. (a) Feedforward network. (b) Feedback network. Solid arrow: excitatory inputs. Open arrow: inhibitory inputs. Circles and lines represent computational units and interactions rather than explicit delineation of neurons and their processes. Apparent violations of Dale’s law can be resolved via inhibitory interneurons (Li, 2000), as seems to be done in the cortex (White, 1989; Gilbert, 1992; Rockland & Lund, 1983).
collectively have a spectrum of selectivity in a certain feature dimension (e.g., stimulus position), then the output signal would reect a degree of invariance in this dimension. For example, preliminary data indicate simple and complex cells may play respective roles of xn and z in implementing the MAX operation in V1 (Lampl, Riesenhuber, Poggio, & Ferster, 2001). A
2862
Angela J. Yu, Martin A. Giese, and Tomaso A. Poggio
dedicated summating interneuron receiving inputs from a number of simple cells could inhibit their inputs to intermediate-layer neurons or directly inhibit the dendrites of the complex cell receiving inputs from the simple cells. A dedicated summating unit might not even be necessary if each simple cell (or its downstream interneuron) inhibits the synaptic inputs from the other simple cells to the same complex cell. The rst model is a feedforward network (FFN) with divisive (shunting) inhibition: xn f ( xn ) P c C k f ( xk ) X zD yn ,
yn D
(2.1)
n
where 0 < c ¿ 1 is a small, positive constant that determines the baseline activities when the input signal vanishes. f (x ) is a positive, monotonically increasing, convex function. In the simulations presented in section 3, f ( x ) D xq is used, although the precise form of f ( x ) is not critical, and we obtained similar results using f ( x ) D eqx . We also examine a divisive feedback network (DFB) variation of the FFN network: xn f ( yn ) P t yP n D ¡yn C c C k f ( yk ) X zD yn ,
(2.2)
n
where f and c are as for the FFN model and t is the time constant of the dynamical system. The architectures of the divisive models, FFN and DFB, are similar to those previously proposed for WTA behavior (Grossberg, 1973; Koch & Ullman, 1985; Fukai & Tanaka, 1997), gain control in the y visual system (Reichardt et al., 1983), and attentional modulation on orientation lters in human vision (Lee et al., 1999). The third model is a linear threshold network (LIN) with subtractive inhibition, represented as rectied linear inhibitory inputs: t yP n D ¡yn ¡
X
z D ( w C 1)
k
w[yk] C C xn
X n
[yn ] C ,
(2.3)
where the constant w > 0 species the inhibitory synaptic strength. The output gain factor w C 1 in equation 2.3 ensures that z D xm exactly at equilibrium (see section 3.1), but in fact any constant gain factor would
Implementations of the Maximum Operation
2863
work equally well for the purpose of achieving linearity, as discussed in section 1. The fourth network is a leaky integrate-and-re model (SPK), which directly implements the threshold nonlinearity of the LIN network: P n D ¡mn ¡ w tm yn ( t ) D z ( t) D
X i
X
X 6 n kD
y k C xn
d ( t ¡ ti ) yn ( t).
(2.4)
n
Unit n res if mn exceeds threshold h. The membrane potential, represented by mn , is lower-bounded by zero and is also reset to zero after each spike at time ti . One major difference between the SPK model and the LIN model is that while the dynamic variable y in the LIN model can become negative, m in the SPK model is lower-bounded by zero, and therefore the impact of past history on current activities is more limited in the SPK model. Also, the intermediate-layer units communicate only when one or more of them spike, making their activities more input-bound. As we will see in section 3, these differences result in some fundamentally different response properties. Network architecture similar to the linear threshold networks, LIN and SPK, has previously been proposed in the context of inhibitory interactions in the limulus retina (Hartline & Ratliff, 1957), orientation tuning in the visual cortex (Ben-Yishai, Lev Bar-Or, & Sompolinsky, 1995), gain elds in the parietal cortex (Salinas & Abbott, 1996), and WTA behavior in analog VLSI circuits (Lazzaro et al., 1989; Hahnloser et al., 2000).
3 Results
In the following, we examine the linearity and selectivity properties (as discussed in section 1) of each network. The responses of intermediatelayer and output-layer units in equilibrium conditions are examined. We also explore the models’ abilities to approximate the MAX operation under a range of internal and external parameter values: input distribution, strength of lateral inhibition, number of inputs, initial conditions, and the presence of noise4 . We also point out circumstances in which z is a more robust reconstruction of xm than ym , the latter of which has been used in
4 In all of our simulations, the inputs are assumed to change much more slowly than network activities and thus are represented as constant as the network response relaxes toward equilibrium.
2864
Angela J. Yu, Martin A. Giese, and Tomaso A. Poggio
previous MAX models with similar architecture (Yuille & Grzywacz, 1989). The relevant mathematical analyses can be found in the appendix. 3.1 Network Dynamics.
3.1.1 Divisive Feedforward Network. As with all of the other models we present, the feedforward network has no synaptic delay, thus its “dynamics” is trivial in the sense that the output immediately and fully reects the system’s processing of the input. From equation 2.1, notice that if f ( x ) is sufciently convex and therefore sufciently exaggerates the difference between the maximal input, xm , and the other inputs, then f ( xm ) dominates the sum in the denominator of equation 2.1, resulting in ym ¼ xm and yn ¼ 0, 6 m, giving rise to z ¼ xm . 8n D 3.1.2 Divisive Feedback Network. For the DFB model, the effect of the nonlinear operation f is similar, except the difference between ym and yn is further exaggerated in each iteration until equilibrium is reached. The network dynamics q
ym D yn D
xm ym P
q k yk q xn yn P q, C k yk
cC c
6 m, 8n D
(3.1)
give rise to the equilibrium relation yn / ym D (xn / xm ) ( yn / ym ) q , which is 6 consistent if yn D 0, 8n D m. It is easy to see that the system has a stable equilibrium at yn D 0, z D ym ¼ xm . 3.1.3 Linear Threshold Network. Similarly, in the LIN network, given sufciently strong lateral inhibition, represented by w in equation 2.3, yn , 8n, approaches yn D
X k
w[yk] C C xn .
(3.2)
It is easy to see that ym D ( w C11)xm and [yn ] C D 0 is a stable equilibrium of this system, giving rise to z D xm . In section A.3, we quantitatively describe the conditions on the parameters that allow the system to reach this equilibrium. 3.1.4 Spiking Network. For the SPK network, since it is deterministic, given similar initial conditions for each mn , mm always reaches ring threshold before all the other units, resulting in a strong lateral inhibition of its neighbors in the next time step. More quantitatively, if w ¸ xm ¡ h, then [mn ¡h] C tends toward 0 (yn is silent), and mm tends toward xm , inducing a regular spike train in ym , and therefore in z. The frequency of this spike train
Implementations of the Maximum Operation
2865
Figure 2: Different input types. (a) Different input types: r gaussian, o ramp, * uniform. xm D 1. Because the network connectivity is symmetric in all models, there is no inherent topology in the output. (b) Corresponding equilibrium intermediate-layer activities, yn , in the FFN (q D 6). z for FFN: Gaussian— 0.97, ramp—0.95, uniform—0.91. (c) yn activities for LIN (w D 15). z for LIN: gaussian—1.04, ramp—1.03, uniform—1.00. (d) Evolution of membrane potential mn in the SPK network in response to “ramp” input: w D 10. Inset shows average ring rates (units normalized for comparison).
is roughly proportional to the input magnitude, tempered by the “leakiness” of the integrate-and-re neuronal model. 3.2 Simulations.
3.2.1 Intermediate-Layer Activities. To illustrate the behavior of fyn g in each of the models, we use several different classes of inputs, with identical amplitude a D 1 (see Figure 2a). There are N D 81 input and corresponding intermediate-layer units: Gaussian: width s D 10, centered at n D 0 Ramp: xn D a(n / 80 C 1 / 2)
6 0 Uniform inputs with one “winner”: x 0 D a, xn D 0.90a, 8n D
2866
Angela J. Yu, Martin A. Giese, and Tomaso A. Poggio
The responses to gaussian and ramp inputs are interesting visual illustrations of how the nonlinear intermediate-layer interactions differentially attenuate the input signals: the gaussian peak is narrower, and the ramp has been transformed into an exponential (see Figures 2b and 2c). The “uniform” inputs represent a worst-case scenario for some of the models (see section A.1) and will be used in later simulations to examine the dependence of z on the magnitude of the submaximal inputs. While LIN completely suppresses the nonmaximal inputs, FFN partially suppresses the nonmaximal inputs. However, z ¼ xm for all types of inputs, for both FFN and LIN, as part of a general pattern where the nal output, z, is much more accurate and consistent in reporting xm than the maximal intermediate-layer activity, ym . The DFB network responds very well to all the input types, where only ym > 0 in every case (the results are difcult to visualize in the format of Figures 2b and 2c). The SPK network also responds well to all input types, where only mm ever reaches ring threshold. Figure 2d shows the evolution of membrane potential of the intermediate units in the SPK model, mn , in response to a ramp input. The inset shows the corresponding average ring rate: only ym is active. 3.2.2 Dependence on Submaximal Inputs and Inhibitory Strength. Selectiv6 m, is a critical property ity, or the ability to ignore submaximal inputs xn , n D of the MAX operator, as was discussed in section 1. This property is sensitive to the strength of lateral inhibition, controlled by q in FFN and DFB and by w in LIN and SPK models. We consider the case where all the submaximal inputs, xn , are identical, since they provide a systematic means of varying the submaximal inputs and a worst-case scenario for some of the models (see the appendix). Figure 3 shows the dependence of z and ym on the magnitude of the submaximal inputs xn , while Figure 4 shows the effect of varying inhibitory strength (q for the divisive models, w for the linear models). For the FFN model, we see that z D xm for small values of xn and for xn D xm , but dips slightly for intermediate values, a phenomenon that concurs with our mathematical analysis (see section A.1). In contrast to z, ym has a much stronger dependence on xn , dropping to xm / N in the limit of xn D xm . Also, the strength of inhibition, represented in the model by q, has a very small effect on z compared to ym , which starts deviating from xm for even small xn when w is small. A more systematic study of this effect is shown in Figure 4. These results agree with the general observation that z is much more robust than ym in reproducing xm in a variety of situations in this and most of the other models studied. Divisive inhibition applied recurrently, as demonstrated by the DFB plot in Figure 3, signicantly improves the system’s performance in implementing the ideal MAX operation. z reproduces xm in equilibrium, independent of xn and q (see also Figure 4), agreeing with our mathematical analysis (see section A.2).
Implementations of the Maximum Operation
FFN
1.5
2
q=5: y(max) q=5: z q=10: y(max) q=10: z
1.5 DFB
2
1 0.5
2
LIN
1.5
1 0.5
0.5
0 0
1
w=5: (w+1)y(max) w=5: z w=15: (w+1)y(max) w=15: z
3 2
1
0.5
1
w=5: y(max) w=5: z w=15: y(max) w=15: z
1
0.5 0 0
q=1: y(max) q=1: z q=2: y(max) q=2: z
SPK
0 0
2867
0.5 x n
1
0 0
0.5 x
1
n
Figure 3: Dependence on nonmaximal input. For each network, the magnitude 6 m, was varied. N D 3, xm D 1. The ring rate of the spiking network of xn , 8n D has been normalized for ease of comparison. z is more robust than ym in general.
The nal output, z, of the LIN model has similar response to variations in xn as that of FFN, although ym in this case behaves more robustly with respect to the magnitude of the submaximal inputs xn (see Figure 3), over a range of lateral inhibition w (see Figure 4). The SPK model behaves well except when xn is large (see Figure 3) or w is small (see Figure 4), in which case a spike in a unit m causes its own membrane potential to drop to 0 in the succeeding time step, while the other units receiving inhibition from the spike may be only partially suppressed and be the rst to reach ring threshold next. In this case, the ring can alternate between ym and yn , and the overall ring rate, z, increases accordingly. 3.2.3 Dependence on Number of Inputs. In some of the previous models proposed for the realization of the MAX operation or WTA, the performance of the models was dependent on the number of inputs (Yuille & Grzywacz, 1989). As the number of inputs increased, the system tended to become increasingly inaccurate in reproducing the maximal input. Our models are relatively immune to this problem (see the appendix). Figure 5 shows the
2868
Angela J. Yu, Martin A. Giese, and Tomaso A. Poggio
3
y
3
m
z
2
1 0
z
DFB
FFN
2
y
m
1
10
3
q
20
30
0
10
3
(w+1)y
q
20 y
m
m
z
1 0
z
2
SPK
LIN
2
30
1
10
w
20
30
0
10
w
20
30
Figure 4: Strength of lateral inhibition. Relationship between output and input amplitude as a function of inhibitory strength, represented by the parameter q or w. Each network is simulated to convergence, as q or w is varied from 2 to 30. In general, z is more robust than ym . For the DFB model, ym and z overlap completely. N D 3, xm D 1, xn D 0.9.
network responses to number of inputs varying from 2 to 30 (higher values of N result in responses similar to those in the case of N D 30). The robustness of z is striking, even though in the FFN model, ym converges to xm / N, and in LIN, ym converges to w (xm ¡ xn ) (see sections A.1 and A.3 for details). 3.2.4 Dependence on Initial Conditions. The FFN model has no memory of its past states and therefore no dependence on initial conditions. The DFB model has, besides the attractor that gives rise to MAX-operator behavior, other non-MAX attractors. An example is shown in Figure 6b, where the initial strength of one unit allows it subsequently to suppress the other unit and dominate the sum z, even though its input ceases to be the maximum. In section A.2, we examine in more details how this undesirable attractor can arise. The LIN network, in contrast, has a unique equilibrium for z that does not depend on the identity of xm or the magnitude of xn , as shown in
Implementations of the Maximum Operation
2
y
2
m
z
DFB
FFN
z
1
10
2
N
20
(w+1)y
30
10
2
N
20
30 y
m
z
SPK
z
LIN
0
m
1
0
y
m
1
0
2869
1
10
N
20
30
0
10
N
20
30
Figure 5: Number of inputs. Dependence of network performance on the number of inputs, N, ranging from 2 to 30. xm D 1, xn D 0.9, N D 3. FFN: q D 15, DFB: q D 2, LIN: w D 10, SPK: w D 20.
Figure 6. The SPK model has behaviors very similar to LIN: the intermediatelayer activities reect the relative strength of inputs, but the nal output z depends on only the maximal input. 3.2.5 Noise Sensitivity. The reader may well wonder by this point how our deterministic models fare in the presence of noisy inputs. Does the nonlinear amplication of the signal also have an amplifying effect on the noise component of the input? To examine this issue, each model is fed with uniform inputs (with one “winner”), to which gaussian noise, independently generated in each iteration, is added. The results are shown in Figure 7. While the FFN responds with output noise of magnitude comparable to input noise, the feedback networks actually suppress output noise, in accordance with previous work on recurrent networks (Yuille & Grzywacz, 1989; Salinas & Abbott, 1996): the recurrent interactions have an averaging effect on the noise. Note that the noise behavior of the spiking model is not directly comparable to the other models, as the input noise of the spiking
2870
Angela J. Yu, Martin A. Giese, and Tomaso A. Poggio
Figure 6: Multistability. Dependence on initial conditions. (a) N D 2. Dashed: unit 1. Solid: unit 2. Iterations 1–200: x1 D 1, x2 D 0.9. Iterations 201–400: x1 D 0.95, x2 D 1. Iterations 401–600: x1 D 0.7, x2 D 1. (b) y1 (dashed) and y2 (solid) responses of DFB. z depends on initial conditions. (c) y1 (dashed) and y2 (solid) responses of LIN. z D y1 C y2 does not depend on initial conditions. (d) SPK responses similar to LIN’s: y1 (dashed) and y2 (solid) activities reect relative strength of inputs, but z is consistent and only a function of the maximal input.
model is measured in terms of membrane potential, while the output noise is measured in terms of ring rate; for all the other models, both the input and output noise are measured in terms of mean ring rate. Overall, it is reassuring that despite the presence of signal amplication in the systems, output noise increases linearly as a function of input noise with acceptably small slopes. 4 Predictions
The results from the simulation studies presented in section 3 give rise to a number of model-specic predictions, which are summarized in Table 1. Some of the predictions are easier to test than others. Preliminary evidence indicates there are single cells in striate cortex (Sakai & Tanaka, 1997; Lampl et al., 2001) and inferior temporal cortex (Sato, 1989) that respond to visual stimuli in a MAX-like manner. Given such a “MAX” neuron, it is
Implementations of the Maximum Operation
Output Noise (STD, % of Mean)
140
2871
FFN: q=15 DFB: q=2 LIN: w=10 SPK: w=20
120 100 80 60 40 20 0 0
20
40 60 80 Input Noise (% of Signal)
100
Figure 7: Noise sensitivity. Standard deviation of output for each network over 50 trials, normalized by the expected or noiseless output, as a function of the amount of uncorrelated noise (measured in terms of standard deviation from the noiseless input signal) added independently to each unit in each iteration. N D 3, xm D 1, xn D 0.9. FFN: q D 15, DFB: q D 2, LIN: w D 10, SPK: w D 20.
relatively easy to test whether the neuron exhibits hysteretic behavior. For instance, given the time courses of a MAX neuron’s responses to two stimuli separated in space, its combined response should be point-wise maximum of the two responses if it is nonhysteretic, but be dominated by the one that has the shorter-latency and dominant transients if it is hysteretic. In fact, Table 1: Model-Specic Predictions. Prediction Model
FFN
DFB
LIN
SPK
Hysteresis Sensitive to GABA A blocker Sensitive to GABAB blocker Effect of GABA blocker on z Sparse yn activity Effect of larger N on ym Noise suppression
no yes no none no # no
no yes no none yes none yes
no no yes " no # yes
yes no yes " no none yes
2872
Angela J. Yu, Martin A. Giese, and Tomaso A. Poggio
preliminary data from Lampl et al. (2001) give some support to the latter behavior. In this work, the only model we examined that exhibits nontrivial hysteresis is the DFB model. However, it has been shown previously that if a little self-excitation is imposed on the LIN model, then it would also exhibit hysteresis (Hahnloser, 1998). In principle, it should also be possible to apply antagonists locally to different inhibitory neurotransmitters, such as GABAA or GABAB , in order to weaken the proposed inhibitory connections involved in implementing the MAX operation. For example, GABAA or other inhibitory receptors with reversal potential near resting potential would be critical for implementing the divisive shunting inhibition used in the FFN and DFB models; GABAB or other inhibitory receptors with very negative reversal potential could be involved in subtractive inhibition involved in the LIN or SPK models. Local application of these blockers therefore might be able to differentiate whether the MAX operation involves divisive or subtractive inhibition, which are most likely to be involved in cellular or network implementations of the MAX operation, respectively. Moreover, the specic effects of suppressing the inhibitory synapses could be different according to our simulation results: when the inhibitory activities are suppressed, LIN and SPK’s output should be increased, while FFN and DFB’s output should not be signicantly affected. Successful implementation of these experiments would provide valuable insight into the underlying mechanisms of the MAX operation, although in practice, interpretation of these experiments might be difcult, as the relevant synapses might live and interact on distal dendrites while recordings will mainly be in the soma. If experimental data could give some indication as to what xn and yn represent, whether they are computed synaptically within the recorded cell or represent input and output to other neurons, more specic predictions might then be tested: sparseness of yn activation, effect of N on ym , and even effect of input noise on z. 5 Biophysiological Considerations
Even in the absence of more detailed experimental data, some reasonable conjectures can be made about the relative plausibility of the models based on biophysiological considerations. The feedforward divisive inhibition model can be implemented in multiple ways. Two particularly attractive implementations emerge if equation 2.1 is broken down in two distinct ways: yn D xn ¤
f (x ) Pn c C k f ( xk )
yn D xn f ( xn ) ¤
cC
1 P k
(5.1)
f ( xk )
.
(5.2)
Implementations of the Maximum Operation
2873
Equation 5.1 can be thought of as a multiplication between xn and the output of a normalizing network of mutually inhibitory neurons that also receive a copy of the input xn . Such network interactions have been been previously proposed to account for response of V1 neurons receiving afferent LGN inputs and lateral cortical inputs (Carandini & Heeger, 1994). Equation 5.2 represents another possibility: the gating of a nonlinear P operation on the input, g ( xn ) by an inhibitory pooling neuron computing k f ( xk ) ) , where g ( xn ) / xn f ( xn ) is necessary for achieving the linearity property of the ideal MAX operation. Such multiplicative gain control interactions have been observed in the parietal cortex (Salinas & Abbott, 1996), insect visual system (Hatsopoulos, Gabbiani, & Laurent, 1995), area LIP (Andersen, Bracewell, Barash, Gnadt, & Fogassi, 1990), and the superior colliculus (Van Opstal, Hepp, Suzuki, & Henn, 1995). The biophysiological considerations of the DFB model are similar to those of the FFN model, involving two main possibilities of decomposn f ( yn ) ing the product CxP , as described before. The main differences are ( ) c
k
f yk
that for the normalization implementation, the mutually inhibitory units are represented by yn rather than xn ; similarly, for the gain control implementation, the gain factor now depends on yn rather than xn , and the gain-controlled neuron receives inputs from both xn and yn and computes g ( xn , yn ) / xn f ( yn ) . One signicant advantage of this recurrent model over the feedforward model is that it works well with little or no nonlinear amplication of the signal by the operator f , as we have seen in the simulation results in section 3 and the mathematical analysis in section A.2. Another computational advantage of this recurrent model is that it performs very efcient noise suppression. The implementation of the LIN and SPK models is straightforward. Each yn probably represents the activity of a single neuron, which receives inhibition from its neighbors and excitation from the input. One ambiguity is that the summation can be done by either a dedicated summation neuron (in which case, self-excitation is needed for the SPK model), or the unit could be directly inhibited by its neighbors via dendritic inputs (in which case, self-inhibition is needed for the LIN model). 6 Discussion
In this work we have reviewed and analyzed a number of neural circuits that provide good approximations to the MAX operation, which has been proposed to play a signicant role in various processes of the visual system. The MAX operation is interesting because it is an example of a fundamental, nonlinear computational operation that can be realized with neurophysiologically plausible mechanisms. The models were chosen in order to demonstrate different neural principles for the realization of this computational operation. They are simple enough to allow an understanding of
2874
Angela J. Yu, Martin A. Giese, and Tomaso A. Poggio
the underlying parametric dependencies and some mathematical analysis. The neural mechanisms on which our models are based are also fundamental in models for contrast gain control and winner-take-all, both of which have been extensively studied in the context of important cortical processes. These processes and the MAX operation may well share similar or overlapping neural substrate. From our mathematical and simulation analyses, it appears that each model is endowed with a variety of distinct computational properties, although they may be difcult to test in experimental settings. The biophysiological considerations for the various models are also complex and nontrivial. In general, in the absence of more detailed experimental data, the differences in the model behaviors do not lead to immediate support for one model over any other. This work gives rise to a number of potential directions of future theoretical research. One obvious extension of this work is to analyze neurophysiologically more detailed and more realistic models, which could involve more stochastic descriptions of network dynamics, or biophysiologically more realistic neurons. For instance, software such as Neuron or Genesis could be used to simulate dendritic interactions and shunting conductances explicitly. Another interesting question is how any of these models may be learned through experience or wired up during development. Mechanisms similar to those proposed by Fukushima (1980) to explain learning in the Neocognitron may be explored. The design of these future studies is contingent on crucial experimental data that are not yet available. The simulation and mathematical results from our work give rise to a number of predictions, which can be used to guide future experimental investigations on the MAX operation. Although the experimental demonstration of the different properties predicted by the models is nontrivial, this preliminary theoretical analysis should be a helpful rst step for the preparation of more detailed neurophysiological experiments.
Appendix
The mathematical discussions here are intended to help explain and support the various simulation results presented in section 3. We are mainly interested in how well the networks satisfy the selectivity and linearity properties of the MAX operation and how they are affected by internal and external parameters, such as the magnitude of the nonmaximal inputs, the number of inputs, the strength of inhibition, and initial conditions. We also point out why z is a more robust reconstruction of xm than ym , the latter of which has been used in previous MAX models with similar architectures (Yuille & Grzywacz, 1989).
Implementations of the Maximum Operation
2875
A.1 Divisive Feedforward Network. Given equations 2.1, and using a
polynomial form of f , f ( x) D xq : zD
X y
PN yn ¼
q n D 1 xn xn q kD 1 x k
PN
D xm
1
P
q C1 6 m rn nD P q C 6 m rk kD
1C
D xm L,
where we have dened the ratio rn ´
xn xm
and L ´
P qC1 r PnD m nq . Note 0 · C
1C 1
6
r
kD 6 m k
L · 1, since 0 · rn · 1. Simple calculus manipulation shows that L reaches minimum (and z deviates maximally from xm ) when the ratios are rj D
q (1 C
P
q C1 ) 6 m rk kD
(q C 1) (1 C
P
q ) 6 m rk kD
,
a quantity independent of j. In other words, the selectivity of the model, that is, the ability of z to ignore the submaximal inputs, is most challenged when xn are identical. Identical nonmaximal inputs therefore provide a worst-case scenario for the FFN model, where z can be expressed as z D xm
1 C ( N ¡ 1)rq C 1 1 C ( N ¡ 1) rq
and ym as ym D
xm . 1 C ( N ¡ 1) rq
Note that the dependence of z on N is minimal compared to that of ym , since z has an extra O ( N ) term in the denominator to balance out the numerator (see Figure 5a). Notice also that the dependence of z on r is not monotonic: z ¼ 1 when r ¼ 0 or r ¼ 1, but is somewhat less for intermediate values of r, consistent with the simulation results in Figure 3a. A.2 Divisive Feedback Network. As we discussed in section 3.1, there is an attractor at z D xm , a phenomenon that is independent of q, N, or frn g. This is reected in the robustness of the model’s output in Figures 3b, 4b, and 5b. However, there also exist other attractors that do not approximate the MAX operation, such as the ones shown in Figure 6b. A closer look at the system of equations 3.1 makes the reason apparent: this set of equations 6 n. In particular, it means is stable if any yn ¼ xn , and all the other y k D 0, k D
2876
Angela J. Yu, Martin A. Giese, and Tomaso A. Poggio
that if yn has been the active unit, then it may continue to dominate even if its input becomes smaller than those of its neighbors. Finally, let us note that under the special condition where all inputs are equal, xm D xn , then there is a unique attractor where z D xn and each yn D z / N, as long as the initial conditions are not such that yn D 0, 8n (the system also has a degenerate, unstable equilibrium at yn D 0, 8n). A.3 Linear Threshold Network. The attractor giving rise to the ideal 6 MAX operation, as described in section 3.1, requires [yn ] C D 0, 8n D m. Now we examine conditions under which this cannot be fullled. First, note that given equation 2.3, subtracting the expression for yn from ym , gives the convergence
ym ¡ yn D b n ´ xm ¡ xn . Thus, if bn < xm / ( w C 1), then ym ¡ yn < xm / ( w C 1) , and it cannot be such that ym D xm / ( w C 1) and yn · 0. In this case, Equation 3.2 still holds, and we have the following: X n
[yn ] C D
X
yj
j: yj > 0
0 D
X j: yj > 0
D xm C
1
@xj ¡
X
wyk A
k: yk > 0
X 6 m j: yj > 0, j D
xj ¡ Jw
X
y k.
k: yk > 0
Rearranging the terms, (1 C Jw)
X j: yj > 0
yj D xm C
X 6 m jD
xj D xm C ( J ¡ 1) xm ¡
X
bj ,
6 m jD
where J is the total number of units such that yj > 0. Then we can express z as z D (w C 1) D
X yj j: yj > 0
Jw C J wC1 X bj xm ¡ C Jw 1 Jw C 1 jD6 m
D xm C
³ ´ wC1 X xm ¡ bj , Jw C 1 jD6 m,y > 0 w C 1 j
Implementations of the Maximum Operation
2877
We know bn < xm / ( w C 1), so z is maximized if bj is minimized toward 0 and if J is maximized toward N(see Figure 3c). It is clear that using uniform inputs with decreasing bn is a valid way of examining the network response as the inputs approach the worst-case scenario: z D xm (1 C (N¡1 ) / ( Nw C 1) ) , where the error factor is inversely proportional to w (see Figure 4c) and relatively independent of N (see Figure 5c). For identical and nonzero bn D b, and large w, z D xm C
(w C 1) ( N ¡ 1) Nw C 1
³
xm ¡b wC1
´ ¼ xm ,
where the error term is roughly independent of N. We also have the equilibrium condition, (w C 1) ym ¼
(N ¡ 1) xm C wb, N N
indicating that ym has an inverse dependence on N initially, but then becomes relatively independent of N as N becomes large. A.4 Spiking Network. Because of the deterministic dynamics of equation 2.4, given identical initial conditions, ym always res before the other units can reach the threshold, and therefore ym is the only unit that is active. In the degenerate case that yn is identical for all inputs, all the units re synchronously, and the overall ring rate, z, is N times higher (see Figure 3d). These behaviors are independent of magnitude of w (see Figure 4d), as long as it is large enough to reset all the membrane potential, mn , of all the nonring cells to 0 after each spike. If this is not the case, then the unit receiving the second largest input xm2 may have greater membrane potential than mm after unit m res: mm2 > mm , and may subsequently be the rst to reach the ring threshold, as is the case in Figure 6d. In this way, the ring may alternate between two or more units, and the overall ring rate z increases as w decreases. 7 Acknowledgments
We thank Peter Dayan, Richard Hahnloser, and Maximilian Riesenhuber for helpful discussions. This work was supported by the Ofce of Naval Research under contract N00014-93-1-3085, Ofce of Naval Research under contract N00014-95-1-0600, National Science Foundation under contract IIS9800032, and the National Science Foundation under contract DMS-9872936. A.Y. was also supported by Massachusetts Institute of Technology under the UROP program, National Science Foundation under Graduate Research Fellowship Program, and the Gatsby Charitable Foundation under a grant
2878
Angela J. Yu, Martin A. Giese, and Tomaso A. Poggio
to the Gatsby Computational Neuroscience Unit. M.G. was also supported by the Deutsche Forschungsgemeinschaft, Honda R&D Americas, and the Deutsche Volkswagenstiftung. References Andersen, R. A., Bracewell, R. M., Barash, S., Gnadt, J. W., & Fogassi, L. (1990). Eye position effects on visual, memory, and saccade-related activity in areas LIP and 7a of macaque. Journal of Neuroscience, 10(4), 1176–1196. Anderson, C. H., & Van Essen, D. C. (1987). Shifter circuits: A computational strategy for dynamic aspects of visual processing. Proceedings of the National Academy of Sciences (USA), 84(17), 6297–6301. Anderson, J. S., Carandini, M., & Ferster, D. (2000). Orientation Tuning of Input Conductance, Excitation, and Inhibition in Cat Primary Visual Cortex. Journal of Neurophysiology, 84, 909–926. Ben-Yishai, R., Lev Bar-Or, R., & Sompolinsky, H. (1995). Theory of orientation tuning in visual cortex. Proceedings of the National Academy of Sciences (USA), 92, 3844–3848. Borg-Graham, L., Monier, C., & Fregnac, Y. (1998). Visual input evokes transient and strong shunting inhibition in visual cortical neurons. Nature, 393, 367– 373. Carandini, M., & Ferster, D. (2000). Membrane potential and ring rate in cat primary visual cortex. Journal of Neurophysiology, 20(1), 470–484. Carandini, M., & Heeger, D. J. (1994). Summation and division by neurons in primate visual cortex. Science, 264, 1333–1336. Carandini, M., Heeger, D. J., & Movshon, J. A. (1997). Linearity and normalization in simple cells of the macaque primary visual cortex. Journal of Neuroscience, 17(21), 8621–8644. Connor, C. E., Preddie, D. C., Gallant, J. L., & Van Essen, D. C. (1997). Spatial attention effects in macaque area V4. Journal of Neuroscience, 17(9), 3201– 3214. Doiron, B., Longtin, A., Berman, N., & Maler, L. (2001). Subtractive and divisive inhibition: Effect of voltage-dependent inhibitory conductances and noise. Neural Computation, 13(1), 227–248. Ferster, D., & Jagadesh, B. (1992). EPSP-IPSP interactions in cat visual cortex studied with in vivo whole-cell patch recording. Journal of Neuroscience, 12(4),1262–1274. Fukai, T., & Tanaka, S. (1997). A simple neural network exhibiting selective activation of neural ensembles: From winner-takes-all to winner-shares-all. Neural Computation, 9, 77–97. Fukushima, K. (1980).Neocognitron: A self organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics, 36(4), 193–202. Giese, M. A. (2000). Neural model for the recognition of biological motion. In G. Baratoff & H. Neumann (Eds.), Dynamische Perzeption(pp. 105–110). Berlin: Inx Verlag.
Implementations of the Maximum Operation
2879
Gilbert, C. D. (1992). Horizontal integration and cortical dynamics. Neuron, 9(1), 1–13. Grossberg, S. (1973).Contour enhancement: Short term memory, and constancies in reverbarating neural networks. Studies in Applied Mathematics, 52, 213–257. Grzywacz, N. M., & Yuille, A. L. (1990). A Model for the estimate of local image velocity by cells in the visual cortex. Proc. R. Soc. Lond. B, 239, 129–161. Hahnloser, R. (1998). On the piecewise analysis of networks of linear threshold neurons. Neural Networks, 11, 691–697. Hahnloser, R., Douglas, R. J., Mahowald, M., & Hepp, K. (2000). Digital selection and analogue amplication coexist in a cortex-inspired silicon circuit. Nature, 405(6789), 947–951. Hartline, H. K., & Ratliff, F. (1957). Spatial summation of inhibitory inuences in the eye of limulus, and the mutual interaction of receptor units. Journal of General Physiology, 41(5), 1049–1066. Hatsopoulos, N., Gabbiani, F., & Laurent, G. (1995). Hysteresis reduction in proprioception using presynaptic shunting inhibition. Journal of Neurophysiology, 73(3), 1031–1042. Holt, G. R., & Koch, C. (1997). Shunting inhibition does not have a divisive effect on ring rates. Neural Computation, 9, 1001–1013. Hubel, D. H., & Wiesel, T. Receptive elds, binocular interaction and functional architecture in the cat’s visual cortex. Journal of Physiology, 160, 106–154. Koch, C., & Poggio, T. (1987). Biophysics of computational systems: Neurons, synapses, and membranes. In G. M. Edelman, W. E. Gall, & W. M. Cowan (Eds.), Synaptic functions (pp. 637–697). New York: Wiley. Koch, C., Poggio, T., & Torre, V. (1983). Nonlinear interactions in the dendritic tree: Localization timing and the role in information processing. Proceedings of the National Academy of Sciences (USA), 80, 2799–2802. Koch, C., & Ullman, S. (1985). Shifts in selective visual attention: Towards the underlying neural circuitry. Human Neurobiology, 4, 219–227. Kohonen, T. (1995). Self Organizing Maps. Berlin: Springer-Verlag. Lampl, I., Riesenhuber, M., Poggio, T., & Ferster, D. (2001). The MAX operation in cells in the cat visual cortex. Society of Neuroscience Abstracts, 619, 30. Lazzaro, J., Ryckenbusch, S., Mahowald, M. A., & Mead, C. A. (1989). Winnertake-all network of O ( n ) complexity. In D. Touretzky (Ed.), Advances in neural information processing systems, 1 (pp. 703–711). San Mateo, CA: Morgan Kaufmann. Lee, D. K., Itti, L., Koch, C., & Braun, J. (1999). Attention activates winner-takesall competition among visual lters. Nature Neuroscience, 2, 375–381. Li, Z. (2000). Computational design and nonlinear dynamics of a recurrent network model of the primary visual cortex. Neural Computation, 15(8), 1749– 1780. Logothetis, N. K., Pauls, J., & Poggio, T. (1995). Shape representation in the inferior temporal cortex of monkeys. Current Biology, 5, 552–563. Motter, B. C. (1994).Neural correlates of attentive selection for color or luminance in extrastriate area V4. Journal of Neuroscience, 14(4), 2178–2189. Naka, K. I., & Rushton, W. A. H. (1966). S-potential from luminosity units in the retina of sh (cyprinidae). Journal of Physiology, 185, 587–599.
2880
Angela J. Yu, Martin A. Giese, and Tomaso A. Poggio
Nowlan, S. J. & Sejnowski, T. J. (1995). A selection model for motion processing in area MT of primates. Journal of Neuroscience, 15(2), 1195–1214. Pasupathy, A., & Connor, C. E. (1999). Responses to contour features in macaque area V4. Journal of Neurophysiology, 82(5), 2490–2502. Perrett, D. I., Oram, M. W., Harries, M. H., Bevan, R., Hietanen, J. K., Benson, P. J., & Thomas, S. (1991). Viewer-centred and object-centred coding of heads in the macaque temporal cortex. Experimental Brain Research, 86(1), 159–173. Reichardt, W., Poggio, T., & Hausen, K. (1983). Figure-ground discrimination by relative movement in the visual system of the y. Part II. Towards the neural circuitry. Biological Cybernetics, 46, 1–30. Riesenhuber, M. K., & Poggio, T. (1999a). Are cortical models really bound by the “binding problem”? Neuron, 24, 87–99. Riesenhuber, M. K., & Poggio, T. (1999b). Hierarchical models of object recognition in cortex. Nature Neuroscience, 2, 1019–1025. Riesenhuber, M. K., & Poggio, T. (2000). Models of object recognition. Nature Neuroscience, 3 (Supp.), 1199–1204. Rockland, K. S., & Lund, J. S. (1983). Intrinsic laminar lattice connections in primate visual cortex. Journal of Comparative Neurology, 216, 303–318. Sakai, K., & Tanaka, S. (1997).A spatial pooling mechanism underlies the secondorder structure of V1 complex cell receptive elds: A computational analysis. Soc. Neurosci. Abs., 23, 453. Salinas, E., & Abbott, L. F. (1996). A model of multiplicative neural responses in parietal cortex. Proceedings of the National Academy of Sciences (USA), 93,11956– 11961. Sato, T. (1989).Interactions of visual stimuli in the receptive elds of inferiortemporal neurons in awake monkeys. Experimental Brain Research, 77, 23–30. Simoncelli, E. P., & Heeger, D. J. (1998). A model of neural responses in visual area MT. Vision Research, 38, 743–761. Starzyk, J. A., & Fang, X. (1993).CMOS current mode winner-take-all circuit with both excitatory and inhibitory feedback. Electronics Letters, 29(10), 908–910. Tanaka, K. (1996). Inferotemporal cortex and object vision. Annual Review of Neuroscience, 19, 109–139. Thorpe, S., Fize, D., & Marlot, C. (1996). Speed of processing in the human visual system. Nature, 381, 520–522. Torre, V., & Poggio, T. (1978). A synaptic mechanism possibly underlying directional selectivity motion. Proceedings of the Royal Society London B, 202, 409–416. Treue, S., & Maunsell, J. H. L. (1996). Attentional modulation of visual motion processing in cortical areas MT and MST. Nature, 382, 539–541. Van Opstal, A. J., Hepp, K., Suzuki, Y., & Henn, V. (1995).Inuence of eye position on activity in monkey superior colliculus. Journal of Neurophysiology, 74(4), 1593–1610. White, E. L. (1989). Cortical circuits. Boston: Birkhauser. Wilson, H. R., & Humanski, R. (1993). Spatial frequency adaptation and contrast gain control. Vision Research, 33, 1133–1149. Yu, A. J., Giese, M. A., & Poggio, T. (2000a). Neural circuits for the realization of the maximum operation. Society of Neuroscience Abstracts, 26, 1202.
Implementations of the Maximum Operation
2881
Yu, A. J., Giese, M. A., & Poggio, T. (2000b). Neural mechanisms for the realization of maximum operations. European Journal of Neuroscience, 12, 74. Yuille, A. L., & Grzywacz, N. M. (1989). A winner-take-all mechanism based on presynaptic inhibition feedback. Neural Computation, 1, 334–347. Received September 18, 2001; accepted May 10, 2002.
LETTER
Communicated by Haim Sompolinsky
Self-Regulation Mechanism of Temporally Asymmetric Hebbian Plasticity Narihisa Matsumoto [email protected] Graduate School of Science and Engineering, Saitama University, Saitama 338-8570, Japan, and RIKEN Brain Science Institute, Saitama 351-0198, Japan Masato Okada [email protected] RIKEN Brain Science Institute, Saitama 351-0198, Japan Recent biological experimental ndings have shown that synaptic plasticity depends on the relative timing of the pre- and postsynaptic spikes. This determines whether long-term potentiation (LTP) or long-term depression (LTD) is induced. This synaptic plasticity has been called temporally asymmetric Hebbian plasticity (TAH). Many authors have numerically demonstrated that neural networks are capable of storing spatiotemporal patterns. However, the mathematical mechanism of the storage of spatiotemporal patterns is still unknown, and the effect of LTD is particularly un known. In this article, we employ a simple neural network model and show that interference between LTP and LTD disappears in a sparse coding scheme. On the other hand, the covariance learning rule is known to be indispensable for the storage of sparse patterns. We also show that TAH has the same qualitative effect as the covariance rule when spatiotemporal patterns are embedded in the network. 1 Introduction
Recent biological experimental ndings have indicated that synaptic plasticity depends on the relative timing of the pre- and postsynaptic spikes. This determines whether long-term potentiation (LTP) or long-term depression (LTD) is induced (Bi & Poo, 1998; Markram, Lubke, ¨ Frotscher, & Sakmann, 1997; Zhang, Tao, Holt, Harris, & Poo, 1998). LTP occurs when presynaptic ring precedes postsynaptic ring by no more than about 20 ms. In contrast, LTD occurs when presynaptic ring follows postsynaptic ring. The transition between LTP and LTD takes place within a few milliseconds. A learning rule of this type is called temporally asymmetric Hebbian learning (TAH) (Abbott & Song, 1999; Rubin, Lee, & Sompolinsky, 2001) or spike timing dependent synaptic plasticity (STDP) (Song, Miller, & Abbott, 2000). Many authors have numerically demonstrated that neural networks are capable of c 2002 Massachusetts Institute of Technology Neural Computation 14, 2883– 2902 (2002) °
2884
Narihisa Matsumoto and Masato Okada
storing spatiotemporal patterns (Gerstner, Kempter, van Hemmen, & Wagner, 1996; Kempter, Gerstner, & van Hemmen, 1999; Munro & Hernandez, 2000; Song et al., 2000; Rao & Sejnowski, 2000; Yoshioka, 2002). Song et al. (2000) have used TAH in discussing the variability of spike generation in a network that consists of spiking neurons. They found that the area of LTD has to be slightly larger than that of LTP for stability. A precise balance between LTP and LTD is crucial. Yoshioka (2002) has used TAH in discussing an associative memory network that consists of spiking neurons. He found that stable retrieval requires that the areas of LTP and LTD be equal. Munro and Hernandez (2000) have numerically shown that LTD makes the retrieval of spatiotemporal patterns from a network possible even in a noisy environment. However, they did not discuss why TAH is effective in terms of the storage and retrieval of spatiotemporal patterns. Since both LTP and LTD are effects of TAH, the interference between LTP and LTD may prevent the retrieval of patterns. To investigate the unknown mathematical mechanism for retrieval, we employ an associative memory network that consists of binary neurons. Simplifying the dynamics of the internal potential enables us to analyze the details of the retrieval process. We dene a learning rule that is similar to the formulations used in previous work. We show the mechanism that makes the retrieval of spatiotemporal patterns possible in this network. The storage of spatiotemporal patterns in associative memory networks by the covariance learning rule has been the subject of a lot of research (Chechik, Meilijson, & Ruppin, 1999; Kitano & Aoyagi, 1998). A number of biological ndings have implied that sparse coding schemes may be used in the brain (Miyashita, 1988). The covariance rule is well known to be indispensable when the sparse patterns are embedded in a network as attractors (Amari, 1989; Okada, 1996). The information on the ring rate for the stored patterns is not indispensable to TAH, although it is indispensable to the covariance rule. We theoretically show that TAH has the same qualitative effect as the covariance rule when the spatiotemporal patterns are embedded in the network. This means that differences between spike times induce LTP or LTD, and this spike time difference is capable of canceling out the effect of the ring-rate information. We conclude that this is why information on the ring rate is not required for the storage of patterns by TAH. We introduce our model for the storage of spatiotemporal patterns in section 2. In section 3, we discuss the mechanism by which the spatiotemporal patterns are stored and retrieved and discuss the basis for TAH having the same effect as the covariance learning rule. In section 4, we investigate the properties of our model. In section 5, we discuss our model. The appendix presents a detailed derivation of the statistical neurodynamical equations of the model. 2 Model
We investigate a network that consists of N binary neurons that are mutually connected (see Figure 1). In this article, we consider the case where N ! 1.
Temporally Asymmetric Hebbian Plasticity
2885
Figure 1: An associative memory network that stores spatiotemporal patterns. x(t) and x(t C 1) are network state at time t and t C 1, respectively. Jij is synaptic weight from the jth neuron to the ith neuron and is independent of time. The network consists of N binary neurons, that is, each state is f0, 1g.
We employ a neuronal model with binary state f0, 1g. We dene discrete time steps and the following rule for synchronous updating: ui ( t ) D
N X
Jij xj (t ) ,
(2.1)
j D1
xi (t C 1) D H (ui ( t) ¡ h ) , ( 1, u ¸ 0 H (u) D 0, u < 0,
(2.2) (2.3)
where xi ( t) is the state of the ith neuron at time t, ui (t ) is the internal potential of that neuron, and h is a uniform threshold. If the ith neuron res at t, its state is xi (t ) D 1; otherwise, xi ( t) D 0. The specic value of the threshold is discussed later. Jij is the synaptic weight from the jth neuron to the ith m m m m neuron. Each element ji of the m th memory pattern »m D (j1 , j2 , . . . , jN ) is generated independently by m
m
Prob[ji D 1] D 1 ¡ Prob[ji D 0] D f.
(2.4)
m The expectation of »m is E[ji ] D f , and thus f is considered to be the mean ring rate of the memory pattern. The memory pattern is sparse when f ! 0, and this coding scheme is called sparse coding. The synaptic weight Jij follows the form of the synaptic plasticity, which depends on the difference in spike times between the ith (post-) and jth (pre-) neurons. The difference determines whether LTP or LTD is induced. A learning rule of this type is called temporally asymmetric Hebbian learning (TAH) or spike-timingdependent synaptic plasticity (STDP). The biological experimental ndings
2886
Narihisa Matsumoto and Masato Okada
Figure 2: Temporally asymmetric Hebbian plasticity. (a) The result of biological ndings (Zhang et al., 1998). (b) The learning rule in our model. LTP is induced when the jth neuron res one time step before the ith neuron. LTD is induced when the jth neuron res one time step after the ith one. Synaptic weight Jij follows this form of TAH.
indicate that LTP or LTD is induced when the difference in the pre- and postsynaptic spike times falls within about 20 ms (Zhang et al., 1998; see Figure 2a). We dene a single time step in equations 2.1 through 2.3 as 20 ms for Figure 2a, and durations within 20 ms are ignored (see Figure 2b). Figure 2b shows that LTP is induced when the jth neuron res one time
Temporally Asymmetric Hebbian Plasticity
2887
Figure 3: The stored pattern sequence. Each pattern is retrieved periodically. The patterns retrieved are 1 at time 1, 2 at time 2, and 1 at time p C 1. In short, pC1 p ji D ji1 and ji0 D ji . m C1
step before the ith neuron, ji
m
D jj D 1, while LTD is induced when the m ¡1
m
jth neuron res one time step after the ith neuron, ji D jj D 1. Previous work has indicated the signicance of the balance between LTP and LTD (Song et al., 2000). Therefore, we dene that the area of LTP is the same as that of LTD and that the amplitude of LTP is also the same as that of LTD. On the basis of these denitions, we employ the following learning rule: Jij D
p X 1 (jim C 1jjm ¡ jim ¡1jjm ). Nf (1 ¡ f ) m D 1
(2.5)
The number of memory patterns is p D aN, where a is dened as the loading rate. There is a critical value aC of the loading rate, so that a loading rate larger than aC makes retrieval of the pattern sequence unstable. aC is called the storage capacity. Previous work has shown that the learning method described by equation 2.5 is capable of storing spatiotemporal patterns, that is, pattern sequences (Munro & Hernandez, 2000; Rao & Sejnowski, 2000). We show that a sequence of p memory patterns is retrieved, that is, »1 ! » 2 ! ¢ ¢ ¢ ! »p ! » 1 ! ¢ ¢ ¢ (see Figure 3). In other words, »1 is retrieved at t D 1, » 2 is retrieved at t D 2, and » 1 is retrieved at t D p C 1. Here, we discuss the value of the threshold h . The threshold value is well known to require time-dependent control according to the progress
2888
Narihisa Matsumoto and Masato Okada
of the retrieval process (Amari, 1989; Okada, 1996). One candidate for the algorithm to control the threshold value is to maintain the mean ring rate of the network at that of memory pattern, f , as follows: f D
N N 1 X 1 X H ( ui (t) ¡ h (t ) ) . xi ( t ) D N iD 1 N iD 1
(2.6)
The threshold value obtained is well known to be nearly optimal, since it gives an approximate maximal value for the storage capacity (Okada, 1996). 3 Theory
Many neural network models for the storage and retrieval of sequential patterns by TAH have been discussed by many authors (Gerstner et al., 1996; Kempter et al., 1999; Munro & Hernandez, 2000; Rao & Sejnowski, 2000). They have numerically shown that TAH is effective for storing patterns. For example, Munro and Hernandez (2000) showed that their model provides the retrieval of stored patterns even in noisy environments. However, reasons for the effectiveness of TAH have not been mentioned in previous work. Exploring this mechanism is the main purpose of our article. Here, we discuss the mechanisms that the network learned by TAH can store and retrieve sequential patterns. Before providing details of this retrieval process, we discuss the simple situation where there are very few memory patterns relative to the number of neurons: p » O (1). Let the state at time t be the equivalent to the tth memory pattern: x ( t) D » t . The internal potential ui ( t) of equation 2.1 is then given by ui ( t) D jit C 1 ¡ jit¡1 .
(3.1)
ui ( t) depends on two independent random variables, jit C 1 and jit¡1 , according to equation 3.1. The rst term jit C 1 is the signal term for the recall of the pattern » t C 1 , which is intended to be retrieved at time t C 1, and the second term jit¡1 may interfere with the retrieval of » t C 1 . According to equation 3.1, ui ( t) takes a value of 0, ¡1, or C 1 (see Table 1). As stated above, jit¡1 D 1 is the case where interference of LTD appears. If the threshold h ( t) is set between 0 and C 1, jit C 1 D 0 is not affected by the interference of jit¡1 D 1. When jit C 1 D 1 and jit¡1 D 1, the interference does affect the retrieval of »t C 1 . We consider the probability distribution of the internal potential ui ( t) to examine how the interference of LTD affects the retrieval of » t C 1 . The probability of jit C 1 D 1 and jit¡1 D 1 is f 2 , that of jit C 1 D 1 and jit¡1 D 0 is f ¡ f 2 , that of jit C 1 D 0 and jit¡1 D 1 is f ¡ f 2 , and that of jit C 1 D 0 and jit¡1 D 0 is (1 ¡ f ) 2 . The probability distribution of ui ( t) is given by Figure 4a: Prob(ui ( t) ) D ( f ¡ f 2 )d ( ui (t ) ¡ 1) C (1 ¡ 2 f C 2 f 2 )d ( ui ( t) ) C ( f ¡ f 2 )d ( ui ( t ) C 1).
(3.2)
Temporally Asymmetric Hebbian Plasticity
2889
Table 1: When (jit C 1 , jit¡1 ) D (1, 0), ui ( t) D C 1, and when (jitC 1 , jit¡1 ) D (0, 1) , ui ( t) D ¡1. Other than in these conditions, ui ( t ) D 0. jit C 1 njit¡1
0
1
0
0
1
+1
¡1
0
Note:jit¡1
D 1 is the case where interference of LTD appears.
If the threshold h ( t) is set between 0 and C 1, the state xi (t C 1) is 1 with a probability of f ¡ f 2 and 0 with a probability of 1 ¡ f C f 2 . The overlap between the state x ( t C 1) and the memory pattern » t C 1 is given by mt C 1 ( t C 1) D
N X 1 (j t C 1 ¡ f ) xi ( t C 1) D 1 ¡ f. Nf (1 ¡ f ) iD 1 i
(3.3)
In a sparse limit, f ! 0, the probability of jit C 1 D 1 and jit¡1 D 1 approaches 0. This means that the interference of the LTD disappears in this case, and the model is able to retrieve the next pattern » t C 1 . The overlap mt C 1 ( t C 1) then approaches 1. We now discuss whether information on the ring rate is indispensable to TAH. To investigate this, we consider the case where there is a large number of memory patterns—p » O ( N ) . Using equation 3.3, the internal potential ui ( t) of the ith neuron at time t is represented as ui (t ) D (jit C 1 ¡ jit¡1 ) mt ( t) C zi ( t) , zi ( t ) D
p X 6 t mD
(jim C 1 ¡ jim ¡1 ) mm ( t).
(3.4) (3.5)
zi (t ) is called the cross-talk noise and represents contributions of nontarget patterns other than » t¡1 and prevents the retrieval of the target pattern, » t C 1 . This term disappears in the nite loading case, p » O (1). The covariance learning rule,
Jij D
p X 1 (jim C 1 ¡ f ) (jjm ¡ f ) , Nf (1 ¡ f ) m D 1
(3.6)
is indispensable when sparse patterns are embedded in a network as attractors (Amari, 1989; Okada, 1996). Under sparse coding schemes, unless the covariance rule is employed, the cross-talk noise does diverge in the large
2890
Narihisa Matsumoto and Masato Okada
Figure 4: The probability distribution of the ith neuronal potential ui ( t ) at time t. (a) p » O (1). (b) p » O ( N ) . In a, there are three delta functional distributions. If the threshold h ( t) is set between 0 and C 1 at time t, the pattern t C 1 is retrievable. In b, each distribution follows a gaussian distribution; the respective means are ¡1, 0, and C 1, and the variance of each is s 2 ( t) . Adjusting the threshold at an appropriate value is thus essential.
N limit. Storage of the patterns thus becomes impossible. The information on the ring rate of the stored patterns is not indispensable to TAH, while it is indispensable to the covariance rule. We used the method of statistical neurodynamics (Amari & Maginu, 1988; Okada, 1995) to examine whether the variance of the cross-talk noise diverges. If it is possible to store a pattern sequence, the cross-talk noise follows a gaussian distribution with a mean of 0 and time-dependent variance s 2 ( t). Otherwise, s 2 ( t) diverges. The probability distribution of ui (t) is given by Figure 4b. Since s 2 ( t) changes over time, it is necessary to control the threshold to an appropriate value at each time step (Amari, 1989; Okada, 1996). We then apply statistical neurodynamics to obtain recursive equations for the overlap mt (t ) between the network
Temporally Asymmetric Hebbian Plasticity
2891
state x ( t) and the target pattern » t and the variance s 2 ( t) of cross-talk noise. The details of the derivation are given in the appendix. Here, we show the recursive equations for mt ( t) and s 2 ( t) : mt ( t ) D s 2 (t) D
1¡2 f 1¡ f f erf(w 0 ) ¡ erf(w 1 ) C erf(w 2 ) , 2 2 2 t X aD 0
U (t) D p
2(a C 1) C ( a C 1) aq ( t ¡a)
1 2p s ( t ¡1)
a Y b D1
(3.7)
U2 (t ¡b C 1),
(3.8)
f(1¡2 f C 2 f 2 ) e ¡w 0 C f (1¡ f ) ( e¡w1 C e¡w2 ) g, 2
2
2
(3.9)
1 (1¡ (1¡2 f C 2 f 2 ) erf(w 0 ) ¡ f (1¡ f ) (erf(w 1 ) C erf(w 2 ) ) ) , (3.10) 2 Z y 2 b! erf(y) D p exp (¡u2 ) du, b Ca D , a! D a £ ( a ¡ 1) £ ¢ ¢ ¢ £ 1, p 0 a! ( b ¡ a) ! q (t) D
h ( t¡1 ) w0 D p , 2s ( t¡1 )
w1 D
¡mt¡1 (t ¡1) C h (t ¡1) p , 2s ( t¡1 )
w2 D
mt¡1 (t ¡1) C h ( t¡1 ) p . 2s ( t¡1 )
These equations reveal that the variance s 2 ( t) of cross-talk noise does not diverge, as long as a pattern sequence is retrievable. This result means that TAH has the same qualitative effect as the covariance rule. Next, we discuss the mechanism that the variance of the cross-talk noise does not diverge. Let us consider equation 2.5. The synaptic weight Jij from the jth neuron to the ith neuron is also derived in the following way: Jij D
p X 1 (jim C 1jjm ¡ jim ¡1jjm ) Nf (1 ¡ f ) m D 1
(3.11)
D
p X 1 (jim jjm ¡1 ¡ jim jjm C 1 ) Nf (1 ¡ f ) m D 1
(3.12)
D
p X 1 m m ¡1 m C1 ji (jj ¡ jj ) Nf (1 ¡ f ) m D 1
(3.13)
D
p X 1 m m ¡1 m C1 ji f(jj ¡ f ) ¡ (jj ¡ f )g Nf (1 ¡ f ) m D 1
(3.14)
This equation implies that TAH includes the information on the ring rate of the memory patterns when spatiotemporal patterns are embedded in a network. Therefore, the variance of the cross-talk noise does not diverge, and this is another factor that enables the network to store and retrieve a pattern sequence. We conclude that the differences in spike timing induce
2892
Narihisa Matsumoto and Masato Okada
Figure 5: The critical overlap (the lower line) and the overlap in the steady state (the upper line). The dashed line shows the normalized ring rate of the network when the ring rate of stored patterns is 0.1. The threshold is 0.52, and the number of neurons is 5000. The data points indicate the median values, and both ends of the error bars indicate 1/4 and 3/4 derivations, respectively. They are obtained in 11 computer simulation trials. The storage capacity is 0.27.
LTP or LTD and that the effect of the ring-rate information may be canceled out by the differences in spike timing. 4 Results
We used the statistical neurodynamics and computer simulation to investigate the properties of our model and examine its behavior under the following two conditions: a xed threshold and a time-dependent threshold. Figure 5 shows the dependence of the overlap mt ( t) and the mean ring P rate of the network, xN ( t) D N1 N i xi ( t ) on the loading rate a when the mean ring rate of the memory pattern is f D 0.1 and the threshold is h D 0.52, which is optimized to maximize storage capacity. The upper line denotes the steady-state values of the overlap mt (t) in retrieval of the pattern sequence. mt ( t) is obtained by setting the initial state of the network at the rst memory pattern: x (1) D » 1 . In this case, the storage capacity is aC D 0.27. The lower line indicates the dependence of the critical initial overlap mC on a. mC is obtained by setting the initial state of the network at » 1 with additional noise. The method to add noise is that 100s% of the minority components
Temporally Asymmetric Hebbian Plasticity
2893
Figure 6: Storage capacity for N D 50,000. The values of parameters are set at the same values as the N D 5000 case of Figure 5. The lines of the theoretical results are identical with these of Figure 5. The storage capacity of the simulations approached the analytical result compared with that of N D 5000 of Figure 5. This fact implies that the discrepancy between the storage capacity of the computer simulations and the analytical result is generated by the nite-size effect of the computer simulations.
(xi (1) D 1) are ipped, while the same number of majority components (xi (1) D 0) are also ipped (Okada, 1996). Thus, the mean ring rate of the network is kept equal to that of the memory pattern, which is f . The initial 2s overlap m1 (1) is given as 1 ¡ 1¡ . The stored pattern sequence is retrievable f
when the initial overlap m1 (1) is greater than the critical value mC . Therefore, the lower line represents the basin of attraction for the retrieval of the pattern sequence. The dashed line shows the steady-state values of the normalized mean ring rate of the network, xN ( t) / f , for the pattern sequence. The data points and error bars indicate the results of computer simulation of 11 trials with 5000 neurons: N D 5000, with the former indicating median values and the latter indicating 1/4 and 3/4 deviations. A discrepancy between the values of mC obtained by the computer simulations and the theory exists. To investigate the cause of the discrepancy, we perform computer simulations with 50,000 neurons and show the results in Figure 6. The data points and error bars indicate the results of computer simulations of 11 trials. The values of parameters are set at the same values as the N D 5000 case of Figure 5. The lines of the theoretical result are identical with that of Figure 5. The storage
2894
Narihisa Matsumoto and Masato Okada
Figure 7: The probability distribution of internal potential in the steady state of a D 0.1. The histogram is obtained by computer simulation with 5000 neurons, while the solid line is obtained by the statistical neurodynamics. Each probabilistic distribution follows a gaussian distribution.
capacity of the simulations approached the analytical result compared to that of Figure 5. This fact implies that the discrepancy between the storage capacity of the computer simulations and the analytical result is generated by the nite-size effect of the computer simulations. Figure 7 shows the probability distribution of the internal potentials in the steady state for a D 0.1. The histogram is obtained by computer simulation with N D 5000, and the solid line denotes the theoretical result of the statistical neurodynamics. Since the results for computer simulation and the statistical neurodynamics coincide, we give only the results obtained by the statistical neurodynamics. Next, we examine the application of the threshold control scheme given as equation 2.6, where the threshold is controlled to maintain the mean ring rate of the network at f . q (t) in equation 3.10 is equal to the mean ring rate P P ( ( ) ) 2 D N1 N ( ) ( ) because q (t ) D N1 N i D 1 xi t i D1 xi t under the condition xi t D f0, 1g. Thus, the threshold is adjusted to satisfy the following equation: f D q ( t) D
1 (1¡ (1¡2 f C 2 f 2 ) erf (w 0 ) ¡ f (1 ¡ f ) (erf (w 1 ) C erf(w 2 ) ) ). 2
(4.1)
Another candidate for the threshold control scheme is to maintain the activity at f ¡ f 2 , which is the ring rate of the signal term in equation 3.4: Prob(ui (t) D 1) D f ¡ f 2 . In this case, the left-hand side f of equation 4.1 is simply replaced by f ¡ f 2 . Figure 8 shows the overlap mt ( t) as a function of loading rate a for both cases; Figure 8a is to maintain activity at f , and Figure 8b is to maintain at f ¡ f 2 with f D 0.1. The respective storage capac-
Temporally Asymmetric Hebbian Plasticity
2895
Figure 8: The critical overlap (the lower line) and the overlap in the steady state (the upper line) when the threshold is varied over time to maintain the mean ring rate of the network at (a) f and (b) f ¡ f 2 . The dashed lines in both gures show the normalized ring rate of the network when the ring rate of stored patterns is 0.1. The respective storage capacities are aC D 0.234 in a and aC D 0.2597 in b. Both basins of attraction are larger than under the xed threshold condition, in Figure 5.
2896
Narihisa Matsumoto and Masato Okada
Figure 9: The storage capacity aC as a function of f in the cases where activity is maintained at f (±) and at f ¡ f 2 ( C ). As f decreases, the storage capacity for activity maintained at f approaches that for activity maintained at f ¡ f 2 . The 1 capacity diverges as f | log in a sparse limit. f|
ities are aC D 0.234 in Figure 8a and aC D 0.2597 in Figure 8b. The storage capacity under condition a is larger than under condition b, and both basins become larger than under the xed threshold condition, that is, h D 0.52 (see Figure 5). The network is thus made robust against noise. This means that even if the initial state x (1) is different from the rst memory pattern » 1 — that is, the state involves a lot of noise—the pattern sequence is retrievable. Finally, we discuss the dependence of the storage capacity on the ring rate f of the memory pattern. The storage capacity is known to diverge 1 as f | log in a sparse limit, f ! 0 (Tsodyks & Feigle’man, 1988; Perezf| Vicente & Amit, 1989). Therefore, we investigate the asymptotic property of the storage capacity in the sparse limit. Figure 9 shows the dependence of the storage capacity on the ring rate when the threshold is controlled to maintain the network activity at f (symbol ±) and at f ¡ f 2 (symbol C ). As f decreases, the storage capacity in the case where activity is maintained at f approaches that for activity maintained at f ¡ f 2 . The storage capacity 1 diverges as f | log in a sparse limit. f| 5 Discussion
We have used a simple neural network model to examine the mechanism by which TAH makes it possible for a pattern sequence to be stored and
Temporally Asymmetric Hebbian Plasticity
2897
retrieved. First, we showed that the interference between LTP and LTD disappeared in a sparse coding scheme. This is a factor that enables the network to store and retrieve a pattern sequence. Next, we showed the mechanism that TAH had the same qualitative effect as the covariance learning rule by analyzing the stability of the stored pattern sequence and the retrieval process by means of the statistical neurodynamics. As a consequence, the variance of cross-talk noise did not diverge, and this is another factor that makes the network store and retrieve a pattern sequence. We conclude that the differences in spike timing induce LTP or LTD and that the effect of the ring-rate information may be canceled out by the differences in spike timing. We investigated the properties of our model. To improve the retrieval property of the basin of attraction, we introduced a threshold control algorithm that adjusts a threshold value to maintain the mean ring rate of the network at that of the memory pattern. We found that this scheme enlarged the basin of attraction and made the network more robust against noise. 1 We also found that the loading rate diverged as f | log in a sparse limit, f| f ! 0. Here, we compare the storage capacity of our model with that of the model where the covariance learning rule of equation 3.6 is used. A pattern sequence that consists of p memory patterns was stored according to this rule. The dynamical equations for this model have been derived by Kitano and Aoyagi (1998). We calculate the storage capacity aCCOV from their dynamical equations and compare it with that of our model, aCTAH , by obtaining the ratio
aCT AH . aCCOV
The threshold control method is described earlier in
this article. As f decreases, the ratio approaches 0.5 (see Figure 10). The nearhalving of the storage capacity for our model is because of the contribution of LTD. Therefore, in terms of storage capacity, the covariance learning rule is better than TAH. However, the information on the ring rate is indispensable to the covariance rule. In biological systems, it seems to be difcult to get the information on the ring rate. Therefore, TAH is more biologically plausible than the covariance rule. In our analysis, TAH was formulated as equation 2.5 to realize that the area of LTP was the same as that of LTD. Next, we investigated the balance between LTP and LTD. We considered the case where LTP was induced for the same duration as LTD but the amplitudes of LTP and LTD uctuated. In our model, it is not possible to cancel out the information on ring rate when the balance between LTP and LTD is not maintained. Accordingly, the cross-talk noise must diverge. However, if the amplitudes of LTP and LTD are the same on average in a learning process, it is possible to stably retrieve the memory patterns. In other words, the balance between LTP and LTD need not be maintained precisely but must be maintained on average. We intend to discuss this in qualitative terms in future work. Finally, we must briey mention the connection between our model and a biological experimental nding called a synre chain: a volley of activity
2898
Narihisa Matsumoto and Masato Okada
Figure 10: A comparison of the storage capacities for our model, aCTAH , and for the model based on the covariance learning, aCCOV . The ratio
T AH aC
aCOV
is plotted as a
C
function of f . As f decreases, the ratio approaches 0.5. The near-halving of the storage capacity for our model is because of the contribution of LTD.
propagated from one pool of neurons to the next pool (Abeles, 1991). There are two candidates to act as a model for a synre chain: a recurrent network with connections learned by Hebbian learning (Herrmann, Hertz, & Prugel-Bennett, ¨ 1995) and a feedforward network with uniform connections (Diesmann, Gewaltig, & Aertsen, 1999). Our model is a recurrent network in which a single sequence is embedded by TAH. Since our model is based on recent biological ndings, it is more biologically plausible than that of Herrmann et al. Our model is easily extended to the storage and retrieval of several pattern sequences. When the connectivity is sparse, our model may be regarded as a feedforward network. Our model is thus a candidate to act as a model for a synre chain. Appendix: Derivation of the Dynamical Equations by the Statistical Neurodynamics
In this appendix, we derive recursive equations for the overlap mt ( t) and the variance s 2 (t) . The internal potential ui ( t) at time t is derived from pC 1 p equation 2.5 and the periodic boundary condition of ji D ji1 and ji0 D ji
Temporally Asymmetric Hebbian Plasticity
2899
in the following way: ui ( t ) D D
D
p N X X 1 (jim C 1jjm ¡jim ¡1jjm ) xj (t ) Nf (1¡ f ) D1 m D 1
(A.1)
j
N X j D1 p X m D1
p X 1 ( (jim C 1 ¡ f ) ¡ (jim ¡1 ¡ f ) ) (jjm ¡ f ) xj ( t) Nf (1¡ f ) m D 1
(A.2)
(jim C 1 ¡jim ¡1 ) mm ( t)
(A.3)
C D (jit 1 ¡jit¡1 ) mt ( t) C zi ( t) .
(A.4)
If it is possible to store a pattern sequence, the cross-talk noise will follow a gaussian distribution with a mean of 0 and time-dependent variance s 2 ( t) : PN 1 t E[zi ( t) ] D 0, E[(zi (t ) ) 2 ] D s 2 (t ). The overlap is mt ( t) D Nf (1¡ i D 1 (ji ¡ f) f ) xi ( t) . We use a Taylor expansion to derive the following expression for the ith neuronal state at time t C 1: Á p ! X C1 º º¡1 º (ji ¡ji ) m ( t) ¡h ( t¡1 ) (A.5) xi ( t C 1) D F ºD 1
Á DF
p X
6 m ºD
! ºC 1
(ji
¡jiº¡1 ) mº ( t) ¡h ( t¡1 ) Á
m C1 m ¡1 C (ji ¡ji ) mm (m )
D xi
( t) F
0
p X
6 m ºD
! (jiºC 1 ¡jiº¡1 ) mº (t ) ¡h (t ¡1)
( t C 1) C (jim C 1 ¡jim ¡1 ) mm ( t) x0i
(m )
( t C 1).
(A.6) (A.7)
Using this relationship, the cross-talk noise at time t, zi ( t) , is derived next: zi ( t ) D
p X 6 t mD
(jim C 1 ¡ jim ¡1 ) mm ( t)
D
p X N X 1 m C1 m ¡1 m f(ji ¡ f ) ¡ (ji ¡ f ) g(jj ¡ f ) xj ( t) Nf (1 ¡ f ) m D6 t jD1
D
p X N X 1 (m ) (jNim C 1 ¡ jNim ¡1 )jNjm xj (t ) Nf (1 ¡ f ) m D6 t jD1
C
p X 6 t ºD
(A.8)
(A.9)
U ( t) (jNiºC 1 mº¡1 ( t ¡ 1) ¡ 2jNiº¡1 mº¡1 ( t ¡ 1) C jNiº¡1 mº C 1 ( t ¡ 1) ) ,
(A.10)
2900
Narihisa Matsumoto and Masato Okada (m )
(m )
0 0 0m where U (t) D E[xi (t) ] and jNiº D jiº ¡ f . xi ( t) is independent of xi ( t) (m ) and is the differential of xi ( t) . Therefore, the square of zi ( t) is given by
³ ( zi ( t) ) 2 D
1 Nf (1 ¡ f )
C
p X 6 t ºD
´2 X p X N 6 t jD 1 mD
(jNim C 1 ¡ jNim ¡1 ) 2 (jNjm ) 2 ( xj
(m )
U ( t) 2 (jNiºC 1 mº¡1 ( t ¡ 1) ¡ 2jNiº¡1 mº¡1 ( t ¡ 1)
C C jNiº¡1 mº 1 ( t ¡ 1) ) 2
D
D
(12
C
(¡1) 2 )a
C
(12
(¡3) 2
t X aD 0
( t) ) 2
q (t) C
C
2
C3 C
2(a C 1) C ( a C 1) aq ( t
1 N
(A.11)
(12
C
(¡2) 2
(¡1) 2 )
¡ a)
a Y bD 1
2)
2(
C 1 aq(t ¡ 1)U t )
aq(t ¡ 2) U2 ( t) U 2 (t ¡ 1) C ¢ ¢ ¢ (A.12)
U2 ( t ¡ b C 1) ,
(A.13)
PN
(m ) 2 i D 1 ( x i ( t ) ) , b Ca
b! D a! ( b¡a )! and a! is the factorial P with positive integer a. We applied the relationship baD 0 ( b Ca ) 2 D 2b Cb in
where p D aN, q (t) D
this derivation. The expectation of zi ( t) is assumed to be 0, so the variance s 2 (t) is equal to E[(zi (t ) ) 2 ]. We then get a recursive equation for s 2 (t) : s 2 ( t) D
t X aD 0
2(a C 1) C ( a C 1) aq(t
¡ a)
a Y b D1
U 2 (t ¡ b C 1) .
(A.14)
Replacing the average over the sites with the average over the gaussian noise term, a recursive equation for the overlap mt (t ) between the network state x ( t) and the retrieval pattern » t is obtained as follows: N X 1 (j t ¡ f ) xi (t ) (A.15) Nf (1 ¡ f ) iD 1 i ½ ½ Z ¾¾ 1 (A.16) D Dz (j t ¡ f ) x ( t) f (1 ¡ f ) ½ ½ Z 1 D Dz (j t ¡ f ) F ( (j t ¡ j t¡2 ) mt¡1 ( t ¡ 1) C s (t ¡ 1) z f (1 ¡ f ) ¾¾ ¡ h (t ¡ 1) ) (A.17)
mt ( t) D
D
1 ¡2f 1¡ f f erf(w 0 ) ¡ erf(w 1 ) C erf(w 2 ) , 2 2 2
(A.18)
Temporally Asymmetric Hebbian Plasticity
R
y p2 exp (¡u2 ) du, and w 0 p 0 ( t¡1 ) Ch ( t¡1 ) mt¡1p . U ( t) is derived in the 2s ( t¡1 )
where erf(y) D w2 D
U (t) D
2901 t¡1 ( ) ( ) h ( t¡1) , w 1 D ¡m p t¡1 Ch t¡1 , D p
2s ( t¡1 )
2s ( t¡1 )
following way:
N N 1 X 1 X 0 (m ) xi ( t ) D x0 ( t) D E[x0 ( t) ] N iD 1 N iD 1 i
(A.19)
0 D E[F ( (j t ¡ j t¡2 ) mt¡1 ( t ¡ 1) C s ( t ¡ 1)z ¡ h ( t ¡ 1) ) ]
Z
D
1
¡1
1
D p
D p
2p
(A.20)
DzhhF0 ( (j t ¡ j t¡2 ) mt¡1 (t ¡ 1) C s (t ¡ 1) z ¡ h ( t ¡ 1) ) ii Z
1 ¡1
(A.21)
z2
dze¡ 2 hhF0 ( (j t ¡ j t¡2 ) mt¡1 (t ¡ 1) C s (t ¡ 1) z
1 2p s (t ¡ 1)
¡ h ( t ¡ 1) ) ii
(A.22)
f(1 ¡ 2 f C 2 f 2 ) e¡w 0 C f (1 ¡ f ) (e ¡w 1 C e ¡w 2 ) g. 2
2
2
(A.23)
Finally, we derive q ( t) . q ( t) D
N N 1 X 1 X (m ) ( xi ( t ) ) 2 D ( xi ( t) ) 2 D E[ (x ( t) ) 2 ] N iD 1 N i D1
D E[F2 ( (j t ¡ j t¡2 ) mt¡1 ( t ¡ 1) C s ( t ¡ 1) z ¡ h ( t ¡ 1) )]
Z
D D
1
(A.24) (A.25)
DzhhF2 ( (j t ¡ j t¡2 ) mt¡1 ( t ¡ 1) C s ( t ¡ 1) z ¡ h (t ¡ 1) ) ii
(A.26)
1 (1 ¡ (1 ¡ 2 f C 2 f 2 ) erf(w 0 ) ¡ f (1 ¡ f ) (erf(w 1 ) C erf(w 2 ) ) ) . 2
(A.27)
¡1
References Abbott, L. F., & Song, S. (1999). Temporally asymmetric Hebbian learning, spike timing and neuronal response variability. In M. S. Kearns, S. Solla, & D. Cohn (Eds.), Advances in neural information processing systems, 11 (pp. 69–75). Cambridge, MA: MIT Press. Abeles, M. (1991). Corticonics. Cambridge: Cambridge University Press. Amari, S. (1989). Characteristics of sparsely encoded associative memory. Neural Networks, 2, 1007–1018. Amari, S., & Maginu, K. (1988). Statistical neurodynamics of various versions of correlation associative memory. Neural Networks, 1, 63–73. Bi, G. Q., & Poo, M. M. (1998). Synaptic modications in cultured hippocampal neurons: Dependence on spike timing, synaptic strength, and postsynaptic cell type. Journal of Neuroscience, 18, 10464–10472. Chechik, G., Meilijson, I., & Ruppin, E. (1999). Effective learning requires neuronal remodeling of Hebbian synapses. In M. S. Kearns, S. Solla, & D. Cohn
2902
Narihisa Matsumoto and Masato Okada
(Eds.), Advances in neural information processing systems, 11 (pp. 96–102). Cambridge, MA: MIT Press. Diesmann, M., Gewaltig, M., & Aertsen, A. (1999). Stable propagation of synchronous spiking in cortical neural networks. Nature, 402, 529–533. Gerstner, W., Kempter, R., van Hemmen, J. L., & Wagner, H. (1996). A neuronal learning rule for sub-millisecond temporal coding. Nature, 383, 76–78. Herrmann, M., Hertz, J. A., & Prugel-Bennett, ¨ A. (1995). Analysis of synre chains. Network: Computation in Neural Systems, 6, 403–414. Kempter, R., Gerstner, W., & van Hemmen, J. L. (1999). Hebbian learning and spiking neurons. Physical Review E, 59, 4498–4514. Kitano, K., & Aoyagi, T. (1998). Retrieval dynamics of neural networks for sparsely coded sequential patterns. Journal of Physics A: Mathematical and General, 31, L613–L620. Markram, H., Lubke, ¨ J., Frotscher, M., & Sakmann, B. (1997). Regulation of synaptic efcacy by coincidence of postsynaptic APS and EPSPS. Science, 275, 213–215. Miyashita, M. (1988).Neuronal correlate of visual associative long-term memory in the primate temporal cortex. Nature, 335, 817–820. Munro, P., & Hernandez, G. (2000). LTD facilitates learning in a noisy environment. In S. A. Solla, T. K. Leen, & K.-R. Muller ¨ (Eds.), Advances in neural information processing systems, 12 (pp. 150–156). Cambridge, MA: MIT Press. Okada, M. (1995). A hierarchy of macrodynamical equations for associative memory. Neural Networks, 8, 833–838. Okada, M. (1996). Notions of associative memory and sparse coding. Neural Networks, 9, 1429–1458. Perez-Vicente, C. J., & Amit, D. J. (1989). Optimized network for sparsely coded patterns. Journal of Physics A: Mathematical and General, 22, 559–569. Rao, R. P. N., & Sejnowski, T. J. (2000). Predictive sequence learning in recurrent neocortical circuits. In S. A. Solla, T. K. Leen, & K.-R. Muller ¨ (Eds.), Advances in neural information processing systems, 12 (pp. 164–170). Cambridge, MA: MIT Press. Rubin, J., Lee, D. D., & Sompolinsky, H. (2001). Equilibrium properties of temporally asymmetric Hebbian plasticity. Physical Review Letters, 86, 364–367. Song, S., Miller, K. D., & Abbott, L. F. (2000). Competitive Hebbian learning through spike-timing-dependent synaptic plasticity. Nature Neuroscience, 3, 919–926. Tsodyks, M. V., & Feigle’man, M. V. (1988). The enhanced storage capacity in neural networks with low activity level. Europhysics Letters, 6, 101–105. Yoshioka, M. (2002). Spike-timing-dependent learning rule to encode spatiotemporal patterns in a network of spiking neurons. Physical Review E, 65, 0119031-15. Zhang, L. I., Tao, H. W., Holt, C. E., Harris, W. A., & Poo, M. M. (1998). A critical window for cooperation and competition among developing retinotectal synapses. Nature, 395, 37–44. Received October 18, 2001; accepted May 10, 2002.
LETTER
Communicated by Alessandro Treves
Associative Memory with Dynamic Synapses Lovorka Pantic [email protected] JoaquÂõ n J. Torres¤ [email protected] Hilbert J. Kappen [email protected] Stan C.A.M. Gielen [email protected] Department of Biophysics, University of Nijmegen, 6525 EZ Nijmegen, The Netherlands We have examined a role of dynamic synapses in the stochastic Hopeldlike network behavior. Our results demonstrate an appearance of a novel phase characterized by quick transitions from one memory state to another. The network is able to retrieve memorized patterns corresponding to classical ferromagnetic states but switches between memorized patterns with an intermittent type of behavior. This phenomenon might reect the exibility of real neural systems and their readiness to receive and respond to novel and changing external stimuli. 1 Introduction
Recent experimental studies report that synaptic transmission properties of cortical cells are strongly inuenced by recent presynaptic activity (Abbott, Varela, Sen, & Nelson, 1997; Tsodyks & Markram, 1997). The postsynaptic response is activity dependent and can decrease (in case of synaptic depression) or increase (in case of synaptic facilitation) under repeated stimulation (Tsodyks & Markram, 1997). Markram and Tsodyks (1996) found that for some cortical synapses, after induction of long-term potentiation (LTP), the temporal synaptic response was not uniformly increased. Instead, the amplitude of the initial postsynaptic potential was potentiated, whereas the steady-state synaptic response was unaffected by LTP (synaptic redistribution). These ndings affect the transmission properties of single neurons, as well as network functioning and behavior. Previous studies on the effects of synaptic depression on network behavior have focused on both feedforward and recurrent networks. For feedforward networks, some authors ¤ Present address: Dept. of Electromagnetism and Material Physics, University of Granada, E-18071 Granada, Spain
c 2002 Massachusetts Institute of Technology Neural Computation 14, 2903– 2923 (2002) °
2904
L. Pantic, J. J. Torres, H. J. Kappen, and S.C.A.M. Gielen
have studied a role of activity-dependent synapses in a supervised learning paradigm (Natschl¨ager, Maass, & Zador, 2001). They showed that even with a single hidden layer of binary neurons, such networks can approximate a large class of nonlinear lters. Another study (Liaw & Berger, 1996) demonstrated that dynamically changing synaptic strength allows the extraction of statistically signicant features from noisy and variable temporal patterns. An application was presented where a simple neural network with dynamic synapses was able to perform speech recognition from unprocessed noisy waveforms of words spoken by multiple speakers. For recurrent networks, a number of theoretical and numerical studies revealed that a large population of excitatory neurons with depressing synapses can exhibit complex regimes of activity (Bressloff, 1999; Kistler & van Hemmen, 1999; Senn et al., 1996; Tsodyks, Pawelzik, & Markram, 1998; Tsodyks, Uziel, & Markram, 2000). Tsodyks et al. (2000) demonstrated that such networks exhibit short intervals of highly synchronous activity (population bursts) intermittent with long periods of asynchronous activity resembling the behavior observed in neurons throughout the cortex. It has been proposed (Senn et al., 1996) that synaptic depression may also serve as a mechanism that can provoke rhythmic activity and therefore can have an important role in central pattern generation. The effects of synaptic dynamics on some complex neural functions such as associative memory are not yet fully established, and only a few studies (Bibitchkov, Herrmann, & Geisel, 2002) have focused on this important issue. The conventional way for model studies of associative memory functions is based on the view that recurrent (attractor) neural networks may capture some basic characteristics of associative memory (Amit, 1989; Miyashita & Chang, 1988). In such networks, long-term storage of the memory patterns is enabled by synaptic connections that are adjusted (according to the Hebb rule) such that the attractors (xed points) of the network dynamics represent the stored memories (Hopeld, 1982). The question one may ask is how the retrieval properties and the xed points of the Hopeld network are affected by the experimental ndings showing use-dependent weakening of the synaptic connections. Bibitchkov et al. (2002) reported that short-term synaptic depression does not affect the xed points of the network but rather reduces the stability of the patterns and the storage capacity. They also showed that during external stimulation, depression mechanism enables easier switching between patterns and better storage of temporal sequences. In this study, we aim to extend previous work and explore the role of noise, as well as the effect of the synaptic recovery process in pattern retrieval. We found that a network with depressing synapses is sensitive to noise and displays rapid switching among stored memories. The switching behavior is observed for overlapping as well as nonoverlapping patterns. It exists for the network consisting of simple binary units and also for the network with integrate-and-re neurons. The transitions among memories start to occur when the network dynamics, described by an iterative (mean-
Associative Memory with Dynamic Synapses
2905
eld) map, undergoes an intermittent behavior. The memory states lose absolute stability, and the system, instead of residing at one of the xed-point attractors, reveals another type of dynamics. Linear stability analysis as well as numerical simulations reveal that in addition to traditional ferromagnetic (memory) and paramagnetic (noisy) xed-point regimes, there exists a limit-cycle phase that enables transitions among the stored memories. 2 Models
The classical Hopeld-like neural network consists of N fully connected binary neurons. The neurons are regarded as stochastic two-state devices with the state-variable s taking values 1 (ring) or 0 (no ring) according to the probabilistic rule given by Probfsi ( t C 1) D 1g D
1 [1 C tanh (2b hi (t) ) ]. 2
(2.1)
Parameter b D T1 represents the level of noise caused by synaptic stochasticity or other uctuations in neuron functioning. The local eld hi is the net P input to neuron i given by hi D j vij0sj ¡hi , where vij0 represents the synaptic connection strength (weights) from neuron j to neuron i. The updating of all neurons can be carried out synchronously (Little dynamics) at each time step or asynchronously (Glauber dynamics), updating them one at a time. The connectivity matrixç vij0 is dened according to the standard covariance rule, wij0 D
P X 1 (jim ¡ f ) (jjm ¡ f ) , Nf (1 ¡ f ) m D 1
(2.2)
which represents an optimal learning rule for associative memory networks (Dayan & Willshaw, 1991; Palm & Sommer, 1996). The j m ’s (m D 1, . . . , P) denote P patterns to be memorized (stored) and, later, to be retrieved by the network. All thresholds hi are set to zero (Peretto, 1992). The variable f represents the mean level of activity of the network for the P patterns. It is known that the memorized patterns are stable xed points of the network P dynamics as long as a ´ N is small and as long as the noise level is not too high (Amit, 1989; Peretto, 1992) (ferromagnetic phase). When b < 1, the noise is stronger than the connectivity, and memory is lost (paramagnetic phase). When a is too large, the interference between patterns destroys pattern stability and causes the spin-glass phase to dominate (Amit, 1989). As a model of the synaptic depression, we use the phenomenological model introduced in Tsodyks and Markram (1997). This model assumes that synaptic resources are available in a fraction of recovered (x), active (y), and inactive (z) states and that the total amount of resources (neurotransmitter) in the recycling process is constant, which gives x C y C z D 1. Each
2906
L. Pantic, J. J. Torres, H. J. Kappen, and S.C.A.M. Gielen
action potential (AP) activates a fraction (U) of the resources in the recovered state, which (instantaneously) becomes active (during transmission), then becomes inactivated with a time constant (tin ) of a few milliseconds, and nally recovers with a time constant (trec ) that can vary from tens of milliseconds to seconds, depending on the type of the neuron and the type of synapse. The synaptic current is proportional to the fraction of active resources I ( t) D Ay ( t) where A determines the maximal response of the synapses. tin is used to model the a-function like response of the synapse. The model can be further simplied since the inactivation time constant tin is much shorter than the recovery constant trec (Tsodyks et al., 1998). The detailed synaptic response can be ignored, and we can eliminate the variable y from the synaptic dynamics. Thus, we assume that the depressing synapse can be described by the dynamics of the recovered resources only. Since the states of the binary neurons are updated at discrete time steps, we use a discretized equation, ³ x ( t C dt) D x ( t) C dt
´ 1 ¡ x ( t) ¡ Ux ( t) s ( t) , trec
(2.3)
instead of the original differential equation (Tsodyks & Markram, 1997; Tsodyks et al., 1998), with dt denoting the time step of discretization (updating). To incorporate the synaptic dynamics into the network, we consider that during memory retrieval, the strength of synaptic connections (weights) changes in time according to the depressing mechanism. The dynamic synaptic strength vij ( t) is assumed to be proportional to the fraction of recovered resources xj ( t) and the static Hebbian term wij0 , that is, wij ( t) D wij0 xj ( t).1 P The local elds are then given by hi D j wij0 xj sj . The states of all the neurons are updated synchronously at time steps of dt D 1 msec and together dene the consecutive network states. The overlap of the instantaneous network PN 1 (j m ) state with the memorized pattern m , that is, mm ´ Nf (1¡ i D1 i ¡ f si , f) provides insight into the network dynamics and measures the similarity of the actual state and the stored patterns. 3 Simulation Results 3.1 Network with Binary Units. We analyze the behavior of a network consisting of N fully connected binary neurons for a small number of patterns. We rst consider the case of a single pattern with f D 12 . The pattern is stored according to rule 2.2, and the states of the neurons are governed by equations 2.1 and 2.3. 1 It is worth mentioning that under this assumption, interactions between the neurons are no longer symmetric.
Associative Memory with Dynamic Synapses
2907
For static synapses, it is well known that the network exhibits two types of behavior. The pattern (antipattern) is either permanently retrieved (ferromagnetic phase) or the retrieval is not possible (paramagnetic phase). For depressing synapses, the pattern is stable only for a certain period of time, after which it disappears (see Figure 1, top). All efferent (outgoing) synapses from active units are permanently depressed during this period; their aver1 P age level of recovery x C ´ Nf j2Act xj decreases to a low level (see Figure 1, bottom). At the sameP time, all efferent synapses from inactive units recover, 1 that is, x ¡ ´ N (1¡ xj increases. Due to noise, the network at some 6 j2Act f) moment of time switches to the attractor formed by the recovered synapses (antipattern) (see Figure 1, top). The antipattern is now temporarily active; then it disappears while the original pattern is being instantaneously reactivated, and so forth. This process repeats itself, and Figure 1 (bottom) shows the repetitive behavior of the x C , x¡ , and overlap m1 for two values of trec . For smaller values of trec , the duration of the pattern (antipattern) activity is longer than for larger values of trec for which the switching between attractor states occurs more frequently.
b= 20, t = 18 msec rec
b= 20, trec = 30 msec
0
0
50
50
100
100
0
100
200
300
0
1
1
0.5
0.5
0
0
0.5
0.5
1 0
100
200
300
1 0
100
200
300
100
200
300
Figure 1: The periodic regimes of the behavior of an associative memory network with dynamic synapses with a single stored pattern ( f D 0.5). The pattern consists of active (j D 1, . . . , 60) and inactive (j D 61, . . . , 120) units. (Upper panels) A raster plot of the neural activity of all neurons (N D 120) in the network during 300 msec. (Bottom panels) the overlap m 1 (thin line), the average recovery level of the active units x C (solid thick line), and the average recovery level of the inactive units x ¡ (dashed thick line) as a function of time. For clarity, the dashed thick line is not presented on the right panel.
2908
L. Pantic, J. J. Torres, H. J. Kappen, and S.C.A.M. Gielen
In the case of several orthogonal patterns 2 whose active units overlap (see Figure 2, top), the network also displays switching behavior. Panels showing overlaps with P D 6 patterns indicate that the network rst recalls pattern 6 and its antipattern, alternately, (shown in the sixth panel of the gure), then switches to pattern 3 and its antipattern (third panel), and then retrieves pattern 5 and its corresponding antipattern (fth panel). Switching that occurs on the fast timescale between the pattern and its antipattern is similar to the behavior shown in Figure 1 (limit cycle). In addition, there exists a switching on the slower timescale among the different patterns. The switching shown in Figure 2 is more complex than the switching considered in Figure 1 since one limit cycle switches to another pattern or to another limit cycle. This motion might be related to the behavior that can occur for several coexisting attractive states. Similar behavior is also present when several patterns with nonoverlapping active units are stored in the network. The upper panel of Figure 3 shows the activity of the network as a function of time, and the middle panel shows the level of recovery xj for neuron j (j D 1, . . . , N). All synapses from temporarily active units become depressed, while synapses from inactive units are either recovered or are recovering (see Figure 3, middle). Random synchronous activity of some units from an inactive group (which occurs due to noise) will cause an increase of their local elds and further excitation of the other units in the same group (associative mechanisms). Meanwhile, the local elds of the neurons from the currently active group become suppressed, and the ctivity of that group is inhibited. As a result of this process, the network switches to other (mixture) states and so forth. 3.2 Network with Integrate-and-Fire Units. We now explore the switching behavior for the more realistic network with integrate-and-re (IF) neurons. We consider N D 200 fully connected neurons whose states are modeled by the membrane potential Vi (t) (i D 1, . . . , N) according to the dynamics:
tm dVi (t) / dt D ¡Vi (t) C RIi (t ) C j (t) . syn
(3.1)
The membrane time constant and the membrane resistance are xed to tm D 20 msec and R D 100 MV, respectively. j ( t) represents gaussian white noise with autocorrelation hj ( t)j (t0 ) i D Dd ( t ¡t0 ). The synaptic input to cell i P syn from other neurons is given by Ii D gA jND 1 wij0 yj , where Ayj represents the synaptic current Ij of the neuron j (see section 2), wij0 denotes the connectivity matrix given by 2.2, and g is a scaling factor. Whenever the membrane potential is equal to the threshold value (Vth D 10 mV), an action potential is emitted and the potential is reset to zero with the refractory period of tref D 2 1 N
PN iD 1
6 º. (jim ¡ f )(jiº ¡ f ) D 0 when m D
Associative Memory with Dynamic Synapses
2909
Figure 2: (Top) Raster plot of a network with N D 128 neurons and P D 6 orthogonal patterns ( f D 0.5). The patterns consist of alternating blocks of active units (ones) and inactive units (zeros). The rst pattern consists of two blocks with size f N, the second pattern consists of 22 blocks with size f 2 N, the third is composed of 23 blocks with size f 3 N, and so forth. The noise level is xed to b D 30, and the time constant for recovery is trec D 30 msec. Other panels represent overlaps between the current network state and the memorized patterns.
1 msec. To enable easier observation, we store P D 5 patterns, whose active bits do not overlap. Figure 4 shows the network behavior during the pattern retrieval. For some xed noise level and depending on the value of the time constant for recovery trec , the network either permanently recalls one of the stored patterns (trec D 0 msec—standard behavior of the network with constant weights), or switches among patterns (trec D 300 msec and trec D 800 msec), or exhibits a paramagnetic type of the behavior (trec D 3000 msec) when retrieval of the patterns is not possible. We conclude that the effects of synaptic depression on the network level are qualitatively the same for
2910
L. Pantic, J. J. Torres, H. J. Kappen, and S.C.A.M. Gielen
A 0 50 100 0
100
200
300
400
500
B 20 40 60 80 100
50
0.1
100
0.2
150
0.3
200
0.4
250
300
350
400
450
500
0.5
0.6
0.7
0.8
0.9
1
Figure 3: (A) Raster plot of neural activity of a network consisting of N D 100 neurons with P D 10 nonoverlapping patterns during the period of 500 msec. trec is set to 75 msec. Each pattern consists of 10 active and 90 inactive units. The 6 (1, . . . , 10) , rst pattern is dened as jj1 D 1, 8j 2 (1, . . . , 10) and jj1 D 0, 8j 2 6 (11, . . . , 20) , the second is given by jj2 D 1, 8j 2 (11, . . . , 20) and jj2 D 0, 8j 2 and so forth. (B) Level of recovery xj for neuron j (j D 1, . . . , N). White regions represent low levels of recovery (depression), while gray (black) parts indicate that synapses are recovering (recovered).
the network with IF units as it is for the network with the binary neurons. The most interesting feature is that synaptic depression destabilizes the stored memories and enables noise-induced transitions from one pattern to another. It has been demonstrated (Mueller & Herz, 1999) that the associative properties of a network with the IF neurons or with binary neurons are similar. Our results extend this observation for network with depressing synapses. 4 Mean-Field Calculations
To investigate the switching phenomenon further, we analyzed the behavior of the network with the binary units within the mean-eld framework. For simplicity, we consider the case a ! 0. When the network is close to pattern
Associative Memory with Dynamic Synapses t
0 Neurons 100 200
t
0
rec
rec
2911
= 0 msec
= 300 msec
100 200
t
= 800 msec
t
= 3000 msec
rec
0 100 200
rec
0 100 200 100
time (msec)
420
Figure 4: From top to bottom, panels represents raster plots of neural activity of the network with N D 200 integrate-and-re units and P D 5 patterns (with nonoverlapping active bits) for four different values of trec and for a xed noise level of D D 0.5.
m D 1, the local eld is given by X hi D wij0 xj hsj i j
D
XX m 1 (ji ¡ f ) (jjm ¡ f ) xj hsj i Nf (1 ¡ f ) j m
¼
X 1 (j 1 ¡ f ) (jj1 ¡ f ) xj hsj i, Nf (1 ¡ f ) j i
(4.1)
where weqhave ignored contributions from the other patterns. This term P P scales as N for random patterns. The sum j in the previous expression can be split in two terms: 2 3 X X 1 1 1 1 (j ¡ f ) 4 (jj ¡ f ) xj hsj i C (jj ¡ f ) xj hsj i5 , hi D Nf (1 ¡ f ) i 6 Act j2Act j2
(4.2)
2912
L. Pantic, J. J. Torres, H. J. Kappen, and S.C.A.M. Gielen
6 Act) denote the active (inactive) neurons in pattern 1. where j 2 Act (j 2 1 P We denote by x C the mean level of recovery and by m C ´ Nf j2Act hsj i the mean activityPof the active units. For the inactive units, we have x¡ and 1 hsj i. With these denitions, we have m1 D m C ¡ m ¡. The m¡ ´ N (1¡ 6 j2Act f) local eld is now given by
hi D (ji1 ¡ f ) ( x C m C ¡ x ¡m ¡ ) .
(4.3)
Hence, for m C we have m C ( t C 1) D
1 f1 C tanh[2b (1 ¡ f ) (x C ( t) m C ( t) ¡ x¡ ( t) m¡ ( t) )]g, 2
(4.4)
1 f1 ¡ tanh[2b f (x C (t ) m C (t ) ¡ x¡ ( t) m¡ ( t) )]g. 2
(4.5)
and for m¡ m ¡ ( t C 1) D
The dynamics of variables x C and x¡ is governed by the equations µ
x C ( t C 1) D x C (t) C µ x ¡ ( t C 1) D x ¡ (t) C
1 ¡ x C ( t) ¡ UxC ( t) m C ( t) trec
¶
¶ 1 ¡ x ¡ ( t) ¡ Ux¡ ( t) m¡ ( t) , trec
(4.6) (4.7)
which together with equations 4.4 and 4.5 represent the four-dimensional iterative map that approximately describes the network dynamics. 5 Stability Analysis
We analyze the stability of the four-dimensional map for the case f D The xed points are given by the equations
1 2.
1 f1 C tanh [b (x C m C ¡ x ¡m ¡ ) ]g 2 1 m ¡ D f1 ¡ tanh [b (x C m C ¡ x ¡m ¡ ) ]g 2 1 xC D 1 C c mC mC D
x¡ D
1 , 1 C c m¡
with c D Utrec . Let us denote by xE 0 D gE ( xE )
(5.1)
Associative Memory with Dynamic Synapses
2913
the iterative map with xE D ( m C , m¡ , x C , x ¡ ) and by xE D xE¤ the xed-point solution. Linear stability analysis demands linearization of gE ( xE ) near xE¤ : x0i D gi (xE¤ ) C
X @gi ( xE¤ ) ( xj ¡ xj¤ ) C ¢ ¢ ¢ @ x j j
The eigenvalues li of the matrix D ´
@gi ( xE¤ ) @xj
(5.2)
indicate the stability of the
xed-point xE¤ . If |li | < 1 8i, the solution is stable. In Figures 5 and 6 we
illustrate the stability of the system (see equations 4.4 through 4.7) for two different parameter sets. The stability matrix D and the eigenvalues are given in the appendix. Figure 5 shows that there are two critical (bifurcation) values of trec (t1 and t2 ) where |l| max crosses the value 1.3 For trec < t1 , the system has three xed points, where two of them are stable and correspond to memory (ferromagnetic) solutions. As trec increases, the xed points approach each other and coalesce near trec D t1 . The gure shows that near this point, the system undergoes a qualitative change in behavior. For t1 < trec < t2 , stable oscillations occur (limit cycle attractor). The amplitude of oscillations (black and white circles in the gure) gradually decreases and nally ceases at t2 . At this point, the oscillatory behavior is replaced by a stable (|l| max < 1) xed point, which represents the no-memory (paramagnetic) solution (Hopf bifurcation). In sum, we see that the network dynamics, in addition to the xed-point (ferromagnetic and paramagnetic) attractors, reveals the limit cycle attractor, which enables a periodic kind of behavior. In Figure 6, we present the stability of the map for a larger value of b. Note that for trec < t1 , there exist ve xed points instead of three xed points. The additional xed points are unstable (saddle), and near t1 they coalesce with the stable xed points. When trec is increased, the stable and unstable (saddle) points disappear (a saddle-node type of bifurcation). The network dynamics is illustrated in Figure 7, where the xed points are given by the intersections of the diagonal and the map. The intersection of the map with the diagonal at the value m C D 0.5 represents an unstable xed point. The values m C 2 f0, 1g are stable xed points in the ferromagnetic phase. However, in the oscillatory regime, they are no longer xed points. The consecutive states of the network at equal time intervals (black dots in the gure) show that for trec À t1 (upper left panel), the system reveals a limit-cycle behavior. For smaller values of trec, the system spends more time near one of the ferromagnetic xed points. When trec is almost equal 3 The crossing that occurs for trec < 1 is not relevant and represents an approximation artifact since our map is not valid in that range.
2914
L. Pantic, J. J. Torres, H. J. Kappen, and S.C.A.M. Gielen
2 m+ = 0.5 m+ =/ 0.5
|li|max
1.5
1
0.5
0
0
2
t1
4 trec
6 t2
8
10
6 t2
8
10
1
m+
0.75
0.5
0.25
0
0
2
t1
4 trec
Figure 5: (Top) The largest eigenvalue |l| max of the linearized (stability) matrix D for a xed noise level b D 3 as a function of trec (in units of milliseconds). 6 The solid line corresponds to |l| max of the ferromagnetic xed point m C D 0.5 (memory solutions) and the dot-dashed line to |l| max of the paramagnetic xed point m C D 0.5 (no-memory solution). t1 and t2 denote the bifurcation points where |l| max crosses the value 1. (Bottom) The solid line represents three xed points m C as a function of trec . For trec < t1 , the xed point, m C D 0.5 is an unstable 6 xed point, while m C D 0.5 are stable xed points. In the t1 < trec < t2 region, there exists only one xed point, m C D 0.5, which is unstable, together with a (stable) limit cycle. The white (black) circles correspond to maximal (minimal) amplitudes of oscillations emerging in this region. For trec > t2 , there is only one stable xed point solution (m C D 0.5).
Associative Memory with Dynamic Synapses
2915
2
|li|max
3
|li|max
m+ = 0.5 m+ =/ 0.5 m+ =/ 0.5
4
1
2
15
1 0
16 trec
0
20 t1
40 trec
60
80 t2
0
20 t1
40 trec
60
80 t2
17
1
m+
0.75 0.5 0.25 0
Figure 6: (Top) The largest eigenvalue |l| max of the linearized (stability) matrix D for a xed level of noise b D 20 as a function of trec . The solid line corresponds 6 0.5 (memory solutions) and the to |l| max of the ferromagnetic xed points m C D dot-dashed line to |l| max of the paramagnetic xed point m C D 0.5 (no-memory solution). Note that near the rst bifurcation point t1 , there exist two more xed points. The additional xed points are unstable, and their |l| max values are represented by the dashed line. At the bifurcation point itself, the stable (solid line) and unstable (dashed line) points coalesce and, upon further increase of trec , disappear entirely. (Bottom) The solid line represents the xed points m C for different values of trec . In the t1 < trec < t2 region, the circles indicate the presence of an oscillatory phase, with white (black) circles corresponding to maximal (minimal) amplitudes of oscillations.
2916
1
L. Pantic, J. J. Torres, H. J. Kappen, and S.C.A.M. Gielen b = 20, t
rec
b = 20, t
= 60 msec 1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0 0 1
0.5
1
b = 20, t = 16.5 msec rec
0 0
0.08
0.6
0.06
0.4
0.04
0.2
0.02 0.5
= 30 msec
0.5
1
0.05
0.1
0.1
0.8
0 0
rec
1
0 0
Figure 7: A plot of m C (t C 1) versus m C ( t ) corresponding to equation 4.4 for b D 20 and for three different values of trec . The intersections of the diagonal and the map (solid line) represent the xed points (see equation 5.1). The (iterative) dynamics of m C (black dots) indicates that the shape of the limit cycle depends on the value of trec . Note that for trec À t1 , the dynamics is uniformly slow, and after some transient, the black dots form the typical limit cycle. For lower values of trec , the concentration of the black dots is highest near the places where the ferromagnetic xed points used to emerge. This means that the dynamics is slower in these areas of the limit cycle and faster in others. The lower right gure represents the enlarged left corner of the lower-left gure. Note that the system resides for a long time near the corridors (tangency) between the diagonal and the map when the value trec is close to the value t1 .
to t1 , the system resides for a long time near one of the ferromagnetic xed points. In this condition, the lower panels in Figure 7 show the existence of two narrow “corridors” between the diagonal and the map. If the state of the system is in the neighborhood of these corridors, the system resides there for a long time before it escapes and approaches another corridor. Such a nontrivial dynamics resembles intermittent behavior (Hao, 1989) and represents in fact the beginning of a periodic motion (cycle). Figure 8 shows that the period of oscillations for the critical t1 is very large. This is also conrmed in numerical simulations of the stochastic system (see Fig-
Associative Memory with Dynamic Synapses
2917
Period (iterations)
500 b=100 b= 50 b= 10
400 300 200 100 0 0.5
1
1.5 2 c trec/trec
2.5
3
Figure 8: Period of oscillations as a function of the recovery time constant ratio trec c where trec denotes the critical value (t1 ) of the recovery time constant (see c trec Figures 5 and 6). The three different lines correspond to three different noise levels of b D 10, 50, and 100.
ure 1, left), which show long-lasting (¼ 100 ms) activations of pattern and antipattern, alternately. The duration of these long plateaus of pattern and antipattern activities near the critical t1 depends on the value of the parameter b (see Figure 8). For the smaller values of b (larger noise level), the residence periods are shorter, and for some critical value of b, the saddlenode bifurcation ceases to exist while the intermittency disappears (compare Figure 5). In Figure 9, we summarize the network behavior in the parameter (b , trec ) space. The ferro- and paramagnetic phases denote the regions with stable behavior, that is, memory and no-memory (noisy) phases, respectively, whereas the oscillatory region represents the novel periodic type of behavior. The dynamics of the synaptic connections therefore modies the phase diagram and produces macroscopic changes of the network behavior.
2918
L. Pantic, J. J. Torres, H. J. Kappen, and S.C.A.M. Gielen
800 600 trec
Paramagnetic phase
400
Oscillatory phase
200 0
Ferromagnetic phase
0
40
80
120
160
200
b 20 15 trec
Paramagnetic phase
Oscillatory phase
10 5 0
Ferromagnetic phase
0
2
4
6
8
10
b Figure 9: (Top) Three regimes of behavior in the parameter (b, trec ) space. trec is given in units of milliseconds. The dashed line corresponds to the transition between the memory (ferromagnetic) and oscillatory phase and represents the pairs of critical b and trec (t1 ) values. The solid line shows the transition between the oscillatory and noisy (paramagnetic) phase with pairs of critical b and t2 values. (Bottom) Inset of lower-left part of top gure, which shows where the oscillatory phase begins. The dot-dashed line is not relevant and represents an approximation artifact since our model (map) is not valid for trec < 1.
Associative Memory with Dynamic Synapses
2919
6 Discussion
We show that incorporation of experimentally observed short-term synaptic dynamics into a stochastic Hopeld network leads to a novel and complex network behavior. Such a network displays bursting activity of memorized patterns and fast switching between memories. Functionality of synaptic depression affects the stability of memory attractors and enables exible transitions among them. The attractors display a kind of metastability where a small amount of noise can trigger a jump from one state to another. The network dynamics exhibits an oscillatory activity and retrieves patterns or their mixtures alternately. Although the functional roles and relevance of these results for associative memory are not yet clear, they might be linked to the spatiotemporal encoding scheme (Hoshino, Kashimori, & Kambara, 1998) of the olfactory cortex, which is believed to represent a good model for analysis of associative memory processes (Haberly & Bower, 1989). In Hoshino et al. (1998), the model based on experimental ndings showed that the information about odors could be encoded into a spatiotemporal pattern of neural activity in the olfactory bulb, which consists of a temporal sequence of spatial activity patterns. The recognition of olfactory information, as also suggested by other authors, might be related to oscillatory activity patterns. The rapid switching among memory states and the sensitivity to noise resemble the exibility and metastability of real systems (Kelso, 1995) in receiving and processing new data in continuously changing external conditions. They may also serve as a possible mechanism underlying attention or short-term and working memory (Nakahara & Doya, 1998). Nakahara and Doya (1998) emphasized that neural dynamics in working memory for goal-directed behaviors should have the properties of longterm maintenance and quick transition. It was shown that both properties can be achieved when the system parameters are near a saddle-node bifurcation (point), which itself may be a functional necessity for survival in nonstationary environments. A similar switching phenomenon has been reported by Horn and Usher (1989), where threshold dynamics instead of synaptic dynamics has been considered. Using the threshold function associated with fatigue of a neuron, a self-driven temporal sequence of patterns was obtained. Both adaptive mechanisms, the dynamic threshold and synaptic depression, produce similar effects on the network behavior and enable transitions among stored memories. One may wonder whether these two mechanisms are formally equivalent. In order to explore this P question in more detail, consider the mean-eld equations mi D tanh ( j wij xj mj C hi ). The informal equivalence of dynamic synapses and thresholds is that for dynamic thresholds, increasing mi will decrease hi , which decreases future mi . Similarly for dynamic synapses, increased mj will decrease xj , reducing the inuence of the product xj mj on the network. Therefore, the net effect is similar, and in-
2920
L. Pantic, J. J. Torres, H. J. Kappen, and S.C.A.M. Gielen
deed dynamic thresholds also show oscillations. However, we think that there is no formal equivalence. The reason is that changes in xj affect the slope of mj at the right-hand side of the mean-eld equations, thereby possibly changing the number of xed points in the system and dynamically changing the system from ferromagnetic to paramagnetic or vice versa. Changes in hi cannot bring about such qualitative changes. These mechanisms, as well as some other biological mechanisms (Foss, Moss, & Milton, 1997), which also enable noise-induced transitions among attractors, may be related to more dynamic processes of associative memory, the process of associative thinking (Horn & Usher, 1989), or dynamic memory storage (Foss et al., 1997). Our numerical studies of a network consisting of IF neurons with dynamic synapses show the same switching capability among the memory states as the network with binary neurons. In order to determine the relevance of this behavior for neurobiology, further analyses of the switching phenomenon and the phenomenon of intermittency will require more realistic network models, as well as more realistic learning rules. Regarding the learning rule, it has been shown that the covariance learning rule provides optimal memory performance of an associative-memory network with binary neurons (Dayan & Willshaw, 1991; Palm & Sommer, 1996). However, like all other Hebbian forms of plasticity, this rule leads to excessive growth of synaptic strength. Recent experimental results suggest that there are several regulating mechanisms such as synaptic scaling, spike-timing dependent synaptic plasticity (STDP), or synaptic redistribution (Abbott & Nelson, 2000; Bi & Poo, 1998; Markram & Tsodyks, 1996; Turrigiano, Leslie, Desai, Rutherford, & Nelson, 1998) that can prevent uncontrolled synaptic runaway. Chechik, Meilijson, and Ruppin (2001) have already demonstrated that a kind of synaptic scaling, that is, a neuronal regulating mechanism maintaining the postsynaptic activity near a xed baseline, improves network memory capacity. The effects of these regulating mechanisms on the network behavior reported here need to be explored further. In this study, we have focused on a small number of patterns and on the basic mechanism that enables rapid switching among memories. The simulations have shown that noise-induced rapid transitions occur among the (non)overlapping patterns or the mixture states. The transitions become more complex when the number of patterns is increased and the behavior resembles the motion in a complex multistable system with a large number of coexisting states. An important point is whether this behavior would be robust to interference among patterns and how it would affect storage capacity. Bibitchkov et al. (2002) have shown that the capacity of an innite network decreases with increasing depression using an equilibrium formulation of the system. Whether this conclusion is also valid for more realistic neuronal and synaptic dynamics needs to be explored further.
Associative Memory with Dynamic Synapses
2921
Appendix: Stability Matrix
For the system 4.4 through 4.7 and using the stationary conditions 5.1, the matrix D becomes: 0 2m (1¡m )b x ¡2m (1¡m )b x 2m2 (1¡m )b ¡2m (1¡m ) 2 b 1 C
C
C
B¡2m C (1¡ mC )b xC B DDB @ ¡Ux C
C
C
¡
2m C (1¡m C )b x¡
0
C
C
¡2m2
C
(1¡m C )b
C
C
2m C (1¡m C ) 2 b
0
1 ¡Um 1¡ trec C
0
¡Ux¡
0
1 1¡ trec ¡U(1¡m C ).
C C C. A
(A.1)
To obtain the eigenvalues of the linearized system, we have to solve the characteristic equation, |D ¡ lI| D 0,
(A.2)
where | ¢| refers to the determinant and I is the identity matrix. This gives one eigenvalue l1 equal to zero. The other three eigenvalues are the solutions of the cubic equation: 2m (1 ¡ mC )b ( xC C x¡ ) ¡ l ¡2m2C (1 ¡ m C )b 2m C (1 ¡ m C ) 2 b C 1 1 ¡ trec ¡ Um C ¡ l 0 Ux C D 0. (A.3) 1 ¡Ux ¡ 0 1¡ ¡ U (1 ¡ mC ) ¡ l trec
In general, equation A.3 must be solved numerically for an arbitrary solution m C . For the particular solution m C D 1 / 2, equation A.3 reduces to
³
1¡
1 U ¡ ¡l trec 2
´2 ³
´ ³ ´ bU 1 U ¡l C 1¡ ¡ ¡ l D 0, (A.4) c C2 c C2 trec 2 2b
which gives l2 D 1 ¡
1 U ¡ , trec 2
(A.5)
and the other two eigenvalues are the solutions of the quadratic equation: ³ ´³ ´ 1 2b bU U 1¡ ¡ ¡l ¡l C (A.6) D 0. trec 2 c C2 c C2 Acknowledgments
J.J.T. acknowledges support from MCyT and FEDER (Ram Âon y Cajal program) and partial support from Neuroinformatics Thematic Network, Universidad de Granada (Plan propio), MCyT (under project BFM2001-2841), and the Dutch Technology Foundation (STW).
2922
L. Pantic, J. J. Torres, H. J. Kappen, and S.C.A.M. Gielen
References Abbott, L., & Nelson, S. (2000). Synaptic plasticity: Taming the beast. Nat. Neurosci., 3, 1178–1183. Abbott, L., Varela, J., Sen, K., & Nelson, S. (1997). Synaptic depression and cortical gain control. Science, 275, 220–224. Amit, D. (1989). Modeling brain function. Cambridge: Cambridge University Press. Bi, G. Q., & Poo, M. M. (1998). Activity-induced synaptic modications in hippocampal culture, dependence on spike timing, synaptic strength and cell type. Journal of Neuroscience, 18, 10464–10472. Bibitchkov, D., Herrmann, J. M., & Geisel, T. (2002). Pattern storage and processing in attractor networks with short-time synaptic dynamics. Network, Comput. Neural Syst., 13(1), 115–129. Bressloff, P. C. (1999). Dynamic synapses, a new concept of neural representation and computation. Phys. Rev. E, 60(2), 2160–2170. Chechik, G., Meilijson, I., & Ruppin, E. (2001). Effective neuronal learning with ineffective Hebbian learning rules. Neural Comput., 13(4), 817–840. Dayan, P., & Willshaw, D. (1991). Optimizing synaptic learning rules in linear associative memories. Biol. Cyber., 65(4), 253–265. Foss, J., Moss, F., & Milton, J. (1997). Noise, multistability, and delayed recurrent loops. Phys. Rev. E, 55(4), 4536–4543. Haberly, L. B., & Bower, J. M. (1989). Olfactory cortex, model circuit for study of associative memory? Trends Neurosci., 12, 258–264. Hao, B. L. (1989). Elementary symbolic dynamics. Singapore: World Scientic Publishing. Hopeld, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proc. Natl. Acad. Sci. USA, 79, 2554– 2558. Horn, D., & Usher, M. (1989). Neural networks with dynamical thresholds. Physical Review A, 40(2), 1036–1044. Hoshino, K., Kashimori, Y., & Kambara, T. (1998). An olfactory recognition model based on spatio-temporal encoding of odor quality in the olfactory bulb. Biol. Cybern, 79, 109–120. Kelso, J. S. (1995). Dynamic patterns, the self-organization of brain and behavior. Cambridge, MA: MIT Press. Kistler, W. M., & van Hemmen, J. L. (1999). Short-term synaptic plasticity and network behavior. Neural Comput., 11(7), 1579–1594. Liaw, J. S., & Berger, T. W. (1996). Dynamic synapses, a new concept of neural representation and computation. Hippocampus, 6, 591–600. Markram, H., & Tsodyks, M. (1996). Redistribution of synaptic efcacy between neocortical pyramidal neurons. Nature, 382, 807–810. Miyashita, Y., & Chang, H. S. (1988). Neuronal correlate of pictorial short-term memory in the primate temporal cortex. Nature, 331(6151), 68–70. Mueller, R., & Herz, V. M. (1999). Content-addressable memory with spiking neurons. Phys. Rev. E, 59(3), 3330–3338.
Associative Memory with Dynamic Synapses
2923
Nakahara, H., & Doya, K. (1998). Near saddle-node bifurcation behavior as dynamics in working memory for foal-directed behavior. Neural Comput., 10(1), 113–132. Natschläger, T., Maass, W., & Zador, A. (2001). Efcient temporal processing with biologically realistic dynamic synapses. Network, Computation in Neural Systems, 12(1), 75–87. Palm, G., & Sommer, F. (1996). Models of neural networks III. Berlin: SpringerVerlag. Peretto, P. (1992). An introduction to the modeling of neural networks. Cambridge: Cambridge University Press. Senn, W., Segev, I., & Tsodyks, M. (1998). Reading neuronal synchrony with depressing synapse. Neural Comput., 10(4), 815–819. Senn, W., Wyler, K., Streit, J., Larkum, M., Luscher, H. R., Mey, H., Muller, L., Stainhauser, D., Vogt, K., & Wannier, T. (1996). Dynamics of a random neural network with synaptic depression. Neural Networks, 9(4), 575–588. Tsodyks, M., & Markram, H. (1997). The neural code between neocortical pyramidal neurons depends on neurotransmitter release probability. Proc. Natl. Acad. Sci. USA, 94, 719–723. Tsodyks, M., Pawelzik, K., & Markram, H. (1998). Neural networks with dynamic synapses. Neural Comput., 10(4), 821–835. Tsodyks, M., Uziel, A., & Markram, H. (2000). Synchrony generation in recurrent networks with frequency-dependent synapses. J. Neurosci., 20(1), 50. Turrigiano, G. G., Leslie, K. R., Desai, N. S., Rutherford, L. C., & Nelson, S. B. (1998). Activity-dependent scaling of quantal amplitude in neocortical neurons. Nature, 391, 892–896. Received July 12, 2001; accepted May 14, 2002.
LETTER
Communicated by Dario Ringach
Adaptive Spatiotemporal Receptive Field Estimation in the Visual Pathway Garrett B. Stanley [email protected] Division of Engineering and Applied Sciences, Harvard University, Cambridge, MA 02138, U.S.A. The encoding properties of the visual pathway are under constant control from mechanisms of adaptation and systems-level plasticity. In all but the most articial experimental conditions, these mechanisms serve to continuously modulate the spatial and temporal receptive eld (RF) dynamics. Conventional reverse-correlation techniques designed to capture spatiotemporal RF properties assume invariant stimulus-response relationships over experimental trials and are thus limited in their applicability to more natural experimental conditions. Presented here is an approach to tracking time-varying encoding dynamics in the early visual pathway based on adaptive estimation of the spatiotemporal RF in the time domain. Simulations and experimental data from the lateral geniculate nucleus reveal that subtle features of encoding properties can be captured by the adaptive approach that would otherwise be undetected. Capturing the role of dynamically varying encoding mechanisms is vital to our understanding of vision on the natural setting, where there is absence of a true steady state. 1 Introduction
In the early visual pathway, each neuron encodes information about a restricted region of visual space that is generally referred to as the receptive eld (RF) of the cell. The dominant paradigm over the past several decades is that the RF extent and the corresponding temporal encoding properties of the neuron are primarily a function of bottom-up processing, and are thus invariant properties of the cell. Overwhelming evidence contradicts this static construct, pointing toward more complex time-varying encoding mechanisms that result from a variety of dynamic sources. Adaptation and plasticity exist on a number of different timescales ranging from milliseconds to hours; these mechanisms have been studied for some time, but the role in the encoding process is not yet known. In this article, a new approach to RF estimation is presented that is critical in precisely quantifying effects of systems-level plasticity or adaptation and the role they play in encoding information about the visual world. An adaptive implementation c 2002 Massachusetts Institute of Technology Neural Computation 14, 2925– 2946 (2002) °
2926
Garrett B. Stanley
is developed that characterizes the functional nature of the time-varying properties that have been explored experimentally but have not yet been well quantied in the context of neuronal encoding. The linear correlation structure between controlled sensory stimuli and the corresponding recorded neural activity has been shown to reveal a signicant amount of information about the underlying functional properties of the neuron. Reverse-correlation techniques, which refer to the crosscorrelation between stimulus and response for white noise stimuli, have been implemented in a number of systems to characterize the linear component of the stimulus-response relationship of the cells. In the visual pathway, spatiotemporal white noise stimuli have been used extensively to characterize the dynamics of neurons in the retina, lateral geniculate nucleus (LGN), and primary visual cortex using the reverse-correlation technique (Jones & Palmer, 1987; Reid, Soodak, & Shapley, 1991; Reid, Victor, & Shapley, 1997; Mao, MacLeish, & Victor, 1998). Additionally, in the auditory system, researchers have long since used reverse-correlation techniques for characterizing auditory neurons in the periphery (DeBoer & Kuyper, 1968), and others have recently used reverse-correlation techniques to extract soundfeature selectivity of neurons in the auditory cortex (deCharms, Blake, & Merzenich, 1998). These techniques have been instrumental in developing our understanding of systems-level processing in the early sensory pathways. One of the major assumptions of the reverse correlation technique is that the dynamics of the underlying functional mechanisms in the visual pathway are time invariant, resulting in spatiotemporal RF properties that remain unchanged with time. As early as the retina, however, adaptation mechanisms act on a continuum of timescales to adjust encoding dynamics in response to changes in the visual environment (Enroth-Cugell & Shapley, 1973; Shapley & Victor, 1978). Adaptation mechanisms have also been identied in the LGN (Ohzawa, Sclar, & Freeman, 1985) and cortical area V1 (Albrecht, Farrar, & Hamilton, 1984), and models have been suggested in which population activity acts in a divisive manner to regulate the cellular response properties (Heeger, 1992). Furthermore, several recent studies have shown that the spatial and temporal characteristics of visual neurons can vary drastically over a variety of different timescales (Pei, Vidyasagar, Volgushev, & Creutzfeld, 1994; Gilbert, Das, Ito, Kapadia, & Westheimer, 1996; Mukherjee & Kaplan, 1995; Przybyszewski, Gaska, Foote, & Pollen, 2000), through neuromodulatory inuences from other brain regions. Neurons throughout the visual pathway have been shown to exhibit temporal and spatial properties that evolve over a continuum of timescales, bringing into question the reliability of traditional reverse-correlation techniques applied over even short experimental trials. In this article, an adaptive approach for the estimation of spatiotemporal RF properties from single trials is developed specically to address time-varying dynamics in the visual pathway, and the utility is demonstrated in simulation of geniculate and cortical dynamics, as well as experimental data from the cat LGN. Portions
Adaptive Spatiotemporal Receptive Field Estimation
2927
of these results have been presented elsewhere in abstract form (Stanley, 2000b). 2 Receptive Field Estimation
The foundation for much of modern systems neurophysiology is the characterization of the relationship between sensory inputs and the corresponding evoked neural activity. This relationship is often referred to as the RF and in general encompasses temporal as well as spatial characteristics of the underlying sensory map. Temporal estimation is rst developed here, but is easily extended to complete spatiotemporal maps. Suppose the spike P train of a neuron is denoted as r ( t) D j d ( t ¡ tj ) , with spike event times t0 , t1 , t2 , . . . on the interval (0, t f ], where d is the Dirac delta function. The fundamental task lies in the characterization of the relationship between a temporally continuous stimulus and the discrete process of the ring activity of the neuron. One approach is to relate the continuous input to the ring “rate” of the neuron, allowing the quantication of the relationship between the input and its modulatory effects on the rate of action potentials generated. Let the ring rate of the neuron be denoted r[k], which represents the number of events occurring in the interval ( ( k¡1 ) D t, kD t] normalized by the interval width, D t. The rate is also often interpreted as the time-dependent rate of an inhomogeneous Poisson process, which obviously results in discrete stochastic event times (Ringach, Sapiro, & Shapley, 1997). Although the transformation to rate does simplify the relationship a great deal, the remaining relationship between the stimulus, which takes on values both positive and negative relative to the mean level, and the strictly nonnegative ring rate is nontrivial. Linear models are insufcient to capture such a transformation, and thus generally must be described through a higherorder Wiener kernel expansion. An alternate functional model to describe the relationship between the stimulus and the ring rate of a neuron is the Wiener system, which incorporates a linear (L) system followed by a static nonlinearity (N), and is therefore often also referred to as an LN system (Movshon, Thompson, & Tolhurst, 1978; Tolhurst & Dean, 1990; Ringach et al., 1997; Chichilnisky, 2001), as shown in Figure 1. The output of the linear element is expressed as a convolution of the temporal stimulus, s, with the
s
X
g
x
Figure 1: Wiener system.
f( )
r
2928
Garrett B. Stanley
rst-order kernel g, plus an offset m x ,
x[k] D
L¡1 X mD 0
g[m]s[k ¡ m] C m x ,
where LD t is the lter length, which can be interpreted as the temporal integration window of the cell. For white noise inputs, since x is the linear combination of independent random variables, the process x is approximately gaussian. The ring rate of the neuron is then a nonlinear function of the intermediate signal x, so that r[k] D f ( x[k]). The inherent rectifying properties of neurons can often be described through a simple half-wave rectication function, b¢c, resulting in the output r[k] D bx[k]c, which is then driving an inhomogeneous Poisson process. In contrast to the complex nature of a higher-order Wiener kernel representation, the Wiener system provides a relatively simple means for describing the inherent nonlinearity in neural encoding. The basic characteristics of the Wiener system have been discussed in a variety of studies focused on a number of systems. Several studies have provided algorithms for the identication of Wiener cascades (Hunter & Korenberg, 1986; Korenberg & Hunter, 1999), and the convergence properties of the estimators have also been well characterized (Greblicki, 1992, 1994, 1997). Although these developments were targeted at more general physiological systems, the utility of these techniques in neural systems has been recently demonstrated (Paulin, 1993; Dan, Atick, & Reid, 1996; Ringach et al., 1997; Smirnakis, Berry, Warland, Bialek, & Meister, 1997; Chichilnisky, 2001). In the work presented here, an adaptive approach to Wiener system estimation is used to capture time-varying characteristics in the dynamic relationship between visual stimulus and neuronal response. 2.1 Estimation of Time-Invariant Properties. Before the adaptive approach is developed, the framework for the general estimation problem must rst be discussed. Let the parameter vector be dened ash , [g[0] g[1] ¢ ¢ ¢ g[L ¡ 1]]T 2 RL£1 to represent the rst-order kernel, where T denotes transpose. The response can then be written r[k] D bx[k]c, where x[k] D h T Q[k] C m x and Q[k] , [s[k] s[k ¡ 1] ¢ ¢ ¢ s[k ¡L C 1]]T is the time history of the stimulus. As is well known and shown for reference in section A.1, the parameter vector for a rst-order kernel can be estimated from the crosscovariance between the stimulus and the output of the linear element, nor¡1 malized by the autocovariance structure of the stimulus, or hO D W ss w sx , L£L where W ss 2 R is the Toeplitz matrix of the autocovariance of the stimulus s, w sx 2 RL£1 is the cross-covariance vector between the stimulus and the output of the linear block, x, and O¢ denotes estimate. As discussed in section A.2, for gaussian inputs to the static nonlinearity, the only effect is a scaling of the cross-covariance, and the estimate of the parameter vector
a
0
r (spikes/sec)
2 0 400
b
200 400 600 Time (msec)
200 0 0
200 400 600 Time (msec)
r (spikes/sec)
s
2
g (spikes/sec)
Adaptive Spatiotemporal Receptive Field Estimation
20
2929
c
10 0 0
200
50 Time (msec)
d
100
0 200 0
1
2 3 Time (sec)
4
Figure 2: First-order kernel estimation. The rst-order kernel was computed from (a) the temporal m-sequence and (b) the corresponding ring rate (in 7.7 ms bins) of X cell in the LGN. (c) The temporal kernel at the center of the receptive eld; band represents § 2SD around estimate (see equation A.1). The actual (positive) and predicted (negative) ring rates (in 40 msec bins) from the full spatiotemporal RF model. Approximately 4 minutes of data at a sampling rate of 128 Hz.
can be expressed as a function of the stimulus, s, and response, r, 1 ¡1 ¡1 hO D W ss w sx D W ss w sr , C
(2.1)
where C1 2 R is a constant of proportionality. As shown in section A.2, for half-wave rectication with gaussian inputs with mean much less than the standard deviation, this scaling will be approximately 2, which is consistent with the experimental results presented here. Figure 2 shows the estimate of a low-pass biphasic rst-order kernel at a single pixel at the center of the RF from cat LGN X cell response to a spatiotemporal m-sequence. A segment of the m-sequence stimulus (at 128 Hz) at the center pixel is shown in Figure 2a, and the corresponding ring rate (in 7.7 ms bins) is shown in Figure 2b. The kernel estimate for the center pixel computed over the entire trial is shown in Figure 2c. The band represents two standard deviations around the estimator (for a discussion of the kernel estimator statistics, see section A.4). The uncertainty in the estimation (due to noise and unmodeled dynamics) is an extremely valuable measure, especially when comparing the encoding properties in different physiological states. Using the complete spatiotemporal kernel, the ring rate of the cell is predicted in Figure 2d; the actual ring rate is shown in the dark-shaded (positive) region, and the predicted ring rate is shown in the gray-shaded region, reected about the horizontal axis for comparison. The cascade of the linear system with the
2930
Garrett B. Stanley
static nonlinearity is obviously a good predictor of the neuronal response for this relatively linear cell. 2.2 Adaptive Estimation. As shown in section A.3, a recursive estimate of the kernel at time t is posed as the least-squares estimate based on information up to time t:
hOt D arg min h
t X kD L
( r[k] ¡ bh T Q[k] C m x c) 2 .
Kernel estimation formulated in this manner has various computational advantages (Stanley, 2000a; Ringach, Hawken, & Shapley, 2002). If this estimation is carried out over the entire trial, the resulting kernel will be equivalent to that of the nonrecursive estimate from equation 2.1, but the computation will be signicantly more efcient. Perhaps the most benecial aspect of the recursive estimation method, however, is the natural extension to adaptive estimation for neurons whose encoding dynamics vary over time, which has not been explored until recently (Stanley, 2000b, 2001; Brown, Nguyen, Frank, Wilson, & Solo, 2001). Many RF properties in the sensory pathways vary with time, violating a critical assumption for the reverse-correlation technique. Suppose that the recursive lter estimation (see section A.3) is reformulated as (Ljung, 1987), hOt D arg min h
t X kD L
lt¡k (r[k] ¡ bh T Q[k] C m x c) 2 ,
where l 2 [0, 1] is a weighting parameter. Let C t and P t represent the input autocovariance and cross-covariance between input and output at time t, respectively, which can be expressed in an adaptive form: C t D lCt¡1 C Q[t]QT [t]
P t D lP t¡1 C Q[t]r[t].
(2.2)
As shown in section A.3, the estimate of the kernel at time t is nally written as the estimate at time t ¡ 1 plus a correction based on new information arriving at time t: hOt D hOt¡1 C update D hOt¡1 C
µ ¶ 1 1 ¡1 T O C Q[t] r[t] ¡ Q [t]ht¡1 . t ¡L C 1 t C
(2.3)
This estimate therefore downweights past information in an exponential manner as controlled by the size of l, which is often called the “forgetting” factor. As the encoding properties vary over time, the input-output covariance obviously changes properties, which is accounted for by a single weighting parameter, telescoping backward in time; the result is an
Adaptive Spatiotemporal Receptive Field Estimation
2931
exponentially downweighting of past information. Let the time constant of the adaptive estimation t denote the time at which the exponential decay reaches 63% of its nal value, which is t D D t £ ln(0.37) / ln(l) . The statistical properties of the adaptive estimator are discussed in section A.4. 3 Time-Varying Dynamics in Sensory Systems
Although there are computational advantages to the estimation posed above, the primary advantage in characterizing neuronal encoding is the ability to track time-varying characteristics of the cell response properties over a single trial. The importance of the paradigm is well illustrated through the following examples involving both simulation and experimentally obtained data from the mammalian visual pathway. 3.1 Simulations of Time-Varying Encoding Dynamics in the LGN and Cortex. Neurons in the LGN of cats have been shown to exhibit temporal
tuning properties that change over time (Sherman & Koch, 1986; Mukherjee & Kaplan, 1995). Over a relatively short time frame of seconds to minutes, the dynamics can change smoothly from a tonically ring state to a bursting state, which is thought to be modulated by neuromodulatory inuences from the visual cortex and brain stem. Mukherjee and Kaplan analyzed the transfer function characteristics between the visual input and LGN cell activity and found that the tonic mode is associated with low-pass temporal lter characteristics, whereas the burst mode has more of a bandpass nature, as shown in Figure 3a. Such nonstationary behavior over experimental trials precludes the accurate characterization of encoding properties that change over the experimental trial. The adaptive approach presented here, however, is specically designed to capture such phenomena. Consider the following example for which both noise-free and more physiologically realistic stochastic simulations are conducted. Let the ring rate of the neuron simply be a linear¥ function ¦ of a spatially uniform but temporally varying stimulus, r[t] D htT Q[t] . Figure 3a shows the temporal frequency characteristics of the neurons in the tonic (low-pass) and bursting (bandpass) states. The system can be transformed to the parameterized form of the rst-order Wiener kernel, h , where htonic and hburst represent the dynamics of the two modes in the time domain, as shown in Figure 3b. The tonic mode is associated with the slower rise to the peak, whereas the burst mode has a quicker transient and a long inhibitory rebound. In order to simulate the transition from tonic ring to bursting and back to tonic, the dynamics of the neuron were generated from a time-varying weighted average of the two models: ht D atonic (t )htonic C aburst ( t)hburst . The weighting functions atonic (t ) and aburst ( t) are shown in Figure 3c. The adaptive approach was then applied to estimate the time-varying kernel,
2932
Garrett B. Stanley
M agnitude
a
d
tonic
200
10.00s
burst
100 0
0
8.75s 1
10 10 Frequency (Hz)
2
10
7.50s
spikes/sec
b 6.25s
100 50
5.00s
0 0
50
100 150 Lag (msec)
200 3.75s
c
1
tonic
burst
0.5 0 0
5 10 Time (seconds)
2.50s
1.25s
Figure 3: Adaptive tracking of temporal tuning modulation in LGN. (a) Temporal frequency response of tonic and burst ring modes and (b) associated kernels, (c) temporal weighting functions for tonic (solid) and burst (dashed) modes, and (d) kernels estimated during the course of the trial (thick line). The nonadaptive estimate is superimposed for reference (thin line). Kernel length was 175 msec and t was 800 msec.
and the adaptive time constant t was varied to produce the minimal prediction error. Figure 3d shows the rst-order kernel estimate as a function of time, transitioning from the tonic to burst modes, and then back to the tonic mode over the time course of 10 seconds. For this case, the estimation time constant was approximately 800 msec. The estimate of the nonadaptive kernel is superimposed (thin line) for comparison. The transition from tonic to burst and then back to tonic modes is well captured in the kernel estimate for this noise-free simulation. The same analysis was then performed for the case when the rectied output of the linear kernel was used as the time-varying rate of an inhomogeneous Poisson process. A thinning algorithm was used to efciently produce spike times from this construct (Dayan & Abbott, 2001); a segment of the stimulus and the spike train are
Adaptive Spatiotemporal Receptive Field Estimation
2933
Figure 4: Adaptive tracking of temporal tuning modulation in LGN with noise. Temporal white noise (a) was ltered with the time-varying linear kernel, rectied, and then used as the rate of an inhomogeneous Poisson process to generate spikes (b). The adaptive algorithm tracked the time-varying kernel, as shown in the image (c), and the kernel evolution (black) in (d) where the nonadaptive kernel (gray) is again superimposed for reference. Bands represent two standard deviations around the estimate (see equation A.3). Kernel length was 175 msec, and t was approximately 800 msec.
shown in Figures 4a and 4b. The resulting spikes were binned to represent ring rate again, and the estimation procedure was implemented. An image of the kernel evolution over the stimulus trial is shown in Figure 4c, where the prominent peak of the tonic mode is present at the beginning and end of the trial, and the quick transients and inhibitory lobe of the burst mode are present in the middle of the trial. Figure 4d provides another view of the progression, again with adaptive estimate (red) and nonadaptive estimate (black) superimposed for comparison. Bands represent two standard deviations around the estimate (details of statistical properties are in section A.4). Although the estimate suffers as compared to the noise-free
2934
Garrett B. Stanley
simulations of Figure 3, the fundamental properties of the two modes are still extracted from the data set, which is clear from both illustrations. The nonadaptive estimate essentially averages out the interesting time-varying dynamics, while the adaptive estimate deviates from this over the 3.75 to 7.5 second range. The quick transient of burst mode is especially prominent in the adaptive estimate at 6.25 seconds. The condence bands suggest that the adaptive estimates are signicantly different from the nonadaptive estimates and also that the apparent evolution of the kernel properties over the trial is statistically signicant. The spatial RF properties of neurons in the visual cortex have been shown to exhibit plasticity not only over long timescales of days or months, but also over short timescales of seconds to minutes. It has been proposed that
Figure 5: Dynamic cortical RF properties. (a) Original and (b) expanded RFs were used to simulate neuronal response, which was tracked adaptively (c). Thick lines represent the adaptively estimated spatial RF (over one dimension), and shaded regions represent the nonadaptive estimates. (d) An image of the RF evolution. In an additional simulation, the Wiener system output was used as the time-varying rate of an inhomogeneous Poisson process, producing a noisy response. The resulting RF properties (again for 1D) are shown for the adaptively estimated RF (e). t was approximately 800 msec.
Adaptive Spatiotemporal Receptive Field Estimation
2935
this short-term plasticity plays a role in the continuing process of normalization and calibration of the visual system (Gilbert et al., 1996). Many cortical neurons exhibit stimulus-dependent RF expansions over relatively short time periods. Consider the following example, as illustrated in Figure 5. Concentrated and expanded RF were generated using difference of gaussian functions with different spatial extents, as shown in Figures 5a and 5b, with essentially instantaneous temporal dynamics. The time-varying system was then generated using the temporal weighting functions of the previous example, varying from the original to expanded and back to original RF structure. The response to spatiotemporal white noise was simulated, as previously discussed, resulting in a mean ring rate at 35 Hz. Figure 5c shows a spatial slice of the adaptively estimated 2D receptive eld as it evolves over the trial (thick line) with the static RF estimate (shaded) superimposed for reference. The static RF estimate is essentially the averaged RF over the trial, whereas the actual RF was initially more peaked, followed by a broadening of the tuning, and nally a return to the peaked tuning at the end of the trial. The adaptive approach in this case is able to track these changes as they evolve over the stimulus trial, as shown in Figure 5d. In a more realistic simulation, the output of the linear lter-static nonlinearity cascade was used as the rate of an inhomogeneous Poisson process, as was done in the previous example. An image of the adaptively estimated RF (again for a 1D slice of the 2D eld) over the stimulus trial is shown in Figure 5e. As with the previous example, the inherent noise of the Poisson simulation degrades the estimate, but the essential time-varying characteristics are still well captured. 3.2 Experimental Results from LGN Response to the Spatiotemporal M-Sequence. The above simulations provide insight into the development
of the methods of adaptive estimation, but the primary target of the approach is experimental. X cells in the cat LGN were stimulated with the
Figure 6: Evolution of adaptation in the LGN. (a) The rst-order kernel at the beginning (beg) and end (end) of the 4 minute trial. For comparison, the kernel derived from the entire trial is superimposed (avg). Bands represent two standard deviations around the estimators. (b) An image of the kernel evolution over the 4 minute trial. t was 7.8 sec.
2936
Garrett B. Stanley
spatiotemporal m-sequence at 128 Hz and 100% contrast over several minutes (for details of experimental protocol, see Stanley, Li, & Dan, 1999). Figure 6 illustrates typical results obtained using the adaptive estimation approach for a single pixel at the center of the RF. Figure 6a shows the rst-order kernel estimates at the beginning (beg) and end (end) of the trial, with a marked change in temporal dynamics over the 4 minute period. This example was typical among the population from which this cell was drawn, undergoing an attenuation over the stimulus trial, consistent with slow adaptation effects related to a step up in contrast at stimulus onset. Also invariant among cells is the temporal compression that results in a decreased latency. The kernel derived from the entire trial (avg) is superimposed for reference, underscoring the importance of the adaptive approach presented here; when the kernel is computed from the entire trial, it produces essentially the average dynamics over the trial. Bands represent two standard deviations around the estimators. The image shown in Figure 6b illustrates the evolution of the kernel dynamics over the entire trial, with a decrease in latency and in kernel magnitude over the 4 minute period. The adaptive time constant t was 7.8 seconds for this example. The sudden shift in latency is an artifact of the discretization of the kernel at 128 Hz, since the average shift in latency was on the order of magnitude of the sampling interval. The point to emphasize here is that these changes in encoding dynamics would not be discovered through traditional reverse-correlation techniques that assume a xed relationship between stimulus and response over the trial. 4 Discussion
Reverse-correlation techniques have been used extensively in a wide range of sensory systems. Characterizing the RF properties of sensory neurons forms the basis for our current understanding of their functional signicance. The assumption that the spatiotemporal RF properties are static over even short time intervals, however, is problematic. Systems-level adaptation, functional modulation, and plasticity occur on a continuum of timescales that exist in all but the most articial of experimental conditions. Examples were presented here in which spatiotemporal RF properties of visual neurons were adaptively tracked over short time intervals through both simulation and experimental trials. Realistic, noisy simulations were conducted in the development of the adaptive approach, shown through both modulation of temporal coding properties in the LGN and modulation of spatial properties in the cortex. Furthermore, experimental data were used to demonstrate the validity of the approach under physiological conditions, and it was shown that the evolution of encoding properties could be smoothly tracked over short trials. The kernels showed a marked attenuation and temporal compression over a timescale that is consistent with slow forms of contrast and light adaptation. Obviously this could be controlled
Adaptive Spatiotemporal Receptive Field Estimation
2937
for experimentally by allowing the cell to adapt rst to the high-contrast stimulus, but one could argue that this induces an unnatural steady state that is not relevant ethologically. Statistical properties of both the adaptive and nonadaptive estimators were derived, providing a more rigorous means with which to compare encoding dynamics in different physiological conditions. It should also be noted that the approach presented here tracks the evolution of spatiotemporal encoding properties over a single trial, an especially critical point in tracking dynamics that are modulated by nonstimulus-related activity, and thus may not be possible to repeat. Furthermore, for stimulus-related modulation of encoding dynamics, the applicability over single trials is important in natural viewing conditions, in which case the eyes saccade across the visual scene, never to invoke the same trajectory, and therefore stimulus pattern, twice. In this work, the neuronal response is characterized as the single trial ring rate of the neuron in question. This simplication obviously ignores the precise temporal structure of the underlying point events that may be present in sparse coding strategies. However, for some subset of cells in early sensory systems, much of the information is thought to be coded in the ring rate. As is evident in Figure 2, X cells in the LGN are typically relatively linear in their response properties and are often well characterized with this strategy. The simple model prediction of ring rate is surprisingly good, although it may not be capturing subtle features that could be tracked through more elaborate strategies. One would expect that the specic techniques presented here would be less ideal for the sparse coding observed in many cortical cells. The relatively high ring rate of thalamic neurons enables the use of an adaptive time constant that is small enough to allow tracking of relatively fast changes in the RF properties. The lower ring rates associated with sparse coding strategies in the cortex would require much longer adaptive time constants, thereby limiting the effective bandwidth of adaptive processes that could be tracked. However, the general approach of adaptive estimation could certainly be applied; adaptive point process estimation techniques have been recently demonstrated in the hippocampus (Brown et al., 2001) and represent a promising direction of exploration in this context. The adaptive approach presented here is based on an exponential downweighting of past stimulus-response information. This functional form of the downweighting was chosen for computational reasons, but could easily be extended to more complex relationships. However, when this strategy was implemented with other functional forms of the temporal dependence, the results were qualitatively similar. The relevant parameter in each case was the relative temporal window over which the estimation was performed (the time constant of the adaptive estimate, t ). The shorter window obviously enables one to track dynamics that change more rapidly over time, but at the expense of the estimator variance; conversely, larger time constants reduce the variance but make it impossible to detect the nonstationarities.
2938
Garrett B. Stanley
The time constants used in this study were thus chosen so as to use as little past information as possible while still yielding meaningful structure in the kernel estimates. One of the limitations of the proposed methodology lies in the underlying assumptions imposed on the model structure. The input to the static nonlinearity was assumed to be gaussian in nature. However, even with binary inputs, such as that of the m-sequence, the output of the linear block is nearly gaussian for even relatively short lter lengths. The nonlinearity described in this work is a simple half-wave rectication. This obviously does not capture saturation at high ring rates, but instead assumes that stimuli are sufciently small as to avoid driving the cells to saturation. The approach presented here could certainly be integrated with recent techniques in the identication of arbitrary nonlinearities in the cascade (Chichilnisky, 2001; Nykamp & Ringach, 2001), in order to capture the phenomena over a larger operating range. Finally, it was also assumed that the mean of the input to the static nonlinearity was much smaller than the standard deviation (see section A.4), which signicantly reduced the complexity of several of the estimates. However, as demonstrated in section A.2, this assumption was reasonable for the experimental data presented. In summary, time-varying encoding properties may be a ubiquitous characteristic of all sensory neurons. Adaptation has been studied for some time and is known to dramatically affect the temporal and spatial dynamics, and anecdotal evidence of modulatory effects from other brain regions on encoding properties has been reported for a number of stages in the visual pathway. However, it is not yet known what these phenomena imply for the coding strategies of the sensory pathway as a whole. The approach presented here is a rst step toward capturing the evolution of spatiotemporal RF properties over a range of timescales, with the goal of eventually understanding dynamic coding strategies employed by the visual pathway in natural settings, where there is no real concept of steady state. Appendix A.1 Time-Domain Estimation. Consider a purely linear system, with zero-mean input s[k], output x[k], and rst-order kernel g[k] that can be expressed through the vector h. Dene a data vector Q,
Q [k] , [s[k]
s[k ¡ 1] ¢ ¢ ¢ s[k ¡ L C 1]]T 2 RL£1 .
The output can be written x[k] D h T Q[k]. Let the error in the prediction be O dened as e[k] D x[k] ¡ x[k]. The mean square error (MSE) is then dened as MSE D Ef ( e[k]) 2 g, where Ef¢g denotes statistical expectation. The rst-order kernel can then be estimated by minimizing the MSE of prediction: ²2 ¼ X »± hO D arg min . r[k] ¡ h T Q[k] h2RL
k
Adaptive Spatiotemporal Receptive Field Estimation
2939
Taking the partial of this objective function with respect to h[q] and setting P L¡1 equal to zero yields w sx [q] D mD 0 h[m]w ss [q ¡ m], where w sx is the crosscovariance between the input and output, and w ss is the autocovariance of the input. This can be expressed as: 2 3 2 32 3 w sx [0] w ss [0] w ss [1] . . . w ss [L ¡ 1] h[0] 6 w [1] 7 6 w [1] 6 7 w ss [0] . . . w ss [L ¡ 2]7 sx ss 6 7 6 7 6 h[1] 7 6 7 6 7 6 7. D .. .. .. .. .. 6 7 6 .. 76 7 4 5 4 . 5 4 5 . . . . . w sx [L ¡ 1] w ss [L ¡ 1] w ss [L ¡ 2] . . . w ss [0] h[L ¡ 1] Using compact notation reduces the relationship to w sx D W ssh where W ss is the L £ L Toeplitz matrix of the input autocovariance. The rst-order kernel ¡1 w . This estimator is the optimal estimator can then be estimated as hO D W ss sx in the least-squares sense (Ljung, 1987). A.2 Effect of Static Nonlinearity on Covariance Structure. The cascade of a linear system with a static nonlinearity is often referred to as a Wiener system, as shown in Figure 1. It has been shown that the effect of odd static nonlinearities is a simple scaling of the covariance structure (Bussgang, 1952; Papoulis, 1984; Hunter & Korenberg, 1986). Bussgang’s theorem states that if the input to a memoryless system r D f ( x) is a normal process x ( t) , the crosscovariance of x ( t) with the output r (t) is proportional to the autocovariance of the input (Papoulis, 1984):
w xr (t ) D Cw xx (t )
where
C D Ef f 0 [x(t) ]g.
For a half-wave rectication, we have: ( ( for x ¸ 0 1 x 0 f ( x) D f ( x) D 0 else 0
for x ¸ 0 . else
The scaling constant C then becomes Z 1 C D Ef f 0 ( x) g D f 0 (x ) px ( x ) dx, ¡1
where px ( x ) is assumed a gaussian probability density function N We therefore have ³ ´ Z 1 Z 1 1 ¡z2 ¡m x 2 ( ) p e dz D 1 ¡ Y CD px x dx D , ¡m x sx 2p 0 sx
(m x , sx2 ).
where Y (¢) is the standard normal cumulative. The theoretical and observed scalings for varying m x / sx are shown in Figure 7. Gaussian white noise processes with mean m x and variance sx2 were rectied; the ratio of crosscovariance to autocovariance, C, is plotted as a function of m x / sx in Figure 7a.
2940
b
a
c error
C
1
Garrett B. Stanley
0.5 0 2
0 x
/
2
200
0 x
200
0
5 1/C
10
x
Figure 7: Effect of rectication on correlation structure. (a) Simulated and theoretical relationships between m x / sx and scaling C. Gaussian white noise signals with mean m x and variance sx2 were rectied, and the ratio of cross-covariance to autocovariance computed (). (b) The output of the linear lter tends to a gaussian distribution. The nonscaled estimated kernel underpredicts the experimentally observed ring rate. (c) The C1 scaling on the kernel that minimizes the mean squared error is approximately 2, implying a m x ¿ sx , as indicated in a.
Importantly, data from the cat LGN reveal that the scaling of the kernel, 1 / C, which minimizes the error in predicted ring rate, is approximately 2, as shown in Figure 7c. The combination of these two results suggests that the mean of the lter output is small relative to the standard deviation. For the Wiener system shown in Figure 1, it is straightforward to extend the discussion above to the estimate for the linear block. For an uncorrelated input, the output of the linear kernel tends to a gaussian distribution, as shown in Figure 7b. The resulting relationship is w sr D Cw sx , giving, for half-wave rectication, the estimate hO D
1 ¡1 W w sr . C ss
Importantly, data from the cat LGN reveal that the scaling of the kernel, 1 / C, that minimizes the error in predicted ring rate is approximately 2, as shown in Figure 7c. The combination of these results suggests that m x ¿ sx within this structural framework. A.3 Recursive Estimation. The recursive kernel estimation problem can be formulated in the following manner. After presenting a stimulus and observing the response of the neuron up to time t, the estimation can be expressed as
hOt D arg min h
t X kD L
( r[k] ¡ bh T Q[k] C m x c) 2 ,
where the subscript t denotes the time-varying nature of the parameter vector h. Using the least-squares technique described previously, the lter
Adaptive Spatiotemporal Receptive Field Estimation
2941
can be estimated based on information up to time t, hOt D
1 ¡1 C P t, C t
where C t and P t are the stimulus autocovariance Toeplitz matrix and crosscovariance between stimulus and response, respectively, based on data up to time t: Ct D
t X 1 1 t¡L Q[k]QT [k] D C t¡1 C Q[t]QT [t] t ¡ L C 1 kD L t¡LC 1 t ¡L C 1
Pt D
t X 1 1 t ¡L Q[k]r[k] D P t¡1 C Q [t]r[t]. t ¡ L C 1 kD L t ¡L C 1 t¡LC 1
This gives the nal expression for the estimate: µ ¶ 1 1 ¡1 T O O O C [t] [t] ht D ht¡1 C Q r[t] ¡ Q ht¡1 . t ¡L C 1 t C The estimate based on data up to time t¡1 is therefore used in the subsequent estimate based on information up to time t, which results in a recursion for the lter estimation (Goodwin & Sin, 1984; Ljung & Soderstr ¨ om, ¨ 1983). A.4 Statistical Properties of Estimates. As with the traditional leastsquares problem, the statistics of the estimator for the Wiener system can be characterized in a relatively straightforward manner. It is rst necessary to establish the statistical properties of the linear problem, given here for reference. Dene the following data matrices:
D , [Q[L]
Q[L C 1]
¢¢¢
Q[N]]T 2 R ( N¡L C 1) £L
y , [x[L]
x[L C 1]
¢¢¢
x[N]]T 2 R ( N¡L C 1) £1 .
The estimate of the parameter vector from section A.1 can then be written in a more standard least-squares form, hO D C ¡1 P D ( DT D ) ¡1 DT y , where again C represents the stimulus autocovariance Toeplitz matrix, and P represents the cross-covariance matrix between stimulus and response. This estimator is known to be unbiased, so that E[hO ] D h. The covariance can be written (Westwick, Suki, & Lutchen, 1998) L hO D Ef (hO ¡ h ) (hO ¡ h ) T g D (DT D ) ¡1 DT EfeeT gD(DT D ) ¡1 ,
2942
Garrett B. Stanley
where e D [e[L] e[L C 1] ¢ ¢ ¢ e[N]]T 2 RN¡L C 1£1 is the noise vector, where O If the noise is a white process, with standard deviation e[k] , x[k] ¡ x[k]. se À m e , then EfeeT g ¼ se2 I, resulting in L hO ¼ se2 ( DT D ) ¡1 D
se2 C ¡1 2 RN¡l C 1£N¡L C 1 . N ¡L C 1
The covariance structure of the estimator hO could be directly obtained from O the noise term e[k] D x[k] ¡ x[k]. In this case, however, there is obviously no access to the intermediate signal x[k]. The unobservable noise term e must therefore be related to observable noise, which is the difference between the actual ring rate of the cell r[k] and that predicted by the linear-nonlinear cascade, n[k] D r[k] ¡ rO[k]. In order to understand how the two noise processes relate, it is helpful to consider rst the effect of the static nonlinearity on the probability density function. Suppose x is a random variable such that x » N (m x , sx2 ). Let px ( x) represent the density function associated with the random variable x. Let r be the output of the static nonlinearity f (¢) such that r D f ( x ). For the half-wave rectication, the expectation becomes Z m r D Efrg D Letting z D
x¡m x sx ,
1 ¡1
Z rpr ( r ) dr D
1 ¡1
Z f ( x) px (x ) dx D
1 0
xp
1 2p sx2
e
¡(x¡m x ) 2 2sx2
dx.
the expression becomes:
Z 1 1 ¡z 2 (sx z C m x ) e 2 sx dz Efrg D p m 2p sx2 ¡ sxx Z 1 Z 0 Z 1 1 ¡z2 1 ¡z 2 ¡z2 2 2 p ze dz C sx p ze dz C m x D sx e 2 dz. mx mx 2p 2p 0 ¡ sx ¡ sx If m x ¿ sx , the expectation becomes: Z 1 1 ¡z2 sx p ze 2 dz D p . Efrg D sx 2p 2p 0 From the sample mean of the output signal r, the variance of the underlying zero mean noise process can be estimated. For the variance, sr2 D Efr2 g ¡m 2r , where s3 Efr g D p x sx2 2
Z
1
1 ¡z2 z2 p e 2 dz 2p Z Z m xs2 1 1 ¡z2 m 2 sx 1 1 ¡z2 C2p x p e 2 dz. z p e 2 dz C px 2p sx2 ¡ msxx sx2 ¡ msxx 2p ¡ msxx
Adaptive Spatiotemporal Receptive Field Estimation
2943
g x
g x
Figure 8: Equivalent noise sources. The top block diagram represents the encoding model as observed experimentally, with the source of noise at the point of measurement. The bottom block diagram represents the dynamics assumed in the adaptive estimation scheme. As detailed in the text, the statistical properties of the hidden noise source e can be estimated from the observed “noise” n.
If again m x ¿ sx , this results in sr2 D Efr2 g ¡ Efrg2 D
(p ¡ 1) sx2 s2 ¡ x D sx2 . 2 2p 2p
This result can now be extended to describe the equivalent noise sources shown in Figure 8. For the top block diagram, sr2 D srO2 C sn2 D
p ¡1 2 p ¡1 s C sn2 D khk2 ss2 C sn2 . 2p xO 2p
Similarly, for the bottom system, sr2 D
p ¡1 2 p ¡1 2 p ¡1 (sxO C se2 ) D (khk2 ss2 C se2 ) , s D 2p x 2p 2p
where k ¢ k is the standard Euclidean norm. Putting the two equations together yields sn2 D
p ¡1 2 s . 2p e
This gives the equivalent variance of the two noise sources. Given the observed neuronal response r, the statistical properties of the hidden noise process e can be estimated, which is necessary for determining the statistics of the least-squares estimate. The estimator remains unbiased, and the covariance of the estimator can be written as L hO D se2 (DT D ) ¡1 D
se2 1 (2p m 2r ¡ khk2 ss2 ) C ¡1 , C ¡1 D N¡LC 1 N ¡L C 1
(A.1)
2944
Garrett B. Stanley
where m r is the mean observed ring rate and ss2 is the variance of the stimulus. The quantities m r , h , and ss2 can obviously be estimated from data. The uncertainty in the kernel estimate in Figure 2 was computed in this manner. This argument extends naturally to the recursive estimates, for which the covariance becomes L hOt D
1 (2p m 2r ¡ kh k2 ss2 )C ¡1 . t¡LC 1
(A.2)
Within the adaptive framework, the estimate remains unbiased, as with the nonweighted estimation problem, but the covariance of the estimator has a slightly different form. The previous expression for covariance was normalized by the data length used in the estimate. With the adaptive algorithm presented here, the normalization depends on the structure of the exponential decay of contributions from the past: L hOt D
1 (2p m 2r ¡ kh k2 ss2 )C t¡1 w
w,
t X
lt¡kC 1 ,
(A.3)
kD L
where w reects the exponential downweighting of past information. Note that if l D 1, the estimator variance matches that of the nonadaptive estimator. The uncertainty in the kernel estimate in Figures 4 through 6 were computed in this manner. Acknowledgments
I thank Emery Brown, Roger Brockett, and Nick Lesica for helpful comments in the preparation of this article and Yang Dan for the use of the physiological data. References Albrecht, D. G., Farrar, S. B., & Hamilton, D. B. (1984). Spatial contrast adaptation characteristics of neurones recorded in the cat’s visual cortex. J. Physiol., 347, 713–739. Brown, E. N., Nguyen, D. P., Frank, L. M., Wilson, M. A., & Solo, V. (2001). An analysis of neural receptive eld plasticity by point process adaptive ltering. PNAS, 98, 12261–12266. Bussgang, J. J. (1952). Crosscorrelation functions of amplitude-distorted gaussian signals. MIT Res. Lab. Elec. Tech. Rep., 216, 1–14. Chichilnisky, E. J. (2001). A simple white noise analysis of neuronal light responses. Network, 12, 199–213. Dan, Y., Atick, J., & Reid, R. (1996). Efcient coding of natural scenes in the lateral geniculate nucleus: Experimental test of a computation theory. J. Neurosci., 16(10), 3351–3362.
Adaptive Spatiotemporal Receptive Field Estimation
2945
Dayan, P., & Abbott, L. F. (2001). Theoretical neuroscience. Cambridge, MA: MIT Press. DeBoer, E., & Kuyper, P. (1968). Triggered correlation. IEEE Trans. Biomed. Eng., 15, 169–179. deCharms, R. C., Blake, D., & Merzenich, M. (1998). Optimizing sound features for cortical neurons. Science, 280, 1439–1443. Enroth-Cugell, C., & Shapley, R. (1973). Adaptation and dynamics of cat retinal ganglion cells. J. Physiol., 233, 271–309. Gilbert, C. D., Das, A., Ito, M., Kapadia, M., & Westheimer, G. (1996). Spatial integration and cortical dynamics. Proc. Natl. Acad. Sci. USA, 93, 615–622. Goodwin, G. C., & Sin, K. S. (1984). Adaptive ltering prediction and control. Upper Saddle River, NJ: Prentice Hall. Greblicki, W. (1992). Nonparametric identication of Wiener systems. IEEE Trans. on Inform. Thry., 38, 1487–1493. Greblicki, W. (1994). Nonparametric identication of Wiener systems by orthogonal series. IEEE Trans. on Automatic Control, 39, 2077–2086. Greblicki, W. (1997). Nonparametric approach to Wiener system identication. IEEE Trans. on Circuits and Sys., 44, 538–545. Heeger, D. (1992). Normalization of cell responses in cat striate cortex. Vis. Neurosci., 9, 181–197. Hunter, I. W., & Korenberg, M. J. (1986).The identication of nonlinear biological systems: Wiener and Hammerstein cascade models. Biological Cybernetics, 55, 135–144. Jones, J. P., & Palmer, L. A. (1987). The two-dimensional spatial structure of simple receptive elds in cat striate cortex. J. Neurophysiol., 58, 1187– 1211. Korenberg, M. J., & Hunter, I. W. (1999). Two methods for identifying Wiener cascades having noninvertible static nonlinearities. Annals of Biomed. Eng., 27, 793–804. Ljung, L. (1987). System identication: Theory for the user. Upper Saddle River, NJ: Prentice Hall. Ljung, L., & Soderstr ¨ om, ¨ T. (1983). Theory and practice of recursive identication. Cambridge, MA: MIT Press. Mao, B. Q., MacLeish, P. R., & Victor, J. D. (1998).The intrinsic dynamics of retinal bipolar cells isolated from tiger salamander. Vis. Neurosci., 15, 425–438. Movshon, J. A., Thompson, I. D., & Tolhurst, D. J. (1978). Spatial summation in the receptive elds of simple cells in the cat’s striate cortex. J. Physiol., 283, 53–77. Mukherjee, P., & Kaplan, E. (1995). Dynamics of neurons in the cat lateral geniculate nucleus: In vivo electrophysiology and computational modeling. J. Neurophysiol., 74(3), 1222–1243. Nykamp, D. Q., & Ringach, D. L. (2001). Full identication of a linear-nonlinear system via cross-correlation analysis. Journal of Vision, 2, 1–11. Ohzawa, I., Sclar, G., & Freeman, R. D. (1985). Contrast gain control in the cat’s visual system. Journal of Neurophysiology, 54, 651–667. Papoulis, A. (1984). Probability, random variables, and stochastic processes (2nd ed.). New York: McGraw-Hill.
2946
Garrett B. Stanley
Paulin, M. G. (1993). A method for constructing data-based models of spiking neurons using a dynamic linear-static nonlinear cascade. Biol. Cybern., 69, 67–76. Pei, X., Vidyasagar, T. R., Volgushev, M., & Creutzfeld, O. D. (1994). Receptive eld analysis and orientation selectivity of postsynaptic potentials of simple cells in cat visual cortex. J. Neurosci., 14(11), 7130–7140. Przybyszewski, A. W., Gaska, J. P., Foote, W., & Pollen, D. A. (2000). Striate cortex increases contrast gain of macaque LGN neurons. Visual Neuroscience, 17, 485–494. Reid, R. C., Soodak, R. E., & Shapley, R. M. (1991). Directional selectivity and spatiotemporal structure of receptive elds of simple cells in cat striate cortex. J. Neurophysiol., 66, 505–529. Reid, R. C., Victor, J. D., & Shapley, R. M. (1997). The use of m-sequences in the analysis of visual neurons: Linear receptive eld properties. Vis. Neurosci., 14, 1015–1027. Ringach, D. L., Hawken, M. J., & Shapley, R. (2002). Receptive eld structure in neurons in monkey primary visual cortex revealed by stimulation with natural image sequences. Journal of Vision, 2, 12–24. Ringach, D., Sapiro, G., & Shapley, R. (1997). A subspace reverse-correlation technique for the study of visual neurons. Vision Res., 37, 2455–2464. Shapley, R., & Victor, J. D. (1978). The effect of contrast on the transfer properties of cat retinal ganglion cells. J. Physiol, 285, 275–298. Sherman, S., & Koch, C. (1986). The control of retinogeniculate transmission in the mammalian lateral geniculate nucleus. Exp. Brain. Res., 63, 1–20. Smirnakis, S. M., Berry, M. J., Warland, D. K., Bialek, W., & Meister, M. (1997). Adaptation of retinal processing to image contrast and spatial scale. Nature, 386, 69–73. Stanley, G. B. (2000a). Algorithms for real-time prediction in neural systems. Paper presented at the Annual Meeting of the Engineering in Medicine and Biology Society, Chicago. Stanley, G. B. (2000b).Time-varying properties of neurons in visual cortex:Estimation and prediction. Paper presented at the 30th Annual Meeting of the Society for Neuroscience, New Orleans, LA. Stanley, G. B. (2001). Recursive stimulus reconstruction algorithms for real-time implementation in neural ensembles. Neurocomputing, 38–40, 1703–1708. Stanley, G., Li, F., & Dan, Y. (1999). Reconstruction of natural scenes from ensemble responses in the lateral geniculate nucleus. Journal of Neuroscience, 19(18), 8036–8042. Tolhurst, D. J., & Dean, A. F. (1990). The effects of contrast on the linearity of the spatial summation of simple cells in the cat’s striate cortex. Experimental Brain Research, 79, 582–588. Westwick, D. T., Suki, B., & Lutchen, K. (1998). Sensitivity analysis of kernel estimates: Implications in nonlinear physiological system identication. Annals of Biomend Eng., 26, 488–501. Received July 1, 2001; accepted May 13, 2002.
LETTER
Communicated by Morris Hirsch
Global Convergence Rate of Recurrently Connected Neural Networks Tianping Chen [email protected] Wenlian Lu [email protected] Laboratory of Nonlinear Mathematics Science, Institute of Mathematics, Fudan University, Shanghai, China Shun-ichi Amari [email protected] RIKEN Brain Science Institute, Wako-shi, Saitama 351-01, Japan We discuss recurrently connected neural networks, investigating their global exponential stability (GES). Some sufcient conditions for a class of recurrent neural networks belonging to GES are given. Sharp convergence rate is given too. 1 Introduction
Recurrently connected neural networks, sometimes called GrossbergHopeld neural networks, have been extensively studied, and many applications in different areas have been found. Such applications heavily depend on the dynamical behavior of the networks. Therefore, analysis of these behaviors is a necessary step for practical design of neural networks. The dynamical behaviors of recurrently connected neural networks have been studied since the early period of research on neural networks. For example, multistable and oscillatory behaviors were studied by Amari (1971, 1972) and Wilson and Cowan (1972). Chaotic behaviors were studied by Sompolinsky and Crisanti (1988). Hopeld and Tank (1984, 1986) studied the dynamic stability of symmetrically connected networks and showed their practical applicability to optimization problems. Cohen and Grossberg (1983) gave more rigorous results on the global stability of networks. Other work studied the local and global stability of asymmetrically connected networks: Chen and Amari (2001a, 2001b), Cohen and Grossberg (1983), Fang and Kincaid (1996), Hirsch (1989), Kaszkurewicz and Bhaya (1994), Kelly (1990), Matsuoka (1992), Sompolinsky and Crisanti (1988), Wilson and Cowan (1972), Yang and Dillon (1994), and others. Forti, Maneti, and Marini (1994) and Forti and Tesi (1995) proposed three main features a neural network should posesses. Among them, the important one is that c 2002 Massachusetts Institute of Technology Neural Computation 14, 2947– 2957 (2002) °
2948
Tianping Chen, Wenlian Lu, and Shun-ichi Amari
a neural network should be absolutely stable. They also pointed out that the negative semideniteness of the interconnection matrix guarantees the global stability of a Grossberg-Hopeld network and proved the absolute global stability for a class G of activation functions. However, they did not discuss the absolute exponential stability and give the convergence rate. Liang and Wu (1999) discussed the absolute exponential stability; however, they assumed that the connection matrix is a special structure, that is, the offdiagonal entries are nonnegative and the activation functions are bounded. Even with these restrictions, the attraction region is limited. Zhang, Heng, and Fu (1999), Liang and Wang (2000), and Liang and Si (2001) discussed the exponential stability of Grossberg-Hopeld neural networks and estimated the convergence rate. However, these estimates are less sharp. Our goal here is to provide a more accurate estimate of the convergence rate. 2 Some Denitions and Main Results
Class GN of functions: Let G D diag[G1 , . . . , Gn ], where Gi > 0, N if the functions i D 1, . . . , n. g ( x ) D ( g1 (x ) , . . . , gn ( x) ) T is said to belong to G, gi ( x C u ) ¡gi ( x ) · Gi . gi ( x ) , i D 1, . . . , n, satisfy 0 < u Denition 1.
Denition 2.
M.
Ms D
1( C 2 M
MT ) is dened as the symmetric part of the matrix
Here, we investigate following dynamical system X dui ( t) D ¡di ui ( t) C aij gj ( uj (t ) ) C Ii dt j
i D 1, . . . , n
(2.1)
N where di > 0, g ( x ) D ( g1 (x ) , . . . , gn (x ) ) T 2 G. Recurrently connected neural network 2.1 is said to be absolutely exponentially stable if any solution of it converges to an equilibrium point expoN nentially for any activation functions in G. Denition 3.
We will prove two theorems on the globally exponential convergence Let In be the identity matrix of order n, D D diag[d1 , . . . , dn ], A D ( aij ) ni, jD1 , P D diag[p1 , . . . , pn ], where pi > 0, i D 1, . . . , n, g ( x ) D ( g1 ( x ) , . . . , gn ( x ) ) T 2 GN with gi ( x ) being differentiable. If there is a positive constant a > 0 such that Theorem 1.
a < di
i D 1, . . . , n
(2.2)
Global Convergence Rate of Recurrently Connected Neural Networks
2949
and fP[(D ¡ aIn ) G ¡1 ¡ A]gs is positive denite, then the dynamical system 2.1 has a unique equilibrium point u¤ , such that |ui ( t) ¡ u¤i | D O (e ¡at )
i D 1, . . . , n.
(2.3)
Pick a small 2 > 0 such that di > a C 2 and the least eigenvalue lmin of fP[(D ¡ bIn ) G¡1 ¡ A]gs satises Proof.
lmin D lmin fP[ (D ¡ bIn ) G ¡1 ¡ A]gs > 0,
(2.4)
where b D a C 2 . It is well known that under the conditions given in theorem 1, the dynamical system 2.1 has an equilibrium point u¤ D [u¤1 , . . . , u¤n ]T . Dene vi (t ) D ui ( t) ¡ u¤i , and gj¤ ( s) D gj (s C uj¤ ) ¡ gj ( uj¤ ) .
(2.5)
Then we have X dvi ( t) D ¡di vi ( t) C aij gj¤ ( t) dt j
i D 1, . . . , n.
(2.6)
The new variables vi put the unique equilibrium at zero. Let v (t) D [v1 ( t) , . . . , vn ( t) ]T , g¤ ( t) D diag[g¤1 (u1 ( t) ) , . . . , g¤n ( un ( t) )], and dene the Lyapunov function, which was used in Forti and Tesi (1995),
L ( k, t) D
Z vi ( t ) n n X 1X |vi ( t) | 2 C k Pi gi (j ) dj. 2 i D1 0 iD 1
(2.7)
Direct differentiating leads to LP ( k, t) D ¡DvT ( t) v ( t) C vT ( t) Ag¤ ( t)
¡ kfg¤ ( t) gT PDv ( t) C kfg¤ ( t)gT PAfg¤ ( t) g
D ¡bv
T(
T(
T(
(2.8) ¤(
t) v ( t) ¡ v t) ( D ¡ bIn ) v (t) C v t) Ag t)
¡ kbfg¤ ( t) gT Pv ( t) ¡ kfg¤ ( t) gT P ( D ¡ bIn ) v (t) C kfg¤ ( t ) gT PAfg¤ ( t ) g T(
T(
(2.9) T(
¤(
· ¡bv t) v ( t) ¡ v t) ( D ¡ bIn ) v ( t) C v t) Ag t)
¡ kbfg¤ ( t) gT Pv ( t) ¡ kfg¤ ( t) gT
£ fP[(D ¡ bIn ) G ¡1 ¡ A) ]gs g¤ (t)
(2.10)
2950
Tianping Chen, Wenlian Lu, and Shun-ichi Amari
» ¼ 1 1/ 2 ¡1/ 2 ¤( ) ( ) ( ) ( ) ¡ ¡ bI ¡ ¡ bI D vt D D Ag t n n 2 » ¼ 1 £ v ( t) ( D ¡ bIn ) 1/ 2 ¡ ( D ¡ bIn ) ¡1/ 2 Ag¤ ( t) 2 C
T
1 ¤ fg ( t)gT AT ( D ¡ bIn ) ¡1 Afg¤ ( t) g 4
¡ kfg¤ ( t)gT fP[(D ¡ bIn ) G ¡1 ¡ A]gs fg¤ ( t) g ¡ bIn vT (t ) v ( t) ¡ kbfg¤ ( t)gT Pv ( t)
·
(2.11)
1 ¤ fg (t) gT AT ( D ¡ bIn ) ¡1 Afg¤ ( t) g 4 ¡ kfg¤ ( t)gT fP[(D ¡ bIn ) G ¡1 ¡ A]gs fg¤ ( t) g
¡ bvT ( t) v (t ) ¡ kbfg¤ ( t) gT Pv ( t) » ¤( ) T D ¡fg t g kfP[ (D ¡ bIn ) G¡1 ¡ A]gs ¡
(2.12)
¼ 1 T A ( D ¡ bIn ) ¡1 A fg¤ ( t) g 4
¡ bvT ( t) v (t ) ¡ kbfg¤ ( t) gT Pv ( t).
(2.13)
Pick a sufciently large positive k such that the least eigenvalue » lmin D lmin k ( P[ ( D ¡ bIn ) G¡1 ¡ A]) s 1 ¡ AT ( D ¡ bIn ) ¡1 A 4
¼
> 0.
(2.14)
Then we have LP ( k, t ) · bvT ( t) v ( t) ¡ kbfg¤ ( t) gT Pv (t) ¡ lmin fg¤ ( t) gT fg¤ (t )g.
(2.15)
To obtain the desired convergence rate, we use following equalities: g¤i (ji ) D g0i ( u¤i )ji C o (|ji | )
(2.16)
g¤i ( vi ( t) ) vi (t) D g0i ( u¤i ) v2i ( t) C o (| vi ( t) | 2 )
(2.17)
Z
vi ( t)
2 0
g¤i (j ) dj D g0 ( u¤i ) v2i ( t) C o (| vi ( t) | 2 ).
(2.18)
Global Convergence Rate of Recurrently Connected Neural Networks
2951
Combining previous equalities, we have Z g¤i ( vi (t ) ) vi ( t)
D2
vi ( t ) 0
g¤i (j ) dj C o (| vi (t) | 2 ) .
(2.19)
Substituting into equation 2.15, we see that LP ( k, t) · bvT ( t) v (t) ¡ kbfg¤ ( t) gT Pv ( t) ¡ lmin fg¤ ( t) gT fg¤ ( t) g Z vi ( t ) n n X X |vi ( t) | 2 ¡ 2kb · ¡b Pi g¤i (j ) dj i D1
0
iD 1
Á ! n n X X 0 ¤ 2 2 2 |vi ( t) | ¡ lmin fgi ( ui ) g |vi ( t) | C o iD 1
· ¡a
n X iD 1
|vi ( t) | 2 ¡ 2ka
n X
i D1
Z Pi
iD 1
vi ( t ) 0
g¤i (j ) dj D ¡2aL(k, t) (2.20)
holds for sufciently large t. Therefore, n 1X |vi ( t) | 2 · L ( k, t) · L ( k, 0)e ¡2at 2 iD 1
(2.21)
vi (t) D O ( e ¡at )
(2.22)
and i D 1, . . . , n.
Theorem 1 is proved. The conditions that the derivatives gi ( x) exist imposed on the activations gi ( x) in theorem 1 are a little stronger than in the existing literature. However, in practice, they are not very restrictive because the monotonically increasing function is differentiable almost everywhere. Remark 1.
The assumption gi (x ) i D 1, . . . , N is differentiable can be relaxed to the right and left derivatives of gi ( x) i D 1, . . . , N. This relaxation is signicant. For example, monotonically increasing piecewise linear functions satisfy this condition. Remark 2.
Let g0§ ( x ) denote the right and left derivatives, respectively. Under the following assumption that for any 2 > 0, there is d > 0, such that Remark 3.
gi ( x § h) ¡ gi (x ) 0 ¡ g § ( x) <2 h
(2.23)
2952
Tianping Chen, Wenlian Lu, and Shun-ichi Amari
holds for all i D 1, . . . , N and x whenever 0 < h < 2 , then |ui (t ) ¡ u¤i | D O ( e¡at )
i D 1, . . . , n
(2.24)
hold uniformly for all g ( x) D ( g1 ( x ) , . . . , gn ( x ) ) T 2 G. Let the activation functions satisfy the assumptions given in theorem 1, and let ji > 0, i D 1, . . . , N be some constants. Then under any of the following three inequalities, 8 9 < N = X ji |Aij | > ajj , j D 1, . . . , N (2.25) djjj ¡ Gj jj Ajj C : ; 6 Corollary 1.
i D 1,i D j
ji [di ¡ Gi Aii ] ¡
ji [di ¡ Gi Aii ] ¡
N X
jj Gj |Aij | > aji ,
6 i j D 1, j D
i D 1, . . . , N
(2.26)
N X 1 fji Gi |Aij | C jj Gj |Aji |g > aji , 2 6 i j D 1j D
i D 1, . . . , N,
(2.27)
there is a unique equilibrium u¤ D [u¤1 , . . . , u¤n ]T 2 RN , such that for any solution u (t ) of the differential equation 2.1, there holds lim fu(t) ¡ u¤ g D O ( e¡at ) .
t!1
(2.28)
In fact, under either one of previous three conditions, the matrix ( D ¡ aIn ) ¡ MG is an M-matrix (see Berman & Plemmons, 1979), where M D (mij ) N with entries i, j D 1 (
mii mij
D D
Aii |Aij |
6 iD j.
(2.29)
By a property of M-matrix, (D ¡aIn ) G¡1 ¡M is also an M-matrix. Therefore, there exists P D diag[p1 , . . . , pn ], with pi > 0, i D 1, . . . , n such that pi [di ¡ a]Gi¡1 ¡
N 1 X fpi |Aij | C pj |Aji |g > 0 i D 1, . . . , N. 2 jD 1, jD6 i
(2.30)
By Gershgorin’s theorem (see Golub and Van Loan, 1990), all the eigenvalues of the matrix fP[(D ¡ aIn ) G ¡1 ¡ M]gs are positive. Therefore, fP[ (D ¡ aIn ) G¡1 ¡ A]gs is positive denite. This corollary is a direct consequence of theorem 1.
Global Convergence Rate of Recurrently Connected Neural Networks
2953
The results in corollary 1 are generalizations of those given in Chen and Amari (2001a). Moreover, it gives a sharper convergence rate. Remark 4.
If the activation functions are sigmoidal functions gi (x ) D tanh ( Gi x ) , we have the following stronger results, and the proof is much simpler. Let A D ( aij ) ni, jD1 , P D diag[p1 , . . . , pn ], where pi > 0, i D 1, . . . , n, gi ( x ) D tanh (Gi x ) . If there is a positive constant a > 0 such that a · di for i D 1, . . . , N and the least eigenvalue lmin D lmin fP[(D¡aIn ) G¡1 ¡A]gs > 0. Then the dynamical system 2.1 has a unique equilibrium point u¤ , such that Theorem 2.
|ui ( t) ¡ u¤i | D O (e ¡(a Cd¡2 ) t )
i D 1, . . . , n
(2.31)
where » d D lmin min
i D 1,¢,n
g0i ( u¤i ) Pi
¼ ,
(2.32)
and 2 > 0 is any xed small number. Proof.
Using the same notations as in the proof of the theorem 1, dene
L1 ( t ) D
n X
Z Pi
vi ( t) 0
i D1
g¤i (j ) dj
(2.33)
and L2 ( t ) D
n 1X |vi ( t) | 2 . 2 iD 1
(2.34)
Under the assumptions, vi (x ) i D 1, . . . , n are bounded. Therefore, there is a constant C such that Gi ¸ g0i ( ui ( t) ) > C for all t > 0 and i D 1, . . . , n. Thus, 1 C|vi ( t) | 2 · 2
Z
vi ( t ) 0
g¤i (j ) dj ·
1 Gi |vi (t) | 2 . 2
(2.35)
Then we can nd two positive constants C1 and C2 such that C1 L1 ( t) · L2 ( t) · C2 L1 ( t) .
(2.36)
Using equation 2.19 and direct differentiating leads to LP 1 ( t) D ¡fg¤ ( t) gT PDv ( t) C fg¤ ( t) gT PAfg¤ ( t) g ¡ afg¤ ( t)gT Pv ( t) ¡ fg¤ ( t) gT P ( D ¡ aIn ) v ( t) C fg¤ ( t )gT PAfg¤ ( t )g
(2.37)
2954
Tianping Chen, Wenlian Lu, and Shun-ichi Amari · afg¤ ( t) gT Pv ( t ) ¡ fg¤ ( t) gT fP[(D ¡ aIn ) G¡1 ¡ A) ]gs g¤ ( t)
(2.38)
· ¡afg ( t) g Pv ( t) ¡ lmin fg ( t) g fg ( t) g
(2.39)
¤
· ¡2a
¤
T
n X
Z
Pi
v i ( t)
0
i D1
T
¤
g¤i (j ) dj
Á ! n n X X 0 ¤ 2 2 2 |vi ( t) | ¡ lmin fgi ( ui ) g |vi ( t) | C o iD 1
· ¡2a
n X i D1
¡ 2lmin ³
Z
Pi
(2.40)
i D1
v i ( t) 0
n X
g¤i (j ) dj Z
gi ( u¤i )
vi ( t )
0
i D1
0
»
· ¡2 a C lmin min
i D1,¢,n
Á g¤i (j )
g0i ( u¤i ) Pi
dj C o
¼
´ C o (1)
n X
! |vi ( t) |
2
(2.41)
iD 1
L1 ( t )
(2.42)
· ¡2(a C d ¡ 2 ) L1 ( t)
(2.43)
holds for sufciently large t. Therefore, n 1X |vi ( t) | 2 D L2 ( t) · C2 L1 ( t) · C2 L1 (0) e ¡2(aCd¡2 2 iD 1
)t
(2.44)
and vi ( t) D O ( e¡(aCd¡2 ) t )
i D 1, . . . , n,
(2.45)
which proves theorem 2. Under the assumptions that the activation functions gi satisfy g 2 GN and g0i ( x) > C, i D 1, . . . , n in any nite interval for some constant C. The conclusions in theorem 2 are valid too. Remark 5.
Liang and Si (2001) claim that the lower bound mini di is attainable for a special degenerate system (linear). Theorem 2 shows that for a wide class of activation functions, the convergence rate of the nonlinear system, equation 2.1, can be even better than the lower bound mini di given in Liang and Si (2001). Remark 6.
3 Comparisons
In this section, we compare our results with other work that most closely relates to ours. In this other work, the activation functions are assumed to ( C )¡ ( ) satisfy 0 · gi x hh gi x · Gi .
Global Convergence Rate of Recurrently Connected Neural Networks
2955
Zhang et al. (1999) estimated convergence rate. The main result is as follows: If there exists 0 < a < min1 ·i·n < di such that the matrix lmax f¡[(D ¡ aIn ) G¡1 ¡ A]gs · 0,
(3.1)
then dynamical system 2.1 has a unique equilibrium point u¤ , such that a
|ui ( t) ¡ u¤i | D O (e ¡ 2 t )
i D 1, . . . , n.
(3.2)
Instead, in theorem 2, condition 0 < a < min1 ·i·n < di can be relaxed to a · di , and the convergence rate is |ui ( t) ¡ u¤i | D O (e ¡(a Cd¡2 ) t )
i D 1, . . . , n.
(3.3)
Liang and Wu (1999) and Liang and Wang (2000) addressed absolute exponential stability under various assumptions. In most of these articles, the convergence rate is O ( e¡at / 2 ) . Recently Liang and Si (2001) discussed the lower bound of the convergence rate and showed that in some case (when all gj ( x ) D 0), the lower bound a D mini di is attainable. However, in this case, the system is degenerate and has no great interest. Theorem 2 shows that if the activation functions are sigmoidal functions, then under the condition lmin f[(D ¡ min di In ) G¡1 ¡ A]gs > 0, i
(3.4)
the convergence rate is |ui ( t) ¡ u¤i | D O (e ¡(mini di Cd¡2 ) t )
i D 1, . . . , n,
(3.5)
which is even better than O ( e¡fmini di gt ) . 4 Conclusion
We investigated the absolute exponential stability of a class of recurrent connected neural networks. The convergence rate given here is much sharper than that in the existing literature. Acknowledgments
This project is supported by the National Science Foundation of China 69982003 and 60074005. This work was completed when T. Chen visited the RIKEN Brain Science Institute in Japan.
2956
Tianping Chen, Wenlian Lu, and Shun-ichi Amari
References Amari, S. (1971). Characteristics of randomly connected threshold element networks and neural systems. Proc. IEEE, 59, 35–47. Amari, S. (1972). Characteristics of random nets of analog neuron-like elements. IEEE Trans., SMC-2, 643–657. Berman, A., & Plemmons, R. J. (1979). Nonnegative matrices in the mathematical sciences. New York: Academic Press. Chen, T., & Amari, S. (2001a). Stability of asymmetric Hopeld networks. IEEE Transactions on Neural Networks, 12(1), 159–163. Chen, T., & Amari, S. (2001b). New theorems on global convergence of some dynamical systems. Neural Networks, 14(3), 251–255. Cohen, M. A., & Grossberg, S. (1983). Absolute stability of global pattern formation and parallel memory storage by competitive neural networks. IEEE Trans. SMC-13, 815–826. Fang, Y., & Kincaid, T. G. (1996). Stability analysis of dynamical neural networks. IEEE Trans. Neural Networks, 7(4), 996–1006. Forti, M., Maneti, S., & Marini, M. (1994). Necessary and sufcient conditions for absolute stability of neural networks. IEEE Trans. Circuits Syst.-I: Fundamental Theory Appl., 41(7), 491–494. Forti, M., & Tesi, A. (1995).New conditions for global stability of neural networks with application to linear and quadratic programming problems. IEEE Transactions on Circuits and Systems-I: Fundamental Theory and Applications, 42(7), 354–366. Golub, G. H., & Van Loan, C. F. (1990). Matrix computation (2nd ed.). Baltimore, MD: Johns Hopkins University Press. Hirsch, M. (1989).Convergent activation dynamics in continuous time networks. Neural Networks, 2, 331–349. Hopeld J. J., & Tank, D. W. (1984). Neurons with graded response have collective computational properties like those of two-state neurons. Proc. Nat. Acad. Sci., 79, 3088–3092. Hopeld J. J., & Tank, D. W. (1986). Computing with neural circuits: A model. Science, 233, 625–633. Kaszkurewicz, E., & Bhaya, A. (1994). On a class of globally stable neural circuits. IEEE Trans. Circuits Syst.-I: Fundamental Theory Appl., 41(2), 171–174. Kelly, D. G. (1990).Stability in contractive nonlinear neural networks. IEEE Trans. Biomed. Eng., 3(3), 231–242. Liang, X., & Si, J. (2001). Global exponential stability of neural networks with globally Lipschitz continuous activations and its application to linear variational inequality problem. IEEE Transactions on Neural Networks, 12(2), 349– 359. Liang, X., & Wang, J. (2000). Absolute exponential stability of neural networks with a general class of activation functions. IEEE Transactions on Circuits and Systems—I, Fundamental Theory and Applications, 47(8), 1258–1263. Liang, X., & Wu, L. (1999). Global exponential stability of a class of neural circuits. IEEE Transactions on Circuits and Systems—I, Fundamental Theory and Applications, 46(6), 748–751.
Global Convergence Rate of Recurrently Connected Neural Networks
2957
Matsuoka, K. (1992). Stability conditions for nonlinear continuous neural networks with asymmetric connection weights. Neural Networks, 5, 495–500. Sompolinsky, H., & Crisanti, A. (1988). Chaos in random neural networks. Physical Review Letter, 61(3), 259–262. Wilson, H., & Cowan, A. (1972). Excitatory and inhibitory interactions in localized populations of model neurons. Biophysical J., 12, 1–24. Yang, H., & Dillon, T. (1994). Exponential stability and oscillation of Hopeld graded response neural network. IEEE Trans. Neural Networks, 5(5), 719–729. Zhang, Yi, P. A., Heng, P., & Fu, A. W. C. (1999). Estimate of exponential convergence rate and exponential stability for neural networks. IEEE Trans. Neural Networks, 10(6), 1487–1493.
Received September 27, 2001; accepted May 24, 2002.
LETTER
Communicated by Shun-ichi Amari
Locality of Global Stochastic Interaction in Directed Acyclic Networks Nihat Ay [email protected] Max-Planck-Institute for Mathematics in the Sciences, 04103 Leipzig, Germany The hypothesis of invariant maximization of interaction (IMI) is formulated within the setting of random elds. According to this hypothesis, learning processes maximize the stochastic interaction of the neurons subject to constraints. We consider the extrinsic constraint in terms of a xed input distribution on the periphery of the network. Our main intrinsic constraint is given by a directed acyclic network structure. First mathematical results about the strong relation of the local information ow and the global interaction are stated in order to investigate the possibility of controlling IMI optimization in a completely local way. Furthermore, we discuss some relations of this approach to the optimization according to Linsker’s Infomax principle. 1 Introduction
This article is based on the hypothesis that neural networks are realizations of complex systems in the sense that their elementary units (neurons) are strongly interacting with each other, subject to constraints. The following challenging statement by Chaitin (2001, p. 97) encourages speculation on the fundamental subject of complexity from a very general point of view: “Many years ago I used mutual information in an attempt to formulate a general mathematical denition of life, to distinguish organized living matter from ordinary matter. The idea was that the parts of a living organism are highly correlated, highly interdependent, and have high mutual information. After all, primordial quality of living beings is their tremendous complexity and interdependentness. I never got very far with this theory. Can you do better?” Within the eld of neural networks, several complexity measures related to the one by Chaitin (1979) have been proposed. Tononi, Sporns, and Edelman (1994) introduced a measure for brain complexity that is intended to describe segregation and integration in the brain in terms of mutual information with respect to all bipartitions. Jost (2000 / 2001) generalized it to the case of arbitrary partitions, where higher-order interactions among more than two subsystems are included. In the approach presented here, we formalize the preceding hypothesis by using a notion of complexity based on the stochastic interaction among the elementary subsystems— namely, the c 2002 Massachusetts Institute of Technology Neural Computation 14, 2959– 2980 (2002) °
2960
Nihat Ay
neurons. It quanties the stochastic interaction in a neural system with respect to a joint probability distribution p in terms of the Kullback-Leibler “distance” (Kullback & Leibler, 1951) of p from the set of all factorizable disributions. Amari (2001) discussed this measure from the informationgeometric point of view. He used the Pythagorean theorem (Amari, 1985; Amari & Nagaoka, 2000) to present a hierarchical decomposition of global stochastic interaction in lower-order interactions. In our previous work (Ay, 2002), we studied probability distributions that maximize the global interaction without any constraints. This article continues this work and applies it to the eld of neural networks. Experimental evidence for the optimization of information ow in neural systems is mainly provided on the local level in terms of the mutual information between a neuron and its neighbors. Early experimental results by Laughlin (1981) concerning the response characteristics of large monopolar cells (LMC) in the visual system of the y suggest that the local information transmission is maximized in a neural system. A detailed discussion on this topic is presented by Rieke, Warland, Ruyter van Steveninck, and Bialek (1998). Linsker (1986, 1988) pointed out that local optimization can be achieved by Hebb-like learning. He convincingly demonstrated by computer simulations that this kind of local learning leads to the emergence of receptive elds in feedforward networks that are surprisingly similar to those observed in the visual pathway of the cat and the monkey (Hubel & Wiesel, 1962, 1968). Motivated by the simulation results, he formulated the principle of maximum information preservation (Infomax) for unsupervised or self-organized learning in the eld of neural networks. In this article, we investigate the possibility of describing the optimization of local information transmission in directed acyclic networks in terms of the global interaction complexity (a generalization of this approach to recurrent networks is in progress). Such a local-global compatibility would provide an attractive way to formulate an invariant rst principle for information processing in neural systems based on our main hypothesis of strong global interaction. The article is organized as follows. Section 2 contains a brief introduction to the setting of random elds on directed acyclic graphs. Section 3.1 formalizes the hypothesis of strong interaction and applies it to the situation of our previous work (Ay, 2002). Section 3.2 discusses the relation of our hypothesis to the Linsker Infomax. Section 3.3 presents some locality properties of global interaction in directed acyclic networks. Section 4.1 continues the discussion of Linsker’s work in view of the results of our article. We conclude with some problems and comments in section 4.2. The appendix contains the proofs of the mathematical statements. 2 Preliminaries: Random Fields on Directed Acyclic Graphs
We use the framework of random elds on directed acyclic networks, which is recalled in what follows (see Lauritzen, 1996; Cowell, Dawid, Lauritzen, & Spiegelhalter, 1999).
Locality of Global Stochastic Interaction
2961
2.1 Random Fields. The set of all probability distributions on a nonempty nite set V is denoted by PN (V) ½ RV . The support of a probability distribution p 2 PN (V) is dened as supp p :D fv 2 V: p (v) > 0g. The strictly positive distributions P (V) have the maximal support V. Let V be a nite set of neurons. To each neuron v 2 V, we assign a nite set Vv of states. For a subset A ½ V, the set of all congurations on A is given by the product VA :D £v2A Vv . If A D ;, then V A consists of exactly one element: the empty conguration 2 . The elements of PN (V A ) are the random elds on A. One has the natural restriction XA : VV ! VA , (vv ) v2V 7! vA :D (vv ) v2A , which induces the projection PN (VV ) ! PN (V A ), p 7! pA , where p A denotes the image measure of p under the variable XA . 2.2 Graphical Structure. In this article, we investigate random elds that are compatible with directed acyclic graphs. Such graphs are considered to be a general model for feedforward neural networks. The set E of edges (synapses) is a subset of the set V £V of ordered pairs of neurons. The graph N :D ( V, E ) is called directed and acyclic iff for all sequences v 0, . . . , vn 2 V with (v k¡1 , v k ) 2 E, k D 1, . . . , n, the rst neuron v0 is different from the last neuron vn . With each v 2 V, we associate the set
pa(v) :D fw 2 V: (w, v ) 2 Eg of parents of v. We dene the periphery of the graph N as per(N) :D fv 2 V: pa(v) D ;g. It is nonempty if N is directed and acyclic. The complement Vnper (N ) of the periphery is denoted by int(N). 2.3 Combination of Random Fields and Graphs. Now we introduce random elds on V that are compatible with the structure of a given directed and acyclic graph. First, we assume that there is a probability distribution p@ 2 PN (Vper(N) ) on the network’s periphery that is independent of intrinsic properties of the system. Furthermore, we assume that for each neuron v 2 int(N) , there is a local kernel function
(v , v0 ) 7! Kv (v0 | v) ,
Kv : Vpa(v) £ Vv ! [0, 1], with X v 2Vv 0
Kv (v0 | v) D 1,
for all v 2 Vpa(v) .
The distribution p@ and the family ( Kv ) v2int ( N ) of kernels dene the composed distribution, p (v) :D p@ (vper(N) )
Y v2int ( N )
Kv (vv | vpa(v) ) ,
v 2 VV .
(2.1)
2962
Nihat Ay
We denote it by p@ (v2int (N ) Kv ) and say that it is (N, p@ ) -adapted. Let K N v be the set of local kernel functions for the neuron v, and let K v be its subset of strictly positive kernel functions. Then we have the composition map
PN (V@ ) £
Y v2int (N )
K N v ,! PN (VV ) ,
( p@ I Kv , v 2 int(N) ) 7! p@ (v2int ( N ) Kv ). The image of this map is denoted by C (N, p@ ) . The elements of C ( N, p@ ) are exactly the ( N, p@ ) -adapted probability distributions. In this article, we use the term local with two different meanings, which can be clearly distinguished from the context. The rst meaning refers to the denition of a local maximizer of a function. The second way to use this term describes what is usually meant by local information ow or local learning in a neural network and therefore depends on the underlying network structure. Remark.
3 Invariant Maximization of Interaction 3.1 Formalization of the Hypothesis. This article is based on the hypothesis that neural networks are realizations of systems with strongly interacting units. Now we formalize this approach using well-known information-theoretic quantities. For two disjoint subsystems A, B ½ V, and a probability distribution p 2 PN (VV ) , we dene the entropy on A by
Hp ( A) :D ¡
X
v2VA
p ( XA D v) ln p ( XA D v)
and the conditional entropy on B given A, by Hp ( B | A ) :D ¡
X v2VA , v0 2VB
p ( XA D v, XB D v0 )
£ ln p ( XB D v0 | XA D v) . The mutual information or transinformation of A and B is dened as Ip ( AI B ) :D Hp ( A ) C Hp (B ) ¡ Hp ( A ] B ) . The (stochastic) interaction of the neurons in A is dened in a similar way: IpA :D
X v2A
Hp ( v) ¡ Hp ( A) .
Locality of Global Stochastic Interaction
2963
We have IpA]B D IpA C IpB C Ip ( AI B ) .
(3.1)
Without specifying the probability distribution p, we use the same notation for the corresponding functions PN (VV ) ! R C . For example, H ( A) denotes the function p 7! Hp ( A ) . In these denitions, we assume that learning in neural systems is driven by mechanisms that maximize the stochastic interaction IV subject to constraints. The formulation of this hypothesis does not require any specication of the underlying constraints. For this reason, we call it the hypothesis of invariant maximization of interaction (IMI). The completely unconstrained maximization of IV leads to a strong reduction of the number of possible congurations in the system. Ay (2002) studied this reduction and proved the following proposition: Let p be a local maximizer of IV . Then the following bound on the support of p holds: X (|Vv | ¡ 1). |supp p| · 1 C (3.2) Proposition 1.
v2V
Thus, for binary neurons (i.e., |Vv | D 2 for all v 2 V), the number of congurations is reduced from 2|V| to at most |V| C 1. This is illustrated by the following example obtained from computer simulations. Each Venn diagram in Figures 1 and 2 represents a probability distribution on the set of all atoms that are generated by the corresponding events. The gray value of an atom is proportional to the probability of the atom. “White” corresponds to the maximal probability in a given Venn diagram. The diagrams on the left-hand side are initial distributions that induce those on the right-hand side by unconstrained natural gradient ow (Amari, 1998) that optimizes the stochatic interaction. Here, we investigate the maximization of interaction with respect to extrinsic and intrinsic constraints. The extrinsic constraint is modeled by a xed input distribution on the periphery of the system. We get a restriction of the optimization to a convex subset of the set of probability distributions, which is discussed in section 3.2. The specication of the internal structure of the system restricts the optimization to a more complicated manifold, which describes the intrinsic constraints. We have to consider two aspects: specication of the network structure in terms of a graph and specication of the system parameterization that is compatible with the network structure. Intrinsic constraints are investigated in section 3.3. Example 1.
3.2 The Extrinsic Constraint and Linsker’s Infomax. Let @ be a subset of the set V of neurons, called periphery of V, and p@ be a probability distri-
2964
Nihat Ay
Figure 1: Computer simulations of the interaction maximization in the case of two binary neurons.
bution on V@ . By C ( p@ ) , we denote the convex set of probability distributions q on VV that satisfy q (X@ D v) D p@ ( X@ D v)
for all v 2 V@ .
Using equation 3.1, we can split the global stochastic interaction in the following way: IV D I@ C I ( @I Vn@) C IVn@.
(3.3)
Note that the rst term on the right-hand side of equation 3.3 depends only on the peripheral distribution p@ . Therefore, it is constant on the set C ( p@ ) of distributions with @-marginal p@: Vn@ IpV D Ip ( @I Vn@) C Ip C const,
p 2 C ( p@ ) .
(3.4)
Thus, the C ( p@ ) -constrained optimization of interaction can be thought of as a combination of two optimizations. The rst one corresponds to Ip (@I Vn@) D Hp (Vn@) ¡ Hp (Vn@ | @)
(3.5)
and is closely related to Linsker ’s Infomax principle (Linsker, 1988), which states that learning processes in layered networks maximize the transinformation between the input and the output (we discuss Linsker’s work in section 4.1). In many models, this principle leads to a strong decorrelation of the output neurons (Bell & Sejnowski, 1995; Obradovic & Deco, 1998). Therefore, it is considered to be consistent with the concept of redundancy reduction by Attneave (1954) and Barlow (1989). On the other hand,
Locality of Global Stochastic Interaction
2965
Figure 2: Computer simulations of the interaction maximization in the case of three binary neurons.
the internal integration of information is based on functional dependences and correlations, so that the optimization according to the Infomax priciple should be controlled by the maximization of the redundancy of the output neurons (this has also been proposed by Barlow, 2001). Combining the Infomax term Ip ( @I Vn@) with the redundancy term IpVn@ , we end up with the optimization of the right-hand side of equation 3.4, which is equivalent on C ( p@ ) to the maximization of interaction.
2966
Nihat Ay
So far we have compared the IMI hypothesis with Linsker ’s Infomax principlefrom the conceptual point of view. There is another way to compare these two approaches. In order to do this, we use equation 3.4 to rewrite the global interaction on C ( p@ ) : Á IpV D ( Hp ( Vn@) ¡ Hp (Vn@ | @) ) C D
X v2Vn@
X v2Vn@
! Hp ( v) ¡ Hp ( Vn@)
C const
Hp ( v ) ¡ Hp (Vn@ | @) C const.
(3.6)
Comparing equation 3.6 with 3.5, we observe that the main difference be@ ) is that the tween the global interaction and the transinformation on C ( pP output entropy Hp ( Vn@) in equation 3.5 is replaced by the sum v2Vn@ Hp ( v) of the individual “local” output entropies. Under certain assumptions, the conditional entropy Hp ( Vn@ | @) in equation 3.6 also localizes and provides a way to dene local learning rules for the optimization of interaction. We study this locality property in section 3.3. Furthermore, we observe that both the Infomax and the IMI optimizations have the tendency to produce input-output relations with low conditional entropy. The following generalization of proposition 1 gives us an estimate of the conditional entropy with respect to a distribution that maximizes the interaction on C ( p@ ) . Let @ be a subset of V, and let p@ be a probability distribution on N V@ . If p 2 P (VV ) is a local maximizer of IV in the convex set C ( p@ ) of distributions with @-marginal p@ , then Proposition 2.
0 · |supp p| ¡ |supp p@ | ·
X v2Vn@
(|Vv | ¡ 1) .
In particular, for binary neurons, inequality 3.7 implies 0 · |supp p| ¡ |supp p@ | · |Vn@|. Note that we recover the estimate 3.2 if we set @ :D ; in equation 3.7. Corollary 1.
In the situation of proposition 2, for all v 2 supp p@
|supp p (¢ | X@ D v) | · 1 C
X v2Vn@
(|Vv | ¡ 1) .
(3.7)
Locality of Global Stochastic Interaction
2967
In the situation of proposition 2, the conditional entropy of the internal state under the condition of the external state satises Corollary 2.
Á Hp (Vn@ | @) · ln 1 C
X v2Vn@
! (|Vv | ¡ 1) .
From corollaries 1 and 2, we observe that the optimization of interaction with a xed input distribution leads to a nearly functional dependency between the input and the output. Consider, for example, the binary case. Given an input conguration on the periphery @, there are in general 2 |Vn@| possible outputs. According to corollary 1, this maximal number is reduced to at most 1 C |Vn@| congurations if we have a distribution that maximizes the interaction of the neurons in the set C ( p@ ). Thus, we have an exponential reduction of the possible outputs. As stated in corollary 2, the corresponding conditional entropy that quanties the uncertainty about the output, given the input, is bounded from above by ln(1 C |Vn@|) . 3.3 The Intrinsic Constraints and the Locality of Learning. To motivate the main idea of this section, we consider the example where all neurons except one, v 0 2 V, are elements of the periphery: @ D Vnv0 . In this case, formula 3.3 becomes
IV D I@ C I ( v 0I Vnv 0 ) .
(3.8)
Furthermore, consider a peripheral distribution p@ and the directed graph N D (V, E ), where E denotes the set of edges that go from the periphery to the neuron v 0, that is, E D @ £ fv0 g. Then equation 3.8 implies that all ( N, p@ ) -adapted probability distributions p (see section 2.3) satisfy IpV D Ip@@ C Ip ( v 0 I pa(v 0 ) ) .
(3.9)
This formula states that the optimization of the global interaction IV in the set of ( N, p@ ) -adapted distributions is equivalent to the optimization of the transinformation between the neuron v 0 and its parents pa(v 0 ) , and therefore also equivalent to the optimization according to Linsker’s Infomax principle. In the eld of neural networks, the hypothesis that neurons optimize information about their local environment is the subject of many theoretical and experimental investigations (Laughlin, 1981; Rieke et al., 1998). On the other hand, it is important to translate this hypothesis of local information maximization into an optimization principle that corresponds to a globally dened complexity measure (Tononi et al., 1994). In this article, we prove some locality properties of the IMI optimization for directed acyclic networks. The general theory, which also describes recurrent networks and
2968
Nihat Ay
establishes the equivalence of the maximization of local interactions to the maximization of global complexity, will be presented in another article. Now we extend the representation 3.9 to more than one inernal neuron. Let N D (V, E ) be a directed and acyclic graph with periphery @ D per(N) , and let p@ be a probability distribution on V@. If a probability distribution p is ( N, p@ ) -adapted in the sense of section 2.3, then Proposition 3.
IpV D Ip@@ C
X
Ip ( vI pa(v) ) .
v2int (N )
Thus, the global stochastic interaction in a directed acyclic network N can be expressed, up to a constant, as the sum of local interactions of the neurons with their neighbors. In particular, this is the case in a multilayer network without any lateral connections within the individual layers. Therefore, the optimization of the local interactions in a directed acyclic graph is sufcient for the optimization of the global interaction. This indicates the possibility of controlling learning processes in terms of local interactions. This localglobal connection can be easily investigated in networks without any hidden part, which is the subject of what follows. Networks of this type are called simple. More precisely, a directed acyclic network N D ( V, E ) is simple if pa(v) ½ per(N) holds for all v 2 V. A simple network consists of the two layers per(N) and int(N) D Vnper ( N ) , where all connections go from per(N) to int(N) : E ½ per(N) £int(N). In particular, there are no lateral connections inside each layer. In simple networks, the following holds: Let N D ( V, E ) be a simple network, and let p@ be a probability distribution on per(N), and let p be an ( N, p@ ) -adapted distribution, that is, p 2 C ( N, p@ ) . Then p is a local maximizer of IV in C ( N, p@ ) if and only if p is a local maximizer of I (vI pa(v) ) in C ( N, p@ ) for all v 2 int(N). This statement remains true if “local maximizer” is replaced by “isolated local maximizer.” Proposition 4.
Now we describe intrinsic constraints that are given not only by the network structure but also by the specication of a model that is compatible with this structure (see the comment on the specications of intrinsic constraints at the end of section 3.1). Therefore, we consider a simple network N D ( V, E ) and choose an arbitrary numbering int(N) D fv1 , . . . , vm g. Furthermore, we assume that there is a probability distribution p@ on the periphery @ D per(N) of N and a family of parameterizations (embeddings) kr : H r ! K vr , where H r is an open subset of Rdr . The local parameters h (r,i) , i D 1, . . . , dr , determine the local kernel of the neuron vr . For each h ( r) 2 H r , the kernel function kr (h ( r) ) is written as (v , v0 ) 7! kr (v0 | v, h ( r) ) . Using p@ and this family of local parameterizations, we dene the global Q parameterization Q that assigns to each h D (h (1) , . . . , h (m ) ) 2 H :D m rD 1 H r the
Locality of Global Stochastic Interaction
2969
composed probability distribution with p (v | h ) D p@ (v@ )
m Y
( ) kr (vvr | vpa(vr ) , h r ) ,
rD 1
v 2 VV .
(3.10)
Variational learning in neural networks corresponds to an optimizing stochastic process in the parameter space that is usually derived from the gradient of the corresponding utility function by adaptation to available information (nonanticipation, locality). In this article, we do not derive such learning processes. We investigate the local-global relation of stochastic interaction entirely in terms of the natural gradient (Amari, 1998) and in this context also talk about learning. The natural gradient of a function f on the image N of the parameterization Q is determined by the rst fundamental form G :D ( g ( r,i ) ( s, j) ) ( r,i) ( s, j) of the Fisher metric with ³ g ( r,i ) ( s, j) (h ) :D EQ (h )
@ ln Q @h ( r,i)
(h )
@ ln Q @h (s, j)
´ (h ) ,
h 2 H.
The local coordinates of the natural gradient of f are given by (G ¡1 r ( f ± Q ) ) (h ) ,
h 2 H,
L Pm dr » d where r denotes the canonical gradient in m rD 1 R D R , d D rD 1 dr . Note that in simple networks, the functions IQ (vr I pa(vr ) ) : h 7! IQ (h ) ( vr I pa(vr ) )
and
HQ (pa( vr ) | vr ) : h 7! HQ (h ) (pa( vr ) | vr ) 6 do not depend on the nonlocal parameters h ( s, j) , s D r. The corresponding local functions H r ! R are denoted by Iloc ( vr I pa(vr ) ) and H loc (pa(vr ) | vr ) .
Let N be a simple network, and let Q be a parameterization of the form 3.10 with image N . Then the coordinates of the natural gradient of IV on N with respect to Q are split in a completely local way. More precisely, the following holds: For all r, the matrix Gr :D ( g ( r,i) ( r, j) ) i, j depends on only the local parameter Theorem 1.
vector h ( r) , and
0
1 ( G1¡1 r Iloc (v1 I pa(v1 ) ) ) (h (1) ) B C .. (G ¡1 r (IV ± Q ) ) (h ) D @ A . ( ) ¡1 loc m ( Gm r I (vm I pa(vm ) ) ) (h ) 0 ¡1 1 ( G1 r H loc (pa( v1 ) | v1 ) ) (h (1) ) B C .. D ¡@ A. . ( ) ¡1 loc m ( Gm r H (pa(vm ) | vm ) ) (h )
2970
Nihat Ay
Informally, theorem 1, can be stated as follows: In simple networks, global learning according to the IMI hypothesis is equivalent to the local learning according to a family of local potential functions: each neuron maximizes the ow of information from its parents. According to theorem 1, in a simple network, it is sufcient to compute separately the gradients of the local information ows. The following example discusses a specic parameterization of a simple network with a single output neuron: (Simple Perceptron): Consider the simple network N D ( V, E ) with V :D fv0 , v1 , v2 , . . . , vn g and E :D fv1 , v2 , . . . , vn g £ fv 0 g. Obviously, we have @ :D per(N) D pa(v0 ) and int(N) D fv 0 g. We assume that all neurons are binary: Vv :D f § 1g, v 2 V. The input distribution on f § 1g@ is denoted by p@ . For each input v D (v1 , . . . , vn ) (we identify vi with vvi ), the neuron v 0 computes the weighted sum of the input activities according to a vector h D (h 1 , . . . , h n ) 2 Rn of synaptic weights: Example 2.
hh (v) :D
n X
h i vi ,
i D1
v 2 f § 1g@ .
Then it makes a transition to the state C 1 with probability f ( hh (v) ) , where f : R !]0, 1[ is the logistic function: x 7 ! f ( x ) :D
1 . 1 C exp(¡x)
Thus, for h 2 Rn , we have the transition kernel k (¢ | ¢, h ) dened by Á k ( C 1 | v, h ) D
Á
1 C exp ¡
n X
i
h vi
!!¡1
.
i D1
This model implies the following parameterization:
Q : h 7! p (v | h ) :D p@ (v@ ) k (v0 | v@ , h ) . After elementary calculations, we get the rst fundamental form, gi, j (h ) D
X v2f § 1g@
p@ (v)
df (hh (v) ) vi vj , dx
1 · i, j · n,
Locality of Global Stochastic Interaction
2971
Figure 3: The shape of the modulation function f for several values of c.
and @Iloc ( v0 I @) @hi
(h ) D
X
p@ (v) vi
v2f§ 1g@
Á
df ( hh (v) ) dx
¡ ¢ ! p@ (s) f hh (s ) £ hh (v) ¡ ln P . (3.11) @ s2f § 1g@ p (s) (1 ¡ f ( hh (s ) ) ) P
s2f§ 1g@
Consider the functional that assigns to each random variable X: f § 1g@ ! ]0, 1[ the modulated variable: Á ! Ep@ (X ) X e ¡ ln . (3.12) X :D X (1 ¡ X ) ln 1 ¡X 1 ¡ Ep@ (X ) This modulation is illustrated in Figure 3 by the shape of the function x 7! x ¡ c). x (1 ¡ x ) (ln 1¡x Applying the modulation 3.12 to the variable Xh :D f ± hh , equation 3.11 simplies to @Iloc ( v 0 I @) @hi
(h ) D
X
fh (v) . p@ (v) vi X
(3.13)
v2f§ 1g
@
Thus, one can interpret this part of the gradient to be Hebb-like, where the contribution of the output is not linear but appropriately modulated. 4 Discussion 4.1 More on Linsker’s Work. In 1986, Linsker implemented a Hebb-like learning rule in a parameterized feedforward network that led to the emergence of receptive elds similar to those in the early visual pathway of
2972
Nihat Ay
Figure 4: A building block of the layered network considered by Linsker.
mammals (Hubel & Wiesel, 1962, 1968). In his description, Linsker considered separately the building blocks of the layered network consisting of two neighboring layers L and M (see Figure 4). Linsker (1988, p. 112) observed that the underlying Hebb-like learning is compatible with the optimization of local information ow: “For more general Hebb-type rules, we found that variance of the output activity was maximized subject to various constraints. This result led us to suggest that, at least in an intuitive sense, a Hebb rule may act to generate an M cell whose output activity preserves maximum information about the input activities, subject to constraints.” This statement describes in an intuitive sense the optimization of the local interaction of each M-neuron with its parents. According to theorem 1, this optimization can be thought to be driven by the IMI principle. Linsker postulated the principle of maximum information preservation (Infomax), which is different from the IMI. According to Linsker’s Infomax, the mutual information I ( LI M ) between the input layer L and the output layer M should be maximized. The Infomax is obtained by induction from the observation that the local information ow is maximized (Linsker, 1988, p. 113): “The formulation of this principle arose from studying Hebb-type rules and recognizing certain optimization properties to which they lead for single M cells. Once formulated, however, the principle is independent of any particular local algorithm, whether Hebb-related or otherwise, that may be found to implement it.” Linsker’s Infomax has been applied to several models (see Linsker, 1997), but one important question remains open: Is this principle consistent? The phenomenologically convincing optimization principle that Linsker started with is the local one, which was implemented by Hebb-like learning rules. Is it possible to recover the local principle by applying the induced one to the original model? As stated above, the IMI principle has this consistency for the building block of two layers.
Locality of Global Stochastic Interaction
2973
4.2 Some Problems and Com ments
4.2.1 Application to General Layered Networks. In order to apply the IMI optimization to Linsker’s model systematically, instead of considering the two-layer building blocks separately, the feedforward network should be analyzed as a whole. This has been done only partially in this article (see proposition 3). In particular, the derivation of learning rules in terms of locally adapted stochastic processes that optimize the global stochastic interaction is necessary for establishing the strong local-global connection of interaction in general directed acyclic networks. 4.2.2 Translation to a Dynamical Setting. An important property of the IMI principleis its independence from any model specication. In particular, IMI can be applied to arbitrary graphical structures within the setting of random elds (directed, nondirected, or mixed in terms of chain graphs in the sense of Lauritzen, 1996, and Cowell et al., 1999). Nevertheless, in order to develop a dynamical theory of strongly interacting units, we considered the temporal aspects of interaction (Ay, 2001; Ay & Wennekers, 2001). The application of this dynamical approach to recurrent neural networks is in progress. 4.2.3 Preference for Neural Models. Starting with the hypothesis that neural networks are realizations of complex systems in the sense that the elementary units are strongly interacting with each other, it should not only be assumed that learning produces high complexity but also that evolution generates structural and neurophysiological properties so that the production of high complexity becomes possible. Thus, there should be a preference for intrinsic constraints with sufcient variability, such that learning according to the IMI hypothesis leads to the strong interaction between neurons. On the other hand, too large a variability has a negative effect on the generalization ability of learning systems. This leads to the question on the existence of low-dimensional neuro-manifolds that are compatible with the IMI optimization. In connection with the unconstrained optimization, we proved the following theorem (theorem 3.5 of Ay, 2002): Consider n binary neurons, n ¸ 8. Then there exists an exponential family E of dimension less than or equal to n2 such that all distributions with locally maximal interaction of the neurons are in the topological closure of E . It is not known how to relate such exponential families to specic neural network models. For instance, one can investigate the manifold of Boltzmann machines with regard to its variability for the IMI optimization.
2974
Nihat Ay
Appendix: Proofs
The lower bound is trivial. We prove the upper bound. Dene the convex sets Proof of Proposition 2.
Q :D fq 2 PN (VV ) : supp q@ D supp p@g and
R :D fq 2 Q: q@ D p@, qv D pv for all v 2 Vn@g ½ C ( p@ ) . The set R can be considered as the intersection of Q with an afne subspace V of aff Q that is given by r D |supp p@ | ¡ 1 C
X v2Vn@
(|Vv | ¡ 1)
linear equations. Of course, dim V of the simplex PN (VV ) ,
¸ dim aff Q ¡ r. Consider the open face
P :D fq 2 PN (VV ) : supp q D supp pg. Then the set
S :D V
\ P D R\P
is relatively open (i.e., S is open in aff S ), and p 2 S . Furthermore, p locally maximizes the strictly convex restriction IV | S of the interaction IV to S . Thus, p must be an extreme point of S , which is possible only if S D V \ P D fpg. This implies \ aff P D fpg.
V
(A.1)
Now we apply the dimension formula: dim aff Q ¸ dim aff( V D dim V
[ aff P )
C (|supp p| ¡ 1) ¡ dim( V
|
\ aff P ) {z } (A.1)
¸ dim aff Q ¡ r C |supp p| ¡ 1.
This gives us the estimate 3.7.
D 0
Locality of Global Stochastic Interaction
2975
Assume that there is an v 2 supp p@ with
Proof of Corollary 1.
|supp p (¢ | X@ D v) | > 1 C
X
v2Vn@
(|Vv | ¡ 1) .
Then |supp p| D
X v2supp p@
|supp p (¢ | X@ D v) | Á
> (|supp p | ¡ 1) C 1 C
X
@
v2Vn@
! (|Vv | ¡ 1) .
This is a contradiction to the estimate 3.7. This follows directly from corollary 1.
Proof of Corollary 2.
We choose a numbering fv1 , . . . , vm g of int(N), such that pa(vk ) ½ per(N) ] fv1 , . . . , vk¡1 g for all k D 1, . . . , m. This is always possible for directed acyclic graphs. Then, with the chain rule for the entropy (Cover & Thomas, 1991) and the Markov property of p with respect to N (Cowell et al., 1999) we get Proof of Proposition 3.
Hp (V ) D Hp@ (@) C Hp (v1 | @) C ¢ ¢ ¢ C Hp ( vm | @, v1 , . . . , vm¡1 ) D Hp@ ( @) C Hp ( v1 | pa(v1 ) ) C ¢ ¢ ¢ C Hp ( vm | pa(vm ) ) .
This implies IpV D
X v2V
Á D
Hp (v ) ¡ Hp ( V )
X v2@
D Ip@@ C
! Hp@ ( v ) ¡ Hp@ ( @) X
C
m X iD 1
( Hp (vi ) ¡ Hp ( vi | pa(vi ) ) )
Ip (vI pa(v) ) .
v2int ( N )
We prove this statement by using proposition 3. If p is a local maximizer of I ( vI pa(v) ) for all v 2 int(N), then it also locally maximizes the global interaction IV in the set of ( N, p@ )-adapted distributions. Thus, we have only to prove the opposite implication. Let p D p@ (w2int ( N ) Kw ) be a local maximizer of IV in the set of ( N, p@ ) -adapted distributions and assume that there exists a neuron v 2 int(N) such that p is not a local maximizer of I ( vI pa(v) ) . Then there exist neighborhoods U w of Proof of Proposition 4.
2976
Nihat Ay
Kw in K N w with the following properties i and ii: i. For all K0w 2 U
w,
0 ) 6 w 2 int(N) , with p@ (w2int ( N ) Kw D p, one has
IpV@ (w2int(N) Kw0 ) · IpV .
(A.2)
ii. There exists a kernel Kv0 2 U
v,
6 K v , such that Kv0 D
Iq ( vI pa(v) ) > Ip (vI pa(v) ) ,
(A.3)
with q :D p@ (w2int(N) Kw Kv0 ) . wD 6 v
6 v, the following holds: Note that for q and w 2 int(N), w D
Iq ( wI pa(w) ) D Ip ( wI pa(w) ) .
(A.4)
We choose a kernel Kv0 that satises equation A.3. Applying proposition 3, we get the following contradiction to equation A.2: IV q
D
Iq@ C
( A.3) , ( A.4 ) @ > Ip C
D
X w2int (N) wD 6 v
X w2int (N) wD 6 v
Iq ( wI pa(w) ) C Iq ( vI pa(v) ) Ip ( wI pa(w) ) C Ip ( vI pa(v) )
IpV .
The second statement of proposition 4 concerning the isolated local maximizers can be proven in the same way. Here, one has to replace in equation A.2 · by < and in equation A.3 > by ¸. In order to prove theorem 1, we need the following lemma: Let N be a simple network, and let Q be a parameterization of the form 3.10. Then for each r, the matix Gr D ( g ( r,i) (r, j) ) i, j depends on only the loLemma 1.
cal parameter vector h ( r) , and the rst fundamental form of the Fisher metric is given by 0 B G (h ) D @
G1 (h (1) )
0 ..
0
1 C A.
. Gm (h ( m) )
Locality of Global Stochastic Interaction
2977
With the score function
Proof.
@ ln p (v | ¢) @h ( r,i )
@ ln kr (vvr | vpa(vr ) , ¢)
(h ) D
@h ( r,i)
(h ( r) ) ,
one has g (r,i) ( s, j) (h ) D D
X
p (v | h )
@ ln p (v | ¢)
p (v |, h )
@ ln kr (vvr | vpa(vr ) , ¢)
v2VV
X
(h )
@h ( r,i)
@ ln p (v | ¢) @h ( s, j)
@h ( r,i)
v2VV
(h )
(h ( r) )
@ ln ks (vvs |, vpa(vs ) , ¢) @h ( s, j)
(h ( s) ) .
6 s: We consider the case r D
g ( r,i ) ( s, j) (h ) D
X
p@ (v@ )
v2VV
m Y tD 1 tD 6 r, s
³
kt (vvt | vpa(vt ) , h ( t ) ) ¢ ¢ ¢ ¢
¢ ¢ ¢ kr (vvr | vpa(vr ) , h ( r) ) ³ ¢ ¢ ¢ ks (vvs | vpa(vs ) , h D
X
p@ (v@ )
v2VV
...
tD 1 tD 6 r, s
@h ( r,i) @
@h ( r,i)
@h ( s, j)
v2VV
|
@ ln ks (vvs | vpa(vs ) , ¢)
´
@h ( s, j)
(h ( r) )
@ ks (vvs | vpa(vs ) , ¢)
(
@
X
)
´
p@ (v@ )
m Y
)
tD 1 tD 6 r,s
@h ( s, j)
(h ( s) )
)
( ) kt (vvt | vpa(vt ) , h t ) kr (vvr | vpa(vr ) , ¢)
ks £ (vvs | vpa(vs ) , ¢)
h (s)
{z ´1
D 0.
(h
( s)
¢ ¢¢¢
kt (vvt | vpa(vt ) , h ( t ) ) ¢ ¢ ¢ ¢
@ kr (vvr | vpa(vr ) , ¢)
( D
m Y
( s)
@ ln kr (vvr | vpa(vr ) , ¢) ( r) (h ) @h ( r,i)
}
) h (r)
2978
Nihat Ay
It remains to prove that g ( r,i) (r, j) is a function of the local parameter vector h (r ) : g ( r,i) ( r, j) (h ) D
D
X
p@ (v@ )
v2VV
X
m Y tD1
¢¢¢
kt (vvt | vpa(vt ) , h ( t) ) ¢ ¢ ¢ ¢
@ ln kr (vvr | vpa(vr ) , ¢)
p@ (v)
X v0 2Vvr
v2V@
¢¢¢
@h ( r,i)
(h ( r ) )
@ ln kr (vvr | vpa(vr ) , ¢) @h ( r, j)
(h ( r) )
( ) kr (v0 | vpa(vr ) , h r ) ¢ ¢ ¢ ¢
@ ln kr (v0 | vpa(vr ) , ¢) ( r ) @ ln kr (v0 | vpa(vr ) , ¢) ( r) (h ) (h ) @h ( r,i) @h ( r, j)
( ) D g ( r,i ) ( r, j) (h r ) .
According to proposition 3, we have
Proof of Theorem 1.
IV ± Q D Ip@@ C D Ip@@ C
m X rD 1 m X rD 1
Iloc ( vr I pa(vr ) ) ( Hp@ (pa( vr ) ) ¡ Hloc (pa( vr ) | vr ) ) .
Therefore, @ (IV ± Q ) @h ( r,i)
(h ) D
@ Iloc ( vr I pa(vr ) ) @h ( r,i)
(h ( r ) ) D ¡
@ H loc (pa( vr ) | vr ) ( r ) (h ). @h ( r,i)
With lemma 1, this completes the proof. Acknowledgments
I am grateful to Eugene Gutkin, Jurgen ¨ Jost, and Thomas Wennekers for helpful comments and two anonymous referees for constructive suggestions. The computer simulations shown in Figures 1 and 2 are based on an implementation by Ulrich Steinmetz. I worked out this article at the MaxPlanck-Institute for Mathematics in the Sciences in Leipzig. References Amari, S. (1985). Differential-geometricmethods in statistics. Heidelberg: SpringerVerlag. Amari, S. (1998). Natural gradient works efciently in learning. Neural Computation, 10, 251–276.
Locality of Global Stochastic Interaction
2979
Amari, S. (2001).Information geometry on hierarchy of probability distributions. IEEE Trans. IT, 47, 1701–1711. Amari, S., & Nagaoka, H. (2000). Methods of information geometry. New York: Oxford University Press. Attneave, F. (1954). Informational aspects of visual perception. Psychological Review, 61, 183–193. Ay, N. (2001). Information geometry on complexity and stochastic interaction. Manuscript submitted for publication. Ay, N. (2002). An information-geometric approach to a theory of pragmatic structuring. Annals of Probability, 30, 416–436. Ay, N., & Wennekers, T. (2001). Dynamical propertiesof strongly interacting Markov chains. Manuscript submitted for publication. Barlow, H. (1989). Unsupervised learning. Neural Computation, 1, 295–311. Barlow, H. (2001). Redundancy reduction revisited. Network: Comput. Neural Syst., 12, 241–253. Bell, A. J., & Sejnowski, T. J. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7, 1129– 1159. Chaitin, G. J. (1979). Toward a mathematical denition of “life.” In R. Levine & M. Tribus (Eds.), The maximum entropy formalism (pp. 477–498). Cambridge, MA: MIT Press. Chaitin, G. J. (2001). Exploring randomness. New York: Springer-Verlag. Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: Wiley-Interscience. Cowell, R. G., Dawid, A. P., Lauritzen, S. L., & Spiegelhalter, D. J. (1999). Probabilistic networks and expert systems. New York: Springer-Verlag. Hubel, D. H., & Wiesel, T. N. (1962). Receptive elds, binocular interaction, and functional architecture in the cat’s visual cortex. Journal of Physiology, 160, 106–154. Hubel, D. H., & Wiesel, T. N. (1968). Receptive elds and functional architecture of monkey striate cortex. Journal of Physiology (Lond.), 195, 215–243. Jost, J. (2000/ 2001). Komplexe Systeme. lecture, University of Leipzig. Kullback, S., & Leibler, R. A. (1951). On information and sufciency. Annals of Mathematical Statistics, 22, 79–86. Laughlin, S. (1981).A simple coding procedure enhances a neuron’s information capacity. Z. Naturforsch., 36, 910–912. Lauritzen, S. L. (1996). Graphical models. Oxford: Clarendon Press. Linsker, R. (1986). From basic network principles to neural architecture. Proceedings of the National Academy of Sciences, USA, 83, 7508–7512. Linsker, R. (1988). Self-organization in a perceptual network. IEEE Computer, 21, 105–117. Linsker, R. (1997). A local learning rule that enables information maximization for arbitrary input distributions. Neural Computation, 9, 1661–1665. Obradovic, D., & Deco, G. (1998). Information maximization and independent component analysis: Is there a difference? Neural Computation, 10, 2085– 2101.
2980
Nihat Ay
Rieke, F., Warland, D., Ruyter van Steveninck, R., & Bialek, W. (1998). Spikes: Exploring the neural code. Cambridge, MA: MIT Press. Tononi, G., Sporns, O., & Edelman, G. M. (1994).A measure for brain complexity: Relating functional segregation and integration in the nervous system. Proc. Natl. Acad. Sci. USA, 91, 5033–5037.
Received September 12, 2001; accepted May 10, 2002.
LETTER
Communicated by Morris Hirsh
On Unique Representations of Certain Dynamical Systems Produced by Continuous-Time Recurrent Neural Networks Masahiro Kimura [email protected] NTT Communication Science Laboratories, Seika-cho, Kyoto 619-0237, Japan This article extends previous mathematical studies on elucidating the redundancy for describing functions by feedforward neural networks (FNNs) to the elucidation of redundancy for describing dynamical systems (DSs) by continuous-time recurrent neural networks (RNNs). In order to approximate a DS on R n using an RNN with n visible units, an n-dimensional afne neural dynamical system (A-NDS) can be used as the DS actually produced by the above RNN under an afne map from its visible state-space R n to its hidden state-space. Therefore, we consider the problem of clarifying the redundancy for describing A-NDSs by RNNs and afne maps. We clarify to what extent a pair of an RNN and an afne map is uniquely determined by its corresponding A-NDS and also give a nonredundant sufcient search set for the DS approximation problem based on A-NDS. 1 Introduction
Neural networks in general exhibit redundancy when describing systems such as functions or dynamical systems, that is, distinct parameter values can determine the same system. In the case of feedforward neural networks (FNNs), such redundancy has been investigated from various points of view. We consider multilayer perceptrons as FNNs. For multilayer FNNs with tanh activation function, Hecht-Nielsen (1989) and Chen, Lu, and HechtNielsen (1993) investigated what the conditions are for two networks with the same number of hidden units to produce the same input-output function. Sussmann (1992) claried to what extent a three-layer FNN with tanh activation function can be uniquely determined by its input-output function. The purpose of this article is to extend Sussmann’s work on function description by FNNs to the problem of dynamical system (DS) description by continuous-time recurrent neural networks (RNNs). Let us consider the function approximation problem using a three-layer FNN. Let Pr be the space of all three-layer FNNs with r hidden units. The ultimate goal of the network training is to nd the parameter value pr0 that minimizes the error Er ( pr ) in any nonnegative integer r, where the parameter value pr 2 Pr minimizes some error function Er that depends on c 2002 Massachusetts Institute of Technology Neural Computation 14, 2981– 2996 (2002) °
2982
Masahiro Kimura
only the input-output function of an FNN with r hidden units. When we search for the optimum parameter value pr 2 Pr , we expect to make our search efcient by conning it to a subset of Pr in which all redundancy has been eliminated. Actually, this topic has been investigated by Hecht-Nielsen (1990), Hecht-Nielsen and Evans (1991), and Chen et al. (1993) for multilayer FNNs with tanh activation function. Chen et al. (1993) in particular found a reasonably small sufcient search set in Pr described by simple linear and quadratic inequalities, although it still exhibited redundancy for r. Recently, Fukumizu and Amari (2000) proved the existence of local minima and plateaus in the hierarchical geometric structure of the space of all three-layer FNNs. They proved that under a certain condition, there exists the following one-dimensional submanifold C r of Pr : Each element of C r is a critical point of Er and determines the same input-output function. There exists a local minimum pr¡1 2 Pr¡1 of Er¡1 such that the inputoutput function determined by an element of C r is the same as the input-output function determined by pr¡1 . There exists an open subset C r0 of C r such that any element of C r0 is a local minimum of Er and any element of C r nC r0 is a saddle of Er . Moreover, they explained that this sort of one-dimensional submanifold causes serious plateaus in the category of on-line learning. Here, a plateau is dened as a situation that causes a very slow decrease of the training error over a long period of time before a sudden exit from it.S Therefore, it is important to know a nonredundant sufcient search set in r¸ 0 Pr from the point of view of developing effective learning algorithms. By improving the work of Chen et al. (1993), this article gives a nonredundant sufcient search set for the DS approximation problem using an RNN. We consider additive networks with tanh activation function as RNNs and use them to approximate DSs on the n-dimensional Euclidean space R n . Mathematically, a ow on R n , that is, a one-parameter group of transformations of R n , is called a DS on R n (Hirsh & Smale, 1974). The units of an RNN are divided into visible units and hidden units. Funahashi and Nakamura (1993) proved that given a nite-time trajectory of a DS on R n , there exists an RNN with n visible units such that the trajectory is approximated by one visible trajectory of the RNN in any precision. Note that an RNN with hidden units cannot dene a DS on its visible statespace unless a map is successfully specied from its visible state-space to its hidden state-space to determine the initial states of the hidden units for an initial state of a DS. Kimura and Nakano (1998) extended the result of Funahashi and Nakamura (1993) to the category of approximating a DS on R n itself by using an RNN with n visible units. They used an afne neural dynamical system (A-NDS), a DS produced by one RNN with hidden units under one afne map, to approximate the target DS in any precision.
Unique Representations of Certain Dynamical Systems
2983
An n-dimensional A-NDS is produced by a suitable pair of an RNN with n visible units and an afne map from its visible state-space R n to its hidden state-space. In this article, as one extension of the works by Sussmann (1992), and Chen et al. (1993) for function description using FNNs to DS description using RNNs, we construct a unique parametric representation of n-dimensional A-NDSs. Namely, we clarify to what extent a pair of an RNN and an afne map is uniquely determined by its corresponding DS and give a nonredundant sufcient search set for the DS approximation problem using an RNN. In section 2, the RNNs to be considered are specied, and the denition and basic facts for A-NDSs are summarized. In section 3, we summarize our main results before the mathematically rigorous discussions. We dene the notion of symmetric transformations in section 4 and the notion of irreducible representations in section 5. In section 6, using these ideas, the redundancy for producing A-NDSs is completely elucidated, and a unique parametric representation of n-dimensional A-NDSs is theoretically constructed. In section 7, we present a nonredundant sufcient search set of an n-dimensional A-NDS. Section 8 sets out the conclusion. 2 Preliminaries 2.1 Continuous-Time Recurrent Neural Networks. This article considers the following RNNs with n visible units called additive networks since they are widely used and their electrical circuit implementations have already been constructed (Hertz, Krogh, & Palmer, 1991; Hirsh, 1991; Baldi, 1995; Pearlmutter, 1995). An RNN N rn ( W ) consists of n C r units, where the units 1 to n are visible and the units n C 1 to n C r are hidden, and the state ui (t) of unit i at time t is governed by a system of ordinary differential equations, nCr X 1 dui ( t ) D ¡ ui ( t ) C wij tanh ( uj ( t) ) C wi0 , t dt jD 1
i D 1, . . . , n C r,
where t is a xed positive constant called the time constant, wij is a realvalued parameter called the connection weight from unit j to unit i, wi0 is a real-valued parameter called the bias to unit i, and 0
w10 B . B W D @ .. wn C r 0
w11 .. . wn C r 1
¢¢¢ .. . ¢¢¢
1 w1 n C r C .. C 2 Mn C r,n C r C 1 ( R ) . . A wn C r n C r
Here, we denote by M m,m0 ( R ) the set of real m £ m0 matrices. Note that the number r of hidden units is also a nonnegative integral-valued parameter.
2984
Masahiro Kimura
2.2 Afne Neural Dynamical Systems. An RNN N 0n ( W ) without hidden units produces a DS © W on R n in the natural manner (Kimura & Nakano,
1998). However, an RNN N rn ( W ) with r ( > 0) hidden units does not produce a DS on R n unless a map h from the visible state-space R n to the hidden state-space R r is successfully specied. We consider afne maps from R n to R r . For any 0 1 a10 a11 ¢ ¢ ¢ a1n B . .. .. .. C C ADB . . . A 2 M r,n C 1 (R ) , @ .. ar0 ar1 ¢ ¢ ¢ arn dene the afne map hA : R n ! R r by h A (x ) D ( ( hA ) 1 ( x) , . . . , ( hA ) r (x ) ) , x D ( x1 , . . . , xn ) 2 R n ; ( hA ) k ( x ) D
n X jD 1
a kj xj C ak0
( k D 1, . . . , r ).
Let Mnr denote the set of those pairs of W D ( wij ) 2 Mn C r,n C r C 1 ( R ) ( i D 1, . . . , n C rI j D 0, 1, . . . , n C r) and A D ( akj ) 2 M r,n C 1 (R ) ( k D 1, . . . , rI j D 0, 1, . . . , n ) such that wn C k 0 D wn C k j D
n X iD 1 n X
a ki wij C ( ak0 / t ) , a ki wij ,
iD 1
k D 1, . . . , r,
k D 1, . . . , rI j D 1, . . . , n C r.
For any m D ( W, A ) 2 Mnr , we dene the C1 map ’m : R £ R n ! R n by
’m ( t, x) D ( ¼ ± ©
W ) ( t,
( x, hA ( x) ) ) ,
t 2 R , x 2 Rn ,
where ¼ : R n £ R r ! R n is the canonical projection. Then the following theorem holds (Kimura & Nakano, 1998): Theorem 1.
i. For any m D (W, A ) 2 Mnr , ’m forms a DS on R n , that is, the RNN N rn ( W ) produces the DS ’m on the visible state-space R n under afne map hA . ii. Let à be a DS on R n . Given a positive constant e , a bounded closed interval J ½ R , and a compact set K ½ R n , there exist a positive intger r and m 2 Mnr such that k’m ( t, x ) ¡ à (t, x) k < e,
t 2 J, x 2 K.
Unique Representations of Certain Dynamical Systems
2985
For any m 2 Mnr ( r > 0), the DS ’m is called an n-dimensional afne neural dynamical system (A-NDS). We dene Mn0 by the set of W D ( wij ) 2 Mn,n C 1 ( R ) (i D 1, . . . , nI j D 0, 1, . . . , n ) . For any W 2 Mn0 , the DS © W produced by the RNN N 0n ( W ) is also called an n-dimensional A-NDS. For m D (W, A ) 2 Mnr (0 < r 2 Z ) , we denote by X m the vector eld on R n corresponding to the n-dimensional A-NDS ’m . It turns out that X m ( x) D ( ( X m ) 1 ( x ) , . . . , ( X m ) n ( x ) ) ( x D ( x1 , . . . , xn ) 2 R n ) ; n X 1 ( X m ) i ( x ) D ¡ xi C wij tanh ( xj ) C wi0 t j D1
C
r X kD 1
0 1 n X wi n C k tanh @ akj xj C a k0 A jD 1
(i D 1, . . . , n ) , (2.1)
(Kimura & Nakano, 1998). Note that the notion of A-NDSs can be regarded as an extension of the pseudoneural systems by Funahashi (1994) and an extension of the dynamic perceptrons with self-connections by Moreau, LoiÂes, Vandewalle, and Brenig (1999). Theorem 1 states that it is valid to use an n-dimensional A-NDS as the DS that an RNN with n visible units actually produces on its visible state-space R n to approximate a target DS on R n . Throughout this article, we use the following notation in order to simplify the expression. For ( W, A) 2 Mnr , we denote by wij the ( i, j) -component of W, denote by a kj the ( k, j) -component of A, and dene
wj D ( w1j , w2j , . . . , wnj ) 2 R n , a k D ( a k0, a k1 , . . . , akn ) 2 R n C 1 ,
j D 0, 1, . . . , n C r, k D 1, . . . , r,
n
a k D ( a k1 , a k2 , . . . , akn ) 2 R , k D 1, . . . , r, 0 1 w10 w11 ¢ ¢ ¢ w1 n C r B . .. .. .. C C Wv D B . . . A 2 Mn,n C r C 1 ( R ) . @ .. wn0 wn1 ¢ ¢ ¢ wn n C r By denition, Mnr can be identied with Mn,n C r C 1 ( R ) £ M r,n C 1 ( R ) in the following way: Remark.
Mnr 3 ( W, A )$( W v , A) 2 Mn,n C r C 1 (R ) £ Mr,n C 1 ( R ) . Therefore, Mnr can be regarded as an ( n2 C 2nr C n C r) -dimensional Euclidean space. Note that any element (W, A ) of Mnr is determined by specifying wj 2 R n ( j D 0, 1, . . . , n C r) and a k 2 R n C 1 ( k D 1, . . . , r) .
2986
Masahiro Kimura
2.3 Representations of A-NDSs. Let D n denote the set of all n-dimen-
sional A-NDSs, and let [ F: Mnr ! D n 0·r2Z
denote the parametric representation of D n , that is, (
F (m ) D
’m , © m,
m D ( W, A ) 2 Mnr , r > 0, m D W 2 Mn0 .
For ’ 2 D n , m 2 Mnr is called a representation of ’ if F (m ) D ’. By denition, the map F is surjective. We also know that the restriction F | M n : Mn0 ! D n is injective (Kimura & Nakano, 1998). However, the 0 parametric representation F exhibits redundancy. In fact, there exist distinct RNNs with the same number of hidden units such that they produce the same A-NDS, and there also exist distinct RNNs that produce the same A-NDS even though their numbers of hidden units differ. 3 Main Results
In this section, we describe our main results before the formal discussion. The mathematically rigorous statements and their proofs are given in the following sections. cn ½ Mn that consists of ( W, A) 2 Mn Let us consider the subset M r r r satisfying the following conditions: The RNN N rn ( W ) has at least one connection from each hidden unit n C k to the visible units. The afne map (h A ) ( x ) is not a constant map. 6 |xi | for k D 1, . . . , r and i D 1, . . . , n. | ( hA ) k ( x1 , . . . , xn ) | D 6 | ( hA ) ` ( x ) | for k D 6 `. | ( hA ) k ( x ) | D
Note that these conditions can also be stated in terms of W and A (see the denition of irreducible representations in section 5). In the case r D 0, cn D Mn . Then we obtain the following results: There exists no we put M 0 0 other representation of an n-dimensional A-NDS ’m using an RNN with cn , although there always exists such a fewer hidden units than r if m 2 M r n c cn , then m 0 is representation of ’m if m 2 Mr nMnr . If ’m D ’m 0 for m , m 0 2 M r obtained from m by appropriately relabeling the hidden units fn C 1, . . . , n C rg and changing the signs of the weights w n C `, a` associated with a hidden unit n C ` for some `’s (see the denition of symmetric transformations in section 4). These results clarify the redundancy for describing A-NDSs by RNNs and afne maps.
Unique Representations of Certain Dynamical Systems
2987
In this article, we use the following notation: For any positive integer d 6 y , we write and for x D ( x1 , . . . , xd ), y D ( y1 , . . . , yd ) 2 R d such that x D
x¹y if x1 < y1 , or x1 D y1 , x2 < y2 , or , . . . , or x1 D y1 , . . . , xd¡1 D yd¡1 , xd < cn that consists of ( W, A) 2 M cn satisyd . Let us consider the subset Brn ½ M r r fying the conditions 0 ¹ wnC 1, . . . , 0 ¹ wnC r,
a1 ¹ ¢ ¢ ¢ ¹ ar ,
cn . Then we where 0 D (0, . . . , 0) 2 R n . In the S case r D 0, we put B0n D M 0 n n obtain the following result: B :D 0·r2Z Br is a nonredundant sufcient search set of an n-dimensional A-NDS. Namely, the restriction F | Bn : B n ! D n is bijective. 4 Symmetric Transformations
We return to the formal discussion. Let us consider the case where distinct RNNs with the same number of hidden units produce the same A-NDS. Assume that r is a positive integer. A C1 -diffeomorphism of Mnr is referred to as a transformation of Mnr . We introduce the notion of symmetric transformations of Mnr . A transformation g of Mnr is said to be symmetric if for each m 2 Mnr , m and g (m ) represent the same A-NDS, that is, Denition.
F (m ) D F ( g (m ) ) ,
m 2 Mnr .
Let us construct some symmetric transformations of Mnr . First, for ` D 1, . . . , r, we dene the transformation h`: Mnr ! Mnr by h` ( W, A) D ( W 0 , A0 ) , (W, A ) 2 Mnr ;
w0 n C ` D ¡w n C `, a0 ` D ¡a`,
6 n C `) , w0 j D wj ( j D 6 `) . a0 k D a k ( k D
Note that the transformation h` means changing the signs of all of the weights associated with the hidden unit n C `. By equation 2.1, it is easily proved that h` (` D 1, . . . , r) is a symmetric transformation of Mnr . Next, let Sr denote the set of permutations on the set f1, . . . , rg. For s 2 Sr , we dene the transformation qs : Mnr ! Mnr by qs (W, A ) D (W 0 , A0 ), ( W, A) 2 Mnr ;
w0 n C s ( k) D w n C k ( k D 1, . . . , r ) , a0 s ( k) D ak ( k D 1, . . . , r) .
w0 j D wj
( j D 0, 1, . . . , n ) ,
2988
Masahiro Kimura
Note that the transformation qs means relabeling the hidden units according to the permutation s. By equation 2.1, it is easily proved that qs (s 2 Sr ) is a symmetric transformation of Mnr . Let G rn be the subgroup of the transformation group of Mnr generated by the symmetric transformations h` ( ` D 1, . . . , r) and qs (s 2 Sr ) . We can easily prove the following proposition.
G
Proposition 1.
i. hk ±hk D id
n r
is a noncommutative nite group of degree 2r r!. In particular,
( k D 1, . . . , r ) , where id is the identity transformation of Mnr .
ii. hk ± h` D h` ± hk
iii. qs ± h` D hs ( `) ± qs iv. G
n r
D fh1
(s 2 Sr , ` D 1, . . . , r ) .
1
± ¢ ¢ ¢ ±hr ± qs | l1 , . . . , lr D 0, 1I s 2 Sr g, where h`0 expresses expresses h`.
n 0
as the trivial group consisting of the identity transformation
l1
id and h` We dene G of Mn0 .
( k, ` D 1, . . . , r ) .
lr
5 Irreducible Representations
Let us consider the case where distinct RNNs with different numbers of hidden units produce the same A-NDS. First, we introduce the notion of minimal representations of A-NDSs. Let m 2 Mnr be a representation of an n-dimensional A-NDS ’. Here, m is called a minimal representation of ’ if there is no other representation of ’ using an RNN with fewer hidden units than r, that is, if Denition.
9m 0 2 Mnr0
such that
F (m 0 ) D ’
)
r · r0 .
In order to state the minimality condition of m 2 Mnr in terms of m itself, we introduce the notion of irreducible representations of A-NDSs. m D ( W, A) 2 Mnr is said to be irreducible if r D 0 or the following conditions hold: Denition.
6 0 i. w n C k D
6 0 ii. aN k D
( k D 1, . . . , r) .
( k D 1, . . . , r ) .
iC1
iii.
6 ak D
R
n C1
§ e1 , . . . , § en for i D 1, . . . , n.
6 a` if k D 6 `. iv. ak D
^
( k D 1, . . . , r) , where ei D (0, . . . , 0, 1 , 0, . . . , 0) 2
Unique Representations of Certain Dynamical Systems
2989
We call m 2 Mnr an irreducible representation of an n-dimensional Acn denote the set of irreducible NDS ’ if F (m ) D ’ and m is irreducible. Let M r n n n c D M . We call the set M cn the irreducible elements of Mr . Note that M r 0 0 n set of Mr . By equation 2.1, we can easily prove the following proposition. If m 2 Mnr is not an irreducible representation of an n-dimensional A-NDS ’, then it is not a minimal representation of ’. Proposition 2.
Here, we explain our irreducibility conditions in terms of RNNs and afne maps according to Sussmann’s view (1992) on irreducible FNNs. We begin with the following denition: Two functions f1 and f2 on R n are said to be sign equivalent if | f1 ( x) | D | f2 ( x) | for any x 2 R n .
Denition.
Let r > 0, m D (W, A ) 2 Mnr , and let ( x1 , . . . , xn ) denote the canonical coordinate system in R n . Consider the afne map h A from the visible statespace R n to the hidden state-space R r . Now, we can state conditions i, ii, iii, and iv in the denition of irreducible representations as follows: Condition i states that there exists at least one connection from each hidden unit n C k to the visible units. Condition ii states that each afne function ( hA ) k ( x) is not a constant function. Condition iii states that each afne function ( hA ) k ( x) is not sign equivalent to any of the afne functions x1 , . . . , xn on R n . Condition iv states that no two of the afne functions ( hA ) 1 ( x) , . . . , (h A ) r ( x) are sign equivalent. 6 Unique Representations
Now we investigate a unique parametric representation of D n . First, we state the following result by Sussmann (1992), which is used to prove our theorem. Let f1 , . . . , fd be nonconstant afne functions on R n such that no two of them are sign equivalent, where d is a positive integer. Then the C1 functions tanh ± f1 , . . . , tanh ± fd and the constant function 1 on R n are linearly independent in the real vector space of C1 functions on R n . Lemma 1.
Let us prove the following proposition. Let r and r0 be nonnegative integers. Suppose that m D ( W, A ) 0 n c cn0 satisfy F (m ) D F (m 0 ) . Then r D r0 , and there ( 2 Mr and m D W 0 , A0 ) 2 M r exists a g 2 G rn such that m D m 0 . Proposition 3.
2990
Masahiro Kimura
First, we assume that r, r0 > 0. By assumption, we have ’m D ’m 0 . Therefore, by equation 2.1, it follows that Proof.
wi0 ¡ w0 i0 C C
D 0,
r X kD 1
n X jD 1
( wij ¡ w0 ij ) tanh ( xj )
wi n C k tanh ( ( hA ) k ( x) ) ¡
r0 X `D 1
w0 i n C ` tanh ( ( hA0 ) ` ( x) ) (6.1)
for x D ( x1 , . . . , xn ) 2 R n , i D 1, . . . , n. Since m and m 0 are irreducible, each ( hA ) k ( x) and each ( hA0 ) ` ( x) is a nonconstant afne function and is not sign equivalent to any of the afne functions 1, x1 , . . . , xn on R n . Moreover, no two of ( hA ) 1 ( x) , . . . , ( hA ) r ( x) are sign equivalent, and no two of ( hA0 ) 1 ( x) , . . . , ( hA0 ) r0 ( x) are sign equivalent. Suppose that for any k D 1, . . . , r and any ` D 1, . . . , r0 , the afne functions ( hA ) k ( x) and ( hA0 ) ` ( x) are not sign equivalent. Then, by equation 6.1 and lemma 1 we have i D 1, . . . , nI k D 1, . . . , r,
wi n C k D 0,
w0 i n C ` D 0,
i D 1, . . . , nI ` D 1, . . . , r0 .
However, this contradicts the fact that m and m 0 are irreducible. Hence, there exists a pair of ( hA ) k ( x) and ( hA0 ) `( x) such that they are sign equivalent. Let f(k1 , `1 ) , . . . , ( kd , `d )g be all of the pairs ( k, `) such that ( hA ) k ( x) and (h A0 ) ` ( x) are sign equivalent, and put ( hA0 ) `a ( x) D r ka ( hA ) ka ( x) ,
a D 1, . . . , d,
where r ka D 1 or ¡ 1 (a D 1, . . . , d) . Then equation 6.1 is equivalent to the following equation: wi0 ¡ w0 i0 C C C
D 0,
d X aD 1
n X jD 1
( wij ¡ w0 ij ) tanh ( xj )
(wi n C ka ¡ r ka w0 i n C `a ) tanh ( ( hA ) ka ( x) )
X
6 k1 ,..., kd kD
wi n C k tanh ( ( hA ) k ( x) ) ¡
X
6 `1 ,..., `d `D
w0 i n C ` tanh ( ( hA0 ) `( x) ) (6.2)
for x D ( x1 , . . . , xn ) 2 R n , i D 1, . . . , n. By lemma 1 and the irreducibility of m and m 0 , the following C1 functions on R n are linearly independent: 1, tanh ( x1 ) , . . . , tanh ( xn ) , 6 `1 , . . . , `d ) . tanh (( hA ) 1 ( x)) , . . . , tanh ( ( hA ) r ( x) ) , tanh ( ( hA0 ) `( x) ) , ( `D
Unique Representations of Certain Dynamical Systems
2991
Hence, by equation 6.2 and the irreducibility of m and m 0 , we have r D r0 D d, wij D w0 ij ,
i D 1, . . . , nI j D 0, 1, . . . , n,
w0 i n C s ( k) D r k wi n C k,
i D 1, . . . , nI k D 1, . . . , r,
where s is the element of Sr dened by s ( ka ) D `a (a D 1, . . . , r) . Observe that ( h A0 ) s ( k ) ( x ) D r k ( h A ) k ( x ) , Now we dene g 2 G
n r
x 2 R n , k D 1, . . . , r.
by
g D h1 (1¡r1 ) / 2 ± ¢ ¢ ¢ ± hr (1¡rr ) / 2 ± qs . Then it can be easily veried that g (m ) D m 0 . We can prove in the same way that the proposition is also true in the case of r0 D 0. Hence, we have completed the proof of proposition 3. Consider an n-dimensional A-NDS ’ and an element mO of MnrO with F (mO ) D ’. In general, it is difcult to decide whether mO is a minimal representation of ’. Indeed, in order to verify the minimality of mO , we need to examine all elements of the sets Mnr ( r 2 Z I 0 · r · rO) . By denition, it is much easier to decide whether mO is irreducible. We saw in proposition 2 that a minimal representation of an A-NDS is an irreducible representation of the A-NDS. The following corollary shows that the converse is also true. Let ’ be an n-dimensional A-NDS and m an element of Mnr such that F (m ) D ’. Then m is a minimal representation of ’ if and only if m is cn . irreducible, that is, m 2 M r Corollary 1.
It is sufcient to prove that if m is irreducible, then it is a minimal representation of ’. Suppose that m is irreducible but not a minimal representation of ’. Then there exist a nonnegative integer r0 and an element m 0 of Mnr0 such that r0 < r and F (m ) D F (m 0 ) . Taking a minimal representation mO 2 MnrO of ’, we have Proof.
rO · r0 < r.
(6.3)
Since minimal representations are irreducible, mO is irreducible. Hence, we obtain r D rO by proposition 3. This contradicts condition 6.3, and we have proved the corollary.
2992
Masahiro Kimura
cn is an Let r be a nonnegative integer. Note that the irreducible set M r n n n n c c open set of Mr and g ( Mr ) D Mr for any g 2 G r . Consider the quotient cn / G n . Let p r : M cn ! M cn / G n be the canonical projection. Then we space M r
r
r
have the following proposition.
r
r
cn / G n has a unique structure of C1 manThe quotient space M r r ifold such that the canonical projection p r is a covering map. Proposition 4.
It is sufcient to prove that G rn acts freely and properly discontinucn . Since G n is a nite group, it is sufcient to show that G n acts ously on M r r r cn . Therefore, our task is to prove that if m D (W, A ) 2 M cn and freely on M r r Proof.
n r
G
6 id 3 g D h 1 l1 ± ¢ ¢ ¢ ± h r lr ± q s D
(l1 , . . . , lr 2 f0, 1g, s 2 Sr ) ,
6 m. then g (m ) D 6 id. Then there exists an integer k 2 f1, . . . , rg First, consider the case s D 6 k. If g (m ) D m , then we have such that s ( k) D
a k D § a s ( k) . cn . This contradicts m 2 M r Next, consider the case s D id. In this case, there exists an integer k 2 f1, . . . , rg such that l k D 1. If g (m ) D m , then we have
w n C k D 0. cn . This contradicts m 2 M r Hence, we have completed the proof of proposition 4. Now we theoretically construct a unique parametric representation of fn by D n . We dene the parameter space M fn D M
[
cn / G (M r
n r)
(disjoint union).
0·r2Z
fn is a countable disjoint union of C1 manifolds with distinct Note that M e: M fn ! D n by dimensions. We also dene the map F e(p r (m ) ) D F (m ) , F
cn , 0 · r 2 Z . m 2M r
cn , Note that this denition is well dened since F (m ) D F ( g (m ) ) (m 2 M r n e g 2 G r , 0 · r 2 Z ) . Proposition 3 implies that the map F is injective.
Unique Representations of Certain Dynamical Systems
2993
e is surjective. Hence, we have proved the Corollary 1 implies that the map F following theorem. e: M fn ! D n is bijective. Namely, F e: M fn ! D n gives a unique F parametric repesentation of n-dimensional A-NDSs. Theorem 2.
7 Search Sets
Let us construct a nonredundant sufcient search set for the DS approximation problem using an RNN, that is, a nonredundant sufcient S search set of an n-dimensional A-NDS. Now we dene the subset Bn of 0·r2Z Mnr by
Bn D
[
Brn ,
0·r2Z
where (
Brn
D
) 0 ¹ wn C 1 , . . . , 0 ¹ wn C r , n c ( W, A) 2 Mr a1 ¹ ¢ ¢ ¢ ¹ ar
for r > 0, and B0n D Mn0 . Then the following theorem holds. Theorem 3.
The restriction
F | Bn : Bn ! D n is bijective. Namely, B n is a nonredundant sufcient search set of an n-dimensional A-NDS. It is sufcient to prove that the map
Proof.
cn / G p r | Brn : Brn ! M r
n r
is bijective for any positive integer r. First, we prove the surjectivity. It is sufcient to prove that for any m 2 c Mnr , there exists a g 2 G rn such that g (m ) 2 Brn . The following lemma is needed to prove the surjectivity. Lemma 2.
C
r
Let m D (W, A ) 2 C r , where
cn | 0 ¹ w n C 1 , . . . , 0 ¹ w n C r g. D f(W, A) 2 M r
2994
Masahiro Kimura
Then there exists a g 2 G
n r
such that g (m ) 2 Brn .
Since m D ( W, A) 2 C exists a s 2 Sr such that Proof.
r
cn , a k D 6 6 ½ M a` if k D `. Therefore, there r
as ¡1 (1) ¹ ¢ ¢ ¢ ¹ as ¡1 ( r ) . Then it is easily veried that qs (m ) 2 Brn . This completes the proof of lemma 2. cn . Note that Now, let us prove the surjectivity. Let m D ( W, A) 2 M r 6 0, . . . , w n C r D 6 0. We dene the constants l1 , . . . , lr by wn C 1 D ( lk D
0 1
if if
0 ¹ wn C k,
w n C k ¹ 0,
( k D 1, . . . , r) .
Then, it is easily veried that h1 l1 ± ¢ ¢ ¢ ± hr lr (m ) 2 C r . Hence, by lemma 2, we can prove the surjectivity. Next, we prove the injectivity. It is sufcient to prove that if m 2 Brn and g 2 G rn satisfy g (m ) 2 Brn , then g (m ) D m . In particular, it is sufcient to prove that 6 g 2 G id D
n r,
m 2 Brn ) g (m ) 2/ Brn .
Hence, the proof of the injectivity is achieved by the following two lemmas: 6 0. If Let s 2 Sr and l1 , . . . , lr 2 f0, 1g such that l1 C ¢ ¢ ¢ C lr D g D h1 l1 ± ¢ ¢ ¢ ± hr lr ± qs , then g (m ) 2/ Brn for any m 2 Brn .
Lemma 3.
e A e) 2 B n . Suppose that l k D 1, Let m D ( W, A) 2 Brn and g (m ) D ( W, r where k is an integer such that 1 · k · r. Since 0 ¹ w n C 1 , . . . , 0 ¹ w n C r , we can easily show that w Q n C k ¹ 0. Hence, g (m ) 2/ Brn . This completes the proof of lemma 3. Proof.
Lemma 4.
6 id. Then qs (m ) 2 Let s 2 Sr such that s D / Brn for any m 2 Brn .
e A e) 2 Bn . Since s D 6 id, there Let m D (W, A ) 2 Brn and qs (m ) D ( W, r exist integers k and ` such that 1 · k < ` · r and s ¡1 ( k) > s ¡1 (`) . Since Proof.
Unique Representations of Certain Dynamical Systems
2995
a1 ¹ ¢ ¢ ¢ ¹ ar , we can easily show that aQ ` ¹ aQ k. Hence, qs (m ) 2/ Brn . This completes the proof of lemma 4. We have completed the proof of theorem 3. 8 Conclusion
We discussed the redundancy for describing systems by neural networks, that is, the property that distinct parameter values can determine the same system. In particular, we attempted to extend the works of Sussmann (1992) and Chen et al. (1993) on function description by FNNs to DS description by continuous-time RNNs. In order to approximate a DS on R n using an RNN with n visible units, an n-dimensional A-NDS can be used as the DS actually produced by such an RNN under an afne map from its visible state-space to its hidden state-space. Therefore, we considered clarifying the redundancy for describing A-NDSs by RNNs and afne maps. We constructed a unique parametric representation of n-dimensional A-NDSs and presented a nonredundant sufcient search set for the DS approximation problem based on A-NDS. Acknowledgments
I express my sincere graditude to Naonori Ueda for his encouragement and useful suggestions. References Baldi, P. (1995). Gradient descent learning algorithm overview: General dynamical systems perspective. IEEE Transactions on Neural Networks, 6, 182–195. Chen, A. M., Lu, H., & Hecht-Nielsen, R. (1993). On the geometry of feedforward neural network error surfaces. Neural Computation, 5, 910–927. Fukumizu, K., & Amari, S. (2000). Local minima and plateaus in hierarchical structure of multilayer perceptrons. Neural Networks, 13, 317–327. Funahashi, K. (1994). Approximation theory, dynamical systems, and recurrent neural networks. In D. Bainov & A. Dishliev (Eds.), Proceedings of the Fifth International Colloquium on Differential Equations, 2 (pp. 51–58). Plovdiv, Bulgaria: SCT Publishing. Funahashi, K., & Nakamura, Y. (1993). Approximation of dynamical systems by continuous time recurrent neural networks. Neural Networks, 6, 801–806. Hecht-Nielsen, R. (1989). Theory of backpropagation neural network. In Proceedings of the International Joint Conference on Neural Networks (pp. 593–605). New York: IEEE. Hecht-Nielsen, R. (1990). On the algebraic structure of feedforward network weight spaces. In R. Eckmiller (Ed.), Advanced neural computers. Amsterdam: Elsevier.
2996
Masahiro Kimura
Hecht-Nielsen, R., & Evans, K. M. (1991). A method for error surface analysis. In M. NovÂa k & E. PelikÂan (Eds.), Theoretical aspectsof neurocomputing (pp. 13–18). Singapore: World Scientic. Hertz, J., Krogh, A., & Palmer, R. G. (1991). Introduction to the theory of neural computation. Reading, MA: Addison-Wesley. Hirsh, M. W. (1991). Network dynamics: Principles and problems. In F. Pasemann & H. D. Doebner (Eds.), Neurodynamics (pp. 3–29). Singapore: World Scientic. Hirsh, M. W., & Smale, S. (1974). Differential equations, Dynamical systems and linear algebra. New York: Academic Press. Kimura, M., & Nakano, R. (1998). Learning dynamical systems by recurrent neural networks from orbits. Neural Networks, 11, 1589–1599. Moreau, Y., LoiÂes, S., Vandewalle, J., & Brenig, L. (1999). Embedding recurrent neural networks into predator-prey models. Neural Networks, 12, 237–245. Pearlmutter, B. A. (1995). Gradient calculations for dynamic recurrent neural networks: A survey. IEEE Transactions on Neural Networks, 6, 1212–1228. Sussmann, H. J. (1992). Uniqueness of the weights for minimal feedforward nets with a given input-output map. Neural Networks, 5, 589–593.
Received October 29, 2001; accepted May 28, 2002.
LETTER
Communicated by Irwin Sandberg
Descartes’ Rule of Signs for Radial Basis Function Neural Networks Michael Schmitt [email protected] ¨ Mathematik, Ruhr-Universit¨at Lehrstuhl Mathematik und Informatik, Fakult¨at fur Bochum, D–44780 Bochum, Germany We establish versions of Descartes’ rule of signs for radial basis function (RBF) neural networks. The RBF rules of signs provide tight bounds for the number of zeros of univariate networks with certain parameter restrictions. Moreover, they can be used to infer that the Vapnik-Chervonenkis (VC) dimension and pseudodimension of these networks are no more than linear. This contrasts with previous work showing that RBF neural networks with two or more input nodes have superlinear VC dimension. The rules also give rise to lower bounds for network sizes, thus demonstrating the relevance of network parameters for the complexity of computing with RBF neural networks. 1 Introduction
First published in 1637, Descartes’ rule of signs is an astoundingly simple yet powerful method to estimate the number of zeros of a univariate polynomial (Descartes, 1954; Struik, 1986). Precisely, the rule says that the number of positive zeros of a real univariate polynomial is equal to the number of sign changes in the sequence of its coefcients, or is less than this number by a multiple of two (see, e.g., Henrici, 1974). We show that versions of Descartes’ rule of signs can be established for radial basis function (RBF) neural networks. We focus on networks with standard, that is, gaussian, hidden units with parameters satisfying certain conditions. In particular, we formulate RBF rules of signs for networks with uniform widths. In the strongest version, they yield that for every univariate function computed by these networks, the number of zeros is bounded by the number of sign changes occurring in a sequence of the output weights. A similar rule is stated for networks with uniform centers. We further show that the bounds given by these rules are tight, a fact also known for the original Descartes’ rule (Anderson, Jackson, & Sitharam, 1998; Grabiner, 1999). The RBF rules of signs are presented in section 2. We employ the rules to derive bounds on the Vapnik-Chervonenkis (VC) dimension and the pseudodimension of RBF neural networks. These dimensions measure the diversity of a function class and are basic concepts c 2002 Massachusetts Institute of Technology Neural Computation 14, 2997– 3011 (2002) °
2998
Michael Schmitt
studied in theories of learning and generalization (see, e.g., Anthony & Bartlett, 1999). The pseudodimension also has applications in approximation theory (Maiorov & Ratsaby, 1999; Dung, 2001). We show that the VC dimension and the pseudodimension of RBF neural networks with a single input node and uniform widths are no more than linear in the number of parameters. The same holds for networks with uniform centers. These results contrast nicely with previous work showing that these dimensions grow superlinearly for RBF neural networks with more than one input node and at least two different width values (Schmitt, 2002). In particular, the bound established there is V ( W log k), where W is the number of parameters and k the number of hidden units. Further, for the networks considered here, the linear upper bound improves a result of Bartlett and Williamson (1996), who have shown that gaussian RBF neural networks restricted to discrete inputs from f¡D, . . . , Dg have these dimensions bounded by O (W log(WD) ) . Moreover, the linear bounds are tight and considerably smaller than the bound O ( W 2 k2 ) known for unrestricted gaussian RBF neural networks (Karpinski & Macintyre, 1997). We establish the linear bounds in section 3. The computational power of neural networks is a well-recognized and frequently studied phenomenon. Not so well understood is the question how large a network must be to perform a specic task. The rules of signs and the VC dimension bounds provided here give rise to answers in terms of lower bounds on the size of RBF networks. These bounds cast light on the relevance of network parameters for the complexity of computing with networks having uniform widths. In particular, we show that a network with zero bias must be more than twice as large to be as powerful as a network with variable bias. Further, to have the capabilities of a network with nonuniform widths, it must be larger in size by a factor of at least 1.5. The latter result is of particular signicance since, as shown by Park and Sandberg (1991), the universal approximation capabilities of RBF networks hold even for networks with uniform widths (see also Park & Sandberg, 1993). The bounds for the complexity of computing are derived in section 4. The results presented here are concerned with networks having a single input node. Networks of this type are not, and have never been, of minor importance. This is evident from the numerous simulation experiments performed to assess the capabilities of neural networks for approximating univariate functions. For instance, Broomhead and Lowe (1988), Moody and Darken (1989), Mel and Omohundro (1991), Wang and Zhu (2000), and Li and Leiss (2001) have done such studies using RBF networks. Moreover, the best-known lower bound for the VC dimension of neural networks, being quadratic in the number of parameters, holds, among others, for univariate networks (Koiran & Sontag, 1997). Thus, the RBF rules of signs provide new insights for a neural network class that is indeed relevant in theory and practice.
Descartes’ Rule of Signs
2999
2 Rules of Signs for RBF Neural Networks
We consider a type of RBF neural network known as gaussian or standard RBF neural network. Denition 1.
An RBF neural network computes functions over the reals of the
form Á w0 C w1 exp ¡
kx ¡ c1 k2 s12
!
Á C ¢ ¢ ¢ C wk exp
¡
kx ¡ ckk2 sk2
! ,
where k ¢ k denotes the Euclidean norm. Each exponential term corresponds to a hidden unit with center ci and width si , respectively. The input nodes are represented by x, and w 0, . . . , wk denote the output weights of the network. Centers, widths, and ouptut weights are the network parameters. The network has uniform widths if s1 D ¢ ¢ ¢ D sk; it has uniform centers if c1 D ¢ ¢ ¢ D ck. The parameter w0 is also referred to as the bias or threshold of the network. Note that the output weights and the widths are scalar parameters, whereas x and the centers are tuples if the network has more than one input node. In the following, all functions have a single input variable unless indicated otherwise. Given a nite sequence w1 , . . . , wk of real numbers, a change of sign occurs at position j if there is some i < j such that the conditions Denition 2.
wi wj < 0 and wl D 0,
for i < l < j,
are satised. Thus we can refer to the number of changes of sign of a sequence as a well-dened quantity. 2.1 Uniform Widths. We rst focus on networks with zero bias and show that the number of zeros of a function computed by an RBF network with uniform widths does not exceed the number of changes of sign occurring in a sequence of its output weights. Moreover, the difference of the two quantities is always even. Thus, we obtain for RBF networks a precise rephrasing of Descartes’ rule of signs for polynomials (see, e.g., Henrici, 1974, theorem 6.2d). We call it the strong RBF rule of signs. In contrast to
3000
Michael Schmitt
Descartes’ rule, the RBF rule is not conned to the strictly positive numbers but is valid for the whole real domain. Its derivation generalizes Laguerre’s proof of Descartes’ rule (P Âolya & SzegÍo, 1976). Theorem 1 (Strong RBF Rule of Signs for Uniform Widths).
Consider a function f : R ! R computed by an RBF neural network with zero bias and k hidden units having uniform width. Assume that the centers are ordered such that c1 · ¢ ¢ ¢ · ck, and let w1 , . . . , w k denote the associated output weights. Let s be the number of changes of sign of the sequence w1 , . . . , wk and z the number of zeros of f . Then s ¡ z is even and nonnegative. Proof. Without loss of generality, let k be the smallest number of RBF units 6 required for computing f . Then we have wi D 0 for 1 · i · k, and due to
the equality of the widths, c1 < ¢ ¢ ¢ < ck. We rst show by induction on s that z · s holds. Obviously, if s D 0, then f has no zeros. Let s > 0, and assume the statement is valid for s ¡1 changes of sign. Suppose that there is a change of sign at position i. As a consequence of the assumptions above, wi¡1 wi < 0 holds, and there is some real number b satisfying ci¡1 < b < ci . Clearly, f has the same number of zeros as the function g dened by ³ 2 ´ x ¡ 2bx ( ) exp ¢ f (x ) , gx D s2 where s is the common width of the hidden units in the network computing f . According to Rolle’s theorem, the derivative g0 of g must have at least z ¡ 1 zeros. Then the function h dened as ³ 2 ´ x ¡ 2bx ¢ g0 ( x ) h ( x ) D exp ¡ s2 has at least z ¡ 1 zeros as well. We can write this function as ³ ´ k X ( x ¡ cl ) 2 2(cl ¡ b ) exp ¡ . h ( x) D wl s2 s2 lD 1 Hence, h is computed by an RBF network with zero bias and k hidden units having width s and centers c1 , . . . , ck . Since b satises ci¡1 < b < ci and the sequence w1 , . . . , w k has a change of sign at position i, there is no change of sign at position i in the sequence of the output weights: w1
2(c1 ¡ b) 2(ci¡1 ¡ b) 2(ci ¡ b ) 2(ck ¡ b) . , . . . , wi¡1 , wi , . . . , wk s2 s2 s2 s2
At positions different from i, there is a change of sign if and only if the sequence w1 , . . . , wk has a change of sign at this position. Hence, the number of changes of sign in the output weights for h is equal to s¡1. By the induction
Descartes’ Rule of Signs
3001
hypothesis, h has at most s ¡ 1 zeros. As we have argued above, h has at least z ¡ 1 zeros. This yields z · s, completing the induction step. It remains to show that s ¡ z is even. For x ! C 1, the term ³
(x ¡ ck ) 2 wk exp ¡ s2
´
becomes in absolute value larger than the sum of the remaining terms so that its sign determines the sign of f . Similarly, for x ! ¡1, the behavior of f is governed by the term ³
( x ¡ c1 ) 2 w1 exp ¡ s2
´ .
Thus, the number of zeros of f is even if and only if both terms have the same sign. And this holds if and only if the number of sign changes in the sequence w1 , . . . , w k is even. This proves the statement. The above result establishes the number of sign changes as upper bound for the number of zeros. It is easy to see that this bound cannot be improved. If the width is sufciently small or the intervals between the centers are sufciently large, the sign of the output of the network at some center ci is equal to the sign of wi . Thus, we can enforce a zero between any pair of adjacent centers ci¡1 , ci through wi¡1 wi < 0. The parameters of every univariate RBF neural network can be chosen such that all widths are the same and the number of changes of sign in the sequence of the output weights w1 , . . . , wk, ordered according to c1 · ¢ ¢ ¢ · ck, is equal to the number of zeros of the function computed by the network. Corollary 1.
Next, we consider networks with freely selectable bias. The following result is the main step in deriving a bound on the number of zeros for these networks. It deals with a more general network class where the bias can be an arbitrary polynomial. Suppose p: R ! R is a polynomial of degree l and g: R ! R is computed by an RBF neural network with zero bias and k hidden units having uniform width. Then the function p C g has at most l C 2k zeros. Theorem 2.
Proof. We perform induction on k. For the sake of formal simplicity, we say
that the all-zero function is the (only) function computed by the network with zero hidden units. Then for k D 0, the function p C g is a polynomial of degree l that, by the fundamental theorem of algebra, has no more than l zeros.
3002
Michael Schmitt
Now, let g be computed by an RBF network with k > 0 hidden units, so that p C g can be written as pC
³ ´ ( x ¡ ci ) 2 . wi exp ¡ s2 iD 1
k X
Let z denote the number of zeros of this function. Multiplying with exp( ( x ¡ ck ) 2 / s 2 ), we obtain the function ³ exp
(x ¡ ck ) 2 s2
´ p C wk C
³ ´ ( x ¡ ci ) 2 ¡ ( x ¡ c k ) 2 exp ¡ wi s2 iD 1
k¡1 X
that must have z zeros as well. Differentiating this function with respect to x yields ³
´ 2(x ¡ ck ) 0 exp pCp s2 ³ ´ k¡1 X ( x ¡ ci ) 2 ¡ ( x ¡ c k ) 2 2(ci ¡ ck ) C exp ¡ wi , s2 s2 iD 1 (x ¡ ck ) 2 s2
´³
which, according to Rolle’s theorem, must have at least z ¡ 1 zeros. Again, multiplication with exp(¡(x ¡ ck ) 2 / s 2 ) leaves the number of zeros unchanged, and we get with ³ ´ k¡1 X ( x ¡ ci ) 2 2(x ¡ ck ) 2(ci ¡ ck ) 0 C C exp ¡ p p wi s2 s2 s2 i D1 a function of the form q C h, where q is a polynomial of degree l C 1 and h is computed by an RBF network with zero bias and k ¡ 1 hidden units having uniform width. By the induction hypothesis, q C h has at most l C 1 C 2(k¡1) D l C 2k ¡ 1 zeros. Since z ¡ 1 is a lower bound, it follows that z · l C 2k as claimed. From this, we immediately have a bound for the number of zeros for RBF networks with nonzero bias. We call this fact the weak RBF rule of signs. In contrast to the strong rule, the bound is not in terms of the number of sign changes but in terms of the number of hidden units. The naming, however, is justied since the proofs for both rules are very similar. As for the strong rule, the bound of the weak rule cannot be improved. Let function f : R ! R be computed by an RBF neural network with arbitrary bias and k hidden units having uniform width. Then f has at most 2k zeros. Furthermore, this bound is tight. Corollary 2 (Weak RBF Rule of Signs for Uniform Widths).
Descartes’ Rule of Signs
3003
Figure 1: Optimality of the weak RBF rule of signs for uniform widths: a function having twice as many zeros as hidden units.
The upper bound follows from theorem 2 since l D 0. The lower bound is obtained using a network where the bias is negative and all other output weights are positive. For instance, ¡1 for the bias and 2 for the weights are suitable. An example for k D 3 is displayed in Figure 1. For sufciently small widths and large intervals between the centers, the output value is close to 1 for inputs close to a center and approaches ¡1 between the centers and toward C 1 and ¡1. Thus, each hidden unit gives rise to two zeros. Proof.
The function in Figure 1 has only one change of sign: the bias is negative; all other weights are positive. Consequently, we realize that a strong rule of signs does not hold for networks with nonzero bias. 2.2 Uniform Centers. We now establish a rule of signs for networks with uniform centers. Here, in contrast to the rules for uniform widths, the bias plays a role in determining the number of sign changes.
Suppose f : R ! R is computed by an RBF neural network with arbitrary bias and k hidden units having uniform centers. Assume the ordering s12 · ¢ ¢ ¢ · sk2 , and let s denote the number of sign changes of the sequence w 0 , w1 , . . . , wk. Then f has at most 2s zeros. Theorem 3 (RBF Rule of Signs for Uniform Centers).
3004
Michael Schmitt
Proof. Given f as supposed, consider the function g obtained by introducing
the new variable y D ( x ¡ c) 2 , where c is the common center. Then g ( y) D w0 C w1 exp(¡s1¡2 y ) C ¢ ¢ ¢ C wk exp(¡sk¡2 y ) . Let z be the number of zeros of g and s the number of sign changes of the sequence w0 , w1 , . . . , w k. Similarly as in theorem 1, we deduce z · s by induction on s. Assume there is a change of sign at position i, and let b satisfy ¡2 si¡1 < b < si¡2 . (Without loss of generality, the widths are all different.) By Rolle’s theorem, the derivative of the function exp(by) ¢ g ( y) has at least z ¡1 zeros. Multiplying this derivative with exp(¡by) yields w 0b C w1 ( b ¡ s1¡2 ) exp(¡s1¡2 y ) C ¢ ¢ ¢ C w k ( b ¡ sk¡2 ) exp(¡sk¡2 y ) with at least z ¡ 1 zeros. Moreover, this function has at most s ¡ 1 sign changes. (Note that b > 0.) This completes the induction. p 6 Now, each zero y D 0 of g gives rise to exactly two zeros x D c § y for f . (By denition, y ¸ 0.) Further, g (0) D 0 if and only if f (c) D 0. Thus, f has at most 2s zeros. That the bound is optimal can be seen from Figure 2 showing a function having three hidden units with center 20, widths s1 D 1, s2 D 4, s3 D 8, and output weights w0 D 1 / 2, w1 D ¡3 / 4, w2 D 2, w3 D ¡2. The example can easily be generalized to any number of hidden units. 3 Linear Bounds for VC Dimension and Pseudodimension
We apply the RBF rules of signs to obtain bounds for the VC dimension and the pseudodimension that are linear in the number of hidden units. First, we give the denitions. A set S is said to be shattered by a class F of f0, 1g-valued functions if for every dichotomy (S 0 , S1 ) of S (where S 0 [ S1 D S and S 0 \ S1 D ;), there is some f 2 F such that f ( S0 ) µ f0g and f ( S1 ) µ f1g. Let sgn: R ! f0, 1g denote the function satisfying sgn ( x ) D 1 if x ¸ 0 and sgn ( x ) D 0 otherwise. Given a class G of real-valued functions, the VC dimension of G is the largest integer m such that there exists a set S of cardinality m that is shattered by the class fsgn ±g: g 2 G g. The pseudodimension of G is the VC dimension of the class f(x, y ) 7! g (x ) ¡ y: g 2 G g. Denition 3.
We extend this denition to neural networks in the obvious way. The VC dimension of a neural network is dened as the VC dimension of the class of functions computed by the network (obtained by assigning all possible values to its parameters), and analogously for the pseudodimension. It is evident that the VC dimension of a neural network is not larger than its pseudodimension.
Descartes’ Rule of Signs
3005
Figure 2: Optimality of the RBF rule of signs for uniform centers: a function having twice as many zeros as sign changes.
3.1 Uniform Widths. The RBF rules of signs give rise to the following
VC dimension bounds. The VC dimension of every univariate RBF neural network with k hidden units, variable but uniform widths, and zero bias does not exceed k. For variable bias, the VC dimension is at most 2k C 1. Theorem 4.
Proof. Let S D fx1 , . . . , xm g be shattered by an RBF neural network with
k hidden units. Assume that x1 < ¢ ¢ ¢ < xm , and consider the dichotomy ( S 0, S1 ) dened as xi 2 S 0 () i even.
Let f be the function of the network that implements this dichotomy. We want to avoid that f (xi ) D 0 for some i. This is the case only if xi 2 S1 . Since f (xi ) < 0 for all xi 2 S 0 , we can slightly adjust some output weight such that the resulting function g of the network still induces the dichotomy ( S 0, S1 ) but does not yield zero on any element of S. Clearly, g must have a zero between xi and xi C 1 for every 1 · i · m ¡ 1. If the bias is zero, then, by theorem 1, g has at most k ¡ 1 zeros. This implies m · k. From corollary 2, we have m · 2k C 1 for nonzero bias.
3006
Michael Schmitt
As a consequence of this result, we now know the VC dimension of univariate uniform-width RBF neural networks exactly. In particular, we observe that the use of a bias more than doubles their VC dimension. The VC dimension of every univariate RBF neural network with k hidden units and variable but uniform widths equals k if the bias is zero and 2k C 1 if the bias is variable. Corollary 3.
Proof. Theorem 4 has shown that k and 2k C 1 are upper bounds, respectively.
The lower-bound k for zero bias is implied by corollary 1. For nonzero bias, consider a set S with 2k C 1 elements and a dichotomy (S 0 , S1 ) . If |S1 | · k, we put a hidden unit with positive output weight over each element of S1 and let the bias be negative. Figure 1, for instance, implements the dichotomy (f5, 15, 25, 35g, f10, 20, 30g) . If |S 0 | · k, we proceed the other way, with negative output weights and a positive bias. It is easy to see that the result holds also if the widths are xed. This observation is helpful in the following where we address the pseudodimension and establish linear bounds for networks with uniform and xed widths. Consider a univariate RBF neural network with k hidden units having the same xed width. For zero bias, the pseudodimension d of the network satises k · d · 2k; if the bias is variable, then 2k C 1 · d · 4k C 1. Theorem 5.
Proof. The lower bounds follow from the fact that the VC dimension is a
lower bound for the pseudodimension and that corollary 3 is also valid for xed widths. For the upper bound, we use an idea that has been previously employed by Karpinski and Werther (1993) and Andrianov (1999) to obtain pseudodimension bounds in terms of zeros of functions. Let Fk,s denote the class of functions computed by the network with k hidden units and width s. Assume the set S D f(x1 , y1 ) , . . . , (xm , ym ) g µ R2 is shattered by the class f(x, y ) 7! sgn ( f ( x) ¡ y) : f 2 F k,s g.
Let x1 < ¢ ¢ ¢ < xm , and consider the dichotomy (S 0 , S1 ) dened as ( xi , yi ) 2 S 0 () i even.
Since S is shattered, there exist functions f1 , f2 2 F k,s that implement the dichotomies ( S 0 , S1 ) and ( S1 , S 0 ), respectively. That is, we have for i D 1, . . . , m f2 ( xi ) ¸ yi > f1 ( xi ) () i even, f1 ( xi ) ¸ yi > f2 ( xi ) () i odd. This implies ( f1 ( xi ) ¡ f2 (xi ) ) ¢ ( f1 (xi C 1 ) ¡ f2 ( xi C 1 ) ) < 0
Descartes’ Rule of Signs
3007
for i D 1, . . . , m ¡ 1. Therefore, f1 ¡ f2 must have at least m ¡ 1 zeros. On the other hand, since f1 , f2 2 F k,s , the function f1 ¡ f2 can be computed by an RBF network with 2k hidden units having equal width. Considering networks with zero bias, it follows from the strong RBF rule of signs (theorem 1) that f1 ¡ f2 has at most 2k ¡ 1 zeros. For variable bias, the weak rule of signs (corollary 2) bounds the number of zeros by 4k. This yields m · 2k for zero and m · 4k C 1 for variable bias. 3.2 Uniform Centers. Finally, we state the linear VC dimension and pseudodimension bounds for networks with uniform centers.
The VC dimension of every univariate RBF neural network with k hidden units and variable but uniform centers is at most 2k C 1. For networks with uniform and xed centers, the pseudodimension is at most 4k C 1. Corollary 4.
Proof. Using the RBF rule of signs for uniform centers (theorem 3), the
VC dimension bound is derived following the proof of theorem 4, and the bound for the pseudodimension is obtained as in theorem 5. Employing techniques from Corollary 3, it is not hard to derive that k is a lower bound for the VC dimension of these networks and, hence, also for the pseudodimension. 4 Computational Complexity
In this section, we demonstrate how our results can be used to derive lower bounds on the size of networks required for simulating other networks that have more parameters. We focus on networks with uniform width. Lower bounds for networks with uniform centers can be obtained analogously. We know from corollary 3 that introducing a variable bias increases the VC dimension of uniform-width RBF networks by a factor of more than two. As an immediate consequence, we have a lower bound on the size of networks with zero bias for simulating networks with nonzero bias. It is evident that a network with zero bias cannot simulate a network with nonzero bias on the entire real domain. The question makes sense, however, if the inputs are restricted to a compact subset, as done in statements about uniform approximation capabilities of neural networks. Every univariate RBF neural network with variable bias and k hidden units requires at least 2k C 1 hidden units to be simulated by a network with zero bias and uniform widths. Corollary 5.
In the following, we show that the number of zeros of functions computed by networks with zero bias and nonuniform widths need not be bounded
3008
Michael Schmitt
Figure 3: Function f2 of lemma 1 has ve zeros and is computed by an RBF network with four hidden units using two different widths.
by the number of sign changes. Thus, the strong RBF rule of signs does not hold for these networks. For every k ¸ 1, there is a function fk: R ! R that has at least 3k¡1 zeros and can be computed by an RBF neural network with zero bias, 2k hidden units, and two different width values. Lemma 1.
Proof. Consider the function
f k ( x) D
k X iD 1
Á
(¡1) i w1 exp
Á
( x ¡ ci ) 2 s12
!
Á ¡ w2 exp
( x ¡ ci ) 2 s22
!! .
Clearly, it is computed by a network with zero bias and 2k hidden units having widths s1 and s2 . It can also immediately be seen that the parameters can be instantiated such that fk has at least 3k¡1 zeros. An example for k D 2 is shown in Figure 3. Here, the centers are c1 D 5, c2 D 15, the widths are s1 D 1 / 2, s2 D 2, and the output weights are w1 D 2, w2 D 1. By juxtaposing copies of this gure, examples for larger values of k are easily obtained. For zero bias, the family of functions constructed in the previous proof gives rise to a lower bound on the size of networks with uniform widths
Descartes’ Rule of Signs
3009
simulating networks with nonuniform widths. In particular, uniform-width networks must be at least a factor of 1.5 larger to be as powerful as nonuniform-width networks. This separates the computational capabilities of the two network models and demonstrates the power of the width parameter. A univariate zero-bias RBF neural network with 2k hidden units of arbitrary width cannot be simulated by a zero-bias network with uniform widths and fewer than 3k hidden units. This holds even if the nonuniform-width network uses only two different width values. Corollary 6.
Proof. Consider again the functions f k dened in the proof of lemma 1.
Each fk is computed by a network with 2k hidden units using two different width values and has 3k ¡ 1 zeros. By virtue of the strong RBF rule of signs (theorem 1), at least 3k hidden units are required for a network with uniform widths to have 3k ¡ 1 zeros. The lower bounds derived here are concerned with zero-bias networks only. It remains open whether and to what extent nonuniform widths increase the computational power of networks with nonzero bias. 5 Conclusion
We have derived rules of signs for various types of univariate RBF neural networks. These rules have been shown to yield tight bounds for the number of zeros of functions computed by these networks. Two quantities have turned out to be crucial: (1) the number of sign changes in a sequence of the output weights for networks with zero bias and uniform widths and for networks with uniform centers and (2) the number of hidden units for networks with nonzero bias and uniform widths. It remains as the most challenging open problem to nd a rule of signs for networks with nonuniform widths and nonuniform centers. Using the rules of signs, linear bounds on the VC dimension and pseudodimension of the studied networks have been established. The smallest bound has been found for networks with zero bias and uniform widths, for which the VC dimension is equal to the number of hidden units and, hence, less than half the number of parameters. Further, introducing one more parameter, the bias, more than doubles this VC dimension. The pseudodimension bounds have been obtained for networks with xed width or xed centers only. Since the pseudodimension is dened via the VC dimension using one more variable, getting bounds for more general networks might be a matter of looking at higher-input dimensions. Finally, we have calculated lower bounds for sizes of networks simulating other networks with more parameters. These results assume that the simulations are performed exactly. In view of the approximation capabili-
3010
Michael Schmitt
ties of neural networks, it is desirable to have such bounds also for notions of approximative computation. Acknowledgments
This work has been supported in part by the ESPRIT Working Group in Neural and Computational Learning II, NeuroCOLT2, No. 27150. References Anderson, B., Jackson, J., & Sitharam, M. (1998).Descartes’rule of signs revisited. American Mathematical Monthly, 105, 447–451. Andrianov, A. (1999). On pseudo-dimension of certain sets of functions. East Journal on Approximations, 5, 393–402. Anthony, M., & Bartlett, P. L. (1999). Neural network learning: Theoretical foundations. Cambridge: Cambridge University Press. Bartlett, P. L., & Williamson, R. C. (1996). The VC dimension and pseudodimension of two-layer neural networks with discrete inputs. Neural Computation, 8, 625–628. Broomhead, D. S., & Lowe, D. (1988). Multivariable functional interpolation and adaptive networks. Complex Systems, 2, 321–355. Descartes, R. (1954). The geometry of RenÂe Descartes with a facsimile of the rst edition. (Trans. D. E. Smith and M. L. Latham). New York: Dover. Dung, D. (2001). Non-linear approximations using sets of nite cardinality or nite pseudo-dimension. Journal of Complexity, 17, 467–492. Grabiner, D. J. (1999). Descartes’ rule of signs: Another construction. American Mathematical Monthly, 106, 854–856. Henrici, P. (1974). Applied and computational complex analysis 1: Power Series, integration, conformal mapping, location of zeros. New York: Wiley. Karpinski, M., & Macintyre, A. (1997). Polynomial bounds for VC dimension of sigmoidal and general Pfafan neural networks. Journal of Computer and System Sciences, 54, 169–176. Karpinski, M., & Werther, T. (1993). VC dimension and uniform learnability of sparse polynomials and rational functions. SIAM Journal on Computing, 22, 1276–1285. Koiran, P., & Sontag, E. D. (1997). Neural networks with quadratic VC dimension. Journal of Computer and System Sciences, 54, 190–198. Li, S.-T., & Leiss, E. L. (2001). On noise-immune RBF networks. In R. J. Howlett & L. C. Jain (Eds.), Radial basis function networks 1: Recent developments in theory and applications (pp. 95–124). Berlin: Springer-Verlag. Maiorov, V., & Ratsaby, J. (1999). On the degree of approximation by manifolds of nite pseudo-dimension. Constructive Approximation, 15, 291–300. Mel, B. W., & Omohundro, S. M. (1991). How receptive eld parameters affect neural learning. In R. P. Lippmann, J. E. Moody, & D. S. Touretzky (Eds.), Advances in neural information processing systems, 3 (pp. 757–763). San Mateo, CA: Morgan Kaufmann.
Descartes’ Rule of Signs
3011
Moody, J., & Darken, C. J. (1989). Fast learning in networks of locally-tuned processing units. Neural Computation, 1, 281–294. Park, J., & Sandberg, I. W. (1991). Universal approximation using radial-basisfunction networks. Neural Computation, 3, 246–257. Park, J., & Sandberg, I. W. (1993). Approximation and radial-basis-function networks. Neural Computation, 5, 305–316. P Âolya, G., & SzegÍo, G. (1976). Problems and theorems in analysis II: Theory of functions. Zeros. Polynomials. Determinants. Number Theory. Geometry. Berlin: Springer-Verlag. Schmitt, M. (2002). Neural networks with local receptive elds and superlinear VC dimension. Neural Computation, 14, 919–956. Struik, D. J. (Ed.). (1986). A source book in mathematics, 1200–1800. Princeton, NJ: Princeton University Press. Wang, Z., & Zhu, T. (2000). An efcient learning algorithm for improving generalization performance of radial basis function neural networks. Neural Networks, 13, 545–553. Received April 29, 2002; accepted May 24, 2002.
LETTER
Communicated by Grace Wahba
Approximation Bounds for Some Sparse Kernel Regression Algorithms Tong Zhang [email protected] IBM T. J. Watson Research Center, Yorktown Heights, NY 10598, U.S.A. Gaussian processes have been widely applied to regression problems with good performance. However, they can be computationally expensive. In order to reduce the computational cost, there have been recent studies on using sparse approximations in gaussian processes. In this article, we investigate properties of certain sparse regression algorithms that approximately solve a gaussian process. We obtain approximation bounds and compare our results with related methods. 1 Introduction
Gaussian processes have recently attracted much attention since they are relatively simple and achieve good performance for regression problems (see Wahba, 1990; Williams & Rasmussen, 1996; Williams, 1998). From the convex analysis point of view, they can be written as dual forms of ridge regressions. By using kernel representations, gaussian processes can solve ridge regression problems in innite-dimensional Hilbert spaces. One disadvantage associated with gaussian processes is that there is a parameter for every data point. This implies that the computation can still be very expensive when we have a large number of training data. In order to overcome this problem, various methods have been proposed in the literature to accelerate the computation. In this article, we are interested in certain sparse gaussian process algorithms such as those in Csat Âo and Opper (2002), Gao, Wahba, Klein, and Klein (2001), Luo and Wahba (1997), Lin et al. (2000), Nair, Choudhury, and Keane (2001), Smola and Bartlett (2001), and Xiang and Wahba (1997). The problem of achieving sparse representation in a regression problem has drawn much attention in the literature. Some previous approaches are summarized in section 2. In section 3, we review the duality between ridge regression and gaussian processes and establish some notations and conventions used throughout the article. Section 4 outlines some sparse gaussian process algorithms. We then derive approximation bounds for these algorithms in section 5. However, such results do not distinguish the performance of greedy algorithms from randomized algorithms. Therefore, from a closely related but different point of view, we derive additional apc 2002 Massachusetts Institute of Technology Neural Computation 14, 3013– 3042 (2002) °
3014
Tong Zhang
proximation results that are applicable only to greedy algorithms. Section 7 discusses relationships among different methods. In section 8, we use numerical examples to illustrate aspects of the proposed algorithms. Section 9 summarizes the article. It is worth mentioning that this article focuses on the quality of sparse approximation. In particular, we do not consider the generalization performance of the proposed procedures so that the article can focus on the main topic. 2 Previous Work
We consider the standard linear regression model in the following setting. Assume we have a set of feature vectors x1 , . . . , xn , with corresponding real-valued output variables y1 , . . . , yn . Our goal is to nd a linear weight vector w such that y ¼ wT x for all future feature data x. The quality of this approximation is measured by the squared loss as n 1X ( yi ¡ wT xi ). n iD 1
(2.1)
We would like to nd a weight vector w that gives a small loss in equation 2.1. Many methods have been proposed over the years to achieve sparseness in w for regression problems. We shall mention only ideas that are directly related to this article. In equation 2.1, we may consider the output variables [yi ]iD 1:n as a vector in a (nite-dimensional) Hilbert space H and consider the components of training inputs [xij ]iD1:n as a library of basis vectors in H. We use xi, j to denote the jth component of a training input xi . In this regard, equation 2.1 can be viewed as the approximation of a vector (or function) in a Hilbert space by a linear combination of a library of basis vectors (or functions). Assume that the target vector is in the convex hull of the basis vectors. Jones (1992) observed that by using greedy approximation, one can achieve an error rate of O (1 / k) if we approximate the target vector using k library vectors. The same idea has been used in Barron (1993) to analyze the approximation property of neural networks. In an algorithmic form, it led to the matching pursuit method in Mallat and Zhang (1993). It is also possible to impose a sparseness constraint directly into the problem formulation itself. However, a difculty is that the resulting formulation is typically NP-hard (see, e.g., Natarajan, 1995). One way to remedy this problem is to include a one-norm regularization condition in the ridge regression formulation, as in basis pursuit (Chen, Donoho, & Saunders, 1999). That is, we favor a weight vector w that has a small one-norm, as in the following formulation: " # n 1X 2 T ( yi ¡ w xi ) C lkwk1 . (2.2) wO D arg min w n i D1
Sparse Approximation Bounds
3015
The positive parameter l is used to control the complexity of w measured by its one-norm. In basis pursuit, we still have a convex programming problem, which can be solved efciently. Furthermore, one-norm regularization leads to a sparse wO solution. These methods deal directly with the primal form of regression formulation, equation 2.1. They compute sparse coefcients for the primal weight vector w. As we mentioned, kernel methods such as gaussian processes are expressed in the dual form of (ridge) regression. The dual representation has the advantage that a nite-dimensional kernel formulation corresponds to a possibly innite-dimensional primal formulation. It is thus useful to consider an approximation of the solution using a sparse dual representation (the primal weight vector can still be dense). One way to achieve dual sparseness is through Vapnik’s 2 -insensitive loss function in a support vector machine (SVM; Vapnik, 1998): its equivalent primal formulation can be expressed as1 "
# n 1X 2 2 T max(|yi ¡ w xi | ¡ 2 , 0) C lw , wO D arg min w n iD 1
(2.3)
where we use w2 to denote wT w. The relationship of this method and the primal basis pursuit regression has been discussed in Girosi (1998). In this article, we are mainly interested in achieving sparsity through greedy approximation. This approach has a number of advantages over a direct sparse convex programming formulation such as basis pursuit or support vector learning. One advantage is that the sparsity is directly controlled in a greedy approximation algorithm rather than through some other quantities such as l in equation 2.2 or 2 in equation 2.3. Another advantage is that greedy approximation does not change the objective optimization function, while a direct method such as basis pursuit (or SVM) modies the objective function by including a one-norm regularization (or 2 -insensitive loss). The third advantage, which is more important for this article, is that in general, provably good approximation bounds can be obtained for greedy methods but not for direct convex programming methods. Methods studied in this article are closely related to some recent work. One related method is the on-line update scheme in Csat Âo and Opper (2002), which deals only with algorithmic issues. It can be regarded as a special form of rank-one matrix update algorithms widely studied in numerical linear algebra (e.g., see Golub & Van Loan, 1996). The work does not contain theoretical results concerning the approximation rate, which we are interested in here. In the statistical literature, sparse approximation has been widely used for kernel spline regression (Gao et al., 2001; Luo & Wahba, 1997; Lin et al., 1 Originally, Vapnik’s 2 -insensitive loss was linear. We use a quadratic form here for compatibility with other formulations.
3016
Tong Zhang
2000; Xiang & Wahba, 1997). Although some theoretical justications were presented, they were different from techniques used here. In addition, they did not provide explicit approximation bounds, which we are interested in. Although some greedy approximation bounds were described in Smola and Bartlett (2001), they relied on results proved in Natarajan (1995). These results specify the approximation quality of the proposed greedy procedure in terms of the best possible sparse approximation and quantities that are not always well behaved. They are different from those in Barron (1993) and Jones (1992), where such quantities do not appear. Both type of bounds have advantages and disadvantages. In spite of their differences, we show that both types of bounds can be obtained using the same underlying technique. In section 5, we derive bounds in the form of Barron (1993) and Jones (1992) by using a randomization technique. One disadvantage of this analysis is that the resulting approximation bound has a relatively slow convergence rate of O (1 / k) , where k is the number of sparse components. Note that such a rate is very typical for randomization or Monte Carlo methods. Another disadvantage is that due to the underlying randomization, the analysis does not distinguish greedy algorithms from randomized algorithms. Therefore, in section 6, we investigate the behavior of greedy algorithms where we shall improve the main result of Natarajan (1995), (and hence the bound given in Smola and Bartlett (2001)). In this analysis the convergence rate of certain greedy algorithms can be as fast as O ( (1 ¡ m ) k ) , but m depends on additional quantities that characterize the orthogonality of the underlying basis functions (which may not always be well behaved). We outline our proofs in such a way that it becomes apparent that the analysis of sections 5 and 6 is based on the same underlying principle. Therefore this investigation establishes a fundamental relationship of Natarajan (1995) to Jones (1992). Note that we study gaussian processes in the dual representation, which has an objective function different from the primal form of matching pursuit. Therefore, in order to analyze the system, we also employ ideas from on-line learning. As we see later, this analysis leads to some interesting theoretical consequences and algorithmic implications. 3 Preliminaries 3.1 Duality in Ridge Regression. Gaussian processes can be expressed in dual forms of regression problems. Since this duality is important in our discussion and in distinguishing primal greedy algorithms such as matching pursuit from their dual counterparts, we shall briey review the basic ideas. This discussion also introduces notations used throughout the article. Consider the linear least-squares regression problem in equation 2.1. In this article, we assume that data x belong to a Hilbert space H that may be innite dimensional. If the dimension of the input-vector space is larger than the training sample size, the problem of minimizing the least-squares
Sparse Approximation Bounds
3017
loss in equation 2.1 is singular. This is because there are innitely many solutions of w such that wT xi D wO T xi for all i, which implies that all these solutions have the same expected loss. To eliminate this problem of uncertainty, one may consider a simple remedy called ridge regression (Hoerl & Kennard, 1970), which is widely used in statistics. In ridge regression, we obtain a weight wO by minimizing the following modied primal objective function, "
# n 1X 1 T l 2 2 ( w xi ¡ yi ) C w , wO D arg min w 2 n iD 1 2
(3.1)
where l is a regularization parameter that may be either predetermined or computed from the data using various empirical methods such as crossvalidation. As we shall later see from our results, the regularization parameter l also inuences the rate of sparse kernel approximation. Sparse representation itself can also be regarded as a form of regularization. There is generally a trade-off between l and k (where k is the number of basis functions in the sparse representation). However, there could be a broad region in the ( k, l) space that gives very similar results. In such case, a smaller k is often preferred for its obvious computational advantage. Equation 3.1 also has a natural Bayesian statistical interpretation. It is the maximum a posteriori (MAP) estimator corresponding to a probability model with gaussian noise and gaussian weight prior. Although such an interpretation can provide valuable insights, it is not essential in this work. (Interested readers are referred to Smola & Bartlett, 2001; Wahba, 1990; Williams, 1998.) Compared with the basis pursuit formulation 2.2, ridge regression in equation 3.1 employs a two-norm regularization term rather than a onenorm regularization. This two-norm regularization is necessary in order to obtain kernel methods such as gaussian processes. The conversion to kernel methods relies on a duality relationship between the feature-space representation and the sample-space representation, which we now derive. By differentiating equation 3.1 with respect to w at the optimal solution O we obtain a representation of wO in the following form: w, wO D
n X iD 1
a Oi D ¡
aO i xi ,
1 ( wO T xi ¡ yi ) , ln
(3.2) i D 1, . . . , n.
(3.3)
Equation 3.1 is called the primal formulation, which involves the primal variable w. a O is the dual variable. To obtain the corresponding dual
3018
Tong Zhang
formulation in the dual variable a, we can eliminate wO and obtain Á ! n 1 X aO i D ¡ aO kxTk xi ¡ yi , i D 1, . . . , n. ln kD1
(3.4)
It is easy to check that a O is the solution of the following dual optimization problem: " # ³ ´ n n X n X ln 2 lX T aO D arg max l ¡ ai C ai yi ¡ ai akxi xk . a 2 2 i D 1 kD 1 i D1
(3.5)
The prediction wO T x can be expressed in the dual variable as wO T x D
n X kD 1
aO kxTk x.
(3.6)
An equivalent formulation of gaussian processes can be obtained by directly inserting equation 3.2 into the primal formulation 3.1 to obtain a system involving the dual variable a: 2
3 Á !2 n n n X n X 1X 1 X l aO D arg min 4 akxTk xi ¡yi C ai akxTi x k5 . a 2 i D 1 kD 1 n i D 1 2 kD 1
(3.7)
The solution of equation 3.5 is equivalent to the solution of equation 3.7. For all primal vectors w, we dene R (w ) D
n 1X 1 T l ( w xi ¡ yi ) 2 C w2 , 2 n iD 1 2
(3.8)
and for any dual vector a, we dene "
# ³ ´ n n X n X ln 2 lX T l ¡ ai C ai yi ¡ ai akxi x k . Q (a) D 2 2 i D 1 kD 1 iD 1
(3.9)
We also dene
D R (w)
D R ( w) ¡ inf R ( w) ,
D Q (a)
D sup Q (a) ¡ Q (a).
w
a
The following facts are useful in our later analysis. Although these results follow from simple linear algebra, we shall include proofs for completeness.
Sparse Approximation Bounds Proposition 1.
O . R ( wO ) D Q (a) Proof.
3019
At the optimal solution wO of equation 3.1 and aO of equation 3.5,
Since wO and aO satises equation 3.2, we have
O D Q ( a)
³ ´ n n X n X ln 2 lX l ¡ a Oi C a O i yi ¡ a O i aO kxTi x k 2 2 i D 1 kD 1 iD 1
µ ¶ n X ln 2 l ¡ a Oi C a O i ( yi ¡ wO T xi ) 2 iD 1 " # n n X n X lX T T C lw O a O i xi ¡ a O i aO kxi x k 2 i D 1 kD 1 i D1 " # ³ ´2 n X l2 n 1 1 T T T ( wO xi ¡ yi ) ¡ ( wO xi ¡ yi ) ( yi ¡ wO xi ) ¡ D 2 ln n iD 1 µ ¶ l C lw O T wO ¡ wO T wO 2 D
O ). D R (w Proposition 2.
D R (w )
D
Let wO be the solution of equation 3.1. Then for all w, 1 ( w ¡ wO ) T G (w ¡ wO ) , 2
where G denotes the following operator (in matrix form): G D where I is the identity operator. Proof.
1 n
Pn
T i D 1 xi xi C
lI,
Note that # n 1X 2 2 T ( (w ¡ wO ) xi ) C l(w ¡ wO ) n i D1 " # n 1X T T ( ) ( ) C w¡w O wO xi ¡ yi xi C lwO . n iD 1
1 R ( w) ¡ R ( wO ) D 2
"
O we have Now, using the rst-order condition, 3.2, at the optimal solution w, n 1X ( wO T xi ¡ yi ) xi C lwO D 0. n iD 1
Combining the above two equalities, we obtain the proposition.
3020
Tong Zhang
Proposition 3.
D Q (a) ¸
Proof.
we have l
Let wO be the solution of equation 3.1. Then n 1 X ( wO T xi ¡ yi C lnai ) 2 . 2n iD 1
Let aO be the solution of equation 3.5. For any a, using equation 3.4,
n X
Á
iD 1
lna O i ¡ yi C
n X kD 1
! a O kxTk xi
(a O i ¡ ai ) D 0.
This implies that
D Q (a)
O ¡ Q (a) C l D Q ( a)
n X iD 1
Á lnaO i ¡ yi C
n X k D1
! aO kxTk xi
(aO i ¡ ai ) .
Now replace Q (¢) by its denition in equation 3.9 in the above equation and combine relevant terms. We obtain " # n n X X ln 1 ( aO i ¡ ai ) 2 C ( aO i ¡ ai ) ( aO k ¡ ak ) xTi xk D Q (a) D l 2 2 kD 1 iD 1 ¸l
n n X ln 1 X (a ( wO T xi ¡ yi C lnai ) 2 . O i ¡ ai ) 2 D 2 2n iD 1 i D1
The above inequality follows from the semipositive deniteness of the Gram matrix [xTi x k]. 3.2 Kernel Formulation. Since in the dual formulation 3.5 and in the dual linear prediction representation 3.6, we need only to compute the dataspace inner product of the form xTk x, we may replace it by any symmetric positive kernel function K (x k, x ), which leads to the general form of gaussian processes. The idea of using kernels has been used in statistics for a long time (see Wahba, 1990, for example). A kernel form of equation 3.7 can be obtained by replacing each inner product xTk x with a symmetric positive kernel function K (x k, x ) as
2
3 Á !2 n n n X n 1X 1 X lX aO D arg min 4 akK ( xk , xi ) ¡yi C ai akK ( xi , xk ) 5 . a 2 i D1 k D1 n i D 1 2 kD 1
Sparse Approximation Bounds
3021
For technical reasons, we need to assume that a kernel function K (x1 , x2 ) considered in this article can be decomposed as K ( x1 , x2 ) D
X
w j ( x1 ) wj (x2 ).
(3.10)
j
Mercer’s theorem (Mercer, 1909) guarantees the existence of such a decomposition for all (positive) kernels that satisfy some moderate regularity conditions. For some practical kernels such as the polynomial kernel K ( x1 , x2 ) D (1 C x1 x2 ) d and the exponential kernel K (x1 , x2 ) D exp(xT1 x2 ) , a decomposition in equation 3.10 can also be obtained directly from Taylor expansion. Using equation 3.10, the kernel function K (¢, ¢) induces a feature representation of data x as x ! w ( x) D fw j ( x) gj . w ( x ) lies in a Hilbert space HK (reproducing kernel Hilbert space) under the two-norm inner product. A kernel predictor in the form of p (x ) D
n X
ai K ( xi , x)
i D1
can be considered a linear predictor p ( x) D wT w ( x) in HK with weight w given by wD
n X iD 1
ai w ( xi ) 2 HK .
This implies that kernel methods can also be analyzed in the primal form by considering linear regression in HK , where we simply replace a data point x by the corresponding feature space representation w ( x ) 2 HK . For notational simplicity, analysis in this article is represented under the special case w (x ) D x. However, the discussion implies that the algorithms and analysis here are also suitable for general kernel methods, for which we simply replace x by w ( x) . Since kernel methods require only the inner product computation, we do not need to know the feature representation w ( x ) to work with a kernel algorithmically. 4 Sparse Regression Algorithms
We have already proved that the two systems 3.7 and 3.5 are equivalent. One might think that it is possible to work with either formulation to obtain a sparse representation of the dual variable a. Unfortunately, proposition 3 implies that in general, equation 3.5 is not appropriate for obtaining sparse representation. To see this, we note that for a regression problem containing
3022
Tong Zhang
noise, there can be a signicant portion of data such that wO T xi ¡yi is bounded away from zero. In this case, for a sparse variable a, a signicant portion of ( wO T xi ¡ yi ¡ ai ) 2 is also bounded away from zero. Now the inequality in proposition 3 implies that D Q (a) has to be bounded away from zero. That is, a sparse dual variable a does not approximately solve equation 3.5. We shall therefore work with equation 3.7 in order to obtain a sparse representation of the dual variable a. Equation 3.7 is simply equation 3.1 with the primal variable w substituted by equation 3.2. In order to obtain a sparse dual representation in equation 3.7, we only need to P approximately minimize equation 3.1 with w expressed in the form w D i ai xi . It is easy P to see that under this assumption, the objective function R ( w) D R ( i ai xi ) can be expressed in a kernel form. In this article, we consider the following algorithms that approximately solve a gaussian process with a sparse dual variable. Algorithm 1 (greedy sparse gaussian process)
let w0 D 0 for k D 1, 2, . . . nd ik 2 f1, . . . , ng, ak and b k that minimize R (b kw k¡1 C akxik ) let w k D b kwk¡1 C akxik end
Algorithm 2 (random sparse gaussian process)
Given sparseness k Randomly draw k samples xi1 , . . . , xik uniformly from fx1 , . . . , xn g P nd a1 , . . . , ak that minimize R ( jkD 1 aj xij ) P let wk D jkD 1 aj xij Algorithm 3 k -greedy sparse gaussian process)
Given integer k : 1 · k · n let w0 D 0 for k D 1, 2, . . . Randomly draw a subset S k of size k uniformly from f1, . . . , ng nd ik 2 S k, ak and b k that minimize R (b kw k¡1 C akxik ) let w k D b kwk¡1 C ak xik end
Sparse Approximation Bounds
3023
Algorithm 1 is a greedy sparse approximation algorithm. In the statistical literature, this type of algorithm is sometimes referred to as an adaptation method. A work particularly relevant here is Luo and Wahba (1997), where the authors investigated greedy approximation for (kernel-based) splines. A similar algorithm has also been proposed in Smola and Bartlett (2001). However, there are two major differences. One difference is that unlike the algorithm proposed in Smola and Bartlett (2001), we do not use the dual functional Q (a) . The reason is that in the most general situation, it is not possible to nd a sparse solution a that approximately maximizes Q (a). The second difference is that at each step, in addition to ak, we also seek a scaling factor b k that shrinks the current weight w k¡1 . As we shall see, this scaling is important in our analysis. Algorithm 2 is a random sparse approximation method that appeared in Williams and Seeger (2000). In our analysis, this algorithm has similar worst-case approximation properties as those of the greedy approximation algorithm. However, in practice, the greedy algorithm is usually superior in approximation quality. The full greedy algorithm is computationally quite expensive; therefore, a compromise is to consider searching through ik in a subset of f1, . . . , ng at each step, as in algorithm 3. This idea of choosing a random subset, suggested in Smola and Bartlett (2001), is a compromise between the full greedy sparse algorithm and the random sparse algorithm. Note that in algorithm 2, we use the more expensive approach of computing the exact optimal approximation in the randomly selected k-dimensional subspace. In the other two algorithms, we optimize only in a two-dimensional subspace at each step. However one may also consider using the full subspace optimization approach as in algorithm 2. The resulting approximation bounds for the modied algorithms will be the same. Similar randomized algorithms for kernel methods have appeared in the literature, such as approximate smoothing spline methods (Gao et al., 2001; Xiang & Wahba, 1997). Although here we focus on the approximation aspect of different algorithms, it is useful to compare their computational costs. In the kernel formulation, we make the assumption that the inner product K (xi , x k ) can be computed in O (1) ops (oating-point operations). In the full greedy and thek -greedy algorithms, assume we have stored the results of w ( k¡1) T xi for i D 1, . . . , n and w ( k¡1 ) T w ( k¡1) at step k. Then with each xed ik, the minimization of R (b kw k¡1 C akxik ) requires O ( n ) ops. Therefore, the ops required to nd ik , ak, and b k that minimize R (b kw k¡1 C akxik ) are O (n2 ) for the full greedy algorithm and O (k n ) for the k -greedy algorithm. Using the recursion w k D b k wk¡1 C akxik , the computation of w ( k) T xi for i D 1, . . . , n and w ( k) T w ( k) requires another O ( n ) operations. Overall we need O (n2 ) ops per step in algorithm 1 and O (k n ) ops per step in algorithm 3. The computational costs after k steps are O ( kn2 ) for algorithm 1 and O (k kn) for algorithm 3.
3024
Tong Zhang
Table 1: Computational Costs (in Flops) for k-Term Sparse Approximation. Algorithm Computation
Full Greedy O
(kn2 )
Random
k -Greedy
(k2
O (k kn)
O
n)
In order to solve the reduced problem in algorithm 2, we need to rewrite the system in the form of Aa D b. Note that P to form the ( j1 , j2 )th element in A (j1 , j2 D 1, . . . , k), we need to evaluate niD 1 k ( xij1 , xi ) k ( xij2 , xi ) , which requires O (n ) ops. Therefore, overall, we need O ( k2 n ) time to form the reduced system Aa D b. The solution of the reduced system Aa D b can be then obtained in O ( k3 ) ops. Based on the above discussion, we summarize the computational costs of the three algorithms in Table 1. 5 Approximation Bounds for Sparse Regression Algorithms
Let w be an approximate solution to equation 3.1 and wO be the exact solution. We consider the following type of stochastic gradient update rule with a datum xu , wu D w ¡ g( ( wO T xu ¡ yu ) xu C lw) ,
(5.1)
where g > 0 is a xed variable that is to be determined later. Obviously, if w is of form 3.2, then wu is also of form 3.2. Therefore the update rule is suitable for kernel methods. Our goal is to nd u in equation 5.1 so that R ( wu ) is minimized. For convenience, we introduce the following shorthand notation: w u ( w ) D ( wO T xu ¡ yu ) xu C lw. Using G in proposition 2, we dene the following quantities: aw D
n 1X w u ( w) T Gw u ( w) , n uD 1
(5.2)
bw D
n n 1X 1X (wT xi ¡ yi ) xi C lw D Gw ¡ yi xi . n iD 1 n iD 1
(5.3)
Our analysis starts with the following lemma. Lemma 1.
holds:
Let g D 2lD R ( w ) / aw in equation 5.1. Then the following equality
n 1X 2l2 D R (w ) 2 . R ( wu ) D R ( w ) ¡ n uD 1 aw
(5.4)
Sparse Approximation Bounds Proof.
3025
Note that "
1 R ( wu ) ¡ R (w ) D 2
n 1X ( ( wu ¡ w) T xi ) 2 C l(wu ¡ w ) 2 n i D1
"
C ( wu ¡ w ) T
#
# n 1X ( wT xi ¡ yi ) xi C lw . n i D1
Now, summing over u, we obtain " # n n n 1X 1 X 1X 2 2 T ( ( wu ¡ w) xi ) C l(wu ¡ w ) R ( wu ) ¡ R (w ) D 2n uD1 n iD 1 n u D1 n 1X ( wu ¡ w ) T C n uD 1
D
Á
! n 1X ( wT xi ¡ yi ) xi C lw n i D1
aw 2 g ¡ g (bwO C l(w ¡ wO ) ) T bw . 2
The rst-order condition, equation 3.2, implies that bwO D 0. Using proposition 2, we have (bwO C l(w ¡ wO ) ) T bw D l(w ¡ wO ) T ( bw ¡ bwO ) O ) T G (w ¡ wO ) D 2lD R ( w) . D l(w ¡ w Therefore, n 1X aw 2 g ¡ 2glD R ( w ). R ( wu ) D R ( w) C 2 n u D1
This leads to equation 5.4 by picking g such that g D 2lD R ( w ) / aw . In lemma 1, we computed the average (over data points) of a specic gradient descent rule, equation 5.1, with optimal choice of g. Clearly, this gives an upper bound of the inmum over all such rules (such as used in a sparse greedy algorithm). We thus have the following result: Lemma 2.
Let M D maxi kxi k2 . For all vectors w,
n 1X l2 ( D R ( w ) ) 2 inf R (axi C bw) · R ( w) ¡ . 2 ( R ( wO ) M C lD R (w ) ) ( M2 C l) n iD 1 a,b
3026
Tong Zhang
In order to bound the right-hand side of equation 5.4, we bound aw in equation 5.2 as Proof.
" # Á ! n n n n 1X 1X 1X 1X 2 2 2 2 T (w u ( w) xi ) C lw u (w ) · w u ( w) aw D x Cl . n uD 1 n iD 1 n uD 1 n iD 1 i The above inequality follows from the Cauchy-Schwartz inequality. Using bwO D 0, we obtain w u (w) 2 D
n ± ²2 1X ( wO T xu ¡ yu ) xu C lw n uD 1
D
n 1X ( (wO T xu ¡ yu ) xu ) 2 ¡ l2 wO 2 C l2 (w ¡ wO ) 2 n uD 1
·
n 1X ( wO T xu ¡ yu ) 2 M2 C l(w ¡ wO ) T G ( w ¡ wO ) n uD 1
D
n 1X ( wO T xu ¡ yu ) 2 M2 C 2lD R (w ). n uD 1
The inequality follows directly from the assumption that x2 · M2 and the fact that G ¡ lI is positive semidenite. The last equality follows from proposition 2. Now we have " aw ·
# n 1X 2 2 ( wO T xu ¡ yu ) M C 2lD R ( w) ( M 2 C l) n uD 1
O ) M2 C 2lD R ( w ) ) ( M 2 C l) . · (2R ( w
(5.5)
We obtain from equation 5.4 n 1X l2 D R ( w ) 2 . R ( wu ) · R ( w ) ¡ (R ( wO ) M 2 C lD R ( w) ) ( M2 C l) n uD 1
It is easy to see that this inequality implies the lemma. Now, by using induction and applying lemma 2 repeatedly, we obtain the following approximation bounds on sparse gaussian process algorithms.
Sparse Approximation Bounds Theorem 1.
3027
Let M D maxi kxi k2 . Then in algorithm 1,
D R (w k ) ·
D R (0) ( M2 C l)
l2 D R (0) O ) M2 C lD R (0) R (w
k C ( M2 C l)
.
Furthermore, in algorithms 2 and 3, the expected approximation error satises
D R (0) ( M2
C l) , D R (0) k C ( M 2 C l) R ( wO ) M2 C lD R (0)
E D R (wk) ·
l2
where the expectation E is taken over all possible randomized instances. We prove the theorem by induction. The bounds clearly hold for k D 0. Now assume the theorem holds for w k at an integer k ¸ 0. Let
Proof.
sD
l2 D R (0) R ( wO ) M2 C lD R (0)
( k C 1) C ( M2 C l)
D R (0) ( M2 C l)
.
Then the induction hypothesis for wk implies that sE D R (w k ) ¡
l2 E D R ( w k ) · 1. lD R (0) ) ( M2 C l)
( R ( wO ) M2 C
(5.6)
In the above inequality, the expectation E is deterministic for algorithm 1 and is averaged over all possible randomized instances for algorithms 2 and 3. Using lemma 2, and noting that ED R ( wk ) 2 ¸ ( ED R ( wk ) ) 2 always holds (Jensen’s inequality), we obtain: n 1X E inf D R (axi C bwk ) n iD 1 a,b
l2 ED R ( wk ) 2 (lemma 2) ( R ( wO ) M2 C lD R (0) ) ( M2 C l) " Á !# 1 l2 ED R ( wk ) k · (Jensen’s inequality) sED R ( w ) 1¡ ( R ( wO ) M2 C lD R (0) ) ( M2 C l) s · ED R ( w k ) ¡
" #2 1 l 2 ED R ( w k ) k · · 1 / s. sED R (w ) C 1¡ (R (wO ) M 2 C lD R (0) ) ( M2 C l) 4s
3028
Tong Zhang
In the above, the third inequality follows from the algebraic inequality ab · 1( C b) 2 . The last inequality follows from equation 5.6. This inequality 4 a implies that the theorem holds for w kC 1 . We also have the following corollary: Corollary 1.
Let M D maxi kxi k2 . Then in algorithm 1,
D R (wk) ·
R (0) ( M 2 C l) 2 . l2 k C ( M2 C l) 2
Furthermore, in algorithms 2 and 3, the expected approximation error satises E D R ( wk ) ·
R (0) ( M2 C l) 2 , l2 k C ( M 2 C l) 2
where the expectation E is taken over all possible randomized instances. Observe that D R (0) · R (0) and R ( wO ) M2 C lD R (0) · R (0) ( M 2 C l) . The corollary follows from theorem 1 and simple algebra. Proof.
For small noise problems, theorem 1 can be signicantly better than corollary 1. In this case, we typically choose a small l and can assume that R ( wO ) / l is small.2 Theorem 1 implies an approximation bound of the order O (1 / lk) , while corollary 1 implies an approximation bound of the order O (1 / l2 k) . One drawback of our analysis is that bounds described in theorem 1 are the same for all three sparse kernel regression algorithms considered here. In reality, algorithm 1 usually gives better sparse approximation than algorithm 3, which gives better sparse approximation than algorithm 2. An experiment is included in section 8 to demonstrate this phenomenon. One may also draw this conclusion from our analysis. Note that theorem 1 relies on lemma 2, which is an average bound for stochastic gradient update, equation 5.1. However in reality, a greedy algorithm can decrease the objective function much more quickly than the average case when the variance corresponding to the average is large. Unfortunately, this difference is difcult to quantify in our theoretical analysis. We have considered only the worst-case behavior in theorem 1 without any attempt to optimize the bounds. These results show that with xed l, the approximation error decreases at a rate of O (1 / k) , where k is the 2 If the problem is noise free: that is, we can nd a parameter wQ such that kwk Q is small and y D w O / l · wQ 2 is also small. Q T x; then it is easy to see that R(w)
Sparse Approximation Bounds
3029
number of nonzero coefcients in the sparse approximation. An interesting observation is that bounds in theorem 1 do not depend on the sample size n. For some simplied problems, it was shown by Xiang (1996) that as n increases, a sparse approximation with k D O ( na / lb ) basis functions (where a and b are problem dependent constants) can converge at the same rate to the true solution as that of using the full set of basis functions. Approximation bounds obtained in this section are complementary to such results. Although we do not have generalization error results, the approximation bounds imply the possibility of achieving a signicant reduction in the number of basis functions without reducing the accuracy of the solution by much. The analysis given in this section has certain advantages over the analysis given in section 5 or that of Smola and Bartlett (2001). In particular, the bound they present relies on some additional quantities that can be ill behaved; in fact, the bound can become trivialized when the Gram matrix [xTi xj ] after appropriate column normalization is nearly singular. Although the rened analysis given in section 5 will address this problem to a certain degree, it still relies on similar quantities that are not always well behaved. As a comparison, theorem 1 does not suffer from this problem. Bounds given in corollary 1 are expressed in terms of the initial approximation R (0) , the maximum norm M D maxi kxi k2 , and the regularization parameter l. The rst two quantities, which also appear in the matching pursuit approximation bound, are easy to understand. In addition, we show that the dependency of regularization parameter l cannot be avoided in the worst-case scenario. The following example demonstrates that if we allow l ! 0 and n ! 1, then any k-term (with k xed) sparse approximation error can be bounded below by a positive number that is independent of k. Thus, if we let l ! 0, the best possible k-term sparse approximation error does not decrease to zero as k ! 1. We consider an orthonormal basis fei : i D 1, . . . , dg in a d-dimensional real vector space. Let xi D ei and yi D 1 for i D 1, . . . , n where P n D d. For any l > 0, the optimal solution wO of equation 3.1 is wO D niD1 1 C1ln ei . For P any k, the best k-term approximation is given by w k D jkD 1 1 C1ln eij , where fij : j D 1, . . . , kg is any k-member subset of f1, . . . , ng. We have D R ( wk ) D R ( w k ) ¡ R ( wO ) D 2(1 C1 ln) (1 ¡ k/ n ) . Clearly, for all k, if we allow n ! 1 and l D o (1 / n ) , then D R ( wk ) ! 0.5. 6 Additional Performance Bounds for Greedy Algorithms
Based on some statistical insights, Luo and Wahba (1997) argued that a greedy algorithm acts differently from a random or stratied sample of the basis functions. A greedy algorithm may perform better than a random algorithm, although bounds given in the previous section share the same form. Note that the analysis is based on a randomized average. If the vari-
3030
Tong Zhang
ance with respect to this randomization is large, then greedy algorithms can perform much better than the average. However, for technical reasons, it is difcult to estimate the variance directly and quantify the difference. In this section, we shall approach the problem from a slightly different point of view, similar to that of Natarajan (1995). Using the same underlying idea as that of section 6 (but with a different stochastic gradient-descent rule to focus on the high-variance part of equation 5.1, we are able to derive a bound that improves the main result in Natarajan (1995). Therefore, the analysis in this section establishes a relationship of the analysis given in section 5 and that of Natarajan (1995). For notational simplicity, we rewrite equation 3.8 as R (w ) D
1 T w Gw ¡ wT z C c, 2
(6.1)
P where for kernel regression, G is given in proposition 2, z D n1 niD 1 xi yi , and P c D n1 niD 1 y2i . Our goal is to seek a sparse representation of w as wD
n X
ai xi
(6.2)
iD 1
to approximately minimize equation 6.1. Most derivation given in this section does not require specic forms of G and z (unless explicitly mentioned). However, we do assume that G is a symmetric positive denite linear operator for simplicity (though semipositive deniteness would have been sufcient). This operator G induces a norm that will become useful in our analysis: kvkG D ( vT Gv) 1 / 2 . It is useful to start with a relatively simple scenario and gain some insights into the performance of greedy algorithms: 6 If xTi Gxj D 0 when i D j, then each step of algorithm 1 computes an optimal sparse solution to equation 6.1:
Proposition 4.
R (w
Proof.
D
inf
( j1 ,a1 ) ,..., ( jk ,ak )
R
Á k X
! a`xj` .
`D 1
6 Using the assumption xTi Gxj D 0 when i D j, it is easy to verify that
Á R
k)
n X iD 1
! ai xi
D
m µ X 1 iD 1
2
¶ ai2 xTi Gxi ¡ ai xTi z C c.
Sparse Approximation Bounds
3031
The above is minimized at aO i D xTi z / (xTi Gxi ) (we will let a O i D 0 when xTi Gxi D 0), and it is easy to check: Á ! ¶ n n µ X X 1 2 T (a ) DR ai xi D O i xi Gxi . (6.3) i ¡a 2 iD 1 iD 1
Pn From equation 6.3, i D1 ai xi , if we want Pnwe can see that given any w D to minimize D R ( i D1 ai xi ) by modifying only one component au , then it is necessary to choose the component u with the largest value of (au ¡ aO u ) 2 xTu Gxu , and we should change au to a O u (unless xTu Gxu D 0). Clearly, this implies that starting with w0 D 0, each time we will select one component akxik in algorithm 1 with the largest aO u2 xTu Gxu value among all remaining components not selected so far. This implies that after k iterations, we will nd components i1 , . . . , ik with the k largest values of aO i2 xTi Gxi . Moreover, bj D 1 and (aj ¡ a O ij ) 2 xTij Gxij D 0 for 1 · j · k. This is clearly the optimal way of choosing w k to minimize equation 6.3 among all weight vectors of form 6.2 with k nonzero components. The above result shows that the greedy algorithm works optimally when the basis vectors are G-orthogonal to each other. This is because basis vectors work independently from each other, and greedy search will result in the optimal component. For kernel regression problems considered in this article, the orthogonality condition in proposition 4 will be satised when 6 xTi xj D 0 (i D j). Since in this case the greedy algorithm is optimal, it can perform signicantly better than random sampling. In certain methods, basis functions are specically chosen to be orthogonal (for example, Fourier or wavelet basis). If such basis functions are used, greedy algorithms are superior to random sampling. If the basis vectors are not G-orthogonal to each other, then the greedy algorithm may not perform optimally. However, if the basis vectors are nearly G-orthogonal, then near-optimal performance can be achieved. For technical reasons, it is necessary to modify algorithm 1. In the following, we use spanfxi1 , . . . , xik g to denote the subspace spanned by xi1 , . . . , xik . Algorithm 4 (subspace greedy sparse gaussian process)
let w0 D 0
for k D 1, 2, . . .
nd ik 2 f1, . . . , ng and wk 2 spanfxi1 , . . . , xik g to minimize R ( w k )
end
Clearly the analysis given in section 5 still applies to algorithm 4. In the following, we derive another bound for this algorithm that improves the main result of Natarajan (1995).
3032
Tong Zhang
Let V be a vector space, and consider x1 , . . . , xn 2 V. Let V 0 be a subspace of V. Dene Denition 1.
PN 0
r ( x1 , . . . , xN I V ) D
sup a2RN ,v2V 0
k
i D 1 (ai )
PN
i D1
2
kxi k2G
ai xi C vk2G
.
Note that r ¸ 1, and the equality is achieved if we assume that xTi Gxj D 0 6 when i D j and xTi Gv D 0 when v 2 V 0 . This means that the above quantity measures the degree of orthogonality of the underlying basis vectors. Theorem 2. Consider a permutation ( j1 , . . . , jn ) of (1, . . . , n ) and a number N: 1 · N < n. Let V1 D spanfxj1 , . . . , xjN g and V2 D spanfxjN C 1 , . . . , xjn g. Then
for all wN 2 V1 and c > 0, w k obtained in algorithm 4 satises Á k)
D R ( w · max (1 C c
Proof.
)2
µ D R ( wN ) , 1¡
c2 2 N (2 C c ) r ( xj1 , . . . , xjN I V2 )
¶k
!
D R (0) .
See the appendix.
From the appendix, we can see that the proof of theorem 2 is very similar to the analysis given in section 5. The resulting bound improves the main result in Natarajan (1995), which if adapted to our notations, can be stated N as follows. Under the assumptions of theorem 2, R ( wk ) · 4D R ( wN ) after at most ¹ º Nk D 9Nr ( x1 , . . . , xn I 0) ln 4D R (0) D R ( wN )
steps. As a comparison, the following bound can be obtained by setting c D 1 in theorem 2, ¹ º 4D R (0) . kN D 9Nr ( xj1 , . . . , xjN I V2 ) ln D R ( wN ) Since
Pn r ( x1 , . . . , xn I 0) D sup
a2Rn
2 2 i D 1 (ai ) kxji kG Pn k iD 1 ai xji k2G
PN ¸ sup
a2Rn
2 2 i D 1 (ai ) kxji kG Pn k iD 1 ai xji k2G
D r ( xj1 , . . . , xjN I V2 ) ,
our bound is always better. The difference can be signicant when V2 contains a large number of basis vectors (this always happens in the case of sparse approximation) that are not orthogonal.
Sparse Approximation Bounds
3033
It is worth mentioning that due to a somewhat different reasoning used in Natarajan (1995), it does not appear that their derivation can be easily rened to obtain improved results such as theorem 2. We have achieved this improvement by identifying the essential element in their analysis and relating it to the randomization idea presented in section 5. Consequently, a number of major modications are made in the approach presented in the appendix. On one hand, this simplies their proof; on the other hand, we are also able to obtain an improved bound. 7 Relationships Among Different Methods
Since our analysis of sparse gaussian process algorithms is based on the stochastic gradient descent updates 5.1 and A.2, it is closely related to on-line learning, which in general is based on a similar form of stochastic gradientdescent update as follows: wu D w ¡ g ( ( wT xu ¡ yu ) xu C lw). Note that in both equations 5.1 and A.2, we have used quantities that are not available to on-line algorithms. The usual mistake-bound framework in on-line learning focuses on worst-case upper bounds for stochastic gradient descent over a sequence of data. On the other hand, in lemma 2, we consider upper bounds on the average improvement of stochastic gradient descent over the training data. This implies that we use different criteria to measure the performance of a stochastic gradient-descent rule. The analysis in section 5 is also closely related to matching pursuit-type greedy approximations. However, in matching pursuit, one works with the O Techniques used primal variable w and seeks a sparse representation of w. in the proofs of approximation rates are closely related. Both employ the same averaging technique, which results in a rate of O (1 / k). In matching pursuit, the averaging is with respect to a measure that is not known a priori. This means that the proof of matching pursuit is nonconstructive, and one has to employ the full greedy approach to achieve the approximation rate of O (1 / k) . The proof given in section 5 is constructive, and the averaging is with respect to a uniform distribution over the observed samples. As a comparison, the proof of theorem 2 relies on a nonconstructive randomization measure. It also implies a faster convergence rate when some additional quantities are well behaved. From our analysis, a close relationship of Natarajan (1995) and Jones (1992) can be established. Another related approach is to apply sparse approximation directly to the dual objective function Q (a). By proposition 3, we know that this approach fails for noisy problems. As we have pointed out, an SVM with Vapnik’s 2 -insensitive loss slightly modies the dual objective function Q (a) so that its solution may directly yield a sparse dual variable a. O This is similar to basis pursuit, which incorporates primal sparseness condition on wO into the primal objective function. Our analysis implies that SVMs can produce
3034
Tong Zhang 0
10
1
error
10
10
2
10
3
10
greedy 10 greedy random
0
1
10 sparseness k10
2
10
3
Figure 1: Approximation performance of different algorithms: D R as a function of sparseness k.
only sparse coefcients for nearly noise-free problems. However, sparse approximation algorithms studied in this article have provable approximation bounds for both noise-free and noisy problems. 8 Experiments
In this section, we use two experiments to study some aspects of the proposed sparse gaussian process algorithms as well as the corresponding theoretical bounds. Since we are mainly interested in the bounds in section 5, we will not include results on algorithm 4. For easier control of the experimental design, we will use articially generated data, which are sufcient for the purpose of this article. Real data experiments for similar algorithms can be found in Gao et al., (2001), Luo and Wahba (1997), Nair et al. (2001), Smola and Bartlett (2001), and Xiang and Wahba (1998). The rst experiment compares the three different sparse gaussian process algorithms proposed in section 4. Although in section 5 we can derive only the same approximation bound for all three algorithms, the experiment shows that the greedy approach achieves better approximation quality in practice. We use “greedy” to denote algorithm 1, “k -greedy” to denote algorithm 3, and “random” to denote algorithm 2. We consider a regression problem in d D 1000 dimension, with n D 1000 samples. Each input datum x is a sparse vector, where each of its components is a random number independently generated as 0 with probability 0.99 and a uniform random number in (0, 1) with probability 0.01. We also generate a random vector wN with each component a uniform random number in (0, 1) . We then set yi D wO T xi C ni where ni is a uniform noise in (¡1, 1) . We set l D 0.1, and compare the three algorithms considered in this article. In this example, R (wO ) ¼ 3.03 and D R (0) ¼ 0.90. In Figure 1, we compare
Sparse Approximation Bounds
3035
Figure 2: Variance in randomized algorithms: D R as a function of sparseness k.
the sparse approximation performance of the algorithms, where sparseness k is the number of nonzero terms in the approximation, and approximation error is D R. For algorithms 2 and 3 that are random, the curves are obtained by averaging over ve different random runs. In Figure 2, we plot the average (the “avg” curves), the best (the “min” curves), and the worst (the “max” curves) values of D R over the ve random runs. This result suggests that the variance caused by randomization is very small. In Figure 3, we compare the approximation performance for different values of l. We also x the sparseness k at k D 50 for illustration purpose. It is appropriate to measure the rate of approximation as D R (w k ) / D R (0). Although the result does not show the behavior predicted by theorem 1, which implies that the approximation rate degrades when l ! 0, this does not suggest a contradiction since theorem 1 gives only the worst-case behavior. To illustrate the phenomenon that the proposed sparse gaussian process algorithms converge more slowly as l ! 0, we consider another regression problem in d D 500 dimension with n D 500 samples. In this problem, each component of an input datum x is generated as one plus a random number uniformly distributed in (¡0.1, 0.1) . Each output y is generated as a random number uniformly distributed in (0, 2). The approximation rates of different algorithms at different l are plotted in Figure 4. The example shows that the rate of approximation degrades at a rate of 1 / l as l ! 0. That all three algorithms perform similarly in this case is not too surprising since data are all similar, with each component having value one plus noise. Therefore, random choices and greedy choices do not make much difference. The reason that the random algorithm is slightly better for small l is due to the fact that in algorithm 2, we use the more expensive approach of computing the exact optimal approximation in the randomly selected k-dimensional subspace. In the other two algorithms, we only optimize in a two-dimensional subspace at each step, which does not give the optimal
3036
Tong Zhang 0.2
10
0.4
error
10
0.6
10
greedy 10 greedy random
0.8
10
4
initial error
10
10
1
10
0
10
1
2
10
lambda
0
10
2
0
10
2
10
2
10 4 10
10
2
lambda
10
Figure 3: Approximation rates as functions of l (k D 50).
approximation within the greedily selected k-dimensional subspace. However, it is interesting to observe that even in this case, greedy algorithms show advantages with large l. 9 Summ ary
In this article, we considered a few procedures for obtaining sparse approximations in a gaussian process. In section 5, we obtained approximation error bounds of the form O (1 / k) where k is the number of nonzeros in the
Sparse Approximation Bounds
3037
Figure 4: Approximation rates as functions of l (k D 50).
dual variable. Such a rate is typical for the randomization based analysis of Jones (1992). In addition, we derived a bound of the form O ( (1 ¡ m ) k ) for a greedy algorithm, but m relies on some additional quanitities that may be ill behaved for certain problems. This bound improves a related result in Natarajan (1995). We also included some experimental comparisons for the proposed sparse gaussian processes. We demonstrated that in practice, greedy search can usually give better sparse approximation than randomized sample se-
3038
Tong Zhang
lection. However, the difference can be small for some problems and large for others. This trade-off between randomization and greedy search is consistent with our theory. Although the analysis provides valuable theoretical insights into certain sparse approximation algorithms, the obtained bounds are not necessarily optimal. In particular, we do not have lower bounds that can match the derived upper bounds. It also remains an interesting issue how tight and useful these approximation bounds are for practical kernel regression problems. Appendix: Proof of Theorem 2
The proof is similar to that of theorem 1. For easy comparison, we use the same notations as those in section 5. Without loss of generality, we will assume that j` D ` for ` D 1, . . . , n throughout the proof. Assume that w D w k for some k in algorithm 4. Under the assumptions of theorem 2, we can decompose w as w D v1 C v2 where v1 2 V1 and v2 2 V2 . We assume that v1 is a linear combination of those basis vectors xij found by algorithm 4 that are in fx1 , . . . , xN g , and v2 is a linear combination of those basis vectors xij found by algorithm 4 that are in fxN C 1 , . . . , xn g. The algorithm implies that R ( v1 C bv2 ) is minimized at b D 1. Differentiate R ( v1 C bv2 ) in equation 6.1 with respect to b. It is easy to check that we have the following rst-order condition: vT2 (Gw ¡ z) D 0.
(A.1)
Note that this equation is the only part in our analysis that differentiates algorithm 4 from algorithm 1. Following the same idea used in section 5, we will use a stochastic gradient rule to derive an equality similar to equation 5.4. Observe that v1 ¡ wN 2 V1 ; we can thus represent it as follows: v1 ¡ wN D
N X
bi xi .
i D1
We consider the following type of stochastic gradient update rule with a datum xu (1 · u · N): wu D w ¡ gbu xu ,
(A.2)
We also replace equations 5.2 and 5.3 by the following quantities: aw D
N X uD 1
bu2 xTu Gxu ,
bw D (w ¡ wN ) T ( Gw ¡ z) .
We start our analysis with the following lemma, which is similar to lemma 1.
Sparse Approximation Bounds
3039
Let g D bw / aw . Then the following equality holds:
Lemma 3.
N 1 X b2 R ( wu ) D R ( w ) ¡ w . 2Naw N uD 1
Proof.
(A.3)
Using equation A.2, we obtain from simple algebra that
R ( wu ) ¡ R (w ) D
g2 2 T b x Gxu ¡ gbu xTu ( Gw ¡ z) . 2 u u
Summing over u, we have N X uD 1
R ( wu ) ¡ NR (w ) D
N g2 X b 2 xT Gxu ¡ g (v1 ¡ wN ) T (Gw ¡ z) . 2 uD 1 u u
Using equation A.1, we have ( v1 ¡ wN ) T ( Gw ¡ z ) D bw . Therefore, N X uD 1
R ( wu ) ¡ NR (w ) D
g2 b2 aw ¡ gbw D ¡ w . 2 2aw
Clearly the above lemma will play the same role as that of lemma 1 in section 5. Since the averaging is taken only with respect to part of the basis x1 , . . . , xN , the analysis applies only to a greedy algorithm. Next, we need to lower-bound the quantity bw . 8w, wN such that D R ( w) ¸ D R (wN ) , we have
Lemma 4.
bw ¸
Proof.
p
2D R (w ) 1/ 2 kw ¡ wN ) kG
D R (w ) 1/ 2 ¡ D R (wN ) 1/ 2 . D R (w ) 1/ 2 C D R (wN ) 1/ 2
Let wO D G¡1 z. Then similar to proposition 2, we have
D R (w )
D
1 1 ( w ¡ wO ) (Gw ¡ z) D ( w ¡ wO ) G ( w ¡ wO ) . 2 2
This implies that O G C kwN ¡ wk O GD kw¡ wk N G · kw¡ wk
p
2[D R ( w ) 1 / 2 C D R ( wN ) 1 / 2 ].
(A.4)
3040
Tong Zhang
We thus have: 1 bw N ) T ( Gw ¡ z ) D (w ¡ w 2 2 1 1 O ) T ( Gw ¡ z ) C ( wO ¡ wN ) T G ( w ¡ wO ) D (w ¡ w 2 2 1 O ¡ wN ) T G ( w ¡ wO ) | ¸ D R ( w ) ¡ | (w 2 1 O ¡ wk O G ¸ D R ( w ) ¡ kw N G kw ¡ wk 2 D
D R ( w ) ¡ D R (w ) 1 / 2 D R ( wN ) 1 / 2 1
¸ p D R ( w ) 1 / 2 kw ¡ wk N G
2
D R (w ) 1/ 2 ¡ D R (wN ) 1/ 2 . D R (w ) 1/ 2 C D R (wN ) 1/ 2
In the above, the second inequality follows from the Cauchy-Schwartz inequality, and the third inequality follows from equation A.4. Proof of Theorem 2.
then
From lemmas 3 and 4, if D R ( w ) ¸ (1 C c ) 2 D R ( wN ) ,
N c 2 D R ( w) kw ¡ wk N 2G 1 X c 2 D R (w ) · ¡ . R (wu ) ¡R(w) · ¡ N uD 1 Naw (2 C c ) 2 N (2 C c ) 2 r ( x1 , . . . , xN I V2 )
This implies that if in algorithm 4, D R ( wk ) ¸ (1 C c ) 2 D R ( wN ) , then µ
D R ( w k C 1 ) · D R ( wk ) 1 ¡
¶ c2 . N (2 C c ) 2r ( x1 , . . . , xN I V2 )
Using induction and assuming D R ( w k ) ¸ (1 C c ) 2 D R ( wN ) , we obtain: µ
D R ( w kC1 ) · 1 ¡
c2 2 N (2 C c ) r ( x1 , . . . , xN I V2 )
¶ kC 1
D R (0) .
This proves the theorem. Acknowledgments
I thank anonymous referees for pointing out related work and for constructive suggestions that helped to improve the article.
Sparse Approximation Bounds
3041
References Barron, A. (1993). Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information Theory, 39(3), 930–945. Chen, S. S., Donoho, D. L., & Saunders, M. A. (1999). Atomic decomposition by basis pursuit. SIAM J. Sci. Comput., 20(1), 33–61. CsatÂo, L., & Opper, M. (2002). Sparse on-line gaussian processes. Neural Comp., 14, 641–668. Gao, F., Wahba, G., Klein, R., & Klein, B. (2001). Smoothing spline ANOVA for multivariate Bernoulli observations, with applications to ophthalmology data. J. Amer. Statist. Assoc., 96, 127–160. Girosi, F. (1998). An equivalence between sparse approximation and support vector machines. Neural Computation, 10(6), 1455–1480. Golub, G., & Van Loan, C. (1996). Matrix computations (3rd ed.). Baltimore, MD: Johns Hopkins University Press. Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1), 55–67. Jones, L. K. (1992). A simple lemma on greedy approximation in Hilbert space and convergence rates for projection pursuit regression and neural network training. Ann. Statist., 20(1), 608–613. Lin, X., Wahba, G., Xiang, D., Gao, F., Klein, R., & Klein, B. (2000). Smoothing spline ANOVA models for large data sets with Bernoulli observations and the randomized GACV. Ann. Statist., 28, 1570–1600. Luo, Z., & Wahba, G. (1997). Hybrid adaptive splines. J. Amer. Statist. Assoc., 92, 107–114. Mallat, S., & Zhang, Z. (1993). Matching pursuits with time-frequency dictionaries. IEEE Transactions on Signal Processing, 41(12), 3397–3415. Mercer, J. (1909). Functions of positive and negative type and their connection with the theory of integral equations. Philos. Trans. Roy. Soc. London, A, 209, 415–446. Nair, P., Choudhury, A., & Keane, A. (2001). Some greedy algorithms for sparse nonlinear regression. In Proceedings of the Eleventh International Conference on Machine Learning (pp. 369–376). San Mateo, CA: Morgan Kaufmann. Natarajan, B. K. (1995). Sparse approximate solutions to linear systems. SIAM J. Comput., 24(2), 227–234. Smola, A. J., & Bartlett, P. (2001). Sparse greedy gaussian process regression. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.), Advances in neural information processing systems, 13 (pp. 619–625). Cambridge, MA: MIT Press. Vapnik, V. (1998). Statistical learning theory. New York: Wiley. Wahba, G. (1990). Spline models for observational data. Philadelphia: SIAM. Williams, C. K. (1998). Prediction with gaussian processes: From linear regression to linear prediction and beyond. In M. Jordan (Ed.), Learning and inference in graphical models. Norwell, MA: Kluwer. Williams, C., & Rasmussen, C. (1996). Gaussian processes for regression. In D. Touretzky, M. Mozer, & M. Hasselmo (Eds.), Advances in neural information processing systems, 8. Cambridge, MA: MIT Press.
3042
Tong Zhang
Williams, C., & Seeger, M. (2000). The effect of the input density distribution on kernel-based classiers. In Proceedings of the Seventeenth International Conference on Machine Learning. San Mateo, CA: Morgan Kaufmann. Xiang, D. (1996). Model tting and testing for non-gaussian data with a large data set. Unpublished doctoral dissertation, University of Wisconsin. Xiang, D., & Wahba, G. (1997). Approximate smoothing spline methods for large data sets in the binary case. In Proceedings of the 1997 ASA Joint Statistical Meetings, Biometrics Section (pp. 94–98). Alexandria, VA: American Statistical Association. Received May 30, 2001; accepted May 30, 2002.
Index Volume 14 By Author Aertsen, Ad—See Grun, ¨ Sonja Aertsen, Ad—See Gutig, ¨ Robert Agakov, F.V.—See Williams, C.K.I. Ahuja, Narendra—See Roth, Dan Aihara, Kazuyuki—See Masuda, Naoki Alonso-Betanzos, Amparo—See Castillo, Enrique Amari, Shun-ichi—See Chen, Tianping Amari, Shun-ichi—See Nakahara, Hiroyuki Amari, Shun-ichi—See Wu, Si Anile, Angelo Marcello—See Plebe, Alessio Aussem, Alex Sufficient Conditions for Error Backflow Convergence in Dynamical Recurrent Neural Networks (Letter)
14(8): 1907–1927
Ay, Nihat Locality of Global Stochastic Interaction in Directed Acyclic Networks (Letter)
14(12): 2959–2980
Back, Andrew D. and Chen, Tianping Universal Approximation of Multiple Nonlinear Operators by Neural Networks (Note)
14(11): 2561–2566
Balboa, Rosario M.—See Grzywacz, Norberto M. Baraduc, Pierre and Guigon, Emmanuel Population Computation of Vectorial Transformations (Letter) Barbieri, Riccardo—See Brown, Emery N. Beer, Randall D.—See Mathayomchan, Boonyanit
14(4): 845–871
3044
Index
Bengio, Samy—See Collobert, Ronan Bengio, Yoshua—See Collobert, Ronan Bengio, Yoshua—See Takeuchi, Ichiro Bethge, M., Rotermund, D., and Pawelzik, K. Optimal Short-Term Population Coding: When Fisher Information Fails (Letter) Bezzi, Michele, Samengo, In´es, Leutgeb, Stefan, and Mizumori, Sheri J. Measuring Information Spatial Densities (Letter)
14(10): 2317–2351
14(2): 405–420
Bodelon, ´ Clara—See Hasselmo, Michael E. Bonhoeffer, Tobias—See Swindale, Nicholas V. Brandman, Relly and Nelson, Mark E. A Simple Model of Long-Term Spike Train Regularization (Letter)
14(7): 1575–1597
Bressloff, Paul C., Cowan, Jack D., Golubitsky, Martin, Thomas, Peter J., and Wiener, Matthew C. What Geometric Visual Hallucinations Tell Us about the Visual Cortex (View)
14(3): 473–491
Bressloff, Paul C. and Cowan, Jack D. An Amplitude Equation Approach to Contextual Effects in Visual Cortex (Letter)
14(3): 493–525
Brown, Emery N., Barbieri, Riccardo, Ventura, Val´erie, Kass, Robert E., and Frank, Loren M. The Time-Rescaling Theorem and Its Application to Neural Spike Train Data Analysis (Note)
14(2): 325–346
Browne, Antony Representation and Extrapolation in Multilayer Perceptrons (Letter) Brunel, Nicolas—See Fourcaud, Nicolas Canavier, Carmen C.—See Oprisan, Sorinel A.
14(7): 1739–1754
Index
Carpenter, Gail A. and Milenova, Boriana L. Redistribution of Synaptic Efficacy Supports Stable Pattern Learning in Neural Networks (Letter) ´ and Carreira-Perpin´ ˜ an, Miguel A. Goodhill, Geoffrey J. Are Visual Cortex Maps Optimized For Coverage? (Note)
3045
14(4): 873–888
14(7): 1545–1560
Castellanos, J.—See Galleske, I. Casti, A.R.R., Omurtag, A., Sornborger, A., Kaplan, E., Knight, B., Victor, J., and Sirovich, L. A Population Study of Integrate-and-Fire-or-Burst Neurons (Article)
14(5): 957–986
Castillo, Enrique, Fontenla-Romero, Oscar, Guijarro-Berdinas, ˜ Bertha, and Alonso-Betanzos, Amparo A Global Optimum Approach for One-Layer Neural Networks (Letter)
14(6): 1429–1449
Caticha, N., Palo Tejada, J.E., Lancet, D., and Domany, E. Computational Capacity of an Odorant Discriminator: The Linear Separability of Curves (Letter)
14(9): 2201–2220
Chang, Chih-Chung and Lin, Chih-Jen Training ν-Support Vector Regression: Theory and Algorithms (Letter)
14(8): 1959–1977
Chater, Nick—See French, Robert M. Chen, Tianping—See Back, Andrew D. Chen, Tianping, Lu, Wenlian, and Amari, Shun-ichi Global Convergence Rate of Recurrently Connected Neural Networks (Letter)
14(12): 2947–2957
Chen, Zhe and Haykin, Simon On Different Facets of Regularization Theory (Review)
14(12): 2791–2846
3046
Cho, Siu-Yeung and Chow, Tommy W.S. A New Color 3D SFS Methodology Using Neural-Based Color Reflectance Models and Iterative Recursive Method (Letter)
Index
14(11): 2751–2789
Chow, Tommy W.S.—See Cho, Siu-Yeung Collobert, Ronan, Bengio, Samy, and Bengio, Yoshua A Parallel Mixture of SVMs for Very Large Scale Problems (Letter)
14(5): 1105–1114
Cottrell, Garrison W.—See Taylor, Adam L. Coughlan, James M. and Yuille, A.L. Bayesian A* Tree Search with Expected O(N) Node Expansions: Applications to Road Tracking (Letter)
14(8): 1929–1958
Cowan, Jack D.—See Bressloff, Paul C. Cremers, Daniel and Herz, Andreas V.M. Traveling Waves of Excitation in Neural Field Models: Equivalence of Rate Descriptions and Integrate-and-Fire Dynamics (Letter)
14(7): 1651–1667
Csato, ´ Lehel and Opper, Manfred Sparse On-Line Gaussian Processes (Letter)
14(3): 641–668
Dang, Chuangyin and Xu, Lei A Lagrange Multiplier and Hopfield-Type Barrier Function Method for the Traveling Salesman Problem (Note)
14(2): 303–324
Daw, Nathaniel D. and Touretzky, David S. Long-Term Reward Prediction in TD Models of the Dopamine System (Letter) Decaestecker, Christine—See Saerens, Marco De Moor, B.—See Van Gestel, T. De Zeeuw, Chris I.—See Kistler, Werner M. Diesmann, Markus—See Grun, ¨ Sonja Domany, E.—See Caticha, N. Douglas, Rodney J.—See Hahnloser, Richard H.R.
14(11): 2567–2583
Index
Doya, Kenji, Samejima, Kazuyuki, Katagiri, Ken-ichi, and Kawato, Mitsuo Multiple Model-Based Reinforcement Learning (Letter)
3047
14(6): 1347–1369
Dreyfus, G´erard—See Monari, Ga´etan Duckro, Donald E., Quinn, Dennis W., and Gardner III, Samuel J. Neural Network Pruning with Tukey-Kramer Multiple Comparison Procedure (Letter)
14(5): 1149–1168
Eck, D.—See Schmidhuber, J. Eguchi, Shinto—See Mihoko, Minami Elliott, Terry and Kramer, Jorg ¨ Coupling an aVLSI Neuromorphic Vision Chip to a Neurotrophic Model of Synaptic Plasticity: The Development of Topography (Letter) Elliott, T. and Shadbolt, N.R. Multiplicative Synaptic Normalization and a Nonlinear Hebb Rule Underlie a Neurotrophic Model of Competitive Synaptic Plasticity (Letter)
14(10): 2353–2370
14(6): 1311–1322
Eurich, Christian W.—See Wilke, Stefan D. Fellous, J.-M.—See Tiesinga, P.H.E. Feng, Jianfeng and Li, Guibin Impact of Geometrical Structures on the Output of Neuronal Models: A Theoretical and Numerical Analysis (Letter) Fiori, Simone Notes on Bell-Sejnowski PDF-Matching Neuron (Note)
14(3): 621–640
14(12): 2847–2855
Fontenla-Romero, Oscar—See Castillo, Enrique Fourcaud, Nicolas and Brunel, Nicolas Dynamics of the Firing Probability of Noisy Integrate-and-Fire Neurons (Letter) Frank, Loren M.—See Brown, Emery N.
14(9): 2057–2110
3048
Index
French, Robert M. and Chater, Nick Using Noise to Compute Error Surfaces in Connectionist Networks: A Novel Means of Reducing Catastrophic Forgetting (Letter)
14(7): 1755–1769
Galleske, I. and Castellanos, J. Optimization of the Kernel Functions in a Probabilistic Neural Network Analyzing the Local Pattern Distribution (Letter)
14(5): 1183–1194
Gardner III, Samuel J.—See Duckro, Donald E. Gers, F.—See Schmidhuber, J. Gerstner, Wulfram—See Kistler, Werner M. Gielen, Stan C.A.M.—See Pantic, Lovorka Giese, Martin A.—See Yu, Angela J. Girolami, Mark Orthogonal Series Density Estimation and the Kernel Eigenvalue Problem (Letter) Glavinovi´c, Mladen I. Mechanisms Shaping Fast Excitatory Postsynaptic Currents in the Central Nervous System (View)
14(3): 669–688
14(1): 1–19
Golubitsky, Martin—See Bressloff, Paul C. Goodhill, Geoffrey J.—See Carreira-Perpin´ ˜ an, ´ Miguel A. Gottschalk, Allan Derivation of the Visual Contrast Response Function by Maximizing Information Rate (Letter)
14(3): 527–542
Grinvald, Amiram—See Swindale, Nicholas V. Grun, ¨ Sonja, Diesmann, Markus, and Aertsen, Ad Unitary Events in Multiple Single-Neuron Spiking Activity: I. Detection and Significance (Letter)
14(1): 43–80
Index
Grun, ¨ Sonja, Diesmann, Markus, and Aertsen, Ad Unitary Events in Multiple Single-Neuron Spiking Activity: II. Nonstationary Data (Letter) Grzywacz, Norberto M. and Balboa, Rosario M. A Bayesian Framework for Sensory Adaptation (Letter)
3049
14(1): 81–119
14(3): 543–559
Guigon, Emmanuel—See Baraduc, Pierre Guijarro-Berdinas, ˜ Bertha—See Castillo, Enrique Gutig, ¨ Robert, Aertsen, Ad, and Rotter, Stefan Statistical Significance of Coincident Spikes: Count-Based Versus Rate-Based Statistics (Letter)
14(1): 121–153
Hagiwara, Katsuyuki On the Problem in Model Selection of Neural Network Regression in Overrealizable Scenario (Letter)
14(8): 1979–2002
Hahnloser, Richard H.R., Douglas, Rodney J., and Hepp, Klaus Attentional Recruitment of Inter-Areal Recurrent Networks for Selective Gain Control (Letter)
14(7): 1669–1689
Hahnloser, Richard H.R.—See Xie, Xiaohui Halees, Anason—See Sollich, Peter Hansen, Lars Kai—See Højen-Sørensen, Pedro A.d.F.R. Hanson, Stephen Jos´e and Negishi, Michiro On the Emergence of Rules in Neural Networks (Letter) Hasselmo, Michael E., Bodelon, ´ Clara, and Wyble, Bradley P. A Proposed Function for Hippocampal Theta Rhythm: Separate Phases of Encoding and Retrieval Enhance Reversal of Prior Learning (Letter) Haykin, Simon—See Chen, Zhe
14(9): 2245–2268
14(4): 793–817
3050
Index
Hepp, Klaus—See Hahnloser, Richard H.R. Hertz, John—See Scarpetta, Silvia Herz, Andreas V.M.—See Cremers, Daniel Herz, Andreas V.M.—See Schreiber, Susanne Hikosaka, Okihide—See Nakahara, Hiroyuki Hinton, Geoffrey E. Training Products of Experts by Minimizing Contrastive Divergence (Article) Højen-Sørensen, Pedro A.d.F.R., Winther, Ole, and Hansen, Lars Kai Mean-Field Approaches to Independent Component Analysis (Letter)
14(8): 1771–1800
14(4): 889–918
Hubener, ¨ Mark—See Swindale, Nicholas V. Jacobs, Robert A., Jiang, Wenxin, and Tanner, Martin A. Factorial Hidden Markov Models and the Generalized Backfitting Algorithm (Letter)
14(10): 2415–2437
Jiang, Wenxin—See Jacobs, Robert A. Johnson, M.H.—See Spratling, M.W. Kanamori, Takafumi—See Takeuchi, Ichiro Kaplan, E.—See Casti, A.R.R. Kappen, Hilbert J.—See Pantic, Lovorka Karhunen, Juha—See Valpola, Harri K¨arkk¨ainen, Tommi MLP in Layer-Wise Form with Applications to Weight Decay (Letter)
14(6): 1451–1480
Karlsson, Mattias F. and Robinson, John W.C. On Optimality in Auditory Information Processing (Letter)
14(9): 2181–2200
Kaski, Samuel—See Sinkkonen, Janne Kass, Robert E.—See Brown, Emery N.
Index
3051
Katagiri, Ken-ichi—See Doya, Kenji Kawanabe, Motoaki—See Tsuda, Koji Kawato, Mitsuo—See Doya, Kenji Kimura, Masahiro On Unique Representations of Certain Dynamical Systems Produced by Continuous-Time Recurrent Neural Networks (Letter) Kistler, Werner M. and Gerstner, Wulfram Stable Propagation of Activity Pulses in Populations of Spiking Neurons (Note) Kistler, Werner M. and De Zeeuw, Chris I. Dynamical Working Memory and Timed Responses: The Role of Reverberating Loops in the Olivo-Cerebellar System (Letter)
14(12): 2981–2996
14(5): 987–997
14(11): 2597–2626
Knight, B.—See Casti, A.R.R. Koch, Christof—See Manwani, Amit Koh, Ingrid Y.Y., Lindquist, W. Brent, Zito, Karen, Nimchinsky, Esther A., and Svoboda, Karel An Image Analysis Algorithm for Dendritic Spines (Letter) Kramer, Jorg—See ¨ Elliott, Terry Kristan, Jr., William B.—See Taylor, Adam L. Krose, ¨ Ben—See Vlassis, Nikos Lambrechts, A.—See Van Gestel, T. Lampinen, Jouko—See Vehtari, Aki Lancet, D.—See Caticha, N. Lanckriet, G.—See Van Gestel, T. Latinne, Patrice—See Saerens, Marco Laughlin, Simon B.—See Schreiber, Susanne Leutgeb, Stefan—See Bezzi, Michele
14(6): 1283–1310
3052
Index
Li, Guibin—See Feng, Jianfeng Liao, Shuo-Peng, Lin, Hsuan-Tien, and Lin, Chih-Jen A Note on the Decomposition Methods for Support Vector Regression (Note)
14(6): 1267–1281
Likas, Aristidis—See Titsias, Michalis K. Lin, Chih-Jen—See Liao, Shuo-Peng Lin, Chih-Jen—See Chang, Chih-Chung Lin, Hsuan-Tien—See Liao, Shuo-Peng Lindquist, W. Brent—See Koh, Ingrid Y.Y. Lu, Wenlian—See Chen, Tianping Maass, Wolfgang, Natschl¨ager, Thomas, and Markram, Henry Real-Time Computing Without Stable States: A New Framework for Neural Computation Based on Perturbations (Article)
14(11): 2531–2560
Machens, Christian K.—See Schreiber, Susanne Magesh, M.—See Sastry, P.S. Malaszenko, Natalie A.—See Tang, Akaysha C. Mandic, Danilo P. Data-Reusing Recurrent Neural Adaptive Filters (Letter) Manwani, Amit, Steinmetz, Peter N., and Koch, Christof The Impact of Spike Timing Variability on the Signal-Encoding Performance of Neural Spiking Models (Letter)
14(11): 2693–2707
14(2): 347–367
Markram, Henry—See Maass, Wolfgang Masuda, Naoki and Aihara, Kazuyuki Spatiotemporal Spike Encoding of a Continuous External Signal (Letter)
14(7): 1599–1628
Index
Mathayomchan, Boonyanit and Beer, Randall D. Center-Crossing Recurrent Neural Networks for the Evolution of Rhythmic Behavior (Note) Matsumoto, Narihisa and Okada, Masato Self-Regulation Mechanism of Temporally Asymmetric Hebbian Plasticity (Letter) Meyer, Carsten and van Vreeswijk, Carl Temporal Correlations in Stochastic Networks of Spiking Neurons (Letter) Mihoko, Minami and Eguchi, Shinto Robust Blind Source Separation by Beta Divergence (Letter)
3053
14(9): 2043–2051
14(12): 2883–2902
14(2): 369–404
14(8): 1859–1886
Milenova, Boriana L.—See Carpenter, Gail A. Minagawa, Akihiro, Tagawa, Norio, and Tanaka, Toshiyuki SMEM Algorithm Is Not Fully Compatible with Maximum-Likelihood Framework (Note)
14(6): 1261–1266
Mineiro, Paul—See Movellan, Javier R. Mizumori, Sheri J.—See Bezzi, Michele Monari, Ga´etan and Dreyfus, G´erard Local Overfitting Control via Leverages (Letter)
14(6): 1481–1506
Motomura, Yoichi—See Vlassis, Nikos Movellan, Javier R., Mineiro, Paul, and Williams, R.J. A Monte Carlo EM Approach for Partially Observable Diffusion Processes: Theory and Applications to Neural Networks (Article)
14(7): 1507–1544
Muller, ¨ Klaus-Robert—See Tsuda, Koji Nakahara, Hiroyuki and Amari, Shun-ichi Information-Geometric Measure for Neural Spikes (Article)
14(10): 2269–2316
3054
Nakahara, Hiroyuki, Amari, Shun-ichi, and Hikosaka, Okihide Self-Organization in the Basal Ganglia with Modulation of Reinforcement Signals (Letter)
Index
14(4): 819–844
Nakahara, Hiroyuki—See Wu, Si Natschl¨ager, Thomas—See Maass, Wolfgang Negishi, Michiro—See Hanson, Stephen Jos´e Nelson, Mark E.—See Brandman, Relly Nimchinsky, Esther A.—See Koh, Ingrid Y.Y. O’Halloran, Micah—See Sarpeshkar, Rahul Okada, Masato—See Matsumoto, Narihisa Omurtag, A.—See Casti, A.R.R. Opper, Manfred—See Csato, ´ Lehel Oprisan, Sorinel A. and Canavier, Carmen C. The Influence of Limit Cycle Topology on the Phase Resetting Curve (Letter) Pakdaman, K. The Reliability of the Stochastic Active Rotator (Note)
14(5): 1027–1057
14(4): 781–792
Palo Tejada, J.E.–Caticha, N. Pantic, Lovorka, Torres, Joaqu´ın J., Kappen, Hilbert J., and Gielen, Stan C.A.M. Associative Memory with Dynamic Synapses (Letter) 14(12): 2903–2923 Pawelzik, K.—See Bethge, M. Pearlmutter, Barak A.—See Tang, Akaysha C. Phung, Dan B.—See Tang, Akaysha C. Plebe, Alessio and Anile, Angelo Marcello A Neural-Network-Based Approach to the Double Traveling Salesman Problem (Letter) Poggio, Tomaso A.—See Yu, Angela J.
14(2): 437–471
Index
3055
Quinn, Dennis W.—See Duckro, Donald E. R¨atsch, Gunnar—See Tsuda, Koji Rattray, Magnus Stochastic Trapping in a Solvable Model of On-Line Independent Component Analysis (Letter) Read, Jenny C.A. A Bayesian Approach to the Stereo Correspondence Problem (Letter)
14(2): 421–435
14(6): 1371–1392
Reeb, Bethany C.—See Tang, Akaysha C. Robinson, John W. C.—See Karlsson, Mattias F. Rohde, Douglas L.T. Methods for Binary Multidimensional Scaling (Letter)
14(5): 1195–1232
Rolls, Edmund T.—See Stringer, Simon M. Rotermund, D.—See Bethge, M. Roth, Dan, Yang, Ming-Hsuan, and Ahuja, Narendra Learning to Recognize Three-Dimensional Objects (Letter)
14(5): 1071–1103
Rotter, Stefan—See Gutig, ¨ Robert Ruf, Berthold—See Senn, Walter Saerens, Marco, Latinne, Patrice, and Decaestecker, Christine Adjusting the Outputs of a Classifier to New a Priori Probabilities: A Simple Procedure (Note) Salinas, Emilio and Sejnowski, Terrence J. Integrate-and-Fire Neurons Driven by Correlated Stochastic Input (Letter) Samejima, Kazuyuki—See Doya, Kenji Samengo, In´es—See Bezzi, Michele
14(1): 21–41
14(9): 2111–2155
3056
Index
Samengo, In´es Information Loss in an Optimal Maximum Likelihood Decoding (Note)
14(4): 771–779
Sarpeshkar, Rahul and O’Halloran, Micah Scalable Hybrid Computation with Spikes (Article)
14(9): 2003–2038
Sastry, P.S., Magesh, M., and Unnikrishnan, K.P. Two Timescale Analysis of Alopex Algorithm for Optimization (Letter) 14(11): 2729–2750 Scarpetta, Silvia, Zhaoping, L., and Hertz, John Hebbian Imprinting and Retrieval in Oscillatory Neural Networks (Letter) Schmidhuber, J., Gers, F., and Eck, D. Learning Nonregular Languages: A Comparison of Simple Recurrent Networks and LSTM (Note)
14(10): 2371–2396
14(9): 2039–2041
Schmitt, Michael On the Complexity of Computing and Learning with Multiplicative Neural Networks (Article)
14(2): 241–301
Schmitt, Michael Neural Networks with Local Receptive Fields and Superlinear VC Dimension (Letter)
14(4): 919–956
Schmitt, Michael Descartes’ Rule of Signs for Radial Basis Function Neural Networks (Letter)
14(12): 2997–3011
Schneider, Martin—See Senn, Walter Schraudolph, Nicol N. Fast Curvature Matrix-Vector Products for Second-Order Gradient Descent (Letter)
14(7): 1723–1738
Schreiber, Susanne, Machens, Christian K., Herz, Andreas V.M., and Laughlin, Simon B. Energy-Efficient Coding with Discrete Stochastic Events (Letter)
14(6): 1323–1346
Sejnowski, Terrence J.—See Salinas, Emilio Sejnowski, Terrence J.—See Tiesinga, P.H.E.
Index
3057
Sejnowski, Terrence J.—See Wiskott, Laurenz Senn, Walter, Schneider, Martin, and Ruf, Berthold Activity-Dependent Development of Axonal and Dendritic Delays, or, Why Synaptic Transmission Should Be Unreliable (Letter)
14(3): 583–619
Seung, H. Sebastian—See Xie, Xiaohui Shadbolt, N.R.—See Elliott, T. Shoham, Doron—See Swindale, Nicholas V. ˇ ıma, Jiˇr´ı S´ Training a Single Sigmoidal Neuron Is Hard (Letter) Sinkkonen, Janne and Kaski, Samuel Clustering Based on Conditional Distributions in an Auxiliary Space (Letter)
14(11): 2709–2728
14(1): 217–239
Sirovich, L.—See Casti, A.R.R. Sollich, Peter and Halees, Anason Learning Curves for Gaussian Process Regression: Approximations and Bounds (Letter)
14(6): 1393–1428
Sonnenburg, Soren—See ¨ Tsuda, Koji Sornborger, A.—See Casti, A.R.R. Spratling, M.W. and Johnson, M.H. Preintegration Lateral Inhibition Enhances Unsupervised Learning (Letter)
14(9): 2157–2179
Stanley, Garrett B. Adaptive Spatiotemporal Receptive Field Estimation in the Visual Pathway (Letter)
14(12): 2925–2946
Steinmetz, Peter N.—See Manwani, Amit Stracuzzi, David J.—See Utgoff, Paul E. Stringer, Simon M. and Rolls, Edmund T. Invariant Object Recognition in the Visual System with Novel Views of 3D Objects (Letter)
14(11): 2585–2596
3058
Index
Suykens, J.A.K.—See Van Gestel, T. Svoboda, Karel—See Koh, Ingrid Y.Y. Swindale, Nicholas V., Shoham, Doron, Grinvald, Amiram, Bonhoeffer, Tobias, and Hubener, ¨ Mark Reply to Carreira-Perpin´ ˜ an and Goodhill (Note)
14(9): 2053–2056
Tagawa, Norio—See Minagawa, Akihiro Takeuchi, Ichiro, Bengio, Yoshua, and Kanamori, Takafumi Robust Regression with Asymmetric Heavy-Tail Noise Distributions (Letter)
14(10): 2469–2496
Tanaka, Toshiyuki—See Minagawa, Akihiro Tang, Akaysha C., Pearlmutter, Barak A., Malaszenko, Natalie, A., Phung, Dan B., and Reeb, Bethany C. Independent Components of Magnetoencephalography: Localization (Letter)
14(8): 1827–1858
Tanner, Martin A.—See Jacobs, Robert A. Taylor, Adam L., Cottrell, Garrison W., and Kristan, Jr., William B. Analysis of Oscillations in a Reciprocally Inhibitory Network with Synaptic Depression (Letter)
14(3): 561–581
Thomas, Peter J.—See Bressloff, Paul C. Tiesinga, P.H.E., Fellous, J.-M., and Sejnowski, Terrence J. Attractor Reliability Reveals Deterministic Structure in Neuronal Spike Trains (Letter)
14(7): 1629–1650
Titsias, Michalis K. and Likas, Aristidis Mixture of Experts Classification Using a Hierarchical Mixture Model (Letter)
14(9): 2221–2244
Todorov, Emanuel Cosine Tuning Minimizes Motor Errors (Article)
14(6): 1233–1260
Index
3059
Torres, Joaqu´ın J.—See Pantic, Lovorka Touretzky, David S.—See Daw, Nathaniel D. Tsuda, Koji, Kawanabe, Motoaki, R¨atsch, Gunnar, Sonnenburg, Soren, ¨ and Muller, ¨ Klaus-Robert A New Discriminative Kernel from Probabilistic Models (Letter)
14(10): 2397–2414
Unnikrishnan, K.P.—See Sastry, P.S. Utgoff, Paul E. and Stracuzzi, David J. Many-Layered Learning (Letter)
14(10): 2497–2529
Valpola, Harri and Karhunen, Juha An Unsupervised Ensemble Learning Method for Nonlinear Dynamic State-Space Models (Letter)
14(11): 2647–2692
Vandewalle, J.—See Van Gestel, T. Van Gestel, T., Suykens, J.A.K., Lanckriet, G., Lambrechts, A., De Moor, B., and Vandewalle, J. Bayesian Framework for Least-Squares Support Vector Machine Classifiers, Gaussian Processes, and Kernel Fisher Discriminant Analysis (Letter)
14(5): 1115–1147
Van Hulle, Marc M. Kernel-Based Topographic Map Formation by Local Density Modeling (Note)
14(7): 1561–1573
Van Hulle, Marc M. Joint Entropy Maximization in Kernel-Based Topographic Maps (Letter)
14(8): 1887–1906
Van Loocke, Philip Fields as Limit Functions of Stochastic Discrimination and Their Adaptability (Letter)
14(5): 1059–1070
van Vreeswijk, Carl—See Meyer, Carsten Vehtari, Aki and Lampinen, Jouko Bayesian Model Assessment and Comparison Using Cross-Validation Predictive Densities (Letter)
14(10): 2439–2468
3060
Index
Ventura, Val´erie—See Brown, Emery N. Victor, J.—See Casti, A.R.R. Vlassis, Nikos, Motomura, Yoichi, and Krose, ¨ Ben Supervised Dimension Reduction of Intrinsically Low-Dimensional Data (Letter)
14(1): 191–215
Wennekers, Thomas Dynamic Approximation of Spatiotemporal Receptive Fields in Nonlinear Neural Field Models (Letter)
14(8): 1801–1825
Wiener, Matthew C.—See Bressloff, Paul C. Wilke, Stefan D. and Eurich, Christian W. Representational Accuracy of Stochastic Neural Populations (Letter)
14(1): 155–189
Williams, C.K.I. and Agakov, F.V. Products of Gaussians and Probabilistic Minor Component Analysis (Letter)
14(5): 1169–1182
Williams, R.J.—See Movellan, Javier R. Winther, Ole—See Højen-Sørensen, Pedro A.d.F.R. Wiskott, Laurenz and Sejnowski, Terrence J. Slow Feature Analysis: Unsupervised Learning of Invariances (Article)
14(4): 715–770
Wu, Jiann-Ming Natural Discriminant Analysis Using Interactive Potts Models (Letter)
14(3): 689–713
Wu, Si, Amari, Shun-ichi, Nakahara, Hiroyuki Population Coding and Decoding in a Neural Field: A Computational Study (Letter)
14(5): 999–1026
Wyble, Bradley P.—See Hasselmo, Michael E. Xie, Xiaohui, Hahnloser, Richard H.R., and Seung, H. Sebastian Selectively Grouping Neurons in Recurrent Networks of Lateral Inhibition (Letter) Xu, Lei—See Dang, Chuangyin
14(11): 2627–2646
Index
3061
Yang, Ming-Hsuan—See Roth, Dan Yu, Angela J., Giese, Martin A., and Poggio, Tomaso A. Biophysiologically Plausible Implementations of the Maximum Operation (Letter)
14(12): 2857–2881
Yuille, A.L. CCCP Algorithms to Minimize the Bethe and Kikuchi Free Energies: Convergent Alternatives to Belief Propagation (Letter)
14(7): 1691–1722
Yuille, A.L.—See Coughlan, James M. Zhang, Tong Approximation Bounds for Some Sparse Kernel Regression Algorithms (Letter) Zhaoping, L.—See Scarpetta, Silvia Zito, Karen—See Koh, Ingrid Y.Y.
14(12): 3013–3042